Learning task space control through goal directed exploration

7
Learning Task Space Control through Goal Directed Exploration Lorenzo Jamone 1 , Lorenzo Natale 2 , Kenji Hashimoto 1 , Giulio Sandini 2 , Atsuo Takanishi 1 Waseda University, Tokyo, Japan 1 Robotics, Brain and Cognitive Science Department - Italian Institute of Technology, Genova, Italy 2 lorejam@takanishi.mech.waseda.ac.jp, lorenzo.natale@iit.it, k hashimoto@takanishi.mech.waseda.ac.jp, giulio.sandini@iit.it, takanisi@waseda.jp Abstract—We present an autonomous goal-directed strategy to learn how to control a redundant robot in the task space. We discuss the advantages of exploring the state space through goal-directed actions defined in the task space (i.e. learning by trying to do) instead of performing motor babbling in the joints space, and we stress the importance of learning to be performed online, without any separation between training and execution. Our solution relies on learning the forward model and then inverting it for the control; different approaches to learn the forward model are described and compared. Experimental results on a simulated humanoid robot are provided to support our claims. The robot learns autonomously how to perform reaching actions directed toward 3D targets in task space by using arm and waist motion, not relying on any prior knowledge or initial motor babbling. To test the ability of the system to adapt to sudden changes both in the robot structure and in the perceived environment we artificially introduce two different kinds of kinematic perturbations: a modification of the length of one link and a rotation of the task space reference frame. Results demonstrate that the online update of the model allows the robot to cope with such situations. I. I NTRODUCTION This work investigates how to learn a kinematic model of the robot and use it for control. Specifically, we learn the forward model of a redundant robot and we invert it for the control. Learning is performed online, without any separation between system training and task execution; no prior knowl- edge about the system is given. Our main contribution is to combine advanced control using learned models [1], [2] and goal directed exploration [3], [4]. Differently from previous works, we want the robot to learn autonomously through a goal directed exploration of the task space, without any previous exploration of the motor space (motor babbling). This strategy allows to restrict the exploration of the state space to the regions which are more relevant to the task fulfillment; as soon as the robot needs to perform tasks in other regions of the state space the learned model is updated accordingly. Experimental results show the advantages of this approach. Moreover, the learned model have to be updated if changes in the robot structure or in the perceived environment occur. We simulate such situations by introducing artificially two different kinds of kinematic perturbations: a modification of the size of one link and a 45 CCW rotation of the task space reference frame. Results demonstrate how the robot is able to cope with these situations. The paper is organized as follows. In section II we provide general considerations about online learning and exploration, and in section III we discuss how we applied those considera- tions to learning task space control, also in relation to previous works; then, in section IV we describe the details of our implementation. Finally, in sections V and VI we report the experimental results and we draw our conclusions. II. ONLINE LEARNING AND EXPLORATION Past works [5]–[7] tackled the learning problem by con- sidering two distinct phases: an exploration phase in which the robot collects data about the system followed by a later exploitation phase in which the robot performs its actions using a neural network which has been trained with the data gathered previously. This approach has two problems. First, at some point the robot (i.e. learning system) should decide to stop learning and activate the controller that uses the network. Besides the problem of individuating this special time (how can the robot say that it has learned enough?), the process would probably take too much time (i.e. the curse of dimen- sionality [8]) and still such a learned model would be no longer useful if the system changes. Finally, two different controllers are needed to drive the robot actions during two different phases, as the goal changes (i.e. during the exploration phase the goal is just to gather data for learning). This is neither intuitive nor practical. Indeed, the robot should explore the learning space driven by the same motivations that characterize its later actions (i.e. a goal directed exploration). This also helps to reduce the size of such a space, by individuating the areas which are more important to the task fulfillment, at least at the current developmental stage. Recently, goal directed exploration has been proposed by different authors for the case of inverse kinematics learning [3], [4]. Here we investigate this aspect for the case in which the forward kinematic model is learned and then inverted for control. Furthermore, differently from [4], we do not impose any particular choice of the goals, as we assume that those are already given. III. LEARNING TASK SPACE CONTROL In this work we consider task space control of a redundant robot. Task space (or operational space) is the space in which generally robot tasks are described (e.g. the 3D Cartesian space), as opposed to joints space (or motor space). Since the ,((( 702 Proceedings of the 2011 IEEE International Conference on Robotics and Biomimetics December 7-11, 2011, Phuket, Thailand

Transcript of Learning task space control through goal directed exploration

Learning Task Space Control throughGoal Directed Exploration

Lorenzo Jamone1, Lorenzo Natale2, Kenji Hashimoto1, Giulio Sandini2, Atsuo Takanishi1

Waseda University, Tokyo, Japan1

Robotics, Brain and Cognitive Science Department - Italian Institute of Technology, Genova, [email protected], [email protected],

k − [email protected], [email protected], [email protected]

Abstract—We present an autonomous goal-directed strategyto learn how to control a redundant robot in the task space.We discuss the advantages of exploring the state space throughgoal-directed actions defined in the task space (i.e. learningby trying to do) instead of performing motor babbling in thejoints space, and we stress the importance of learning to beperformed online, without any separation between training andexecution. Our solution relies on learning the forward model andthen inverting it for the control; different approaches to learnthe forward model are described and compared. Experimentalresults on a simulated humanoid robot are provided to supportour claims. The robot learns autonomously how to performreaching actions directed toward 3D targets in task space byusing arm and waist motion, not relying on any prior knowledgeor initial motor babbling. To test the ability of the system toadapt to sudden changes both in the robot structure and inthe perceived environment we artificially introduce two differentkinds of kinematic perturbations: a modification of the length ofone link and a rotation of the task space reference frame. Resultsdemonstrate that the online update of the model allows the robotto cope with such situations.

I. INTRODUCTION

This work investigates how to learn a kinematic model of

the robot and use it for control. Specifically, we learn the

forward model of a redundant robot and we invert it for the

control. Learning is performed online, without any separation

between system training and task execution; no prior knowl-

edge about the system is given. Our main contribution is to

combine advanced control using learned models [1], [2] and

goal directed exploration [3], [4]. Differently from previous

works, we want the robot to learn autonomously through a goal

directed exploration of the task space, without any previous

exploration of the motor space (motor babbling). This strategy

allows to restrict the exploration of the state space to the

regions which are more relevant to the task fulfillment; as soon

as the robot needs to perform tasks in other regions of the state

space the learned model is updated accordingly. Experimental

results show the advantages of this approach. Moreover, the

learned model have to be updated if changes in the robot

structure or in the perceived environment occur. We simulate

such situations by introducing artificially two different kinds

of kinematic perturbations: a modification of the size of one

link and a 45◦ CCW rotation of the task space reference frame.

Results demonstrate how the robot is able to cope with these

situations.

The paper is organized as follows. In section II we provide

general considerations about online learning and exploration,

and in section III we discuss how we applied those considera-

tions to learning task space control, also in relation to previous

works; then, in section IV we describe the details of our

implementation. Finally, in sections V and VI we report the

experimental results and we draw our conclusions.

II. ONLINE LEARNING AND EXPLORATION

Past works [5]–[7] tackled the learning problem by con-

sidering two distinct phases: an exploration phase in which

the robot collects data about the system followed by a later

exploitation phase in which the robot performs its actions

using a neural network which has been trained with the data

gathered previously. This approach has two problems. First, at

some point the robot (i.e. learning system) should decide to

stop learning and activate the controller that uses the network.

Besides the problem of individuating this special time (how

can the robot say that it has learned enough?), the process

would probably take too much time (i.e. the curse of dimen-

sionality [8]) and still such a learned model would be no longer

useful if the system changes. Finally, two different controllers

are needed to drive the robot actions during two different

phases, as the goal changes (i.e. during the exploration phase

the goal is just to gather data for learning). This is neither

intuitive nor practical. Indeed, the robot should explore the

learning space driven by the same motivations that characterize

its later actions (i.e. a goal directed exploration). This also

helps to reduce the size of such a space, by individuating the

areas which are more important to the task fulfillment, at least

at the current developmental stage. Recently, goal directed

exploration has been proposed by different authors for the case

of inverse kinematics learning [3], [4]. Here we investigate this

aspect for the case in which the forward kinematic model is

learned and then inverted for control. Furthermore, differently

from [4], we do not impose any particular choice of the goals,

as we assume that those are already given.

III. LEARNING TASK SPACE CONTROL

In this work we consider task space control of a redundant

robot. Task space (or operational space) is the space in which

generally robot tasks are described (e.g. the 3D Cartesian

space), as opposed to joints space (or motor space). Since the

702

Proceedings of the 2011 IEEEInternational Conference on Robotics and Biomimetics

December 7-11, 2011, Phuket, Thailand

robot needs to be controlled in joints space, an inverse model is

required which maps the desired task space behavior into the

appropriate joints space trajectories (i.e. motor commands).

When the joints space dimension is larger than the task

space dimension an infinity of solutions exists for the model

inversion: in this case, we say that the robot is redundant

with respect to the specified task. Given this scenario, there

are mainly two possibilities to exploit learning: to learn the

forward model and then invert it for the control, or to directly

learn the inverse model. Learning the forward model gives the

opportunity to choose the best inverse solution according to the

situation. Different methods can be exploited (e.g. extended

Jacobian [9], augmented Jacobian [10], local minimization

through null space projection [11]) to solve a secondary task

while performing the main one (i.e. end-effector positioning).

Possible choices for such a secondary task can be joint limits

avoidance, obstacle avoidance, or in general the optimization

of an arbitrary function. One main drawback of this approach

is that the model inversion can be problematic if the Jacobian

matrix becomes ill-conditioned (i.e. the system is close to sin-

gular configurations): joint velocities obtained by the inversion

of such a matrix become huge or even unfeasible. The problem

is avoided when the inverse model is learned directly from the

gathered velocity data: since they are generated by the robot

motion, they are physically feasible by definition. However,

when the inverse model is directly learned from data, the

system can be controlled only exploiting the specific inverse

solution which has been learned. A robotic implementation of

this latter strategy is proposed in [1]. For what concerns for-

ward model learning, we consider the interesting work in [2]

as a main reference. The authors propose an approach which

combines forward kinematics model learning with task space

control, exploiting null space projections to solve multiple

tasks simultaneously. After a motor babbling phase, the system

is controlled with the learned model, which is also constantly

updated online during the movements; results are provided in

simulation on a 3 DOF planar manipulator. The Jacobian is

derived analitically from the learned forward kinematic model,

and then inverted. As they claim, this computation propagates

the estimation errors of the learned model. An alternative

solution is to learn directly the non-linear Jacobian matrix from

the gathered velocity data, as we investigated in [12]. However,

as long as velocities are derived from position measurements

an estimation error is always introduced.

In this work we follow the same approach as in [2] and we

extends those results considering a more complex scenario, in

which a simulated humanoid robot with joints limits is con-

sidered; the robot performs reaching actions directed toward

3D target points in the cartesian space exploting both arm

and waist motion (7 DOFs). Differently from their work, we

do not perform any motor space exploration (babbling phase)

before the reaching actions start: the exploration of the state

space is fully realized during the execution of goal directed

movements in task space. The main goal of these movements

is to bring the end-effector (i.e. robot hand) into a specified

3D position, with the required accuracy. The secondary goal

is to keep the robot joints as far as possible from the physical

limits: this latter is achieved by local minimization through

null space projection, as in [11]. As a result of this strategy,

the motor space is not explored exhaustively; only the regions

which are more relevant for the current tasks are covered. This

allows a faster convergence of the estimation error in those

areas; then, as soon as the robot needs to perform actions in

other regions of the motor space, the learned model is updated

online. Also, to keep the system far from joints limits helps

to carry out a safe and efficient goal directed exploration. In

general, since in the beginning the model estimation can be

poor or even completely wrong, the robot can get stuck at

joint limits without finding a direction to move out: even if

a target in a direction which is opposite to the joint limits

is choosen, a wrong estimated Jacobian could push the robot

again toward the limits. Without motion the model can not

be updated, and the robot could get stuck forever. This effect

is consistently reduced by choosing joint limit avoidance as a

secondary task.

Finally, concerning the problems related to matrix inversion

near singularities, here we choose to rely on the damped least

squares method [13].

IV. PROPOSED SOLUTIONS

The solution we chose for the design of the task space

controller was originally proposed in [11]. Motor velocities

q are generated as follows:

q = Km · C1 +Ks · C2 (IV.1)

where

C1 = J†(q) xdm (IV.2)

C2 = (I − J†(q)J(q)) qds (IV.3)

where J(q) is the Jacobian matrix which maps from motor

velocities to task velocities, J†(q) is its Moore-Penrose gen-

eralized inverse, (I − J†(q)J(q)) is a null-space projector,

xdm is the desired task space velocity for the main task and

qds is the desired joints space velocity for the secondary task.

At every control step these desired velocities are chosen as

follows:

xdm = xd − x (IV.4)

qds = −Kg ∇M(q) (IV.5)

where xd and x are the desired and actual task space positions

and ∇M(q) is the gradient of M(q), the function we want

to minimize as a secondary task. Km, Ks and Kg are

positive definite diagonal gain matrices respectively for the

main task, the secondary task and the gradient descent. Since

our secondary task is to keep the joints as far as possible from

the limits we chose M(q) as in [11]:

M(q) =1

N

N∑i=0

(qi − ai

ai − qmaxi

)2

(IV.6)

ai =qmaxi + qmin

i

2(IV.7)

703

where N is the overall number of joints and qmin and qmax

are lower and upper joints limits.

If the secondary task is specified in a different space (i.e.

M(z)) an appropriate Jacobian is needed to transform the

desired velocities in the joints space; equation IV.5 is hence

modified as follows:

qds = −Kg J†

z (q) ∇M(z) (IV.8)

where Jz(q) is the Jacobian which maps velocities in the qspace to velocities in the z space. See [14] for details.

At every control step we check whether the system is close

to singularities by computing the smaller singular value of the

Jacobian through singular value decomposition (SVD). If the

smaller singular value δm is lower than a predefined threshold

δT we rely on the damped least squares solution [13]. In this

case J†(q) in equations IV.2 and IV.3 is computed as follows:

J†(q) = JT (q)(J(q)JT (q) + λ2I)−1 (IV.9)

where

λ2 =

[1−

(δmδT

)2]· λ2

MAX (IV.10)

In our implementation we chose δT = 0.0001 and λ2MAX =

0.00005.

In the following sections we describe the two approaches we

investigated to obtain a model of the Jacobian J(q) through

learning: derive it from a learned forward kinematic model or

learn it directly from velocity measurements.

A. Derive the Jacobian from the learned forward model

Following the approach of [2] a neural network is trained

using Locally Weighted Projection Regression (LWPR [15])

to estimate the forward kinematic model of the system,

x = f(q). Then the estimated model is derived to obtain

the Jacobian J(q), defined as ∂f∂q . The derivative computation

comes “for free” since the LWPR algorithm already performs

it every time the model is updated.

At every control step i the LWPR neural network is trained

with a couple < qi,xi > and the local Jacobian J(qi) is

computed. We will refer to this neural network as the kinematicmap.

B. Learn the Jacobian from velocity measurements

In [12] we proposed an algorithm to learn online a model

of the Jacobian. The estimation was based on Receptive

Field Weighted Regression (RFWR [16]) and incremental least

squares (ILS [17]). Here we use the same algorithm, but

we employ LWPR instead of RFWR; LWPR is actually an

enhanced version of RFWR. Its main advantage is that both

update and prediction should be faster (it is an O(n) algorithm,

where n is the input dimension, while RFWR is O(n2)) and

that the dimension is locally reduced with a Linear Projection

Regression that projects locally along relevant dimensions.

LWPR should outperform RFWR expecially in presence of

a big input space with irrelevant dimensions.

A local Jacobian matrix Ji is estimated from sensory measure-

ments (arm joints displacements, Δq, and hand displacements,

Δx) in the surroundings of qi. Then, a LWPR neural network

is trained with couples < qi, Ji >, for different qi. We will

refer to this neural network as the Jacobian map.

The Jacobian map is updated during the movement as follows:

1. collect new sample (Δx,Δq)i at time i2. retrieve Ji = LWPR(qi)3. update Ji with (Δx,Δq)i → Ji+1

4. train LWPR(qi, Ji+1)

where the matrix update at step 3 is done with ILS. At every

control step i the local Jacobian J(qi) is obtained by querying

the map with the current qi.

C. LWPR initialization

In general, the output of the LWPR neural network is null

if no local model is associated with the current input. This

always happens in the beginning, when still no samples have

been collected and used for training, but can occur also later,

if the system works in regions of the motor space that have

not been explored yet. If the estimated Jacobian J(q) is null,

the controller IV.1 only relies on C2 (since C1 becomes zero)

and drives the joints toward their mean value, in order to

minimize M(q). As soon as M(q) is minimized, the system

can then easily get stuck around that configuration: if the

output of the LWPR neural network is still null the robot

would not move anymore even if the reaching target changes

(C1 remains zero), it is not able to collect data for training

and therefore it can not learn the kinematic model. Therefore,

we need an initialization matrix for J(q), which is used

everytime the LWPR outputs zero. In our implementation we

chose a matrix with random non-null coefficients between zero

and one. The motion generated using such a matrix will not

necessary bring the hand closer to the target but will produce

exploratory movement. Indeed, a feature of the employed

learning algorithms is that we can use even a “wrong” matrix

as initialization, without preventing the converging of the

estimation error.

V. EXPERIMENTAL RESULTS

The experiments are carried out using the iCub Dynamic

Simulator [18]: a snapshot is displayed in figure 1. All the

software has been realized using YARP [19]. To the sake of

clarity, we recall the definition of joint space vector q and task

space vector x, as used hereinafter:

. q = [θsy θsp θsr θe θwy θwr θwp]T ∈ R

7

. x = [xh yh zh]T ∈ R

3

where θsy , θsp, θsr are the shoulder yaw, pitch and roll

rotations (elevation/depression, adduction/abduction and ro-

tation of the arm), θe is the elbow flexion/extension, θwy ,

θwr, θwp are the waist yaw, roll and pitch rotations (rotation,

adduction/abduction, elevation/depression of the trunk), and

xh, yh, zh are the three cartesian coordinates describing the

hand position with respect to a fixed reference frame placed

704

on the ground, in the middle between the robot feet.

The robot joints limits are defined as follows:

. qmin = [−90◦; 0◦;−35◦; 10◦;−50◦;−30◦;−10◦]T

. qmax = [90◦; 160◦; 95◦; 100◦; 50◦; 30◦; 70◦]T

We compare the two learning algorithms (kinematic map and

Jacobian map) in section V-A, and the two exploration strate-

gies (random motor babbling and goal directed exploration) in

section V-B. Then in section V-C we evaluate the learning and

adaptation capabilities of the system when using the kinematic

map algorithm and the goal directed exploration strategy.

Experiments presented in this latter section are organized as

follows: first the robot performs reaching movements directed

toward 3D targets (goal directed exploration) then we evaluate

the learning performances by executing some test movements,

which consist in following a given task space trajectory. It is

important to notice that this separation has been introduced

only to assess the quality of the model estimation with repeat-

able and significant tests: it does not mean that learning and

execution occur in different phases. The right image in figure

1 provide information about the 3D targets distribution and

the test trajectories. The three black spheres in the kinematic

sketch of the robot indicate the actuated joints: the 3-DOFs

waist, the 3-DOFs shoulder, the 1-DOFs elbow. The green dots

are a representative subset of the targets: they are all physically

reachable and most of them are located in front of the robot,

mainly on the right side. The two test trajectories are displayed

in blue: the first one is a solid line which depicts a cube-like

figure, the other one is a straight dashed line. We refer to the

first one as the cube test and to the second one as the line test.

Fig. 1. On the left: the iCub Simulator. On the right: a sketch of the robotkinematics, the target points of the goal directed exploration (green dots), thetest trajectories (blue solid line for the cube test, blue dashed line for the linetest).

A. Comparison of learning algorithms

We first compare the estimation performances of the two

proposed learning algorithms. Indeed, the quality of the task

space controller depends on the quality of the estimation. Dur-

ing the goal directed exploration data is collected at every time

step i in the form < qi,xi,Δqi,Δxi >, and used for traning

either the kinematic map, xi = f(qi), or the Jacobian map,

Δxi = J(qi)Δqi, as described in sections IV-A and IV-B.

We define the Jacobian estimation error Je = Δxi − Δxi,

where Δxi =∂f∂q (qi)Δqi in case of the kinematic map and

Δxi = J(qi)Δqi in case of the Jacobian map. We built a test

set of 30K samples to evaluate the estimation performances

of the learning algorithms. In figure 2 the Root Mean Squared

Error (RMSE) of the Jacobian estimation is depicted (i.e. the

average of the RMSE of the three components of Je). Then,

in figure 3 we show the hand trajectories during the cube

test performed with the two learned maps, trained with 150K

samples. The kinematic map learns faster than the Jacobian

map, and also the final RMSE (after 150K samples) appears

to be lower: 0.0033 for the kinematic map, 0.0037 for the

Jacobian map. As a consequence, the hand trajectory during

the cube test happens to be better (more straight) when using

the kinematic map for the control. A possible reason for this

result is that the velocity measurements used to train the

Jacobian map < Δqi,Δxi > are derived from the measured

positions (Δqi = qi − qi−1, Δxi = xi − xi−1) and this

computation introduces a further estimation error.

Therefore, we will rely on the kinematic map for all the tests

presented hereinafter.

Fig. 2. RMSE of the Jacobian estimation with training samples collectedduring the goal directed exploration, using the two presented approaches: thekinematic map, blue solid line, and the Jacobian map, blue dashed line.

Fig. 3. Hand trajectories during the cube test, after 150K samples have beencollected with the goal directed exploration and used for training: controllingwith the kinematic map, left image, and controlling with the Jacobian map,right image.

B. Comparison of exploration strategies

An important feature of the goal directed exploration is that

learning is restricted to the regions which are more relevant

for the control: learning such a reduced space should be

faster then learning the whole space. To support our claim

we compared the two approaches. We performed both a

random motor babbling phase (movements toward uniformly

distributed joints configurations were executed controlling the

705

system in joints space) and a goal directed exploration phase

(the hand was driven toward task space positions using the

task space controller IV.1). Training samples for the kinematic

map were collected during the movements: the distribution of

joints positions in both cases are shown in figure 4 and 5.

The distribution of hand positions in task space was the same

both in the random motor babbling and in the goal directed

exploration. Then we compared the Jacobian estimation RMSE

of the kinematic map in the two cases: using the random

motor babbling training samples and using the goal directed

exploration training samples. Results in figure 6 show that the

goal directed exploration allows to achieve a faster and more

precise estimation.

Moreover, thanks to the null-space projection exploited in the

controller, the joints are maintained as far as possible from the

limits. As already pointed out, this helps to carry out a safe

and effective exploration. As a comparison, we performed the

goal directed exploration again using controller IV.2) alone,

without null-space projection. The percentage of movements

ending at joints limits in this case were 65%, while it was only

10% when exploting the null-space projection: as a result the

collected data were highly biased on the joints limits and the

model estimation was poor.

Fig. 4. Joint positions distribution during the goal directed exploration. Jointsare maintained as far as possible from the physical limits.

Fig. 5. Joint positions distribution during the random motor babbling. Jointspositions are uniformly distributed over the whole range.

C. Model learning and adaptation

First we evaluated the performances of the controller in the

execution of the cube test as more data samples are acquired

and used to update the kinematic map with the goal directed

exploration. Results are depicted in figure 7. The system

Fig. 6. RMSE of the Jacobian estimation using the kinematic map, withtraining samples collected during the goal directed exploration, blue solidline, and the random motor babbling, blue dashed line. The test set sampleshave been collected during the goal directed exploration.

Fig. 7. Hand trajectory during the cube test using the kinematic map trainedwith increasing number of samples. From the left: 50K samples, 1000Ksamples, 7000K samples.

starts to be controllable after 50K samples are learned (almost

all the final positions are reached, even if with inaccurate

movements), and trajectories become almost straight around

1000K samples. Then performances further improve as more

samples are learned.

1) Unexplored space: The following experiment aims at

evaluating the ability of the learning algorithm to update the

model when a new region of the motor and the task space

is explored. The line test (as defined in the beginning of this

section) is executed for several iterations; data is collected

during the motion to update the model. The resulting hand

trajectories are displayed in figure 8.

A good estimation of the model is achieved around iteration

80, when about 150K samples have been collected and used for

training. In figure 9 the joints positions distribution is depicted.

Even if the controller tries to keep the joints far from the

limits, the solution of the main task (hand trajectory) for the

line test impose to drive some of the joints close to the limits.

Therefore, the motor space is explored in a region which was

not covered in the previous goal directed exploration (compare

to figure 4). The same holds for the task space (hand positions),

as it can be deduced from the right image in figure 1: the line

test reference trajectory lies in a region of the task space that

has been explored only partially.

2) Robot kinematic modification: After 1000K samples

have been learned during the goal directed exploration we

introduce a modification in the robot kinematics. Specifically,

we change the length of the forearm from 0.18 meters to

0.30 meters. This can simulate different situations, like the

replacement of a robot part with a different one or the use of

a tool for reaching. The goal directed exploration is executed

further with the new kinematics and the cube test is performed

at different learning stages (i.e. increasing number of training

706

samples). Results are shown in figure 10. The first plot on

the left shows the hand trajectory as soon as the kinematic

modification is introduced. After some learning is performed

hand trajectories become straight again, as in the center and

right plots. The learning process is faster if we train the system

on the specific trajectory. Figure 11 shows the improvements of

the controller if the cube test is repeated for several iterations.

In this case only 7 iterations of the test movement (about 100K

training samples) are sufficient to recover a good estimation

of the kinematic model. Of course in this case the new model

is learned only in this specific region of the workspace.

After 2800K samples have been learned with the new kinemat-

ics the modification is removed. The goal directed exploration

is continued with the original kinematics and the cube test is

performed at different learning stages. Figure 12 shows how

the system is able to recover a good estimation of the model.

Fig. 8. Hand trajectory during the execution of the line test (inside anunexplored region of the state space), at different learning stages. From thetop-left corner: after 1, 20, 40, 60, 80, 100 iterations of the test movement.

Fig. 9. Joint positions distribution during the execution of the line test.Movements are performed in a region of the motor space that was not exploredduring the goal directed exploration.

Fig. 10. Hand trajectories during the cube test, after a modification of therobot kinematics. The goal directed exploration is performed with the newrobot kinematics and new samples are collected to update the model. Fromthe left: 0 samples, 700K samples, 2800K samples.

Fig. 11. Hand trajectories during the cube test, after a modification of therobot kinematics. Several iterations of the cube test are performed with thenew robot kinematics and new samples are collected to update the model.From the left: 1 iteration, 7 iterations (about 100K samples), 14 iterations(about 200K samples).

Fig. 12. Hand trajectories during the cube test, after the modification ofthe robot kinematics is removed. The goal directed exploration is performedagain with the original robot kinematics and new samples are collected toupdate the model. From the left: 0 samples, 700K samples, 2800K samples.

Fig. 13. Hand trajectories during the cube test, after a 45◦ CCW rotationof the reference frame. The goal directed exploration is performed with therotated reference frame and new samples are collected to update the model.From the left: 0 samples, 700K samples, 2800K samples.

Fig. 14. Hand trajectories during the cube test, after a 45◦ CCW rotation ofthe reference frame. Several iterations of the cube test are performed with therotated reference frame and new samples are collected to update the model.From the left: 1 iteration, 10 iterations (about 150K samples), 25 iterations(about 375K samples).

Fig. 15. Hand trajectories during the cube test, after the 45◦ CCW rotation ofthe reference frame is removed. The goal directed exploration is performedagain with the original reference frame and new samples are collected toupdate the model. From the left: 0 samples, 700K samples, 2800K samples.

707

3) Rotated reference frame: The second modification we

introduce is a 45◦ CCW rotation around the y axis of the task

space reference frame. The perceived position of the hand is

modified as follows:

xMODh = cos 45◦xh − sin 45◦zh (V.1)

yMODh = yh (V.2)

zMODh = cos 45◦xh + sin 45◦zh (V.3)

The goal directed exploration is executed again with the

rotated reference frame and the cube test is performed at

different learning stages. Results are shown in figure 13. We

can observe a similar behavior to the one shown in the case

of the kinematic modification. Again, the learning process is

faster if we train the system on the specific trajectory, as shown

in figure 14. After 2800K samples have been learned with the

rotated reference frame the modification is removed. Again,

the plots in figure 15 show how the system is able to recover

a good estimation of the model.

VI. CONCLUSIONS

We presented a study about the control of a redundant robot

in the task space (operational space) through an autonomously

learned model, and provided experimental results on a sim-

ulated humanoid robot with joints limits. The robot starts

without any prior knowledge and learns online the kinematic

model necessary to control the end-effector in task space

by using arm and waist motion. Differently from previous

works, there is no separation between an exploration phase

and an exploitation phase; on the contrary, the robot shows a

continuous behavior whose performances evolve with time. We

showed the advantages provided by goal directed exploration

in the task space with respect to random motor babbling in the

joints space. Since the learning space is explored through goal

directed movements the part of the motor space that is actually

covered depends on the choice of the controller. Specifically,

we exploit the redundancy of the robot to address two tasks

simulaneously: to cancel the task space position error (main

task) and to keep the joints as far as possible from the limits

(secondary task). We realize this through Jacobian control

with null-space projection. As a consequence of this strategy

the motor space is not explored exhaustively, but only in the

regions that are more relevant for the tasks fulfillment: this

speeds up learning in those regions. As soon as actions need

to be performed in other areas the learned model is updated

accordingly. To test the ability of the system to adapt to

sudden changes both in the robot structure and in the perceived

environment we artificially introduced two different kinds of

kinematic perturbations: a modification of the length of one

link and a rotation of the task space reference frame. Results

demonstrate that the online update of the model allows the

robot to cope with such situations.

The proposed strategy is very general and can be applied to

any robot which needs to be controlled in a specific task

space. In particular, we are already implementing it on the

humanoid robot Kobian [20], in order to control the robot

hand for reaching and grasping purposes; in this case, the task

space variable is the hand position in the robot visual space

(i.e. uncalibrated camera coordinates).

REFERENCES

[1] A. D’Souza, S. Vijayakumar, and S. Schaal, “Learning inverse kine-matics,” in International Conference on Intelligent Robots and Systems(IROS), Maui, Hawaii, USA, Oct. 29 - Nov. 03 2001, pp. 298–303.

[2] C. Salaun, V. Padois, and O. Sigaud, “Control of reundant robots usinglearned models: an operational space control approach,” in InternationalConference on Intelligent Robots and Systems (IROS), St. Luis, USA,October, 11-15 2009, pp. 878–885.

[3] M. Rolf, J. J. Steil, and M. Gienger, “Goal babbling permits directlearning of inverse kinematics,” IEEE Transactions on AutonomousMental Development, vol. 2, no. 3, pp. 216–229, 2010.

[4] A. Baranes and P. Oudeyer, “Intrinsically motivated goal explorationfor active motor learning in robots: A case study,” in InternationalConference on Intelligent Robots and Systems (IROS), Taipei, Taiwan,2010.

[5] S. Rougeaux and Y. Kuniyoshi, “Robust tracking by a humanoid visionsystem,” in International Workshop on Humanoid and Human FriendlyRobotics, Tsukuba, Japan, 1998.

[6] D. Demers and K. Kreutz-delgado, “Learning global direct inversekinematics,” in Advances in Neural Information Processing Systems.Morgan Kaufmann, 1992, pp. 589–595.

[7] L. Natale, F. Nori, G. Metta, and G. Sandini, “Learning precise 3d reach-ing in a humanoid robot,” in International Conference of Developmentand Learning. London, UK: IEEE, 2007.

[8] R. E. Bellman, Dynamic Programming. Princeton University Press,1956.

[9] J. Baillieul, “Kinematic programming alternatives for redundant manipu-lators,” in International Conference on Robotics and Automation (ICRA),St. Louis, USA, 1985, pp. 722–728.

[10] L. Sciavicco and B. Siciliano, “A solution algorithm for the inversekinematic problem for redundant manipulator,” International Journal ofRobotics and Automation, no. 4, pp. 403–410, 1988.

[11] A. Ligeois, “Automatic supervisory control of the configuration andbehavior of multibody mechanisms,” Transactions on System, Man andCybernetics, no. 7, pp. 868–871, 1977.

[12] L. Jamone, L. Natale, M. Fumagalli, F. Nori, G. Metta, and G. Sandini,“Machine-learning based control of a human-like tendon driven neck,”in International Conference on Robotics and Automation. Anchorage,Alaska, USA: IEEE-RAS, 2010.

[13] Y. Nakamura and H. Hanafusa, “Inverse kinematic solutions withsingularity robustness for robot manipulator control,” Transactions ofthe ASME Journal of Dynamic Systems, Measurement and Control, no.108, pp. 163–171, 1986.

[14] S. Chiaverini, “Singularity-robust task-priority redundancy resolutionfor real-time kinematic control of robot manipulator,” Transaction onRobotics and Automation, no. 13(3), pp. 398–410, 1997.

[15] S. Vijayakumar and S. Schaal, “Locally weighted projection regression :An o(n) algorithm for incremental real time learning in high dimensionalspace,” in International Conference on Machine Learning (ICML), 2000,pp. 1079–1086.

[16] S. Schaal and C. G. Atkenson, “Constructive incremental learning fromonly local information,” Neural Computation, vol. 10(8), pp. 2047–2084,1998.

[17] K. Hosoda and M. Asada, “Versatile visual servoing without knowledgeof true jacobian,” in International Conference on Intelligent Robots andSystems (IROS). IEEE, 1994.

[18] V. Tikhanoff, P. Fitzpatrick, G. Metta, L. Natale, F. Nori, and A. Can-gelosi, “An open source simulator for cognitive robotics research:The prototype of the icub humanoid robot simulator,” in Workshopon Performance Metrics for Intelligent Systems, National Institute ofStandards and Technology, Washington DC, August 19-21 2008.

[19] G. Metta, P. Fitzpatrick, and L. Natale, “Yarp: yet another robotplatform,” International Journal on Advanced Robotics Systems, March2006, special Issue on Software Development and Integration inRobotics.

[20] N. Endo, K. Endo, M. Zecca, and A. Takanishi, “Modular design ofemotion expression humanoid robot kobian,” in 18th CISM-IFToMMSymposium on Robot Design, Dynamics, and Control (ROMANSY),Udine, Italy, 2010.

708