Improved Adaptive-Reinforcement Learning Control for Morphing Unmanned Air Vehicles

Improved Adaptive-Reinforcement Learning Control

for Morphing Unmanned Air Vehicles

James Doebbler ∗, Monish D. Tandale †, John Valasek ‡

Texas A&M University, College Station, TX 77843-3141

and

Andrew J. Meade, Jr. §

William Marsh Rice University, Houston, Texas 77251-1892

This paper presents an improved Adaptive-Reinforcement Learning Control methodol-ogy to the problem of unmanned air vehicle morphing. Sequential Function Approximation,a Galerkin based scattered data approximation scheme, is used to generalize the learningfrom previously experienced quantized states and actions to the continuous state-actionspace, all of which may not have been experienced before. The reinforcement learningmorphing control function is integrated with an adaptive dynamic inversion control trajec-tory tracking function. An episodic unsupervised learning simulation using the Q-Learningmethod is developed to learn the optimal shape change policy, and optimality is addressedby cost functions representing optimal shapes corresponding to flight conditions. Themethodology is demonstrated with a numerical example of a hypothetical 3-D smart un-manned air vehicle that can morph in all three spatial dimensions, tracking a specifiedtrajectory and autonomously morphing over a set of shapes corresponding to flight condi-tions along the trajectory. Results presented in the paper show that this methodology iscapable of learning the required shape and morphing into it, and accurately tracking thereference trajectory in the presence of parametric uncertainties, unmodeled dynamics anddisturbances.

I. Introduction

Morphing research has led to a series of breakthroughs in a wide variety of disciplines that, when fullyrealized for aircraft applications, have the potential to produce large increments in aircraft system safety,affordability, and environmental compatibility.1 The National Aeronautics and Space Administration’s Mor-phing Project defines morphing as an efficient, multi-point adaptability that includes macro, micro, structuraland/or fluidic approaches.2 A hypothetical example is shown in Fig.1(a).

The Defense Advanced Research Projects Agency (DARPA) uses the definition of an air vehicle that isable to change its state substantially (to the order of 50% more wing area or wing span and chord) to adaptto changing mission environments, thereby providing superior system capability that is not possible withoutreconfiguration. Such a design integrates innovative combinations of advanced materials, actuators, flowcontrollers, and mechanisms to achieve the state change.3 In spite of their relevance, these definitions donot adequately address or describe the supervisory and control aspects of morphing. Reference 4 attemptsto do so by making a distinction between Morphing for Mission Adaptation, and Morphing for Control.In the context of flight vehicles, Morphing for Mission Adaptation is a large scale, relatively slow, in-flight

∗Graduate Research Assistant, Flight Simulation Laboratory, Aerospace Engineering Department. Student Member, [email protected]

†Graduate Research Assistant and Doctoral Candidate, Flight Simulation Laboratory, Aerospace Engineering Department.Student Member, AIAA. [email protected]

‡Associate Professor & Director, Flight Simulation Laboratory, Aerospace Engineering Department. Associate Fellow, [email protected], http://jungfrau.tamu.edu/valasek/

§Associate Professor, Department of Mechanical Engineering and Materials Science. Associate Fellow, [email protected]

1 of 19

American Institute of Aeronautics and Astronautics

Infotech@Aerospace26 - 29 September 2005, Arlington, Virginia

AIAA 2005-7159

Copyright © 2005 by James Doebbler, Monish D. Tandale, John Valasek, and Andrew J. Meade. Published by the American Institute of Aeronautics and Astronautics, Inc., with permission.

(a) Conceptual Self-Optimizing Aircraft

−4−2

02

4

−10

1−1

01

x=4 y=1 z=1

XY

Z

−20

2

−10

1−2

0

2

X

x=2 y=1 z=2

Y

Z

−20

2

−2

0

2−1

0

1

X

x=2 y=2 z=1

Y

Z

−1 0 1−2

02

−2

0

2

X

x=1 y=2 z=2

Y

Z

(b) Shape of the Simplified Morphing Air Ve-hicle

Figure 1. Morphing Air Vehicle

shape change to enable a single vehicle to perform multiple diverse mission profiles. Conversely, Morphingfor Control is an in-flight physical or virtual shape change to achieve multiple control objectives, such asmaneuvering, flutter suppression, load alleviation and active separation control.

Previous research by the authors developed and demonstrated an Adaptive-Reinforcement Learning Con-trol (A-RLC) methodology to the problem of aircraft morphing.4 This paper develops an improved Adaptive-Reinforcement Learning Control methodology by using Sequential Function Approximation, a Galerkin basedscattered data approximation scheme, to more efficiently and accurately generalize the knowledge gained fromdiscrete experiences, to the entire state space.

The paper is organized as follows. Section II explains in detail the morphing vehicle model and theset up of the numerical simulation. The reinforcement learning methodology is presented in Section.III.Section.IV talks about the sequential function approximation method that improves the performance of thereinforcement learning over the K nearest neighbor policy. The trajectory tracking controller is developedin Section.V followed by the A-RLC functionality, numerical simulation results and the conclusions in thesubsequent sections.

II. Morphing Air Vehicle Simulation

A simplified morphing air vehicle model in the shape of an ellipsoid is used to obtain the preliminaryresults in this paper. Fig.1(b) shows some of the possible shapes that this vehicle can morph into. Themorphing used in this research involves a change in the dimensions of the ellipsoid axes, while maintain-ing a constant total volume. The Reinforcement Learning module specifies the y and z axis dimensions,corresponding to the current flight condition and the x dimension is calculated by enforcing the constantvolume condition, x = 6V

πyz . It is assumed that the ellipsoidal air vehicle is composed of a smart material(Piezoelectric, Shape Memory Alloy (SMA) or Carbon Nanotubes) whose shape can be modulated by ap-plying voltage. For simulation purposes, we assume the morphing dynamics as simple nonlinear differentialequations given by Eqn.1 and Eqn.2 respectively. The model for relating the y dimension to the applied Vy

voltage is given in Eqn.1

y + 2yy = Vy (1)

and similarly for the z dimension

z + 3zz = Vz (2)

2 of 19


Fig.2(a) shows how the y-dimension and the z-dimension evolve for various constant inputs, with variedinitial conditions. The morphing dynamics are such that the control voltage causes a change in the dimen-sions. As soon as the control voltage goes to zero the dimension remains constant. Fig.2(a), subplot twoshows the change in the z dimension when various possible inputs from the action set are applied for onesecond. Fig.2(a), subplot one shows the change in the y dimension when various possible inputs are appliedtwice consecutively, each input remaining constant for one second.

The Cost Function is selected to be a function of the current shape and the current flight condition. Thecurrent shape can be identified uniquely from the y and z dimensions of the morphing ellipsoid. The currentflight condition is identified by discrete values from 0 to 5, so the Cost Function J : [2, 4]×[2, 4]×[0, 5] → [0,∞]

The total cost function J can be written as a sum J = Jy + Jz where Jy and Jz are the costs associatedwith the y and the z dimension respectively for the different flight conditions. Let Sy(F ) be the optimal ydimension at flight condition F and Sz(F ) be the optimal z dimension at flight condition F . The optimalshapes are arbitrarily selected to be nonlinear functions of the flight condition and are shown graphically inFig.2(c).

Sy = 3 + cos(π

5F ) (3)

Sz = 2 + 2e−0.5F (4)

With the optimal y dimension given by Eqn.3, the cost function Jy is defined as

Jy = (y − Sy(F ))2 (5)

where y is any arbitrary y dimension and F is the current flight condition. The surface plot of Jy is shownin Fig.2(b). Similarly, the cost function Jz is defined as

Jz = (z − Sz(F ))2 (6)

where z is any arbitrary y dimension, F is the current flight condition and Sz is the optimal y dimensionat flight condition F . The surface plot of Jz is shown in Fig.2(d). For the simulation we specify theflight conditions that the morphing vehicle should fly at various locations along a pre-designated flight path(Fig.2(e)). For this preliminary example the optimal shapes are not correlated to the flight path, but dependonly on the flight condition.

The objective of the Reinforcement Learning (RL) Module is to learn the optimal control policy that,for a specific flight condition, commands the optimal voltage which changes the vehicle shape towards theoptimal shape Sy and Sz. Therefore, the RL module seeks to minimize the total cost J over the entire flighttrajectory.

A. Mathematical Model for the Dynamic Behavior of the Morphing Air Vehicle

The dynamic behavior of the morphing air vehicle is modeled by nonlinear six degree-of-freedom equations.5

The equations are written in two coordinate systems: an inertial axis system, and a body fixed axis withorigin at the center of mass of the vehicle.

The structured model approach is followed and the dynamic equations are partitioned into ‘kinematiclevel’ and ‘acceleration level’ equations. The kinematic level variables are dx, dy, dz, φ, θ and ψ. The statesdx, dy, dz are the positions of the center of mass of the morphing vehicle along the inertial XN , YN andZN axes. φ, θ and ψ are the 3-2-1 Euler angles which give the relative orientation of the body axis withthe inertial axis. The acceleration level states are the body-axis linear velocities u, v, w and the body-axisangular velocities p, q, r.

Let pc =[dx dy dz

]T

, vc =[u v w

]T

, σ=[φ θ ψ

]T

and ω=[p q r

]T

The kinematic states and the acceleration states are related by the differential equations

pc = Jlvc (7)σ = Jaω (8)

3 of 19


0 2 4 61

1.5

2

2.5

3

3.5

4

4.5

y di

men

sion

Time (sec)0 2 4 6

1.5

2

2.5

3

3.5

4

4.5

z di

men

sion

Time (sec)

(a) The Morphing Dynamics for y-dimensionand the z-dimension when subjected to variousinput voltages - Vy and Vz respectively

0

50246

0

5

10

y dimensionFlight Condition

Obj

ectiv

e F

unct

ion

Jy

(b) Jy.

0 1 2 3 4 52

2.5

3

3.5

4

F

Sy

0 1 2 3 4 52

2.5

3

3.5

4

F

Sz

(c) Optimal Shapes at the different Flight Con-ditions

0

50246

0

5

10

z dimensionFlight Condition

Obj

ectiv

e F

unct

ion

Jz

(d) Jz.

0 50 100 150 2000

2

4

6

Time (sec)

Flig

ht C

ondi

tion

(e) Flight Conditions encountered while trav-elling along the Flight Path

Figure 2. Morphing Aircraft Simulation

4 of 19


where

Jl =

CθCψ SφSθCψ − CφSψ CφSθCψ + SφSψ

CθSψ SφSθSψ + CφCψ CφSθSψ − SφCψ

−Sθ SφCθ CφCθ

Ja =

1 Sφtan(θ) Cφtan(θ)0 Cφ −Sφ

0 Sφsec(θ) Cφsec(θ)

(9)

and Cφ = cos(φ), Sθ = sin(θ) and so on. The acceleration level differential equations are

mvc + ωmvc = F + Fd (10)Iω + Iω + ωIω = M + Md (11)

where m is the mass of the morphing air vehicle, F is the control force, Fd is the drag force, I is the body axismoment of inertia, M is the control torque, Md is the drag moment and ωV is the matrix representation ofthe cross-product between vector ω and vector V.

ω =

0 −r q

r 0 −p

−q p 0

(12)

Note that for this study the motion of the morphing air vehicle is simulated in the absence of gravity.Also Eqn.11 has an additional term Iω, when compared to rigid body equations of motion. This term isa consequence of the shape change and is responsible for speeding up or slowing down the rotation of thevehicle due to the time rate of change in the moment of inertia about a particular axis.

The drag force Fd is modeled as a function of the air density ρ, the square of the velocity along the axisand the projected area of the ellipsoidal vehicle, perpendicular to the axis. The dimensions of the ellipsoidaxes along the body-axis are x, y, z respectively.

Fd =−ρπ

8

u2sgn(u)yz

v2sgn(v)xz

w2sgn(w)xy

(13)

Similarly, the drag moment Md is modeled as a function of the air density ρ, the square of the angularvelocity along the axis and the projected areas of the ellipsoid on the coordinate system planes parallel tothe axis.

Md =−ρπ

8

p2sgn(p)x(y + z)q2sgn(q)y(x + z)r2sgn(r)z(x + y)

(14)

B. Trajectory Generation

The reference trajectory is arbitrary and is generated as a combination of straight lines and sinusoidal curves.Fig.3(a) shows a sample reference trajectory that the vehicle is required to track. The total flight path isdivided into an odd number of segments of variable lengths. During every odd numbered segment the dy

and dz locations remain constant while the values of dy and dz are generated by a random function. Theeven numbered segments connect the trajectories in two adjacent sections with smooth sinusoidal curves.

The morphing vehicle is supposed to move along the flight trajectory with a constant inertial velocityalong the XN inertial axis. For the attitude reference, it is supposed to trace prescribed sinusoidal oscillationsalong the XN inertial axis. Fig.3(b) shows the projections of the reference trajectory in Y-X, Z-Y and Z-Xplanes, where X is the downrange component, Y is the crossrange component and Z is the altitude. Fig.3(b),subplot 1, shows the top view and subplot 3 shows the side view with motion from left to right. Subplot 2shows the front view with the vehicle moving towards the observer.

5 of 19


0

20

40

60

80

100

050

5

X

Z

Y

(a) Reference Trajectory for Positions along Inertial Axis

0 20 40 60 80 1000

2

4

X

Y

0 50

2

4

Y

Z

0 20 40 60 80 1000

2

4

X

Z

(b) Projections of the ReferenceTrajectory in Y-X, Z-Y and Z-Xplanes

Figure 3. Reference Trajectory Generation

III. Reinforcement Learning Module

A. A Brief Introduction to Reinforcement Learning

Reinforcement learning (RL) is a method of learning from interaction to achieve a goal. The learner anddecision-maker is called the agent. The thing it interacts with, comprising everything outside the agent, iscalled the environment. The agent interacts with it’s environment at each instance of a sequence of discretetime steps, t = 0, 1, 2, 3.... At each time step t, the agent receives some representation of the environment’sstate, st ∈ S, where S is a set of possible states, and on that basis it selects an action, at ∈ A(st), where A(st)is a set of actions available in state s(t). One time step later, partially as a consequence of its action, theagent receives a numerical reward, rt+1 = R, and finds itself in a new state, st+1. The mapping from statesto probabilities of selecting each possible action at each time step, denoted by π is called the agent’s policy,where πt(s, a) indicates the probability that at = a given st = s at time t. Reinforcement learning methodsspecify how the agent changes its policy as a result of its experiences. The agent’s goal is to maximize thetotal amount of reward it receives over the long run.

Almost all reinforcement learning algorithms are based on estimating value functions. For one policy π,there are two types of value functions. One is the state-value function V π(s), which estimates how goodit is, under policy π, for the agent to be in state s. It is defined as the expected return starting from sand thereafter following policy π. The other one is the action-value function Qπ(s, a), which estimates howgood it is, under policy π, for the agent to perform action a in state s. It is defined as the expected returnstarting from s, taking action a, and thereafter following policy π. The process of computing V π(s) orQπ(s, a) is called policy evaluation. π can be improved to a better π′ that, given a state, always selects theaction, of all possible actions, with the best value based on V π(s) or Qπ(s, a). This process is called policyimprovement. V π′(s) or Qπ′(s, a) can then be computed to improve π′ to an even better π′′. The ultimategoal of RL is to find the optimal policy π∗ that has the optimal state-value function, denoted by V ∗(s) anddefined as V ∗(s) = maxπV π(s), or the optimal action-value function, denoted by Q∗(s, a) and defined asQ∗(s, a) = maxπQπ(s, a) . This recursive way of finding an optimal policy is called policy iteration. Byfar, there exist three major methods for policy iteration: dynamic programming, Monte Carlo methods, andtemporal-difference learning.

Reinforcement Learning has been applied to wide variety of physical control tasks, both real and simu-lated. For example, an acrobat system is a two-link, under-actuated robot roughly analogous to a gymnastswinging on a high bar. Controlling such a system by RL has been studied by many researchers.6,7,8 Inmany applications of RL to control tasks, the state space is too large to enumerate the value function. Somefunction approximators must be used to compactly represent the value function. Commonly used approachesinclude neural networks, clustering, nearest-neighbor methods, tile coding and cerebellar model articulatorcontroller.

6 of 19


B. Implementation of Reinforcement Learning Module

For the present research, the agent in the smart morphing vehicle problem is its RL module. It attemptsto minimize the total amount of cost over the entire flight trajectory. To reach this goal, it endeavors tolearn, from its interaction with the environment, the optimal policy that, given the specific flight condition,commands the optimal voltage that changes the morphing vehicle shape towards the optimal one. Theenvironment is the flight conditions which the vehicle is flying in. We assume that the RL module has noprior knowledge of the relationship between voltages and the dimensions of the morphing vehicle, as definedby morphing control functions MCy and MCz in Eqn.1 and Eqn.2. Also it does not know the relationshipbetween the flight conditions, costs and the optimal shapes as defined in Fig.2(c), Fig.2(b) and Fig.2(d).However, the RL module does know all possible voltages that can be applied. It has accurate, real-timeinformation of the morphing vehicle shape, the present flight condition, and the current cost provided by avariety of sensors.

The RL module uses a 1-step Q-learning method, which is a common off-policy Temporal Difference(TD) control algorithm. The algorithm is illustrated as follows:9

Q-Learning()

• Initialize Q(s, a) arbitrarily

• Repeat (for each episode)

– Initialize s

– Repeat (for each step of the episode)

∗ Choose a from s using policy derived from Q(s, a) (e.g. ε-Greedy Policy)∗ Take action a, observe r, s′

∗ Q(s, a) ← Q(s, a) + αr + γmaxa′Q(s′, a′)−Q(s, a)∗ s ← s′

– until s is terminal

• return Q(s, a)

The agent learns the greedy policy, defined as: π(s, a) = argmaxaQ(s, a). As the learning episodes in-crease, the learned action-value function Q(s, a) converges asymptotically to the optimal action-value func-tion Q∗(s, a). The method is an off-policy one as it evaluates the target policy (the greedy policy) followinganother policy. The policy used in updating Q(s, a) can be a random policy, with each action having thesame probability of being selected. The other option is an ε-greedy policy, where ε is a small value. Theaction a with the maximum Q(s, a) is selected with probability 1-ε, otherwise a random action is selected.

When selecting the next action, one typical problem the agent has to face is the exploration-exploitationdilemma.9 If the agent selects a greedy action that has the highest value, then it is exploiting its knowledgeobtained so far about the values of the actions. If instead it selects one of the non-greedy actions, thenit is exploring to improve its estimate of the non greedy actions’ values. Naturally, the RL agent is moreexplorative in the early stage of the learning and more exploitative in the late stage. We realize the changefrom exploration to exploitation by changing the value of ε in the ε-greedy policy. At first, ε is large, so thateach action has equal probability of being selected and the agent can explore as many state-action pairs aspossible. The value of ε is decreased gradually as the learning goes on, which allows the agent to exploit itslearning result. The whole policy iteration process is implemented in the following manner:

Policy Iteration()

• Initialize Q(s, a)arbitrarily

• Initialize Samples

• Repeat (for each episode)

– Repeat (for each step of the episode)

∗ //Collect Samples

7 of 19


∗ Choose a from s using ε-greedy policy derived from Q(s, a)∗ Take action a, observe r, s′

∗ Samples ← Samples + (s, a, r, s′)∗ if (enough samples collected)

· //Compute Q values of the samples· Repeat (for each iteration)· Repeat (for each sample)· Q(s, a) ← Q(s, a) + αr + γmaxa′Q(s′, a′)−Q(s, a)· Re-initialize Samples

– Until s is terminal

• Given a new state, approximate Q value based on sampled states and Q values

If the number of the states and the actions of a RL problem is a small value, its Q(s, a) can be representedusing a table, where the action value for each state-action pair is stored in one entity of the table. Since theRL problem for the morphing vehicle has states (the shape of the ellipsoidal vehicle) on continuous domains,it is impossible to enumerate the action-value for each state-action pair. The solution is to use some typeof approximation method to determine the best action at a specific state based on the learned values andstates.

In previous versions of the learning algorithm we used the K-nearest neighbors method to approximate theaction-value function Q at an arbitrary state (s, a). In this method we first collect a set of state-action pairsamples; then we compute optimal action-values of these samples using the Q-learning method illustratedabove; and finally determine the action-value of the arbitrary new state-action pair using the data from its Knearest neighbors in the set of samples. During the learning process, we found that there were some regionsof the state space which were not explored fully enough to provide the K-nearest neighbors method withenough good nearby data points. The key problem we experienced was that the KNN algorithm sometimesincluded action-value function values for states that had not been updated during the learning process andtherefore still contained the random initial values for the action-value function for that state. The result ofthis problem is that sometimes the chosen action was not a good approximation of the learned action-valuefunction at a specific state. This problem was apparent when the chosen action would morph the bodyto a dimension that was noticeably different from the optimal shape for the current flight condition - thisindicates a problem in learning the optimal shape. We decided to try to improve the shape learning byutilizing a method of function approximation that did not have as much of a dependence on the quality ofthe data in a small region and could utilize more global data in the absence of detailed local information.The following section describes such a method.

IV. Sequential Function Approximation Method

In this section we describe an adaptive and matrix–free scheme developed for interpolating and approxi-mating sparse multi–dimensional scattered data. The scheme requires neither ad–hoc parameters for the userto tune, nor rescaling of the inputs. The Sequential Function Approximation (SFA) method10 is based on asequential Galerkin approach to artificial neural networks and is linear in storage with respect to the numberof samples, s. The method is simple to program and does not require selecting ad-hoc parameter values orrescaling of the inputs. Approximation by artificial neural networks is discussed as well as optimization andbasis functions. The SFA approach and its implementation will be discussed in Sections B and C.

A. Function Approximation in Artificial Neural Network Applications

Any approximation algorithm that uses a combination of basis functions mapped into a graph-directedrepresentation can be called an artificial neural network. Function approximation in the framework of neuralcomputations has been based primarily on the results of Cybenko11 and Hornik et al.,12 who showed that acontinuous d-dimensional function (where d represents the input dimension of the unput) can be arbitrarilywell-approximated by a linear combination of one-dimensional functions φ (basis functions).

u ≈ uan(ξ) = c0 +

n∑

i=1

ciφ(η(ξ, βi)) (15)

8 of 19


where η ∈ R, ξ ∈ Rd, βi ∈ Rm, and ci ∈ R represent the basis function argument, d–dimensional independentinput variables, nonlinear function optimization parameters (for m basis parameters), and the i–th linearcoefficients, respectively. Additionally, u(ξ) is the target function and ua

n(ξ) is the n–th stage of the targetfunction approximation where a is a positive constant greater than 1 and n is the number of bases.

The appropriate linear and nonlinear network parameters in Eq. (15) are traditionally selected by solvinga non-linear optimization problem with the objective function given by the L2 norm (defined as ‖f‖2 =(∫

Ωf2dξ

)1/2 for some arbitrary function f) of the error over some domain of interest Ω

ε2 ≡∫

Ω

(u− uan)2dξ = ‖u− ua

n‖22 . (16)

In the neural network literature, the numerical minimization of Eq. (16) by a gradient descent procedureis known as the backpropagation algorithm.13 More sophisticated optimization methods, including theconjugate gradient and the Levenberg-Marquardt methods, have also been used for neural network training.However, it has been found that even these advanced optimization methods are prone to poor convergence.14

Clearly, the training algorithms must address a multi-dimensional optimization problem with non-lineardependence on the network parameters βi.

As an alternative, Jones15,16 and Barron17 proposed the following iterative algorithm for sequentialapproximation:

uan(ξ) = αnua

n−1(ξ) + cnφ(η(ξ,βn)) (17)

where βn, cn, and αn are selected optimally at each iteration of the algorithm. As a result, the high-dimensional optimization problem associated with neural network training is reduced to a series of simplerlow-dimensional problems. A general principle of statistics was utilized to show that the upper bound ofthe error ε is of the order C/

√n, where C is a positive constant. Orr18 then introduced a forward selection

method of sequential network training, in effect a method of incremental function approximation. At eachiteration of the algorithm an additional basis function producing the largest reduction of error in the previousiteration is chosen from a given set of functions and added to the approximation. This forward selectiontraining method can be inefficient as it may require significant computational resources when the set of trialfunctions is large. A similar principle is utilized in Platt’s resource allocating networks (RAN).19 Wheneveran unusual pattern is presented to the network in on- or off-line network training a new computational “unit”is allocated. Note that these computational units respond to local regions of the input space.

The concept of sequential approximation is one of the major features of the method proposed in thispaper for the solution of scattered data approximation problems.

B. The Proposed Method

The basic principles of the proposed computational method for solving scattered data approximation arepresented in this section. The development of this method was motivated by the similarities between iterativeoptimization procedures reviewed in Section A and the Method of Weighted Residuals (MWR), specificallythe Galerkin method.20

To begin, we can write the function residual rn using the n-th stage of Eq. (15) as

r(cn, ξ, βn) = rn = u(ξ)− uan(ξ) = u(ξ)− ua

(n−1)(ξ)− cnφ(η(ξ, βn)) =r(n−1) − cnφ(η(ξ, βn)) = r(n−1) − cnφn .

Recalling that the inner product 〈f, v〉 of two arbitrary functions f and v is defined as∫Ω

fv dξ, and utilizingthe Petrov–Galerkin approach, we select a cn that will force the function residual to be orthogonal to thebasis function,

〈rn, φn〉 = −〈rn,∂rn

∂cn〉 = −1

2∂〈rn, rn〉

∂cn= 0

which is equivalent to selecting a value of cn that will minimize 〈rn, rn〉 or

cn =〈φn, r(n−1)〉〈φn, φn〉 . (18)

9 of 19


The values of our remaining variables (βn) must minimize the same objective

〈rn, rn〉 = 〈r(n−1), r(n−1)〉 − 2cn〈φn, r(n−1)〉+ c2n〈φn, φn〉 (19)

that can be rewritten with the substitution of Eq. (18) as

〈rn, rn〉 = 〈r(n−1), r(n−1)〉(

1− 〈φn, r(n−1)〉2〈φn, φn〉〈r(n−1), r(n−1)〉

). (20)

Recalling the definition of the cosine using arbitrary functions f and v,

cos(θ) =〈f, v〉

〈f, f〉1/2〈v, v〉1/2,

and the equivalence between the inner product and the square of the L2 norm, Eq. (20) can be written as

‖rn‖2 = ‖r(n−1)‖2 sin(θn) , (21)

where θn is the angle between φn and r(n−1). With Eq. (21) ‖rn‖2 < ‖r(n−1)‖2 as long as φn is not orthogonalto the previous equation residual r(n−1), that is, θn 6= π/2. By inspection the minimum of Eq. (21) is

cnφn = cnφ(η(ξ,βn)) = r(n−1) .

Therefore, to force ‖rn‖2 → 0 a low–dimensional function approximation problem must be solved at eachstage n. This involves an unconstrained nonlinear optimization in the determination βn. The dimensionalityof the nonlinear optimization problem is kept low since we are solving for only one basis at a time.

There are no theoretical restrictions on the type and distribution of local basis functions. Both B1–splines21,22 and Gaussian tensors23 have been used with Eq. (21) using equation residuals in the solutionof ordinary and partial differential equations. These popular bases should be available for use in scattereddata approximation since reference24 demonstrated that approximating functions through the solution ofdifferential equations was a more stringent test of bases than straight-forward function approximation. Thereare neither ad–hoc parameters for the user to tune, nor rescaling of the inputs in the method. The algorithmcan be initialized with either an empty set (r0 = u) or an arbitrary number of predetermined functions andcoefficients. The second option enables the use of solutions from previous numerical analyses.

C. Implementation of The Algorithm

In the scattered data approximation we have only discrete samplings of the observed function u(ξ). Given a fi-nite number of samples s, we can write r(n−1) = r(n−1)(ξ1), · · · , r(n−1)(ξs) and φn = φn(ξ1), · · · , φn(ξs).The equations derived in Section B are directly applicable if we use the discrete inner product.

The choice of bases plays a major role in the efficiency and utility of the scattered data approximation algo-rithm as a whole. We chose normalized bell-shaped bases. Defining ξ∗n such that |r(n−1)(ξ

∗n)| = max|r(n−1)|

(with |f | representing the absolute value of f), then βn = ξ∗n, σn where

φ(η(ξ, βn)) = φn(ξ) = exp(−σ2

n (ξ − ξ∗n) · (ξ − ξ∗n))

.

Though the functional form of the SFA scheme allows the basis center to be located anywhere in Rd, theapplication to scattered data approximation constrains the centers to the set of sample points ξ1, · · · , ξs.The remaining optimization variable σn is continous and unconstrained.

Numerical experiments show that the discrete inner product (〈f, v〉D =∑s

i (fivi)) formulation of Eq. (20)has a number of local minima with respect to σn. This results in an unsatisfactory convergence rate whenusing common gradient–based optimization methods. These experiments resulted in the two step formulationused for this paper which is a combination of Eqs. (18) and (19)

minσn

(〈r(n−1), r(n−1)〉D − 2gn〈φn, r(n−1)〉D + g2n〈φn, φn〉D

)(22)

cn =〈φn, r(n−1)〉D〈φn, φn〉D (23)

10 of 19


where gn = r(n−1)(ξ∗n) (the value of the component in vector r(n−1) with maximum magnitude). This is

the form of the SFA algorithm used in this paper. Note that by using the scalar gn with a normalized bell-shaped basis function centered at ξ∗n that max|rn| ≤ max|r(n−1)| while at the same time the minimizationof Eq. (22) and satisfaction of Eq. (23) ensures that 〈rn, rn〉D < 〈r(n−1), r(n−1)〉D. When using RBFs in thisimplementation of the SFA method the basis parameter σn should vary as Can/ν where C is some positiveconstant. Then, as per the results of reference,24 the upper bound of the residual should be

〈rn, rn〉D = ‖rn‖22,D ≤ Ca−n (24)

where u(ξ) is from the Sobolev space Hν(Rd). In this implementation, Eq. (24) indicates that we shouldobserve exponential convergence that does not depend explicitly on the dimensionality of the approximation.It sidesteps the “curse of dimensionality”.16 Together Eqs. (22) and (23) make up the SFA algorithm usedin this paper.

The two more popular scattered data approximation methods using RBFs in the literature are RBFneural networks and Support Vector Machines (SVMs). RBF neural networks usually consist of a singlehidden layer whose number of transfer functions, n, are chosen through trial and error by the user. Thenonlinear parameters of the hidden layer are determined simultaneously through optimization and the outputparameters are solved by the inversion of a n×n matrix, where n ≤ s. RBF neural networks, besides requiringon the order of n2 memory and time resources, are also strongly susceptible to problems of dimensionality.25

Support Vector Machines on the other hand are very attractive for high-dimensional regression since thecomplexity of the approximation depends only on the number of support vectors and is independent of thedimensions of the input d.26 However, SVMs require user tuned ad-hoc parameters and on the order of s2

memory and time resources.27

The scattered data approximation scheme presented in this paper is free of matrix construction andevaluations and is only of order s in storage. Computational cost is reduced while efficiency is enhancedby the low-dimensional unconstrained optimization. Theoretically, the method should be applicable to anumber of popular local interpolation functions and should, with the use of RBFs, sidestep the limitationsdue to dimensionality. The scheme does not require user interaction other than setting the tolerance fortermination; the computer does the work. To implement the algorithm the user takes the following steps:

1. Initiate the algorithm with r0 = u(ξ1), · · · , u(ξs).2. Search the components of r(n−1) for the maximum magnitude. Record the component index j∗.

3. ξ∗n = ξj∗ .

4. Satisfy Eq. (22) with φn centered at ξj∗ and initialize the optimization parameter with σn = 0.

5. Calculate the coefficient cn from Eq. (23).

6. Update the residual vector rn = r(n−1)− cnφn. Repeat cycle until termination criteria have been met.

A combination of three criteria can be used to terminate the SFA algorithm: ‖rn‖22,D, or |gn| falls below auser specified tolerance (τ), or the number of bases n exceeds the user specified maximum. In this paper weterminate the calculations when either |gn| ≤ τ or n ≥ s.

The method is linear in storage with respect to s since it needs to store only s + 1 vectors to computethe residuals: one vector of length s (rn) and s vectors of length d (ξ1, · · · , ξs). To generate the SFA modelrequires one vector of length n (c1, · · · , cn) and n vectors of length n + 1 (β1, · · · , βn).

A drawback found in using Eqs. (22) and (23) in practice is its sensitivity to the initial value of σn. ThoughEq. (21) is stable, numerical experiments have shown that some initial values of σn cause local extrema inthe formulations of Eqs. (22) and (23) to be selected such that cn → 0 while gn is finite. Problems are notexperienced with the SFA scheme if we initialize with σn = 0.

Lastly, since the SFA scheme works with the L2 norm, time–dependent processes must be reformulatedas a mapping of inputs to desired outputs. Assume the inputs are sequential in time, or ξi is recorded atti where ti < ti+1. Limiting our attention to a single target sensor, we can use the Runge–Kutta temporalintegration technique from the finite difference literature28 to write

u(ti+1) = u(ti) +(ti+1 − ti

)G

(ξi, u(ti), ti

), (25)

11 of 19


where G is an unknown mapping function. If the input vector ξi includes u(ti) as well as ti, and if ti+1 − ti

is roughly constant for all i, then Eq. (25) can be rewritten as

u(ti+1) = G (ξi) (26)

since G is still an unknown function. Therefore, with the sets u(t1), · · · , u(ts) and ξ1, · · · , ξs we can usethe SFA scheme to construct the mapping function G sequentially.

D. Conclusions

An adaptive and matrix–free scheme has been developed for interpolating and approximating sparse multi–dimensional scattered data. The scheme requires neither ad–hoc parameters for the user to tune, nor rescalingof the inputs. The SFA method is based on a sequential Galerkin approach to artificial neural networks andis linear in storage with respect to the number of samples, s. Using this method an accurate approximationis built by incremental additions of optimal local basis functions. Matrix construction and evaluations areavoided. Computational cost is reduced while efficiency is enhanced by the low-dimensional unconstrainedoptimization. Theoretically, the method should be applicable to a number of popular local basis functions.

Future work will investigate improving the extrapolation by merging mathematical models with the exper-imental data through regularization. Future investigations will also include more sophisticated optimizationroutines, the reformulation of Eqs. (22) and (23) to decrease their sensitivity to initial values of σn, and theuse of various basis functions such as low–order polynomials, B–splines, and Gaussian tensors.

V. Structured Adaptive Model Inversion

A. A Brief Introduction to Structured Adaptive Model Inversion (SAMI)

The goal of the SAMI controller is to track the reference trajectories, even when the dynamic properties ofthe vehicle change, due to the morphing. Structured Adaptive Model Inversion (SAMI)29 is based on theconcepts of Feedback Linearization,30 Dynamic Inversion, and Structured Model Reference Adaptive Control(SMRAC).31,32,33 In SAMI, dynamic inversion is used to solve for the control. The dynamic inversion isapproximate, as the system parameters are not modeled accurately. An adaptive control structure is wrappedaround the dynamic inverter to account for the uncertainties in the system parameters.34,35,36 This controlleris designed to drive the error between the output of the actual plant and the reference trajectories to zero, withprescribed error dynamics. Most dynamic systems can be represented as two sets of differential equations,an exactly known kinematic level part, and a momentum level part with uncertain system parameters. Theadaptation included in this framework can be limited to only the uncertain momentum level equations,thus restricting the adaptation to a subset of the state-space, enabling efficient adaptation. SAMI has beenshown to be effective for tracking spacecraft37 and aggressive aircraft maneuvers.38 The SAMI approachhas been extended to handle actuator failures and to facilitate correct adaptation in presence of actuatorsaturation.39,40,41

B. Mathematical Formulation of the SAMI Controller for Attitude Control

The attitude equations of motion for the morphing vehicle are given by Eqn.8 and Eqn.11. Without theangular drag and the Iω term, Eqn.8 and Eqn.11 can be manipulated to obtain the following form29

I∗a(σ)σ + C∗a(σ, σ)σ = PTa (σ)M (27)

where the matrices I∗a(σ), C∗a(σ,σ) and P (σ) are defined as

Pa(σ) , J−1a (σ) (28)

I∗a(σ) , PTa IPa (29)

C∗a(σ, σ) , −I∗a JaPa + PTa [Paσ]IPa (30)

The left hand side of Eqn.27 can be linearly parameterized as follows

I∗a(σ)σ + C∗a(σ, σ)σ = Ya(σ, σ, σ)θ (31)

12 of 19


where Ya(σ,σ,σ) is a regression matrix and θ is the constant inertia parameter vector defined as θ ,[I11 I22 I33 I12 I13 I23

]T

. It can be seen that the product of the inertia matrix and a vector can bewritten as

Iν = Λ(ν)θ, ∀ν ∈ R3 (32)

where Λ ∈ R3×6 is defined as

Λ(ν) ,

ν1 0 0 ν2 ν3 00 ν2 0 ν1 0 ν3

0 0 ν3 0 ν1 ν2

(33)

The terms on the left hand side of Eqn.27 can be written as

I∗a σ = PTa IPaσ

= PTa Λ(Paσ)θ (34)

C∗aσ = −PTa IPaJaPaσ + PT

a [Paσ]IPaσ

= PTa −Λ(PaJaPaσ) + [Paσ]Λ(Paσ)θ (35)

Combining Eqns.34 and 35 we have the linear minimal parameterization for the Inertia Matrix.42

I∗a(σ)σ + C∗a(σ, σ)σ

= PTa Λ(Paσ)− Λ(PaJaPaσ) + [Paσ]Λ(Paσ)θ

= Ya(σ, σ, σ)θ (36)

The attitude tracking problem can be formulated as follows. The control objective is to track an attitudetrajectory in terms of the 3-2-1 Euler angles. The desired reference trajectory is assumed to be twicedifferentiable with respect to time. Let ε , σ-σr be the tracking error. Differentiating twice and multiplyingby I∗a throughout

I∗a ε = I∗a σ − I∗a σr (37)

Adding (Cda + C∗(σ, σ))ε + Kda ε on both sides, where Cda and Kda are the design matrices,

I∗a ε + (Cda + C∗a(σ, σ))ε + Kdaε

= I∗a σ − I∗a σr + (Cda + C∗a(σ, σ))ε + Kdaε (38)

The RHS of Eqn.38 can be written as

(I∗a σ + C∗a(σ, σ)σ)− (I∗a σr + C∗a(σ, σ)σr)+ Cdaε + Kdaε (39)

From Eqn.27 and the construction of Y similar to Eqn.36, the RHS of Eqn.38 can be further written as

PTa M− Ya(σ, σ, σr, σr)θ + Cdaε + Kdaε (40)

So the control law can be now chosen as

M = P−Ta Ya(σ, σ, σr, σr)θ − Cdaε−Kdaε (41)

The above control law requires that the Inertia parameters θ be known accurately, but they may not beknown accurately in actual practice. So by using the certainty equivalence principle,36 adaptive estimatesfor the inertia parameters θ will be used for calculating the control.

M = P−Ta Ya(σ, σ, σr, σr)θ − Cdaε−Kdaε (42)

With the control law given in Eqn.42 the closed loop dynamics takes the following form

I∗a ε + (Cda + C∗a(σ, σ))ε + Kdaε = Ya(σ, σ, σ)θ (43)

where θ = θ − θ. Lyapunov stability analysis shows that the update law ˙θ = −τYa(σ,σ,σr,σr)T ε ensuresasymptotic stability of the tracking error dynamics for bounded reference trajectories. Following similarlines as the attitude controller, a SAMI controller for the control of the linear states of the vehicle can alsobe implemented.

13 of 19


VI. A-RLC Architecture Functionality

Figure 4. Adaptive-Reinforcement Learning Control Architecture.

The Adaptive-Reinforcement Learning Control Architecture is composed of two sub-systems: Reinforce-ment Learning (RL) and Structured Adaptive Model Inversion (SAMI) (Fig.4). The two sub-systems interactsignificantly during both the episodic learning stage, when the optimal shape change policy is learned, andthe operational stage, when the plant morphs and tracks a trajectory. For either type of stage, the systemfunctions as follows. Considering the RL sub-system at the top of Fig.4 and moving counterclockwise, thereinforcement learning module initially commands an arbitrary action from the set of admissible actions.This action is sent to the plant, which produces a shape change. The cost associated with the resultantshape change, in terms of drag force and fuel consumption, is evaluated with the cost function. A sample,composed of the current shape, the current action, the reward (cost), and the resultant shape, is sent to theSample Database. After the Sample Database collects enough number of samples, the action-value functionupdates itself. For the next step, a new action is generated based on the current policy and the currentaction-value function, and the sequence repeats itself. Shape changes in the plant due to actions generatedby the RL Module cause the plant dynamics to change. The SAMI controller maintains trajectory trackingirrespective of the changing dynamics of the plant due to these shape changes.

VII. Numerical Example

A. Purpose and Scope

The purpose of the numerical example is to demonstrate the learning performance and the trajectory trackingperformance of the A-RLC architecture. The learning takes place over 200 unsupervised learning episodes,each consisting of a single 200 second transit through a 100 meter long path. The sequence of the flightconditions is random, and changes twice during each episode. For learning purposes, the shape change iscommanded at ten second intervals. The reference trajectory to be tracked in each episode is generatedarbitrarily. Finally, trajectory tracking performance is demonstrated with a single pass through the path,with a randomly generated reference trajectory and an arbitrary flight condition change at approximately50 second intervals. Fig.2(e) shows a typical flight condition sequence.

B. Learning the Optimal Policy for Shape Change

The RL problem can be decoupled into two independent problems that correspond to the y and z dimensionsrespectively. For the y dimension RL problem, the state has two elements: the value of y dimension andthe flight condition: s = y, F. The state set consisting of all possible states is a 2-Dimensional continuousCartesian space: S = [2, 4] × [0, 5]. For each state, the action set consists all possible voltages that can be

14 of 19


applied: A(s) = −1,−0.5, 0, 0.1, 0.5. Similarly for the z dimension, the state set is [2, 4] × [0, 5], and theaction set is −1,−0.5, 0, 0.1, 0.5.

0 50 100 150 2002

2.5

3

3.5

4

dim

ensi

on −

y

0 50 100 150 2002

2.5

3

3.5

dim

ensi

on −

z

0 50 100 150 200−0.2

−0.1

0

0.1

0.2

Time (sec)

dy/d

t

0 50 100 150 200−0.2

−0.1

0

0.1

0.2

Time (sec)

dz/d

t

truelearned

(a) Comparison between the true optimalshape and the shape learned by the RL agentusing the KNN method

0 50 100 150 2002

2.5

3

3.5

4

dim

ensi

on −

y

0 50 100 150 2002

2.5

3

3.5

dim

ensi

on −

z

0 50 100 150 200−0.4

−0.2

0

0.2

0.4

Time (sec)

dy/d

t

0 50 100 150 200−0.2

−0.1

0

0.1

0.2

Time (sec)

dz/d

t

truelearned

(b) Comparison between the true optimalshape and the shape learned by the RL agentusing the SFA method

Figure 5. Evaluating the Learning

For simplicity, the training is conducted only at six discrete flight conditions: 0,1,2,3,4,5. For each flightcondition, the number of the samples for approximating the action-value function is 2000. The RL agentruns 10 iterations on these samples to update their Q values using the Q-learning algorithm. Fig.5 comparesthe learning performance between the KNN method and the SFA method. Each plot compares the optimalshape with the actual shape achieved after learning. The results for the KNN method shown in Fig.5(a)reveal problems in the learning of the optimal shape. During the second flight condition encountered in theflight, the vehicle does not fully morph in the y dimension to the desired shape. The same behavior is alsoapparent in the z dimension during the first flight condition. Also, the Reinforcement Learning stays ata non optimal shape and makes no attempt to get to the optimal shape. In Fig.5(b) the shapes achievedby the morphing vehicle more closely match the optimal shape in both dimensions. Values of the learnedshape are seen to be close, but not exact to, the optimal values. The RL agent cannot learn the optimalpolicy exactly because the applied control has discrete values by definition. Therefore, it cannot take on anyarbitrary continuous values.

The optimal action-value function for the continuous domain is not known, so performance of the twofunction approximation methods cannot be directly compared. The sampled data points are known, andeach method learns these points very well. The difference in the quality of approximation is in the datapoints that are not near the sampled data points, so performance of the two methods can be compared byexamining errors between the obtained shape versus and the true shape. Using normalized error values forthe entire shape time histories, the numerical simulation using the KNN method had a total RMS error of1.42 in the y dimension and 0.821 in the z dimension. The example run using the SFA method had totalRMS errors of 1.27 and 0.661 in the y and z dimensions, respectively. This shows a reduction of the RMSerror by slightly more than 10% in the y dimension and nearly 20% in the z dimension. It is important tonote that all of these error values include the transient errors between flight condition changes due to themorphing dynamics. Since these errors are clearly present in the time history of the obtained shape, butare not based on the function approximation method used, the actual percent improvement in the functionapproximation caused by the using the SFA method is actually higher than these numbers indicate.

C. Trajectory Tracking

The control objective for the SAMI controller is to track the reference trajectories, irrespective of initialcondition errors, parametric uncertainties, and changes in the dynamic behavior of the aircraft due tomorphing.

Fig.6(a) shows that the adaptive control maintains close tracking of the reference trajectories. Fig.6(b)and Fig.6(c) show the time histories of the linear and angular states respectively. The deviations from the

15 of 19


0 10 20 30 40 50 60 70 80 90 1000

2

4

X

Y

0 50

2

4

Y

Z

0 10 20 30 40 50 60 70 80 90 1000

2

4

X

Z

Actual TrajectoryReference

(a) Projections of the Trajectory in Y-X, Z-Yand Z-X planes

0 50 100 150 2000

50

100

150

dx (

m)

0 50 100 150 2000

2

4

6

dy (

m)

0 50 100 150 2002

4

6

dz (

m)

Time (sec)

0 50 100 150 2000

0.5

1

1.5

u (m

/sec

)

0 50 100 150 200−1

0

1

2

v (m

/sec

)

0 50 100 150 200−1

0

1

2

w (

m/s

ec)

Time (sec)


(b) Time Histories of the Linear States

0 50 100 150 200−100

0

100

φ (d

eg)

0 50 100 150 200−1

0

1

2

θ (d

eg)

0 50 100 150 200−2

−1

0

1

ψ (

deg)

Time (sec)

0 50 100 150 200−20

0

20

p (d

eg/s

ec)

0 50 100 150 200−1

0

1

q (d

eg/s

ec)

0 50 100 150 200−0.5

0

0.5

r (d

eg/s

ec)

Time (sec)


(c) Time Histories of the Angular States

0 20 40 60 80 100 120 140 160 180 2000

50

100

150

200

250

||Ine

rtia

Vec

tor|

|

0 20 40 60 80 100 120 140 160 180 200−200

−100

0

100

200

Mas

s

Time (sec)

EstimatedTrue

(d) Time Histories of the Adaptive Parameters

0 50 100 150 200−20

−10

0

10

Tx

(N)

0 50 100 150 200−20

−10

0

10

Ty

(N)

0 50 100 150 200−10

0

10

20

Tz

(N)

Time (sec)

0 50 100 150 200−5

0

5

Mx

(Nm

)

0 50 100 150 200−1

0

1

My

(Nm

)

0 50 100 150 200−0.5

0

0.5

Mz

(Nm

)

Time (sec)

(e) Time Histories of the Control Forces andMoments

Figure 6. Plots for Trajectory Tracking Performance

16 of 19


reference trajectory seen in Fig.6(b), subplot 2 are due to rapid changes in the morphing aircraft dimensions.The adaptive control has to undergo rapid adaptation during a shape change, as seen in Fig.6(d). Theadaptive controller formulation assumes that the true parameters are constant, hence the shape changecauses disturbances in the tracking behavior. Fig.6(e) shows that the control forces and moments are wellbehaved and within saturation limits.

In the current simulation, the true parameters which are learned are inertia and mass. Mass remainsconstant, but the inertia changes as the aircraft morphs into different shapes. The SAMI controller doesnot implement the term Iω in the Eqn.11, so Iω acts as unmodeled dynamics and perturbs the tracking.However, the SAMI controller is able to maintain adequate tracking performance.

The stability proof guarantees asymptotic stability of the tracking error only for constant true parameters.Here it has been applied for a morphing aircraft. Considering the timescales for morphing for missionadaptation, the mission specifications change very slowly compared to the rate of adaptation for the SAMIcontroller. Also, after the shape changes for a mission specification it will be held constant for the missionduration. Thus, due to the timescale separation and the time interval between two successive changes inmission specification, the shape change behaves like a piecewise constant parametric change for the adaptivecontroller, which can be handled effectively.

Note that the design of the controller does not explicitly account for the drag force and drag momentwhich act as external disturbances. However, the simulation results show that SAMI is able to handleexternal disturbances due to the drag forces, thereby ensuring that A-RLC controller performs well.

VIII. Conclusions

This paper developed an improved Adaptive-Reinforcement Learning Control methodology for morphingaircraft, combining machine learning and adaptive dynamic inversion control. Sequential Function Approx-imation was used to generalize the learning from previously experienced quantized states and actions tothe continuous state-action space, all of which may not have been experienced before. The reinforcementlearning morphing control function was integrated with an adaptive dynamic inversion control trajectorytracking function. An episodic unsupervised learning simulation using the Q-Learning method was devel-oped to learn the optimal shape change policy, and optimality was addressed by cost functions representingoptimal shapes corresponding to flight conditions. The methodology was demonstrated with a numericalexample of a hypothetical 3-D unmanned air vehicle that can morph in all three spatial dimensions, trackinga specifed trajectory and autonomously morphing over a set of shapes corresponding to flight conditionsalong the trajectory.

Based on the results presented in this paper, it is concluded that the method of function approximationused strongly affects the reliability of applying the learned data at points other than those which were sampledduring the learning process. In the example presented, use of the K-Nearest Neighbors method resulted innoticeable errors in the learned shape, while the Galerkin-based Sequential Function Approximation methodshowed improved shape learning. Comparing normalized error values, the K Nearest Neighbor method hada total RMS error of 1.42 in the y dimension and 0.821 in the z dimension, compared to 1.27 and 0.661respectively using Galerkin-based Sequential Function Approximation. This is a reduction of the RMS errorby slightly more than 10% in the y dimension and nearly 20% in the z dimension.

Since Structured Adaptive Model Inversion Control assumes that unknown varying parameters are piece-wise constant and change slowly, the time interval between shape change commands cannot be arbitrarilysmall. However, Morphing for Mission Adaptation should not pose a problem, since compared to time con-stants of system elements, the time interval between shape changes is relatively large and the shape remainsconstant for long durations.

Extensions to this work will encompass morphing of a vehicle more representative of a practical UnmannedAir Vehicle, with morphing times as low as five seconds between shape changes. Hysteresis effects of theassumed Shape Memory Alloy material will be included, and other control methodologies such as LinearParameter Varying control will be investigated as a means of controlling the faster morphing changes.

Acknowledgment

The material on morphing aircraft is based upon work supported by NASA under award no. NCC-1-02038. Support for the sequential function approximation work was provided by the NASA Ames Research

17 of 19


Center grant NCC-2-8077 and NASA Cooperative Agreement No. NCC-1-02038. Any opinions, findings, andconclusions or recommendations expressed in this material are those of the author(s) and do not necessarilyreflect the views of the National Aeronautics and Space Administration.

References

1Wlezien, R., Horner, G., McGowan, A., Padula, A., Scott, M., Silcox, R., and Simpson, J., “The Aircraft MorphingProgram,” , No. AIAA-98-1927.

2McGowan, A.-M. R., Washburn, A. E., Horta, L. G., Bryant, R. G., Cox, D. E., Siochi, E. J., Padula, S. L., and Holloway,N. M., “Recent Results from NASA’s Morphing Project,” Proceedings of the 9th Annual International Symposium on Structuresand Materials, No. SPIE Paper Number 4698-11, San Diego, CA, 17-21 March 2002.

3Wilson, J. R., “Morphing UAVs change the shape of warfare,” Aerospace America, February 2004, pp. 23–24.4Valasek, J., Tandale, M. D., and Rong, J., “A Reinforcement Learning - Adaptive Control Architecture for Morphing,”

Journal of Aerospace Computing, Information and Communication, April 2005, pp. 174–195.5Nelson, R. C., Flight Stability and Automatic Control , chap. 3, McGraw-Hill, 1998, pp. 96–105.6DeJong, G. and Spong, M. W., “Swinging up the acrobot: An example of intelligent control,” Proceedings of the American

Control Conference, 1994, pp. 2158–2162.7Boone, G., “Minimum-time control of the acrobot,” International Conference on Robotics and Automation, Albuquerque,

NM, 1997.8Sutton, R. S., “Generalization in reinforcement learning: Successful examples using sparse coarse coding,” Advances in

Neural Information Processing Systems: Proceedings of the 1995 Conference, edited by D. S. Touretzky, M. C. Mozer, andM. E. Hasselmo, MIT Press, Cambridge MA, pp. 1038–1044.

9Sutton, R. and Barto, A., Reinforcement Learning - An Introduction, The MIT Press, Cambridge, Massachusetts, 1998.10Meade, A. J., Kokkolaras, M., and Zeldin, B. A., “Sequential Function Approximation for The Solution of Differential

Equations,” Communications in Numerical Methods in Engineering, Vol. 13, No. 12, December 1997, pp. 977–986.11Cybenko, G., “Approximations by Superpositions of a Sigmoidal Function,” Mathematics of Control, Signals, and Sys-

tems, Vol. 2, No. 4, 1989, pp. 303–314.12Hornik, K., Stinchcombe, M., and White, H., “Multi-Layer Feedforward Networks are Universal Approximators,” Neural

Networks, Vol. 2, 1989, pp. 359–366.13Rumelhart, D., Hinton, G. E., and Williams, R. J., “Learning Internal Representations by Error Propagation,” Parallel

Distributed Processing: Exploration in the Microstructure of Cognition, 1986, pp. 318–362, MIT Press, Cambridge, MA.14Saarinen, S., Bramley, R., and Cybenko, G., “Ill-Conditioning in Neural Network Training Problems,” SIAM Journal on

Scientific Computing, Vol. 14, No. 3, 1993, pp. 693–714.15Jones, L. K., “Constructive Approximations for Neural Networks by Sigmoidal Functions,” Proceedings of the IEEE ,

Vol. 78 of 1586–1589 , 1990.16Jones, L. K., “A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates For Projection

Pursuit Regression and Neural Network Training,” The Annals of Statistics, Vol. 20, No. 1, 1992, pp. 608–613.17Barron, A. R., “Universal Approximation Bounds for Superpositions of a Sigmoidal Function,” IEEE Transactions on

Information Theory, Vol. 39, No. 3, 1993, pp. 930–945.18Orr, M. J., “Regularization in the Selection of Radial Basis Function Centres,” Neural Computation, Vol. 7, No. 3, 1995,

pp. 606–623.19Platt, J., “A Resource-Allocating Network for Function Interpolation,” Neural Computation, Vol. 3, 1991, pp. 213–225.20Fletcher, C. A. J., Computational Galerkin Methods, Springer-Verlag, New York, 1984.21Kokkolaras, M., Meade, A. J., and Zeldin, B., “Concurrent Implementation of the Optimal Incremental Approximation

Method for the Adaptive and Meshless Solution of Differential Equations,” Optimization and Engineering, Vol. 4, No. 4, 2003,pp. 271–289.

22Thomson, D. L., Meade, A. J., and Bayazitoglu, Y., “Solution of the Radiative Transfer Equation in Discrete OrdinateForm by Sequential Function Approximation,” International Journal of Thermal Sciences, Vol. 40, No. 6, 2001, pp. 517–527.

23Thomson, D. L., Sequential Function Approximation of the Radiative Transfer Equation, Ph.D. thesis, Rice University,Houston, TX, 1999.

24Meade, A. J. and Zeldin, B., “Approximation Properties of Local Bases Assembled From Neural Network TransferFunctions,” Mathematical and Computer Modelling, Vol. 28, No. 9, 1998, pp. 43–62.

25Govindaraju, R. S. and Zhang, B., Radial-Basis Function Networks, chap. Artificial Neural Networks in Hydrology,Kluwer Academic Publishers, Netherlands, 2000, pp. 93–109.

26Smola, A. J. and Sch B., Tech. rep.27Collobert, R. and Bengio, S., “SVMTorch: Support Vector Machines for Large-Scale Regression Problems,” Journal of

Machine Learning Research, Vol. 1, 2001, pp. 143–160.28Chapra, S. C. and Canale, R. P., Numerical Methods for Engineers, No. pp. 701-711, McGraw–Hill, New York, 2002.29Subbarao, K., Structured Adaptive Model Inversion: Theory and Applications to Trajectory Tracking for Non-Linear

Dynamical Systems, Ph.D. thesis, Aerospace Engineering Department, Texas A&M University, College Station, TX, 2001.30Slotine, J. and Li, W., Applied Nonlinear Control , Prentice-Hall, Inc., Upper Saddle River, New Jersey 07458, 1991, pp.

207–271.31Akella, M. R., Structured Adaptive Control: Theory and Applications to Trajectory Tracking in Aerospace Systems,

Ph.D. thesis, Aerospace Engineering Department, Texas A&M University, College Station, TX, 1999.

18 of 19


32Schaub, H., Akella, M. R., and Junkins, J. L., “Adaptive Realization of Linear Closed-Loop Tracking Dynamics in thePresence of Large System Model Errors,” The Journal of Astronautical Sciences, Vol. 48, October-December 2000, pp. 537–551.

33Akella, M. R. and Junkins, J. L., “Structured Model Reference Adaptive Control in the Presence of Bounded Distur-bances,” AAS/AIAA Space Flight Mechanics Meeting, Monterey, CA, Feb 9-11 1998, pp. 98–121.

34Narendra, K. and Annaswamy, A., Stable Adaptive Systems, Prentice-Hall, Inc., Upper Saddle River, New Jersey 07458,1989.

35Sastry, S. and Bodson, M., Adaptive Control: Stability, Convergence, and Robustness, Prentice-Hall, Inc., Upper SaddleRiver, New Jersey 07458, 1989, pp. 14–156.

36Ioannou, P. A. and Sun, J., Robust Adaptive Control , chap. 1, Prentice-Hall, Inc., Upper Saddle River, New Jersey 07458,1996, pp. 10–11.

37Subbarao, K., Verma, A., and Junkins, J. L., “Structured Adaptive Model Inversion Applied to Tracking SpacecraftManeuvers,” Proceedings of the AAS/AIAA Spaceflight Mechanics Meeting, No. AAS-00-202, Clearwater, FL, 23-26 January2000.

38Subbarao, K., Steinberg, M., and Junkins, J. L., “Structured Adaptive Model Inversion Applied to Tracking AggressiveAircraft Maneuvers,” Proceedings of the AIAA Guidance, Navigation and Control Conference, No. AAS-00-202, MontrealCanada, 6-9 August 2001.

39Tandale, M. D. and Valasek, J., “Structured Adaptive Model Inversion Control to Simultaneously Handle ActuatorFailure and Actuator Saturation,” AIAA Guidance, Navigation, and Control Conference, No. AIAA-2003-5325, Austin, TX,11-14 August 2003.

40Tandale, M. D., Subbarao, K., Valasek, J., and Akella, M. R., “Structured Adaptive Model Inversion Control withActuator Saturation Constraints Applied to Tracking Spacecraft Maneuvers,” Proceedings of the American Control Conference,Boston, MA, 2 July 2004.

41Tandale, M. D. and Valasek, J., “Adaptive Dynamic Inversion Control with Actuator Saturation Constraints Applied toTracking Spacecraft Maneuvers,” Proceedings of the 6th International Conference on Dynamics and Control of Systems andStructures in Space, Riomaggiore, Italy, 18-22 July 2004.

42Ahmed, J., Coppola, V. T., and Bernstein, D. S., “Adaptive Asymptotic Tracking of Spacecraft Attitude Motion withInertia Matrix Identification,” AIAA Journal of Guidance Control and Dynamics, Vol. 21, Sept-Oct 1998, pp. 684–691.

19 of 19


Improved Adaptive-Reinforcement Learning Control for Morphing Unmanned Air Vehicles

Documents

Transcript of Improved Adaptive-Reinforcement Learning Control for Morphing Unmanned Air Vehicles