A Reinforcement Learning Approach to Suspended Load Manipulation with Aerial Robots

A Reinforcement Learning Approach Towards Autonomous SuspendedLoad Manipulation Using Aerial Robots

Ivana Palunko1, Aleksandra Faust2, Patricio Cruz3, Lydia Tapia2, and Rafael Fierro3

Abstract— In this paper, we present a problem where asuspended load, carried by a rotorcraft aerial robot, performstrajectory tracking. We want to accomplish this by specifyingthe reference trajectory for the suspended load only. The aerialrobot needs to discover/learn its own trajectory which ensuresthat the suspended load tracks the reference trajectory. As asolution, we propose a method based on least-square policyiteration (LSPI) which is a type of reinforcement learningalgorithm. The proposed method is verified through simulationand experiments.

Keywords Aerial robotics, aerial load transportation, mo-tion planning and control, machine learning, quadrotor con-trol, trajectory tracking, reinforcement learning

I. INTRODUCTION

Unmanned aerial vehicles (UAVs) are flexible platformswith an increasing role in assisting humans in tasks that aredeemed too risky or in some cases impractical for humans.Examples of such missions are search and rescue, environ-mental monitoring, surveillance, and cooperative manipula-tion. The agility and speed of rotorcraft UAVs (RUAVs) allowthem not only to take off and land in a very limited area,but also to carry and manipulate payloads more preciselyand efficiently. Therefore, they can be well suited for usein urban environments, canyons, caves, and other clutteredenvironments for example in assistive roboticscarrying and/ortransporting a load.

RUAV existing applications vary from basic hovering [1]and trajectory tracking [2], to formation control [3], surveil-lance [4], aggressive maneuvering [5], aerobatic flips [6] andaerial manipulators [7]. Trajectory tracking for helicoptersbased on reinforcement learning resulted in autonomousaggressive helicopter flights. This work was summarized byAbbeel et al. [8]. This line of research relies on apprentice-ship learning to achieve aggressive autonomous maneuverssuch as as flips, rolls, loops, chaos, tic-tocs, and auto-rotationlandings. An expert pilot demonstrates the target trajectory

This work was supported by NSF grant ECCS #1027775 and theDepartment of Energy URPR Grant #DE-FG52-04NA25590. Ivana Palunkois supported by the European Community’s Seventh Framework Programmeunder grant No. 285939 (ACROSS). Lydia Tapia is supported in part bythe National Institutes of Health (NIH) Grant P20RR018754 to the Centerfor Evolutionary and Theoretical Immunology.

1 Ivana Palunko, is with the Faculty of Electrical Engi-neering and Computing, University of Zagreb, Zagreb, Croatia,ivana.palunko@fer.hr

2Aleksandra Faust and Lydia Tapia are with the Department of ComputerScience, University of New Mexico,Albuquerque, NM 87131, {afaust,tapia}@cs.unm.edu

3 Patricio Cruz and Rafael Fierro are with the Department of Electricaland Computer Engineering, University of New Mexico, Albuquerque, NM,87131-0001, {pcruzec, rfierro}@ece.unm.edu

Fig. 1. Hummingbird quadrotor with a suspended load at MARHES Lab- University of New Mexico.

and the cost function is learned from the target trajectory.Parameters for an initial linear model are learned fromobserving an expert flying non-aggressive maneuvers. Then,the trajectory is broken down into small segments, so thateach one can be learned separately using a linear dynamicsmodel.

Quadrotors, a type of RUAV, are inherently unstable sys-tems with highly nonlinear dynamics. The task of load car-rying puts additional strain on the controller. The reason forthis is because the suspended load alters quadrotor dynamicswith forces and torques transferred through the suspensioncable. Furthermore, maneuvering with a suspended loadcan be a hazardous task, especially in cluttered and hardlyaccessible environments. Reducing the residual oscillationsof the suspended load is one of the methods to achieve bettercontrol over the system. Another method is to control thetrajectory of the suspended load directly. In this paper wefocus on the problem where a suspended load, carried bya rotorcraft aerial robot, performs trajectory tracking. Thesolution to this problem has a broad spectrum of applicationsranging from motion planning and aerial robotics to controlof cranes and robotic manipulators [9], [10].

In order to solve this problem we propose a method basedon least square policy iteration (LSPI), that is a type of a re-inforcement learning algorithm. One of the main advantagesof this method is that it is model free, meaning that in the

Fig. 2. Load trajectory tracking.

design process we did not use any explicit knowledge aboutthe system model. This makes the method robust to modeluncertainties and noise, as it is demonstrated in our results.Another advantage is that it can be implemented as an onlinelearning algorithm which brings us a step further towardsfully autonomous UAV transportation system. Online leastsquare policy iteration has been successfully implemented inorder to solve benchmark control problems of balancing aninverted pendulum and balancing and riding a bicycle [11],[12]. The method shows promising convergence properties[13] and algorithmic stability [12].

The rest of the paper is organized as follows: in SectionII, we formally state the problem, Section III discusses thealgorithm and methodology in detail including algorithmconvergence, Section IV shows the simulation and experi-mental results implemented on a quadrotor with suspendedload, while Section V presents the closing remarks.

II. PROBLEM STATEMENT

In our previous work [14] we used a dynamic pro-gramming approach in order to generate trajectories whichensured swing-free maneuvers of a suspended load carriedby a quadrotor in an open-loop fashion. By open-loop wemean that there was no feedback involved in order to controlthe swing of the load, just the profile of the trajectory wasshaped based on the model of the system. Since this approachwas a model-based method, the experimental results showedsensitivity with respect to model uncertainties. In this paperwe move a step further, as we focus on trajectory trackingof a suspended load as depicted in Figure 2 and defined as:

Definition 1: Given a reference trajectory Pref (t) =[xLref (t) yLref (t) zLref (t)]T , the position of the sus-pended load PL(t) = [xL(t) yL(t) zL(t)]T and a smallpositive constant εL > 0, we say that the suspended loadperforms trajectory tracking if |eL(t)| ≤ εL where eL(t) =[xLref (t)− xL(t) yLref (t)− yL(t) zLref (t)− zL(t)]

T .We want to achieve this goal by specifying the referencetrajectory for the suspended load only. The quadrotor needsto learn its own trajectory and at the same time to en-sure that the suspended load tracks the reference trajectory.Therefore we are faced with an optimal control problemagain. Since the quadrotor should learn it’s trajectory from

the interaction between a learning algorithm and the aerialload manipulation system, we choose least-square policyiteration (LSPI), a reinforcement learning algorithm, as thebasis of our approach. The implementation is depicted inFigure 3. The inputs to the LSPI algorithm are the vectorof the tracking error and the vector of the suspended loaddisplacement angles. There variables are obtained by thesensors like an on-board camera or an indoor positioningsystem. The output from the LSPI algorithm are three controlsignals which are sent to the quadrotor. These signals areadditional control inputs to the quadrotor control system.Since they are bounded, which is defined by the algorithm[11], the quadrotor will remain stable.

III. SUSPENDED LOAD TRAJECTORY TRACKING USINGLEAST SQUARE POLICY ITERATION

Model-free reinforcement learning (RL) algorithms can bedivided into three categories:• value iteration algorithms that iteratively construct op-

timal value function from which the a greedy policy isderived,

• policy iteration algorithms which construct value func-tions that are then used to construct new, improvedpolicies,

• policy search algorithms that use optimization tech-niques to directly search for an optimal policy.

Offline RL algorithms use data collected in advance whileonline RL algorithms learn a solution by interacting withthe process. Online RL algorithms must balance the need tocollect the informative data (exploration) with the need tocontrol the process well (exploitation).

Policy iteration (PI) algorithms iteratively evaluate andimprove control policies. Least-squares techniques for policyevaluation are sample-efficient and have relaxed convergencerequirements.

The quadrotor with a suspended load can be represented asa deterministic Markov Decision Process (MDP) (X,U, f, ρ)where X is the state space of the process, U is the actionspace of the controller, f : X × U → X is the transitionfunction of the process

xk+1 = f(xk, uk)

and ρ : X × U → R is the reward function

rk+1 = ρ(xk, uk)

The controller chooses actions according to its policy h :X → U

uk = h(xk).

The goal is to find an optimal policy that maximizes thereturn from any initial state x0. The return R is a cumulativeaggregation of rewards

Rh(x0) =

∞∑k=0

γkρ(xk, h(xk))

Fig. 3. Block scheme for load trajectory tracking using LSPI.

where γ ∈ [0, 1] is the discount factor. State-action valuefunction (Q-function) Qh : X × U → R of a policy h is

Qh(x, u) = ρ(x, u) + γRh(f(x, u))

The optimal Q-function is

Q∗(x, u) = maxh

Qh(x, u)

The optimal policy (greedy policy in Q∗)

h(x) ∈ argmaxu

Q(x, u)

Policy iteration evaluates policies by iteratively constructingan estimate of state action value function. This estimate isthen used to construct a new, improved policy. The policyimprovement step is done at every iteration l by minimizingthe least square error in the Bellman equation [15]

Qhl(x, u) = ρ(x, u) + γQhl(f(x, u), hl(f(x, u)))

for Qhl of the current policy hl. The new policy is a greedypolicy with respect to the state action value function:

hl+1(x) ∈ argmaxu

Qhl(x, u).

The sequence of Q-functions produced by policy iterationconverges asymptotically to Q∗ as l → ∞. In continuousspaces, policy evaluation cannot be solved exactly, and thevalue function has to be approximated. Linearly parametrizedQ-function approximator Q consists of n basis function(BFs) φ1, . . . , φn : X×U → R and n dimensional parametervector θ

n∑l=1

φl(x, u)θl = φT (x, u)θ

where φ(x, u) = [φ1(x, u), . . . , φn(x, u)]T . Control action u

is a scalar which is bounded to an interval U = [uL uH ].In this paper we use state-dependent basis functions and

orthogonal polynomials of the action variable which sepa-rates approximation over the state space from approximationover the action space. Orthogonal polynomials are chosenbecause it is simple to solve the maximization problem overaction variables and orthogonality ensure better conditionedregression problem at the policy improvement steps. As

approximators over the state-action space we use Chebyshevpolynomials of the first kind which are defined as

ψ0(ξ) = 1,

ψ1(ξ) = ξ, (1)ψj+1(ξ) = 2ξψj(ξ)− ψj−1(ξ),

ξ = −1 + 2ξ − ξLξH − ξL

and ξ is any variable for which we build the Chebyshevpolynomial. These polynomials are defined on the interval[−1 1]. The approximator over the state-action space isgiven as

φ(x, u) = Φ(u)⊗Ψ(x), (3)

where Φ(u) ∈ RN×1 is the vector of Chebyshev polynomialsin action space, Ψ(x) ∈ RM×1 is the vector of Chebyshevpolynomials in the state space, N and M are dimensionsof Chebyshev polynomials and ⊗ is the Kronecker product.Chebyshev polynomials are good approximators becausethey show uniform convergence as approximators which isshown in [13] and [16]. The online least square policy itera-tion algorithm is described by algorithm (1). The differencebetween an offline and an online variation of LSPI is thatthe online algorithm interacts with the system directly atevery iteration (line 10). Since we do not want the algorithmto get stuck in local minima, exploration and exploitationsactions have to be balanced (line 9). To be able to convergequickly, the algorithm needs to exploit the interaction withthe environment. On the other hand, in order to find theglobal minimum, it has to explore the actions space withprobability εk. As the time passes, the algorithm is relyingmore on the current policy than on exploration, so εk decaysexponentially as the number of steps increases. Instead ofwaiting until many samples are passed, the online LSPIimproves the policy by solving for the Q-function parametersevery Kθ iterations using the current values of the Γ, Λ and z.When Kθ = 1 then the policy is improved at every iterationstep and is called fully optimistic. Usually Kθ should be anumber greater than zero but not too large. In our case Kθ =7 which we picked based on the algorithm performance.Changing this parameter the performance of the algorithm

Algorithm 1 Online LSPI with ε - greedy exploration1: Input: discount factor γ,2: BFs φ1, . . . φn : X × U → R,3: policy improvement interval Kθ, exploration {εk}∞k=0,4: a small constant βΓ > 05: l← 0, initialize policy h0,6: Γ0 ← βΓIn×n, Λ0 ← 0, z0 ← 07: measure initial state x0

8: for every time step k = 0, 1, 2, . . . do

9: uk ←

{hl(xk) : with prob.1− εk (exploit)uni. rand. ac. in U : with prob.εk (explore)

10: apply uk, measure next state xk+1 and reward rk+1

11: Γk+1 ← Γk + φ(xk, uk)φT (xk, uk)12: Λk+1 ← Λk + φ(xk, uk)φT (xk+1, hl(xk+1))13: zk+1 ← zk + φ(xk, uk)rk+1

14: if k = (l + 1)Kθ then15: θl ← solve 1

k+1Γk+1θl = 1k+1Λk+1θl + 1

k+1zk+1

16: hl+1(x)← arg maxuφT (x, u)θl,∀x

17: l← l + 118: end if19: end for

changes. The reward function vector rk ∈ R3×1 at step k isdefined as

rk = −cu |uLk| − eTLkceeLk − δTLkcaδLk, (4)

where uLk ∈ R3×1 is the action vector at step k, eLk ∈ R3×1

is the vector of the load tracking error at step k and δLk ∈R3×1 is the load displacement angle vector, cu ∈ R3×1 is avector with positive entries, ce ∈ R3×3 and ca ∈ R3×3 arediagonal positive definite matrices. With this reward functionwe penalize the tracking error, load swing angles and actions.Since in LSPI the action and the reward are scalar, in order toimplement a 2D or a 3D trajectory tracking, we develop threedistinct functions that run three LSPI algorithms in parallelproviding us with three control actions for the quadrotor uLx,uLy and uLz , as depicted in Figure 3. Control actions arebounded to the interval [ulo uhi].

[13, Theorem 4.2 (Mean convergence of an API algo-rithm), Corollary 4.1 (Mean convergence of RLSAPI)] and[17, Theorem 7.1 (Mean convergence of LS/RLSAPI withexact Chebyshev polynomials)] ensure mean convergenceof the approximate policy iteration algorithm with Cheby-shev approximators. This guarantees that the solution ofthe optimal control problem solved using Chebyshev basedLSPI generates a sequence of policies that asymptoticallyconverges to the optimal performance in the mean.

IV. SIMULATION AND EXPERIMENTAL RESULTS

The algorithm described in the previous section is anonline LSPI. This means that during every simulation step,one iteration of the algorithm is performed while interactingwith the system. Therefore, the algorithm provides an actionat every simulation step, based on the previous outcome ofthe action or based on exploration. We use the nonlinearmodel of the quadrotor and the suspended load described

0 2 4 6 8 10 12 14 16−2

x q [m]

Quadrotor position

0 2 4 6 8 10 12 14 160

z q [m]

0 1 2 3 4 5 6 7 8−1

y q [m]

(a) Quadrotor position.

0 100 200 300 400 500 600 700 800−0.5

exL [m

Load tracking error

0 100 200 300 400 500 600 700 800−1

−0.5

eyL [m

# of samples

(b) Load tracking error.

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−0.5

xL [m]

y L [m]

referencesimulation

(c) Load displacement in 2D.

Fig. 4. Simulation results for online LSPI used for suspended load trackingthe straight line.

0 2 4 6 8 10 12 14 16 18 20−1.5

−0.5

y L [m]

0 2 4 6 8 10 12 14 16 18 20−1.5

−0.5

1.5x L [m

referencesim #1sim #2sim #3

Fig. 5. Simulation results for online LSPI used for suspended load trackingthe straight line multiple simulation trials.

in [14] for building the simulation environment. The modelprovides feedback to the LSI algorithm in form of the loadtracking error eLk and the load displacement angle θLkand φLk for the LSPI algorithm as depicted in Figure 3.The algorithm and simulations were written and executed inMatlab and Simulink 2011, on a Windows machine. In orderto check the execution time of the proposed algorithm weran the algorithm for 10000 steps which took 0.562 seconds.Therefore, the average time for execution of one iterationstep k of the LSPI algorithm is 5.62 · 10−5 seconds. Thisshows that the algorithm is suitable for real-time implemen-tation. In the algorithm described in the previous section,we use three LSPI algorithms in parallel. In order to be ableto execute these algorithms in parallel, we used the ParallelComputing Toolbox from Matlab.

For our first simulation, we have a load tracking a straightline in the xy plane going back and forth. Red lines inFigure 4(c) show the targeted trajectory.We tracked theperformance of the algorithm for 800 steps and these resultsare presented in Figure 4 in blue lines. Figure 4(a) showsthe quadrotor trajectory resulting from the learning. Figure4(c) presents the load trajectory and displacement from thetargeted trajectory. Figure 4(b) depicts load tracking errorthat the agent is trying to minimize. The error is boundedwhich shows stability of the algorithm over 800 iterationsteps. To additionally confirm the convergence and stabilityof the LSPI algorithm we refer to the Figure 5 which showssimulation results for 3 simulation trials. We can see thatthe load starts tracking the reference trajectory very fastin each of the simulation trials. Furthermore, we can seethat the load closely follows the reference trajectory after15 seconds, not diverging more than few centimeters fromthe reference trajectory in all three trials. Figure 6 showssimulation results of the performance of the algorithm withrespect to noisy measurements of the displacement angle ΦLand tracking error exL. Additionally, at time t = 20s there is achange in the length of the suspension cable of the load. Wecan see that the algorithm is robust and that the load tracks

0 5 10 15 20 25 30 35−10

ΦL [°

Displacement of the suspended load in Polar Coordinates

with noiseno noise

0 5 10 15 20 25 30 35−2

e xL [m

Trajectory tracking error of the suspended load in Cartesian Coordinates

with noiseno noise

0 5 10 15 20 25 30 35−1

x L [m]

Suspended load trajectory tracking in Cartesian Coordinates

ref. trajectoryload trajectory

0 5 10 15 20 25 30 350.4

0.8Length of the suspenion cable

Fig. 6. Simulation results for online LSPI. It shows the algorithmperformance with respect to noisy measurements and to variable length ofthe suspension cable.

the reference trajectory. To verify that the simulation resultsare sound and that the produced LSPI trajectories are valid,we performed experiments using an AscTec Hummingbirdquadrotor (see Figure 1). The experimental results were per-formed in MARHES lab at University of New Mexico. Theweight of the load is 45g and the length of the suspensioncable is 62cm. More detailed description of the experimentaltestbed can be found in [18]. We generated trajectories for thequadrotor by learning in simulation, and tested them on a realquadrotor with suspended load. We measured trajectories ofthe quadrotor and the load as well as the load displacementangles using VICON indoor motion tracking system. Wecompare the experimental results with simulation predictions.To show that with LSPI algorithm the system is able toperform more complex trajectories, we chose a Lissajeoucurve as a second trajectory to track. Figure 7(a) comparesthe actual quadrotor trajectory as flown (in blue) with thetrajectory predicted by simulation (in red). The differencesbetween them are negligible, never exceeding more than 15cm. Figure 7(b) compares the load trajectory recorded in theexperiment (in blue) with the simulation predicted trajectory(in red) and target trajectory (in black).

Videos of simulations and experiments can be found on[19].

V. CONCLUSIONS

Tracking and controlling the position of a suspended loadcan be critical to mission success. In this paper we presenta model-free approach to solving this problem involvinga reinforcement learning algorithm. This method converges

0 10 20 30 40 50 60 70 80 90 100−2

2x q [m

]Quadrotor position

0 10 20 30 40 50 60 70 80 90 1000

z q [m]

0 10 20 30 40 50 60 70 80 90 100−2

y q [m]

simulationexperiment

(a) Quadrotor position.

0 10 20 30 40 50 60 70 80 90 100−1.5

−0.5

x L [m]

0 10 20 30 40 50 60 70 80 90 100−2

y L [m]

referencesimulationexperiment

(b) Suspended load position.

Fig. 7. Experimental results using the trajectory generated in simulationusing LSPI.

quickly to learn the policy function that minimizes thetracking error of the load with respect to the referencetrajectory. We show results in simulation of this policy on avariety of trajectories from a simple straight line to a morecomplex Lissajeou curve. In both cases, the error is boundedto centimeters. In order to further validate the simulation,experiments were run on a experimental testbed. In theseexperiments, we see that the load is able to follow a Lissajeoucurve trajectory closely.

The proposed method is designed so it can be easilyimplemented on an off-the-shelf UAV system with minimalprerequisites regarding extra sensors, computational poweror additional knowledge of the system’s model.

The reinforcement learning methods presented are flexibleand can be used both offline and online for trajectory

tracking. This flexibility makes it highly appropriate forquadrotor policy learning.

REFERENCES

[1] S. Bouabdallah, P. Murrieri, and R. Siegwart, “Design and controlof an indoor micro quadrotor,” in IEEE International Conference onRobotics and Automation, vol. 5, Barcelona, Spain, April 2004, pp.4393 – 4398.

[2] G. Hoffman, S. Waslander, and C. Tomlin, “Quadrotor helicopter tra-jectory tracking control,” in AIAA Guidance, Navigation and ControlConference and Exhibit, Honolulu, Hawaii, April 2008, pp. 1–14.

[3] A. Schollig, F. Augugliaro, and R. D’Andrea, “A platform for danceperformances with multiple quadrocopters,” in IEEE InternationalConference on Intelligent Robots and Systems, 2010, pp. 1–8.

[4] M. Valenti, D. Dale, J. How, and D. Pucci de Farias, “Mission healthmanagement for 24/7 persistent surveillance operations,” in AIAAConference on Guidance, Navigation and Control, Hilton Head, SC,August 2007, pp. 1–18.

[5] D. Mellinger, N. Michael, and V. Kumar, “Trajectory generationand control for precise aggressive maneuvers with quadrotors,” inInternational Symposium on Experimental Robotics, New Delhi, India,December 2010.

[6] S. Lupashin, A. Schollig, M. Sherback, and R. D’Andrea, “A simplelearning strategy for high-speed quadrocopter multi-flips,” in IEEEInternational Conference on Robotics and Automation, May 2010, pp.1642–1648.

[7] M. Orsag, C. Korpela, M. Pekala, and P. Oh, “Stability in aerialmanipulation,” in Proc. of IEEE American Control Conference (ACC),2013, To appear.

[8] P. Abbeel, A. Coates, and A. Ng, “Autonomous helicopter aerobaticsthrough apprenticeship learning,” International Journal of RoboticsResearch, vol. 29, 2010.

[9] D. Zameroski, G. Starr, J. Wood, and R. Lumia, “Rapid swing-freetransport of nonlinear payloads using dynamic programming,” ASMEJournal of Dynamic Systems, Measurement, and Control, vol. 130,no. 4, July 2008.

[10] G. Starr, J. Wood, and R. Lumia, “Rapid transport of suspended pay-loads,” in IEEE International Conference on Robotics and Automation,April 2005, pp. 1394 – 1399.

[11] L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, ReinforcementLearning and Dynamic Programming Using Function Approximators.Boca Raton, Florida: CRC Press, 2010.

[12] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” Jour-nal of Machine Learning Research, vol. 4, pp. 1107–1149, December2003.

[13] J. Ma and W. B. Powell, “A convergent recursive least squares policyiteration algorithm for multi-dimensional markov decision processwith continuous state and action spaces,” in IEEE Conference onApproximate Dynamic Programming and Reinforcement Learning,March 2009.

[14] I. Palunko, R. Fierro, and P. Cruz, “Trajectory generation for swing-free maneuvers of a quadrotor with supended payload: A dynamic pro-gramming approach,” in IEEE International Conference on Roboticsand Automation, St. Paul, MN, USA, May 2012.

[15] R. E. Bellman, Dynamic Programming. Princeton University Press,New Jersey, 1957.

[16] W. B. Powell and J. Ma, “A review of stochastic algorithms withcontinuous value function approximation and some new approximatepolicy iteration algorithms for multi-dimensional continuous applica-tions,” Journal of Control Theory and Applications, vol. 9, no. 3, pp.336 – 352, 2011.

[17] J. Ma and W. B. Powell, “Convergence analysis of on-policy lspifor multi-dimensional continuous state and action space mdps andextension with orthogonal polynomial approximation,” working paper,Dept. Operations Res. Financial Eng., Princeton Univ., Princeton,2010.

[18] I. Palunko, P. Cruz, and R. Fierro, “Agile load transportation: Safeand efficient load manipulation with aerial robots,” IEEE Robotics andAutomation Magazine, vol. 19, no. 9, pp. 69–80, September 2012.

[19] (2012, September) MARHES - simulation and experimetal videos. [On-line]. Available: http://marhes.ece.unm.edu/index.php/Ipalunko:Home

A Reinforcement Learning Approach to Suspended Load Manipulation with Aerial Robots

Documents

Transcript of A Reinforcement Learning Approach to Suspended Load Manipulation with Aerial Robots

Trailer-Catalogue.pdf - Aerial Access

Reinforcement Learning and Automated Planning

TECHNICAL SPECIFICATION - Wired Aerial Theatre

AERIAL FIBER OPTIC CABLE - AFL Global

Suspended and Dissolved Solids Theory + Experiment

Reinforcement learning driven by motor noise

Transportation of Cable Suspended Load using Unmanned ...

Detecting Runways in Aerial Images

Unmanned Aerial Vehicles - Acal BFi

Partly submerged crane suspended jacket - TU Delft ...

JUBATUS FIBRE OPTIC AERIAL CABLES

Recommendations on reinforcement in flexural and ...

ChainerRL: A Deep Reinforcement Learning Library

Piled Embankments with Geosynthetic Reinforcement

Reinforcement Learning or Active Inference

Panaray® Sound-Reinforcement Loudspeakers - BOSE.CN

Suggested Reinforcement Requirements for Flexural ...

Reinforcement Learning: The Multi-Player Case

Prefabricated Reinforcement - BCA

Distributed Deep Reinforcement Learning Computations for ...