Reproduction of human arm movements using Kinect-based motion capture data

6
AbstractExploring the full potential of humanoid robots requires their ability to learn, generalize and reproduce com- plex tasks that will be faced in dynamic environments. In re- cent years, significant attention has been devoted to recover- ing kinematic information from the human motion using a motion capture system. This paper demonstrates and evalu- ates the use of a Kinect-based capture system that estimates the 3D human poses and converts them into gestures imitation in a robot. The main objectives are twofold: (1) to improve the initially estimated poses through a correction method based on constraint optimization, and (2) to present a method for com- puting the joint angles for the upper limbs corresponding to motion data from a human demonstrator. The feasibility of the approach is demonstrated by experimental results showing the upper-limb imitation of human actions by a robot model. I. INTRODUCTION ROGRAMMING robots to perform complex tasks and extends its repertoire can be extremely tedious and time consuming. Learning from demonstration is a promising methodology that offers a more intuitive approach to teach a robot how to generate its own motor skills [1, 2]. To this end, the robot should be able to estimate human poses when performing a desired task, as well as to translate the skele- ton data into appropriate motor commands. In the last years, a large body of work has studied the use of marker based motion capture systems for extracting 3D poses as input for training robots to perform complex motions [3-6]. Despite much research progress, these systems are usually expensive, they require careful calibration and its applica- tion is limited to rigid environments. To overcome these limitations, the main challenge is to develop accurate meth- ods for extracting 3D human poses from image sequences using low-cost systems. as a valid alternative. Recently, the field of markerless motion capture has ex- perienced a strong evolution with the development of high-speed and cheap depth cameras. In particular, the depth data provided by the PrimeSense sensor opened up new opportunities for extracting gesture-based interactions with a more portable and less costly system. The publica- J. Rosado is with the Department of Computer Science and Systems Engineering, Coimbra Institute of Engineering, IPC, Coimbra, Portugal ([email protected]). F. Silva is with the Department of Electronics, Telecommunications and Informatics, Institute of Electronics and Telematics Engineering of Aveiro, University of Aveiro, Portugal (email: [email protected]). V. Santos is with the Department of Mechanical Engineering, Institute of Electronics and Telematics Engineering of Aveiro, University of Avei- ro, Portugal (email: [email protected]). Z. Lu is with the Institute of Electronics and Telematics Engineering of Aveiro, University of Aveiro, Portugal (email: [email protected]). tion of the tracking algorithm of the Kinect Software De- velopment Kit [4] and the availability of several develop- ment environments (e.g., Microsoft SDK) have contributed for a growing interest in model-free approaches. However, the success of these alternatives depends on the accuracy and robustness required in each specific area of application. This paper addresses the main concern associated with the use of a Kinect-based human motion capture in robot- ics: the lack of a kinematic model to assure coherence in the provided poses. The main objective is to demonstrate and evaluate both a human action pose correction method and an inverse kinematics technique. The former aims to assure constant limb lengths over an entire sequence of poses. The later converts each of the 3D poses into the cor- responding angles for the upper-body joints, including a validation test to deal with physical limits (e.g., joint lim- its). The motivation of this work is to create a database of classified motions to learning control in robotics. In line with this, the remainder of the paper is organized as follows: Section II presents the motion capture system based on a single Kinect camera and the experimental con- ditions. Section III describes the pose correction method based on constraint optimization. Section IV focuses on the kinematic mapping from 3D poses to joint angles. Section V discusses the results achieved to validate the proposed solutions. Finally, Section VI concludes the paper and pro- poses future extensions. II. HUMAN MOTION CAPTURE The Kinect sensor provides a 640×480 depth image, at 30 frames per second, for the skeleton-based pose estima- tion with depth resolution of a few centimeters. The human skeleton estimated from the depth image includes a total of 20 body joints that will be the input for our approach. These captured data consists of a set of Cartesian points in the 3D volume for each human pose, which will be called raw-data hereinafter. Several studies have assessed the ac- curacy of the depth reconstruction and joint positions from the Kinect pose estimation, including comparisons with ground truth motion capture data [9-11]. In general, these studies highlight the potential of the Kinect skeleton in con- trolled body postures whenever self-occlusions are avoided. In the experiments, we have used a single Kinect camera positioned at about 3 meters from the human subject to capture the whole body standing upright. The human pose estimation is fully automatic and did not require calibration. Reproduction of Human Arm Movements Using Kinect-Based Motion Capture Data José Rosado, Filipe Silva, Vítor Santos and Zhenli Lu, Member, IEEE P

Transcript of Reproduction of human arm movements using Kinect-based motion capture data

Abstract— Exploring the full potential of humanoid robots

requires their ability to learn, generalize and reproduce com-

plex tasks that will be faced in dynamic environments. In re-

cent years, significant attention has been devoted to recover-

ing kinematic information from the human motion using a

motion capture system. This paper demonstrates and evalu-

ates the use of a Kinect-based capture system that estimates

the 3D human poses and converts them into gestures imitation

in a robot. The main objectives are twofold: (1) to improve the

initially estimated poses through a correction method based on

constraint optimization, and (2) to present a method for com-

puting the joint angles for the upper limbs corresponding to

motion data from a human demonstrator. The feasibility of

the approach is demonstrated by experimental results showing

the upper-limb imitation of human actions by a robot model.

I. INTRODUCTION

ROGRAMMING robots to perform complex tasks and

extends its repertoire can be extremely tedious and time

consuming. Learning from demonstration is a promising

methodology that offers a more intuitive approach to teach

a robot how to generate its own motor skills [1, 2]. To this

end, the robot should be able to estimate human poses when

performing a desired task, as well as to translate the skele-

ton data into appropriate motor commands. In the last

years, a large body of work has studied the use of marker

based motion capture systems for extracting 3D poses as

input for training robots to perform complex motions [3-6].

Despite much research progress, these systems are usually

expensive, they require careful calibration and its applica-

tion is limited to rigid environments. To overcome these

limitations, the main challenge is to develop accurate meth-

ods for extracting 3D human poses from image sequences

using low-cost systems. as a valid alternative.

Recently, the field of markerless motion capture has ex-

perienced a strong evolution with the development of

high-speed and cheap depth cameras. In particular, the

depth data provided by the PrimeSense sensor opened up

new opportunities for extracting gesture-based interactions

with a more portable and less costly system. The publica-

J. Rosado is with the Department of Computer Science and Systems

Engineering, Coimbra Institute of Engineering, IPC, Coimbra, Portugal

([email protected]).

F. Silva is with the Department of Electronics, Telecommunications

and Informatics, Institute of Electronics and Telematics Engineering of

Aveiro, University of Aveiro, Portugal (email: [email protected]).

V. Santos is with the Department of Mechanical Engineering, Institute

of Electronics and Telematics Engineering of Aveiro, University of Avei-

ro, Portugal (email: [email protected]).

Z. Lu is with the Institute of Electronics and Telematics Engineering of

Aveiro, University of Aveiro, Portugal (email: [email protected]).

tion of the tracking algorithm of the Kinect Software De-

velopment Kit [4] and the availability of several develop-

ment environments (e.g., Microsoft SDK) have contributed

for a growing interest in model-free approaches. However,

the success of these alternatives depends on the accuracy

and robustness required in each specific area of application.

This paper addresses the main concern associated with

the use of a Kinect-based human motion capture in robot-

ics: the lack of a kinematic model to assure coherence in

the provided poses. The main objective is to demonstrate

and evaluate both a human action pose correction method

and an inverse kinematics technique. The former aims to

assure constant limb lengths over an entire sequence of

poses. The later converts each of the 3D poses into the cor-

responding angles for the upper-body joints, including a

validation test to deal with physical limits (e.g., joint lim-

its). The motivation of this work is to create a database of

classified motions to learning control in robotics.

In line with this, the remainder of the paper is organized

as follows: Section II presents the motion capture system

based on a single Kinect camera and the experimental con-

ditions. Section III describes the pose correction method

based on constraint optimization. Section IV focuses on the

kinematic mapping from 3D poses to joint angles. Section

V discusses the results achieved to validate the proposed

solutions. Finally, Section VI concludes the paper and pro-

poses future extensions.

II. HUMAN MOTION CAPTURE

The Kinect sensor provides a 640×480 depth image, at

30 frames per second, for the skeleton-based pose estima-

tion with depth resolution of a few centimeters. The human

skeleton estimated from the depth image includes a total of

20 body joints that will be the input for our approach.

These captured data consists of a set of Cartesian points in

the 3D volume for each human pose, which will be called

raw-data hereinafter. Several studies have assessed the ac-

curacy of the depth reconstruction and joint positions from

the Kinect pose estimation, including comparisons with

ground truth motion capture data [9-11]. In general, these

studies highlight the potential of the Kinect skeleton in con-

trolled body postures whenever self-occlusions are avoided.

In the experiments, we have used a single Kinect camera

positioned at about 3 meters from the human subject to

capture the whole body standing upright. The human pose

estimation is fully automatic and did not require calibration.

Reproduction of Human Arm Movements Using Kinect-Based

Motion Capture Data

José Rosado, Filipe Silva, Vítor Santos and Zhenli Lu, Member, IEEE

P

Fig. 1. Frame-to-frame variation of the limb lengths for a static posture (left) and a reaching arm movement (right).

In this study the attention is dedicated to the upper limbs,

including the shoulder, elbow and wrist joints of both right

and left arms. In order to ensure the most convenient acqui-

sition conditions, the human subject was asked to prevent

lower trunk movements and to perform controlled scapular

motions. Precautions were also taken to avoid occlusions of

the upper limb parts.

Besides the accuracy and robustness of the skeletal pos-

es, a critical element is the stability of the estimated frame-

to-frame body geometry. As mentioned before, a character-

istic of the human body skeletonization with the Kinect

sensor is that the limb lengths are not kept constant through

the entire sequence and differ between the two arms. Fig. 1

illustrates the variations of the limb lengths, from frame-to-

frame, a static posture and a reaching arm movement were

evaluated. In the static case, the mean value rounds 268 mm

for the arm and 233 mm for the forearm, while the standard

deviation is around 3.65 mm and 1.51 mm, respectively.

These measures are significantly different during the execu-

tion of a reaching movement: 265 mm of mean for the arm,

216 mm for the forearm with a standard deviation around

15.9 mm and 8.8 mm, respectively.

III. CONSTRAINED-BASED MOTION FILTERING

The pose correction method aims to convert the motion

of a source human subject into a new motion, while satisfy-

ing a given set of kinematic constraints. These kinematic

constraints are formulated in order to assure a kinematic

model with constant limb lengths. The proposed method,

applied to each individual frame, can be divided into two

main steps:

• Static calibration: the first step is a static calibration of

the arms, prior to each data collection, to define the ref-

erence model of the subject anthropometry. Concretely,

the human subject was told to hold their arms full ex-

tended aligned with the trunk (fundamental standing po-

sition), while several frames are acquired. A distance

vector among consecutive joints (shoulder-elbow and el-

bow-wrist) is calculated as the mean value taken over all

these frames for both arms. It should be pointed out that

this arm calibration is the basis for the joint-angle calcu-

lations in Section IV: all joints angles are defined as zero

degrees at this calibration posture.

• Pose correction: the basic problem is to find the closest

configuration ( ) nnxxxX ×ℜ∈= 3

21 ,...,,

( )31 ,... ℜ∈nxx to the measurements that are observed

over time, such that the distance between consecutive

points (i.e., link lengths) remains constant.

In line with this, we deal with the following optimization

problem:

∑=

Ω∈−

n

iX

XX

1

ˆmin (1)

where Ω is a certain subset of and is an appropriate ma-

trix norm which measures goodness of fit. Here, we admit

the Euclidean norm as measure of closeness. The goal is to

minimize the objective function (1) by selecting a value of

X that satisfies all equality quadratic constraints defined by:

1,1 ++ =− iiii dxx (2)

where the left part is the Euclidean distance between two

consecutive points and the right part is the link lengths in

the reference model.

The constrained minimization problem was solved with

the OPTI toolbox that can solve this problem of optimizing

a quadratic function of several variables subject to quadrat-

ic constraints. The comparison of the human skeletons ob-

tained with the Kinect raw-data and those after the pose

correction are illustrated in Fig. 2. Different poses are rep-

resented for a movement sequence involving both the right

and the left arm.

Fig. 2. Overlap of the human skeletons extracted from the Kinect and

those after the constraint-based optimization at different frames (green

points and black lines are original Kinect data; red points and blue lines

are motion constrained filtered data. Green and red lines are the respective

trajectories of the wrists).

IV. KINEMATIC MAPPING

One of the main issues in using motion capture data for

training robots is to convert the 3D joint positions into joint

angles relative to a robot model. In this context, the human

skeleton is replaced by two 4 degree-of-freedom (DOF)

robot arms of the same dimensions. Then, an inverse kine-

matics algorithm generates the corresponding joint angles

of the robot for each pose. The problem is decomposed into

a per-frame inverse kinematics algorithm, followed by mo-

tion filtering and interpolation.

A. Inverse kinematics

The filtered movement data is the input for the inverse

kinematics module in which the human arms are modeled

as two independent 4-dof serial chains consisting of a 3-

DOF shoulder (rotations joints with intersecting axis) and a

1-DOF elbow joint. The implementation of the inverse kin-

ematics follows some basic assumptions. First, the robot

model was defined to match the anthropometric measures

of the human subject, avoiding the retargeting problem (i.e.,

compensate for body differences). Second, the perturba-

tions in the movement data caused by the movement of the

subject’s shoulder are ignored. Concretely, we consider that

all joint positions are uniformly affected by the perturba-

tions and the shoulders are at the origin of the reference

system with fixed coordinate frames. Third, the inverse

kinematics considers mechanical constraints on the joints,

such as physical limits both on the range of joint motions

(e.g., the elbow cannot invert the motion when full-

stretched) and on the maximum joint velocities.

Given the 3D positions of the shoulder, elbow and wrist,

the inverse kinematics algorithm is simplified: two degrees

of freedom completely describe the elbow when the posi-

tion of the shoulder is known (the elbow lies on the surface

of a sphere centered at the shoulder). Similarly, the wrist

can only lie on the surface of a sphere centered at the el-

bow. Thus, the configuration of the arm is completely rep-

resented by four variables (the joint angles). Attention was

devoted to avoid discontinuous jumps near ±180º associat-

ed with the use of inverse tangent functions.

Additionally, the implemented algorithm includes a vali-

dation test since there may be motions where the robot’s

joints are not able to approximate the human pose in a rea-

sonable way due to physical limitations. The proposed

strategy to properly cope with the joint velocity limits is to

slowing down the task-space trajectory whenever the limits

are encountered. Thus, whenever the generated joint veloci-

ties violate the speed limits of the joint actuators, the trajec-

tory is scaled in time by an appropriate constant that simul-

taneously assures tracking of the desired arm path and the

fulfillment of the velocity constraints.

B. Filtering and Interpolation

The frame rate of the Kinect sensor and high frequency

components in the movement data imposed a post-

processing stage to refine results. The exact procedure

combines basic interpolation and smoothing techniques. On

the one hand, the joint-angle trajectories are filtered using a

moving average algorithm to smooth out short-term fluc-

tuations based on predefined trail onset and termination

times. On the other hand, the strategy adopted to provide a

more detailed description of the action performed by the

human subject is to use spline interpolation over the set of

observations to satisfy the requirements of differentiability.

To evaluate the different steps of post-processing, we use a

measure based on jerk, the third time derivative of position,

to quantify smoothness at the level of the joint-angles tra-

jectories. Concretely, the particular jerk metric used to

quantify movement smoothness is the integrated squared

jerk [12] defined by:

∫=2

1

)(t

tisj dttxη (3)

A comparison of movement smoothness measures

among the original signal (after pose correction), the mov-

ing average filtered signal, the cubic spline interpolation

and the fifth-order spline interpolation was performed (Fig.

3). The exact procedure to be followed depends on the ul-

timate goal. Anyway, the previous considerations may be of

importance in determining what strategies are appropriate

to the problem in hand.

Fig. 3. Comparison of the smoothness measure for different motion post-

processing methods applied on the joint-angle trajectories (ordinate is

plotted in a logarithmic scale).

V. GESTURES IMITATION IN A ROBOT

Several real-time movements executed by a human sub-

ject were captured using the Kinect sensor to provide vali-

dation for our algorithms. Two different movements were

chosen to illustrate the results: a rhythmic motion repeated

many times and a discrete sequence of upper-limb move-

ments. In the first experiment, the human subject is asked to

repeat a circular path trying to keep, as much as possible, a

constant speed across all trials. Fig. 4 and Fig. 5 show the

variability always present in human movements, both in

task- and joint-spaces. Since the details vary, it seems nec-

essary to consolidate the demonstrated movements having

in mind the desired final result (i.e., the extent to which the

motor goal is reached).

Fig. 4. Variability of human movements in the task-space for the execution

of a circular path repeated many times.

Fig. 5. Variability of human movements in the joint-space for the execu-

tion of a circular path repeated many times.

The second experiment consists of a gesture imitation

task using the two arms in different configurations around

the T-pose. Fig. 6 compares the positions of the right and

left wrists as seen by the filtered data and the robot simula-

tion. The consistency between the two curves suggests the

efficacy of the human motion reconstruction algorithm pro-

posed.

VI. CONCLUSIONS

In this paper, we have described and demonstrated the

potential of the Kinect sensor for gestures imitation of an

upper-body robot from demonstrations of a human teacher.

The implementation of the proposed ideas on the 4-DOF

robot model shows that human-demonstrated gestures are

well replicated by the robot. In this context, the approach is

useful for providing a natural and intuitive interface for a

user to teach complex movements to a robot. The main goal

is to create real data sets that, if combined with other, can

be later used for learning a compact representation of the

task. In this context, they will assist in developing learning

techniques for manipulation/locomotion behaviors based on

examples of human demonstrations.

Fig. 6. Comparison of the motion capture data (left) with the gestures replicated by the robot (the end-effector path is represented in both cases).

ACKNOWLEDGMENT

This work is partially funded by FEDER through the Oper-

ational Program Competitiveness Factors - COMPETE and

by National Funds through FCT - Foundation for Science

and Technology in the context of the project FCOMP-01-

0124-FEDER-022682 (FCT reference Pest-C/EEI/UI0127/

2011). Zhenli Lu is supported by FCT under contract

CIENCIA 2007 (Post Doctoral Research Positions for the

National Scientific and Technological System).

REFERENCES

[1] Billard, A., Callinon, S., Dillmann, R., Schaal, S.: Robot Program-

ming by Demonstration. In: Siciliano, B., Khatib, O. (eds.) Hand-

book of Robotics, Springer, New York, NY, USA, (2008)

[2] Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A Survey of

Robot Learning from Demonstration. Robotics and Autonomous

Systems, 57(5): 469-483 (2009)

[3] Dasgupta, A., Nakamura, Y.: Making Feasible Walking Motion of

Humanoid Robots from Human Motion Capture Data. In: IEEE In-

ternational Conference on Robotics and Automa-tion, pp. 1044-1049

(1999)

[4] Elgammal, A., Lee, C-S: Tracking People on a Torus. IEEE Transac-

tions on Pattern Anal-ysis and Machine Intelligence, 31(3): 520-538

(2009)

[5] Inamura, T., Toshima, I., Tanie, H., Nakamura, Y: Embodied Sym-

bol Emergence Based on Mimesis Theory. International Journal of

Robotics Research, 23(4-5): 363-377 (2004)

[6] Kulic, D., Takano, J.W., Nakamura, Y.: Incremental Learning, Clus-

tering and Hierarchy Formation of Whole Body Motion Patterns us-

ing Adaptive Hidden Markov Chains. Inter-national Journal of Ro-

botics Research, 27(7): 761-784 (2008)

[7] Shon, A.P., Grochow, K., Hertzmann, A., Rao. R.P.: Learning

Shared Latent Structure for Image Synthesis and Robotic Imitation.

In: Weiss, Y., Schlkopf, B., Platt, J.C. (eds.) Ad-vances in Neural In-

formation Processing Systems, MIT Press, Cambridge, MA (2005)

[8] Shotton, J., Fitzgibbon, A.W., Cook, M., Sharp, T., Finocchio, M.,

Moore, R., Kipman, A., Blake, A.; “Real-time Human Pose Recogni-

tion in Parts from Single Depth Images. In: IEEE Computer Vision

and Pattern Recognition, Colorado Springs, USA (2011)

[9] Khoshelham, K., Elberink, S.O.: Accuracy and Resolution of Kinect

Depth Data for Indoor Mapping Applications. Sensors, 12(2): 1437-

1454 (2012)

[10] Smisek, J., Jancosek, M., Pajdla, T.: 3D with Kinect. In: Internation-

al Conference on Computer Vision Workshops, pp. 1154–1160, Bar-

celona, Spain (2011)

[11] Obdržálek, S., Kurillo, G., Ofli, F., Bajcsy, R., Seto, E., Jimison, H.,

Pavel, M.: Accuracy and Robustness of Kinect Pose Estimation in

the Context of Coaching of Elderly Population. In: International

Conference of the IEEE Engineering in Medicine and Biology Socie-

ty, pp. 1188-1193, California, USA (2012)

[12] Platz, T., Denzler, P., Kaden, B., & Mauritz, K.-H: Motor Learning

After Recovery from Hemiparesis. Neuropsychologia, 32, 1209-1223

(1994)