3D gesture recognition with growing neural gas

3D Gesture Recognition with Growing Neural Gas

J.A. Serra-Pérez, J. García-Rodríguez, S. Orts-Escolano, J.M. García-Chamizo, A. Angelopoulou, A. Psarrou and M. Mentzeopoulos

Abstract— We propose the design of a real-time system to recognize and interpret hand gestures. The acquisition devices are low cost 3D sensors. 3D hand pose segmentation, characterization and tracking will be implemented using growing neural gas (GNG) structure. The capacity of the system to obtain information with a high degree of freedom allows the encoding of many gestures and a very accurate motion capture. The use of hand pose models combined with motion information provided with GNG permits to deal with the problem of the hand motion representation. A natural interface applied to a virtual mirror writing system and a system to estimate hand pose have been designed to demonstrate the validity of the system.

I. INTRODUCTION

In recent years there has been an increase of research in human-computer interaction (HCI) to create more user-friendly interfaces by direct use of natural human abilities in communication and handling.

The adoption of direct interaction HCI will enable the deployment of a wide range of applications in more sophisticated computing environments, such as virtual environments (VR) or augmented reality systems (AR). The development of these systems means addressing complex research problems, including sophisticated input / output techniques, different types of interaction and assessment methods. In the domain of input techniques, direct interaction requires the capture and interpretation of the movements of the head, eyes, face, hands, arms and even the whole body.

Among the different body parts, the hand is the general purpose tool of interacting more effectively, due to its communication functionality as both the handling. Some trends embrace both modalities of interaction allowing intuitive and natural interaction. Gesture languages based on hand postures (i.e., static gestures) or movement patterns (i.e., dynamic gestures) have been used to implement command and control interfaces [1-4]. The gesture, which involves the spontaneous movements of the arms and hands that accompany speech, has proved to be a powerful tool for multimodal user interfaces [5-9].

Manuscript received February 20, 2013. J. Serra-Pérez, J. García-Rodríguez, S. Orts-Escolano and J.M. García-

Chamizo are with the Department of Computer Technology, University of Alicante, PO Box 99. 03080, Alicante, Spain (phone: +34 965903400; fax: +34965909643; e-mail: {jserra, jgarcia, sorts, juanma}@dtic.ua.es).

A. Angelopoulou, A. Psarrou, M. Mentzeopoulos are with the Computer Vision and Imaging research group, School of Computer Science, University of Westminster, Cavendish W1W 6UW, United Kingdom (e-mail: {agelopa,psarroa, mentzem }@wmin.ac.uk).

The object manipulation interfaces [10-12] use hand for navigation, selection and manipulation tasks in virtual environments.

In many applications, such as control of machinery or handlers, handling computer-based avatars or musical performance [13], hand serves as an efficient control device and with a high degree of freedom (DOF). Finally, some immersive VR applications as surgical simulations [14] and training systems [15], include manipulation of complex objects in its own definition. The widespread deployment of gesture based HCI systems involves developing general purpose systems for the capture and interpretation of hand movement.

Currently, the most effective tools to capture hand motion detection are electro-mechanical or magnetic devices (data gloves) [16,17]. These devices are inserted in the hand for measuring the location and angle of the finger joints. They offer the most comprehensive set of real-time measurements, independent of the application and allow full functionality of the hand in HCI systems. However, they have several disadvantages in terms of use, as they are very expensive, impede the natural movement of the hand, and require calibration processes and complex installation procedures to obtain accurate measurements.

Computer vision (CV) represents a promising alternative to data gloves due to its potential to provide a more natural interaction. However, there are still several challenges to be overcome for the widespread use: accuracy, processing speed, and generality, among other. The recovery of all the degrees of freedom of hand motion in images with inevitable self-occlusions is a very complex and computationally intensive problem. As a result, current implementations of systems based on computer vision have little in common with those based on gloves. Since the late 70s [18], The predominant method of interaction based on vision application was the one based on the appearance of movement of the hand modeling [19,20]. These models have been successfully applied to the construction of classification systems for the detection of elements of a vocabulary of gestures. However, the 3D motion information provided by these systems is limited to rough estimates of finger positions, their orientations and /or the frame of the palm obtained using specific appearance characteristics affecting the generality focus.

In particular, the estimation of the 3D pose of the hand is of special interest because by understanding the configuration of the hands, we will be able to build systems that can interpret human activities and understand important aspects of the interaction of a human with their physical and social environment. There are several works that address only visual data without using markers [21,22]. Existing approaches can be

https://www.researchgate.net/publication/266232742_Chapter_15_-_Principles_for_the_Design_of_Performance-oriented_Interaction_Techniques?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/220978974_FingARtips_gesture_based_direct_manipulation_in_Augmented_Reality?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/220090180_A_Survey_of_Surgical_Simulation_Applications_Technology_and_Education?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/3208351_A_survey_of_glove-based_input?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/234801233_VIDEOPLACE-an_artificial_reality?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/3192709_Visual_Interpretation_of_Hand_Gestures_for_Human-Computer_Interaction_A_Review?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/262292003_Perceptual_user_interfaces_things_that_see?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/221111498_Full_DOF_tracking_of_a_hand_interacting_with_an_object_by_modeling_occlusions_and_physical_constraints?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/233950932_Efficient_model-based_3d_tracking_of_hand_articulations_using_kinect_BMVC_2011?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

classified as based on models and based on appearance. The former provides a continuum of solutions, but are computationally expensive and dependent on the availability of a large amount of visual information (usually provided by a multi-camera system). Appearance-based systems are associated with a less computational cost and a much smaller hardware complexity, but only recognize a discrete number of hand poses that correspond generally to the training set.

Despite the large amount of work in this field [23,24], the problem is still open and has several theoretical and practical challenges, due to a number of difficulties common to most systems, among which stand out:

High dimensionality problem: The hand is an object articulated with more than 20 DOF. Although the natural movement of the hand has not 20 degrees of freedom, due to the interdependencies between the toes, studies have shown that it is not possible to use less than 6 DOF. Along with the location and orientation of the hand itself, there are still a large number of parameters to be estimated.

Self- occlusions: Since the hand is an articulated object, its projections generate a variety of ways with various self-occlusions which makes harder the segmentation of the various parts of the hand and the extraction of high-level features.

Speed processing: Even for a single image, a real-time CV system needs to process a lot of data. Furthermore, latency requirements in some applications are very demanding in terms of computing power. With current architectures, some of the existing algorithms require expensive and dedicated equipment, with parallel processing capabilities to operate in real time.

Uncontrolled Environments: For general use, many HCI systems must operate with unrestricted funds and a wide range of lighting conditions. Moreover, even the location of a rigid object on an arbitrary background becomes a complex problem in computer vision.

Rapid hand movements: hand has a very fast movement capability reaching speeds up to 5 m / s for translation and 300 m / s for the rotation of the wrist. Presently, cameras typically support a frame-rate up to 30-60 Hz. Furthermore, it is very difficult to make the algorithms achieve a speed of 30 Hz in tracking. In fact, the combination of high-speed movement of the hand and low sampling rates presents additional difficulties for tracking algorithms (i.e., images in consecutive frames are increasingly less correlated with the increased speed of movement of hand).

In this work, we propose a new model-based approach to address hands 3D tracking. The observations come from a low-cost 3D sensor (Kinect). The optimization is performed with a variant of growing neural networks GNG. We hope to achieve accurate and robust tracking with an acquisition frequency of at least 10-15Hz. Among the contributions of the proposed method is novel because (a) provides accurate solutions to the problem of 3D tracking of the hand (b)

requires no complex hardware configuration (c) is based solely on visual data without using physical markers (d) is not very sensitive to lighting conditions and (e) run in real time.

The rest of the paper is organized as follows: section 2 presents the growing neural gas and describes its ability to obtain reduced representations of objects preserving the topological information. Also extend such capabilities to the representation of point cloud sequences and motion analysis. In section 3, the hand pose estimation system is described. Finally, section 4 presents experiments on the application to a virtual mirror writing system and a hand gesture recognition system, followed by our conclusions and future work.

II. TOPOLOGICAL REPRESENTATION WITH GROWING

NEURAL GAS

One way to obtain a reduced and compact representation of 2D shapes or 3D surfaces is to use a topographic mapping where a low dimensional map is fitted to the high dimensional manifold of the shape, whilst preserving the topographic structure of the data. A common way to achieve this is by using self-organising neural networks where input patterns are projected onto a network of neural units such that similar patterns are projected onto units adjacent in the network and vice versa.

The approach presented in this paper is based on self-organising networks trained using the Growing Neural Gas learning method [25], an incremental training algorithm. The links between the units in the network are established through competitive hebbian learning. As a result the algorithm can be used in cases where the topological structure of the input pattern is not known a priori and yields topology preserving maps of feature manifold.

A. Growing Neural Gas

From the Neural Gas model [26] and Growing Cell Structures [27], Fritzke developed the Growing Neural Gas model, with no predefined topology of a union between neurons. A growth process takes place from minimal network size and new units are inserted successively using a particular type of vector quantisation [28], [29]. To determine where to insert new units, local error measures are gathered during the adaptation process and each new unit is inserted near the unit which has the highest accumulated error. At each adaptation step a connection between the winner and the second-nearest unit is created as dictated by the competitive hebbian learning algorithm. This is continued until an ending condition is fulfilled, as for example evaluation of the optimal network topology based on some measure. Also the ending condition could it be the insertion of a predefined number of neurons or a temporal constrain. In addition, in GNG networks learning parameters are constant in time, in contrast to other methods whose learning is based on decaying parameters. The network is specified as: A set A of nodes (neurons). Each neuron c A has its

associated reference vector cw dR . The reference

https://www.researchgate.net/publication/222427666_A_survey_of_advances_in_vision-based_human_motion_capture_and_analysis_Comput_vis_image_underst?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/222572621_Vision_Based_Hand_Pose_Estimation_A_Review?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/2428368_A_Growing_Neural_Gas_Network_Learns_Topologies?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/3301761_Neural-Gas_Network_for_Vector_Quantization_and_Its_Application_to_Time-Series_Prediction?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/222054495_Fritzke_B_Growing_cell_structures_a_self-organizing_network_for_unsupervised_and_supervised_learning_Neural_Netw_79_1441-1460?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/10832730_Neural_maps_and_topographic_vector_quantization?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/222965184_Topology_representing_networks?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

vectors can be regarded as positions in the input space of their corresponding neurons.

A set of edges (connections) between pairs of neurons. These connections are not weighted and its purpose is to define the topological structure. The edges are determined using the competitive hebbian learning algorithm. An edge aging scheme is used to remove connections that are invalid due to the activation of the neuron during the adaptation process.

The GNG learning algorithm is as follows:

1. Start with two neurons a and b at random positions aw

and bw in dR .

2. Generate a random input signal according to a density

function )(P

3. Find the nearest neuron (winner neuron) 1s and the second

nearest 2s .

4. Increase the age of all the edges emanating from 1s .

5. Add the squared distance between the input signal and the winner neuron to a counter error of 1s :

2

1 1)( swserror

6. Move the winner neuron 1s and its topological neighbours

(neurons connected to 1s ) towards by a learning step w

and n , respectively, of the total distance:

)(11 sws ww

)(nn sns ww

7. If 1s and 2s are connected by an edge, set the age of this

edge to 0. If it does not exist, create it. 8. Remove the edges larger than maxa . If this results in

isolated neurons (without emanating edges), remove them as well.

9. Every certain number λ of input signals generated, insert a new neuron as follows:

Determine the neuron q with the maximum

accumulated error. Insert a new neuron r between q and its

neighbor f , with more accumulated error:

fqr www 5.0

Insert new edges connecting the neuron r with neurons q and f , removing the old edge between

q and f .

Decrease the error variables of neurons q and

multiplying them with a constant . Initialize the error variable of r with the new value of the error variable o f q and f .

10. Decrease all error variables by multiplying them with a

constant .

11. If the stopping criterion is not yet achieved, go to step 2. In summary, the adaptation of the network to the input

space takes place in step 6. The insertion of connections (step 7) between the two closest neurons to the randomly generated input patterns establishes an induced Delaunay triangulation in the input space. The elimination of connections (step 8) eliminates the edges that no longer comprise the triangulation. This is made by eliminating the connections between neurons that no longer are next or that they have nearer neurons. Finally, the accumulated error (step 5) allows the identification of those zones in the input space where it is necessary to increase the number of neurons to improve the mapping.

B. Visual Data Representation with Growing Neural Gas

The ability of neural gases to preserve the topology will be employed in this work for the representation and tracking of objects. Identifying the points of the input data that belong to objects allows the network to adapt its structure to this input subspace, obtaining an induced Delaunay triangulation of the object.

Let an object VG AAO , that is defined by a geometric

appearance GA and a visual appearance VA . The geometric

appearance GA is given by morphologic parameters (local

deformations) and positional parameters (translation, rotation and scale):

PMG GGA ,

The visual appearance VA is set of object characteristics

such as colour, depth, texture or brightness, among others. In particular, considering objects in two dimensions. Given a

domain support 2RS , an image intensity function

RyxI ),( such that max,0: ISI , and an object O , its

standard potential field yxIfyx TT ,, is the

transformation 1,0: ST which associates to each point

Syx , the degree of compliance with the visual property

T of the object O by its associated intensity ),( yxI . Considering:

The space of input signals as the set of points in the image:

SV Sxx n ,..,0

The probability density function according to the standard potential field obtained for each point of the image

nTn xxxxpp ,..,,.., 00

Fig. 1. Different image features representation with GNG.

Learning takes place following the GNG algorithm described in previous section. So, doing this process, a representation based on the neural network structures is obtained which preserves the topology of the object O from a certain feature T . That is, from the visual appearance VA

of the object is

obtained an approximation to its geometric appearance GA .

Henceforth we call the Topology Preserving Graph

CATPG , to the non directed graph, defined by a set of

vertices (neurons) A and a set of edges C that connect

them, preserving the topology of an object from the considered standard potential field.

Getting different nTnT xxIfxx ,..,,.., 00 can be

obtained, for example, the representation of objects in one (silhouette figure 1 left), two (shape figure 1 middle) or three dimensions (volume figure 1 left), which cause different structures in the network for each of these standard potential field.

In our case the input data will be acquired with 3D sensors and the topology preserving graph described by the neural networks structure (neurons reference vector and edges connecting them) will be also three dimensional

C. Point Cloud Sequences Representation

GNG has been adapted to represent Point Cloud Sequences using models learnt from previous data acquisitions. We have introduced several improvements to the network in order to accelerate the representation and allow the architecture to work faster.

The main difference with the original GNG algorithm is the omission of insertion/deletion actions (steps 8 to 11) after the first frame. Since no neurons are added or deleted the system keeps correspondence during the whole sequence, solving intrinsically the problem of features correspondence. For the initial moment t0 the representation is obtained making a complete adaptation of a GNG. However, for the following frames the previous network structure is employed. So, the new representation of the object is obtained by performing only the iteration of the internal loop of the learning algorithm of the GNG, relocating the neurons and creating or removing edges. This adaptive method is also able to face real-time constraints, because the number λ of times that the internal loop is performed, can be chosen according to the time available between two successive frames that depend on the acquisition rate. The mean time to obtain a GNG from each capture is about 10ms, using this adaptive method.

GNG provides a reduction of the input data, while preserving its structure. This provides two advantages. First, we have to process less number of points, so we speed up the next step of feature extraction. Second, outliers are reduced. Outliers are one of the main source of errors in this kind of application.

D. Motion Representation

The motion can be classified according to the way of perceiving. Common and relative objects motion can be represented using the graph obtained from the neural network structure for each capture of the sequence.

In the case of analyzing the common motion, it can be performed the analysis of the trajectory followed by an object by tracking the centroid of the same along the sequence. This centroid may be calculated from the positions of the nodes in the graph that represents the object in each capture.

To follow the relative motion, the changes in position of each node with respect to the centroid of the object for each

capture should be calculated. By following the path of each node, changes in the morphology of the object can be analyzed and recognized.

One of the most important problems in tracking, the correspondence between features along the sequence, can be solved of intrinsic form [30] since the position of neurons are known at any time without the need of additional processing.

1) Common Motion

To analyze the common motion, it is necessary to follow the centroid of the object based on the centroid of the reference vectors of the neurons that represents it, defining a simple path for the object. In this case, the common motion Mc is considered as the path described by the centroid Cm

of

the TPG obtained with the structure of the neural network along the frames 0 to f :

fttm mmcC ccTrayM ,...,

0

2) Relative Motion To analyze the relative motion of an object, it is necessary

to consider the specific movement of each neuron with respect to a specific point of the object, typically its centroid, and therefore requires a specific track for each of the paths of the neurons representing the object. Thus relative motion MR is determined by the position changes of individual neurons with respect to the centroid Cm for each node i:

AiTrayM mciR ,

where

ftftt

mmtimi

ci cwcwTray ,...,

00

Where wi is the reference vector of node i, Cm is the centroid of the graph obtained from the neural network representing scene along the frames 0 to f .

3) Motion Analysis

The motion analysis is performed in a sequence by tracking individual objects or entities that appear in the scene. The analysis of the trajectory described by each object is used to interpret its movement. In this case the motion of an object is interpreted by paths that follow each of the GNG neurons or TPG :

AiTrayM i ,

where the path is determined by the sequence of positions (reference vectors) of individual neurons in the map:

ftitii wwTray ,...,

0

In some cases, to address the movement recognition, the trajectories parameterization is performed. In [31], it can be found some suggestions for parameterization.

Also, some direct measures of similarity between paths, such as the modified Hausdorff distance [32] permit the comparison of trajectories learning semantic models of the scenes [33].

4) Hausdorff Distance

Let the distance between two points a and b defined as the

Euclidean distance babad ),( . The distance between a

point a and a set of points bNbbB ,...,1 is defined as

baBad Bb min),( . There are two different ways of

calculating the direct distance between two sets of points NaaaA ,...,1 and

bNbbB ,...,1 .

We consider two measures of direct distance between point sets:

),(max),( BadBAdAa

Aaa

BadN

BAd ),(1

),(

Direct measurements between sets of points may be combined to obtain an indirect measurement with a high discriminatory power of the sets of points which define the paths:

)),(),,(max()),(),,(( ABdBAdABdBAdf

Combining equations 13 and 15 yields the known Hausdorff distance, and combining equations 14 and 15 gives a variant known as the modified Hausdorff distance, which has more discriminatory power for classification and object recognition. For comparison of trajectories, both measures obtain similar results for all types of movements. If some characteristics of objects, such as gestures or movements of the face are also known, the result can be improved by normalization or a parameterization of the trajectories followed by a representation of the characteristics considered.

III. 3D GESTURE REPRESENTATION

The previous section described the ability of neural gases for the representation and tracking of data streams in several dimensions. To analyze the movement, certain characteristics will be tracked in each of the shots of each sequence, not the object itself, but, its representation obtained with the neural network. That is, using the position of the neurons of the network as a further feature.

The construction of a system of representation, tracking and recognition of hand gestures based on neural gas and 3D

https://www.researchgate.net/publication/222440943_Motion-based_recognition_a_survey?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/3684010_A_modified_Hausdorff_distance_for_object_matching?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/280682357_Le_Probleme_de_la_mise_en_correspondance_l'etat_de_l'art?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/37992306_Learning_Semantic_Scene_Models_by_Trajectory_Analysis?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

sensors capable of rendering common and relative motion is proposed.

The proposed system consists of the following modules: Acquisition: based on Kinect sensor. Segmentation: getting a grid from point closest to the

camera, sweeping the depth map + HSV threshold. Characterization: using the neural network structure. Tracking: using the path described by neurons of

GNG network for adaptation to the input of each shot. The figure 2 shows a flowchart of the system.

Fig. 2. Flowchart of the system.

A. Data Acquisition

All experimental phase is based on the use of real-world sequences obtained by a sensor Kinect. Such sensors belong to the so-called RGB-D cameras since they provide capture RGB format images with depth information per pixel. Specifically, Microsoft's Kinect sensor is able to get screenshots of 640x480 pixels and its corresponding depth information, based on an infrared projector combined with a CMOS sensor with a resolution of 320x240 pixels, and can reach rates of up to 30 frames per second.

A first processing of sensor data enables obtaining the RGB description plus the component in the z axis of coordinates of the points in three dimensional space.

B. Hands Segmentation

For the segmentation of the hands from the background, a hybrid technique based on depth information and the application of appropriate thresholds based on HSV color model have been used. These values have been determined by training the system with multiple users with different light conditions. This model has been widely used in the literature on segmentation of faces and hands. Figure 3 shows an example of segmentation.

Fig. 3. Segmentation of the hand from the background.

C. Gesture Characterization with Growing Neural Networks

Growing neural gas presented in Section 2 is used to characterize the hands obtaining a reduced topological representation by a graph defining an induced Delaunay triangulation of the input data. In the experiments section the minimum number of neurons necessary for adequate representation of the hand is defined, which allows the subsequent tracking and recognition of gestures. Figure 4 shows an example of hand characterization.

Fig. 4. Hand Characterization with GNG.

D. Gesture Recognition Based on Trajectories

The path followed by neurons can be obtained and interpreted by processing the neuron positions along the sequence. This evolution can be studied at the level of the global motion, following the centroids of the map or locally, studying the deformation of the object. This is possible because the system does not restart the map for each shot, readapting the previous map without inserting or deleting neurons. Thus, the neurons are used as stable visual markers that define the 3D topology of the objects in the scene.

IV. EXPERIMENTATION

Several experiments have been performed for the validation of the system. First, it has been obtained a proper parameterization of the neural network, and then a number of global motion experiments related to a virtual mirror writing system have been conducted. Finally, the system have learned and labeled various poses of the hand, performed by various users, and subsequently the same gestures, made by other users, have been submitted to the system, with a high rate of correct recognition.

To carry out all the experiments we used the RGBDemo framework, a computer vision software written in C + +, which allows to quickly and easily get started with Kinect-like camera and develop our own projects.

A. Optimal Parametrization of GNG

This experiment measured the mean square error of different representations of the hand obtained with a varied number of neurons. From the graph of Figure 5, it can be noticed that with approximately 200 neurons the error is low and provides an adequate quality of representation. We chose the minimum number of neurons with an acceptable quality as the computational cost is reduced and allows real-time management.

The other GNG parameters used were: λ=2000, 1.01 , 01.02 , 0.5 , 0.0005 ,

max 250a

Fig. 5. Representation error based on the number of neurons in the network.

B. Virtual Mirror Writing Recognition

This section presents an application of the handwriting

recognition system in a virtual mirror by tracking the trajectory of the centroid. Figure 6 shows a set of gestures used to test the system.

(a) (b) (c) (d) (e)

Fig. 6. Virtual mirror writing gestures examples.

Through a training phase, the path described by the centroid of the neural network representation of different characters virtually made by different users has been tagged. Subsequently, there has been done a recognition phase of the symbols carried out by new users by comparing the paths described and labeled using the Hausdorff distances described in section 2.D.4, with a success rate greater than 90%.

C. Recognition of Hand Poses

This section presents an application of the gesture recognition system. Figure 7 shows a set of gestures used as input to the system. Once the sequences have been learned by GNG, Hausdorff distances have be used for comparison of sets of points that define a path. In this case, the ones followed by all neurons representing the gestures of the hand with respect to the centroid of the representation.

In figure 7, the pose of the left defines the start position and the center ones are different final positions for different gestures. The representation on the right shows the trajectories described by the neurons along the gesture realization. As in the previous experiment stages of training / recognition and labeling were made. The results obtained for a reduced set of gestures are promising with a success rate of 95%. Fig. 7. Set of gestures used for the trajectories study. I1

In this case it is especially important to be able to represent 3D space, as can be seen in figure 8, since some gestures present trajectories in all axes that would be impossible to perceive with a system based only on the x and y axes. (a) (b) (c) (d)

Fig. 8. 3D trajectories evolution during the gesture.

V. CONCLUSIONS

In this paper we have presented a novel architecture to represent and recognize gestures based on neural networks and 3D sensors. It has been shown that neural gases are suitable for reduced representation of objects in three dimensions by a graph which defines the interconnection of neurons. Moreover, the processing of the position information of these neurons over time, allows us to build the hand trajectories and interpret them.

For the recognition of gestures we used the Hausdorff distance to measure the similarity between the sets of points that define the different trajectories global and / or local of our markers (neurons).

Finally, to validate the system it have been developed a framework that have been used to test several global and local gestures made by different users, obtaining good results in the recognition task. However, the noisy images and occlusions with the environment are two major problems.

As future work, we will improve the system performance at all stages to achieve a natural interface that allows us to interact with any object manipulation system. Likewise, it is contemplated the acceleration of the whole system on GPUs.

REFERENCES [1] F.K.H. Quek, Unencumbered gestural interaction, IEEE Multi- Media 3

(4) (1996) 36-47 [2] M. Turk, Gesture recognition, in: K.M. Stanney (Ed.), Handbook of

Virtual Environments: Design, Implementation, and Applications, Lawrence Erlbaum Associates, Hillsdale, N.J, (2002), pp. 223238

[3] S. Lenman, L. Bretzner, B. Thuresson, Using marking menus to develop command sets for computer visión based hand gesture interfaces, in: NordiCHI '02: Second Nordic Conference on Human- Computer Interaction, ACM Press, New York, NY, USA, (2002), pp. 239-242

[4] M. Nielsen, M. Sto" rring, T.B. Moeslund, E. Granum, A procedure for developing intuitive and ergonomic gesture interfaces for HCI, in: 5th International Gesture Workshop, (2003), pp. 409-420

[5] A. Wexelblat, An approach to natural gesture in virtual environments, ACM Transactions on Computer-Human Interaction 2 (3) (1995) 179-200

[6] F. Quek, D. McNeill, R. Bryll, S. Duncan, X.-F. Ma, C. Kirbas, K.E. McCullough, R. Ansari, Multimodal human discourse: gesture and speech, ACM Transactions on Computer-Human Interaction 9 (3) (2002) 171-193

[7] R.A. Bolt, Put-that-there: voice and gesture at the graphics interface, in: SIGGRAPH '80: 7th Annual Conference on Computer Graphics and Interactive Techniques, ACM Press, New York, NY, USA, (1980), pp. 262-270

[8] D.B. Koons, C.J. Sparrell, Iconic: speech and depictive gestures at the human-machine interface, in: CHI '94: Conference Companion on Human Factors in Computing Systems, ACM Press, New York, NY, USA, 1994, pp. 453-454. [9] M. Billinghurst, Put that where? Voice and gesture at the graphics interface, SIGGRAPH Computer Graphics 32 (4) (1998) 60-63

[9] M. Billinghurst, Put that where? Voice and gesture at the graphics interface, SIGGRAPH Computer Graphics 32 (4) (1998)60-63.

[10] D. Bowman, Principles for the design of performance-oriented interaction techniques, in: K.M. Stanney (Ed.), Handbook of Virtual Environments: Design, Implementation, and Applications, Lawrence Erlbaum Associates, Hillsdale, NJ, (2002), pp. 201-207

[11] J. Gabbard, A taxonomy of usability characteristics in virtual environments, Master's thesis, Department of Computer Science, University of Western Australia, (1997)

[12] V. Buchmann, S. Violich, M. Billinghurst, A. Cockburn, FingARtips: gesture based direct manipulation in augmented reality, in: GRAPHITE '04: 2nd International Conference on Computer Graphics and Interactive Techniques in Australasia and South East Asia, ACM Press, New York, NY, USA, (2004), pp. 212-221

[13] D.J. Sturman, Whole hand input, Ph.D. thesis, MIT, 1992 [14] A. Liu, F. Tendick, K. Cleary, C. Kaufmann, A survey of surgical

simulation: applications, technology, and education, Presence: Teleoperators and Virtual Environments 12 (6) (2003) 599-614

[15] VGX, Virtualglovebox, http://biovis.arc.nasa.gov/vislab/vgx.htm. [16] D.J. Sturman, D. Zeltzer, A survey of glove-based input, IEEE

Computer Graphics and Applications 14 (1) (1994) 30-39 [17] E. Foxlin, Motion tracking requirements and technologies, in: K.M.

Stanney (Ed.), Handbook of Virtual Environments: Design, Implementation, and Applications, Lawrence Erlbaum Associates, Hillsdale, NJ, (2002), pp. 163-210

[18] M.W. Krueger, T. Gionfriddo, K. Hinrichsen, Videoplace an artificial reality, in: SIGCHI Conference on Human Factors in Computing Systems, ACM Press, New York, NY, USA, (1985), pp. 35-40

[19] V.I. Pavlovic, R. Sharma, T.S. Huang, Visual interpretation of hand gestures for human-computer interaction: a review, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (7) (1997) 677-695

[20] J.L. Crowley, J. Coutaz, F. Be'rard, Perceptual user interfaces: things that see, Communications of the ACM 43 (3) (2000) 54

[21] I. Oikonomidis, N. Kyriazis, and A.A. Argyros. Full DOF Tracking of a Hand Interacting with an Object by Modeling Occlusions and Physical Constraints. In ICCV, (2011)

[22] I. Oikonomidis, N. Kyriazi, A.A. Argyros. Efficient Model-based 3D Tracking of Hand Articulations using Kinect. BMVC (2011)

[23] T.B. Moeslund, A. Hilton, V. Kruger: A survey of advances in vision-based human motion capture and analysis. CVIU 104 (2006) 90-126

[24] A. Erol, G. Bebis, M. Nicolescu, R.D. Boyle, X. Twombly: Vision-based hand pose estimation: A review. CVIU 108 (2007) 52-73

[25] Fritzke, B (1995). A Growing Neural Gas Network Learns Topologies, In Advances in Neural Information Processing Systems 7, G. Tesauro, D.S. Touretzky y T.K. Leen (eds.), MIT Press, Cambridge, Mass

[26] Martinetz, T., Berkovich, S.G., & Schulten, K.J. (1993). "Neural-Gas" Network for Vector Quantization and its Application to Time-Series Prediction, IEEE Transactions on Neural Networks, 4(4):558-569

[27] Fritzke, B. (1993). Growing Cell Structures - A Self-organising Network for Unsupervised and Supervised Learning, Technical Report TR-93-026, International Computer Science Institute, Berkeley, California

[28] Bauer, H.-U., Hermann, M., & Villmann, T. (1999). Neural Maps and Topographic Vector Quantization. Neural Networks, 12(4-5), 659-676

[29] Martinetz, T., & Schulten, K. (1994). Topology Representing Networks. Neural Networks, 7(3) 507-522

[30] Zhang, Z (1993). Le problème de la mise en correspondance: L'état de l'art. Rapport de recherche nº 2146, Institut National de Recherche en Informatique et en Automatique.

[31] Cédras, C. & Shah, M. (1995). Motion-based recognition: a survey, Image and Vision Computing, 13(2): 129-155

[32] Dubbuison, M.P. & Jain, A.K. (1994). A Modified Hausdorff Distance for Object Matching, In Proceedings of the International Conference on Pattern Recognition, Jerusalem, Israel. 566-568.

[33] Wang, X, Tieu, K., & Grimson, E. (2006) Learning Semantic Scene Models by Trajectory Analysis, MIT CSAIL Technical Report.

https://www.researchgate.net/publication/3338382_Unencumbered_gestural_interaction?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/3338382_Unencumbered_gestural_interaction?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/220286385_An_Approach_to_Natural_Gesture_in_Virtual_Environments?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy



https://www.researchgate.net/publication/234832841_Multimodal_human_discourse_Gesture_and_speech?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy




https://www.researchgate.net/publication/242105331_Put_That_Where_Voice_and_Gesture_at_the_Graphics_Interface?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy

https://www.researchgate.net/publication/242105331_Put_That_Where_Voice_and_Gesture_at_the_Graphics_Interface?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy























































https://www.researchgate.net/publication/221248189_Using_marking_menus_to_develop_command_sets_for_computer_vision_based_hand_gesture_interfaces?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy




https://www.researchgate.net/publication/243453130_Put-That_There_Voice_and_Gesture_at_the_Graphics_Interface?el=1_x_8&enrichId=rgreq-06561f41-4c99-4d63-b494-7441e20cdba2&enrichSource=Y292ZXJQYWdlOzI2MTI4MzQ5NTtBUzo5OTQxNDk4MzA1MzMzMkAxNDAwNzEzNzg4MDQy




3D gesture recognition with growing neural gas

Documents

Transcript of 3D gesture recognition with growing neural gas