Surveillance and human–computer interaction applications of self-growing models

Journal Identification = ASOC Article Identification = 1103 Date: June 28, 2011 Time: 11:53 am

Sm

JD

a

ARR2AA

KSTGSH

1

Dsdorpime

s

j

1d

Applied Soft Computing 11 (2011) 4413–4431

Contents lists available at ScienceDirect

Applied Soft Computing

j ourna l ho mepage: www.elsev ier .com/ locate /asoc

urveillance and human–computer interaction applications of self-growingodels

osé García-Rodríguez ∗, Juan Manuel García-Chamizoepartment of Computing Technology, University of Alicante, Ap. 99, E03080 Alicante, Spain

r t i c l e i n f o

rticle history:eceived 27 April 2010eceived in revised form0 December 2010ccepted 10 February 2011vailable online 23 February 2011

eywords:elf-growing modelsopology preservationrowing Neural Gasurveillance systemsuman–computer interaction

a b s t r a c t

The aim of the work is to build self-growing based architectures to support visual surveillance andhuman–computer interaction systems. The objectives include: identifying and tracking persons or objectsin the scene or the interpretation of user gestures for interaction with services, devices and systemsimplemented in the digital home. The system must address multiple vision tasks of various levels suchas segmentation, representation or characterization, analysis and monitoring of the movement to allowthe construction of a robust representation of their environment and interpret the elements of the scene.

It is also necessary to integrate the vision module into a global system that operates in a complexenvironment by receiving images from acquisition devices at video frequency and offering results tohigher level systems, monitors and take decisions in real time, and must accomplish a set of requirementssuch as: time constraints, high availability, robustness, high processing speed and re-configurability.

Based on our previous work with neural models to represent objects, in particular the Growing NeuralGas (GNG) model and the study of the topology preservation as a function of the parameters election,it is proposed to extend the capabilities of this self-growing model to track objects and represent theirmotion in image sequences under temporal restrictions.

These neural models have various interesting features such as: their ability to readjust to new input
patterns without restarting the learning process, adaptability to represent deformable objects and evenobjects that are divided in different parts or the intrinsic resolution of the problem of matching featuresfor the sequence analysis and monitoring of the movement. It is proposed to build an architecture basedon the GNG that has been called GNG-Seq to represent and analyze the motion in image sequences.Several experiments are presented that demonstrate the validity of the architecture to solve problemsof target tracking, motion analysis or human–computer interaction.
. Introduction

Visual tracking is a very active research field in computer vision.evelopment of the process of visual surveillance in dynamic

cenes often includes steps for modeling the environment, motionetection, classification of moving objects, tracking and recognitionf actions developed. Most of the work is focused on applicationselated to track people or vehicles that have a large number ofotential applications such as: controlling access to special areas,

dentification of people, traffic analysis, anomaly detection andanagement alarms or interactive monitoring using multiple cam-

ras [38].Some visual surveillance systems have marked important mile-

tones. The visual surveillance system in real-time W4 [32] uses

∗ Corresponding author. Tel.: +34 965902616; fax: +34 965909643.E-mail addresses: [email protected] (J. García-Rodríguez),

[email protected] (J.M. García-Chamizo).

568-4946/$ – see front matter © 2011 Elsevier B.V. All rights reserved.oi:10.1016/j.asoc.2011.02.007

© 2011 Elsevier B.V. All rights reserved.

a combination of analysis of shapes and tracking, building modelsof appearance to detect and track groups of people and monitortheir behaviors even in the presence of occlusion and in outdoorenvironments. This system uses a single camera and grayscale sen-sor. The system Pfinder [76] is used to retrieve a three-dimensionaldescription of a person in a large space. It follows a single personwithout occlusions in complex scenes, and has been used in vari-ous applications. Another system to track single person, the TI [57],detects moving objects in indoor scenes using motion detection,tracking is made using the first order prediction, and recognition isdone by applying predicates to a behavior graph formed by match-ing objects links in successive frames. This system does not supportsmall movements of objects in the background. The CMU system[46] can monitor activity on a wide area using multiple networkedcameras. It can detect and track multiple people and vehicles in
complex scenes and monitor their activities for long periods oftime.
The Defense Advanced Research Projects Agency (DARPA)funded the project “Visual Surveillance and Monitoring (VSAM)”

dx.doi.org/10.1016/j.asoc.2011.02.007

http://www.sciencedirect.com/science/journal/15684946

www.elsevier.com/locate/asoc

mailto:[email protected]

mailto:[email protected]

dx.doi.org/10.1016/j.asoc.2011.02.007


4 / Appli

[ae

o(aLSlasl

CppoptCtlicrau

Iiis

t

oolNaigft

Tuc

as

cdftprtabcdtc

414 J. García-Rodríguez, J.M. García-Chamizo

15], whose purpose was the development of technologies forutomatic video understanding to help a human operator in thevaluation of complex scenarios, both civilian and military.

The IBM S3 system [70] is based on two components: analysisf the behavior from the automatic processing of video, called SSESmart Surveillance System) and a set of services for data processingnd search information from the content, MILS (Middleware forarge Scale Surveillance), to build applications for video analysis.SE performs detection, tracking and classification of objects byabeling as a person, group or vehicle, generating both immediatelerts and annotated video. In turn, MILS provides an informationystem to store the recorded video for consultations with differentevels of complexity.

Chromate (Crowd Management with Telematic Imaging andommunicative Assistance) [43] was a project funded by the Euro-ean Union whose main goal was to improve the surveillance ofassengers in public transport, allowing the use and integrationf video technologies, detection and wireless transmission. Thisroject was followed by BINOCULARS (Pro-active Integrated Sys-ems for Security Management by Technological Institutional andommunication Assistance) [72] that examined the implications ofechnical, organizational, ethical and sociological impact of surveil-ance on public transport. BINOCULARS is formed by a network ofntelligent input devices that send and receive messages to/from aentral server which coordinates the activity of devices, stores andetrieves information from a database and provides an interface to

human operator. It is a modular and scalable architecture that canse standard commercial hardware.

ADVISOR [65] was also developed as part of a European project.ts main objective is to assist human operators in selecting, record-ng and automatic annotation of images relating to events ofnterest. The recorded video stored can be accessed through con-ultations on semantic content.

These surveillance systems used as test public transport facili-ies such as airports, subway stations, among others.

VIGILANT [31] is a tracking system for multi-camera monitoringf pedestrians in a parking lot. Automatically generates a databasef events in the environment enriched with information from lowevel (colour), 2D and 3D trajectories and classification of objects.ot provide high-level semantics, but using a graphical interfacellows the user to perform sophisticated queries on the informationt possesses. One of the important issues is that monitoring is donelobally, i.e., the cameras are reported to indicate what object areollowing and when you are entering the field of vision of anothero continue monitoring.

Recognition of actions has been extensively investigated [16,35].he analysis of the trajectory is also one of the basic problems innderstanding the actions [37]. Relevant works on tracking objectsan also be found in [48,49,69,70,77,78] among others.

Moreover, the majority of visual surveillance systems for scenenalysis and surveillance depend on the use of knowledge about thecenes where the objects move in a predefined manner [1,6,36,64].

In recent years, work in the analysis of behaviors has been suc-essful because of the use of effective and robust techniques foretecting and tracking objects and people that have allowed toocus interest in higher levels of scene understanding. Moreover,hanks to the proliferation of low-cost vision sensors, embeddedrocessors, and efficiency of wireless networks, a large amount ofesearch has focused on the use of multiple sources of informa-ion for the analysis of behavior. This also has potential uses suchs human–computer interfaces to virtual reality, gaming, gestureased control, presence detection and event intelligence appli-
ations for environmental or human-centric applications such asetection of falls in care of older people. Besides the obvious advan-ages, the use of vision sensor networks incorporates a number ofhallenges such as management and fusion of data, the operation
ed Soft Computing 11 (2011) 4413–4431

of distributed and central units of detection, and algorithmic tech-niques for efficient extraction of interesting behavior with efficientmanagement redundant data. In this line, recent research has beenpresented in the field of surveillance. In the field of detection andmonitoring [44,62,71,74] present interesting works. Databases likeVIHASi were also presented [61] for the evaluation of methods forthe recognition of actions or work of enforcement and surveillancein nursing homes [11].

There are also several works that use the self-organizing modelsfor the representation and tracking of objects. Fritzke [26] proposeda variation of the GNG to map non–stationary distributions that [27]applies to the representation and tracking of people. In [67] it issuggested the use of self-organized networks for human–machineinteraction. In [9], amendments to self-organizing models for thecharacterization of the movement are proposed.

From the works cited, only Frezza-Buet [27] represents both thelocal movement as the global movement, however there is no con-sideration of time constraints, and do not exploits the knowledgeof previous frames for segmentation and prediction in subsequentframes. Neither uses the structure of the neural network to solvethe problem of correspondence in the analysis of the movement.

Considering the work in the area and previous studies aboutrepresentation capabilities of self-growing neural models, it is pro-posed to design a modular system capable of capturing imagesfrom a camera, identifying areas of interest and representing themorphology of entities in the scene, as well as analyzing theevolution of these entities in time and obtain semantic knowl-edge about the actions that occur in the scene. It is proposed therepresentation of the entities through a flexible model able tocharacterize morphological and positional changes of these alongthe image sequence. The representation should identify entitiesover time, establish correspondence during the various observa-tions and allow the description of the behavior of the entitiesthrough the interpretation of the dynamics of the representationmodel.

The time constraints of the problem suggest the need for highavailability systems capable to obtain representations with anacceptable quality in a limited time. Besides, the large amount ofdata suggests definition of parallel solutions.

We propose a neural architecture based on GNG that is able toadapt the topology of the network of neurons to the shape of theentities that appear in the images and can represent and charac-terize objects of interest in the scenes with the ability to track theobjects through a sequence of images in a robust and simple way.

It has been demonstrated that using architectures based onFritzke neural networks, Growing Neural Gas (GNG) [25], can beassumed the temporary restrictions on problems such as trackingobjects or the recognition of gestures, with the ability to processingsequences of images offering an acceptable quality of representa-tion and refined very quickly depending on the time available.

With regard to the processing of image sequences, It has beenintroduced several improvements to accelerate the tracking andallow the architecture to work at video frequency. In this proposal,the use of the GNG to the representation of objects in sequencessolves the costly problem of matching of features over time, usingthe positions of neurons in the network. Likewise, the use of sim-ple prediction facilitates the monitoring of neurons and reducesthe time to readapt the network between frames without damag-ing the quality and speed of system response. Furthermore, can beensured that from the map obtained with the first image will onlybe required to represent a fast re-adaptation to locate and trackobjects in subsequent images, allowing the system to work at video
frequency.
The data stored throughout the sequence in the structure of theneural network about characteristics of the entities represented asposition, colour, and others, provide information on deformation,


/ Appli

mbb

pivbisb

2

[na

[cr

tithp

2

mpmsanq

sulwceiwon

–

–

i

J. García-Rodríguez, J.M. García-Chamizo

erge, paths followed by these entities and other events that maye analyzed and interpreted, giving the semantic description of theehaviors of these entities.

The remainder of the paper is organized as follows: Section 2rovides a description of the learning algorithm of the GNG with

ts accelerated version and presents the concept of topology preser-ation. In Section 3 an explanation on how self-growing models cane adapted to represent and track 2D objects from image sequences

s given. Section 4 demonstrates the application of the model tourveillance and human–computer interaction systems, followedy our major conclusions and further work.

. Self growing models under time constrain

From the Neural Gas model [53] and Growing Cell Structures24], Fritzke developed the Growing Neural Gas model [25], witho predefined topology of a union between neurons, which fromn initial number of neurons, new ones are added.

This model has been used in applications related to robotics30,51], data compression [5,29], recognition of gestures [4,23],luster analysis [47,59,60], biomedicine [17], biology [56] or 3Deconstruction [18,34] and many others.

In previous work [22], it has been presented capabilities of GNGo represent 2D objects through small changes in the algorithm andts parameters, getting an adequate representation of the scene athe time. However, computer vision and image processing tasksave in many cases temporal constraints determined by the sam-ling rate.

.1. Growing Neural Gas

The Growing Neural Gas (GNG) [25] is an incremental neuralodel able to learn the topological relations of a given set of input

atterns by means of competitive Hebbian learning. Unlike otherethods, the incremental character of this model, avoids the neces-

ity to previously specify the network size. On the contrary, from minimal network size, a growth process takes place, where neweurons are inserted successively using a particular type of vectoruantization [53].

To determine where to insert new neurons, local error mea-ures are gathered during the adaptation process and each newnit is inserted near the neuron which has the highest accumu-

ated error. At each adaptation step a connection between theinner and the second-nearest neuron is created as dictated by the

ompetitive Hebbian learning algorithm. This is continued until annding condition is fulfilled. In addition, in GNG network the learn-ng parameters are constant in time, in contrast to other methods

hose learning is based on decaying parameters. In the remainingf this section we describe the Growing Neural Gas algorithm. Theetwork is specified as:

A set N of nodes (neurons). Each neuron c ∈ N has its associatedreference vector wc ∈ Rd. The reference vectors can be regardedas positions in the input space of their corresponding neurons.

A set of edges (connections) between pairs of neurons. Theseconnections are not weighted and its purpose is to define thetopological structure. An edge aging scheme is used to removeconnections that are invalid due to the motion of the neuronduring the adaptation process.

The GNG learning algorithm to approach the network to the
nput manifold is as follows:
1. Start with two neurons a and b at random positions wa and wbin Rd.

ed Soft Computing 11 (2011) 4413–4431 4415

2. Generate a random input pattern � according to the data dis-tribution P(�) of each input pattern. In our case since the inputspace is a 2D image, the input pattern is the (x, y) coordinateof the points belonging to the object shape to represent. Typi-cally, for the training of the network we generate 1000–10,000input patterns depending on the complexity of the inputspace.

3. Find the nearest neuron (winner neuron) s1 and the secondnearest s2.

4. Increase the age of all the edges emanating from s1.5. Add the squared distance between the input signal and the

winner neuron to a counter error of s1 such as:

� error(s1) = ||ws1 − �||2 (1)

6. Move the winner neuron s1 and its topological neighbors (neu-rons connected to s1) towards � by a learning step εw and εn,respectively, of the total distance:

�ws1 = εw(� − ws1 ) (2)

�wsn = εn(� − wsn ) (3)

7. If s1 and s2 are connected by an edge, set the age of this edge to0. If it does not exist, create it.

8. Remove the edges larger than amax. If this results in iso-lated neurons (without emanating edges), remove them aswell.

9. Every certain number � of input signals generated, insert a newneuron as follows:• Determine the neuron q with the maximum accumulated

error.• Insert a new neuron r between q and its further neighbor f:

wr = 0.5(wq + wf ) (4)

• Insert new edges connecting the neuron r with neurons q andf, removing the old edge between q and f.

• Decrease the error variables of neurons q and f multiplyingthem with a constant ˛. Initialize the error variable of r withthe new value of the error variable of q and f.

10. Decrease all error variables by multiplying them with a con-stant ˇ.

11. If the stopping criterion is not yet achieved, go to step 2.

In summary, the adaptation of the network to the input spacetakes place in step 6. The insertion of connections (step 7) betweenthe two closest neurons to the randomly generated input patternsestablishes an induced Delaunay triangulation in the input space.The elimination of connections (step 8) eliminates the edges thatno longer comprise the triangulation. This is made by eliminatingthe connections between neurons that no longer are next or thatthey have nearer neurons. Finally, the accumulated error (step 5)allows the identification of those zones in the input space whereit is necessary to increase the number of neurons to improve themapping.

2.1.1. Accelerated Growing Neural GasTo obtain a complete network, with all its neurons, in a pre-

determined time, the GNG learning algorithm should be modifiedto accelerate its conclusion. The main factor that affects the learn-ing time is the number � of input patterns generated by iteration(step 2), since new neurons are inserted at smaller intervals, tak-ing less time in completing the network. Another alternative is theinsertion of more than one neuron per iteration [12,13], repeating
k times step 9 of the learning algorithm. In this accelerated ver-sion of the GNG, each iteration step 9 is repeated, inserting severalneurons in those zones where bigger accumulated error exists, cre-ating the corresponding connections (Fig. 1). This modification of


4416 J. García-Rodríguez, J.M. García-Chamizo / Applied Soft Computing 11 (2011) 4413–4431

Create GNG 1

Generate pattern 2

Calculate

distances to the

neurons 3

Comparare distance 3

Modify age of the

edges 4

Modify error

of winner

neuron 5

Modify weights 6

Create edges 7

Delete neurons 8 Delete edges 8’

Insert k ne urons 9

Modify error

counters 10

Repeat λ times

Repeat until ending

condition is fulfilled 11

Reconfiguration

module

Insertion/

deletion module

GNG l

Gpp

2

ptirwcgo

Fig. 1. Accelerated

NG algorithm has interesting applications in problems with tem-oral restrictions. The election of the right parameters requires arevious experimentation with the problem to be solved [28].

.2. Topology preservation

The final result of the self-organizing or competitive learningrocess is closely related to the concept of Delaunay triangula-ion. The Voronoi region of a neuron consists of all points of thenput space for what this is the winning neuron. Therefore, as aesult of competitive learning a graph (neural network) is obtained
hose vertices are the neurons of the network and whose edges are
onnections between them, which represents the Delaunay trian-ulation of the input space corresponding to the reference vectorsf neurons in the network.

earning algorithm.

2.2.1. Topology Preserving NetworksTraditionally, it has been suggested that this triangulation result

of competitive learning, it preserves the topology of the input space.However, Martinetz and Shulten [54] introduces a new conditionwhich restricts this quality.

It is proposed that the mapping �w of V in A preserves the vicinitywhen vectors that are close in the input space V are mapped tonearby neurons from A.

It is also noted that the inverse mapping

�−1w : A → V, c ∈ A → wc ∈ V (5)

preserves the neighborhood if nearby neurons of A have associated
feature vectors close in the input space.
Combining the two definitions, can be established the TopologyPreserving Network (TPN) as those net A whose mappings �w and�−1w preserve the neighborhood.


J. García-Rodríguez, J.M. García-Chamizo / Applied Soft Computing 11 (2011) 4413–4431 4417

and (b

aospe

oHt

Nclwbmt

3G

eoao

aAp

A

Tc

Fig. 2. (a) Delaunay triangulation

Thus, self-organizing maps or Kohonen maps [45] are not TPNs it has traditionally been considered, since this condition wouldnly happen in the event that the topology of the map and the inputpace coincide. Since the network topology is established a priori,ossibly ignoring the topology of the input space, it is not possible tonsure that the mappings �w and �−1

w preserve the neighborhood.The Growing Cell Structures [24] are not TPN since the topology

f the network is established a priori (triangles, tetrahedra, etc.).owever, it improves the performance compared to Kohonen, due

o its capacity of insertion and removal of neurons.In the case of the Neural Gases like Growing Neural Gas and

eural Gas, the mechanism for adjusting the network through aompetitive learning [52] generates an induced Delaunay triangu-ation (Fig. 2), a graph obtained from the Delaunay triangulation,

hich has only edges of the Delaunay triangulation of points whichelong to the input space V. In [54] it is demonstrated that theseodels are TPN, and thus can be used effectively in the representa-

ion of objects (Fig. 3) [22] and their movement [23].

. Image sequences processing with self-growing models.NG-Seq

The ability of neural gases to preserve the topology will bemployed in this work for the representation and tracking ofbjects. Identifying the points of the image that belong to objectsllows the network to adapt its structure to this input subspace,btaining an induced Delaunay triangulation of the object.

Let an object O = [AG, AV] that is defined by a geometric appear-nce AG and a visual appearance AV. The geometric appearanceG is given by a morphologic parameter (local deformations) andositional parameters (translation, rotation and scale):

G = [GM, GP] (6)

he visual appearance AV is set of object characteristics such asolour, texture or brightness, among others.

Fig. 3. Two-dimensional objects represe

) induced Delaunay triangulation.

3.1. Representation of 2D objects with GNG

In particular, considering objects in two dimensions. Given adomain support S ⊆ R2, an image intensity function I(x, y) ∈ R suchthat I : S → � 0, Imax �, and an object O, its standard potential field T(x, y) = fT(I(x, y)) is the transformation T : S → [0, 1] which asso-ciates to each point (x, y) ∈ S the degree of compliance with thevisual property T of the object O by its associated intensity I(x, y).

Considering:

• The space of input signals as the set of points in the image:

V = S� = (x, y) ∈ S

(7)

• The probability density function according to the standard poten-tial field obtained for each point of the image:

p(�) = p(x, y) = T (x, y) (8)

Learning takes place following the Accelerated GNG algorithmdescribed in Fig. 1. So, doing this process, a representation basedon the neural network structures is obtained which preserves thetopology of the object O from a certain feature T. That is, from thevisual appearance AV of the object is obtained an approximationto its geometric appearance AG.

Henceforth we call the Topology Preserving Graph TPG = 〈 A, C 〉 tothe non directed graph, defined by a set of vertices (neurons) A anda set of edges C that connect them, preserving the topology of anobject from the considered standard potential field.

Fig. 4 shows an overview of the system to obtain the TPG of anobject from a scene. It is observed that different TPG can be obtainedfrom different object features T without changing the learning algo-
rithm of neural gases.
Getting different T(x, y) can be obtained, for example, therepresentation of objects in two dimensions (Fig. 5, left) orthe silhouette of these (Fig. 5, right), which cause different

nted by a self-organizing network.



Normalized potential

field calculation Growing Neural Gas

learning

Image )(x , yTψ TPG

F

sfi

3

annta

niprIrvi

Process next frame until

Calculate ψ

Procces 1st frame

Calculate prediction

Reajust map

ig. 4. System description to obtain the Topology Preserving Graph of an object.

tructures in the network for each of these standard potentialelds.

.2. Growing Neural Gas for motion analysis. GNG-Seq

To analyze the movement, for each image in a sequence, objectsre tracked following the representation obtained with the neuraletwork structure. That is, using the position or reference vector ofeurons in the network as stable markers to follow. It is necessaryo obtain a representation TPG for each of the instances, positionnd shape of the object for all images in the sequence.

One of the most advantageous characteristics of GNG is that it isot required to restart the learning of the network for each image

n the sequence. Previous neural network structure obtained fromrevious images can be used as a starting point for new framesepresentation, provided that the sampling rate is sufficiently high.n this way, a prediction based on historical images and a small
e-adjustment of the network, provides a new representation in aery short time (total learning time/N), where total learning times the complete learning algorithm that depends on the number of
Fig. 5. Different adaptations of the neural gas to the same object.

sequence finished

Fig. 6. . Image sequence management with GNG-Seq.

input patterns � and the number of neurons N. This provides a veryhigh processing speed up to 100 frames per second. This modelof GNG for representation and processing of image sequences hasbeen called GNG for sequences or GNG-Seq.

The process of tracking an object in each image is based on thefollowing schedule:

(1) Calculation of the transformation function to segment theobject from the background based on information from previ-ous images stored in neural network structure.

(2) Prediction of the new neurons reference vectors.(3) Re-adjustment of neurons reference vectors.

Fig. 6 outlines the process to track objects, which differentiatesthe processing of the first frame, since no previous data is avail-able and is needed to learn the complete network. A second level,which predicts and updates the positions of neurons dependingon the available information on previous frames in the sequencestored in neural network structure. In this way, the objects canbe segmented into the new image, predict the new position andreadjust the map based on information available from previousmaps.

Accelerated GNG algorithm presented in Fig. 1 is used to obtainthe representation for the first frame of the image sequence, per-forming the complete learning process. However, for next frames,final positions (reference vectors) of the neurons obtained from theprevious frame are used as starting points, doing only the inter-nal loop of the general algorithm, with no insertion or deletion ofany neuron, but reposition of neuros and deletion of edges wherenecessary (Fig. 7).

3.2.1. SegmentationImage segmentation based on colour information is a field stud-

ied by many researchers [14] or [41]. The use of colour to trackobjects in image sequences is also important [10,50,55,66]. Colourinformation provides several advantages over the geometric infor-
mation, but can be inadequate in adverse conditions, such as partialocclusion or changes of scale or resolution [20]. A good segmenta-tion of the image is very important to obtain a correct collectionof entities of interest and necessary for the success of the tasks of



Preprocces frame

Obtain pattern

Calculate

distances with

neurons

Compare distances

Modify age of all

edges

Modify weights

Create edges

Delete edges

Modify error

counters

Repeat λ time

Continue until ending

condition is fulfilled

algor

cs

pihbt(rpdt

fi

bvoe

Fig. 7. GNG-Seq

haracterization, analysis and monitoring of the movement in thecene.

The system supports different approaches of segmentation,roviding information on the different characteristics of the

mage to improve the results. In the case of applications foruman–machine interaction information about the skin colour cane stored in the network structure that can improve segmenta-ion based on Gaussian mixture model [55] or the EM algorithmExpectation–Maximization). In applications to track people, neu-ons reference vectors can be used to restrict the search area ofersons or objects in motion. In robotics applications, the graphefined by the neural network could be used to restrict the possiblerajectories of moving objects in the scene.

It is necessary to distinguish the process of segmentation of therst frame from subsequent ones.

Segmenting the first image of a sequence can be accomplished
y simple methods of histogram analysis and creation of multi-ariable and multilevel thresholds for the case of colour imagesr thresholds in the case of images represented in gray lev-ls. It can also be used by probabilistic algorithms such as EM
ithm frames >1.

(Expectation–Maximization) based on modeling with Gaussiandistributions of the probability of pixels belonging to entities ofinterest. Another possibility is the modeling and subtraction ofthe background to obtain entities in the scene that belong to theforeground.

Segmentation of subsequent images will be heavily influencedby the type of environments that is necessary to represent. In sce-narios with low resolution, with multiple entities of interest thatchange, appear and disappear, the best segmentation results willbe based on the movement.

If the situation is relatively static and the background is homo-geneous, where entities of interest move and deform but do notappear and disappear, techniques based on histograms and proba-bilistic methods can be used with good results.

In this paper both techniques are described in applications ofhuman–machine interaction and surveillance systems with differ-
ent approaches. In case of problems of representation of changingscenarios in which multiple people interact entering and leavingthe scene, a segmentation based on the movement was chosen,which characterized the self-growing map to learn about a lim-


4420 J. García-Rodríguez, J.M. García-Chamizo / Appli

Acquire new frame until end of

sequen ce

Characterize entities

with GNG

Segm ent image

Acquire first frame

Save ma p

Initiate system

Segment and

cha racterize entities

with GNG

Obtain pattern s

based on map

information

is

tap

w

wpatm

3hgb

Fig. 8. GNG-Seq algorithm remarking prediction applied to segmentation.

ted input space that has been segmented in each image of theequence.

Fig. 8 shows the scheme used to distinguish the treatment ofhe first frame and the subsequent ones in which information isvailable, since information about colour and location of entities inrevious frames has been stored in the neural networks structure.

The general formulation of the technique could be described thisay:

T (x, y, t) = T(I(x, y, t − n), TPGt−n (9)

here n represents the number of previous frames consider forrediction of segmentation and representation of the current imagend t is current frame. For example if n = 2 information about frames

− 1 and t − 2 stored in the map structure will be consider to seg-ent t frame.

.2.1.1. Segmentation with colour information. In applications foruman–machine interaction (HCI) that works on devices based onesture recognition, it has been decided a segmentation schemeased on histograms that are updated after each frame.


The proposal is to adaptively merge segmentation and charac-terization. In the case of images with good resolution, for first frame,background subtraction scheme should be used to obtain the enti-ties in the scene. Entities are characterized and information aboutthe colour of pixels represented by the reference vector of the neu-ron, maximum, minimum and average values for the HSI colourmodel of the pixels in the original image are stored in neuronsstructure. Using these values to define a window around the ref-erence vector coordinates of each neuron, we will use informationstored in each neuron to update thresholds (or Gaussian mod-els) predicting segmentation and characterization of subsequentframes.

ϕ(x, y, t) = ϕ(x, y, t − n), TPGt−n (10)

In particular, let ϕ be the function that represents the values forthe colour model chosen in the vicinity of the coordinate that rep-resents each neuron. Where n represents the number of previousframes consider for prediction of segmentation and representationof the current image and t is current frame.

In this case, the neural network will learn patterns generated inthe vicinity window for each of the neurons of the map that waslearned for the previous image, rather than random patterns of anyposition of the image as it is done in the general algorithm. Since itis assumed that the entities to characterize in the new image havea restricted movement between frames. Therefore, in this case theinput space is a set of points in the image around each neuron vectorreference. If we consider the reference vectors (ωx, ωy) of N neuronson the map and a maximum distance dmax delimiting boxes aroundeach neuron reference is possible to define all the values of the inputspace S′:

S′ ⊂ SS′ =

{(x, y) ∈ S|d((x, y), (ωx, ωy)) < dmax

}

� = (x, y) ∈ S′(11)

That is, all the points (x, y) inside a square around any neuron refer-ence vector (ωx, ωy) belong to S′. The distance dmax should be takenconsidering the acceleration of objects in the image. This accel-eration will be calculated, considering a motion vector based onneurons reference vectors at time t − 1 and t − 2 (see Section 3.3.2).

In this way, each new frame, the thresholds values stored in theneurons of the map are updated to segment the next frame. Thistechnique integrates segmentation and characterization. It avoidsthe need to segment the image completely and takes only the neces-sary input patterns to learn the map. However, every some frameswould it be necessary to update the model to detect and displaynew items which might appear in the scene or exclude others thatdisappear. The use of Gaussian Mixture Model (GMM) build withinformation about objects belonging to the foreground should beuseful to make the segmentation more robust.

Fig. 9 is an example of segmentation of images from a sequenceof a hand gesture, which uses information stored in neurons of mapslearned on previous frames to update the segmentation thresh-olds and restrict the search window where objects can appear inimages.

3.2.1.2. Segmentation with motion information. In the case of appli-cations where colour cannot be used to perform segmentation, likesurveillance applications with images obtained with low resolution(Fig. 10), the objects of interest can be segmented by different esti-mation techniques between frames and modeling background. Inthis case, information of the neurons reference vectors can be used
to define a rectangle around the objects of interest that restrictsthe area where objects may appear in subsequent frames. Thus,it is only necessary to segment into these boxes, and every fewframes would be necessary to update the model to detect and dis-



F etatiov

pt

wstnic

Tyfitwa

ig. 9. Segmentation with colour information stored in map structure. (For interprersion of this article.)

lay new items which might appear in the scene or exclude othershat disappear.

Likewise, the use of information on the position of entitiesithin the images will be used to locate them and make a selective

egmentation, since, because of limited speed and acceleration onhe objects will be only necessary to analyze the areas of the imageear neurons reference vectors in previous map. That is, the min-

mum bounding box that defines the TPG whose coordinates arealculated as:

xmin = xmin(TPGt−i) − εymin = ymin(TPGt−i) − εxmax = xmax(TPGt−i) + εymax = ymax(TPGt−i) + ε|i, t ∈ [0, n] ∧ (i ≤ t)

(12)

he expression indicates that the box with top-right node (xmin,min) and bottom-right node (xmax, ymax) that locates the entityor tracking or segmentation is calculated according to the max-
mum and minimum coordinates of the nodes that form the TPGhat maps the entity for t frame in i previous frames and a value εhich represents the displacement of the entity due to the speed
nd acceleration of the entities movement. For example if i = 2 infor-

Fig. 10. Segmentation with informa

n of the references to colour in this figure legend, the reader is referred to the web

mation about frames t − 1 and t − 2 stored in the map structure willbe consider to segment t frame and also to calculate ε.

3.2.2. Prediction and correctionTo accelerate the process of re-adjustment of neural network

structure to the representation of new frames, the knowledge ofthe positions of the nodes in the graph created in the learning ofprevious frames can be used to predict the new positions of theneurons before applying the learning algorithm. In this way, theprediction approximates the network to its final position using lesstime in learning.

Depending on the information on neurons reference vectors ofprevious frames available, predictive mechanisms may be imple-mented based on the velocity and acceleration of objects or Kalmanfilters (linear dynamical systems) [8,75] and particle filters (nonlin-ear) [3,42,63] among others.

We employ a simple approach based on neurons trajectories,computing motion vector based on positions of the neurons at timet − 1 and t − 2. This way, velocity and acceleration of the objects orparts of objects tracked are obtained.

tion stored in maps structure.



Fig. 11. Single object tracking.

lex ob

ioanststcpca

rntoG

rnritr

radtt

the map is closed to its final position. As in the case of the quan-tization error, the results apply to the versions that use predictionare better, even for versions with higher values of � (Fig. 14).

0,50

1,00

1,50

2,00

2,50

Erro

r

Quan�za�on Error

pred 10000

no pred 10000

pred 1000

no pred 1000

Fig. 12. Comp

Prediction allows the system to segment the object in the newmage, to predict the new position and to readjust the map basedn the information from previous maps. As we know speed andcceleration of the different nodes of the graph, we can predict theew position of the TPG before iterating the internal loop. For noisyystems will be necessary to use techniques such as Kalman fil-ers for linear systems with Gaussian noise and particle filters forystems with non-linear or non-Gaussian noise. As one of the objec-ives of this thesis is the design of a system operating under timeonstraints, it was not considered the use of sophisticated filters toredict the position of each neuron due to its high computationalost. Brémond and Thonnat [7] or Haritaoglu et al. [32] use a similarpproach.

The prediction system works properly for image sequences rep-esenting objects with constant acceleration and with little or nooise. After each frame and having applied the prediction, informa-ion in the neural networks structure is updated with the positionsf neurons reference vectors after applying the algorithm of theNG.

Example in Fig. 11 presents 3 frames from a sequence that rep-esents a circular object in motion with constant acceleration. Redeurons represent map in the previous frame, green neurons rep-esent prediction based on the 3 previous frames and in yellows represented the correction to the final position. (For interpreta-ion of the references to colour in this figure legend, the reader iseferred to the web version of this article.)

Fig. 12 presents the initial and final frames of a sequence thatepresents a local motion of the hand. As in the previous sequence
re shown in red the position of neurons in the previous frame, pre-iction in green, and final position in yellow. (For interpretation ofhe references to colour in this figure legend, the reader is referredo the web version of this article.)
ject tracking.

To test the effect of prediction in terms of placement and quanti-zation error were performed experiments with real sequences fromthe database CAVIAR [21] which are presented in Figs. 23 and 24.The test was performed for a short sequence of 250 frames andhave been used versions of the GNG with k (number of neuronsinserted by iteration) = 1 and � (input signals) = 1000 and 10,000,both versions with and without prediction.

As shown in Fig. 13 the results are better in cases in which pre-diction is applied, especially in the case of choosing a small valueof �. For the larger values of � there is little improvement, since forhigh values learning time is enough to properly put on neurons tocharacterize the entities without prior prediction.

The positioning error indicates the distance between the neu-rons movement for consecutive frames. By applying the prediction,

0,0027252321191715131197531

Frames

Fig. 13. Quantization error for CAVIAR sequences.


J. García-Rodríguez, J.M. García-Chamizo / Appli

0

2

4

6

8

10

12

252321191715131197531

Erro

r

Frames

Placement Error

Pred 1000

no pred 1000

Pred 10000

no pred 10000

wbuetlsdo�rt

sAad

3

af

occt

ocbo

Fig. 14. Placement error for CAVIAR sequences.

Graphs in Fig. 15 show the difference of the quantization errorith respect to the variant without prediction based on the num-

er of input signals by iteration. That is, the improvement obtainedsing the prediction respect to versions without. The prediction isspecially important when a few input signals are available andherefore less time to update the map between frames. The val-eys of error in the graph coincide with times that are stable inpeed and acceleration and thus improvement is great in using pre-iction. However, when entities are stopped (peaks) major errorsccur until the system balances. In the event of a large amount of

input signals, GNG has enough time and signals to adapt neuronseference vectors even without the need for prediction. In this case,he use of prediction may even be negative.

As an example of the importance of implementing a predictionystem, Fig. 16 shows the adaptation of the map to a new frame.t the top is the starting frame. On the bottom left, prediction waspplied and the evolution is good, however in the right image pre-iction has not been applied and the adjustment is incorrect.

.3. Representation of motion

Motion can be classified according to its perception. Commonnd relative motion can be represented with the graph obtainedrom the neural network for every frame of the image sequence.

In the case of motion tracking in common mode, the analysisf the trajectory followed by an object can be done following theentroid of this throughout the sequence. This centroid can be cal-ulated from the positions of the nodes in the graph that representshe object in each image.

To track the movement in relative mode, changes in the positionf each node with respect to the centroid of the object should be
alculated for each frame. Following the trajectory of each node cane analyzed and recognized the changes in the morphology of thebject.
-0,20

-0,10

0,00

0,10

0,20

0,30

0,40

0,50

10111213141516171819202122232425262 7987654321

Erro

r

Frames

Diference EQ predic�on

Dif EQ pred 10000p Dif EQ pred 1000 p

Fig. 15. Quantization error difference for CAVIAR sequences.


One of the most important problems of tracking objects,the correspondence between features along the frames of thesequence can be intrinsically solved [79], since the position ofneurons is known at any time without requiring any additionalprocessing.

3.3.1. Common motionTo analyze the common motion, simply follow the centroid of

the object based on the centroid of the neurons reference vectorsthat represent it defining a single trajectory for the object. Fig. 18shows the path followed by the sequence of Fig. 17. Examples canalso be found in Figs. 23 and 24.

In this case, the common movement Mc is regarded as the tra-jectory described by the centroid Cm of the TPG obtained with theneural network structure along the frames 0 to f:

Mc = Traycm = {cmt0 , . . . , cmtf } (13)

3.3.2. Relative motionTo analyze the relative movement of an object, consider the spe-

cific motion of individual neurons with respect to a particular pointof the object, usually its centroid, and therefore will require spe-cific tracking for each of the trajectories of the neurons that mapthe object (Fig. 19).

Therefore, the relative motion MR is determined by the positionchanges of individual neurons with respect to the centroid Cm forevery node i:

MR = [Traycmi ], ∀i ∈ A (14)

where

Traycmi = {wit0 − cmt0 , . . . , witf− cmtf } (15)

where wi is the reference vector of the node i, Cm is the centroidof the graph obtained from the neural network that represents theimage along frames 0 to f.

3.4. Motion analysis

The analysis of motion in a sequence is done by tracking theindividual objects or entities that appear in the scene. The analysisof the trajectory described by each object is used to interpret itsmovement.

In this case the motion of an object is interpreted by the trajec-tories followed by each of the neurons TPG:

M = [Trayi], ∀i ∈ A (16)

where the trajectory is determined by the succession of positions(reference vectors) of individual neurons throughout the map:

Trayi = {wit0 , . . . , witf} (17)

In some cases, to address the recognition of the movement a param-eterization of the trajectories is performed. In [10] some proposalsfor parameterization can be found.

Direct measures of similarity between trajectories, such as themodified Hausdorff distance [19] for comparison of trajectories andlearning semantic scene models [73] are also used.

3.4.1. Hausdorff distanceLet the distance between two points a and b defined as

the Euclidean distance d(a, b) = ||a − b||. The distance between a
point a and a set of points B = {b1, . . . , bNb } is defined as d(a,B) = minb∈B||a − b||. It exists two different ways to calculate thedirect distance between two sets of points A = {a1, . . . , aNa } andB = {b1, . . . , bNb }.



tion w

p

d

d

Dap

f

Ciiicforic

4a

tn

4

ostp

Fig. 16. Examples of adapta

Consider now two direct distance measures between set ofoints:

(A, B) = maxa ∈ A

d(a, B) (18)

(A, B) = 1Na

∑

a ∈ A

d(a, B) (19)

irect measures between sets of points can be combined to obtainn indirect measure with a high discriminatory power of sets ofoints that define paths:

(d(A, B), d(B, A)) = max(d(A, B), d(B, A)) (20)

ombining Eqs. (18) and (20) the well known Hausdorff distances obtained, and combining Eqs. (19) and (20) gives a variant oft called the modified Hausdorff distance, which has more discrim-natory power for classification and object recognition. For theomparison of trajectories, both measures have similarly resultsor all types of movements. Also, if some specifics features of thebjects are known, such as gestures or movements of the face, theesults can be improved with a normalization or prior parameter-zation of the trajectories followed by the representation of theharacteristics considered.

. Surveillance and human–computer interactionpplications

In order to validate our proposal, some applications of GNG-Seqo track multiple objects in visual surveillance and gesture recog-ition for human–machine interaction systems are presented.

.1. Tracking multiple objects in visual surveillance systems

There are several studies on the labeling and tracking of multiple
bjects, some of them based on the trajectory [33] or the currenttate [2,40]. Sullivan and Carlsson [68] explores the way in whichhey interact. There is also an important field of study in relatedroblems such as occlusion [39,58].
ith and without prediction.

The technique used for tracking multiple objects is based onthe use of the GNG-Seq, which with its fast algorithm separates thedifferent objects present in the image. Once the objects in the imageare separated, it is possible to identify groups of neurons that aremapping each of them and follow them separately. These groupswill be identified and labeled to use them as a reference and keepthe correspondence between frames.

The system has several advantages compared to other trackingsystems:

• Graph obtained with the neural network structure permit therepresentation of local and global movement.

• Information stored in the structure of the neural network alongthe sequence permit the representation of motion representa-tion and the analysis of entities behavior based on the trajectoryfollowed by the neural network nodes.

• Correspondence features problem is solved using the own struc-ture of the neural network.

• Real time processing of motion is supported since the neural net-work accelerated version allows to obtain fast representationsdepending on the available time for first frame. Moreover, fromthe second frame speed of the algorithm is increased since noneurons are added or deleted and previous structure can be usedas a start point to process subsequent frames.

However some drawbacks should be considered:

• Quality of representation is highly depending on the robustnessof segmentation results.

• Management of prediction and occlusion is very simple anderratic motion or complicated interaction can affect the perfor-mance of the system.

4.1.1. Merger and division
The ability of Growing Neural Gas to break up to map all the
input space is especially useful for objects that are divided. Thenetwork will eliminate unnecessary edges so that objects are rep-resented independently by groups of the neurons.



m dif

csdst

tg

Pwo

G

Iset

G

4

c

Fig. 17. Frames fro

If the input space is unified again, the network adapts thesehanges by introducing new edges that reflect homogeneous inputpaces. In all cases the neurons will remain without adding oreleting them so that objects or persons that come together andplit into groups, can be identified and even tracked separately orogether.

This last feature is a great advantage of the representation modelhat gives the system great versatility in terms of track entities orroups of entities in video sequences.

The merge of entities is represented as the union of the Topologyreserving Graph that mapped entities. That is the necessary edgesill be introduced to convert the isolated groups of neurons in only

ne big group.

PT1 ∪ GPT2 ∪ · · · ∪ GPTn ⇒ GPTG (21)

n the case of division of entities, the map that represents the groupplit in different clusters. On the contrary to the merge process,dges among neurons will be deleted to create a number of clustershat represent the different entities in the scene.

PTG ⇒ (GPT1, GPT2, . . . , GPTn) (22)

.1.2. OcclusionsThe modeling of individual objects during the tracking does not

onsider the interaction between multiple objects or interactions

ferent sequences.

of these objects with the background. For instance, partial or totalocclusion among different objects.

The way in which the occlusions are handled in this work isto discard the image if the object is completely concealed by thebackground. In each image, once an object is characterized andsegmented, pixels belonging to each object are calculated. Framesare discarded, if percentage of pixels loss with respect to the aver-age value calculated for the previous frames is very high andresumed the consideration of frames when the rate again becomesacceptable. In the case of partial occlusion with the background orbetween objects would be expected to adapt to the new transitiveform since information from previous frames is available on theneural network structure.

4.1.3. ExperimentationTo demonstrate the model capability to track multiple objects,

two examples are presented: firstly, a synthetic example with twocircles moving in different directions and with constant acceler-ation (Fig. 20) and secondly, a part of a sequence from databaseCAVIAR (Context Aware Vision using Image-based Active Recogni-tion) (Fisher [21]) has been used as input to the system.

Fig. 21 shows the common movement on the right and globalone on the left. Paths followed by the centroids of each object rep-resentation in the first case, and trajectories followed by neuronsthat represent he objects in the second one.



follow

atdso

fssit

Fig. 18. Trajectories

The database CAVIAR has been chosen to test the system with more realistic sequence. Fig. 22 presents an example in whichwo people walk together and separated in a lobby. This exampleemonstrates the ability of the system for the representation ofimple objects, as well as its versatility to adapt different divisionsr merger of objects.

Figs. 23 and 24 describe the first frame in the top row, middlerame on the central row, and last frame in the bottom row from theequence example. Showing the original image in the left column,
egmented image and application of the network onto the imagen central column and the trajectories described by the objects onhe right one.
Fig. 19. Trajectories followe

ed by the centroid.

In frames of the example 23 can be observed two people thatstart walking from distant points in the same direction, to meet ina point and walk together. The map start with two clusters and thenmerges into a single one. In Fig. 24, a group of people walk togetherand after a few meters split into groups. At first they are mapped bya single cluster but when they split, the map that represents themalso split into different parts.

The system represents people with different clusters whilewalking separately and merged into a single cluster when they
meet. This feature can be used for video-surveillance systems. Thedefinition of the path followed by the entities that appear in theimage, depending on the path followed by the neurons that mapthe entity, allows us to study the behavior of those entities in time
d by all the neurons.



Fig. 20. Representation of the second and fifth frame

Fr

arbdtatfig

ig. 21. Representation of trajectories followed by centroids (left) and all the neu-ons (right).

nd give a semantic content to the evolution of the scene. By thisepresentation will be possible to identify individuals who haveeen part of a group or have evolved separately since there are noteleted or added neurons and neurons involved in the represen-ation of each entity remain stable over time. Different scenarios
re stored in a database and can be analyzed through measureso compare sets of points as the Hausdorff and extended or modi-ed Hausdorff distances that are widely used in the next section forestures comparison.
Fig. 22. Sequence of two people walking in lobby of INRIA labs.

s of a sequence with two objects in movement.

The fact that entities are represented by several neurons allowsthe study of deformations of these (local motion) and the interpre-tation of simple actions undertaken by such entities.

All of the above characteristics make the model of represen-tation and analysis of motion in image sequences very powerful.Image sequences have more than 1000 frames with an area of250 × 200 pixels and the processing speed obtained is higher than90 frames per second as average. First frame takes more time to beprocessed since the complete learning algorithm should be used.However for subsequent frames the speed is higher than 100 framesper second. The system has been tested with a Pentium IV 2,4GHzprocessor and C++ Builder environment has been used to programthe algorithms. Video acquisition time is not considered since thisfactor is highly dependent on the camera and communication busemployed.

Examples of the paper experiments are available inwww.dtic.ua.es/∼jgarcia/experiments.

4.2. Human–computer interaction applications. Gesturerecognition systems

In this section, the model is applied to a gesture recognition sys-tem. Fig. 25 shows a set of gestures used as input to the system. Onceimage sequences are learned and represent by GNG-Seq, Hausdorffdistances are used for the comparison of sets of points that define apath, in this case followed by all the neurons reference vectors thatrepresent different hand gestures.

Gesture I1 defines the beginning for the sequences ending withgestures G1–G8 and I2 defines the gesture start for sequences end-ing with gestures G9 and G10.

Tables 1 and 2 show the Hausdorff distance and its extendedversion. Every gesture has been compared with 20 versions of thegesture made by different people and only the worst case, with thegreater distance, is shown.

Both measures failed in the same gestures, but the success rateremains high in both cases with 80%, whereas the gestures have notbeen normalized previously. These measures provide good resultsin scenarios in which the motion is stable, for example surveillancecameras that monitored traffic on a highway surveillance, cam-eras in a garage or in corridors. But they are affected when usedwith manually operated cameras like the example given, since thesame gesture can be captured in different ways by different users
and lead to a misidentification if a previous normalization is notperformed.
Database of gestures with simple background has been usedto compare results with previous systems. However segmentation

http://www.dtic.ua.es/~jgarcia/experiments



Fig. 23. Trajectories for scene 1 CAVIAR DB.

Fig. 24. Trajectories for scene 2 CAVIAR DB.



Fig. 25. Set of gestures to study trajectories.

Table 1Hausdorff distance for gestures in Fig. 25.

G1 G2 G3 G4 G5 G6 G7 G8 G9 G10

G1 445G2 1.397 786G3 1.243 859 534G4 1.182 907 977 521G5 806 1068 894 694 463G6 2.122 1522 1765 1549 1654 1932G7 1.059 859 858 1015 988 842 786G8 3.015 5984 3323 3488 2966 5898 4573 3780G9 943 1110 982 1488 1285 1058 1461 1151 636G10 690 1174 739 1065 680 1157 1422 983 949 517

Table 2Modified Hausdorff distance for gestures in Fig. 25.

G1 G2 G3 G4 G5 G6 G7 G8 G9 G10

G1 7.15G2 14.86 8.33G3 23.62 15.15 5.21G4 9.68 15.45 15.45 6.16G5 17.21 20.54 23.52 13.15 5.37G6 55.29 37.30 45.54 40.62 43.23 50.08G7 12.30 12.10 12.02 12.01 9.82 11.75 7.10G8 96.41 152.43 113.0 92.32 85.17 16.45 105.1 74.35G9 12.35 12.18 13.80 19.84 20.23 14.91 17.97 20.75 4.58G10 11.44 14.37 14.73 18.42 13.42 18.86 15.74 15.46 8.17 10.05


4 / Appli

rr

5

n

bciota

vbsfbsrim

bddaTdm

A

p

R

[

[

[

[

[

[

[

[

[

[

[[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

430 J. García-Rodríguez, J.M. García-Chamizo

esults of images with complex backgrounds that not affect theesults as can be observed in Fig. 9.

. Conclusions

In this work it has been presented a system based on GNG neuraletwork capable of representing motion under time constraints.

The proposed system incorporates mechanisms for predictionased on information stored within the network structure on theharacteristics of objects such as colour, shape or situation to antic-pate certain operations such as segmentation and positioning ofbjects in subsequent frames. This provides for a more rapid adap-ation to the objects in the image, restricting the areas of searchnd anticipating the new positions of objects.

Processing information on the map neurons position (referenceectors) along the time is possible to construct the path followedy objects represented and interpret these. This evolution can betudied from global movement, using the centroids of the map orrom local movement, by studying the deformations of the objectased on neural network structure changes along time. This is pos-ible because the system does not restart the map every frame, onlyeadjust the network structure starting from previous map withoutnserting or deleting neurons. In this way the neurons are used as

arkers that define the stable form of objects.Through the implementation of surveillance applications, capa-

ilities of the system for tracking and motion analysis have beenemonstrated. The system automatically handles the mergers andivisions among entities that appear in the images and can detectnd interpret the actions that are performed in video sequences.he GNG-Seq based on Accelerated GNG with mechanisms of pre-iction in the segmentation and positioning system enables us toanage image sequences at video frequency.

cknowledgement

This work was partially supported by the University of Alicanteroject GRE09-16.

eferences

[1] E. Andre, G. Herzog, T. Rist, On the simultaneous interpretation of real worldimage sequences and their natural language description: the system soccer,in: Proceedings of the European Conference on Artificial Intelligence, Munich,1988, pp. 449–454.

[2] A.A. Argyros, M.I.-A. Lourakis, Real-time tracking of multiple skin-colouredobjects with a possibly moving camera, in: Proceedings of the European Con-ference on Computer Vision, 2004, pp. 368–379.

[3] M.S. Arulampalam, S. Maskell, N. Gordon, T. Clapp, A tutorial on particle fil-ters for online nonlinear/non Gaussian Bayesian tracking, IEEE Transactions onSignal Processing 50 (2) (2002) 174–188.s.

[4] H.J. Boehme, A. Brakensiek, U.D. Braumann, M. Krabbes, A. Corradini, H.M.Gross, Neural networks for gesture-based remote control of a mobile robot,in: Proceedings of the IEEE World Congress on Computational Intelligence, vol.1, Anchorage, 1998, pp. 372–377.

[5] L. Bougrain, F. Alexandre, Unsupervised connectionist clustering algorithms fora better supervised prediction: application to a radio communication problem,in: Proceedings of the International Join Conference on Neural Networks, vol.28, Washington, 1999, pp. 381–391.

[6] M. Brand, V. Kettnaker, Discovery and segmentation of activities in video,IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8) (2000)844–851.

[7] F. Brémond, M. Thonnat, Tracking multiple nonrigid objects in video sequences,IEEE Transactions on Circuits and Systems for Video Technology 8 (5) (1998)585–591.

[8] E. Brookner, Tracking and Kalman Filtering Made Easy, John Wiley & Sons, 1998.[9] X. Cao, P.N. Suganthan, Video shot motion characterization based on hierar-

chical overlapped growing neural gas networks, Multimedia Systems 9 (2003)378–385.

10] C. Cédras, M. Shah, Motion-based recognition: a survey, Image and Vision Com-puting 13 (2) (1995) 129–155.

11] D. Chen, M.Y. Chen, H.D. Wactlar, C. Gao, A. Bharucha, Video measurement ofresident-on-resident physical aggression in nursing homes. ACM Workshop onVision Networks for Behavior Analysis, 2008, pp. 61–68.

[

[


12] G. Cheng, A. Zell, Externally growing cell structures for pattern classification,in: Proceedings of the ICSC Symposia on Neural Computation, Berlin, 2000, pp.233–239.

13] G. Cheng, A. Zell, Double growing neural gas for disease diagnosis, in: Pro-ceedings of Artificial Neural Networks in Medicine and Biology Conference(ANNIMAB-1), 2000, pp. 309–314.

14] H.D. Cheng, X.H. Jiang, Y. Sun, J. Wang, Colour image segmentation: advancesand prospects, Pattern Recognition 34 (2000) 2259–2281.

15] T. Collins, A.J. Lipton, T. Kanade, A system for video surveillance and monitor-ing: VSAM final report, Technical Report CMU-RI-TR-00-12, Robotics Institute,Carnegie Mellon University, 2000.

16] T. Collins, A.J. Lipton, T. Kanade, Introduction to the special section on videosurveillance, IEEE Transactions on Pattern Analysis and Machine Intelligence22 (2000) 745–746.

17] Z. Cselényi, Mapping the dimensionality, density and topology of data: thegrowing adaptive neural gas, Computer Methods and Programs in Biomedicine78 (2005) 141–156.

18] R.L do Rêgo, A.F. Araújo, F.B. de Lima Neto, Growing self-organising maps forsurface reconstruction from unstructured point clouds, in: Proceeding of IEEEInternational Joint Conference on Artificial Neural Networks IJCNN, 2007, pp.1900–1905.

19] M.P. Dubbuison, A.K. Jain, A modified Hausdorff distance for object match-ing, in: Proceedings of the International Conference on Pattern Recognition,Jerusalem, Israel, 1994, pp. 566–568.

20] M.D. Fairchild, Colour Appearance Models, Addison Wesley, 1998.21] R.B. Fisher, PETS04 surveillance ground truth data set, in: Proc. Sixth IEEE Int.

Work. on Performance Evaluation of Tracking and Surveillance, 2004, pp. 1–5.22] F. Flórez, J.M. García, J. García, A. Hernández, Representation of 2D objects

with a topology preserving network, in: Proceedings of the 2nd InternationalWorkshop on Pattern Recognition in Information Systems, Alicante, 2002, pp.267–276.

23] F. Flórez, J.M. García, J. García, A. Hernández, Hand gesture recognition followingthe dynamics of a topology-preserving network, in: Proc. of the 5th IEEE Intern.Conference on Automatic Face and Gesture Recognition, Washington, DC, 2002,pp. 318–323.

24] B. Fritzke, Growing cell structures – a self-organising network for unsupervisedand supervised learning, Technical Report TR-93-026, International ComputerScience Institute, Berkeley, California, 1993.

25] B. Fritzke, A growing neural gas network learns topologies, in: G. Tesauro,D.S. Touretzky y, T.K. Leen (Eds.), Advances in Neural Information ProcessingSystems, vol. 7, MIT Press, Cambridge, MA, 1995.

26] B. Fritzke, A self-organizing network that can follow non-stationary distribu-tions, in: Proceedings of the 7th International Conference on Artificial NeuralNetworks, 1997, pp. 613–618.

27] H. Frezza-Buet, Following non-stationary distributions by controlling the vec-tor quantization accuracy of a growing neural gas network, Neurocomputing71 (7–9) (2008) 1191–1202.

28] J. García-Rodríguez, F. Flórez-Revuelta, J.M. García-Chamizo, Growing neuralgas for vision tasks with time restrictions, in: Proceeding of the InternationalConference on Artificial Neural Networks, vol. 2, Athens (Greece), 2006, pp.578–586.

29] J. García-Rodríguez, F. Flórez-Revuelta, J.M. García-Chamizo, Image compres-sion using growing neural gas, in: Proceedings of International Joint Conferenceon Artificial Neural Networks, Orlando, USA, 2007, pp. 366–370.

30] J. García-Rodríguez, F. Flórez-Revuelta, J.M. García-Chamizo, Learning topologicmaps with growing neural gas, Lecture Notes in Artificial Intelligence 4693 (2)(2007) 468–476.

31] D. Greenhill, P. Remagnino, G.A. Jones, VIGILANT: content-querying of videosurveillance streams, in: Video Based Surveillance Systems – Computer Visionand Distributed Processing, Kluwer Academic Publishers, 2002, pp. 193–204.

32] H. Haritaoglu, D. Harwood, L.S. Davis, W4: Who? When? Where? What? A realtime system for detecting and tracking people, in: Proceedings of the Inter-national Conference on Automatic Face and Gesture Recognition, 1998, pp.222–227.

33] M. Han, W. Xu, H. Tao, Y. Gong, An algorithm for multiple object trajectorytracking, in: Proceedings of IEEE Computer Vision and Pattern RecognitionConference, 2004.

34] Y. Holdstein, A. Fischer, Three-dimensional surface reconstruction using Mesh-ing Growing Neural Gas (MGNG), Visual Computation 24 (2008) 295–302.

35] R.J. Howarth, H. Buxton, Conceptual descriptions from monitoring and watch-ing image sequences, Image and Vision Computing 18 (9) (2000) 105–135.

36] R.J. Howarth, B. Hilary, An analogical representation of space and time, Imageand Vision Computing 10 (7) (1992) 467–478.

37] W. Hu, D. Xie, T.A. Tan, A hierarchical self-organizing approach for learning thepatterns of motion trajectories, IEEE Transactions on Neural Networks 15 (1)(2004) 135–144.

38] W. Hu, T. Tan, L. Wang, S. Maybank, A survey on visual surveillance of objectmotion behaviors, IEEE Transactions on Systems, Man and Cybernetics 34 (3)(2004) 334–352.

39] Y. Huang, I. Essa, Tracking multiple objects trough occlusions, in: ProceedingsIEEE Conference on Computer Vision and Pattern Recognition, 2005.

40] C. Hue, J.C. Le Cadre, P. Pérez, Tracking multiple objects with particle filter-ing, IEEE Transactions on Aerospace and Electronic Systems 38 (3) (2002)791–812.

41] S. Ji, W. Park, Image segmentation of colour image based on region coherency,in: Proc. International Conference on Image Processing, 1998, pp. 80–83.


/ Appli

[

[

[

[[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

currently professor in the Department of Computer Tech-nology of the University of Alicante and head of theIndustrial Informatics and Computer Nets research group.His research interest areas are computer vision, neural

J. García-Rodríguez, J.M. García-Chamizo

42] S.J. Julier, J.K. Uhlmann, Unscented filtering and non linear estimation, Proceed-ings of the IEEE 92 (3) (2004) 401–422.

43] L. Khoudour, J.P. Deparis, J.L. Bruyelle, F. Caestaing, D. Aubert, S. Bouchafa,S.A. Velastin, M.A. Vincencio-Silva, M. Wherett, Project CROMATICA, ICIAP (2)(1997) 757–764.

44] D. Kim, D. Kim, A novel fitting algorithm using the ICP and the particle filters forrobust 3d human body motion tracking. ACM Workshop on Vision Networksfor Behavior Analysis, 2008, pp. 69–76.

45] T. Kohonen, Self-organising Maps, Springer-Verlag, Berlin/Heidelberg, 2001.46] A.J. Lipton, H. Fujioshi, R.S. Patil, Moving target classification and tracking from

real-time video, in: Proc. IEEE Workshop Applications of Computer Vision,1998, pp. 8–14.

47] P.J. Lisboa, B. Edisbury, A. Vellido, Business Applications of Neural Networks:The State-of-the-Art of Real-world Applications, World Scientific, 2000.

48] J.G. Lou, H. Yang, W.M. Hu, T.N. Tan, An illumination invariant change detectionalgorithm, in: Asian Conf. Computer Vision, 2002, pp. 13–18.

49] J.G. Lou, H. Yang, W.M. Hu, T.N. Tan, Visual vehicle tracking using an improvedEKF, in: Asian Conf. Computer Vision, 2002, pp. 296–301.

50] L. Lucchese, S.K. Mitra, Colour image segmentation: a state-of-the-art survey,Proceedings of the Indian National Science Academy 67 (2) (2001) 207–221.

51] S. Marsland, U. Nehmzow, J. Shapiro, A real-time novelty detector for a mobilerobot, in: EUREL Advanced Robotics Conference, Salford, 2000.

52] T. Martinetz, Competitive Hebbian learning rule forms perfectly topology pre-serving maps, in: Proceedings of ICANN, 1993.

53] T. Martinetz, S.G. Berkovich, K.J. Schulten, “Neural-Gas” network for vectorquantization and its application to time-series prediction, IEEE Transactionson Neural Networks 4 (4) (1993) 558–569.

54] T. Martinetz, K. Schulten, Topology representing networks, Neural Networks 7(3) (1994) 507–522.

55] S. McKenna, Y. Raja, S. Gong, Colour model selection and adaptation in dynamicscenes, in: European Conference on Computer Vision, 1998.

56] T. Ogura, K. Iwasaki, C. Sato, Topology representing network enables highlyaccurate classification of protein images taken by cryo electron-microscopewithout masking, Journal of Structural Biology 143 (2003) 185–200.

57] T. Olson, F. Brill, Moving object detection and event recognition algorithmsfor smart cameras, in: Proc. DARPA Image Understanding Workshop, 1997, pp.159–175.

58] V. Papadourakis, A. Argyros, Multiple objects tracking in the presence oflong-term occlusions, Computer Vision and Image Understanding (2010),doi:10.1016/j.cviu.2010.02.003.

59] A.K. Qin, P.N. Suganthan, Robust growing neural gas algorithm with applicationin cluster analysis, Neural Networks 17 (2004) 1135–1148.

60] C. Rehtanz, C. Leder, Stability assessment of electric power systems using grow-ing neural gas and self-organising maps, in: Proceedings of ESSAN, 2000, pp.401–406.

61] H. Ragheb, S.A. Velastin, P. Remagnino, T. Ellis, ViHASi: virtual human actionsilhouette data for the performance evaluation of silhouette-based actionrecognition methods. ACM Workshop on Vision Networks for Behavior Analy-sis, 2008, pp. 77–84.

62] S. Rao, N.C. Pramod, C.K. Paturu, People detection in image and video data. ACMWorkshop on Vision Networks for Behavior Analysis, 2008, pp. 85–92.

63] B. Ristic, S. Arulampalam, N. Gordon, Beyond the Kalman Filter. Particle Filtersfor Tracking Applications, Artech House, 2004.

64] K. Schaefer, M. Haag, W. Theilmann, H. Nagel, Integration of image sequenceevaluation and fuzzy metric temporal logic programming, in: C. Habel, G.Brewka, B. Nebel (Eds.), KI-97: Advances in Artificial Intelligence, Lecture Notesin Computer Science, vol. 1303, Springer, New York, 1997, pp. 301–312.

65] N.T. Siebel, ADVISOR design and implementation of people tracking algorithms
for visual surveillance applications. Design and implementation of peopletracking algorithms for visual surveillance applications, PhD thesis, Depart-ment of Computer Science, The University of Reading, Reading, UK, 2003.
66] M. Sonka, V. Hlavac, R. Boyle, Image Processing, Analysis and Machine Vision,2nd edition, Brooks/Cole Publishing Company, 1998.


67] E. Stergiopoulou, N. Papamarkos, A. Atsalakis, Hand Gesture Recognition Via aNew Self-organized Neural Network, CIARP, LNCS 3773, 2005, pp. 891–904.

68] J. Sullivan, S. Carlsson, Tracking and labelling of interacting multiple targets, in:Proceedings of the European Conference of Computer Vision, Springer, LNCS3953, no. 3, 2006, pp. 619–632.

69] Y. Tian, T.N. Tan, H.Z. Sun, A novel robust algorithm for real-time object tracking,Chinese Journal of Automation 28 (5) (2002) 851–853.

70] Y. Tian, L.M.G. Brown, A. Hampapur, M. Lu, A.W. Senior, C. Shu, IBM smartsurveillance system (S3): event based video surveillance system with an openand extensible framework, Machine Vision and Applications 19 (5–6) (2008)315–327.

71] D. Tweed, J.M. Ferryman, Enhancing change detection in low-quality surveil-lance footage using Markov random fields. ACM Workshop on Vision Networksfor Behaviour Analysis, 2008, pp. 23–30.

72] S.A. Velastin, B.A. Boghossian, B.P.L. Lo, J. Sun, M.A. Vicencio-Silva, PRISMATICA:toward ambient intelligence in public transport environments, IEEE Transac-tions on Systems, Man, and Cybernetics, Part A 35 (1) (2005) 164–182.

73] X. Wang, K. Tieu, E. Grimson, Learning semantic scene models by trajectoryanalysis. MIT CSAIL Technical Report, 2006.

74] J. Wang, Y. Makihara, Y. Yagi, People tracking and segmentation usingspatiotemporal shape constraints. ACM Workshop on Vision Networks forBehavior Analysis, 2008, pp. 31–38.

75] G. Welch, G. Bishop, An introduction to the Kalman filter, ACM SIGGRAPH,Course 8. Available from: http://www.cs.unc.edu/∼welch/kalman/, 2001.

76] C. Wren, A. Azarbayejani, T. Darell, A. Pentland, Pfinder: real-time tracking ofthe human body, IEEE Transactions on Pattern Analysis and Machine Intelli-gence 19 (7) (1997) 780–785.

77] Y. Wu, Q. Liu, T.S. Huang, An adaptive self-organising colour segmentationalgorithm with application to robust real-time human hand localization, in:Proceedings of the IEEE Asian Conference on Computer Vision, Taiwan, 2000,pp. 1106–1111.

78] H. Yang, J.G. Lou, H.Z. Sun, W.M. Hu, T.N. Tan, Efficient and robust vehiclelocalization, in: IEEE Int. Conf. Image Processing, 2001, pp. 355–358.

79] Z. Zhang, Le problème de la mise en correspondance: L’état de l’art. Rapportde recherche no 2146, Institut National de Recherche en Informatique et enAutomatique, 1993.

José García Rodríguez received the BSc, MSc and Phd incomputer science from the University of Alicante (Spain)in 1994, 1996 and 2009, respectively. He is currentlyassociate professor in the Department of Computer Tech-nology of the University of Alicante. His research interestareas are image understanding, computer vision, patternrecognition, neural networks or man–machine interfaces.

Juan Manuel García Chamizo received the MSc in physicsfrom the University of Granada (Spain) in 1980 and Phdfrom the University of Alicante (Spain) in 1994. He is

networks, industrial informatics and biomedicine.

http://www.cs.unc.edu/~welch/kalman/

Surveillance and human–computer interaction applications of self-growing models

Documents

Transcript of Surveillance and human–computer interaction applications of self-growing models