Particle filtering with multiple and heterogeneous cameras

16
Particle filtering with multiple and heterogeneous cameras $ Rafael Mun ˜ oz-Salinas , R. Medina-Carnicer, F.J. Madrid-Cuevas, A. Carmona-Poyato Department of Computing and Numerical Analysis, University of Co ´rdoba, 14071 Co ´rdoba, Spain article info Article history: Received 3 December 2008 Received in revised form 11 December 2009 Accepted 21 January 2010 Keywords: People tracking Stereo vision Particle filters Multiple-views abstract This work proposes a novel particle filter for tracking multiple people using multiple and heterogeneous cameras, namely monocular and stereo cameras. Our approach is to define confidence models and observation models for each type of camera. Particles are evaluated independently in each camera, and then the data are fused in accordance with the confidence. Confidence models take into account several sources of information. On the one hand, they consider occlusion information from an occlusion map calculated using a depth-ordered particle evaluation. On the other hand, the relative precision of sensors is considered so that the contribution of a sensor in the final data fusion step is proportional to its precision. We have defined confidence and observation models for monocular and stereo cameras and have designed tests to validate our proposal. The experiments show that our method is able to operate with each type individually and in combination. Two other remarkable properties of our method are that it is highly parallelizable and that it does not impose restrictions on the cameras’ positions or orientations. & 2010 Elsevier Ltd. All rights reserved. 1. Introduction Multicamera people tracking has become an attractive re- search field in the last 10 years. The use of multiple cameras allows solutions to the occlusion problem and coverage of wider areas than can be covered by a single camera. Most of the work on multicamera people tracking has been developed using mono- cular cameras. Nevertheless, recent works have shown that stereo cameras provide better tracking results [26–28] than monocular cameras do. Therefore, people tracking using multiple stereo cameras seems to be a promising tracking alternative. However, the popularization of stereo tracking systems is mainly deterred by the fact that most of the current surveillance systems are already based on monocular cameras. Therefore, the technology transition should be smooth and allow for the reuse of existing hardware. Particle filtering is now the most popular visual tracking approach [6,13,20,36]. It has gained popularity in the vision community because of its ability to deal with non-linear and non- Gaussian systems. The most common data fusion approach consists of obtaining independent likelihoods from each sensor and multiplying them. However, this approach has several limitations. First, in real scenarios one might need to use a set of heterogeneous sensors, each one with a different level of reliability. However, in such a case, fusing their observations by considering all the sensors as equally reliable is a suboptimal strategy, since it does not make the best use of each sensor. Second, in many cases, the reliability of a sensor might change dynamically according to the target state. For instance, a camera should be considered more reliable if it observes the target perfectly than if it observes the target partially occluded. This work proposes a novel particle filter for fusing informa- tion from a set of heterogeneous sensors, assuming that they might produce measures with different levels of confidence. A sensor confidence model is defined for each type of sensor, indicating its reliability in observing a particle. All observations of a particle are then fused while taking confidence into account, so that sensors with low confidence levels are assigned a lower relevance in the final data fusion step. The filter is employed to provide a novel solution to the multicamera people tracking problem, which is able to operate with either monocular cameras, stereo cameras or a combination of both types of cameras. For that purpose, we define observations and confidence models for each type of camera. The observation model of a monocular camera evaluates foreground and color information, whereas the stereo camera observation model also evaluates depth information. The confidence models of both cameras take occlusion into account in order to compute their reliability dynamically. Occlusion is computed using a depth- ordered particle evaluation scheme that permits a parallelized implementation of the algorithm. The rest of the paper is organized as follows. First, Section 2 explains the most relevant related work. Section 3 provides a brief overview of particle filtering and stereo computation. Section 4 ARTICLE IN PRESS Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2010.01.015 $ This work has been carried out with the support of the Research Projects DPI2006-02608 and TIN2007-66367 funded by the Spanish Ministry of Science and Technology and FEDER. Corresponding author. E-mail address: [email protected] (R. Mun ˜ oz-Salinas). Pattern Recognition 43 (2010) 2390–2405

Transcript of Particle filtering with multiple and heterogeneous cameras

ARTICLE IN PRESS

Pattern Recognition 43 (2010) 2390–2405

Contents lists available at ScienceDirect

Pattern Recognition

0031-32

doi:10.1

$This

DPI2006

and Tec� Corr

E-m

journal homepage: www.elsevier.com/locate/pr

Particle filtering with multiple and heterogeneous cameras$

Rafael Munoz-Salinas �, R. Medina-Carnicer, F.J. Madrid-Cuevas, A. Carmona-Poyato

Department of Computing and Numerical Analysis, University of Cordoba, 14071 Cordoba, Spain

a r t i c l e i n f o

Article history:

Received 3 December 2008

Received in revised form

11 December 2009

Accepted 21 January 2010

Keywords:

People tracking

Stereo vision

Particle filters

Multiple-views

03/$ - see front matter & 2010 Elsevier Ltd. A

016/j.patcog.2010.01.015

work has been carried out with the supp

-02608 and TIN2007-66367 funded by the

hnology and FEDER.

esponding author.

ail address: [email protected] (R. Munoz-Salin

a b s t r a c t

This work proposes a novel particle filter for tracking multiple people using multiple and heterogeneous

cameras, namely monocular and stereo cameras. Our approach is to define confidence models and

observation models for each type of camera. Particles are evaluated independently in each camera, and

then the data are fused in accordance with the confidence. Confidence models take into account several

sources of information. On the one hand, they consider occlusion information from an occlusion map

calculated using a depth-ordered particle evaluation. On the other hand, the relative precision of

sensors is considered so that the contribution of a sensor in the final data fusion step is proportional to

its precision. We have defined confidence and observation models for monocular and stereo cameras

and have designed tests to validate our proposal. The experiments show that our method is able to

operate with each type individually and in combination. Two other remarkable properties of our

method are that it is highly parallelizable and that it does not impose restrictions on the cameras’

positions or orientations.

& 2010 Elsevier Ltd. All rights reserved.

1. Introduction

Multicamera people tracking has become an attractive re-search field in the last 10 years. The use of multiple camerasallows solutions to the occlusion problem and coverage of widerareas than can be covered by a single camera. Most of the work onmulticamera people tracking has been developed using mono-cular cameras. Nevertheless, recent works have shown that stereocameras provide better tracking results [26–28] than monocularcameras do. Therefore, people tracking using multiple stereocameras seems to be a promising tracking alternative. However,the popularization of stereo tracking systems is mainly deterredby the fact that most of the current surveillance systems arealready based on monocular cameras. Therefore, the technologytransition should be smooth and allow for the reuse of existinghardware.

Particle filtering is now the most popular visual trackingapproach [6,13,20,36]. It has gained popularity in the visioncommunity because of its ability to deal with non-linear and non-Gaussian systems. The most common data fusion approachconsists of obtaining independent likelihoods from each sensorand multiplying them. However, this approach has severallimitations. First, in real scenarios one might need to use a setof heterogeneous sensors, each one with a different level of

ll rights reserved.

ort of the Research Projects

Spanish Ministry of Science

as).

reliability. However, in such a case, fusing their observations byconsidering all the sensors as equally reliable is a suboptimalstrategy, since it does not make the best use of each sensor.Second, in many cases, the reliability of a sensor might changedynamically according to the target state. For instance, a camerashould be considered more reliable if it observes the targetperfectly than if it observes the target partially occluded.

This work proposes a novel particle filter for fusing informa-tion from a set of heterogeneous sensors, assuming that theymight produce measures with different levels of confidence. Asensor confidence model is defined for each type of sensor,indicating its reliability in observing a particle. All observations ofa particle are then fused while taking confidence into account, sothat sensors with low confidence levels are assigned a lowerrelevance in the final data fusion step.

The filter is employed to provide a novel solution to themulticamera people tracking problem, which is able to operatewith either monocular cameras, stereo cameras or a combinationof both types of cameras. For that purpose, we define observationsand confidence models for each type of camera. The observationmodel of a monocular camera evaluates foreground and colorinformation, whereas the stereo camera observation model alsoevaluates depth information. The confidence models of bothcameras take occlusion into account in order to compute theirreliability dynamically. Occlusion is computed using a depth-ordered particle evaluation scheme that permits a parallelizedimplementation of the algorithm.

The rest of the paper is organized as follows. First, Section 2explains the most relevant related work. Section 3 provides a briefoverview of particle filtering and stereo computation. Section 4

ARTICLE IN PRESS

R. Munoz-Salinas et al. / Pattern Recognition 43 (2010) 2390–2405 2391

presents the proposed particle filtering algorithm, whereasSection 5 explains the observation and confidence modelsproposed for monocular and stereo cameras. Finally, Section 6shows the experiments carried out and Section 7 draws someconclusions.

2. Related work

The multicamera people tracking problem has been addressedfrom multiple perspectives, but most often using monocularcameras. A common approach used by many authors is tointersect the medial axes of the people blobs on the groundplane using the floor homography [2,16,17,19]. The main problemis that the majority of the people’s silhouettes need to be visible,and in most cases also the people’s feet. Therefore, cameras mustbe placed at elevated positions and relatively far from the people.Although this restriction may be feasible in outdoor scenarios, itmight be difficult in indoor scenarios where the areas to becovered are small and the cameras must be placed closer to thepeople. A solution to the tracking problem in these scenarios isproposed in Ref. [9] by Fleuret et al. These authors present atracking approach using multiple monocular cameras placed ateye level. The monitored area is discretized into cells to create aprobabilistic occupancy map, and an iterative process is run ateach frame in order to determine the locations of the people in themap. The authors claim that the computing time is improved bythe use of integral images, but at the expense of imposingrestrictions to the camera position and orientation, i.e., thecameras must be placed in such a manner as to prevent peoplefrom appearing inclined in the images. In [30], the authorsdescribe a distributed self-configurable tracking system usingmultiple monocular cameras with and without overlap. In orderto allow an efficient camera collaboration, the Kalman–Consensusfilter is employed so that each camera comes to a consensus withits neighboring cameras about the actual state of the target. Wulet al. [34] proposed a method specially designed for tracking alarge number of close objects from multiple cameras. Theirproposal focus on the data association problem for which theypropose the use of a greedy randomized adaptive searchalgorithm.

Although promising results have been obtained using multiplemonocular cameras, some studies have shown that stereo visionprovides better results for object tracking. The authors of thiswork have proposed several approaches for people detection andtracking using a single stereo camera. While Ref. [28] proposes atracking approach combining color and stereo extracted directlyfrom the camera image, Refs. [26,27] propose the use of plan-viewmaps to represent stereo information more efficiently. In all thetests performed, stereo information has improved the trackingresults. However, using a single stereo camera still imposes stronglimitations on the extent of the monitored area. In [25], Mittal andDavis present a probabilistic approach for tracking people incluttered scenes using multiple monocular cameras. They em-ployed a fine camera calibration in order to obtain depthinformation from camera pairs projected onto plan-view maps.In [21], Krumm et al. show a people tracking system for a smartroom using a pair of stereo cameras with a short base line. Peopleare detected by grouping 3D blobs extracted from the stereoinformation. The main drawback of their approach is the difficultyof reliably segmenting the 3D blobs. In [35], Zhao et al. propose atracking system with multiple stereo short-baseline cameras.Each camera performs an independent tracking procedure andthen reports its results to a centralized tracker.

Important deterrents to the popularization of stereo trackingsystems are that most of the surveillance systems installed

nowadays employ monocular cameras, and that tracking ap-proaches have not been proposed that can employ the existingequipment in combination with stereo cameras. Our aim is toprovide a framework for operating simultaneously with bothtypes of sensors so that current tracking systems could beimproved by simply adding stereo cameras.

Particle filters are probably one of the most important trackingframeworks in the vision community. However, the mostfrequently employed fusion approach consists of considering theobservations from all sensors jointly, without considering theirrelative reliability. In other words, they generally treat all kind ofsensors as equally reliable. However, some researchers haveproposed alternative fusion approaches with particle filters. In[33] the authors propose an adaptive particle filter for tracking asingle object using two monocular cameras. Each particle isassigned two likelihoods, one for each camera. Their fusionstrategy consists of using a joint likelihood if the target is visiblein both views, and the maximum likelihood in case it is onlyvisible in one camera. The scope of their work is very restricted inthe number of views and target that can be employed. In addition,their approach is only valid for tracking in the camera imageplane, but cannot be employed to determine the 3D location ofthe target. In [8], Du and Piater propose a method for trackingpeople in multiple monocular cameras using a separate particlefilter in each camera. At each time step, each filter resamples andevaluates particles using the ground plane homography and thepeople medial axis. Then, data fusion is performed using a Markovrandom field, and belief propagation is employed to reduce thecomputational effort. In Ref. [23] the authors proposed a multi-view-based cooperative tracking approach based on the homo-graphic relation between different monocular views. Theirapproach is similar to [8] except for the explicit management ofocclusion. They apply two hidden Markov processes (a trackingand an occlusion process) for each target in each view. Based onthe occlusion process, the cooperative tracking process reallocatesresources among the different trackers in each view. These threeworks are interesting contributions to the multisensor trackingproblem using particle filters. Nevertheless, all three assumeidentical sensors (with the same precision and reliability), thuslimiting their extension to more complex scenarios whereheterogeneous sensors are available. A solution to that problemis given in [14]. Han et al. propose a kernel-based Bayesian filterin which more particles are assigned to the most reliable sensors.The sensors’ reliability is dynamically determined based on theobserved likelihood, i.e., sensors providing higher likelihoods areconsidered more reliable. Finally, independent posteriors arecombined as a mixture of Gaussian kernels. One drawback of theirapproach is that, in practice, their approach does not really fuseinformation. Instead, the tracking is primarily biased towards theobservations of the sensor with the highest likelihood, whichattracts most of the particles. Besides, while the sensor withhighest likelihood might be the one that observes the target best,it is not necessarily the one with the highest precision. Therefore,their approach does not consider the relative confidence levels ofthe different sensors employed.

2.1. Proposed contribution

As can be seen, most of the sensor fusion approaches used withparticle filters fail to implement the idea of combining informa-tion from heterogeneous sensors so as to assign more relevance tothe most reliable ones. This work aims to fill that gap byproposing a tracking framework that offers two main contribu-tions. First, we propose a novel multisensor multitarget particlefilter that takes into account the sensors’ confidence levels. In our

ARTICLE IN PRESS

R. Munoz-Salinas et al. / Pattern Recognition 43 (2010) 2390–24052392

approach, the higher the confidence of a sensor, the moreinfluence it has in the tracking process. Confidence takes severalsources of information into account. First, it considers occlusioninformation, using an occlusion map that is calculated by sortingparticles according to their distance to the camera. Therefore, asensor that is not able to see a particle because it is occluded isconsidered to have a low level of confidence. In contrast to Ref.[14], we detect occlusions based on the physical restrictions of thesensors employed instead of detecting them based on theposteriors (which might be noisy). Second, each type of sensorhas a parameter that indicates its reliability compared to the restof the sensors. Thus, the contributions of more reliable sensors aregiven higher weights in the final data fusion step. As the secondcontribution, this paper provides a solution to the people trackingproblem using monocular and stereo cameras simultaneously.Our proposal is able to operate with either monocular cameras,stereo cameras or a combination of both types. For that purpose,we define monocular and stereo observation models as well asconfidence models to fuse their observations. Two remarkableproperties of the method proposed are that it is highly paralleliz-able and that it does not impose restrictions on the cameras’position or orientation.

3. Introductory concepts

This section provides an introduction to two basic conceptsemployed in this work. First, we briefly explain the basics ofstereo computation. However, for an updated review of the stateof the art, the interest reader is referred to [3]. Then, the basics ofparticle filtering are explained.

3.1. Stereo processing

The minimal possible stereo system is composed of a pair ofcameras whose optical centers are separated by a distance b. Apoint P=(X,Y, Z) in space projects to two locations (p=(x,y) andp0 ¼ ðx0; yÞ) at the same scan line in both camera images. Thedisplacement of the projection in one image with respect to theother is called the disparity and the set of all disparities betweentwo images is the so-called disparity map. Disparities can only becomputed for those points that are seen on both images, but thisis not always possible because of the lack of texture or occlusions.The points whose disparity cannot be calculated are calledunmatched points. In this work, we have employed a stereocorrelation algorithm based on SAD.

Knowing the intrinsic parameters of the stereo system, it ispossible to reconstruct the three-dimensional structure of thedisparity map. The depth of a given point can be calculated bytriangulation as

Z ¼flb

d; ð1Þ

where fl is the focal length of the cameras and d is the disparitycalculated as d=x�x

0

. It is also possible to compute the other twocomponents (X and Y) of the three-dimensional position of P as

X ¼xZ

fl; Y ¼�

yZ

fl: ð2Þ

3.2. Particle filtering

Target tracking consists of estimating the posterior distribu-tion pðxtjz1:tÞ of a system’s state xt conditioned on a sequence ofobservations z1:t. In order to make the problem computationallytractable, it is assumed that the process is Markovian, i.e., the

current system state is conditioned only on the previousobservation. So, pðxtjz1:tÞ can be computed based on the actualobservation zt and the previous estimate pðxt�1jz1:t�1Þ. For thatpurpose, a prediction-and-update application of Bayes’ law isemployed:

pðxt jz1:t�1Þ ¼

Zpðxtjxt�1Þpðxt�1jz1:t�1Þdxt�1; ð3Þ

pðxt jz1:tÞppðztjxtÞpðxtjz1:t�1Þ: ð4Þ

In the above equations, pðxtjxt�1Þ represents the systemevolution and pðzt jxtÞ the measurement process.

Several approaches have been proposed to solve theseequations. However, particle filters are especially interestingbecause they can deal naturally with systems where both theposterior density and the observation density are non-Gaussian.They are also able to manage multiple hypotheses simulta-neously. These algorithms approximate the filtering distributionby a weighted set of particles fðsi;piÞg. The importance weights pi

are approximations of the posterior pðxtjz1:tÞ such thatP

ipi ¼ 1.The CONDENSATION algorithm [13] is probably the mostprevalent such algorithm in the computer vision community.

Despite of the many advantages of particle filters, theypresents several problems when applied to multi-target trackingproblems. First, the standard particle filter does not define amethod for identifying individual modes (or targets). Second, theparticles of a standard particle filter quickly collapse to a singlemode, discarding the others (usually referred to as coalescence orhijacking problem) [4,31]. Third, a joint particle filter suffers fromexponential complexity as the number of targets grows [32].Fourth, particle filters do not define a method to fuse informationfrom multiple heterogeneous sensors. Recent approaches haveovercome the first three problems through the use of multipleparticle filters (MPFs) [7,18,29,31], i.e., employing an independentparticle filter for each target. In that case, interaction factors, whichmodify the weights of particles, are employed to avoid thecoalescence problem.

4. Multiple-view particle filtering algorithm

This section explains the tracking algorithm proposed fortracking people using multiple cameras. The goal of our peopletracking problem is to estimate the ground plane positions

XðtÞ ¼ fXðtÞ1; . . . ;XðtÞntg ð9Þ

of a set of people in the area of analysis. Let nt represent thenumber of people being tracked at time t and X(t)p the position inthe ground plane of the p-th person. Let us assume that there is aset of V heterogeneous cameras sharing a common referencesystem (obtained by calibration), thus making it possible to knowthe projection of a three-dimensional point in each of thecameras. It is important to note that a fine camera calibration isnot required, i.e., epipolar lines are not employed. Let us alsoassume that people are mostly seen in a standing position andthat there is a people detector mechanism that indicates thepositions of the people entering the area under surveillance in aninitial time step.

By

PðtÞ ¼ fPpðtÞjp¼ 1 . . .ntg ð10Þ

let us denote the set of trackers employed in time t, where Pp(t)represents the information that each tracker keeps about itstarget:

PpðtÞ ¼ fSpðtÞ;ApðtÞg: ð11Þ

ARTICLE IN PRESS

R. Munoz-Salinas et al. / Pattern Recognition 43 (2010) 2390–2405 2393

The parameter Sp(t) represents the particle set employed forestimating the target’s location, whereas Ap(t) represents theinformation on the person’s appearance.

The particle set employed is defined as

SpðtÞ ¼ fðsp;iðtÞ;pp;iðtÞ; yvp;iðtÞjv¼ 1 . . .V ; i¼ 1 . . .NpðtÞg: ð12Þ

Each particle sp,i(t) has an associated particle weight pp;iðtÞ thatrepresents the likelihood obtained after fusing information fromall the cameras. Independent view weights are also employed torepresent the information obtained by each camera. The viewweights

yvp;iðtÞ ¼ fp

vp;iðtÞ; c

vp;iðtÞg ð13Þ

are comprised of two elements: pvp;iðtÞ representing the likelihood

value observed from the v-th camera, and the confidence valuecv

p;iðtÞA ½0;1� representing how reliable that camera’s observationis. In other words, pv

p;iðtÞ represents how likely the person is to beat the particle location according to the observation obtainedfrom camera v, and cp,i

v (t) represents the reliability of theobservation. The definition of cp,i

v (t) might be different from oneapplication to another, and should be based on the sensorsemployed. In any case, cp,i

v (t)=1 when the sensor is considered tobe completely reliable for the observation in question, and tendsto 0 as the sensor’s reliability decreases. In our case, bothmonocular and stereo cameras are employed, and their con-fidence models are explained later in Section 5.3.

The view weights are fused to calculate the final particleweight as

pp;iðtÞp1

V

Xv

pvp;iðtÞc

vp;iðtÞ: ð14Þ

The idea is that the contribution of a view to the final particleweight depends on its confidence. So, pp;iðtÞ ¼ 1 means that allcameras see the particle position sp,i(t) perfectly and that they‘‘believe’’ that the target is there. As pp;iðtÞ tends to 0, it is lesslikely that the target is at sp,i(t).

Additionally, each tracker Pp(t) keeps a color model of theclothes of its target in each of the cameras

ApðtÞ ¼ favpðtÞjv¼ 1 . . .Vg: ð15Þ

Since we consider that the scene might be analyzed by cameraswith different sensor characteristics and that illumination is notuniform, a point in the scene might be seen with a different colorin each of the cameras. Therefore, a different color model ap

v(t) iskept for each camera. The color models Ap(t) are initialized fromthe information of the first frame. Later, they are dynamicallyupdated in order to adapt them to illumination changes and bodymovements.

The outline of the proposed algorithm is shown in Fig. 1. In aninitial stage, before tracking takes place, a background model foreach camera is created. We have employed the method proposedin Ref. [11], which combines color and depth information. Themethod models each pixel as a mixture of Gaussian’s with fourcomponents: three components for color information (using theHSV color space) and a fourth component for depth information.In the case of monocular images, the depth component is simplyignored. When depth information is available, it is employed toimprove the foreground extraction. Foreground pixels with thesame color as the background can be correctly classified asforeground using a simple depth test.

Following that stage, the image set is captured and backgroundsubtraction is performed (step i). Let us denote by Bv thebackground map obtained for the v-th camera. A pixel of thebackground map is 1 when it is classified as background and 0otherwise. The trackers then iterate in order to estimate the newlocations of the people being tracked.

Particle propagation (step ii) is performed using a randomwalk movement model because of the unpredictable behavior ofpeople. The random noise applied is assumed to follow a Normaldistribution Nð0;s2

mÞ; whose deviation is calculated based on thefact that average human walking speed is about 1 m/s (3.6 km/h).Then, if the proposed system is able to operate at f hertz, theparameter s2

m is calculated as

sm ¼1

f: ð16Þ

In this way, the propagation step is adapted to the frame rateemployed. The parameter sm increases as the camera frame rate f

decreases. Thus, for a system with a lower frame rate, the particlesare drawn more spread out from one iteration to the next.

After propagation, the algorithm proceeds with the particleevaluation (step iii). For each particle, the view weights yv

p;iðtÞ arecalculated in each camera independently (see Section 5). One ofthe most attractive advantages of using multiple cameras is themanagement of occlusion, and our algorithm is specificallydesigned to deal with it. When a person is occluded in a camera,another camera might be employed to keep track of that person.For that purpose, an occlusion map Ov is maintained for eachcamera (as in Ref. [22]). The occlusion map has the samedimensions as the original camera image and indicates at eachpixel if it is occupied by any of the people being tracked. Theocclusion map is calculated independently for each camera usinga depth-ordered approach. First, the targets are sorted accordingto their distance from the camera using the positions estimated inthe previous iteration. Then, starting from the nearest person tothe camera, the view weights are calculated. Afterwards, the‘‘best’’ particle sp,b

v (t) of that camera is selected. The best particle isthe one that maximizes the product of its weight and confidence(Eq. (5)). Note that the best particle might not be the best globalsolution, but is a good local solution that allows us to calculate theocclusion independently for each camera. Thus, the third step ofthe algorithm can be distributed in as many processes as cameras.Then, the position of the best particle is employed to project theperson silhouette in the occlusion map Ov (setting all the pointsinside the silhouette to 1). Afterwards, the particles of the nextperson are evaluated, but this time employing the occlusion mapto take occlusions into account (this is explained in greater depthin the next section). In brief, particles projecting at imagepositions already occupied by other people are assigned lowvalues of the certainty factor cp,i

v (t). Therefore, in the data fusionstep, the cameras in which a person is occluded are not asrelevant in determining the person’s location as the cameras inwhich the person is fully visible. The computation of our occlusionmap is slightly different from the one calculated by Lanz in Ref.[22]. Instead of using a different occlusion map for each person,we employ a single occlusion map for all of them.

When the particle sets have been evaluated in all the cameras,the data are fused in order to obtain a global estimation of thepeople’s locations pp;iðtÞ (step iv). Note that pp;iðtÞ takes intoaccount both the information from all the views and theinformation about the rest of the targets via the interaction factorIp;iðtÞ. The interaction factor is employed to prevent particles ofone target from invading the region of another target (coalescenceproblem) thus imposing the physical restriction that two targetscannot be at the same location simultaneously. The interactionfactor is defined in this work as a Gaussian function:

Ip;iðtÞ ¼ 1�e�dmpðtÞ2=2s2

dm ; ð17Þ

where dmp(t) is the Euclidean distance to the nearest target(excluding itself) and sdm in the deviation. The parameter sdm

models the allowed distance between particles from differenttrackers. The value sdm ¼ 0:5 is employed, assuming that the

ARTICLE IN PRESS

Fig. 1. Multiple-view particle filtering.

R. Munoz-Salinas et al. / Pattern Recognition 43 (2010) 2390–24052394

width of the average person is approximately 0.25 m. Since 95% ofthe area is within 2 � s around the mean, particles drawn nearerthan 0.25 m to other people receive very low values of interaction.Thus, the system enforces the physical restriction that two peoplecannot be at the same location simultaneously.

Using the fused information, the best location hypothesisE½SpðtÞ� and the target’s global confidence cE½SpðtÞ� are estimated(step v). The confidence value cE½SpðtÞ�, which is calculated using theconfidence of best particles in each camera, is employed later(step vi) to determine if the target is not properly detected. In casecE½SpðtÞ� falls below a minimum value cmin, we consider that theestimation of the target is not reliable. It normally indicates thatthe target is not properly visible in any camera. If so, the numberof particles and the random noise employed for particle propaga-tion are increased in an attempt to relocate the person in thefollowing iterations. If the situation persists over too long a time,then the algorithm considers that the person is lost and its trackerstops working.

In case there is no total occlusion in all the cameras, the colormodels Ap(t) are updated using the information from the besthypothesis (step vii). Finally, the background models aresmoothly adapted to changes in the environment. To prevent

people standing for long periods of time from becoming part ofthe background model, pixels marked as occupied in the occlusionmaps Ov are not updated.

The next section provides a detailed explanation of howparticles are evaluated.

5. Particle evaluation

This section explains the proposed observation and confidencemodels. First, Section 5.1 shows the 3D geometric modelemployed to model the appearance of people and the informationextracted from its projection. Then, Sections 5.2 explains theproposed camera models. Section 5.3 explains how the confidencevalues are calculated. Finally, Section 5.4 shows how the colormodels are updated to adapt them to changes in illumination andbody pose.

5.1. 3D model projection

The proposed method relies on the use of a geometric 3Dmodel representing the silhouettes of people. We have selected a

ARTICLE IN PRESS

R. Munoz-Salinas et al. / Pattern Recognition 43 (2010) 2390–2405 2395

basic model consisting of a box whose dimensions are based onthe dimensions of an average adult person. The box is assumed tohave width Bw=0.5 m and height Bh=1.8 m. Although the modeldimensions are fixed in this work, they can be adapted to theparticular characteristics of the people being observed. Since thecameras are calibrated, it is possible to calculate the projection ofthe 3D model in each camera for a given position sp,i(t). Let usdefine by

pðsp;iðtÞÞv¼ fpiðxi; yiÞg ð18Þ

the image pixels of the v-th camera image that lies in theprojection of a 3D model placed at sp,i(t). Fig. 2 shows theprojection of the model employed in three different cameras.Although, in practice, a solid model is employed, Fig. 2 shows awired version for viewing purposes.

Note that some of the pixels in p(sp,i(t))v might be backgroundpixels and therefore, be irrelevant. Let us denote by

fðsp;iðtÞÞv¼ fpijBpi

¼ 04piApðsp;iðtÞÞvg ð19Þ

the pixels in p(sp,i(t))v that are foreground pixels (Bpi

¼ 0). Also,some of the pixels in f(sp,i(t))

v might have already been set asbelonging to another person in the occlusion map Ov. So, let usdenote by

vðsp;iðtÞÞv¼ fpijOpi

¼ 04piAfðsp;iðtÞÞvg ð20Þ

the foreground pixels that have not yet been occupied by otherpeople, i.e., Opi

¼ 0.For the sake of clarity, we will omit (sp,i(t))v in the rest of this

paper when referring to the sets defined above. Instead, they willbe denoted by p, f and v.

5.2. Camera models

As explained in Section 4, the observations are analyzedindependently for each camera, obtaining a set of view weightsyv

i ðtÞ (see Eq. (13). Each view weight is composed of a likelihoodvalue pv

p;iðtÞ and a confidence value cp,iv (t). The first one indicates

how likely the particle is to be placed at the true person’s location,according to the observation from the v-th view. This value iscalculated differently depending on the type of camera. Thecamera models proposed in this work combine the number offoreground points of the projection, their color distribution andtheir depth (in the case of stereo cameras). The idea is that if theperson being tracked is at the particle location, then (i) there mustbe a set of foreground points in the projected region; (ii) the colordistribution of the visible points must be similar to the colordistribution of the person model; and in the case of stereocameras, (iii) the 3D locations of the foreground points should benear the particle’s location. The camera models proposed in this

Fig. 2. Projection of the 3D geometric model employed for tracking people. The cam

work are then defined as follows:

pvp;iðtÞ ¼

po;vp;i ðtÞp

c;vp;i ðtÞ if v is monocular;

po;vp;i ðtÞp

c;vp;i ðtÞp

d;vp;i ðtÞ if v is stereo:

8<: ð21Þ

The parameter po,vp,i (t) represents the likelihood of the particle

sp,i(t) according to the foreground observation. The parameterpc,v

p,i (t) represents the particle likelihood based on the colorinformation, and pd,v

p,i (t) the likelihood based on the depthinformation. Then, pv

p;iðtÞ is calculated using only the two firstlikelihoods for monocular cameras, while all three componentsare employed for stereo cameras. Below, we explain how theforeground, color and depth observation models are defined.

5.2.1. Foreground model

The foreground model is based on the idea that a person placedat the particle location must project to foreground points in thecorresponding camera region. Therefore, we evaluate the numberof foreground points in the region where the 3D model projects asevidence of the presence of a person. Let us define the proportionof foreground pixels of the projection as

Fvp;iðtÞ ¼

jfj

jpj; ð22Þ

where J denotes the magnitude of a set. Then, Fvp;iA ½0;1� is 0 when

there are no foreground points in the region and tends to 1 as theproportion of foreground points increases. Note that the para-meter Fp,i

v (t) is independent of the distance from the target to thecamera, since it is divided by the total number of points projected.

Fp,iv (t) is assumed to follow a Normal distribution Nðmo;soÞ.

Then, we shall denote its likelihood distribution function asfollows:

po;vp;i ðtÞ ¼ pðFv

p;iðtÞjsp;iðtÞÞ ¼1

so

ffiffiffiffiffiffi2pp exp �

ðOvp;iðtÞ�moÞ

2

2s2o

!: ð23Þ

We have empirically determined through several tests that thebest distribution parameters are mo ¼ 0:7 and so ¼ 0:3.

Note that we have assumed jpja0. If this condition is not met(i.e., the particle is not visible by the v-th camera), then we setpo,v

p,i (t)=0.

5.2.2. Color model

A color histogram avp(t) is maintained at each camera to model

the colors of the clothes of the person being tracked. Colorhistograms have often been used for modeling color in trackingproblems since they allow the global properties of objects to becaptured with invariability to scale, rotation and translation [5]. Inthis work, histograms are created using the color of the non-occluded foreground pixels, i.e., the points in v. The HSV colorspace [10] is employed because it is relatively invariable to

eras are calibrated in order to calculate the model projection in each camera.

ARTICLE IN PRESS

R. Munoz-Salinas et al. / Pattern Recognition 43 (2010) 2390–24052396

illumination changes. Thus, a histogram is comprised of nhns binsfor the hue and saturation. However, as chromatic information isnot reliable when the value component is too low or too high,pixels in those regions are not used to describe the chromaticity.However, because these ‘‘color-free’’ pixels might contain im-portant information, histograms are also populated with nv binsto capture their luminance information. Thus, histograms arecomposed of m=nhns+nv bins. Let us define a function b :

R2-f1 . . .mg that associates a pixel pi with the index of thehistogram bin b(pi)=w corresponding to its color. Then, the w-thbin of a histogram is calculated as

aðwÞ ¼

Ppi Avk½bðpiÞ�w�

jvj; ð24Þ

where k is the Kronecker’s delta function. Note that the histogrambins are normalized:

Xm

w ¼ 1

aðwÞ ¼ 1:

The observed color information is represented by the variableCi,p

v (t), which represents the distance between the color histogrammaintained by the tracker av

p(t) and the color histogram extractedfrom region v (denoted by a

vðsp;iðtÞÞ). The parameter Ci

v(t) isdefined as the Bhattacharyya distance of the histograms [1,15],which is calculated as

Cvp;iðtÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1�rðav; a

vðsp;iðtÞÞÞ

q; ð25Þ

where r is the Bhattacharyya coefficient:

rða; bÞ ¼X

w

aðwÞbðwÞ: ð26Þ

Eq. (26) is 1 when both distributions are identical, and tends to 0as they differ. So, Cv

p;iðtÞA ½0;1� is a normalized distance that can beused to compare distributions created with different numbers ofpixels.

We assume that Cvp;iðtÞ �Hnð0;scÞ follows a half normal

distribution. Then, we shall denote its likelihood distributionfunction as

pc;vp;i ðtÞ ¼ pðCv

p;iðtÞjsp;iðtÞÞ ¼2

sc

ffiffiffiffiffiffi2pp exp �

Cvp;iðtÞ

2

2s2c

!: ð27Þ

The parameter sc was experimentally determined as sc ¼ 0:3.We have assumed that jvja0; otherwise, pc;v

p;i ðtÞ ¼ 0. Obtainingjvj ¼ 0 means either that the particle does not project to theimage, or that it is completely occluded by other people that arenearer to the camera.

5.2.3. Depth model

Depth is a powerful source of information for trackingpurposes, as we have already demonstrated in previous works[26–28]. If a particle is located at the location of a real person,then the 3D location of the pixels in v must be close to theparticle’s location. To check that restriction, let us denote by d theset of visible pixels for which depth information has beencalculated. Remember that depth information might not beavailable in all pixels (as previously explained in Section 3.1).Let us also denote by BðdiÞ the ground plane coordinates of a pixel.Finally, let us define the variable Dp,i

v (t), indicating the degree towhich the points from d are close to the particle’s location, asfollows:

Dvp;iðtÞ ¼

1

jvj

Xjdji ¼ 1

exp �ðBðdiÞ�sp;iðtÞÞ

2

2s2z

!: ð28Þ

The value Dp,iv (t) is obtained by summing for each pixel in d a

value proportional to its distance to the particle location sp,i(t).

The value added for each pixel is modeled as a Gaussian function.So, for pixels located exactly at sp,i(t) we add 1, but as the distanceincreases, the valued added tends to 0. The parameter sz variatesthe influence of the distance and is set, according to thedimensions of 3D model employed, to

sz ¼Bw

2; ð29Þ

i.e., half the width of the 3D box. Finally, the sum computed isdivided by the number of visible points jvj. So, Dp,i

v (t) is invariantwith respect to the size of the projection and to the amount ofocclusion.

Dp,iv (t) is assumed to follow a Normal distribution Nðmd;sdÞ.

Consequently, its likelihood is defined by

pd;vp;i ðtÞ ¼ pðDv

p;iðtÞjsp;iðtÞÞ ¼1

sc

ffiffiffiffiffiffi2pp exp �

ðDvp;iðtÞ�mdÞ

2

2s2d

!: ð30Þ

We have empirically determined that md ¼ 0:5 and sd ¼ 0:2provide appropriate results. Finally, in the case when jvj ¼ 0; weset pd,v

p,i (t)=0.

5.3. Confidence models

The confidence value cp,iv (t) of the view weights is employed to

fuse the independent observations (Eq. (6)). This value is in therange [0,1] and indicates the degree to which the observationfrom view v can be relied on. It is 0 when the observation iscompletely unreliable, and it increases as the reliability increases.In our problem, the confidence of an observation is consideredhigh when (i) the visibility of the particle in the camera is high;and if the camera is stereo, (ii) when the particle is near thecamera. The first condition ensures that cameras with highvisibility of the target make a large contribution to the finalfusion stage. The second condition is mainly related to the factthat stereo information becomes unreliable as the distance to theobserved object increases. Thus, observations from stereo cam-eras placed near the target are considered more reliable.

The confidence parameter then is calculated as

cvp;iðtÞ ¼

gmWvp;iðtÞ if v is monocular;

gstWvp;iðtÞjv

p;iðtÞ if v is stereo;

8<: ð31Þ

where Wvp;iðtÞA ½0;1� indicates the degree of visibility of the particle

from the v-th camera, and jvp;iðtÞAð0;1� indicates the degree of

confidence of the depth information employed to calculate pd,vp,i (t).

These functions are explained later in more detail. Finally, theg parameters, which are in the range [0,1], are included in order toset the relative importance between the different type of sensors.A sensor is assigned with g¼ 1 when it is considered the mostreliable of the available sensors. However, low values of g indicatethat a sensor is considered less reliable than some others sensorsavailable. Obviously, when only one type of sensor is employed,g is not useful. The use of this parameter only makes sense whendifferent types of sensors are employed. We provide a discussionof how this parameter’s values are chosen in the experimentalsection.

5.3.1. Definition of the visibility confidence parameter Wvp;iðtÞ

The term visibility includes two aspects in our work. First, itrefers to the degree to which the 3D model projects in the field ofview of the v-th camera. For that purpose, let us define themeasure projp,i

v (t). It is 1 when the model fully projects onto thecamera image, and tends to 0 as the model projects outsidethe camera’s field of view. So, projp,i

v (t)=0 means that the particleis not visible from the v-th camera. Second, the visibility alsorefers to the proportion of the projected model that is not

ARTICLE IN PRESS

R. Munoz-Salinas et al. / Pattern Recognition 43 (2010) 2390–2405 2397

occluded by other people. Thus, we define

unoccvp;iðtÞ ¼

jvj

jpjif jpja0;

0 otherwise;

8<: ð32Þ

as the proportion of points projected that are not occluded.The visibility factor is then defined as

Wvp;iðtÞ ¼ projv

p;iðtÞunoccvp;iðtÞ: ð33Þ

So, Wvp;iðtÞ ¼ 1 for particles projecting completely into the field of

view of the camera in positions not occluded by other people.However, it tends to 0 as the projections lie either outside theview of the camera or in occluded areas.

5.3.2. Definition of the depth confidence parameter jvp;iðtÞ

Errors of stereo information increase with the distance to theobserved object. Thus, we define the confidence measure jv

p;iðtÞ. Ittends to 0 for particles far from the camera, and to 1 for particlesclose to the camera. The definition of near/far is different from onestereo camera to another. It mainly depends on the opticalproperties of the camera, i.e., cameras with large focal length havea higher precision at large distances than cameras with short focallength. To account for this property, we have defined jv

p;iðtÞ as

jvp;iðtÞ ¼ exp �

Zvp;iðtÞ

2

2s2c

!; ð34Þ

where Zp,iv (t) represents the distance from the particle sp,i

v (t) to thev-th camera. Thus, jv

p;iðtÞ decreases as the distance increases. Thesmoothness of the curve is modified by the parameter sc , which isdefined as

sc ¼Zmax

2: ð35Þ

Therefore, when Zp,iv (t)=Zmax, the value of jv

p;iðtÞ is very small,indicating a lack of confidence. In fact, the parameter Zmax can beseen as a upper limit indicating the maximum distance at whichstereo errors can be considered too high for camera v. FromEq. (1), we can derive

Zmax ¼flb

dmin: ð36Þ

Thus, Zmax is expressed in terms of a disparity limit dmin. The use ofdmin allows us to have an uniform rule to calculate Zmax for eachcamera. In consequence, a camera with a long focal length can befar from an object and still have a higher level of confidence than acloser camera with a smaller focal length. The disparity limit dmin

must be selected according to the precision of the stereo matchingalgorithm employed. We have determined experimentally thatdmin=3 provides appropriate results in our case.

5.4. Color model update

Changes in illumination conditions and body movementsmight alter the observed color distribution of a person’sappearance. It is therefore necessary to continuously update themodels av

p(t) of people. These are updated using the color modelsof the best estimated hypothesis E½SpðtÞ� at each iteration. Let usdenote by a

vpðtÞ the color model in the v-th view of a particle

placed at E½SpðtÞ�. Then, the bins w of the color histograms areupdated as

avpðtþ1;wÞ ¼ ð1�lÞav

pðt;wÞþlavpðt;wÞ; ð37Þ

where the parameter lA ½0;1� weights the contributions of theobserved color model to the updated one. The value of l is usuallyset to a small value (0.01) so that the color model adapts smoothly

to illumination changes. Finally, the updated histogram isnormalized so that their bins sum up to one.

6. Experimental results

This section explains the experiments that were conducted totest the proposed algorithm. The section is divided into threesections. The first one, Section 6.1, evaluates the performance ofthe proposed algorithm using only stereo cameras. The secondone, Section 6.2, evaluates the algorithm performance when onlymonocular cameras are employed. Finally, Section 6.3 evaluatesthe algorithm using a combination of stereo and monocularcameras.

Several video sequences were recorded in three differentscenarios in order to perform the experiments. The first scenariois a lab of approximately 5�4 m. Sequences were recorded usingtwo different camera configurations. In the first one (see Fig. 4), atotal of three stereo firewire cameras were placed at a height of3 m. The cameras were set to record at 15 fps with a resolution of320�240 pixels. In the second configuration (see Fig. 9), fivefirewire monocular cameras were placed at a height of 3 mconfigured to record with the same frame rate and resolution as inthe previous configuration. In both cases, camera synchronizationwas achieved by the firewire hardware.

The second scenario is the PEIS room [24], a robotizedapartment employed for the development and research of mobileand embedded robotic systems aimed to help elderly people (seeFig. 6). The room is equipped with four monocular web camerasand two stereo systems. The first stereo system (composed offirewire cameras) is placed on the ceiling at a height of C5 m andpointing downwards. The second stereo system (composed ofweb cameras) is placed at a height of C3 m in a slanting position.On the other hand, the four monocular web cameras are placedaround the room at a height of C3 m in slanting position. Thearea covered by the cameras is 3�4 m. Besides, all the camerasare roughly synchronized via software to record at 7 fps with aresolution of 320�240 pixels.

Finally, the third scenario is a corridor of approximately 15 min length (see Fig. 7). Two stereo firewire systems were placed at aheight of 2 m at opposite ends of the corridor recording at 15 fpswith a resolution of 320�240 pixels.

The stereo cameras employed in all the scenarios aremanufactured by Videre Design. The web cameras employedconnect to the USB port and are manufactured by Logitech. Inthe PEIS room, one of the two stereo cameras employed iscomprised of a pair of monocular web cameras. For that purpose,we employed the calibration and rectification routines providedby the OpenCv library [12]. The stereo correspondence algorithmemployed in all cases is an correlation algorithm using SAD andperforming subpixel interpolation.

The number of people in the recorded sequences varied from2 to 4. The people were instructed to move freely about theenvironment. Therefore, interactions and occlusions are frequentin the recorded videos. A total of 10 different people participatedin the experiments.

6.1. Performance evaluation with stereo cameras

This section analyzes the performance of the algorithm whenonly stereo cameras are employed. The experiments conductedwere designed in order to evaluate two crucial aspects of theproposed method: first, the error in determining the positions ofthe people being tracked (tracking error), and second, theevolution of the error as the number of particles grows. Asregards the first objective, the positions of people in six different

ARTICLE IN PRESS

R. Munoz-Salinas et al. / Pattern Recognition 43 (2010) 2390–24052398

sequences were determined in each frame in order to obtainquantitative measures of the tracking error. The positions weremanually extracted to indicate the locations of the people’s headsin the ground plane. A total of 8122 positions have been manuallyextracted from the 4178 frames recorded. The tracking error foreach person was calculated as the distance from their actualposition (manually determined) to the position estimated by theproposed tracking method.

As regards the second objective, the performance of particlefilter methods increases as the number of particles grows.Nevertheless, the higher the number of particles employed, thehigher the computational effort required by the algorithm.Moreover, there is usually a limit to the number of particles,beyond which no significant improvement is obtained. Our aim isto examine the error evolution in order to determine theminimum number of particles required for the proposed methodto perform adequately. Thus, the tracking error of our algorithmwas analyzed in the sequences recorded for an increasing numberof particles. Furthermore, due to the stochastic nature of thealgorithm, each test was repeated 10 times with different seedsfor the random number generator.

The results of our experimentation are shown in Fig. 3. Thehorizontal axis of the graph in Fig. 3 represents the number ofparticles employed for each tracker. The vertical axis representsthe RMSE (root mean-square error) in determining the people’spositions. The RMSE has been obtained from the errors of the 10runs for all the people in the sequences. Each line of the graphrepresents the tracking error in a particular scenario. As expected,the tracking error is reduced as the number of particles increasesin all the scenarios. However, it can be observed that, in the laband PEIS room scenarios, only twenty particles are needed toreach the smallest error. However, over a hundred particles areneeded to obtain the smallest error in the corridor scenario. As itwill be shown later, the reason behind this difference is thereduced visibility and the great distance between the cameras inthe corridor scenario. In any case, the error is around 0.1 m in allthe scenarios, which can be considered reliable, taking accountthat ground-truth was extracted manually.

The tests were performed in an Intel Pentium Core 2 with 1 GBof RAM running GNU+Linux. In our tests, the background removaltakes 30 ms for each image. The time required to evaluate a viewweight depends on the distance to the camera, but on average, itrequires 1 ms. Thus, evaluating 50 particles in a scenario withthree cameras requires approximately 120 ms. Although ourimplementation is sequential (only one processor is employed),

Fig. 3. Tracking errors (expressed in meters) as the number of particles grows in

each of the scenarios tested.

the algorithm can be parallelized in order to reduce thecomputing time.

In the following section, we show images from some of thesequences employed to test our proposal. Fig. 4 shows somescenes recorded at the lab (first scenario). In the sequence, fourpeople enter the room and move about, causing frequentocclusions of each other in some of the cameras. The figureshows the tracking results in four different time instants. Theupper rows show the camera images at a particular time instant.In the images, the models are drawn at the best positionestimated by the tracker E½SpðtÞ� (see Eq. (7)). Below the cameraimages, the figure shows a ground map of the monitored areawhere the locations and orientations of the cameras have beensuperimposed. The particles have been drawn on the groundmaps in the form of circles whose color indicates their likelihoodand confidence. The color scheme employed is indicated in Fig. 5.Each target is represented by a different color: red, blue, greenand light blue. The particles of the red target are drawn in purered when pv

p;iðtÞ ¼ 1 and cp,iv (t)=1, i.e., particles with high

confidence and high likelihood. Black is used for particles withcv(t)=0 and pv

p;iðtÞ ¼ 0, i.e., particles with low confidence, eitherbecause they are not visible or because they are very far from thecamera. The rest of possible intermediate values of pv

p;iðtÞ ¼ andcp,i

v (t) are represented by the colors inside the rectangle. The mapslabeled fusion show the results of merging information from allviews.

In the first frame of Fig. 4, the red target is entering in the lab.It can be noticed that the particles for the first camera are drawnin dark gray, thus indicating that the target is occluded in thatcamera. Nevertheless, since the target is properly seen in the restof the cameras, the final location estimated by the tracker is veryaccurate. The rest of the images in the figure show how the rest ofthe people enter in the room and walk around. As can easily beseen, occlusions are frequent. Take for instance the third frame(#403). The person marked in green is invisible in the firstcamera, and only partially visible in the rest of the cameras.Likewise, the person in blue is occluded in the third camera by theperson in red, but visible in the rest of the cameras. Despite thefrequent occlusions, the proposed algorithm is able to track allthe people reliably.

Fig. 6 shows images from one of the tests performed in thesecond scenario: the PEIS room. As previously indicated, twostereo cameras were placed in the monitored area: one on theceiling and another one in a slanting position. The camera inthe slanting position is the one comprised of two monocular webcameras. The scene in Fig. 6 shows three people that enter theroom and chat. During the chat, they change their positionsand cause occlusions. It is interesting to note that a ceiling camerareduces occlusions significantly. Nevertheless, such aconfiguration requires very wide angle lenses and a high ceiling(which is not always possible). On the other hand, slantingcameras monitor wider areas than cameras placed on ceilings, butocclusions are more frequent. In any case, a point worthmentioning is that the proposed algorithm is able to workindependently of the camera configuration employed. As can beseen, tracking can be reliably done with a small error.

Finally, Fig. 7 shows one of the tests performed in the corridorscenario. In the figure, we can see three people moving from oneend of the corridor to the other. As we explained earlier, this is thescenario that requires the highest number of particles in order toobtain reliable results. This is an especially difficult trackingscenario for two reasons. First, the distance between the camerasis very large, so we observed large stereo errors in some of theframes. Nevertheless, the confidence measure helps improve thetracking by taking into account the most reliable information.Second, since cameras are in front of each other there are very

ARTICLE IN PRESS

Fig. 4. Four people move randomly about a lab, causing frequent occlusions, and leave the field of view of some of the cameras.

R. Munoz-Salinas et al. / Pattern Recognition 43 (2010) 2390–2405 2399

severe and frequent occlusions. Moreover, in some cases, thetargets are completely occluded in both cameras, e.g., frame #656.However, the tracking can be recovered thanks to the occlusionhandling procedure employed (step vi of the algorithm in Fig. 1).

6.2. Performance evaluation with monocular cameras

This section aims to show that the proposed algorithm is notonly valid for stereo cameras, but also for monocular cameras. Asin the previous case, we are interested in determining the number

of particles that provide an appropriate trade-off between theprecision and computing time. But we are also interested indetermining the minimum number of cameras required in orderto obtain reliable results. So the tests were repeated for a differentnumber of cameras. We analyzed a total of 2890 frames, and atotal of 6647 people’s positions. Each test was repeated 10 timesin order to consider the stochastic nature of particle filters. Theresults are shown in Fig. 8.

The graph at the left side of Fig. 8 shows the results obtained inthe Lab scenario. The graph shows the RSME as the number ofparticles grows. The test was repeated for different camera

ARTICLE IN PRESS

0.5

1

1 π0.5

c

0.5

1

1 π0.5

c

0.5

1

1 π0.5

c

0.5

1

1 π0.5

c

Fig. 5. Color map used to represent particles in Fig. 5.

Fig. 6. Tracking results in the second scenario, the PEIS room. Three people enter and start a chat. Notice how the proposed method is able to operate even with a ceiling

camera.

Fig. 7. Tracking results in the corridor scenario. Two stereo cameras are placed at opposite ends of the corridor. Complete occlusions take place in that scenario, e.g., in

frame #656.

R. Munoz-Salinas et al. / Pattern Recognition 43 (2010) 2390–24052400

configurations. For instance, the blue line represents the meanerror obtained by the tracker in all the positions analyzed, usingall the possible combinations of four cameras in the lab. Theresults obtained clearly show how the tracking precisionimproves as the number of cameras increases. The reason is thatby using a large number of cameras, simultaneous occlusions inall of the cameras become unlikely. Thus, targets are visible inalmost every situation. We can observe that the minimum errorobtained, 0.15 m, is slightly higher than that obtained by using

stereo cameras. However, it is small enough to be used in manytracking applications. Another conclusion obtained from theresults is that at least four cameras are required in that scenarioin order to obtain reliable results.

Fig. 9 show some frames from one of the test sequencesemployed. As in Fig. 4, the odd rows show the camera images,while the even rows show the ground map with the particles in it.The color representation of Fig. 5 has been employed for theparticles. The figure shows a scene with four people entering in

ARTICLE IN PRESS

Fig. 8. Tracking errors for the first and second scenarios, using different numbers of cameras and particles.

R. Munoz-Salinas et al. / Pattern Recognition 43 (2010) 2390–2405 2401

the lab and moving about, causing frequent occlusions. It can beclearly seen that the proposed algorithm tracks the targets’positions reliably.

The graph at the right side of Fig. 8 shows the results ofprocessing the sequences recorded in the PEIS room (the secondscenario). In that scenario, we placed a total of four cheap webcameras around the area of surveillance. The tests wereperformed, as in the previous case, using a variable number ofcameras. The results show that the higher the number of cameras,the smaller the tracking error obtained. As in the previous case, aminimum of four cameras are required to obtain reliable results isthat scenario.

Finally, the experiments in the corridor scenario were alsoperformed using exclusively monocular cameras, although theseresults are not presented graphically. The results indicate that it isimpossible to track the position of the targets using monocularcameras reliably in that camera configuration, at least with themethod proposed. In general, we have noticed that the highesttracking precision is obtained when the target is seen from atleast two cameras separated by an angle of 90 degrees. In thatcase, the intersection of their lines of sight (LOS) produces a goodestimate of the target’s location. However, as the cameras becomeparallel, the intersection of their LOS provide little information onthe target’s location. This effect is especially evident in thecorridor scenario, where the two cameras are in front of eachother. To show that problem better, we ran a test on a corridorwith a single person moving from one end to the other. Thesequence was recorded using stereo cameras and then processedwith and without stereo information. The results are shown inFig. 10. At the left side of the figure, we show some framesprocessed using both stereo and monocular information. At theright side there is a graph showing the tracking error of eachalternative in every frame. When the person first appears (frame#87), both trackers are initialized with the same position. As timepasses, the monocular tracker is unable to keep track of theperson’s location (see the red line of the graph). However, we cansee that the tracker using stereo information is able to track theperson, maintaining a low tracking error (see the blue line). Then,we might conclude that our monocular tracking approach isinfluenced by the position of the cameras. We think that thisproblem is not unique to our method, but all tracking methodsrelying on the intersection of LOS might well have the samedifficulty.

6.3. Performance evaluation with a combination of stereo and

monocular cameras

This sections aims to show the ability of the proposed methodto combine information from stereo and monocular cameras

simultaneously. We will focus our analysis in the lab scenario. Forthat purpose, the sequences employed in Section 6.1 were alsoused for these tests. The scenes were recorded using three stereocameras. However, in order to test the combined use of stereo andmonocular cameras, we run the tests while discarding stereoinformation from some of the cameras at a time, i.e., with some ofthe stereo cameras playing the role of monocular cameras. Weperformed a comprehensive test of all possible combinations ofstereo and monocular roles.

In the previous tests, the use of the g parameters (Eq. (31)) wasirrelevant because only one type of sensor was employed.However, in this case, the g parameters are employed to set therelative importance of one sensor over another. For that purpose,we performed several tests with the sequences in order to analyzethe variation of the system behavior depending on g. The testswere performed using a total of 50 particles for each person andrepeated 10 times. First, we tested the system using two stereocameras and one monocular camera. For that configuration, wetested two sets of combinations of gst and gm. First, we assumedthat the stereo cameras were the most reliable type of sensor (i.e.,gst ¼ 1:) and varied gm in the range [0.1,1]. Then, we tested theopposite behavior, i.e., we set the monocular cameras as the mostreliable type of sensor (gm ¼ 1:0) and tested different values of gst .The results are shown on the left side of Fig. 11. As can be seen,assuming that the two stereo cameras are the most reliablesensors provides good results independently of the value of gm.However, the opposite is not true. The lowest error was obtainedfor gst ¼ 1:0 and gm ¼ 0:6. Then, we repeated the test using onlyone stereo camera and two monocular cameras. The results areshown at the right side of Fig. 11. In that case, the best results wasobtained for gm ¼ 0:4 and gst ¼ 1:0. The results show clearly thanstereo cameras are more reliable tracking sensors than monocularones.

To better understand the results obtained, we perform adeeper analysis of one of the sequences tested. In particular, wewill focus on the scene showed in Fig. 12(b). The graphs inFig. 12(a) show the results for each set of test performed on thesequence using the best configuration of the g parameters. Notethat it is not until frame #130 that the first person enters theroom and error computation can be done. The leftmost graphshows the mean tracking error in each frame when the scene isprocessed using stereo information from the three cameras. Ascan be noticed, the error is small in every frame, and never greaterthan 0.2 m.

The second graph shows the results when one camera isconfigured as monocular and the rest as stereo cameras. There arethree possible combinations of roles. All of them were tested andare shown with lines of different colors. One can clearly see that ingeneral, the error is very similar to the one obtained for threestereo cameras. However, there are instances in which the

ARTICLE IN PRESS

Fig. 9. Tracking results in the first scenario using five monocular cameras.

R. Munoz-Salinas et al. / Pattern Recognition 43 (2010) 2390–24052402

tracking error is higher than 0.2 m, especially in the case of theblue line. It seems that, for that distribution of stereo-and-monocular camera roles, the error obtained is higher than in therest of the configurations. The reason is that there are frames ofthe sequence in which people are only visible to the monocularcamera. In that situation, the tracking cannot be properlyperformed and the error increases. However, moments later theperson becomes visible in more than one camera and the trackingis recovered.

The third graph shows the results when two of the cameras areconfigured as monocular and the other one as stereo. Again, thethree possible combinations were tested and plotted as differentgraphs. As we can see, the tracking error obtained dependsstrongly on which configuration is employed. The configurationshown in red seems to be the most promising one, while the onerepresented by the green line shows the highest error. As in theprevious case, these results show that when occlusions occur,tracking is less reliable in monocular cameras.

ARTICLE IN PRESS

Fig. 10. Comparison of tracking performance in the corridor. The top row shows the tracking results using stereo and the bottom row the results when monocular cameras

are employed. The graph shows the tracking error in each frame for the two camera configurations.

γst

γm

γ

0.1 0.40.2

0.1 0.2

0.1 0.2

0.1 0.2

γm

γm = 1.0 γm = 1.0

One stereo camera

1.2 m0.64 m0.09 m

Two stereo cameras

γst = 1.0 γst = 1.0

Error color scale

0.3 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.00.3

0.4 0.5 0.6 0.7 0.8 0.9 1.00.3 0.4 0.5 0.6 0.7 0.8 0.9 1.00.3

st

Fig. 11. Analysis of the tracking error for different values of the gst and gm parameters. See text for details.

R. Munoz-Salinas et al. / Pattern Recognition 43 (2010) 2390–2405 2403

The rightmost graph shows the results obtained when all thecameras are configured as monocular. In that case, the system isunable to keep reliable track of the people in the scene. Aminimum of four cameras are required to obtain reliable results inthis scenario, as we saw in Section 6.2.

Fig. 12(b) shows frames from the scene three monocularcameras are employed. We can see that the three people enter inthe scene and move, causing occlusion. Let us pay attention firstto frames #265�361. We can see that the people in blue and redcross and the monocular tracker swap their identities. This fact isreflected in the rightmost graph of Fig. 12(a) by a large increase inthe tracking error. Later, around frame #400, they cross again andtheir identities are swapped again. As regards the person in green,we can observe that the tracker also has difficulties during theframes #353 and #394. In that part of the sequence, the person ingreen jumps over the chessboard panel and between the twoother people. During that part of the sequence, the person isoccluded in the first and third camera, so the tracking tries todetermine his position using only the second camera. As can be

seen, it takes a while until the tracker recovers the person’slocation.

Another point that is worth noting is related to the sceneillumination. We have found that the main algorithm’s difficultyis related to under-segmentation. This occurs when the target isbefore a background with the same color and therefore thebackground subtraction technique is unable to extract itssilhouette. Segmentation errors are managed by our foregroundmodel (Section 5.2.1). However, if under-segmentation is toosevere in a camera, it might raise a false miss. If that occurs in onlya small subset of the cameras, the target can be successfullytracked using the rest of the cameras. However, if under-segmentation occurs in the majority of the cameras, the targetmight be lost. Fortunately, this problem only occurs in monocularcameras. When depth information is available, the backgroundsegmentation algorithm employed [11] is capable of extractingthe target’s silhouette even if the target is before of a backgroundwith the same color. The over-segmentation problem, whichnormally takes place when the scene suffers from a sudden

ARTICLE IN PRESS

Fig. 12. (a) Tracking error of the scene in (b) for all possible combinations of stereo and monocular cameras. (b) Frames of a scene showing the tracking results when the

scene is processed using three monocular cameras.

R. Munoz-Salinas et al. / Pattern Recognition 43 (2010) 2390–24052404

change of illumination, is less frequent in our case. The reasonwhy is that the background subtraction method employed isrelatively robust to illumination changes since it gives moreimportance to chroma and depth information than to illumina-tion.

7. Conclusions

This work has proposed a novel particle filter for trackingmultiple targets using multiple and heterogeneous cameras. Forthat purpose, each type of camera defines both an observation anda confidence model. The latter models the confidence of thecamera in determining the likelihood of a particle. The particlesare evaluated independently in each camera, thus enabling aparallel implementation. In a final step, information is fusedtaking confidence into account. The confidence model proposedconsiders occlusion, which is calculated using an occlusion map ineach camera. Cameras in which the target is fully visible areassigned a higher confidence level than those in which the targetis occluded.

The algorithm proposed allows for tracking people using eithermonocular cameras, stereo cameras or a combination of bothtypes. The tests performed in several environments show that thetracking error of our system is low and that it is able to recoverfrom both partial and total occlusions, and that stereo cameras aremore reliable for tracking purposes than monocular cameras.

References

[1] F. Aherne, N. Thacker, P. Rockett, The Bhattacharyya metric as an absolutesimilarity measure for frequency coded data, Kybernetica 32 (1997) 1–7.

[2] J. Black, T. Ellis, P. Rosin, Multiview image surveillance and tracking,Workshop on Motion and Video Computing, 2002, pp. 169–174.

[3] M.Z. Brown, D. Burschka, G.D. Hager, Advances in computational stereo, IEEETransactions on Pattern Analysis and Machine Intelligence (2003)993–1008.

[4] C. Hue, J.L. Cadre, P. Perez, Sequential Monte Carlo methods for multipletarget tracking and data fusion, IEEE Transaction on Signal Processing 50(2002) 309–325.

[5] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEETransactions on Pattern Analysis and Machine Intelligence (2003)564–577.

[6] J. Deutscher, A. Blake, I. Reid, Articulated body motion captured by annealedparticle filtering, IEEE Conference on Computer Vision and Pattern Recogni-tion 2 (2000) 126–133.

[7] W. Du, J. Piater, Tracking by cluster analysis of feature points and multipleparticle filters, in: IEEE Conference on Advanced Video and Signal BasedSurveillance, 2005, pp. 165–170.

[8] W. Du, J. Piater, Multi-camera people tracking by collaborative particle filtersand principal axis-based integration, in: Lecture Notes in Computer Science,vol. 4843, 2007, pp. 365–374.

[9] F. Fleuret, J. Berclaz, R. Lengagne, P. Fua, Multicamera people tracking with aprobabilistic occupancy map, IEEE Transactions on Pattern Analysis andMachine Intelligence 30 (2008) 267–282.

[10] J.D. Foley, A. van Dam, Fundamentals of Interactive Computer Graphics,Addison-Wesley, Reading, MA, 1982.

[11] M. Harville, G. Gordon, J. Woodfill, Foreground segmentation using adaptivemixture models in color and depth, in: IEEE Workshop on Detection andRecognition of Events in Video, 2001, pp. 3–11.

[12] Intel. OpenCV: open source computer vision library /http://www.intel.com/research/mrl/opencv/S.

[13] M. Isard, A. Blake, CONDENSATION—conditional density propagationfor visual tracking, International Journal of Computer Vision 29 (1998)5–28.

[14] B. Han, S.-W. Joo, L.S. Davis, Probabilistic fusion tracking using mixturekernel-based Bayesian filtering, in: IEEE 11th International Conference onComputer Vision, 2007, pp. 1–8.

[15] T. Kailath, The divergence and Bhattacharyya distance measures in signalselection, IEEE Transactions on Communication Technology 15 (1967)52–60.

[16] J. Kang, I. Cohen, G. Medioni, Tracking objects from multiple and movingcameras, IEE Intelligent Distributed Surveillance Systems (2004) 31–35.

ARTICLE IN PRESS

R. Munoz-Salinas et al. / Pattern Recognition 43 (2010) 2390–2405 2405

[17] S.M. Khan, M. Shah, A multiview approach to tracking people in crowdedscenes using a planar homography constraint, in: Lecture Notes in ComputerScience, vol. 3954, 2006, pp. 133–146.

[18] Z. Khan, T. Balch, F. Dellaert, MCMC-based particle filtering for tracking avariable number of interacting targets, IEEE Transactions on Pattern Analysisand Machine Intelligence 27 (2005) 1805–1819.

[19] K. Kim, L.S. Davis, Multi-camera tracking and segmentation of occludedpeople on ground plane using search-guided particle filtering, ECCV (3)(2006) 98–109.

[20] G. Kitagawa, Monte Carlo filter and smoother for non-Gaussian nonlinearstate space models, Journal of Computational and Graphical Statistics5 (1996) 1–25.

[21] J. Krumm, S. Harris, B. Meyers, B. Brumitt, M. Hale, S. Shafer, Multi-cameramulti-person tracking for easyliving, in: Third IEEE International Workshopon Visual Surveillance, 2000, pp. 3–10.

[22] O. Lanz, Approximate Bayesian multibody tracking, IEEE Transactions onPattern Analysis and Machine Intelligence (2006) 1436–1449.

[23] K.-C. Lien, C.-L. Huang, Multiview-based cooperative tracking of multiplehuman objects, EURASIP Journal on Image and Video Processing 2008 (2008)13 doi:10.1155/2008/253039.

[24] R. Lundh, L. Karlsson, A. Saffiotti, Dynamic self-configuration of an ecology ofrobots, in: Proceedings of the IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), 2007, pp. 3403–3409.

[25] A. Mittal, L.S. Davis, M2 tracker: a multi-view approach to segmenting andtracking people in a cluttered scene, International Journal of Computer Vision53 (2001) 189–203.

[26] R. Munoz Salinas, A Bayesian plan-view map based approach formultiple-person detection and tracking, Pattern Recognition 41 (2008)3665–3676.

[27] R. Munoz Salinas, E. Aguirre, M. Garcıa-Silvente, People detection andtracking using stereo vision and color, Image and Vision Computing 25 (2007)995–1007.

[28] R. Munoz Salinas, M. Garcıa-Silvente, R. Medina-Carnicer, Adaptive multi-modal stereo people tracking without background modelling, Journal ofVisual Communication and Image Representation 19 (2008) 75–91.

[29] K. Okuma, A. Taleghani, D. De Freitas, J.J. Little, D.G. Lowe, A boosted particlefilter: multi target detection and tracking, in: Lectures Notes in ComputerScience, vol. 3021, 2004, pp. 28–39.

[30] C. Soto, Bi Song, A.K. Roy-Chowdhury, Distributed multi-target tracking in aself-configuring camera network, in: Computer Vision and Pattern Recogni-tion (CVPR), 2009, pp. 1486–1493.

[31] J. Vermaak, A. Doucet, P. Perez, Maintaining multimodality through mixturetracking, in: Ninth IEEE International Conference on Computer Vision, 2003,pp. 1110–1116.

[32] J. Vermaak, S.J. Godsill, P. Perez, Monte carlo filtering for multi-target trackingand data association, IEEE Transactions on Aerospace and Electronic Systems41 (2005) 309–332.

[33] Y.-D. Wang, J.-K. Wu, Adaptive particle filter for data fusion of multiplecameras, Journal of VLSI Signal Processing 49 (2007) 363–376.

[34] Z. Wu1, N.I. Hristov, T.L. Hedrick, T.H. Kunz, M. Betke1, Tracking a largenumber of objects from multiple views, in: International Conference onComputer Vision ICCV, 2009.

[35] T. Zhao, M. Aggarwal, R. Kumar, H. Sawhney, Real-time wide area multi-camera stereo tracking, in: Computer Vision and Pattern Recognition CVPR,2005, pp. pp. 976–983.

[36] S. Zhou, R. Chellappa, B. Moghaddam, Visual tracking and recognition usingappearance-adaptive models in particle filers, IEEE Conference on PatternRecognition 13 (2004) 1491–1506.

About the Author—R. MUNOZ-SALINAS received the Bachelor degree in Computer Science from Granada’s University (Spain) and the Ph.D. degree from Granada’sUniversity (Spain), in 2006. Since 2006 he has been working with the Department of Computing and Numerical Analysis of Cordoba University, currently he is assistantprofessor. His research is focused mainly on Mobile Robotics, Human-Robot Interaction, Artificial Vision and Soft Computing techniques applied to Robotics.

About the Author—R. MEDINA-CARNICER received the Bachelor degree in Mathematics from University’s of Sevilla (Spain). He received the Ph. D. in Computer Sciencefrom the Polytechnic University of Madrid (Spain) in 1992. Since 1993 he has been a lecturer of Computer Vision in Cordoba’s University (Spain). His research is focused onedge detection, evaluation of computer vision algorithms and pattern recognition.

About the Author—F.J. MADRID-CUEVAS received the Bachelor degree in Computer Science from Malaga’s University (Spain) and the Ph.D. degree from PolytechnicUniversity of Madrid (Spain), in 1995 and 2003, respectively. Since 1996 he has been working with the Department of Computing and Numerical Analysis of CordobaUniversity, currently he is assistant professor. His research is focused mainly on image segmentation, 2-D object recognition and evaluation of computer vision algorithms.

About the Author—A. CARMONA-POYATO received his title of Agronomic Engineering and Ph.D. degree from the University of Cordoba (Spain), in 1986 and 1989,respectively. Since 1990 he has been working with the Department of Computing and Numerical Analysis of the University of Cordoba as lecturer. His research is focusedon image processing and 2-D object recognition.