Left-Luggage Detection using Bayesian Inference

Left-Luggage Detection using Bayesian Inference

Fengjun Lv Xuefeng Song Bo Wu Vivek Kumar Singh Ramakant Nevatia

Institute for Robotics and Intelligent SystemsUniversity of Southern California

Los Angeles, CA 90089-0273�flv�xsong�bowu�viveksin�nevatia�@usc.edu

Abstract

This paper presents a system that incorporates low-levelobject tracking and high-level event inference to solve thevideo event recognition problem. First, the tracking modulecombines the results of block tracking and human trackingto provide the trajectory as well as the basic type (humanor non-human) of each detected object. The trajectories arethen mapped to 3D world coordinates, given the cameramodel. Useful features such as speed, direction and dis-tance between objects are computed and used as evidence.Events are represented as hypotheses and recognized in aBayesian inference framework. The proposed system hasbeen successfully applied to many event recognition tasksin the real world environment. In particular, we show re-sults of detecting the left-luggage event on the PETS 2006dataset.

1. IntroductionVideo event recognition has been an active topic in com-puter vision and artificial intelligence community recently.It is a difficult task because it consists of not only low-levelvision processing tasks such as tracking and object detec-tion, which by themselves are difficult problems, but alsohigh-level event inference, which has to handle the ambigu-ity in event definition, the variations in execution style andduration and the fusion of multiple source of information.

Another major difficulty is the lack of event data andevaluation criteria. This is because the acquisition of eventdata is hard either because events of interest (especially un-usual events) rarely happen in real life or because recordingof such events is prohibited in many situations. So many re-searchers have to rely on staged data, which requires a lot ofeffort to make it look realistic. This difficulty impedes theevaluation and comparison of different event recognition al-gorithms.

The PETS 2006 workshop serves the need of event databy providing a publicly available dataset that contains eventscenarios in a real-world environment. The workshop fo-

cuses on detecting the left-luggage event in a railway sta-tion. The event has a clear definition consisting of thefollowing three rules: (1) Contextual rule: A luggage isowned and attended by a person who enters the scene withthe luggage until such point that the luggage is not in phys-ical contact with the person. (2) Spatial rule: A luggage isunattended when the owner is further than 3 meters from theluggage. (3) Temporal rule: If a luggage has been left unat-tended by the owner for a period of 30 consecutive secondsin which time the owner has not re-attended to the luggage,nor has the luggage been attended to by a second party, thealarm event is triggered.

Although detection of left-luggage is the only consideredtask of PETS 2006, the event contains large variations interms of the types of luggage and the complexity of scenar-ios. The dataset includes five types of luggage: �briefcase,suitcase, 25 liter rucksack, 70 liter backpack, ski gear car-rier�. The size and the shape of these different types aredifferent, which increases the difficulty in luggage detec-tion and tracking. The dataset considers the following sevenscenarios with increasing complexity. (1) Person 1 witha rucksack loiters before leaving the rucksack unattended.(2) Person 1 and 2 enter the scene from opposite directions.Person 1 places a suitcase on the ground, before person 1and 2 leave together without the suitcase. (3) Person 1 tem-porarily places his briefcase on the ground before picking itup again and leaving. (4) Person 1 places a suitcase on theground. Person 2 arrives and meets with person 1. Person1 leaves the scene without his suitcase. (5) Person 1 witha ski gear carrier loiters before leaving the luggage unat-tended. (6) Person 1 and 2 enter the scene together. Person1 places a rucksack on the ground, before person 1 and 2leave together without the rucksack. (7) Person 1 with asuitcase loiters before leaving the suitcase unattended. Dur-ing this event five other persons move in close proximityto the suitcase. Accordingly, the event recognition systemshould have the flexibility to accommodate these variations.

We present an event recognition system that tackles thechallenges in tracking and event recognition problems. For

1

Figure 1: The overview diagram of our system

tracking, a combination of a blob tracker (see Section 2.1)and a human tracker (see Section 2.2) is used to providethe trajectory of each human and carried object. For eventrecognition, each event is represented and recognized in aBayesian inference framework (see Section 3.1). In partic-ular, the algorithm is applied to detect the left-luggage eventon the PETS 2006 dataset (see Section 3.2). The overviewdiagram of our system is shown in Figure 1.

Note that in PETS 2006, each scenario is filmed frommultiple cameras. We only use the third camera because itprovides the best view point and the results based on thissingle view point is satisfactory.

Many efforts have been made to solve the human track-ing problem. As it is hard to provide an adequate de-scription of the whole literature, we only list some that areclosely related to this work. The observations of humanhypotheses may come from various cues. Some use the re-sult of background subtraction as observations and try to fitmultiple human hypotheses to explain the foreground blobs[12]. As the hypotheses space is usually of high dimen-sion, sampling algorithms such as MCMC [12], or dynamicprogramming algorithms such as EM [7] are applied. Othermethods, e.g. [2], build deformable silhouette models basedon edge features to detect pedestrians and associate the de-tection responses to form tracks. These two types of ap-proaches are complementary.

Proposed methods for video event recognition can bedivided into three sub-categories based on the underlyingfeatures they use: (1) those based on 2-D image featuressuch as optical flow [3] , body shape/contour [11], space-time interest point [4], (2) those based on 2-D trajectoriesof tracked objects such as hands, legs or the whole body[8] and (3) those based on 3-D body pose acquired frommotion capture or 3-D pose tracking system [5]. None ofabove methods, however, involves the manipulation of ob-jects such as luggage.

2. The Tracking ModuleWe use two components for the tracking task. A Kalmanfilter based blob tracker, which provides trajectories of allforeground objects including humans and carried objects, isdescribed in Section 2.1. A human body detection basedhuman tracker is described in Section 2.2. We describe thecombination of both trackers in Section 2.3.

2.1 Kalman Filter based Blob Tracking

BackgroundSubtraction

Input Frame(t)

ForegroundBlobs (t)

Object Tracks(t-1)

EstimatedObject Tracks (t)

Track-BlobAssociation

Matrix

New TrackInitialization

Track Ending

Object BlobSplit

Object BlobMerge

ObjectTracks

(t)

ObjectKnowledge

Scene and CameraKnowledge

Figure 2: An overview of the blob tracker

Figure 2 shows an overview of our blob tracking algo-rithm. We learn the background model from the first fivehundred frames and apply it to detect the foreground pix-els at each frame. Connected foreground pixels form fore-ground blobs. At each frame �, a blob � �

� contains the fol-lowing information:

� Size and location: for simplicity, we use a rectangularbounding box (��, �

��, ��

��, ��

��)

� Appearance: we use the color histogram of the blob

Ideally, one blob corresponds to one real object and viceversa. But we have to consider the situations such as one ob-ject splits into several blobs or multiple objects merge into

2

Figure 3: Blob tracking results

one blob. For a blob � �� and an object ��

� , we assign an as-sociation value based on the overlap between the boundingbox of ��

� and �� :

��

��

��

��

��

��

��

� ��(1)

where �� is the predicted bounding box of � �

� from the pre-

vious frame using the Kalman filter and � ��

�� is the in-

tersection of the two bounding boxes and � is a thresholdbetween 0 and 1. If ��

� �� is one, we call ��

� and ��

are matched.

At time �, assume we have � blobs and � predicted ob-jects. For all pairs of blob and object, we construct an��association matrix. The following are possible situationsthat we need to consider: (1) If the blob-object match isone-to-one, we simply update the object with the blob. (2)If a blob has no matched object, a new object is created. (3)If an object has no match blobs for several frames, the objectis removed. (4) If an object has multiple matched blobs, theblobs are merged into a bounding box of the object. (5) Ifa blob has multiple matched objects, the blob is segmentedbased on the appearance (color histogram) of each matchedobject.

A luggage is detected based on its mobility: It seldommoves after the start of its track. Some detected humansand luggage are shown in Figure 3.

2.2 Detection based Human Tracking

For shape based human tracking, we take the single framehuman detection responses as the observations of humanhypotheses. Tracking is done in 2D with a data associationstyle method. The detector is a generalization of the bodypart detector proposed by Wu et al. in [9]. We do not relyon skin color, or face, hence our method can deal with rearviews. The algorithm of [9] has been extended in [10] totrack multiple humans based on the detectors. Our trackingalgorithm is a simplified version of [10].

Figure 4: Human tracking results

2.2.1 Multi-View Human Body Detection

In [9], four part detectors are learned for detecting full-body,head-shoulder, torso, and legs. We only use the full-bodydetector here. Two detectors are learned: one for the leftprofile view, and one for the frontal/rear view. The thirddetector is for right profile view, which is generated by flip-ping the left profile view horizontally. Nested cascade de-tectors are learned by boosting edgelet feature based weakclassifiers, as in [9]. The training set contains 1,700 positivesamples for frontal/rear view, 1,120 for left profile view, and7,000 negative images. The positive samples are collectedfrom the Internet and the MIT pedestrian set [6]. The neg-ative images are all from the Internet. The training set isfully independent of the test sequences, and the detectorslearned are for generic situations. For detection, the inputimage is scanned by all three detectors and the union of theirresponses is taken as the multi-view detection result.

2.2.2 Multiple Human Tracking

Humans are tracked in 2D by associating the frame detec-tion responses. This 2D tracking method is a simplified ver-sion of [10]. In [10], the detection responses come fromfour part detectors and a combined detector. To start a tra-jectory, an initialization confidence �� is calculatedfrom � consecutive responses, which correspond to one hu-man hypothesis, based on the cues from color, shape, andposition. If �� is larger than a threshold ��, a tra-jectory is started. To track the human, first data associationwith the combined detection responses is attempted; if thisfails, data association with the part detection responses is at-tempted; if this fails again, a color based meanshift tracker[1] is used to follow the person. The strategy of trajectorytermination is similar to that of initialization. A termina-tion confidence �� is calculated when an existingtrajectory has been lost by the detector for � time steps. If�� is larger than a threshold �� , the trajectory isterminated.

In this work, we do not use the combined detectionmethod described in [9]. Since the occlusions are not strongin this data set, a local feature based full-body detector gives

3

Figure 5: Tracking results of individual trackers and com-bined tracker. (a),(c),(e): The tracking results on the sameframe using the blob tracker alone, the human tracker aloneand the combined tracker. The combined tracker can sepa-rate two persons from a merge blob and locate the luggageas well. (b),(d),(f): Another example. The combined trackerconfirms that the object in the middle is not human. It alsotracks the person that is missed by the human tracker due tothe large tilt angle.

satisfactory results. Fig.4 shows some sample frames of thetracking results.

2.3 The Combination of the Blob Trackerand the Human Tracker

In most cases, the blob tracker does a good job of trackingobjects and locating luggage. However, when multiple hu-mans appear merged from the beginning, the tracker can notseparate them well and sometimes the tracker mis-identifiesthe type of the tracked objects based only on the mobility ofthe objects. See Figure 5(a) and (b) for examples of theseissues.

On the other hand, the human tracker does not track ob-jects other than humans. Also, although the learned de-tectors work with a large range of tilt angles (within about

��Æ �Æ), when the tile angle is too large, the detectors cannot find the human reliably. These issues can be clearly seenin Figure 5(c) and (d).

We can overcome the above limitations of the blobtracker and the human tracker by combining their results sothat if one trajectory from the human tracker largely over-laps one trajectory from the blob tracker, the first trajec-tory supersedes the second one. This is because the humantracker is based on human shape so it is more reliable. Theblob tracker, on the other hand, enhances the results by in-cluding objects that are missed by the human tracker. Thecombined results of the above examples are shown in Figure5(e) and (f).

3. The Event Recognition ModuleWe first describe our event recognition algorithm, followedby the details of how this algorithm is applied to detect theleft-luggage event on the PETS 2006 dataset.

3.1 Event Recognition using Bayesian Infer-ence

Our event recognition algorithm takes object trajectory(provided by the underlying detection/tracking modules) asthe only observation. The algorithm assumes that the typeof each object is known. The basic object types includehuman, carried object, vehicle and scene object (e.g. door,stairs). This information is either automatically provided bythe detection/tracking modules or specified by system users.

The trajectory is first smoothed using median filter to re-move abrupt changes. Useful physical properties such asthe position, speed and direction of a moving object and thedistance between two objects are inferred from object tra-jectories. Since these physical properties are measured interms of world coordinates, 2D object trajectories need tobe mapped to 3D world coordinates first. For an object onthe ground plane, we assume that the vertical coordinate ofits lowest point is zero. Given the camera model, we usethe center of the bottom of the image bounding box as theobject’s lowest point and map this point to 3D position onthe ground plane.

System users have to provide an event model for eachevent in advance. A successful event model has to handlethe ambiguity in the event definition (which implies that weshould define an event in a probabilistic domain instead ofa logical one) and incorporate multiple cues into a compu-tational framework.

Due to these considerations, we use Bayesian inferenceas the event modeling and recognition framework: eventsand related cues are considered as hypotheses and evidence,respectively; Competing events are treated as competing hy-potheses. For example, regarding the relationship between

4

a moving object A and a zone B in terms of position of �and�, there are four possibilities (hypotheses), correspond-ing to four competing events: ��:A is outside of B, ��:Ais entering B, ��:A is inside B, ��:A is exiting B�. Weconsider the following two pieces of evidence: ��:the dis-tance between A and B sometime (e.g. 0.5 second) ago,��:the distance between A and B now�. The completemodel also includes the prior probabilities of each hypothe-sis and the conditional probabilities of each evidence giventhe hypotheses. In the above example, we can set the priorprobabilities of all hypotheses to be equal (� �� ,� � � � � �) and use the following conditional probabili-ties.

� �� :�0,0,1,1�� :�1,1,0,0�� :�0,1,1,0�� :�1,0,0,1�

The meaning of this table is clear. Take the hypothesis�� “A is entering B” for example, �� and �� means � was outside of � sometime ago and � is inside �now. Given the evidence, the probability of �� is computedusing the Bayesian rule:

� ��

��

� ��

(2)

If we assume each evidence is conditionally independentgiven the hypothesis, Eq.2 can be rewritten as:

� ��

��

� ��

(3)

Note that the above example has the simplest form of aprobability distribution function: a threshold function (i.e.the distance is quantified to only two values “=0” and “�0”)and the probability is either 0 or 1 (a deterministic case).For more complex cases, the following distribution func-tions are provided in our system: �Histogram, Threshold,Uniform, Gaussian, Rayleigh, Maxwell�. The parametersof the distribution functions are learned from training data.If such data are not available, the parameters are specifiedbased on user’s knowledge of the event.

If a hypothesis � has no competing counterparts, weconsider �� , the opposite of the hypothesis, and use thefollowing equation to compute the posterior probability ofhypothesis given evidence.

� ��

��

� ��

��

� ��

��

� ��

(4)

When � �� is unknown or hard to estimate, we use adefault value 0.5.

3.2 Left-Luggage Detection

For the specific task of detecting left-luggage, as describedin the introduction, the following events are modeled.

� Drop off luggage (denoted as �): Three pieces of evi-dence are used.

� ��: The luggage did not appear �� (or other smallnegative value) second ago.

� ��: The luggage shows up now.

� ��: The distance between the person and the luggagenow is less than some threshold � �� (e.g. 1.5meters).

�� and �� together impose a temporal constraint: theluggage was carried by the human before the drop-off sothe separate track of the luggage has not started yet. Onceit appears, we know the drop-off has just happened.

�� is a spatial constraint: The owner has to be close tothe luggage when drop-off just happened. This constrainteliminates irrelevant persons who are just passing by whendrop-off takes place. In case that multiple persons are closeto the luggage when drop-off takes place, the closest personis considered as the owner.

The prior and the conditional probabilities are listed asfollows:

� ��

� �� , � � � � �

� �� , � � � � �

�Abandon luggage (denoted as �): Two pieces of evidenceare used.

� ��: The distance between � and � �� (or othersmall negative value) second ago is less than the alarmdistance �� (=3 meters).

� ��: The distance now is larger than ��.

�� and�� together incorporate the spatial rule of PETS2006 that is described in the introduction.

The prior and the conditional probabilities are listed asfollows:

� ��

� �� , � � � �

� �� , � � � �

5

Table 1: The ground-truth (with dark background) and our result (with white background). Locations are measured in meter.The last four rows show the frame number and time (in second) when an even is triggered.

We first detect the drop-off event on all combination ofthe human-luggagepair and only those pairs with high prob-ability are considered further by the ! �� event. This ishow the contextual rule is applied.

For the temporal rule, we require that if current frame �gets a high probability of the ! �� event and if any pre-vious frame �� (� within the range of 30 seconds) also getshigh probability, the high probability at frame � � will bediscarded. Finally, alarm events are triggered at the frameswith high probability. PETS 2006 also defines a warningevent, which is treated exactly the same here except for asmaller distance threshold (=2 meters).

4. Results and EvaluationThe following information is provided in the ground-truthdata of PETS 2006 dataset.

� The number of persons and luggage involved in thealarm event

� The location of the involved luggage

� The frame (time) when the warning event is triggered

� The frame (time) when the alarm event is triggered

We compare our results with the ground-truth in Table 1.The rows with dark background are the ground-truth data.

Figure 6 shows the key frames (with tracking results) inwhich the drop off luggage and the abandon luggage eventare detected. The 3D mapping on the ground plane is shownat the right side of each frame. The floor pattern is alsoprovided as reference. Note that the frame number shownin the abandon luggage event is the frame number of thealarm event minus 750 (equivalent to 30 seconds in a videowith 25 fps frame rate).

We can see in Figure 6, some errors on the ground planeare visible because the size and position of bounding boxes

are not perfect and a small offset of a point in 2D can resultin a significant error in 3D when the object is far away fromthe camera. But nonetheless, our tracking algorithms workvery well. Note that in Figure 6(e), two persons on the leftare not separated because they are merged in a single mov-ing blob and the human detector fails to respond due to alarge tilt angle.

The event recognition algorithm detects all (warning andalarm) events within an error of 9 frames (0.36 second). Theinvolved persons include anyone who is close enough to theinvolved luggage when and after the abandon luggage eventtakes place. The correct number of involved persons in allsequences are detected, except for the last sequence. Inthat sequence, three persons are walking closely as a group.One of them is severely occluded by the other two and thusmissed by the tracker.

The clear definition of the alarm event helps contributeto the good performance of our event recognition system.The contextual rule is crucial because detecting the drop-offevent provides a filtering mechanism to disregard the irrel-evant persons and luggage before being considered for theabandon event. Furthermore, sometimes our tracker mis-identifies the type of an object based only on the mobilityof the object, which can result in many false alarms. Dueto the contextual rule, the possibility of such false alarmsis significantly reduced. To illustrate the importance of thecontextual rule, we list in Table 2 the number of triggeredevents without applying the contextual rule. This demon-strates that the high-level reasoning can eliminates the er-rors of low-level processing.

5. Conclusions and Future WorkWe have presented an object tracking and event recognitionsystem that has successfully tackled the challenge proposedby the PETS 2006 workshop. For tracking, we combinethe results of a blob tracker and a human tracker, whichprovides not only the trajectory but also the type of each

6

Figure 6: The key frames (with tracking results) in which the drop off luggage and the abandon luggage event are detected.Figures shown here are, from left to right in each row, the key frame of drop off luggage in 2D and 3D, the key frame ofabandon luggage in 2D and 3D, respectively.

7

Table 2: Without the contextual rule, the results containmany false alarms.

tracked object. For event inference, events are modeled andrecognized in a Bayesian inference framework, which givesperfect event recognition result even though the tracking isnot perfect.

Note that our system is designed as a framework forgeneral event recognition task and is capable (and has al-ready been applied) of recognizing a much broader range ofevents. For this specific (left-luggage) task, we just used ourexisting system and did not make any additional effort ex-cept that we specified the event models (as described in Sec-tion 3.2). This was done immediately. In conclusion, oursystem provides quick solutions to event recognition taskssuch as the left-luggage detection.

Real-world scenarios usually are more complicated thanthe one presented here, in terms of the number of involvedpersons and objects and the variation in execution style andduration. For future work, we need to consider more so-phisticated algorithms to handle such complexities.

References

[1] D. Comaniciu, V. Ramesh and P. Meer. The VariableBandwidth Mean Shift and Data-Driven Scale Selec-tion. In ICCV 2001, Vol I: 438-445.

[2] L. Davis, V. Philomin and R. Duraiswami. Tracking hu-mans from a moving platform. In ICPR 2000, Vol IV:171-178.

[3] A. A. Efros, A. C. Berg, G. Mori and J. Malik. Recog-nizing Action at a Distance. In ICCV 2003, 726-733.

[4] I. Laptev and T. Lindeberg. Space-time interest points.In ICCV 2003, 432-439.

[5] F. Lv and R. Nevatia. Recognition and Segmentation of3-D Human Action using HMM and Multi-Class Ad-aBoost. In ECCV 2006, 359-372.

[6] C. Papageorgiou, T. Evgeniou and T. Poggio. A Train-able Pedestrian Detection System. In Proc. of Intelli-gent Vehicles, 1998, 241-246.

[7] J. R. Peter, H. Tu and N. Krahnstoever. SimultaneousEstimation of Segmentation and Shape. In CVPR 2005,Vol II: 486-493.

[8] C. Rao, A. Yilmaz and M. Shah. View-Invariant Rep-resentation and Recognition of Actions. In IJCV 50(2),Nov. 2002, 203-226.

[9] B. Wu and R. Nevatia. Detection of Multiple, PartiallyOccluded Humans in a Single Image by Bayesian Com-bination of Edgelet Part Detectors. In ICCV 2005, VolI:90-97.

[10] B. Wu and R. Nevatia. Tracking of Multiple, PartiallyOccluded Humans based on Static Body Part Detection.To appear in CVPR 2006.

[11] A. Yilmaz and M.Shah. Actions Sketch: A Novel Ac-tion Representation. In CVPR 2005, 984-989.

[12] T. Zhao and R. Nevatia. Tracking multiple humans incrowded environment. In CVPR 2004, Vol II: 406-413.

8

Left-Luggage Detection using Bayesian Inference

Documents

Transcript of Left-Luggage Detection using Bayesian Inference