Object Labelling from Human Action Recognition

c©2003 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material foradvertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to

reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Object Labelling from Human Action Recognition

Patrick Peursum Svetha Venkatesh Geoff A.W. West Hung H. BuiSchool of Computing

Curtin University of TechnologyGPO Box U1987 Perth, 6845 Western Australia

E-mail: {peursump, svetha, geoff, buihh}@cs.curtin.edu.au

Abstract

This paper presents a method for finding and classify-ing objects within real-world scenes by using the activity ofhumans interacting with these objects to infer the object’sidentity. Objects are labelled using evidence accumulatedover time and multiple instances of human interactions.This approach is inspired by the problems and opportunitiesthat exist in recognition tasks for intelligent homes, namelycluttered, wide-angle views coupled with significant and re-peated human activity within the scene. The advantages ofsuch an approach include the ability to detect salient ob-jects in a cluttered scene independent of the object’s phys-ical structure, adapt to changes in the scene and resolveconflicts in labels by weight of past evidence. This initialinvestigation seeks to label chairs and open floor spaces byrecognising activities such as walking and sitting. Findingsshow that the approach can locate objects with a reason-ably high degree of accuracy, with occlusions of the humanactor being a significant aid in reducing over-labelling.

1 Introduction

Most approaches to object recognition rely on classify-ing an object by comparing a model of the object againsta database of known objects [4]. Function-based objectrecognition introduced by Stark and Bowyer [13] is a vari-ation on traditional methods, where object models are anal-ysed and classified based on their functional components.For example a chair could be defined as any object that hasa flat, stable sitting surface.However, such object recognition techniques rely on

analysing the physical appearance of an object, a techniquethat is difficult to apply successfully to cluttered scenes andcomplex objects that are typical of indoor, real-world en-vironments. As an alternative, other researchers have rea-soned that function-based recognition can be performed bymonitoring the activities occurring within a scene rather

than using a structural model of the object. Some applica-tions of this are in path finding within a scene, either for thepurposes of detecting unusual behaviour or predicting tra-jectories [15, 7], or for determining the extent of the path-ways and obstacles that exist within the scene [16, 9, 5].However, these investigations have mostly been applied tooutdoors scenes and are limited to using the trajectories ofmoving people or cars, thus they do not infer anything be-yond the position of paths and obstacles within the scene.Very little research has attempted to merge human actionrecognition — such as that performed by Bobick and Davis[3] — with the goal of identifying and classifying the ob-jects that are being interacted with.

This paper explores an activity-based approach to learn-ing and classifying functional objects in an indoor, real-world environment monitored by stationary cameras. Thepremise of this approach is that since humans interact differ-ently with objects that differ in their functionality, it shouldbe possible to identify objects using their associated visualhuman interaction signatures. The advantage of such an ap-proach is that it considers object recognition independent ofthe object’s physical structure. Furthermore, the system canuse an evidence-based framework to classify objects in anincremental manner, and thus should be flexible enough toadapt to the scene as it changes over time (such as an ob-ject being moved). This would be particularly applicableto ‘smart houses’ and other intelligent monitoring systemsthat require robust, function-oriented scene understandingof indoor environments.

As an initial investigation, only the recognition of chairsand open floor spaces within a scene monitored by videocameras is considered. The activities of a single person aremonitored with four video cameras to detect signature activ-ities (including walking, sitting down into a chair, remain-ing seated and standing back up). Activities are modelledwith Hidden Markov Models (HMMs) [12] due to theirproven aptitude in classifying human actions [18, 11]. TheHMMs are used to recognise the activity a person is con-ducting, which is then used along with the person’s location

Proceedings of the First IEEE International Conference on Pervasive Computing and Communications (PerCom’03) 0-7695-1895/03 $17.00 © 2003 IEEE 0-7695-1893-1/03 $17.00 © 2003 IEEE

https://www.researchgate.net/publication/221410177_A_Bayesian_Computer_Vision_System_for_Modeling_Human_Interaction?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

https://www.researchgate.net/publication/222501466_Learning_the_Distribution_of_Object_Trajectories_for_Event_Recognition?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

https://www.researchgate.net/publication/2945857_Finding_Paths_in_Video_Sequences?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

https://www.researchgate.net/publication/3191947_Achieving_generalized_object_recognition_through_reasoning_aboutassociation_of_function_to_structure?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

https://www.researchgate.net/publication/222544196_A_Survey_Of_Free-Form_Object_Representation_and_Recognition_Techniques?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

https://www.researchgate.net/publication/222764829_Learning_Spatio-Temporal_Patterns_for_Predicting_Object_Behaviour?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

https://www.researchgate.net/publication/44283242_Spatial-Temporal_Reasoning_Based_on_Object_Motion?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

https://www.researchgate.net/publication/3513086_Recognizing_human_action_in_time-sequential_images_using_hidden_Markov_model?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

https://www.researchgate.net/publication/3758759_Using_adaptive_tracking_to_classify_and_monitor_activities_in_site?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

https://www.researchgate.net/publication/3659906_An_appearance-based_representation_of_action?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

https://www.researchgate.net/publication/245123229_A_Tutorial_on_Hidden_Markov_Models_and_Selected_Applications_in_Speech_Recognition?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

to label areas of the scene. Partial occlusions of the personare taken as significant indicators of an object’s boundariesand affect labelling accordingly. This evidence is then accu-mulated over time and multiple instances of human interac-tions. The system specifically avoids making any assump-tions regarding the orientation or position of tracked people,or the relative positions of the cameras themselves.The rest of this paper is organised as follows: First, a

brief review of related literature in this field is given in Sec-tion 2. An overview of the system’s architecture and a de-scription of the methods used to track people, detect differ-ent activities and label objects in the scene follows in Sec-tion 3. An analysis of the experimental results can be foundin Section 4 and conclusions are presented in Section 5.

2 Related Work

Recently, human activity recognition frommotion analy-sis has become a significant topic for research and has beenapplied to a variety of problems. Early work [18] was lim-ited to recognising patterns of action that were previouslylearned using HMMs. Other work [3] attempts to representand recognise actions in low resolution video sequencesusing what is termed “motion-energy images”, applied tothe task of recognising the action of a person sitting down(among others).Grimson et al [5, 8] use activity occurring within a scene

to establish a common coordinate system between multi-ple cameras monitoring the same scene. They build on thisto detect unusual behaviours of tracked objects by compar-ing the motion of an object against the learned motion ofprevious objects, as does Stauffer [14] and Haritaoglu [6].Objects that fail to fall within any of the previously learned‘typical’ behaviours are classed as anomalous, which hasobvious applications to video surveillance systems. Otherwork [7, 15] has similar goals, but takes the approach ofmapping trajectories to assist in predicting future positionsof objects, then comparing an object’s trajectory againstthese predictions.Such use of activity is limited to recognising the activity

and any anomalies, with no attempt made to use the activityto learn about the scene being viewed. Along these lines,some researchers [16, 9] use the motion of objects (cars andpeople) to find the pathways that exist within a clutteredoutdoors scene. Grimson et al [5] also describe a novel sys-tem that uses the motion and occlusion of tracked objects toinfer the position and depth of occluding obstacles and openareas within the scene. In this regard, the work by Grimsonet al is perhaps closest to the research outlined in this paper,though they limit the scope of their work to inferring theposition of objects, and do not attempt to classify what theobjects are.

3 Tracking, Activity Segmentation and La-belling

3.1 Overview

The system performs four major operations in sequenceto produce a labelled image of a given scene (see Figure 1over page). The video is captured at 25 frames per secondfrom four cameras involved in monitoring the scene (a labo-ratory). There is a high degree of overlap between the fieldsof view for each camera to maximise the chance that an ob-ject is viewed by more than one camera. All cameras aremounted in the ceiling, one in each corner, with a partitionin the laboratory occluding parts of each view of the lab (seeFigure 2). The captured video is saved to disk in MPEG-4format, which is then processed offline for object segmenta-tion and tracking to produce the raw data needed for activitysegmentation and scene labelling. This is then separatelyprocessed to produce a labelled image of the scene from allfour camera views.

Figure 2. Laboratory layout. Intensity indi-cates height of obstacles, with light grey fordesks, tables and chairs, dark grey for oc-cluding partitions and cupboards and blackfor ceiling-mounted camera. Position and ori-entation of the target chairs varied betweenexperiments.

3.2 Foreground Object Segmentation and Track-ing

Background subtraction is employed to segment objectsfrom the video stream, using a mixture model of Gaus-sian distributions to model the background as described in[14]. This background model was chosen since it can ro-bustly adapt to changes in the background definition overtime, which is essential for this research since in future ex-periments, it must handle background objects being moved






https://www.researchgate.net/publication/3193142_Monitoring_Activities_from_Multiple_Video_Streams_Establishing_a_Common_Coordinate_Frame?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==





https://www.researchgate.net/publication/284801760_real-time_surveillance_of_people_and_their_activities?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

Figure 1. Major steps in activity-based scene labelling.

about the scene. Foreground objects (ie: people) are seg-mented out from the background, outlined by a boundingbox and tracked using a Kalman filter [14, 10, 2]. Trackingoccurs on the centroid of the bounding box since it is lesssusceptible to occlusion and noise than the bottom edge ofthe box. Initial parameters for the Kalman filters were esti-mated by evaluating example videos against hand-labelledground truths.

3.3 Calibration of Views

Each view is calibrated to the world coordinate systemvia a set of landmark points using an algorithm developedby Tsai [17]. Correspondences are then found betweenviews of an object by their proximity in the world coordi-nate system (assuming that all objects are standing on theground plane). Additionally, partial occlusions of objectsare detected by comparing the world heights and positionsof the object in all views. If the object’s lower portion isoccluded in one view, it will report a smaller height than theother views, and the bounding box is automatically adjustedto reflect the correct height.

3.4 Activity Segmentation

Four Hidden Markov Models (HMMs) were trained withcontinuous data extracted from video sequences, one HMMfor each activity that would be recognised:

• Walking• Sitting down into a chair• Person seated in a chair• Standing up from a chair

Training data consisted of six examples with four viewsper example (24 sequences in total) of a person walkinginto the room, sitting down into a chair, standing back upand leaving the room. For each example, the chair was po-sitioned at different orientations and positions within theroom, with the caveat that all cameras were able to viewthe chair. Each sequence was manually segmented into thefour different activities, which were then used to train theHMMs. Training features are as follows:

• Real-world height (in mm).

• Change in height between this frame and the previousframe, expressed as a proportion of the total height tominimise dependency on the object’s height.

• Change in width, also expressed as a proportion of thetotal width.

• Speed of the object (absolute velocity).

These HMMs then form the basis for automatically seg-menting test video sequences into blocks that each relate toone particular activity. Only modelled activities were con-ducted, and only one person moved through the scene inorder to simplify processing.The sitting down and standing up activities are both mod-

elled with Bakis strict left-right HMMs, where each statemay only transition to itself or the next state in the model,but no others. This improves accuracy over a standardHMM since sitting-down and standing-up motions are non-cyclic even though they exhibit seemingly cyclic motionprofiles (see Figure 3). In a standard HMM, this pseudo-cyclic motion would be incorporated into the trained model,which would later cause confusion when attempting to sep-arate sitting from standing. The problem is avoided with theuse of left-right models. In contrast, the walking and seatedactivities are modelled using standard HMMs due to theircyclic motion profiles. The models for sitting and standingcontain ten states, with walking comprising five states andseated with three states. These numbers were arrived at em-pirically by comparing the performance of the models givendifferent numbers of states.Activity segmentation proceeds by only considering

frames within a fixed-size moving window (a window sizeof 30 frames was found to provide the best results). The fea-tures from frames within this window are then used to cal-culate the log likelihood for each HMM, selecting the bestHMM as the activity description for the block of frames.The activity is estimated to have begun halfway in the win-dow, based on the reasoning that an HMM will becomedominant over the previous activity’s HMM when at leasthalf the frames of the window relate to the new activity.The window is then moved one frame forward and the en-tire process is repeated.


https://www.researchgate.net/publication/4004539_Multi_view_image_surveillance_and_tracking?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

https://www.researchgate.net/publication/242506502_An_efficient_and_accurate_camera_calibration_technique_for_3D_machine_vision?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

https://www.researchgate.net/publication/2478236_M2Tracker_A_Multi-View_Approach_to_Segmenting_and_Tracking_People_in_a_Cluttered_Scene_Using_Region-Based_Stereo?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

(a) (b)

Figure 3. Example motion profiles for sittingdown (a) and standing up (b). The dip cor-responds to the person leaning forwards forbalance, just before settling into the seat (orjust after rising from the seat for (b)). Motionincorrectly appears to be cyclic due to the dip.

However, this basic method tends to produce short burstsof incorrect activity labelling due to an incorrect HMM tem-porarily becoming more probable than the correct HMMbe-cause of noise, occlusions or other random factors. To solvethis, a heuristic confidence test is performed on the HMMlog likelihoods, which mandates that the most likely HMMmust significantly outperform the next most likely model.The ratio between the best and second best HMM log like-lihoods is taken, with ‘significant’ difference being definedby an arbitrary threshold (currently 0.75). If no significantHMM is found, the previous activity is re-instated. Increas-ing the significance threshold has the effect of minimisingfalse positives due to noise, at the cost of reducing the abil-ity to recognise genuine activities.

3.5 Activity Voting

Each view performs activity segmentation independentlyof all other views. To improve activity segmentation fur-ther, each view then casts a vote as to the activity beingperformed. Votes are equally weighted and if a deadlockoccurs, the current model is re-instated. The elected activityis then used by all views to perform scene labelling, againindependently of one another.

3.6 Scene Object Labelling

Labelling of objects in the scene is performed by tak-ing each frame of an activity block and updating the viewbased on the activity being conducted and the position ofthe person. This occurs by maintaining a weight for eachlabel (chair or floor) for every pixel in a view’s backgroundimage. The weights lie within the range 0 to 1, and areinitialised to 0. When a pixel (x, y) is updated (due to anactivity occurring at that pixel), all weights are updated viathe following exponential-forgetting function:

wLt+1 (x,y) = w

Lt (x,y)·(1 − η) + (η · κ),

κ ={

1 if L = detected object,0 otherwise

• w is the weight of the Lth label (chair or floor) at timet, pixel (x,y).

• η is the learning rate for learning labels, and is gen-erally very small (less than 0.05) to avoid building upweights too quickly.

• κ is the update value that controls which label will bestrengthened.

This exponential-forgetting facilitates the elimination ofincorrect labelling due to bad activity segmentation whennew, more correct labels become apparent over time. Italso provides the possibility of adapting to changes in theplacement of labelled objects (chairs) since old labels willbecome negligible over time.

Chairs are labelled whenever the sitting, seated or stand-ing up activities are detected. The fitted ellipse of the seatedperson is used as the labelling area, which is by implica-tion close to the area of the chair. Only the last 10% of theframes are used for labelling if the activity is ‘sit’ since thesitting action begins in a standing position and ends with theperson seated. Similarly, the ‘stand’ action causes only thefirst 10% of frames to be labelled. Floor space is labelledwhen the walking activity occurs. However, only the lowest5% of the fitted ellipse is labelled as floor space since thisarea corresponds to the feet of the person.

3.7 Use of Occlusion to Constrain Labelling

Partial occlusions that occur when a person walks on thefar side of a chair are used to affect the labelling and futurelearning of the chair’s boundaries. The occlusion is usedas strong evidence to indicate that the unoccluded area isnot part of the chair. The weight of the chair labels for thisarea are reduced to 0 when the occlusion is detected, andthe learning rate (η) for the area is reduced by a factor of 4to retard the speed at which the chair label is later relearned.For example, if η was 0.04 and an occlusion effect was de-tected, η would be reduced to 0.01 for all future learning ofchair labels in the unoccluded region. Floor label learningwould not be affected. Note that the learning rate is not re-duced to zero since mistakes in defining the area of partialocclusions would then become irreversible.


Initial walk Sit once in each chair More walking, sitting (d) Final labelling

Figure 4. Sequence of images showing the progression of labelling for one run. Outlines indicatechair, floor and obstacle boundaries. Upper row shows chair labelling, lower row shows floor la-belling. Intensity indicates weight of the label, and strengthens as more evidence is accumulated.

4 Results and Analysis

4.1 Experiments

(a) (b)

Figure 5. a) NW view of scene, b) with labels.

Three video sequences, each approximately one minutein duration and comprising of four views, were taken ofa person entering and moving about and occasionally sit-ting down in one of the target chairs within the scene. Thechairs remained in fixed positions throughout the experi-ments. Activity segmentation and scene labelling was per-formed on each of the sequences to produce three sets offour labelled images (one image for each view of the scene— see Figures 4, 5 and 7). These labelled images were thenanalysed together with the original video sequences to eval-uate the effectiveness of the system and identify possiblereasons for inaccuracies.

4.2 Activity Segmentation

The ground-truth for the starting frame for each activityinstance was estimated by manually segmenting the videosequences. The uncertainty for manual segmentation isroughly ± 5 frames, though this is a subjective judgement.The difference (in frames) between this ground-truth andthe automatic activity segmentation was then evaluated. Ta-ble 1 shows the mean error and variance for each type ofsegmented activity.

Table 1. Error means and variances for activ-ity segmentation.

Num. Found Mean Error Error VarModel Instances Instances (frames) (frames)Walk 19 19 -2.94 24.56Sit 19 19 5.15 48.31Seated 19 2 0 8Stand 19 19 -50.24 726.32

The ‘walk’ and ‘sit’ activities are segmented highly ac-curately given the ground-truth uncertainty is ± 5 frames.Also significant is that the ‘sit’ activity is generally seg-mented slightly later than the actual ‘sit’ action, and con-versely the ‘walk’ activity is segmented slightly earlier.This means that the beginning and end of chair interactionsare conservatively estimated, further improving the robust-ness of segmentation.


The most concerning aspect of the data is that only two‘seated’ instances were actually detected (out of 19), andthe system generally detects the beginning of the ‘stand’actions far too early. In fact, the two failures are related.

Figure 6. Motion profile of a full sequence forsitting into a chair and standing back up. Sec-tions show ground-truth.

To illustrate, Figure 6 shows the motion profile for an en-tire walk-sit-seated-stand-walk sequence. As the segmenta-tion window of 30 frames moves over the data, it incremen-tally shifts from frames containing the ‘sit’ action to thosecontaining the ‘seated’ action. However, at around frame160, the motion profile within the window looks strikinglysimilar to the first third of the ‘stand’ profile, even though itis actually part of ‘sit’. This results in the ‘stand’ model’slog likelihood increasing markedly, and the system oftenmisinterprets the end of a ‘sit’ action as the beginning ofa ‘stand’ action. This causes the premature detection of a‘stand’ action at around the time that the ‘seated’ action ac-tually begins. The mistake is not corrected over the nextfew frames since the threshold requiring a model to signif-icantly outperform all other models before a new activitylabel is accepted becomes a factor. It prevents the ‘seated’model from replacing the ‘stand’ model since the ‘stand’model’s log likelihood remains sufficiently high enough toavoid being replaced, and so the ‘seated’ action is never de-tected.One possible solution to this problem includes enforcing

a high-level heuristic that requires a ’seated’ action to fol-lowing a ’sit’ action. Another possibility is to investigatewhether additional features can improve the ability of thesystem to discriminate between ‘seated’ and ‘standing’. Fi-nally, HMM termination probabilities [1] could be used toreduce the confusion between the end of the ‘sit’ state withthe start of the ‘stand’ state.Fortunately, the loss of the ‘seated’ activity label merely

results in less evidence for the chair labelling, which is eas-ily offset by observing more instances of a person sitting inthe chair. No other negative effects become apparent sincethe ‘stand’ action simply stretches to encompass the interim‘seated’ action, and the ‘sitting’ and ‘walking’ activities ac-curately detecting the start and end points of the entire sit-

ting sequence.

4.3 Scene Labelling Accuracy

Chair labelling was evaluated by comparing the area la-belled as ‘chair’ against the true extent of the chairs in eachof the views (where a chair’s extent also includes the spacebetween the chair legs). Table 2 shows the confusion matrixfor chair labelling.

Table 2. Confusion matrix for chair labelling.Note that ‘Other’ relates to non-floor and non-chair objects (walls, cupboards, desks, etc).

Classified as (pixels)Chair Floor Other Recall

Chair 18,127 3,068 5,083 68.98%Floor 11,503 168,320 72,435 66.7%Other 7,313 7,686 628,065 97.7%

Precision 49.07% 93.99% 89.01%

The system achieved a recall rate of 68.98% for chairlabelling, representing the correctly-labelled percentage ofthe total chair area across all views. Thus it is evident thatchair labelling manages to locate chairs very successfully,with nearly 7 out of 10 chair pixels found. Inaccuracies aremostly due to the fact that the labels are produced from theseated person, who is almost always offset slightly from thechair itself since they sit on the chair rather than within it.To measure how closely chair labelling was able to fit

within chair boundaries, it is necessary to refer to precision.Even though the precision value of 49.07% seems quite low(indicating that about half the chair labels were outside ofchairs), it is not unexpected since the seated person’s extentis nearly always larger than the chair itself. For example,the person’s head and shoulders are almost always higherthan the chair’s back. Additionally, the offset of the personfrom the chair further degrades the accuracy of labelling.The use of occlusion to localise the extent of the chair

was found to have provided significant benefits to precision.Unfortunately, the effectiveness of occlusion was not fullyexploited due to the limited number of occlusions that theexperiments contained. To illustrate the potential of occlu-sion, consider only the chair views that actually experiencedocclusions - precision for these was fairly high at 70.3%.In contrast, if occlusion effects had not been taken into ac-count, the precision for these same chair views would havebeen 55.4%, not much better than the overall precision of49.07%. The effectiveness of occlusion can be explainedby the fact that it is particularly useful in detecting and re-ducing one of the primary causes of over-labelling; that is,a person’s head and shoulders rising above the chair’s back.


https://www.researchgate.net/publication/3974251_Introducing_termination_probabilities_to_HMM?el=1_x_8&enrichId=rgreq-a4c6ef43ec029a9fc080060dca547c01-XXX&enrichSource=Y292ZXJQYWdlOzIyMTAzNjczMjtBUzoxOTMwNDcyMDA1NzEzOTJAMTQyMzAzNzQ0NzA4NA==

NW View NE View

SW View SE View

Figure 7. Floor (dark red) and chair (light blue)labelling for all views of the same test run.Edges show chairs, floors and occluding ob-jects in the scene. Note that floor labellingis reasonably adept at detecting edges ofoccluding objects such as walls and chairs.Note also the effects of occlusion in the NWview (top-left chair) and SE view (lower-rightchair) for finding chair boundaries.

Though Table 2 shows that the precision of floor labelsis extremely good (93.99%), it is misleading since the openfloor space extends over a large proportion of the view. Sim-ilarly misleading is the recall figure for the floor, whichseems quite low (66.7%). The failure here is that some por-tions of open floor space were not actually walked over bythe person during the experiments, so gaps exist in the cov-erage and adversely affect the recall. Thus it is ill-advisedto attempt to analyse floor labelling statistically as was donewith chair labels. Instead, floor label evaluation was re-stricted to visually inspecting the labelled images for over-labelling, where floor labels incorrectly spilled into chair ar-eas or occluding walls and partitions — see Figure 7. Over-all, floor labelling manages to detect occluding edges rea-sonably well with only minimal overflow. Over-labellinginto chair spaces is also minimal, due both to the successof floor labelling and the fact that chair labels tend to over-power the floor labels.

4.4 Limitations and Future Work

The most limiting factor of this research is the coarsenessof measurements, basically using the bounding box and fit-ted ellipse of a person for almost all processing. This causes

several problems:

• Occlusions are based on the bounding box, and socan ‘over-occlude’, leading to incorrect judgements re-garding the extent of occlusion (see Figure 8).

• Only activities that involve full-body movements canbe detected. This restricts the possible list of objectsthat can be classified.

• Over-labelling will always occur since the boundingbox and fitted ellipse are always larger than the areaactually taken up by the person’s body.

Silhouette analysis [6] and human pose estimation aretwo possible methods to improve the granularity of mea-surements from the current bounding box approach. Fur-thermore, the lack of collaboration between views whenperforming labelling reduces label accuracy and means thateach labelled view cannot be easily corresponded to thereal-world location of objects — only ground-plane coor-dinates can be retrieved. Some method of finding the co-ordinates of a non-ground-plane 3D point by correspondingthe point across multiple views could be used to overcomethis.Additionally, the experiments detailed in this paper as-

sume that the objects being labelled (chairs and floor) werein fixed locations. This is an unrealistic constraint for chairsand many other objects in a typical home. Further experi-ments must be conducted to determine whether the systemcan handle changes in the location of objects, an issue thathas already been allowed for through the use of an adaptivebackground model and evidence-based labelling but has notbeen specifically investigated.Finally, the precision of labels estimated from activities

could be improved by using some form of image segmenta-tion and/or relaxation labelling to detect the bounds of theactual object within the image associated with the activitylabel. Such refinement could be guided by both image in-formation and evidence from human activity (such as occlu-sions).

5 Conclusions

This paper has presented the concept of identifying andclassifying objects floors within a scene by detecting andrecognising activities that humans perform when interact-ing with these objects. Only chairs and floor spaces wereconsidered. The accuracy of chair labelling was quite good,with nearly 70% of all chair pixels successfully labelled andocclusions proving to be a powerful tool for reducing over-labelling. Floor labelling also showed the promise of thetechnique, but the ubiquity of floor space within the scene



(a) (b) (c)

Figure 8. Three frames in an occlusion sequence; Frames 368 (a), 370 (b) and 375 (c). 8b shows howthe bounding box combines with occlusion by the chair seat to cause the chair back to be incorrectlyconsidered part of the unoccluded area. 8c shows a correct occlusion boundary since the person isnow fully behind the chair back.

makes any statistical analysis difficult to justify. The chal-lenge now is to test the potential of this labelling approachby applying the method to a variety of objects other than‘chair’ and ‘floor’. This will require addressing current lim-itations, such as the need for finer measurements of humanbody actions (which could be obtained from human poseestimation techniques or silhouette analysis) as well as col-laboration between views during the object labelling phasevia 3D point correspondences.

References

[1] Y. Al-Ohali, M. Cheriet, and C. Suen. Introducing termi-nation probabilities to HMM. In Proceedings of the IEEEInternational Conference on Pattern Recognition, 2002.

[2] J. Black, T. Ellis, and P. Rosin. Multi view image surveil-lance and tracking. In Proceedings of the IEEE Workshopon Motion and Video Computing, page 6, 2002.

[3] A. F. Bobick and J. W. Davis. An appearance-based repre-sentation of action. In Proceedings of the IEEE InternationalConference on Pattern Recognition, 1996.

[4] R. J. Campbell and P. J. Flynn. A survey of free-form objectrepresentation and recognition techniques. Computer Visionand Image Understanding, 81(2):166–210, February 2001.

[5] W. E. L. Grimson, C. Stauffer, R. Romano, and L. Lee. Us-ing adaptive tracking to classify and monitor activities in asite. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 22–29, 1998.

[6] I. Haritaoglu, D. Harwood, and L. S. Davis. W4: Real-time surveillance of people and their activities. IEEETransactions on Pattern Analysis and Machine Intelligence,22(8):II–809–II–830, August 2000.

[7] N. Johnson and D. Hogg. Learning the distribution of objecttrajectories for event recognition. Image and Vision Com-puting, 14:609–615, 1996.

[8] L. Lee, R. Romano, and G. Stein. Monitoring activities frommultiple video streams: Establishing a common coordinate

frame. IEEE Transactions on Pattern Analysis and MachineIntelligence, 22(5):758–767, May 2000.

[9] D. Makris and T. Ellis. Finding paths in video sequences.In Proceedings of the British Machine Vision Conference,pages 263–272, 2001.

[10] A. Mittal and L. S. Davis. M2Tracker: A multi-view ap-proach to segmenting and tracking people in a clutteredscene using region-based stereo. In Proceedings of the Eu-ropean Conference on Computer Vision, volume 1, pagesI–18–I–33, 2002.

[11] N. M. Oliver, B. Rosario, and A. P. Pentland. A Bayesiancomputer vision system for modeling human interactions.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 22(5):831–843, May 2000.

[12] L. R. Rabiner. A tutorial on hidden Markov models andselected applications in speech recognition. Proceedings ofthe IEEE, 77(2):257–286, February 1989.

[13] L. Stark and K. Bowyer. Achieving generalized objectrecognition through reasoning about association of functionto structure. IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 13(8):1097–1104, October 1991.

[14] C. Stauffer andW. E. L. Grimson. Learning patterns of activ-ity using real-time tracking. IEEE Transactions on PatternAnalysis and Machine Intelligence, 22(8):747–757, August2000.

[15] N. Sumpter and A. J. Bulpitt. Learning spatio-temporal pat-terms for predicting object behaviour. In Proceedings of theBritish Machine Vision Conference, pages 649–658, 1998.

[16] M. K. Teal and T. J. Ellis. Spatial-temporal reasoning basedon object motion. In Proceedings of the British MachineVision Conference, 1996.

[17] R. Y. Tsai. An efficient and accurate camera calibration tech-nique for 3D machine vision. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 364–374, 1986.

[18] J. Yamato, J. Ohya, and K. Ishii. Recognizing human actionin time-sequential images using hidden markov model. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 379–385, 1992.






























































Object Labelling from Human Action Recognition

Documents

Transcript of Object Labelling from Human Action Recognition