Spatio-temporal constraints for on-line 3D object recognition in videos

12
Spatio-temporal constraints for on-line 3D object recognition in videos Nicoletta Noceti, Elisabetta Delponte, Francesca Odone * DISI, Università degli Studi di Genova, Via Dodecaneso 35, 16146 Genova, Italy article info Article history: Received 15 July 2008 Accepted 10 June 2009 Available online 18 August 2009 Keywords: View-based 3D object recognition Local keypoints Space–time coherence Continuous views Interactive recognition On-line recognition Real-time recognition abstract This paper considers view-based 3D object recognition in videos. The availability of video sequences allows us to address recognition exploiting both space and time information to build models of the object that are robust to view-point variations. In order to limit the amount of information potentially available in a video we adopt a description of the video content based on the use of local scale-invariant features, both on the object modeling (training sequence) and the recognition phase (test sequences). Then, by means of an ad hoc matching procedure, we look for similar groups of features both in modeling and rec- ognition. The final pipeline we propose is based on the construction of an incremental model of the test sequence, thanks to which we perform on-line recognition. We present experimental results on objects recognition in videos, showing how our approach can be effectively applied to rather complex settings. Ó 2009 Elsevier Inc. All rights reserved. 1. Introduction This paper deals with the problem of recognizing 3D objects 1 in real-world scenes, and to this purpose it proposes an on-line recog- nition method for video streams able to recognize objects in a possi- bly cluttered environment. The main objectives of the work are (1) to design a fast algorithm for object recognition in possibly cluttered environments, (2) to use it to show how, observing an object for some time from slightly dif- ferent view-points, recognition performances will improve. In order to keep the computational cost low and to reduce the information redundancy typical of video sequences, we describe the frames content by means of local keypoints. We devise a method that exploits the spatio-temporal information obtained from keypoints tracking: we use the video sequence to find 2D descriptions stable to view-point changes, and to match such descriptions favoring groups of features close to one another within a single view and among similar view-points. This choice is again motivated by computational reasons, and also by the fact that view-centered approaches have been proved an effective choice for object recognition. The contributions of our work are many: first, an appearance- based object recognition method that exploits temporal coherence both in the modeling and the recognition phase, and thus it fits nat- urally in the video analysis framework. In spite of the redundancy of the data that we use, the descriptions we obtain are compact since they only keep information which is very distinctive across time. Second, an effective matching procedure upon which we base our recognition module: a simple nearest neighbor is strengthened with a strategy that favors matching between groups of features belonging to similar fields of view, and being spatially close. Final- ly, an on-line recognition procedure, that may be optimized to per- form in real-time with little modifications in the feature extraction phase. New information on the observed scene, gathered from a vi- deo stream, is processed as soon as it becomes available. To this purpose we build an incremental model of the scene that encodes elements from the recently seen objects. Our approach takes some inspiration from biological vision sys- tems that gather information by means of motion (motion of the pupil and motion of the head). Fixating an unfamiliar object for a few seconds and possibly observing it from slightly different view-points is a common practice in human experience: by doing so, we are able to collect useful information, including important cues for depth perception and object recognition. It is known that the cerebral cortex uses spatio-temporal continuity to build invari- ant representations with respect to scale, view-point, and position variation of 3D objects [34]. In the method that we propose such a motivation is intertwined with computational requirements: the video description that we need to obtain must be stable and robust in case of long observations of the scene, as it happens for long vi- deo streams. To assess our method we first validate its recognition perfor- mance on a set of video sequences of increasing difficulty previ- ously recorded. They include clutter, illumination and scene 1077-3142/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2009.06.006 * Corresponding author. Fax: +39 010 353 6699. E-mail address: [email protected] (F. Odone). 1 We refer to the problem of recognizing a given specific object and not elements of a class of objects; therefore our problem is purely recognition rather than categorization. Computer Vision and Image Understanding 113 (2009) 1198–1209 Contents lists available at ScienceDirect Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu

Transcript of Spatio-temporal constraints for on-line 3D object recognition in videos

Computer Vision and Image Understanding 113 (2009) 1198–1209

Contents lists available at ScienceDirect

Computer Vision and Image Understanding

journal homepage: www.elsevier .com/ locate /cviu

Spatio-temporal constraints for on-line 3D object recognition in videos

Nicoletta Noceti, Elisabetta Delponte, Francesca Odone *

DISI, Università degli Studi di Genova, Via Dodecaneso 35, 16146 Genova, Italy

a r t i c l e i n f o

Article history:Received 15 July 2008Accepted 10 June 2009Available online 18 August 2009

Keywords:View-based 3D object recognitionLocal keypointsSpace–time coherenceContinuous viewsInteractive recognitionOn-line recognitionReal-time recognition

1077-3142/$ - see front matter � 2009 Elsevier Inc. Adoi:10.1016/j.cviu.2009.06.006

* Corresponding author. Fax: +39 010 353 6699.E-mail address: [email protected] (F. Odone).

1 We refer to the problem of recognizing a given speca class of objects; therefore our problem is purcategorization.

a b s t r a c t

This paper considers view-based 3D object recognition in videos. The availability of video sequencesallows us to address recognition exploiting both space and time information to build models of the objectthat are robust to view-point variations. In order to limit the amount of information potentially availablein a video we adopt a description of the video content based on the use of local scale-invariant features,both on the object modeling (training sequence) and the recognition phase (test sequences). Then, bymeans of an ad hoc matching procedure, we look for similar groups of features both in modeling and rec-ognition. The final pipeline we propose is based on the construction of an incremental model of the testsequence, thanks to which we perform on-line recognition. We present experimental results on objectsrecognition in videos, showing how our approach can be effectively applied to rather complex settings.

� 2009 Elsevier Inc. All rights reserved.

1. Introduction

This paper deals with the problem of recognizing 3D objects1 inreal-world scenes, and to this purpose it proposes an on-line recog-nition method for video streams able to recognize objects in a possi-bly cluttered environment.

The main objectives of the work are (1) to design a fast algorithmfor object recognition in possibly cluttered environments, (2) to useit to show how, observing an object for some time from slightly dif-ferent view-points, recognition performances will improve.

In order to keep the computational cost low and to reduce theinformation redundancy typical of video sequences, we describethe frames content by means of local keypoints. We devise amethod that exploits the spatio-temporal information obtainedfrom keypoints tracking: we use the video sequence to find 2Ddescriptions stable to view-point changes, and to match suchdescriptions favoring groups of features close to one anotherwithin a single view and among similar view-points. This choiceis again motivated by computational reasons, and also by the factthat view-centered approaches have been proved an effectivechoice for object recognition.

The contributions of our work are many: first, an appearance-based object recognition method that exploits temporal coherenceboth in the modeling and the recognition phase, and thus it fits nat-

ll rights reserved.

ific object and not elements ofely recognition rather than

urally in the video analysis framework. In spite of the redundancyof the data that we use, the descriptions we obtain are compactsince they only keep information which is very distinctive acrosstime. Second, an effective matching procedure upon which we baseour recognition module: a simple nearest neighbor is strengthenedwith a strategy that favors matching between groups of featuresbelonging to similar fields of view, and being spatially close. Final-ly, an on-line recognition procedure, that may be optimized to per-form in real-time with little modifications in the feature extractionphase. New information on the observed scene, gathered from a vi-deo stream, is processed as soon as it becomes available. To thispurpose we build an incremental model of the scene that encodeselements from the recently seen objects.

Our approach takes some inspiration from biological vision sys-tems that gather information by means of motion (motion of thepupil and motion of the head). Fixating an unfamiliar object for afew seconds and possibly observing it from slightly differentview-points is a common practice in human experience: by doingso, we are able to collect useful information, including importantcues for depth perception and object recognition. It is known thatthe cerebral cortex uses spatio-temporal continuity to build invari-ant representations with respect to scale, view-point, and positionvariation of 3D objects [34]. In the method that we propose such amotivation is intertwined with computational requirements: thevideo description that we need to obtain must be stable and robustin case of long observations of the scene, as it happens for long vi-deo streams.

To assess our method we first validate its recognition perfor-mance on a set of video sequences of increasing difficulty previ-ously recorded. They include clutter, illumination and scene

N. Noceti et al. / Computer Vision and Image Understanding 113 (2009) 1198–1209 1199

changes, occlusions, presence of multiple objects of the same kind.Then we report qualitative experiments showing the effectivenessof on-line recognition and suggesting the importance of exploiting,whenever they are available, continuous views: they allow us toincrease the recognition performance and to significantly increasethe tolerance to view-point changes. Our experimental analysis isbased on the argument that, modeling the scene by means of a con-tinuous set of views, may help (i) capturing the idea of invariantrepresentation to view-point changes which is typical of biologicalvision systems (ii) increasing the recognition confidence by observ-ing the object for an amount of time which depends on how diffi-cult the object is (e.g., flat objects are recognized more quickly).

The on-line experiments that we report are carried out bymeans of a fully functioning on-line recognition system engineeredto recognize one object at a time, that returns simple feedbacks tothe user suggesting view-point changes that would facilitate rec-ognition. From a functional stand-point, the system main featureis interactivity that allows the user to understand why recognitionis not working and update the observation point accordingly.

The remainder of the paper is organized as follows. Section 2 re-ports analysis of related works, Section 3 gives an overview of theproposed method. Then, Section 4 describes how we build a modelfrom an image sequence, while Section 5 is devoted to the matchingprocedure among training and test models that we propose. Section6 illustrates the recognition method applied to a video sequence orto a video stream. In Section 7 we assess the recognition method,then we report a qualitative evaluation on the benefits of an on-lineinteractive procedure for object recognition. Finally, in Section 8 wedraw some conclusions and discuss future developments.

2. Related work

State of the art on object recognition is vast, and it goes back to theearly years of computer vision [22,15,23,2]. Here we focus primarilyon view-based or appearance-based approaches, instead of the so-called model-based methods [3,16,14]. There are several reasonsmotivating view-based approaches to object recognition, amongwhich the studies on biological vision. Over the last decade, a num-ber of physiological studies in nonhuman primates have establishedseveral basic facts about the cortical mechanisms of recognition thathave inspired many artificial recognition systems [10,33]. Also,neurophysiological studies [36] showed that receptive field profilesof the mammalian retina and of the visual cortex can be well mod-eled by superpositions of local Gaussian filters. The idea of exploitingsimple local features has attracted the attention of the computer vi-sion community (see, for instance, [20,5,19,30,11,35]). The popular-ity of local approaches is due to the fact that, unlike global methods[24,29], they produce relatively compact descriptions of the imagecontent and do not suffer from the presence of cluttered background,scene variations and occlusions. Therefore, they are considered as asimple and valuable alternative to more complex appearance-basedmodels as aspect graphs [1] and active appearance models [6,17].Also, by means of appropriate descriptions of the local information,images can be matched effectively, even in the presence of illumina-tion changes and scale variations. Many 2D local keypoints detectorsand descriptors have been proposed in the literature, and compara-tive evaluations aiming at putting some order on the vast amount ofworks have been proposed [26,25]. For what concerns descriptors,SIFT [21] are considered the most reliable in a variety of situations[26,25].

The main problem with using local keypoints for recognition isthat, while observing minute details, the overall object appearancemay be lost. This phenomenon does not occur in biological systemsthat are usually able to capture the global essence from localfeatures.

To address these well known problems a number of approachesare available. A first possible way is to build explicit 3D modelsfrom local features extracted from multiple views of the object[30,4]. This approach is an effective choice, in particular if onewants to recognize rigid objects and if no time constraints are set.

Alternatively, a very common strategy is to summarize localinformation in global descriptions of the object, for instance bymeans of codebooks [5,19,7]. In particular, [7] reports our early at-tempts to exploit space–time information to model view-pointchanges. We extracted features in video sequences and combinedthem in a bag-of-keypoint description of the image content. Timecontinuity was only used on the training phase to select the moststable features, but the actual image representation was performedon single frames. The main drawbacks of the approach are itsweakness to an increasing amount of clutter and the fact that themodel does not scale well as the number of modeled objects growsmoderately. Also, real-time performance was not achievable be-cause of the complex nature of the adopted image description.These issues have been addressed in present work.

Closeness constraints have been used to increase the quality ofmatches [11]. Explicit geometric information may also be appliedwith success to object classes (see for instance [32], where a setof canonical parts are linked together to form a 3D structure ofthe object).

Another important source of information is temporal continu-ity, whenever a dense sequence of the object of interest is avail-able. Temporal continuity may be used to extract features robustto view-point changes, by means of feature tracking. Those fea-tures may be used to compute 3D information for reconstructionand recognition (see, for instance, [28]). Temporal continuity mayalso help to cluster related keypoints observed at different view-points, as suggested in [12], where temporal information is usedon the training phase only, to obtain a richer object model, whileat run time only one image is available. The method proposed in[12] is shown successful on nearly flat objects with highly texturedsurfaces, with almost no features due to shape. Our approach is re-lated with [12], as discussed later in this paper.

3. Overview of our approach

Let us briefly describe our approach to 3D object recognition. Inorder to show the type of data we will deal with, let us have a lookat Fig. 1, showing sample frames from a simple image sequencecontaining a 3D object that we may want to recognize. We repre-sent implicitly the 3D information from a video sequence by meansof time-invariant features, i.e, local features that are distinctive inspace and smooth and stable in time. To do so, we extract 2Dscale-invariant local features (Harris corners in scale-space) fromthe images and track them along the video sequence by means ofa Kalman filter, forming features trajectories.

Dynamic filters, robust to temporary occlusions, allow us todeal with non convex objects, since, while rotating around a 3D ob-ject, self-occlusions may cause temporary interruptions in a trajec-tory. For a given trajectory we obtain a compact time-invariantfeature by averaging descriptions for each keypoint belonging tothe same trajectory. The overall procedure described so far is sum-marized in Fig. 2 (left).

Thus we obtain a spatio-temporal model made of all the time-invariant features observed in a video sequence or a part of it. Inthe modeling phase we acquire a video sequence of an object ofinterest at a time, observing it in relatively controlled environ-ments from different view-points—meaningful for recognition pur-poses (see Fig. 1). Fig. 2 (right) shows a visual impression of thelocal regions of the video sequence that participate to modelingthe object of interest. Observe that areas belonging to different

Fig. 1. The object goofy in a sequence capturing its variation in space and time.

Fig. 2. Overview of our approach. Left: the construction of a time-invariant feature. Right: a visual representation of the local regions of the image sequence that participate toobject modeling.

1200 N. Noceti et al. / Computer Vision and Image Understanding 113 (2009) 1198–1209

time views are included. Also, the same part may appear more thanonce, if different views of it are important for recognition – in thefigure goofy’s left arm appears twice. In the test (recognition)phase a buffer of adjacent frames stored from a video stream isused to model the recently seen content of the observed scene ofinterest, similarly to training. Then the test model at time t is com-pared to the training models by means of a matching technique [9]that exploits spatial and temporal coherence of time-invariant fea-tures: after computing a first set of robust matches with a simpleNN strategy, the procedure is reinforced by analyzing spatio-tem-poral neighborhoods and deleting isolated matches while addingweaker matches that are spatio-temporally close to robust ones. Fi-nally, recognition is based on the matching procedure and it relieson an analysis of the recent content of the video which is comparedwith all the available models.

In the next sections we report detailed descriptions of the mainphases of our method.

4. Building models with time-invariant features

In this section we introduce the time-invariant features and themodel derived from an image sequence upon which we base ourobject recognition method.

4.1. Feature extraction and cleaning

For each image of the sequence we extract Harris corners in thescale-space [18] and describe them using SIFT [21]:

SIFTk ¼ ðpk ; sk ;dk ;HkÞ;

where pk is the position in the space, sk refers to the keypoint scalelevel and dk to the principal direction of the keypoint. Hk containslocal orientation histograms around the kth keypoint.

In order to model the variation of keypoints in time and spacewe cluster different occurrences of the same keypoint in the vari-ous frames. To do it efficiently we exploit the image sequence tem-poral coherence, tracking features along the image sequence withthe help of a dynamic filter. The filter models the evolution in timeof position pk, scale sk and orientation dk. The use of dynamic filtersallows us to track and cluster robustly local keypoints that are af-fected by temporary occlusions or by localization noise (see Fig. 2,left). Although our early works concentrated on non linear models(see [7]), our current approach is based on the linear Kalman filter,which is adequate for our case (keypoint tracking in dense se-quences) and more efficient to compute.

Nevertheless, at the end of the tracking procedure the quality ofthe obtained trajectories may vary. There is the need for applying acleaning procedure that eliminates the ones that are unstable orcontain errors: to this purpose, a multi-stage evaluation procedureis applied to each trajectory (see Fig. 3). First we check whether thevariance of the scale ðvarSÞ and of the principal direction ðvardÞ,and the total variance of the local orientation histogram ðvarHÞare below pre-defined thresholds sS; sd; sH . Thresholds sd and sH

are estimated over a training set of stable trajectories, and theircurrent values are sd ¼ 6. sS is set equal to 1.5, corresponding tothe gap between two scale levels of the considered scale-space.

Fig. 3. The stages of the trajectories cleaning procedure.

N. Noceti et al. / Computer Vision and Image Understanding 113 (2009) 1198–1209 1201

Then, trajectories with high variances are further analyzed inorder to check whether they contain abrupt changes that couldbe caused by tracking errors.

To do so, we perform a Sum of Squared Differences (SSD) corre-lation test, normalized with respect to the keypoint size, betweenthe first gray-level patch of the trajectory and the subsequent ones.Then we compute the difference between adjacent SSD values. Inthe presence of abrupt changes of the similarity values, i.e., if oneof the computed differences is above a threshold sSSD, the trajectoryis discarded. sSSD is set to 0.45. This operation allows us to estimatethe intra-class variability of the features set induced by the track-ing phase. The adopted heuristic exploits the available temporalcoherence, instead than relying on a fully unsupervised methodsuch as clustering.

Fig. 4 (left) shows the SSD values evolving along the sequence oftwo trajectories: the green one refers to a good feature, whoseappearance varies slowly and smoothly in time (Fig. 4 (middle)shows the corresponding patches); the red one refers to an unsta-ble trajectory, containing tracking errors: a visual inspection of thegray-level patches (Fig. 4 (right)) confirms this observation. Noticethat both trajectories had passed the first cleaning step. The use ofgray-level patches is justified by the fact that such information isnever used in the tracking phase, therefore it may be an effectivecross-check of the trajectory stability.

Fig. 4. The SSD computed between the patch extracted in the first frame and the follo(reported on the middle) while the red one (upper) shows a patch that presents abrupreferences to color in this figure legend, the reader is referred to the web version of thi

4.2. The spatio-temporal model

Once a set of trajectories has been extracted from an image se-quence, we build a set of time-invariant features—as many as thenumber of trajectories. We call this collection of features a spa-tio-temporal model of the sequence. Notice that we do not keepany information on the relative motion between the camera andthe object, as it is not informative for the recognition purpose. Asfor the model size, thank to the tracking and cleaning phase thatkeeps only stable features, the number of local elements decreasesconsiderably (one order of magnitude smaller, as we will see in theexperiments).

A time-invariant feature is a compact description of the appear-ance of the object element stable with respect to the view-point:

fk ¼ fHk; pk; tkg

� Hk: a spatial appearance descriptor, that is, the average of all

local orientation histograms of a trajectory fHkgt t ¼ ti; . . . ; tf .

� �pk ¼ ð�xk; �ykÞ: the average position of the trajectory keypoints(centroid of a spatial neighborhood).

� tk ¼ ftik; t

fkg: a temporal descriptor that contains information on

when the feature first appeared in the sequence and on when itwas last observed (this limits the temporal neighborhood).

More in detail, all the keypoints linked by a tracking trajectory,slightly different views of the same object element, participate tothe construction of a time-invariant feature. Average values aregood representatives of the original keypoints as the tracking pro-cedure, followed by cleaning, is robust and leads to a class of key-points with a small variance. Fig. 5 (middle column) shows thesimilarity values of each keypoint of the trajectory Hkt with itsaverage Hk; similarity is computed with histogram intersection( at the basis of our matching method, see Section 5). We considertwo different keypoints: a planar one, shown on the top row, is adetail of a book; a 3D one, shown on the bottom row, is the detailof the hand of a plastic puppet. Notice how the similarity values arequite high and very stable, even if the trajectory appearance maychange (Fig. 5 (left column) reports the gray-level patches).

Other summarizing descriptions were possible: [13], for in-stance, selects a stable (i.e., frontal) keypoint within the trajectory,estimating the minimum of a quadratic fit of the relative changebetween a keypoint at time t and t þ 1. This choice only seemsappropriate for planar objects: Fig. 5 (right), on top, replicatesthe evaluation reported in [13] on a planar feature obtaining a sta-ble feature at frame 10, on the bottom shows how the same eval-uation cannot be made on a 3D feature.

The main drawback of choosing an average description insteadthan a delegate keypoint is that some histogram peculiarities maybe suppressed or smoothed, especially if the sequence is long, in

wing ones. Left: the green line (lower) shows the SSD trend for a stable trajectoryt changes due to tracking errors (reported on the right). (For interpretation of thes article.)

2 4 6 8 10 12 14 160

0.2

0.4

0.6

0.8

1

Frame

Sim

ilarit

y to

Ave

rage

5 10 15 20 250.2

0.4

0.6

0.8

1

1.2

Frame

Rel

ativ

e D

escr

ipto

r Cha

nges Relative Descriptor Changes

Quadratic curve fit

2 4 6 8 10 12 140

0.2

0.4

0.6

0.8

1

Frame

Sim

ilarit

y to

Ave

rage

5 10 15 20 25 300.2

0.4

0.6

0.8

1

Frame

Rel

ativ

e D

escr

ipto

r Cha

nges

Fig. 5. Comparison of our approach to aggregating information from a trajectory and the one proposed in [13], for a planar feature (top) and a 3D one (bottom). See text.

Fig. 6. A simplified visual description of the two-stage matching between spatio-temporal models. Time-invariant features, depicted as trajectories of image patches,are represented in a space (S)–time (t) reference system.

1202 N. Noceti et al. / Computer Vision and Image Understanding 113 (2009) 1198–1209

apparent contradiction with the fact that a long trajectory is asso-ciated to a stable feature. This can be easily explained recalling thatkeypoint tracking is performed over adjacent frames and no guar-antee of a global stability is required. To this purpose two countermeasures have been devised: (i) while building the average valueHk we discard all keypoints that deviate from the average; (ii) verylong sequences are cut in shorter ones of maximum length N. Thelatter measure is skipped if the size of the model is big (i.e., in thecase of complex objects or scenes).

This basic scheme for building the model of a sequence isadapted to the two different object modeling and object recogni-tion in a scene of interest. In the modeling phase we may rest onthe assumption that the sequence contains one object only, ac-quired in favorable circumstances. In the test or recognition phase,instead, no assumptions can be made neither on the content of thescene (even if in the application domains that we will consider theuser will be willing to collaborate to favor recognition) nor on itslength. Therefore, while on the modeling phase the model is builtstrictly following the above procedure, on the test phase we buildan incremental model of the recently seen elements. This impliesthat time-invariant features that are too old are discarded fromthe model. More details on this incremental model will be givenin Section 6.

5. Matching models with spatio-temporal constraints

In this section we report our two-stage matching which exploitsspatial and temporal coherence within each model. A preliminaryversion of this approach has been proposed in [9]. Let us considera test T and an object M. The problem we address is to see whetherthe test model T contains the object M.

Our two-steps matching strategy, visually described in Fig. 6, isperformed as follows:

Step 1: We perform a nearest-neighbor matching betweenmodel M and test T: this procedure gives us initial hypothesesfor the presence of the object represented by M in the observedscene. For each time-invariant feature in M, we use histogramintersection [31] between spatial appearance descriptors tocheck whether T contains a similar feature: we set a minimum

similarity threshold S1 to obtain a collection of robust matchesðfM; fTÞS1

.Step 2: This stage, based on enhancing spatio-temporally coher-ent matches, helps us to confirm or to reject the previoushypothesis. We use spatio-temporal constraints and perform abackward matching procedure from the test to the trainingmodel.� We detect the subsets of M and T containing most matches,

IM and IT , exploiting the temporal descriptors t of the featuresmatched after the first step. Doing so we obtain hints aboutthe object appearance in the scene, since IM marks a particu-lar range of views. In order to refine this detection, we rejectmatches which are spatially isolated, according to the fea-tures average positions p.

� For each pair ðfM; f TÞS1 belonging to IM and IT we increase thenumber of matches in a spatio-temporal neighborhoodaround the previously detected features. Let us consider amatch ðfM; f TÞS1

: we can specify a spatio-temporal neighbor-hood Q (see Fig. 6 where the cylinder bounds a region inwhich this second search is performed) and consider thetwo sets, FM and FT , belonging to it. Such features (of M andT respectively) appear together with fM and fT for a chosenperiod of time. A new matching step is then performedbetween FT and FM considering a lower similarity thresholdS2: a pair ðf 0T ; f 0MÞS2

is a new match if the features are spatiallyclose to fT and fM , respectively.

age Understanding 113 (2009) 1198–1209 1203

The similarity thresholds have been set to S1 ¼ 0:6, correspond-

D2

A1B1C1

D1E1

ing to the equal error rate on a training set of example features, andconsequently we set S2 ¼ 0:5. This procedure increases the numberof high scores when groups of features which are spatio-tempo-rally coherent appear both in training and test. Fig. 7 shows exam-ples of the increased number of matches thanks to the second step.

Thanks to the compact representation of visual information thematching phase is efficient even when the sequence is long andthere are several detected keypoints.

N. Noceti et al. / Computer Vision and Im

t_now− tΔ

F1

0 t_now

Fig. 8. A schematic description of the time-invariant features participating to theconstruction of the incremental model (see text).

6. The recognition pipeline

In this section we describe how the matching strategy is appliedto recognizing known objects in the observed scene.

We assume that our test data are video sequences of a scene ofinterest that may contain one or more objects previously acquiredby the system on a modeling phase.

Let us first assume that the video sequence has been acquiredand stored prior the recognition phase. During recognition the vi-deo sequence is accessed and local features are extracted andtracked along the first frames of the sequence. An incrementalmodel of the sequence is built, according to the formulation of Sec-tion 4.2 with the following specifications: when, at time t1, the sizeof the model reaches a minimal number of spatio-temporal fea-tures Smin, we look for known objects on the scene: the two-stepmatching is performed between the test model at time t1 (that en-codes the information from a temporal window ½0; t1�) and alltraining models.

Notice that, for space occupancy and performance reasons, thetest models will have a finite memory of length Dt: the size ofthe memory induces a temporal window of analysis that we slideahead as time goes on. Time-invariant features that do not overlapwith this temporal window are discarded, since they do not belongto recently seen elements. As pointed out in Section 4.2 in order tomaintain the stability of spatio-temporal features a cutting proce-dure limits the length of the trajectories. In this incremental proce-dure we simply finalize a time-invariant feature composed by Nkeypoints, and start building a new one with the new observations.Fig. 8 depicts such an incremental model. Each segment representsa trajectory, the dark dots indicate the beginning and the end of atrajectory. Intermediate dots indicate that after N frames a trajec-tory is finalized and a new one, containing more recent views ofthe feature, is started (see, for instance trajectory D1). At currenttime tnow the model contains fB1; D1; D2; E1; F1g.

Fig. 7. An image region showing the position in space of a feature that found itsmatch at Step 1 (A) and further features added at Step 2 thanks to their spatio-temporal vicinity to feature A (C). The circle B bounds the neighborhood Q in space.

The scene model is reset when a high number of trajectories isabruptly interrupted – this usually happens when the user atten-tion moves from one object to another.

Matching is performed each time the current model is signifi-cantly richer or different from the one used in the previous match-ing stage. Matching a test model at time t against the trainingmodels can produce one of the following results:

� It exists just one model M (of an object O) producing WM P Wmin

matches while for the other models Mi;WM �WMi P d. In thiscase we reach the conclusion that object O is in the scene. At thispoint the system is ready for another recognition, while the tem-poral window slides ahead.

� More than one model verify the given conditions. We considersuch objects as our hypothesis, then we extract new informationto improve the description. When the test model is grown of atleast n items, we try again the recognition only among thehypothesis.

� There are no models verifying the conditions. For a fixed delaythe system continues to increase the model. If no observationis reached it concludes that no known objects are present inthe scene after a given number of matching attempts.

This very same recognition pipeline may be applied to on-linerecognition, in this case a video stream is acquired from a framegrabber and processed until acquisition goes on.

The on-line procedure may be used for real-time processing. Inorder to keep up with real-time processing only a few slight mod-ifications of the above procedure are advisable: (i) Feature extrac-tion is performed at the first frame and only feature tracking isapplied in subsequent frames; extraction will be performed againonly if the current number of trajectories goes below a giventhreshold (10% of the original trajectories). (ii) During the cleaningprocedure no SSD analysis is performed, being the most time con-suming step of the procedure. These modifications will decreasethe recognition performance, but this will be countered by addi-tional functionalities that may be given to the on-line system. In-deed, it is worth to notice that when the recognition process isactivated real-time on a video stream our system gathers hypoth-esis on the presence of objects in the scene and may send feed-backs to the user on how to explore the world. In the case ofmultiple matches it suggests to the user to observe the object ofinterest from a slightly different view-point. On the system outputwindow a text message ‘‘. . .?” (see Fig. 12) appears when the sys-tem has gathered enough information to make a judgement (i.e.,the test model is richer than Smin features), but no evidence in favorof one object has been reached. In the case of no matches it sug-gests to change scale or observation point before concluding thatno known object is present in the scene. In this case a text message

1204 N. Noceti et al. / Computer Vision and Image Understanding 113 (2009) 1198–1209

‘‘empty scene” suggests that no known object are present. Whenthe system starts receiving positive recognition feedbacks referredto a specific object it returns a first text message showing that rec-ognition is positive but the system confidence is still low: ‘‘May beobject X?”. Finally, a positive recognition stable for at least 10frames returns a positive answer: ‘‘This is object X”.

The system response is incrementally updated and cleaned atrun time so that the model of the observed scene does not growto a size which is difficult to manage.

Fig. 9. Test frames reporting the two-stages matches against a set of known objects(see text).

7. Experiments

This section first describes experiments carried out on a set ofvideo sequences previously acquired. Then reports a discussionon the importance of the temporal component, that first appearedin [8].

7.1. Experimental setting

The problem of 3D object recognition that we tackle aims atspotting the presence of an object of interest in a video sequence.

In our experiments we consider a dataset of 20 objects of differ-ent levels of complexity (books, puppets, jars) in terms of geome-try, texture richness, reflectance. Two deformable objects areincluded in the set (a soft puppet and a puppet with a wire skele-ton whose position may be changed) and two puppets with adjust-able limps are also present.

We model each object starting from a video sequence (trainingsequence) acquired by a still camera while the object rotates on aturntable. Fig. 1 shows sample frames from one of these videos. Inthis modeling (or training) stage the environment is relatively con-trolled: the background is a plain white wall, the scene is illumi-nated by a conventional neon light.

At run time, the object scale and view-point is not known a pri-ori, and also the object could be occluded by other objects. Acqui-sition conditions may change (in particular scene illumination maybe very different from training) and objects positions and orienta-tions may differ. Also, the quality of the obtained models will be af-fected by the quality of the acquired sequences: if the cameramotion is too fast or the acquisition rate is too small, the amountof errors induced by the tracking phase reduces the efficacy ofthe computed models. The videos recorded off-line to the purposeof evaluating the performances of the method have been acquiredat 25 fps—enough to obtain good models. Instead, when the systemruns in the on-line mode, with a single processor, the acquisitionrate is lower – about 10 fps – therefore test models could be nois-ier, unless the camera motion slows down to guarantee small vari-ations from frame to frame.

7.2. 3D object recognition

To assess the robustness of our approach we perform tests onvideo sequences, each of which mainly focuses on one out of the20 objects. These test sequences are not acquired in favorable cir-cumstances: other objects appear in the field of view, some of thembelonging to the dataset of 20 known objects. Several models at atime may be taken into account, but in these experiments we as-sume that one object at a time is recognized.

Test videos are acquired so to represent a wide selection of pos-sible different conditions: illumination conditions have been chan-ged in various ways, using different sources of natural and artificiallight; some of the videos have been acquired in different days andin a different place from the training; different camera motionshave been explored, included hand held camera motion. Some ofthe known objects are occasionally temporally occluded, and the

scene often contains multiple objects, some of them known tothe system.

A common feature to all videos is that they all contain a ‘‘maincharacter”, i.e., an object that appears for most frames and is ob-served from the most favorable view-point possible. In this waythe dataset tries to replicate the flavor of the on-line recognitionprocedure we are considering, where the user moves the cameraaround one object, trying to gather as much information on itsappearance as possible. To the purpose of obtaining quantitativeevaluations the test videos have been labeled with respect to thismain character. No labels are associated to secondary charactersthat may appear in the scene. Occasionally our recognition systemreturns a positive feedback on the presence of a secondary charac-ter in a video (see examples in Fig. 9) increasing the number offalse positives.

In order to give a flavor of the experiments performed, Fig. 9shows sample frames of the test sequences highlighting the fea-tures that matched positively with some object models. Differentcolors distinguish features matched on the first or on the secondstep. Again, the circles indicate the spatial neighbors used on thesecond stage – while the temporal neighborhoods do not appearsince they cannot be drawn on the single frame.

Overall we performed 840 experiments. For each object knownto the system we evaluate its presence over all the test data. Figs. 9and 10 show a few frames from the test sequences.

Fig. 11 (right) shows a ROC curve that describes the system per-formance on the labeled set of videos as the threshold on the num-ber of matches varies. Our system is currently working with athreshold set to 10, which gives us a good compromise betweenfalse positives and false negatives.

The shape of the curve shows how a subset of the true positivesis always missed, unless a very low threshold (up to two matches)is accepted. This is because some videos contain very differentviews of the main character with respect to the training set.Fig. 10 shows three frames out of a difficult video: scale andview-point are always very different from the training model. Also,severe occlusions of the object often occur.

Fig. 11 (left) reports, in gray-levels, the obtained confusion ma-trix with respect to all objects: training models are on the columns,while each row is derived from the test sequences containing a gi-ven object. The gray-level of each entry Aij is proportional to how

Fig. 11. System results on a labeled set of videos. The confusion matrix summarizing the recognition results for the 20 objects (left); a ROC curve summarizing the systemperformance as the threshold on the number of matches necessary to achieve recognition varies (right).

Fig. 10. Frames out of a rather difficult sequence of object ‘‘goofy”. The object always appears very different from the training set (see text).

Table 1Results obtained with a multi-frame analysis as the threshold on the number ofpositive matches increases (see text). Training and test sets correspond to frames ofthe training and test sequences (see text).

Threshold Precision Recall

12 0.87 0.5413 0.88 0.4914 0.90 0.4615 0.91 0.4316 0.94 0.3817 0.95 0.3618 0.98 0.3219 0.98 0.29

Table 2Results obtained with a bag-of-keypoints (b.o.k.) approach (see text). Training andtest sets correspond to frames of the training and test sequences. The first two rowsrefer to a NN classifier, the next rows to an SVM classifier equipped with a linear and ahistogram intersection kernel respectively.

Matching method Precision Recall

NN (similarity P0.5) 0.11 0.15NN (similarity P0.6) 0.73 0.06SVM e.e.r. (linear kernel) 0.32 0.32SVM e.e.r. (histogr. int. kernel) 0.39 0.39

N. Noceti et al. / Computer Vision and Image Understanding 113 (2009) 1198–1209 1205

often model j has been spotted in the test sequences containing ob-ject i. Notice that some gray entries are due to secondary objectsappearing in videos.

The overall performance of our method on the whole dataset ofvideos, in terms of precision and recall,2 is reported in Table 3. Wehave compared our results with two baseline methods: first, we con-sider a description of the object based on the local keypoints ex-tracted from multiple-views, reminiscent but simpler than themethod proposed in [20]. Then, we adopt a bags-of-keypoints ap-proach and perform recognition with nearest neighbors (NN) andsupport vector machine (SVM) classifiers, similarly to [5].

Both algorithms are naturally applied to test images and not totest videos. A voting scheme allows us to obtain results related toeach test video and then to compare them with our method.

We first evaluate the baseline methods on our dataset on aframe-by-frame basis, then we extend the most promising meth-ods to the case of videos.

Tables 1 and 2 report the results obtained on a test set made offrames extracted from the test sequences. The training sets are arelatively dense multi-view description of the object (for computa-tional reasons, instead than all the video frames, we sampled oneframe every five).

We start off following the multi-view approach: similarly to[20] the best candidate match for each keypoint is found by iden-tifying its nearest neighbor in the database of keypoints from train-ing images. The best match is kept if it is significantly better thanthe second best match. The results obtained with this strategy arerather poor (only the 3% of test images are correctly classified). Wethen add a threshold on the minimum number of features of thetest image to match the training features. Table 1 reports the re-

2 We recall that precision ¼ TPTPþFP ; recall ¼ TP

TPþFN where TP = true positivesTN = true negatives, FP = false positives and FN = false negatives.

,

sults obtained with thresholds in the range [12–19]; we observehow as the threshold grows precision degrades quite quickly. Inthe next experiments we keep the threshold 15.

Notice that, having chosen a quite dense test set the number oftraining features is very high. As a positive side effect are the quitegood recognition rates, on the negative side the computational costof the matching strategy and the space required to store the mod-

Table 3Results obtained with a video-based analysis. For what concerns the multi-viewapproach, the first row was obtained with a very dense sampling of the trainingvideos (one every 5). The second row with a sparse sampling (so to keep about 10frames). Our method gives a better compromise between efficiency and performance(see text).

Matching method Precision Recall

Dense multi-view 0.91 0.62Sparse multi-view 0.51 0.54Our time-invariant features (# matches P10) 0.84 0.59

1206 N. Noceti et al. / Computer Vision and Image Understanding 113 (2009) 1198–1209

els. Our method provides a compact alternative to such an ap-proach. Indeed, we obtain trajectories of 60 frames on average,and about one forth of the original trajectories survives the clean-ing phase. Thanks to this, and to the fact that a time-invariant fea-ture carries information of all local features belonging to thetrajectory we obtain a space reduction of the model from[50,000–100,000] (when keeping all the local keypoints of a train-ing set) to [2500–3500] elements. This is a considerable space sav-ing and allows for faster matches.

To obtain a more compact model from a training set of images awell known approach in the literature is using codebooks. Wecompare our method with such a strategy, implementing bag-of-keypoints. We build a vocabulary for each object, clustering thefeatures from each training set with k-means and keeping the clus-ters centroids as words of the vocabulary. The k-means algorithmis initialized with a number of clusters equal to the 70% the averagenumber of features per each frame, to obtain a vocabulary of sim-ilar size to the time-invariant features set (in the order of few thou-sands elements). Then, each training image is represented withrespect to the vocabulary, as well as the test images. In accordancewith our matching strategy, we first perform a nearest neighborsimilarity match between the test image and each training image,to see whether a test image belongs to one object or not. The sim-ilarity measure adopted is histogram intersection. Since we build amodel for each object of interest, we need similarity thresholdð0 6 t 6 1Þ to be set for each object. Table 2 (first two rows) reportsthe results obtained with the two best thresholds and shows howthis approach is not appropriate to the problem at hand. Followingthe suggestions in [5], we then adopt a more complex SVM classi-fier, well known in the literature for its generalization properties.We first adopt a linear kernel as in [5], then a histogram intersec-tion kernel, for conformity to other experiments. Such a kernel has

Fig. 12. Sample frames fro

been shown very appropriate to deal with histogram-like descrip-tions and has the very nice property it does not depend on anyparameter [27]. The selection of the regularization parameter isperformed with a leave one out procedure. Table 2 (bottom tworows) reports the results obtained with the SVM offset b chosento the equal error rate (e.e.r.). The results confirm that the mainlimitation of the approach is on the description phase. Bag-of-key-points capture all meaningful local structures in the scene, includ-ing the ones generated by clutter. Since the background maychange considerably in our setting, as the objects are allowed tobe moved from place to place, such a global description is notappropriate. Testing image regions would increase performanceto the price of efficiency. This approach shares many similaritiesto our previous attempt to perform object recognition [7]. The fail-ure of such models when background and context information isnot exploitable is apparent.

Finally, Table 3 compares the results obtained with our methodon a video-based analysis. The results in the first row have beenobtained by applying a voting scheme on the video frames to theframe-by-frame results presented in Table 1. The results obtainedare in line with the ones reported with our approach, with a veryhigh computational cost in comparing two big sets of keypoints.The table also reports, on the second row, results on training andtest sets of comparable size to the ones considered with our meth-od. These smaller datasets have been obtained sub-sampling theoriginal video frames: we kept about 10 frames per video, with aconsiderable decrease both in computational costs and obtainedperformances.

7.3. On-line recognition and the importance of continuous views

One of the main objectives of our work is the development ofan on-line method, that allows us to interact with the recognitionsystem. While it is not straightforward to report results on on-line recognition, since no ground truth is available, we may com-ment on the fact that our system may run for sessions of severalminutes, and, more importantly, on the fact that thanks to thefeedback of the system towards the user recognition rates (in par-ticular recall) may increase. In particular it is clear that when thesystem reaches a deadlock (more than one object model matcheswith success after various attempts, or no match is found for along time) the user is advised to change observation view-point,or to get closer to the object of interest. Fig. 12 shows a few

m the system output.

0 2 4 6 8 10 12 14 160

10

20

30

40

50

60

70

80

90

100

Number of Evaluations

Num

ber o

f Mat

ches

Book geoGoofyThreshold

0 2 4 6 8 10 120

5

10

15

20

25

30

35

40

45

Number of Evaluations

Num

ber o

f Mat

ches

Still GoofyMoving GoofyThreshold

0 10 20 30 40 50 60 70 80 900

5

10

15

20

25

30

35

Frame Number

Num

ber o

f Mat

ches

DeweyThreshold

0 20 40 60 80 100 120 1400

2

4

6

8

10

12

14

16

18

20

Frame Number

Num

ber o

f Mat

ches

BambiBook geoCoffeeEasy boxDonaldGoofyDeweyTeaThreshold

0 10 20 30 40 50 60 70 80 900

5

10

15

20

25

30

35

40

45

Frame Number

Num

ber o

f Mat

ches

BiscuitEasy boxDonaldScroogeGoofyDeweyThreshold

0 20 40 60 80 100 120 140 160 1800

10

20

30

40

50

60

70

Frame Number

Num

ber o

f Mat

ches

BiscuitBookGeoCoffeeEasyBoxDonaldScroogeGoofyDeweyTeaThreshold

Fig. 13. The importance of observing objects for enough time, from slightly different view-points (see text).

N. Noceti et al. / Computer Vision and Image Understanding 113 (2009) 1198–1209 1207

1208 N. Noceti et al. / Computer Vision and Image Understanding 113 (2009) 1198–1209

examples of when the system reaches a decision or when it ispuzzled by the presence of more than one object or of not enoughinformation.3

In this section we report a few qualitative results on the 3D ob-jects set that confirm the ability of our system to direct the user to-wards a better recognition. Thanks to the experiments reported weare also implicitly verifying the robustness of our system with re-spect to illumination and scene changes since, for these experi-ments, we use the objects models that were built 2 monthsbefore the recognition experiments. We switch on our recognitionsystem and observe an object for enough time so that the systemcan recognize it. The time span required, before recognition isachieved, depends on the difficulty of the object: on average weobserved that flat highly textured objects are correctly classified,with no ambiguity, after a shorter amount of time than poorly tex-tured or complex 3D objects.

Fig. 13 shows a few prototypical examples of the results weachieved in our experiments. On the x-axis is a measure of the tem-poral component (the number of frames when a single object isevaluated; the number of evaluations when different objects arecompared, since the number of frames required to spot themmay be very different), on the y-axis the number of robust matchesachieved against all the objects models considered. The horizontalline indicates the threshold (set to 15 in our case) above which anobject is recognized.

Notice that the system works approximately in real-time whenperforming the tracking, while the computation slows down dur-ing features re-extraction (about 10 frames per second – per-formed when the data become insufficient) and matchingattempts.

(a) We evaluate the effectiveness of our object representationconsidering a purely temporal component: the camera is stilland the model is built using the features that are morerobust, essentially, to acquisition noise. On the x-axis thenumber of evaluations indicates how often the system triesa new classification after enough new information has beengathered. The purpose of this plot is to show how recogni-tion is nearly immediate in the case of a flat textured object(10 frames are enough for recognizing the book), while moretime is necessary to recognize a more complex object (thepuppet goofy is recognized after 40 frames). It is remarkable,though, to observe how a correct recognition is achievedafter some time.

(b) To the purpose of assessing the need for a spatio-temporaldescription we compare the time needed for recognizing a3D object with a still camera, or slightly rotating the object,in order to show some of its most peculiar sides. The plotshows how motion helps to achieve recognition morerapidly.

(c) The adopted scale-invariant descriptions allow us to obtaina good tolerance to scale variations, while our on-line rec-ognition system, for performance reasons, does not takeinto account search and assumes that the distance betweenobject and camera is close to the one adopted on the mod-eling stage. Plot (c) is obtained acquiring an object thatstarts very distant from the camera (at least 1 m furtherthan the modeling conditions) and then it moves closerto it. The figure shows nicely how the recognition reliabilitygrows to a maximum as the object approaches and thenstarts decreasing when it becomes much bigger than theoriginal model.

3 A demo video of the on-line recognition procedure is available for download athttp://slipguru.disi.unige.it/Downloads/multimedia/online_recognition.avi.

(d) Plain objects make recognition more difficult since it is diffi-cult to spot features that are both meaningful in space androbust in time: plot (d) shows how, for a very smooth andreflective object, about 100 frames are needed before themodel is complex enough to be useful.

(e) The system is observing a very complex scene with manyknown objects at a wrong scale: a high number of modelsis considered but none of them passes the minimal recogni-tion threshold. When the user decides to focus on one spe-cific object then recognition is immediately obtained.

(f) A highly cluttered scene containing two similar objects (abox and a book). After an initial confusion the correct objectis detected when the camera moves around it for about 50frames keeping it at the center of attention.

8. Conclusions

This paper presented a view-based approach to 3D object recog-nition in videos. The proposed method is based on the use of localscale-invariant features and captures the essence of the objectappearance selecting those features that are stable in time (i.e.,across view-point changes) and meaningful in space. These fea-tures produce a compact description of the object, tolerant to scale,illumination, and view-point changes.

A two-steps matching procedure allowed us to first exploit localsimilarities and then to enforce them looking for groups of match-ing features in space–time neighborhoods. This allows us to con-sider loose vicinity constraints among groups of features. Themethod was applied on-line, thanks to the use of an incrementalmodel that describes the recently seen objects of a test sequence.We presented very promising experimental results on recognitionof 3D objects in cluttered environments.

The fact that on-line recognition can be applied to real-timeprocessing allowed us to perform a series of experiments basedon the interaction between the system and the user. We observedhow the temporal component, that is the observation of an objectfrom continuous view-points, may greatly help recognition. As ex-pected, planar objects are much more simple to model and to rec-ognize, while in the case of complex 3D objects characterized byself-occlusions, the use of continuous views leads to an improvedperformance.

The devised system was mainly motivated by the interest inunderstanding the benefits of using multiple or continuous viewsin the presence of critical conditions—in particular occlusions,scene changes, scale and illumination variations, and clutter. Sys-tem scalability was not a primary goal. The current version ofour system may deal with about 20 known objects at a time, weare currently devising an indexing module that would allow usto increase the number of known objects. Also, for a more completedescription of objects, it will be useful to merge information fromdifferent videos in a single object model.

Acknowledgments

The authors are grateful to Augusto Destrero, Alessandro Verri,and Matteo Santoro for fruitful discussions and proof reading. Alsothey would like to thank the anonymous reviewers for usefulsuggestions. The video acquisition infrastructure was developedunder the technology transfer program with Imavis s.r.l. (http://www.imavis.com/). This work has been partially supported bythe FIRB Project LEAP (RBIN04PARL) and the PRIN Project 3SHIRT.

References

[1] K. Bowyer, R. Dyer, Aspect graphs: an introduction and survey of recent results,International Journal of Imaging Systems and Technology 2 (4) (1990) 315–328.

N. Noceti et al. / Computer Vision and Image Understanding 113 (2009) 1198–1209 1209

[2] I. Biederman, Recognition-by-components: a theory of image humanunderstanding, In Psychological Review 94 (1987) 115–147.

[3] P.J. Besl, R.C. Jain, Three-dimensional object recognition, Computing Surveys 17(1) (1985) 75–145.

[4] M. Brown, D.G. Lowe, Unsupervised 3D object recognition and reconstructionin unordered datasets, in: Proceeding of 3DIM, 2005.

[5] G. Csurka, C.R. Dance, L. Fan, C. Bray, Visual categorization with bag of keypoints,in: The 8th European Conference on Computer Vision – ECCV, Prague, 2004.

[6] T.F. Cootes, K.N. Walker, C.J. Taylor, View-based active appearance models, in:International Conference on Automatic Face and Gesture Recognition, 2000,pp. 227–232.

[7] E. Delponte, E. Arnaud, F. Odone, A. Verri, Trains of keypoints for 3D objectrecognition, in: IEEE International Conference on Pattern Recognition, HongKong, 2006.

[8] E. Delponte, N. Noceti, F. Odone, A. Verri, The importance of continuous viewsfor real-time 3D object recognition, in: Workshop on 3D Representation forRecognition (3dRR-07), ICCV, 2007.

[9] E. Delponte, N. Noceti, F. Odone, A. Verri, Spatio-temporal constraints formatching view-based descriptions of 3D objects, in: Proceeding of theInternational Workshop on Image Analysis for Multimedia InteractiveServices, 2007.

[10] S. Edelman, N. Intrator, T. Poggio, Complex cells and object recognition.<http://kybele.psych.cornell.edu/edelman/archive.html>.

[11] Vittorio Ferrari, Tinne Tuytelaars, Luc Van Gool, Simultaneous objectrecognition and segmentation from single or multiple model views,International Journal of Computer Vision 67 (2) (2006).

[12] M. Grabner, H. Bischof, Object recognition based on local feature trajectories,in: I Cogn. Vision Work, 2005.

[13] M. Grabner, Object recognition with local feature trajectories, Master’s thesis,Technische Universitat Graz, 2004.

[14] W.E.L. Grimson, Object Recognition by Computer: The Role of GeometricConstraints, The MIT Press, 1990.

[15] J.M. Hollerbach, Hierarchical shape description of objects by selection andmodification of prototypes, Technical Report 346, MIT AI Lab, 1975.

[16] D.P. Huttenlocher, Three-dimensional recognition of solid object from a twodimensional image, Technical report, M.I.T. AI Lab Memo, 1989.

[17] M.J. Jones, T. Poggio, Multidimensional morphable models: a framework forrepresenting and matching object classes, International Journal of ComputerVision 2 (29) (1998) 107–131.

[18] T. Lindeberg, Feature detection with automatic scale selection, InternationalJournal of Computer Vision 30 (2) (1998).

[19] B. Leibe, K. Mikolajczyk, B. Schiele, Efficient clustering and matching for objectclass recognition, in: British Machine Vision Conference, 2006.

[20] D.G. Lowe, Object recognition from local scale-invariant features, in:International Conference on Computer Vision, Corfú, Greece, 1999, pp. 1150–1157.

[21] D.G. Lowe, Distinctive image features from scale-invariant keypoints,International Journal of Computer Vision (2004).

[22] D. Marr Vision, A Computational Investigation into the Human Representationand Processing of Visual Information, Freeman Publishers, 1982.

[23] M. Minsky, A framework for representing knowledge, in: The Psychology ofComputer Vision, McGraw-Hill, 1975.

[24] H. Murase, S. Nayar, Visual learning and recognition of 3D objects fromappearance, IJCV 14 (1) (1995).

[25] P. Moreels, P. Perona, Evaluation of features detectors and descriptors based on3D object, In ICCV, 2005.

[26] K. Mikolajczyk, C. Schmid, A performance evaluation of local descriptors, IEEETransactions on Pattern Analysis and Machine 27 (10) (2005).

[27] F. Odone, A. Barla, A. Verri, Building kernels from binary strings for imagematching, Image Processing, IEEE Transactions on 14 (2) (2005) 169–180.

[28] M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, R.Koch, Visual modeling with a hand-held camera, International Journal ofComputer Vision 59 (3) (2004) 207–232.

[29] M. Pontil, A. Verri, Support vector machines for 3D object recognition, IEEEPAMI 20 (1998) 637–646.

[30] Fred Rothganger, Svetlana Lazebnik, Cordelia Schmid, Jean Ponce, 3D objectmodeling and recognition using local affine-invariant image descriptors andmulti-view spatial constraints, International Journal of Computer Vision 66 (3)(2006).

[31] M.J. Swain, D.H. Ballard, Color indexing, International Journal of ComputerVision 7 (1) (1991) 11–32.

[32] S. Savarese, L. Fei-Fei, 3D generic object categorization, localization and poseestimation, in: ICCV, 2007.

[33] T. Serre, Aude Oliva, Tomaso Poggio, A feedforward architecture accounts forrapid categorization, PNAS (2006).

[34] S.M. Stringer, G. Perry, E.T. Rolls, J.H. Proske, Learning invariant objectrecognition in the visual system with continuous transformations, BiologicalCybernetics 94 (2006) 128–142.

[35] A. Thomas, V. Ferrari, B. Leibe, B. Schiele T. Tuytelaars, L. Van Gool, Towardsmulti-view object class detection, in: CVPR, 2006.

[36] R.A. Young, The gaussian derivative model for spatial vision: retinalmechanisms, in: Spatial Vision, vol. 2, 1987, pp. 273–293.