Geo-contextual priors for attentive urban object recognition

6
Geo-Contextual Priors for Attentive Urban Object Recognition Katrin Amlacher, Gerald Fritz, Patrick Luley, Alexander Almer and Lucas Paletta Abstract— Mobile vision services have recently been pro- posed for the support of urban nomadic users. While camera phones with image based recognition of urban objects provide intuitive interfaces for the exploration of urban space and mobile work, similar methodology can be applied to vision in mobile robots and autonomous aerial vehicles. A major issue for the performance of the service - involving indexing into a huge amount of reference images - is ambiguity in the visual information. We propose to exploit geo-information in association with visual features to restrict the search within a local context. In a mobile image retrieval task of urban object recognition, we determine object hypotheses from (i) mobile image based appearance and (ii) GPS based positioning, and investigate the performance of Bayesian information fusion with respect to benchmark geo-referenced image databases (TSG-20, TSG-40). This work specifically proposes to introduce position information as geo-contextual priors for geo-attention based object recognition to better prime the vision task. The results from geo-referenced image capture in an urban scenario prove a significant increase in recognition accuracy (> 10%) when using the geo-contextual information in contrast to omitting geo-information, the application of geo-attention is capable to improve accuracy by further > 5%. I. I NTRODUCTION Mobile object recognition and visual positioning have recently been proposed in terms of mobile vision services for the support of urban nomadic users (Fig.1). The performance of these services is highly dependent on the uncertainty in the visual information. Covering large urban areas with naive approaches would require to refer to a huge amount of reference images and consequently to highly ambiguous features. Previous work on mobile vision services primarily ad- vanced the state-of-the-art in computer vision methodology for the application in urban scenarios. [1] provided a first innovative attempt on building identification proposing local affine features for object matching. [2] introduced image retrieval methodology for the indexing of visually relevant information from the web for mobile location recognition. Subsequent attempts [3], [4], [5] advanced the methodology further towards highly robust building recognition, however, it has not yet been considered to investigate the contribution of geo-information to the performance of the vision service. In this paper we propose to exploit contextual information from geo-services with the purpose to cut down the visual search space into a subset of all available object hypotheses This work is supported in part by the European Commission funded project MOBVIS under grant number FP6-511051 and by the FWF Austrian National Research Network Cognitive Vision under sub-project S9104-N04. JOANNEUM RESEARCH Forschungsgesellschaft mbH, Institute of Digital Image Processing, Wastiangasse 6, 8010 Graz, Austria, [email protected] in the large urban area. Geo-information in association with visual features enables to restrict the search within a local context. We describe the embedding of the problem in a general system implementation of an Attentive Machine Interface (AMI, Fig.2) that enables contextual processing of multi-sensor information in a probabilistic framework. We extract object hypotheses in the local context from (i) mobile image based appearance and (ii) GPS based positioning and investigate the performance of Bayesian information fusion with respect to reference databases (TSG-20, TSG- 40: Sec. IV). The results from experimental tracks and image capture in an urban scenario prove a significant increase in recognition accuracy (> 10%) when using the geo-contextual information. Fig. 1. Free exploration of urban environments with camera phones and image based recognition services. This work proposes significantly better performance using position priors for object recognition. II. URBAN OBJECT DETECTION AND RECOGNITION Urban image based recognition provides the technology for both object awareness and positioning. Outdoor geo- referencing still mainly relies on satellite based signals where problems arise when the user enters urban canyons and the availability of satellite signals dramatically decreases due to various shadowing effects. Cell identification is not treated here due to its large positioning error. Alternative concepts for localization are economically not affordable, such as, INS and markers that need to be massively distributed across the urban area. In the following, we propose a system for mobile image retrieval. For image based urban object recognition, we briefly describe and make use of the methodology pre- sented in [5]. Mobile recognition system In the first stage, the user captures an image about an object of interest in its field of view, and a software client initiates wireless data submission to the server. Assuming that a GPS receiver is available, the mobile device reads the actual position estimate and sends this together with the image to the server . In the second stage, the web-service reads the message and analyzes the geo-referenced image. Based on a current quality of service and the given decision for object detection

Transcript of Geo-contextual priors for attentive urban object recognition

Geo-Contextual Priors for Attentive Urban Object Recognition

Katrin Amlacher, Gerald Fritz, Patrick Luley, Alexander Almer and Lucas Paletta

Abstract— Mobile vision services have recently been pro-posed for the support of urban nomadic users. While cameraphones with image based recognition of urban objects provideintuitive interfaces for the exploration of urban space andmobile work, similar methodology can be applied to visionin mobile robots and autonomous aerial vehicles. A majorissue for the performance of the service - involving indexinginto a huge amount of reference images - is ambiguity in thevisual information. We propose to exploit geo-information inassociation with visual features to restrict the search within alocal context. In a mobile image retrieval task of urban objectrecognition, we determine object hypotheses from (i) mobileimage based appearance and (ii) GPS based positioning, andinvestigate the performance of Bayesian information fusion withrespect to benchmark geo-referenced image databases (TSG-20,TSG-40). This work specifically proposes to introduce positioninformation as geo-contextual priors for geo-attention basedobject recognition to better prime the vision task. The resultsfrom geo-referenced image capture in an urban scenario provea significant increase in recognition accuracy (> 10%) whenusing the geo-contextual information in contrast to omittinggeo-information, the application of geo-attention is capable toimprove accuracy by further > 5%.

I. INTRODUCTION

Mobile object recognition and visual positioning haverecently been proposed in terms of mobile vision services forthe support of urban nomadic users (Fig.1). The performanceof these services is highly dependent on the uncertaintyin the visual information. Covering large urban areas withnaive approaches would require to refer to a huge amountof reference images and consequently to highly ambiguousfeatures.

Previous work on mobile vision services primarily ad-vanced the state-of-the-art in computer vision methodologyfor the application in urban scenarios. [1] provided a firstinnovative attempt on building identification proposing localaffine features for object matching. [2] introduced imageretrieval methodology for the indexing of visually relevantinformation from the web for mobile location recognition.Subsequent attempts [3], [4], [5] advanced the methodologyfurther towards highly robust building recognition, however,it has not yet been considered to investigate the contributionof geo-information to the performance of the vision service.

In this paper we propose to exploit contextual informationfrom geo-services with the purpose to cut down the visualsearch space into a subset of all available object hypotheses

This work is supported in part by the European Commission fundedproject MOBVIS under grant number FP6-511051 and by the FWF AustrianNational Research Network Cognitive Vision under sub-project S9104-N04.

JOANNEUM RESEARCH Forschungsgesellschaft mbH, Instituteof Digital Image Processing, Wastiangasse 6, 8010 Graz, Austria,[email protected]

in the large urban area. Geo-information in association withvisual features enables to restrict the search within a localcontext. We describe the embedding of the problem ina general system implementation of an Attentive MachineInterface (AMI, Fig.2) that enables contextual processing ofmulti-sensor information in a probabilistic framework. Weextract object hypotheses in the local context from (i) mobileimage based appearance and (ii) GPS based positioningand investigate the performance of Bayesian informationfusion with respect to reference databases (TSG-20, TSG-40: Sec. IV). The results from experimental tracks and imagecapture in an urban scenario prove a significant increase inrecognition accuracy (> 10%) when using the geo-contextualinformation.

Fig. 1. Free exploration of urban environments with camera phones andimage based recognition services. This work proposes significantly betterperformance using position priors for object recognition.

II. URBAN OBJECT DETECTION AND RECOGNITION

Urban image based recognition provides the technologyfor both object awareness and positioning. Outdoor geo-referencing still mainly relies on satellite based signals whereproblems arise when the user enters urban canyons and theavailability of satellite signals dramatically decreases due tovarious shadowing effects. Cell identification is not treatedhere due to its large positioning error. Alternative conceptsfor localization are economically not affordable, such as, INSand markers that need to be massively distributed across theurban area. In the following, we propose a system for mobileimage retrieval. For image based urban object recognition,we briefly describe and make use of the methodology pre-sented in [5].

Mobile recognition system In the first stage, the usercaptures an image about an object of interest in its field ofview, and a software client initiates wireless data submissionto the server. Assuming that a GPS receiver is available, themobile device reads the actual position estimate and sendsthis together with the image to the server .

In the second stage, the web-service reads the messageand analyzes the geo-referenced image. Based on a currentquality of service and the given decision for object detection

Fig. 2. Schematic sketch of the client-server system architecture used formobile urban object recognition using multi-sensor information. Contextgenerating components for object recognition (Sec.II) and positioning areintegrated in the context graph for information fusion (geo-indexed objectrecognition, Sec.III).

and identification, the server prepares the associated annota-tion information from the content database and sends it backto the client for visualization.

Informative features for recognition Research on visualobject detection has recently focused on the developmentof local interest operators [6], [7] and the integration oflocal information into object recognition [8]. The SIFT(Scale Invariant Feature Transformation) descriptor [7] iswidely used for its capabilities for robust matching despiteviewpoint, illumination and scale changes in the object imagecaptures which is mandatory for mobile vision services.The Informative Features Approach [8], [5] used here applieslocal density estimations to determine the posterior entropy,making local information content explicit with respect toobject discrimination.

The information content from a posterior distribution isdetermined with respect to given task specific hypotheses. Incontrast to costly global optimization, one expects that it issufficiently accurate to estimate a local information contentfrom the posterior distribution within a sample test point’slocal neighborhood in descriptor space. One is primarilyinterested to get the information content of any sample localdescriptor di in descriptor space D, di ∈ R|D|, with respectto the task of object recognition, where oi denotes an objecthypothesis from a given object set SO. For this, one needsto estimate the entropy H(O|di) of the posterior distributionP (ok|di), k = 1 . . .Ω, Ω is the number of instantiations ofthe object class variable O. The Shannon conditional entropydenotes

H(O|di) ≡ −∑

k

P (ok|di) log P (ok|di). (1)

One approximates the posteriors at di using only samplesgj inside a Parzen window of a local neighborhood ε, ||di−dj || ≤ ε, j = 1 . . . J . Fig. 3 depicts discriminative descrip-tors in an entropy-coded representation of local SIFT featuresdi. From discriminative descriptors one proceeds to entropy

Fig. 3. Concept for recognition from informative local descriptors. (I) First,standard SIFT descriptors are extracted within the test image. (II) Decisionmaking analyzes the descriptor voting for MAP decision. (III) In i-SIFTattentive processing, a decision tree estimates the SIFT specific entropy,and only informative descriptors are attended for decision making (II).

thresholded object representations, providing increasinglysparse representations with increasing recognition accuracy,in terms of storing only selected descriptor information thatis relevant for classification purposes, i.e., those di withH(O|di) ≤ HΘ.

Attentive object detection and recognition Detectiontasks require the rejection of images whenever they do notcontain any objects of interest. For this we consider toestimate the entropy in the posterior distribution – obtainedfrom a normalized histogram of the object votes – and rejectimages with posterior entropies above a predefined thresh-old. The proposed recognition process is characterized by anentropy driven selection of image regions for classification,and a voting operation.

III. GEO-INDEXED OBJECT RECOGNITION

Geo-services provide access to information about a localcontext that is stored in a digital city map. Map informationin terms of map features is indexed via a current estimateon the user position that can be derived from satellite basedsignals (GPS), dead-reckoning devices and so on. The mapfeatures can provide geo-contextual information in terms oflocation of points of interest, objects of traffic infrastructure,information about road structure and shop information.

A. Geo-services for object hypotheses

In previous work [9] we already emphasised the generalrelevance of geo-services for the application of mobile visionservices, such as mobile object recognition. However, thecontribution of positioning to recognition was merely on aconceptual level and the contribution of the geo-services tothe performance of geo-indexed object recognition was notquantitatively assessed. Fig. 5 depicts a novel methodologyto introduce geo-service based object hypotheses. (i) A geo-focus is first defined with respect to a radius of expectedposition accuracy with respect to the city map. (ii) Distancesbetween user position and points of interest (e.g., tourist sightbuildings) that are within the geo-focus are estimated. (iii)The distances are then weighted according to a normal

(a) Bottom-up: context graph (b) Bottom-up: feature space (c) Top-down: context graph (d) Top-down: feature space

Fig. 4. Bottom-up versus top-down processing of geo-contextual information with respect to the vision task. (a) Bottom-up processing of vision and GPSbased information is fused into a Maximum A Posteriori (MAP) decision maker. (b) Query features in visual feature space suffer from association withambiguous nearest-neighbor features. In top-down processing (c), position information impacts the distribution in visual feature space and from this thevisual classification task itself.

density function by

p(x) = 1/((2π)d/2|Σ|1/2) exp−1/2(x−µ)T Σ−1(x−µ).(2)

By investigating different values for σ, assuming (Σij) =δijσ

2j , we can tune the impact of distances on the weighting

of object hypotheses. (iv) Finally, weighted distances arenormalised and determine confidence values of individualobject hypotheses.

B. Contextual Awareness And Attentive Machine Interface

The Attentive Machine Interface (Fig. 2 [10]) uses a con-text framework that defines a cue as an abstraction of logicaland physical sensors which may represent a context itself,generating a recursive definition of context. Sensor data,cues and context descriptions are defined in a frameworkof uncertainty. Attention is the act of selecting and enablingdetail – in response to situation specific data - within a choiceof given information sources, with the purpose to operateexclusively on it. Attention enabled by the AMI is thereforefocusing operations on a specific detail of a situation that isdescribed by the context.

The architecture of the AMI reflects the enabling of bothbottom-up and top-down information processing and wouldsupport snapshot (e.g., image) based as well as continuousoperation on a stream of input data (e.g., video). Fig. 2outlines the embedding of the AMI within a client-serversystem architecture for mobile vision services with supportfrom multi-sensor information. A user interface generatestask information (mobile vision service) that is fed into theAMI. The user request for context information is handledby a Master Control (MC) component that schedules theprocessing (multiple users can start several tasks) and as-sociates with each task corresponding system monitoring(SM) procedures. A concrete task is then performed by theTask Processor (TP) who, firstly, requests a hierarchical de-scription of services, i.e. context-generating modules (contextsubgraph) and, secondly, executes the services in the order ofthe subgraph description. Since such a subgraph can provideseveral ways of processing, the appropriate part can get

selected by means of, e.g., time constraint, confidence of theexpected result and quality of context-generating services.If a service gets offline, TP can switch to another similarservice or to another processing chain, where already pro-cessed data is reused. The Context Graph Manager (CGM)maintains and manages context-generating modules in agraph structure (Context Graph). These context-generatingmodules are services that receive an input cue (an image, aGPS signal, etc.) from the Data Control (DC) module andgenerate a specific context abstraction from an integrationof the input cues. CGM assembles the subgraph accordingto several constraints, such as, task information, availabilityof context-generating modules and data and ensures that thesubgraph is processable. The AMI functionality ensures thepossibility to arbitrarily combine services and implementsprocess flow regulation mechanism, e.g. when a service getsoffline to switch to another service. It is also possible toinvoke an additional processing chain if the confidence ofthe result it too low. Multiple users can concurrently requestcontext information and the services are targeted towards fastand accurate (robust) responses.

C. Bayesian information fusion (Bottom-Up)

Fig.4a depicts the context graph of the bottom-up informa-tion processing approach for geo-indexed object recognition.Object recognition and geo-services are computed in parallel,and the resulting posterior are integrated in a post-processingstep. Fig.4b illustrates that recognition with nearest-neighborapproaches for classification in the visual feature space is notimpacted by geo-services.

Distributions over object hypotheses from vision (Sec. II)and geo-services are then integrated via Bayesian decisionfusion. Although an analytic investigation of both visual andposition signal based information should prove statisticaldependency between the corresponding random variables,we assume that it is here sufficient to pursue a naive Bayesapproach for the integration of the hypotheses (in order toget a rapid estimate about the contribution of geo-services

(a) Distances to POIs

(b) Geo-Contextual priors

Fig. 5. Geo-services are applied to impact computer vision from positionestimates. (a) Distances between the estimated user position and geo-referenced points of interest (POIs). (b) Priors for the computer vision taskare determined by exponential weighting of distances (Eq. 2) with differentspatial scales σ.

to mobile vision services) by

P (ok|yi,v,xi,g) = α p(ok|yi,v)p(ok|xi,g), (3)

where indices v and g mark information from image (y) andpositioning (x), respectively.

D. Geo-Contextual Priors for Recognition (Top-Down)

Fig.4c depicts the context graph of the top-down informa-tion processing approach for geo-indexed object recognition.The classification stage in visual object recognition is herecompletely dependent on the priors derived from the geo-services. Fig.4d illustrates that recognition with nearest-neighbor approaches for classification in the visual featurespace is directly impacted by the geo-contextual priors:visual features of hypotheses that were rejected from thegeo-services clear the feature space for improved nearest-neighbor classification.

P (ok|yi,v,xi,g) = β p(yi,v|ok)p(ok|xi,g), (4)

where indices v and g mark information from image (y) andpositioning (x), respectively.

(a)

(b)Fig. 6. (a) Airborne image of the test site with user geo-track (red),query image captures (light blue), and points of interest (POIs, blue). (b)Spatial heat map with color-code of the number of hypotheses selectedfor recognition in top-down information processing, demonstrating attentiveprocessing with the selective attention in hypothesis space (maximumnumber is 40 in the TSG-40 database (see Sec. IV).

IV. EXPERIMENTS

The overall goal of the experiments was to determine andto quantify the contribution of geo-services to object recogni-tion in urban environments. The performance in the detectionand recognition of objects of interest on the query imageswith respect to a given reference image database and a givenmethodology (TSG-20 [5]) was compared to the identicalprocessing but using geo-information and information fusionfor the integration of object hypotheses (Sec. III).

User scenario In the application scenario, we imaginea tourist exploring a foreign city for tourist sights (e.g.,buildings). He is equipped using a mobile device with built-in GPS and can send image based queries to a server usingUMTS or WLAN based connectivity. The server performsgeo-indexed object recognition and is expected to respondwith tourist relevant annotation if a point of interest wasidentified.

Hardware and Image Databases In the experimentswe used an ultra-mobile PC (Sony Vaio) with 1.3 MPixelsimage captures and a camera-equipped mobile phone (HTC

Touch Cruise) with 1 MPixel image captures. Referenceimagery [5] with 640×480 resolution about building objectsof the TSG-20 database1 were captured from a mobile phone(Nokia 6230), building objects of the TSG-40 database2

were captured using a SLR camera (Canon EOS), wherethe high-resolution images were downsampled to 1 MPixel.The reference imagery contains changes in 3D viewpoint,partial occlusions, scale changes by varying distances forexposure, and various illumination changes. For each objectwe selected 2 images taken by a viewpoint change of≈ ±30

and of similar distance to the object for training to determinethe i-SIFT based object representation (Sec. II). 2 additionalviews were taken for test purposes.

Query Image Databases For the evaluation of back-ground detection we used a dataset of 120 query images,containing only images of buildings and street sides with-out TSG-20 objects. Other dataset were acquired with theUMPC, which consists of seven images per TSG-20 objectand with the HTC, consisting of six images per TSG-40object from different view points; images were captured ondifferent days under different weather conditions.

Results on object detection and recognition In the firstevaluation stage, each individual image query was evaluatedfor vision based object detection and recognition (Sec. II),then regarding extraction of geo-service based object hy-potheses (Sec. III), and finally with respect to Bayesiandecision fusion on the individual probability distributions.

Detection is a pre-processing step to recognition to avoidgeo-services to support objects that are not in the queryimage. Preliminary experiments resulted in low performance(PT rate 89.2%, FP rate 20.1%), however, geo-indexed objectrecognition then finds more correct hypotheses.

The evaluation of the complete database of image queriesabout TSG-20 objects (Fig. 8a,b) proves a decisive advantagefor taking geo-service based information into account incontrast to purely vision based object recognition. Whilevision based recognition is on a low level (≈ 84%; probablydue to low sensor quality), an exponentially weighted spatialenlargement of the scope on object hypotheses with geo-services increased the recognition accuracy up to ≈ 97%.With increasing σ an increasing number of object hypothesesare taken for information fusion and the performance finallydrops to vision based recognition performance (uniformdistribution in the geo-service based object hypotheses).

In addition, we applied the methodologies on the TSG-40database where images of objects of a complete inner cityarea were processed (Fig. 6). The results confirm the superi-ority of the top-down processing applied on both high-quality(Fig. 8c) and low-quality (Fig. 8d) imagery. Fig. 7 depictssample query images associated with corresponding distri-butions on object hypotheses from vision, geo-services, andusing information fusion. The results demonstrate significantincreases in the confidences of correct object hypotheses.The reason to use the TSG-20 and TSG-40 databases was to

1http://dib.joanneum.at/cape/TSG-20/2http://dib.joanneum.at/cape/TSG-40/

(a) train image (b) query image

(c) Vision based posterior

(d) Geo-service based posterior

(e) Bayesian fusion based posterior

Fig. 7. Resulting posteriors (blue=bottom-up, green=top-down processing)from a sample recognition experiment. (a) Geo-referenced train images, (b)query image, (c) object recognition, (d) geo-services and (e) informationfusion using visual and position information.

make use of geo-referenced training images. Other publiclyavailable building image databases, such as the ZuBuD [1],do not provide geo-coordinates to our framework.

V. CONCLUSIONS

In this work we investigated the contribution of geo-contextual information for the improvement of performancein visual object detection and recognition. We argued thatgeo-information provides a focus on the local object contextthat would enable a meaningful selection of expected objecthypotheses and finally proved that the performance of urbanobject recognition can be significantly improved. This workwill be relevant for a multitude of mobile vision servicesand geo-indexed image retrieval, enabling higher accuracyand more robust mobile applications.

(a) (b)

(c) (d)

Fig. 8. Performance results on complete geo-referenced databases (Geo=geo-services, OR=object recognition, OR+Geo=bottom-up processing,R+OR+Geo=top-down processing): (a) TSG-20 results (bottom-up: ≈ 97%, top-down: ≈ 99%) , (b) results (speed) of top-down processing, (c) results onthe TSG-40 database with high-quality imagery (max. 2500 keys per object class, HΘ = 3.5), (d) results on the TSG-40 with low-quality imagery. Theresults demonstrate a superior performance of the top-down geo-indexed object recognition approach, for particular local geo-contexts with spatial scale σ.

REFERENCES

[1] H. Shao, T. Svoboda, and L. van Gool, “HPAT indexing for fastobject/scene recognition based on local appearance,” in Proc. Intl.Conf. Image and Video Retrieval, CIVR 2003, 2003, pp. 71–80.

[2] T. Yeh, K. Tollmar, and T. Darrell, “Searching the web with mobileimages for location recognition,” in Proc. IEEE Computer Vision andPattern Recognition, CVPR 2004, Washington, DC, 2004, pp. 76–81.

[3] R. Maree, P. Geurts, J. Piater, and L. Wehenkel, “Decision treesand random subwindows for object recognition,” in ICML workshopon Machine Learning Techniques for Processing Multimedia Content(MLMM2005), 2005.

[4] S. Obdrzalek and J. Matas, “Sub-linear indexing for large scale objectrecognition.” in Proceedings of the British Machine Vision Conference,vol. 1, 2005, pp. 1–10.

[5] G. Fritz, C. Seifert, and L. Paletta, “A Mobile Vision System for UrbanObject Detection with Informative Local Descriptors,” in Proc. IEEE

4th International Conference on Computer Vision Systems, ICVS, NewYork, NY, January 2006.

[6] K. Mikolajczyk and C. Schmid, “A performance evaluation of localdescriptors,” in Proc. Computer Vision and Pattern Recognition, CVPR2003, Madison, WI, 2003.

[7] D. Lowe, “Distinctive image features from scale-invariant keypoints,”Intl. Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.

[8] G. Fritz, C. Seifert, and L. Paletta, “Urban Object Recognitionfrom Informative Local Features,” in Proc. IEEE Intl.Conference onRobotics and Automation, ICRA, 2005, pp. 132–138.

[9] P. Luley, L. Paletta, A. Almer, M. Schardt, and J. Ringert, “Geo-services and computer vision for object awareness in mobile systemapplications,” in Proc. 3rd Symposium on LBS and Cartography.Springer, 2005, pp. 61–64.

[10] K. Amlacher and L. Paletta, “An attentive machine interface usinggeo-contextual awareness for mobile vision tasks,” in Proc. EuropeanConference on Artificial Intelligence, vol. 178, 2008, pp. 601–605.