Qualitative scene interpretation using planar surfaces

11
Autonomous Robots 8, 129–139 (2000) c 2000 Kluwer Academic Publishers. Printed in The Netherlands. Qualitative Scene Interpretation Using Planar Surfaces A. BRANCA, E. STELLA AND A. DISTANTE Istituto Elaborazione Segnali ed Immagini - CNR, Via Amendola 166/5 Bari, Italy Abstract. Our aim is to provide an autonomous vehicle moving into an indoor environment with a visual system to perform a qualitative 3D structure reconstruction of the surrounding environment by recovering the different planar surfaces present in the observed scene. The method is based on qualitative detection of planar surfaces by using projective invariant constraints without the use of depth estimates. The goal is achieved by analyzing two images acquired by observing the scene from two different points of view. The method can be applied to both stereo images and motion images. Our method recovers planar surfaces by clustering high variance interest points whose cross ratio measurements are preserved in two different perspective projections. Once interest points are extracted from each image, the clustering process requires to grouping corresponding points by preserving the cross ratio measurements. We solve the twofold problem of finding corresponding points and grouping the coplanar ones through a global optimization approach based on matching of high relational graphs and clustering on the corresponding association graph through a relaxation labeling algorithm. Through our experimental tests, we found the method to be very fast to converge to a solution, showing how higher order interactions, instead to giving rise to a more complex problem, help to speed-up the optimization process and to reach at same time good results. Keywords: 3D structure reconstruction, projective invariants, graph matching, maximum clique, relaxation labeling 1. Introduction The main goal of an autonomous agent moving into an indoor environment is to build up a 3D description of the surrounding scene by using 2D images acquired from different viewpoints by the vision system of which it is equipped. Inferring 3D information from 2D images taken from different viewpoints can be performed by Euclidean (metric) or projective (non metric) models. Metric ap- proaches establish a model to relate the pixel coordi- nates to the space coordinates, by computing the dis- tance (depth) from the camera optical center to points in the world from their correspondences in a sequence of images using the camera projection matrix (Tsai, 1986; Faugeras and Toscani, 1986). Though this class of methods can capture the Euclidean representation of the 3D space, a calibrated system is required. However, we know that it is not always possible to assume that cameras can be calibrated off-line, and, in addition, individual depth estimates often result noisy and sparse, producing a few qualitative information useful for scene analysis. Moreover, a lot of navigational problems don’t re- quire the precise knowledge of the 3D position of points in the scene, but only a qualitative and compact descrip- tion of the environment useful for scene interpretation, object modeling, object recognition, 3D motion esti- mation, ground plane obstacle detection. Computing structure without explicit camera calibration should be more robust than using calibration because we need not make any assumptions about the Euclidean geometry. When no initial assumption is made about either the intrinsic or extrinsic parameters of the camera, from point correspondences in pairs of images one is able

Transcript of Qualitative scene interpretation using planar surfaces

Autonomous Robots 8, 129–139 (2000)c© 2000 Kluwer Academic Publishers. Printed in The Netherlands.

Qualitative Scene Interpretation Using Planar Surfaces

A. BRANCA, E. STELLA AND A. DISTANTEIstituto Elaborazione Segnali ed Immagini - CNR, Via Amendola 166/5 Bari, Italy

Abstract. Our aim is to provide an autonomous vehicle moving into an indoor environment with a visual systemto perform a qualitative 3D structure reconstruction of the surrounding environment by recovering the differentplanar surfaces present in the observed scene.

The method is based on qualitative detection of planar surfaces by using projective invariant constraints withoutthe use of depth estimates. The goal is achieved by analyzing two images acquired by observing the scene fromtwo different points of view. The method can be applied to both stereo images and motion images.

Our method recovers planar surfaces by clustering high variance interest points whose cross ratio measurementsare preserved in two different perspective projections. Once interest points are extracted from each image, theclustering process requires to grouping corresponding points by preserving the cross ratio measurements.

We solve the twofold problem of finding corresponding points and grouping the coplanar ones through a globaloptimization approach based on matching of high relational graphs and clustering on the corresponding associationgraph through a relaxation labeling algorithm.

Through our experimental tests, we found the method to be very fast to converge to a solution, showing howhigher order interactions, instead to giving rise to a more complex problem, help to speed-up the optimizationprocess and to reach at same time good results.

Keywords: 3D structure reconstruction, projective invariants, graph matching, maximum clique, relaxationlabeling

1. Introduction

The main goal of an autonomous agent moving intoan indoor environment is to build up a 3D descriptionof the surrounding scene by using 2D images acquiredfrom different viewpoints by the vision system of whichit is equipped.

Inferring 3D information from 2D images taken fromdifferent viewpoints can be performed by Euclidean(metric) or projective (non metric) models. Metric ap-proaches establish a model to relate the pixel coordi-nates to the space coordinates, by computing the dis-tance (depth) from the camera optical center to pointsin the world from their correspondences in a sequenceof images using the camera projection matrix (Tsai,1986; Faugeras and Toscani, 1986). Though this classof methods can capture the Euclidean representation ofthe 3D space, a calibrated system is required.

However, we know that it is not always possible toassume that cameras can be calibrated off-line, and, inaddition, individual depth estimates often result noisyand sparse, producing a few qualitative informationuseful for scene analysis.

Moreover, a lot of navigational problems don’t re-quire the precise knowledge of the 3D position of pointsin the scene, but only a qualitative and compact descrip-tion of the environment useful for scene interpretation,object modeling, object recognition, 3D motion esti-mation, ground plane obstacle detection. Computingstructure without explicit camera calibration shouldbe more robust than using calibration because weneed not make any assumptions about the Euclideangeometry.

When no initial assumption is made about either theintrinsic or extrinsic parameters of the camera, frompoint correspondences in pairs of images one is able

130 Branca, Stella and Distante

only to compute a projective representation of the world(Faugeras, 1992; Hartley et al., 1992).

More recent work (Faugeras, 1992) focuses on theextraction of information from the cameras and the im-ages by only using results of projective geometry.

Recovering of a non-metric 3D structure up to a pro-jective transformation requires only the geometric in-formation relating different viewpoints (collineations,fundamental matrix, epipolar geometry) from whichto proceed towards the recovery of the non-metric 3Dstructure (Luong and Faugeras, 1996; Deriche et al.,1994; Prichett and Zisserman, 1998; Beardsley et al.,1994).

In the context of human made environments most ofexpected objects (walls, floors, ceilings, doors, poly-hedral objects) are characterized by planar faces. Ouraim is to obtain a raw segmentation of the scene intoplanar regions characterized by only some sparse co-planar points. Various approaches have been proposedin literature to use the projective geometry theory to ap-proximate the observed scene with a set of planar faces(Chabbi and Berger, 1996; Carlsson, 1996; Rothwellet al., 1995; Nagao, 1996; Zhang and Faugeras, 1994).

Planar surfaces imply a very strong constraint on thekind of correspondences that occur between two im-ages. More precisely, if the tokens that are put into cor-respondence are produced by visual features situatedin a plane, there exists an analytic transformation bet-ween the two projective planes (Semple and Kneebone,1979) completely specified by a 3 by 3transformationmatrix named homography.

Approaches using planar homographies to detectplanar surfaces impose to small subsets of corre-spondences (supposed to be coplanar on the basis oftheir nearness, geometrical configuration or simple co-planarity tests) to be merged on the basis of their ho-mography similarity or to be grown of additional cor-respondences (Prichett and Zisserman, 1998; Sinclairand Blake, 1996). The main drawback of this classof approaches is to require the initial homography es-timates to be refined during the plane reconstruction.Besides initial erroneous homography estimates couldprevent optimal plane recovery, repeating the compu-tations on the growing sets can result into a more timeconsuming approach.

In this work, our aim is to exploit the fundamen-tal property of planar homographies of preserving thecross-ratio of five arbitrary co-planar points.

Most of previous attempts to use the projectiveinvariance of cross-ratio of five coplanar points, as

constraint in planar region detection (Chabbi andBerger, 1996; Sinclair and Blake, 1996; Gurdjoset al., 1996; Carlsson, 1996; Rothwell et al., 1995;Oberkampf et al., 1996; Kanatani, 1994), have encoun-tered a lot of difficulties due to the use of probabilis-tic analysis (Maybank, 1995a, b). The performance ofprobabilistic approaches depends on the choice of rulefor deciding whether five image points have a givencross-ratio. The definition of a robust decision rule—which takes into accounts the effects on the cross ra-tio of small errors in locating the image points—andthresholds on the probabilities in the decision rule is adifficult task. Moreover, an individual estimate of thecross-ratio invariance of a five-point set does not allow,immediately, the clustering of all points available on agiven plane.

Our idea is to overcome the problems derived fromthe use of probabilistic decision rules by consideringthis projective invariance constraint into a global op-timization process performing at same time matchingand grouping of coplanar features. By considering alarge number of intersecting subsets of five points, ob-tained as combinations of available sparse features, theclustering problem is solved by searching for a solutionin the space of all potential matches by imposing to sat-isfy the five-order constraint of cross ratio. Imposingthe cross ratio invariance to be globally satisfied sparesus to deciding about local measurements of cross-ratio,giving rise to a more robust approach.

Using graph theory, our clustering problem is equiv-alent to find all maximal cliques (i.e., the largest subsetsof nodes mutually compatible) of an association graphwhose nodes represent the potential feature matchesand the 5-order links are weighted by cross ratio simi-larity: each maximal clique represent a planar face. Inliterature maximal clique finding algorithms (Pardalosand Xue, 1994; Horaud and Skordas, 1989) are es-sentially implemented as recursive graph search meth-ods which are known to be NP-complete problemsand of an exponential growth in computing time asthe number of association graph nodes increases. Onthe other hand optimization techniques (hopfield neuralnetworks (Jagota, 1995), relaxation labeling (Pelillo,1995)) have been resulted more reliable, convergingin polynomial times to an optimal solution. In Pelillo(1995) it is shown as a an optimal solution to the generalproblem ofmaximum clique finding(i.e., finding thelargest maximal clique) can be achieved through a sim-ple version of the original relaxation labeling model ofRosenfeld, Hummel, Zucker (Rosenfeld et al., 1976).

Qualitative Scene Interpretation 131

Based on the results in (Pelillo, 1995, 1997), we ex-tend the classical relaxation labeling approach (basedon binary compatibility matrices) to treat with com-patibility matrices of order five (with coefficients de-termined through cross ratio similarity measurements)and apply it to our context of planar surfaces recov-ering by clustering of coplanar features. Since in ourcontext we are not interested to an exact 3D reconstruc-tion but only to a raw approximation of the surroundingenvironment with a set of planar surfaces, an optimiza-tion approach, as the non-linear relaxation labeling, issufficient.

Once a raw segmentation of the observed scene isdetermined with some sparse features, we can obtaina more refined segmentation by recovering a lot of ad-ditional points on each plane through the associatedhomography matrix.

Summarizing, in our method, high interest fea-tures, extracted using the Moravec’s interest operator(Moravec, 1983) from each frame, are clustered intolarge sets by imposing to satisfy the cross-ratio invari-ance constraint (Section 2). By using the graph theorya planar region is represented by a set of totally con-nected nodes of an association graph (Section 3) and itis determined through an optimization approach basedon a non-linear relaxation labeling process (Section 4).Finally, on each cluster of the so recovered co-planarfeatures, the associated planar homography is recover-ed in order to obtain a more refined scene segmenta-tion (Section 5). In our experimental tests (Section 6),we found our method to be very fast in converging toa correct solution, showing as despite common assump-tions, higher order interactions help to speed-up theprocess.

2. The Projective Invariance Constraint

Clustering of coplanar features proposed in this paperis based on projective invariance of cross ratio of fivearbitrary coplanar points.

We use the Moravec interest operator to isolatefeature-points with high directional variance (Moravec,1983), i.e., the projections on the image plane of 3Dpoints occurring at corners and edges.

Our aim is to grouping coplanar features into clus-ters representing different planes present in the scenewithout any a priori knowledge.

The cross-ratio is the simplest numerical propertyof an object that is unchanged under projection to animage. In the plane it is defined on four coplanar lines

Figure 1. The pencil of four coplanar lines has a cross ratio definedby the angles between lines. Any line intersecting the pencil has thesame cross-ratio for the points of intersection of the line with thepencil.

(u1, u2, u3, u4), incident at a single point (pencil oflines) (Fig. 1), in terms of the angles between them andis given by:

cr(α13, α24, α23, α14) = sin(α13)× sin(α24)

sin(α23)× sin(α14)(1)

whereαi j is the angle formed by the incident linesui

andu j . Any five coplanar point setP = (p1, p2, p3,

p4, p5) (not three of which collinear) will be charac-terized by the invariance of cross-ratiocross(P) of thepencil of coplanar lines generated joining a point ofPwith the other four (Fig. 2).

Projective invariance of cross ratio imposes for eachsubset of five coplanar pointsP in the first image thecorresponding points in the second imageP′ to havethe same cross ratio.

Actually in our context we are dealing with featuresextracted from different unknown planes: given five

Figure 2. Any set of five coplanar points has the cross ratio of thecorresponding pencil.

132 Branca, Stella and Distante

arbitrary features, if the cross ratio is not preserved wecannot decide if the features are on different planes orthey have been mismatched. In fact, in most previousworks, using cross ratio invariance, assumptions aboutcoplanarity or correctness of some matches are a priorimade.

Our idea is to solve this twofold problem by impos-ing the cross ratio invariance constraint to be satisfiedon a lot of combinations of available features. The goalis to recovering large subsets of features all mutuallycompatible (i.e., correctly matched and coplanar) byimposing cross ratio invariance to be globally satisfied.

3. Mapping to High-Order Graphs

We solve this global constraint satisfaction problemthrough an optimization approach selecting sets ofmatches mutually compatible, i.e., pairs of featurescorrectly matched and coplanar. We represent the twoframes of the sequence by two relational graphs char-acterized by

— nodes({n1i }, {n2 j }) representing the feature points({pi }, {qj }) extracted using the Moravec’s interestoperator

— five-order links ({L1i1i2i3i4i5}, {L2 j1 j2 j3 j4 j5}) weight-ed by the cross-ratio

(CR1i1i2i3i4i5 = cross(pi1, pi2, pi3, pi4, pi5),

CR2 j1 j2 j3 j4 j5 = cross(qj1,qj2,qj3,qj4,qj5))

evaluated on the features associated to the five con-necting nodes.

Optimal matches should be recovered through a searchon the association graphG = {{nh}, {lhklmn}} consist-ing of

— nodes{nh} representing candidate matches— five-order links{lhklmn}weighted by cross ratio sim-

ilarity (2):

Chklmn= e−(|CR1hi ki li mi ni −CR2h j k j l j mj n j |2) (2)

Theoretically, the association graphG should consistsof N × M nodes{nh}, representing candidate matches{(phi ,qhj )} and( N×M

5 ) links weighted by compatibil-ity matrix values{Chklmn}. Actually, in order to reducethe number of links involved in the process, we gen-erate a nodenh only if the corresponding featuresphi

andqhj have an high radiometric similarity. Each planeis characterized by a set of assignment nodes totallyconnected as specified by the imposed projective invari-ant constraint of cross-ratio. The goal is to determineall clusters of nodes mutually compatible according tothe compatibility matrixC.

In graph theory a set of nodes mutually compatibleis known as clique; a clique is maximal if no strictsuperset of it is a clique; a maximum clique is a cliquehaving the largest cardinality.

Our problem requires to recovering all maximalcliques identifying all different planes. We solve theproblem of co-planar feature detection by finding iter-atively all maximum cliques of the constructed associ-ated graph.

4. Co-Planar Feature Detection

An efficient optimization tool capable of solving themaximum clique problem is the relaxation labelingmodel.

Relaxation labeling processes solve difficult con-strained satisfaction problems, making use of contex-tual information to solve local ambiguities and achieveglobal consistency (Pelillo, 1997).

A relaxation labeling process takes as input an initiallabeling (matching) assignment and iteratively updatesit considering the compatibility model.

If we assign to each nodenh an initial labeling3h

representing the degree of confidence of the hypothesisphi is matched with qhj , a relaxation algorithm updatesthe labeling{3h} in accordance with the compatibilitymodel, by using asupport functionγ (3) (which quan-tifies the degree of agreement of theh-th match withthe context).

γh =∑klmn

Chklmn3k3l3m3n (3)

The rule of adjustment of the labeling{3h} should in-creases3h whenγh is high and decreases it whenγh

is low. This leads to the following updating rule:

3h = 3hγh

/∑k

3kγk (4)

Iteratively all labeling of{nk} nodes are updated untila stable state is reached. In the stable state all{nh}nodes with non-null labeling3h will represent optimalmatches between coplanar features.

Qualitative Scene Interpretation 133

In Pelillo (1997) it has been shown that this algorithmposses a Liapunov function. This amounts to statingthat each relaxation labeling iteration actually increasesthe labeling consistency, and the algorithm eventuallyapproaches the nearest consistent solution. The solu-tion is the largest set of nodes mutually compatibleaccording to the compatibility matrix{Chklmn}, i.e., themaximum clique identifying the largest planar face.

4.1. The Clustering Process

The relaxation labeling approach gives us an approxi-mation of the maximum clique: i.e., the largest set offeatures correctly matched and coplanar, this amountsto estimate the planar surface for which a greatest num-ber of features have been extracted. Actually our goalis to recover as many subsets of mutually compatiblenodes as the large planar surfaces are observed in thescene. In order to recover all planar faces in whichthe available features can be organized, we apply therelaxation labeling process iteratively, until all featureshave been clustered.

At each iteration, the current largest subset of featurematches mutually compatible under cross ratio invari-ant constraint is identified as the maximum clique andit is pruned from the association graph. When all max-imal cliques have been identified the remaining graphnodes will represent mismatched features or sparse iso-lated features.

We point out that the method in finding the maximalcliques recovers at same time all correct matches.

Finally, in order to reduce the computational com-plexity (of order( N

5 ), if N is the number of involvedmatch nodes), for each cluster to be determined, insteadto apply the relaxation process for maximum cliquefinding to the whole setM of theN involved matches,we considerm smallest subsets{M1, . . . ,Mm}—eachof n matches randomly selected fromM (n¿ N)—sothat Mi ∩ M j 6= ∅ and Mi ⊂ M (for i, j = 1 . . .m)andm( N

5 )¿ (N5 ). The maximum clique of the whole

set M is generated by merging all maximum cliquesestimated from all smallest subsets{M1, . . . ,Mm}.

The so clustered mutually compatible features arepruned fromM and the process is repeated on the re-maining.

5. Planar Structure Reconstruction

The projective coordinates(p, p′) of correspondencesamong the projections of 3D points belonging to a 3D

plane into two image planes are related by an analytictransformation from the first image coordinates to thesecond image coordinates depending only on the rela-tive positions of the two image planes and the 3D plane.

This analytic transformation is completely specifiedby a 3 by 3transformation matrix named homogra-phy H, which is a collineation between the two im-age planes (considered as projective planes (Sempleand Kneebone, 1979)) and depends of the rotation andtranslation of the 3D plane parameters between the twoviews.

Assuming a pinhole model for the camera, and theimage plane to be in front of the center of projection.In projective co-ordinates an image positionp is givenby

p =

x

y

f

(5)

where f is the focal length of the camera.The homography is computed, up to a nonzero scalar

factor (λ), as the unique solution of the system of equa-tions:

p′i = λHpi (6)

This system can be solved if four correspondences(pi, p′i) are available: only eight coefficients ofH needto be computed, sinceH is defined up to a nonzeroscalar factor.

However, due to errors often occurring in the avail-able correspondences, the goal is to find the matrixwhich approximates at best the solution of this systemaccording to a given criterion.

In fact, in our work, the homography of a plane isestimated by minimizing a functional of all availablecorrespondences:

E(H) =∑

i

(p′i − Hpi)2 (7)

The method we use to minimizeE is an iterativeapproach based on gradient descendent.

In our context we can compute for each cluster of co-planar features recovered through relaxation labelingusing the cross-ratio constraint, the homography matrixdescribing uniquely for the corresponding 3D plane therelation among the features projected on two differentimage planes. All matches among features of the sameplane are described by the same collineation. Once a

134 Branca, Stella and Distante

collineation has been estimated for each cluster of co-planar points, a lot of additional points of the sameplane can be recovered.

We group all points whose correspondences com-puted using an homography satisfy the constraint ofradiometric similarity. Since radiometric measures pro-vides reliable results only in the neighborhood of highvariance regions, the points we select to be tested forregion growing are always high variance features.

6. Experimental Results

The first experiment, reported in Fig. 3, is about the re-sults obtained while testing our method in differentcomplex contexts generated synthetically. The goal wasto measure the ability to cluster coplanar features ofplanes apparently indistinguishable because of theirsimilar texture and approximatively similar distancefrom the optical center of the TV camera.

The second experiment, reported in Fig. 4, shows theability of the system to recover different planar surfacein a stereo context.

Finally, the remaining experiments have been per-formed with the aim to test the ability of the systemto reconstruct the free space on the ground plane on

Figure 3. A synthetic time varying image sequence. The simulated scene is constituted of three orthogonal planes (one horizontal (α), andtwo vertical (β, θ )). The simulated camera motion is a forward translation along a direction parallel toα while the camera axis is rotated to anangle of 45◦ with respect to the moving direction. (a, b) are the two analyzed consecutive images with superimposed the extracted features. (c)is the map of feature correspondences as estimated by the radiometric similarity (it is correct because the images are synthetic). (d–f) are therecovered planes:α, β and respectively.

which the vehicle moves. Figures 5–7 report someresults obtained on real time-varying image sequencesacquired in our laboratory by a TV camera (6 mm focallength) mounted respectively on a LABMATE platformby TRC (Fig. 5) and aNomadic Scoutmobile vehicle(Fig. 6).

In the experiment, the vehicle was constrained tomove forward by following rectilinear paths alongwhich different obstacles were located on the floor. TheTV camera was inclined so the optical axis intersectsthe floor at a distance from the vehicle of approxima-tively 50 cm.

In the experiment, in Fig. 5 the proposed clusteringapproach has been tested on a large number of extractedfeatures. Though the results are acceptable, the com-putational time grows exponentially with the numberof involved features.

In the experiment, in Figs. 6 and 7 the clustering pro-cess has been applied only on a small number of ex-tracted features (Fig. 6), then more additional pointshave been recovered through the estimated homogra-phies. Initially, only 21 high variance features (Fig. 6)was found through the Moravec interest operator, thenthe raw segmentation into two different clusters usingthe cross-ratio constraint is obtained. From the firstrecovered cluster of 11 features detected on the floor

Qualitative Scene Interpretation 135

Figure 4. Stereo sequence: (a, b) the two analyzed images with overimposed the extracted features, (c) the initial potential match set obtainedafter pruning based on radiometric similarity, (d, e) the two recovered planes, (f ) the whole final match set obtained after the relaxation process(it is the merging of the two maximum clique obtained for both planes (d) and (e).

Figure 5. Two consecutive images of a time varying image sequence acquired with a CCD camera mounted on our mobile platform LABMATEby TRC, while it is moving on our laboratory in order to detecting all obstacles on the ground plane. The optical axis of the camera is rotated to40◦ with respect the ground. The frame distance is of 200 mm. (a, b) Two consecutive frames with superimposed all the extracted features, (c)the ground plane recovered as the largest cluster of coplanar features (i.e., the free space without obstacles), (d) the initial potential match set asrecovered after the radiometric similarity based pruning step, (e) the displacement field of the recovered free ground plane.

surface (Fig. 6(b)), through the associated homographymatrix 110 additional co-planar features have been esti-mated (Fig. 7(a)). Similarly, from only 7 features clus-tered on the obstacle surface (Fig. 6(c)), 144 additionalco-planar features have been recovered (Fig. 7(b)).

Finally, the results obtained in a context where alarger ground plane is present are reported in Fig. 8.The experiments have been performed on the“Marbled-Block” image sequence made available byMichael Otte at KOGS/IAKS Universitaet Karlsruhe.

136 Branca, Stella and Distante

Figure 6. Lab-seq: (a) Sparse features extracted initially using the Moravec interest operator. (d) Initial correspondences estimated throughradiometric similarity. (b, e) The first largest cluster of features and correspondences recovered by imposing the cross-ratio invariance constrainton the extracted features. (c, f ) The second largest cluster of features recovered by imposing the cross-ratio invariance constraint on the featuresdiscarded from the first cluster.

Figure 7. Lab-seq: Additional coplanar features recovered on the floor through the collineation estimated for the first extracted cluster.

Qualitative Scene Interpretation 137

Figure 8. Marbled-Block-seq: (a, c) First largest cluster of co-planar features recovered using the cross-ratio constraint on the high-varianceextracted features. (b, d) The planar surface reconstruction using the homography matrix estimated on the sparse features.

This image sequence (30 frames) shows a polyhedralscene characterized by a large horizontal planar surfaceon which four columns and a right-to-left moving mar-bled block are placed, the images were acquired by aforward moving camera. The results here reported havebeen obtained by considering the frames 20-th and 25-th. In Fig. 8(a) the 20-th frame with overimposed thelargest recovered cluster of co-planar features (55 highvariance points) extracted on the horizontal surface isshown.

The reconstruction process, based on the estimatedhomography, has been able to recover (Fig. 8(b)) aquite dense map of 1219 features covering up the wholeplanar surface.

In our experiments, we find the minimum cardinal-ity of small subsets{M1, . . . ,Mm} involved to estimatea maximum clique is 10, and a maximum of 30 iter-ations are necessary to relaxate (i.e., to reach a stablestate the relaxation algorithm) requiring 0.1 seconds.Since in real contexts the number of large surfaces to berecovered are no more than two or three, the requiredcomputational time to perform the whole process is ofthe order of 1 second at maximum.

7. Conclusions

An optimization approach to perform a qualitative 3Dstructure reconstruction has been described. The ap-proach can constitute the basis of a lot of vision basedalgorithms in contexts of mobile robot navigation. Wepropose to determine the segmentation of a scene intoplanar surfaces through a global optimization approachinvolving the simplest numerical property of a planarobject that is unchanged under projection to an im-age plane: the cross-ratio of five arbitrary points. Themethod is based on relaxation of an association graphcharacterized by links of order five weighted by crossratio similarities between the connected node-features.The goal is to determine all sets of mutually compatiblenodes (i.e., features correctly matched and coplanar)(maximal cliques).

Once a raw segmentation is obtained with onlysome sparse points, a full segmentation is recoveredby adding new points using the homographies relat-ing two different views of planar objects. The inte-gration of this reconstruction scheme based on planarhomographies, with the optimization approach based

138 Branca, Stella and Distante

on cross-ratio similarities, enables the whole system tooperate in real time.

References

Beardsley, P.A., Reid, I.D., Zisserman, A., and Murray, D.W. 1994.Active visual navigation using non-metric structure. TechnicalReport No. OUEL 2047/94.

Carlsson, S. 1996. Projectively invariant decomposition and recog-nition of planar shapes.International Journal of Computer Vision,17(2):193–209.

Chabbi, H. and Berger, M.O. 1996. Using projective geometryto recover planar surfaces in stereovision.Pattern Recognition,29(4):533–548.

Deriche, R., Zhang, A., Luong, Q.-T., and Faugeras, O.D. 1994. Ro-bust recovery of the epipolar geometry for an uncalibrated stereorig. In Proceedings European Conference on Computer VisionECCV’94, pp. 565–576.

Faugeras, O.D. 1992. What can be seen in three dimensions with anuncalibrated stereo rig? InProceedings of European Conferenceon Computer Vision ECCV’92, pp. 563–578.

Faugeras, O.D. and Toscani, G. 1986. The calibration problemfor stereo. InProceedings Computer Vision Pattern RecognitionCVPR’86, pp. 15–20.

Gurdjos, P., Dalle, P., and Castan, S. 1996. Tracking 3D coplanarpoints in the invariant perspective coordinates plane. InProceed-ings of International Conference on Pattern Recognition ICPR’96.

Hartley, R.I., Gupta, R., and Chang, T. 1992. Stereo from uncali-brated cameras. InProceedings of Computer Vision and PatternRecognition CVPR’92, pp. 761–764.

Horaud, R. and Skordas, T. 1989. Stereo correspondence throughfeature grouping and maximal cliques.IEEE Trans. Pattern Anal.Machine Intell., 11(11):1168–1180.

Hummel, R.A. and Zucker, S.W. 1983. On the foundations of relax-ation labelling processes.IEEE Trans. Patter. Anal. Mach. Intell.,5(3):267–287.

Jagota, A. 1995. Approximating maximum clique with a hopfieldneural network.IEEE Transaction on Neural Networks, 6(3):724–735.

Kanatani, K. 1994. Computational cross ratio for computer vision.CVGIP: Image Understanding, 60(3):371–381.

Luong, Q.-T. and Faugeras, O.D. 1996. The fundamental matrix:Theory, algorithms, and stability analysis.International Journalof Computer Vision, 1:43–75.

Maybank, S.J. 1995a. Probabilistic analysis of the application ofcross ratio to model based vision: Misclassification.InternationalJournal of Computer Vision, 14:199–210.

Maybank, S.J. 1995b. Probabilistic analysis of the application ofcross ratio to model based vision.International Journal of Com-puter Vision, 16:5–33.

Moravec, H.P. 1983. The Stanford Cart and the CMU rover. InPro-ceedings IEEE.

Nagao, K. 1996. Direct methods for evaluating the planarity andrigidity of a surface using only 2D views. InProceedings of In-ternational Conference of Pattern Recognition ICPR’96, pp. 417–422.

Oberkampf, D., DeMenthon, D.F., and Davis, L.S. 1996. Iterativepose estimation using coplanar feature points.CVGIP: Image Un-derstanding, 63(3):495–511.

Pardalos, P.M. and Xue, J. 1994. The maximum clique problem.Global Optimization, 4:301–328.

Pelillo, M. 1995. Relaxation labeling networks for the maximumclique problem.Journal of Artificial Neural Networks, 2(4):313–328.

Pelillo, M. 1997. The dynamics of nonlinear relaxation labeling pro-cesses.Journal of Mathematical Imaging and Vision, 7(4):309–323.

Prichett, P. and Zisserman, A. 1998. Wide baseline stereo matching.In Proceedings of International Conference on Computer VisionICCV’98.

Rosenfeld, A., Hummel, R.A., and Zucker, S.W. 1976. Scene la-beling by relaxation operations.IEEE Trans. Syst. Man. Cyber.,6(6):420–433.

Rothwell, C.A., Zisserman, A., Forsyth, D.A., and Mundy, J.L. 1995.Planar object recognition using projective shape representation.International Journal of Computer Vision, 16:57–99.

Semple, J. and Kneebone, G. 1979.Algebraic Projective Geometry,Oxford University Press.

Sinclair, D. and Blake, A. 1996. Qualitative planar region detection.International Journal of Computer Vision, 18(1):77–91.

Tsai, R.Y. 1986. An efficient and accurate camera calibration tech-nique for 3D machine vision. InProceedings of Computer Visionand Pattern Recognition CVPR’86, pp. 364–374.

Zhang, Z. and Faugeras, O. 1994. Finding planes and clusteringof objects from 3D line segments with application to 3D motiondetermination.CVGIP: Image Understanding, 60(3):267–284.

Antonella Brancawas born in Poggiardo (Lecce-Italy) in 1968. Shereceived the degree in Computer Science from the University of Bariin 1992. Since 1993 she has worked as research associate at Insti-tute for Signal and Image Processing of Italian National ResearchCouncil. Her areas of interest include computer vision, artificialneural networks, robotics.

Ettore Stellawas born in Bari (Italy) in 1960 and received the degreein Computer Science from Bari University in 1984. From 1987 to1990 he was a research scientist at Italian Space Agency (ASI) at theCentro di Geodesia Spaziale (CGS) in Matera (Italy). Since 1990 hehas been a research scientist at IESI-CNR. His current interests arein computer vision, planning, neural networks, control of a mobilerobot operating in indoor environment and obstacle avoidance.

Qualitative Scene Interpretation 139

Arcangelo Distantewas born in Francavilla Fontana (Brindisi-Italy)in 1945. He received the degree in Computer Science at Universityof BARI (Italy) in 1976. Until 1981 he worked for I.N.F.N. (Na-tional Nuclear Physics Institute) and subsequently for IESI-CNR.Currently, he is the coordinator of the Robot Vision Group at IESI-CNR and director of the IESI Institute. His research interests arefocused on 3D object reconstruction, representation of visual in-formation and generation of 3D modelling, shape representation forimage understanding, vision for robotic navigation, and architecturesfor computer vision. Dr. A. Distante is a member of the IAPR andSPIE.