A low-dimensional local descriptor incorporating TPS warping for image matching

12
A low-dimensional local descriptor incorporating TPS warping for image matching Yang Duanduan * , Andrzej Sluzek School of Computer Engineering, Nanyang Technological University, Blk N4, Nanyang Avenue, Singapore 639798, Singapore article info Article history: Received 12 November 2008 Received in revised form 15 October 2009 Accepted 9 December 2009 Keywords: Image descriptor Image matching Interest points TPS warping abstract This paper proposes a low-dimensional image descriptor combining shape characteristics and location information. The shape characteristics are obtained from simple rectangular patterns approximating the interest regions. Although the shape component of the descriptor does not perform as well as high-dimensional descriptors (e.g. SIFT) it can be built very quickly. Moreover, it usually provides enough correspondences to align locations of interest regions. The alignment is based on the thin plate spline (TPS) warping algorithm with control points automatically identified by our method. Subsequently, the aligned coordinates contribute additional dimensions to the descriptor. The process may be iterated several times until no further improvement is achieved. Experiments show that incorporation of location data into the descriptor improves performance. The proposed descriptor is compared to SIFT (a standard benchmark which is considered one of the best local descriptors [1]) for real images with various geometric and pho- tometric transformations and for diversified types of scenes. Results show the proposed low-dimensional descriptor generally performs better than SIFT descriptor while the computational complexity of our descriptor is far superior. Ó 2009 Elsevier B.V. All rights reserved. 1. Introduction Image matching is one of the most important and fundamental topics in computer vision and image processing, with many areas of applications, such as image retrieval, object detection and recog- nition, object tracking, vision-based navigation, etc. Currently, methods based on local features seem to dominate in image matching due to their flexibility, simplicity, and good performance [2]. Usually, local features are extracted from images in a form of interest points (keypoints) that are subsequently characterized by the feature descriptors. Detectors of interest points have been developed since early 1980’s (e.g. [3,4]). In the following years, scale and rotation invari- ance of interest point detectors was an important issue, and many such detectors have been proposed. They actually extract locations with their corresponding neighborhoods so that the term interest region detector is more appropriate. The most typical detectors are DoG [5], Harris-Laplace [6], Hessian-Laplace [7,8], Harris-Affine [8] and Hessian-Affine [9]. The papers [8,9] have evaluated these interest region detectors and Harris-Affine and Hessian-Affine detector are found the best. Therefore, in this paper we mainly use Harris-Affine and Hessian-Affine detector to extract interest regions that are subsequently characterized by the proposed descriptor. The most straightforward feature descriptors are color or inten- sity characteristics of interest regions, e.g. [10]. The differential fea- ture descriptors use a set of image derivatives with a given order, e.g. [11]. In [12], Mindru et al. introduce moment invariants for interest region description under changing viewpoint and illumi- nation. The responses of filters can also be used as the feature descriptors, e.g. [13]. Lowe in [7] proposes Scale Invariant Feature Transform (SIFT), which is computed by sampling the magnitudes and orientations of local image gradients and building smoothed orientation histograms. This feature description provides robust- ness against localization errors and small geometric distortions. PCA-SIFT [14], CSIFT [15], and Gradient Location and Orientation Histogram (GLOH) [1] are proposed to enhance SIFT by applying PCA, incorporating color information, and applying PCA after changing the location grid. There are also other various feature descriptors, such as spin images [16], etc. In [1], the experimental evaluation of several different descriptors for image matching has been reported, and the SIFT-based descriptors are found the best performers. An important property of human vision is that objects are gen- erally recognized by detecting various geometric patterns which may have diversified shapes yet structurally they are similar, e.g. [17]. This psycho-physiological fact has been exploited, directly or indirectly, is several methods of image processing and recogni- tion (e.g. [18,19]). Our descriptor is also inspired by this concept. 0262-8856/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.imavis.2009.12.003 * Corresponding author. Tel.: +65 6790 4618/4592; fax: +65 6792 6559. E-mail addresses: [email protected] (Y. Duanduan), [email protected] (A. Sluzek). Image and Vision Computing 28 (2010) 1184–1195 Contents lists available at ScienceDirect Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis

Transcript of A low-dimensional local descriptor incorporating TPS warping for image matching

Image and Vision Computing 28 (2010) 1184–1195

Contents lists available at ScienceDirect

Image and Vision Computing

journal homepage: www.elsevier .com/locate / imavis

A low-dimensional local descriptor incorporating TPS warping for image matching

Yang Duanduan *, Andrzej SluzekSchool of Computer Engineering, Nanyang Technological University, Blk N4, Nanyang Avenue, Singapore 639798, Singapore

a r t i c l e i n f o

Article history:Received 12 November 2008Received in revised form 15 October 2009Accepted 9 December 2009

Keywords:Image descriptorImage matchingInterest pointsTPS warping

0262-8856/$ - see front matter � 2009 Elsevier B.V. Adoi:10.1016/j.imavis.2009.12.003

* Corresponding author. Tel.: +65 6790 4618/4592;E-mail addresses: [email protected] (Y. Duandu

Sluzek).

a b s t r a c t

This paper proposes a low-dimensional image descriptor combining shape characteristics and locationinformation. The shape characteristics are obtained from simple rectangular patterns approximatingthe interest regions. Although the shape component of the descriptor does not perform as well ashigh-dimensional descriptors (e.g. SIFT) it can be built very quickly. Moreover, it usually provides enoughcorrespondences to align locations of interest regions. The alignment is based on the thin plate spline (TPS)warping algorithm with control points automatically identified by our method. Subsequently, the alignedcoordinates contribute additional dimensions to the descriptor. The process may be iterated several timesuntil no further improvement is achieved. Experiments show that incorporation of location data into thedescriptor improves performance. The proposed descriptor is compared to SIFT (a standard benchmarkwhich is considered one of the best local descriptors [1]) for real images with various geometric and pho-tometric transformations and for diversified types of scenes. Results show the proposed low-dimensionaldescriptor generally performs better than SIFT descriptor while the computational complexity of ourdescriptor is far superior.

� 2009 Elsevier B.V. All rights reserved.

1. Introduction

Image matching is one of the most important and fundamentaltopics in computer vision and image processing, with many areasof applications, such as image retrieval, object detection and recog-nition, object tracking, vision-based navigation, etc. Currently,methods based on local features seem to dominate in imagematching due to their flexibility, simplicity, and good performance[2]. Usually, local features are extracted from images in a form ofinterest points (keypoints) that are subsequently characterizedby the feature descriptors.

Detectors of interest points have been developed since early1980’s (e.g. [3,4]). In the following years, scale and rotation invari-ance of interest point detectors was an important issue, and manysuch detectors have been proposed. They actually extract locationswith their corresponding neighborhoods so that the term interestregion detector is more appropriate. The most typical detectorsare DoG [5], Harris-Laplace [6], Hessian-Laplace [7,8], Harris-Affine[8] and Hessian-Affine [9]. The papers [8,9] have evaluated theseinterest region detectors and Harris-Affine and Hessian-Affinedetector are found the best. Therefore, in this paper we mainlyuse Harris-Affine and Hessian-Affine detector to extract interest

ll rights reserved.

fax: +65 6792 6559.an), [email protected] (A.

regions that are subsequently characterized by the proposeddescriptor.

The most straightforward feature descriptors are color or inten-sity characteristics of interest regions, e.g. [10]. The differential fea-ture descriptors use a set of image derivatives with a given order,e.g. [11]. In [12], Mindru et al. introduce moment invariants forinterest region description under changing viewpoint and illumi-nation. The responses of filters can also be used as the featuredescriptors, e.g. [13]. Lowe in [7] proposes Scale Invariant FeatureTransform (SIFT), which is computed by sampling the magnitudesand orientations of local image gradients and building smoothedorientation histograms. This feature description provides robust-ness against localization errors and small geometric distortions.PCA-SIFT [14], CSIFT [15], and Gradient Location and OrientationHistogram (GLOH) [1] are proposed to enhance SIFT by applyingPCA, incorporating color information, and applying PCA afterchanging the location grid. There are also other various featuredescriptors, such as spin images [16], etc. In [1], the experimentalevaluation of several different descriptors for image matching hasbeen reported, and the SIFT-based descriptors are found the bestperformers.

An important property of human vision is that objects are gen-erally recognized by detecting various geometric patterns whichmay have diversified shapes yet structurally they are similar, e.g.[17]. This psycho-physiological fact has been exploited, directlyor indirectly, is several methods of image processing and recogni-tion (e.g. [18,19]). Our descriptor is also inspired by this concept.

P0 P1 P2 P3

P4 P5 P6 P7

(a) (b)

Fig. 2. (a) Division of interest regions and (b) the eight rectangular patterns used forapproximations.

Y. Duanduan, A. Sluzek / Image and Vision Computing 28 (2010) 1184–1195 1185

The proposed descriptor is applied to interest regions extractedby the Harris-Affine (or Hessian-Affine) detector. The regions arefurther divided into nine sub-regions, and each of them is approx-imated by eight simple rectangular patterns. The motivation forsuch a division and more details on the proposed descriptor aregiven in Section 2.

Using simple rectangular patterns limits performances of thedescriptor but has the advantage of very low computational com-plexity. Therefore, when two images are matched, this preliminarystep is supplemented by a more sophisticated method. In Section 3,the proposed descriptor is expanded by location data of interest re-gions. We present a method of image alignment based on thin platespline (TPS) warping algorithm. In fact, the images are not actuallyaligned and only the locations of interest regions are aligned fordescriptor expanding. The parameters obtained from TPS transfor-mation are incorporated into the region descriptors so that thealignment procedure may be iteratively repeated (if subsequentiterations improve the alignment). Selection of reliable controlpoints is the key of the alignment operation, so that we proposean automatic method (based on local affine approximations) tofind the control points. Experiments (described in Section 4) showthat the alignment is an effective tool significantly improving per-formances of our descriptor.

Further in Section 4, we set experiments to evaluate ourdescriptor. The data of our experiments are from [1], which areused in the paper [1] for image descriptors comparison. We com-pare our descriptor with SIFT, because the elevation of Mikolajczykand Schmid [1] shows the SIFT-based descriptors perform best. Theresults show the proposed low-dimensional descriptor performs,in general, better than SIFT, while the computational complexityof our detector is far superior.

2. Simple descriptor of interest regions

2.1. Introduction

The analyzed images are pre-processed using the Harris-Affine(or Hessian-Affine) detector of interest regions. The detector pro-vides not only the location and scale of interest regions but alsoelliptical or circular neighborhoods (representing affine/scale dis-tortions of regions) as shown in Fig. 1a. In order to use simple rect-angular approximations, we replace elliptical shapes by rectanglesof the same size, orientation and proportion (see Fig. 1b).

Our descriptor is built over nine sub-regions of such rectangles(Fig. 2a shows how extracted regions are divided).

The descriptor values are obtained by approximating sub-re-gions with predefined rectangular patterns (see Fig. 2b). It shouldbe noted that the approximations of a given rectangle (sub-region)are obtained using correspondingly transformed (rotated andstretched) patterns so that they match the shape of the rectangle.

Fig. 1. (a) An example of Harris-Affine results and (b) the modification of results.

Although eight patterns are shown in the figure, only four of themare mutually independent (the lower-row patterns are inversionsof the top row) so that the approximations provide only fourdescriptors (for the patterns’ inversions we use the same valueswith the opposite sign).

There are several reasons to use such patterns. First, it can benoticed that the proposed rectangular patterns correspond to eightmajor directions of the intensity gradient (with p/4 increments be-fore stretching the patterns). Therefore, the approximation scores(see Eq. (1)) represent local directional properties of interest re-gions. This is a concept similar to well-proven SIFT descriptors.Additionally, the winning approximation of the central sub-regionis used to obtain the rotational invariance of the descriptors (as ex-plained in Section 2.2).

Secondly, approximations of individual sub-regions (and theunion of these approximations for the whole rectangle) producesimplified pattern-based local image representations invariant un-der rotation, scaling and stretching. In spite of the patterns’ sim-plicity, they have been found (see [19]) sufficientlydiscriminative in objection detection, although the reported ap-proach is similar to the Haar transform rather than based on a di-rect pattern matching. In the future we may consider moresophisticated patterns (e.g. patterns discussed in [18]) instead.

Last but not least, the descriptors produced by the patterns aresimple enough to iteratively perform matching if necessary (detailsare discussed in Section 3). We believe, therefore, that the pro-posed descriptors can be an efficient tool for preliminary matchinglocal contents of images.

2.2. Descriptor building

Assume that S(x, y) is the intensity function for a given rectan-gular sub-region S. The score of approximation by Pj pattern is de-fined by the difference of average intensities of the white and blackareas of the pattern superimposed on the rectangle:

AppðS; PjÞ ¼Pðx;yÞ2Pj white

Sðx; yÞareaðPj whiteÞ

�Pðx;yÞ2Pj black

Sðx; yÞareaðPj blackÞ

ð1Þ

Eight approximation scores can obtained from only four calcula-tions because App(S, Pj) = �App(S, Pk), where k = j + 4(mod8).

It can be noted that in order to fit a given rectangle, four pat-terns can be stretched and rotated in four different ways (seeFig. 3). However, as explained below, the final descriptor is identi-cal no matter which option is eventually selected.

The rotational invariance of our descriptor is obtained by iden-tifying the dominant orientation of the central sub-region. Themaximum approximation score for the central sub-region indicatesthe orientation. Then, the remaining sub-regions are approximatedin the clockwise order, starting from the sub-region opposite the

S0

S6

S7

S1

S8 S5

S4

S3

S2

S0

S6 S7

S1

S8

S5

S4 S3 S2

(a)

P0 P1 P2 P3

P4 P5 P6 P7

P0 P1 P2 P3

P4 P5 P6 P7

(b)

Fig. 5. (a) The winning approximation is by P6. Thus P6, P7, P0 and P1 will be used fordescriptor building. (b) The winning approximation is by P0, i.e. P0, P1, P2 and P3 willbe used for descriptor building. The rotational invariance is clearly preserved.

(a) (b) (c)

Strechning and rotation by θ

Fig. 3. Four alternative ways of fitting a pattern to a rectangle. (a) An exemplarypattern; (b) an interest region with division, the exemplary pattern fits the centralsub-region; and (c) the four transformations satisfying the fitting condition(rotations by h, h + p/2, h + p, h + 3p/2 angles).

1186 Y. Duanduan, A. Sluzek / Image and Vision Computing 28 (2010) 1184–1195

black part of the winning approximation of the central sub-region.Using Fig. 4 as a reference, the operations are performed as follows:

1. Label the central sub-region as S0.2. Assume that P0 is transformed as shown in Fig. 4b to fit S0. All

other patterns are rotated and stretched in the same way(Fig. 4d).

3. Approximate S0 using the transformed patterns (P0, P1, . . . , P7)and record the winning approximation (P0 pattern is the win-ning approximation for the example.).

4. Label the sub-region opposite the black part of P0 (the winningapproximation) as S1 (as pointed by the solid arrow in Fig. 4b).

5. Label the remaining sub-regions in the clockwise order (Fig. 4c).The order of S1, S2, . . . , S 8 is thus rotationally invariant.

Thus, the descriptor-building algorithm for the whole rectangleS can be specified by the following steps:

Step 1: Approximate (based on Eq. (1)) the sub-region S0 usingeight patterns in the configuration fitting the sub-region. LetApp(S0, Pk) be the winning approximation indicating the domi-nant orientation. Memorize four approximation scores:App(S0, Pk), App(S0, Pk+1(mod8)), App(S0, Pk+2(mod8)) and App(S0, Pk+3(mod8)) as the sub-descriptor for S0. The approximationsby remaining four patterns are discarded because of symmetry.Thus the sub-descriptors for S0 is denoted as:

SDS0 ¼ ½AppðS0; PkÞ;AppðS0; Pkþ1ðmod8ÞÞ;AppðS0; Pkþ2ðmod8ÞÞ;AppðS0; Pkþ3ðmod8ÞÞ� ð2Þ

Step 2: Using the dominant orientation obtained in Step 1, labelthe remaining sub-regions as S1, . . . , S8 in the clockwise direc-tion. Examples of labeling are shown in Fig. 5.Step 3: Approximate S1, . . . , S8 sub-regions using the same Pk,Pk+1(mod8), Pk+2(mod8) and Pk+3(mod8) patterns used for S0. The

S0

S6

S

S5

S4

S3

S2

(a) (b)

Transforming the P0 to fit S0

P0

S0

Fig. 4. (a) P0 pattern; (b) labeling S0 and transforming P0 to fit S0; and (c) labe

approximation scores are considered the corresponding sub-descriptors for each sub-region, i.e.

(

l

SDSj¼ ½AppðSj; PkÞ;AppðSj; Pkþ1ðmod8ÞÞ;AppðSj; Pkþ2ðmod8ÞÞ;� AppðSj; Pkþ3ðmod8ÞÞ� j ¼ 1;2; . . . ;8 ð3Þ

Step 4: Concatenate the sub-descriptors to form the 36-dimen-sional shape descriptor of the rectangle S:

D shapeðSÞ ¼ ½SDS0 ; SDS1 ; . . . ; SDS8 �: ð4Þ

3. Expanding descriptor using location data

The 36-dimensional descriptor built in Section 2 is often insuf-ficient for a reliable image matching (see the experiments de-scribed in Section 4). The main reason is apparently thesimplicity of approximation patterns, and a relatively low dimen-sionality. However, the descriptor is usually efficient enough toidentify the most prominent correspondences, which can be subse-quently used for image alignment. Actually, the images are notaligned pixel by pixel. Only the locations of interest regions arealigned for descriptor expanding. We assume that if images con-tain the same (at least partially) scenes, the results of alignment

S7

1

S8

c) (d)

P0 P1 P2 P3

P4 P5 P6 P7

ing S1, S2, . . . , S8; and (d) transforming all patterns similarly to P0.

Rs_1

Rs_2

Rs_3

Rd_1

Rd_2

Rd_3

Fig. 6. Identification of coherent points. The arrows denote the best matches forinterest points. In the example, Rs_2 and Rd_1 are a pair of coherent points.

Y. Duanduan, A. Sluzek / Image and Vision Computing 28 (2010) 1184–1195 1187

would eliminate some false matching and would identify correctmatches previously considered too weak.

2D images of 3D scenes can be aligned only by using non-lineartransformations of unpredictable analytical forms so that imagewarping is a standard method for image alignment. One of the pop-ular warping techniques is a thin plate spline (TSP) transformationswarping method, see [20]. Thus, we develop a TPS-based methodto align the locations of interest regions and to integrate the align-ment results into our descriptor.

3.1. Introduction to TPS

A thin plate spline (TSP) transformation is a popular tool for sur-face interpolation over scattered data, obtained by minimizing the

bending energyR R

@2 f@x2

� �þ @2 f

@xy

� �þ @2 f

@y2

� �2� �

dxdy of the warping

function f(x, y). The solution for the warping function f(x, y), whichis the desired displacement at a point (x, y), has the following form:

f ðx; yÞ ¼ a1 þ axxþ ayyþXn

i�1

wiUðkðxi; yiÞ � ðx; yÞkÞ ð5Þ

where U(r) = r2 log r2 (it is called the kernel function) and(x1, y1), (x2, y2), . . . , (xn, yn) are so-called control points. In imagealignment we actually need two warping functions fx(x, y) andfy(x, y) to define displacements in X and Y directions.

In order to solve Eq. (5), we need correspondences for the con-trol points. The control points (x1, y1), (x2, y2), . . . , (xn, yn) are sourcepoints, and their corresponding target points are denoted asðx01; y01Þ; ðx02; y02Þ; . . . ðx0n; y0nÞ.

Using the set of source points, a n � 3 matrix P is defined as

P ¼1 x1 y1

. . .

1 xn yn

264

375 ð6Þ

Using the kernel function, we define another matrix K asKij ¼ Uðkðxi; yiÞ � ðxj; yjÞkÞ .

Finally, we define L as a combination of K and P : L ¼ K PPT 0

� �where 0 is a 3 � 3 matrix of zeros.

The solution of Eq. (5), i.e. the vector W = (w1, . . . , wn) and thecoefficients a1, ax, ay, is obtained from

L�1Y ¼ ½Wja1; ax; ay�T ð7Þ

where Y = (V|0 0 0)T.V is an n-vector consisting of coordinates of the target points,

i.e. V ¼ ðx01; x02; . . . x0nÞ to solve for fx(x, y) function of X-direction dis-placements and V ¼ ðy01; y02; . . . y0nÞ to solve for fy(x, y) function of Y-direction displacements.

3.2. Descriptor expanding by the alignment

When two images are matched, we arbitrarily select one ofthem to be the destination image, and the other one becomes thesource image. Although minor differences may exist, the resultsare generally unaffected by the choice. Then, we try to align thecoordinates from the source image to the destination image usingTPS algorithm. The alignment is based on two functions fx and fy

(for X and Y displacements) derived from Eq. (5).To perform the alignment, two sets of control points (source

points and target points) must be identified. The following proce-dure automatically selects the most reliable control points. Thecontrol point candidates are the interest points (i.e. the centersof interest regions) detected in both images by the Harris-Affine(or Hessian-Affine) detector and subsequently described by theshape descriptor of Section 2.

First, coherent points are found based on the descriptors. AssumeRs_i is an interest point in the source image and its best match in thedestination image (found using the nearest neighbor (NN) methodin the shape descriptor space) is Rd_j. If Rs_i is also the best matchfor Rd_j, the points are considered a pair of coherent points (seeFig. 6). Such a pair consists of a source coherent point and a destina-tion coherent point.

The most reliable coherent points will be subsequently selectedas the control points for TPS warping. The selection procedure isbased on the assumption that for the source and destinationimages related by TPS transformation, the warping functions canbe locally approximated by affine equations becausePn

i�1wiUðkðxi; yiÞ � ðx; yÞkÞ (see Eq. (5)) is almost constant withina small area.

The whole set of source coherent points in the source image ispartitioned into several groups by K-means spatial clustering (thenumber of clusters will be discussed in Section 4.1). Then, the bestlocal affine transform for each cluster is found. Assume a clustercontaining m source coherent points (xs1, ys1), (xs2, ys2), . . . , (xs-

m, ysm) with their corresponding destination coherent points denotedas (xt1, yt1), (xt2, yt2), . . . , (xtm, ytm). Any three non-colinear pairs ofsuch coherent points define an affine transform. The best affinetransform is the one which is (approximately) satisfied by mostof the pairs. In other words, we find the transform Aff for whichthe largest number of pairs satisfies

kAff ðxsj; ysjÞ � ðxtj; ytjÞk < T ð8Þ

where T is a constant threshold.Because the threshold is used to prevent the non-affine trans-

formations, it makes no sense to assign too big or too small valuesto T. In this paper, we experimentally set T = 10 pixels. Otherthresholds (with the small changes of T = 10) have been also tested,but the same clustering results are generally obtained.

In an unlikely event of more transforms equally well satisfyingEq. (8), the selected one would have the smallest overall deviationspecified by Eq. (8). If the number of pairs satisfying Eq. (8) isnumerous enough (at least half of the remaining coherent pairs,i.e. (m � 3)/2) we assume that the affine transform locally existsbetween the images. Then, the three point pairs defining the se-lected affine transform Aff are used as the control points for TPSalignment. If the number of pairs satisfying Eq. (8) is not large en-ough (i.e. less than (m � 3)/2), the cluster is discarded.

Once the control points (both source points and target points) areidentified, we can derive the warp functions fx and fy (for X and Ydisplacements) from Eq. (5). Thus, each interest point in the sourceimage (xs, ys) is aligned to the point of the following coordinatesðx0s; y0sÞ ¼ ½ðfxðxs; ysÞ; fyðxs; ysÞ�. Exemplary results of image alignmentusing TPS warping are given in Fig. 7. Although only the interestpoints need alignment, these examples show that TPS warpingactually brings the source images very closely to the destinations.

Subsequently, the 36-dimensional shape descriptors D_shapedefined by Eq. (4) are expanded to 38 dimensions. This is done

1188 Y. Duanduan, A. Sluzek / Image and Vision Computing 28 (2010) 1184–1195

slightly differently for the interest regions of destination image(where the interest points coordinates are directly incorporated –see Eq. (9A)) than for the interest regions of source image (wherethe warped coordinates are used – see Eq. (9B)).

DðSdestÞ ¼ ½D shapeðSdestÞ; normx � xd; normy � yd� ð9AÞ

DðSsourceÞ ¼ ½D shapeðSsourceÞ; normx � x0s; normy � y0s� ð9BÞ

where D_shape(Sdest) is the shape descriptor for the interest regionin the destination image located at (xd, yd), D_shape(Ssource) is theshape descriptor for the interest region in the source image locatedat (xs, ys). ðx0s; y0sÞ ¼ ½ðfxðxs; ysÞ; fyðxs; ysÞ� is the location of the interestregion in the warped source image. normx and normy are scaling fac-tors normalizing image coordinates to a standard range. The initialvalues of these factors indicate the importance of the positionalalignment.

The added dimensions incorporate the localization data into thedescriptors. If two interest regions are the actual match theirdescriptor values for the added dimensions should be very similar.

To identify the intrinsic integrity of image alignment, werecalculate coherent points using the 38-dimensional descriptor.If the number of coherent points does not increase noticeably,the values of normx and normy are set to zero (which indicatesthat the image alignment does not improve matching results).Otherwise, the values of normx and normy are kept unchangedor increased. Then, we re-identify the set of control points usingthe new 38-dimensional descriptor and re-align the images again.The process can be iteratively repeated until the number ofcoherent points stops increasing or the maximum number of iter-ations is reached (in the experiments, the maximum number ofiterations is arbitrarily set to 5).

Fig. 7. Exemplary results of TPS alignment: (a) source image

4. Experiments

Performances of our descriptor have been evaluated on realimages with different geometric and photometric transformations,and for different scene types. We deliberately selected images usedfor the descriptor evaluation in [1]. Fig. 8 shows examples of suchimages. Various image transformations have been used to evalu-ated our descriptor, including: rotations (e.g. Fig. 8a), scale changes(e.g. Fig. 8a and h), viewpoint changes (e.g. Fig. 8c and g), imageblur (e.g. Fig. 8b and e), JPEG compression (e.g. Fig. 8f), illuminationchanges (e.g. Fig. 8d), etc. Results are obtained for regions detectedwith the Harris-Affine and Hessian-Affine detectors. All experi-ments are performed on a personal computer with 2.66 GHz IntelCore 2 processor.

The interest regions are matched using both the nearest neigh-bor (NN) method and the threshold matching. For the threshold-based matching, two regions are matched if the distance betweentheir descriptors is below a threshold. In NN matching, anotherthreshold is involved. Two regions match if their descriptors arethe nearest neighbors and if the distance between them is belowthis another threshold.

We use the same evaluation criterion as in [1]. It is based on thenumber of correct matches and the number of false matches ob-tained for a pair of images. The descriptors from the reference im-age are compared with descriptors from the transformed image,and we count the number of correct matches as well as the numberof false matches. The results are presented using recall versus pre-cision parameterization. Recall is defined as the ratio between thenumber of correctly matched regions and the number of corre-sponding regions for two images of the same scene:

recall ¼ No: of correct matchesNo: of correspondences

ð10Þ

s; (b) aligned source images; and (c) destination images.

Fig. 8. Exemplary scenarios for the descriptor evaluation: (a) and (h) rotation + zoom, (b) and (e) image blur, (c) and (g) viewpoint change, (d) light change, and (f) JPEGcompression.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/277

1st iteration2nd iteration3rd iteration4th iteration

Fig. 9. The performances of our descriptors for Fig. 8a with different iteration. Theregions are detected by Harris-Affine detector.

Y. Duanduan, A. Sluzek / Image and Vision Computing 28 (2010) 1184–1195 1189

More details about the definition of correct matches and correspon-dences can be found in [1]. Precision depends on the number of falsematches versus the total number of matches.

precision ¼ 1� No: of false matchesthe total No: of matches

ð11Þ

It should be noted that a higher precision value usually indicatestighter thresholds in the matching algorithms, i.e. for smallerthresholds fewer false matches can be expected.

As mentioned in [1], a perfect descriptor should give recall equalto 1 for any precision. We hope our descriptor can obtain both highrecall and high precision.

4.1. 1 Parameters of the algorithm

In Section 3.2, we set the maximum number of five iterations tojump out of the iteration cycle. Actually, in our experiments theconvergence is often reached (i.e. the number of coherent pointsstops to increase) after only one or two iterations. Nevertheless,an example is shown in Fig. 9 where four iterations are neededto saturate the number of coherent points (note that each iterationimproves the performance of our descriptor).

In Section 3.2, K-means clustering method is used to spatiallypartition the source coherent points. The value of K is defined bythe number of the coherent points. If we aim to have clusters con-taining approximately C points, the value of K is calculated asK ¼ bm=Cc, where m is the number of coherent points. It is, thus,

possible that different warping functions can be obtained for dif-ferent C.

We have attempted to choose the most appropriate value of Cbased on the errors between the actual image-mapping function

Table 1The recall and precision of our descriptor with the location information integration andwithout.

Image pairs Descriptor with thepositional data (38D)

Descriptor without thepositional data (36D)

Recall Precision Recall Precision

Fig. 8a 0.7857 0.1222 0.4643 0.0722Fig. 8b 0.7697 0.5116 0.4257 0.2829Fig. 8c 0.5275 0.1902 0.1498 0.0503Fig. 8d 0.7339 0.3653 0.3303 0.1644Fig. 8e 0.5932 0.2925 0.1925 0.0949Fig. 8f 0.9545 0.8008 0.7677 0.6441Fig. 8g 0.5397 0.1879 0.1605 0.0457Fig. 8h 0.7003 0.3371 0.2981 0.1434

1190 Y. Duanduan, A. Sluzek / Image and Vision Computing 28 (2010) 1184–1195

and the calculated warping functions (the image database used inthe experiments provides such mapping functions). For each pairof test images, the absolute difference between the mapping func-tion M(x, y) and the derived warp function W(x, y) was calculatedfor each source interest point (x, y). We investigated the averagewarp errors for interest points (obtained by Harris-Affine detector)of all database images in the database using four value of C, i.e. 5,10, 15 and 20. The minimum average error was found for C = 10.Thus, in the experiments C is set to 10.

4.2. Comparison of 36-dimensional and 38-dimensional descriptors

In this section, we compare the results for the exemplary pairsof images from Fig. 8 to show that incorporation of the positionaldata into the descriptor improves the performance of our descrip-tor. Only the standard NN algorithm (i.e. without any threshold) isused for the interest region matching, and the interest regions areobtained by Harris-Affine detectors. Table 1 shows recall and preci-sion results of the matching.

From the Table 1, we can see that both precision and recall canbe significantly improved with the incorporation of positional data.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/343

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/112

5

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/322

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/112

4 SIFT descriptorour descriptor

(a) (b)

(e) (f)

Fig. 10. Evaluation for blurred images. (a) A structured scene (Fig. 8b) with Harris-Affiregions using threshold-based matching. (c) A structured scene (Fig. 8b) with Hessian-Afregions using threshold-based matching. (e) A textured scene (Fig. 8e) with Harris-Affineusing threshold-based matching. (g) A textured scene (Fig. 8e) with Hessian-Affine regiregions using threshold-based matching.

The aligned locations of interest regions are, therefore, useful forthe discrimination power and robustness of our descriptor.

4.3. The comparison between our descriptor and SIFT descriptor

In [1], an experimental evaluation of several different localdescriptors was reported, and SIFT-based descriptors show the bestperformance in image matching. Therefore, we choose the SIFTdescriptor as a benchmark for our descriptors. The interest regionsare detected by both Hessian-Affine and Harris-Affine detectors formost images. However, for the images with only the scale changes,the scale-invariant detectors, Harris-Laplace and Hessian-Laplace,are used for the consistency with [1]. We use both the NN match-ing method and the threshold matching method. Diversifiedthresholds are used to get precision changing from near 0 to near 1.

We first compare the performance when a matched image issignificantly blurred. Fig. 10a–h shows the results for a structuredscene (Fig. 8b) and a textured scene (Fig. 8e). Both Harris-Affine re-gions and Hessian-Affine regions are evaluated by the NN match-ing and the threshold matching method. It can be seen that ourdescriptor is better than the SIFT when applied to a blurred imageof a structured scene. In case of a textured scene, our descriptor isalso better than SIFT but only for lower precision values (i.e. higher1-precision). There are two possible explanations for a good perfor-mance of our descriptor. First, it can obtain the edge informationfrom the rectangular approximations (which is important inblurred images). Secondly, the aligned locations of interest regionsare robust against image blur.

Subsequently, our descriptor has been evaluated for scalechanges plus rotations (e.g. Fig. 8a and h). For the consistency with[1], Harris-Laplace and Hessian-Laplace detectors are used to de-tect the interest regions and the threshold matching method is ex-plored to evaluate the performances. In Fig. 11, the curves of theperformance are shown. For both the textured and structuredscenes with Harris-Laplace regions and Hessian-Laplace regions,we observe higher scores of our descriptors when precision is smallbut the scores are lower for higher precisions. The possible reason is

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/599

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/240

4

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/109

4

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/364

9 SIFT descriptorour descriptor

(c) (d)

(g) (h)

ne regions using NN matching. (b) A structured scene (Fig. 8b) with Harris-Affinefine regions using NN matching. (d) A structured scene (Fig. 8b) with Hessian-Affineregions using NN matching. (f) A textured scene (Fig. 8e) with Harris-Affine regions

ons (Fig. 8e) using NN matching. (h) A textured scene (Fig. 8e) with Hessian-Affine

Y. Duanduan, A. Sluzek / Image and Vision Computing 28 (2010) 1184–1195 1191

that location information is incorporated in our descriptor but notthe scales of regions. When the threshold of matching is low (i.e.precision is high) there are rather few correct matches, becausethe regions might have similar aligned locations but their sizesare different. When, however, the matching threshold is high (thatresults in a small precision) more correct matches are found. Simi-lar observations have been made in other similarly distortedimages.

The following Fig. 12 shows the results for the illuminationchange (Fig. 8d). For NN matching method, our descriptor obtainsslightly higher scores than SIFT with Harris-Affine regions. Forthe threshold-based matching, similarly to the previous scenarioof scale change and rotation, our descriptor is worse than SIFTfor higher precisions and better for lower with Harris-Affine re-gions. For the Hessian-Affine regions, similar observations aremade.

Then, we compare the performance for JPEG compression (seeFig. 8f). The results are straightforward, comparing to other trans-formations. On the Harris-Affine regions, we can see from Fig. 13that both our descriptor and SIFT can perform very well. However,our descriptor is superior (for threshold-based matching) and gives

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/297

3

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/458

8

SIFT descriptorour descriptor

(a) (b)

Fig. 11. Evaluation of scale change plus rotation. (a) A structured scene (Fig. 8h) with HarA textured scene (Fig. 8a) with Harris-Laplace regions. (d) A textured scene (Fig. 8a) wit

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/109

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/382

SIFT descriptorour descriptor

(a) (b)

Fig. 12. Evaluation of illumination changes. (a) Harris-Affine regions using NN matching.using NN matching. (d) Hessian-Affine regions using threshold-based matching.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/198

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/552

SIFT descriptorour descriptor

(a) (b)

Fig. 13. Evaluation for JPEG compression. (a) Harris-Affine regions using NN matching. (using NN matching. (d) Hessian-Affine regions using threshold-based matching.

almost 100% recall when precision is less than 0.7 on Harris-Affineregions. On Hessian-Affine regions, our descriptors perform betterthan SIFT using both two matching methods.

Finally, we compare the cases of the viewpoint changes. Theviewpoint change is about 50� which is (according to [1]) the mostchallenging transformation of the evaluated ones. For structuredscenes (Fig. 14a and b), our descriptor is better that SIFT on Har-ris-Affine regions, but worse (or partially worse) on Hessian-Affineregions. For textured scenes, our descriptors are worse than SIFTfor higher precisions and better for lower. The possible reason isthat simple rectangular approximations produce numerous similardescriptors for this textured scene so that it is hard to accuratelyalign the locations.

Altogether, our descriptor (with a much lower dimensionality)performs better than SIFT in the majority of analyzed cases.

4.4. Performance evaluation using a database

In the above experiments, the performances of our descriptorsare tested with the image pairs which come from the same ob-ject/scene. The regions from the same ground truth may be corre-

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/281

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/667

SIFT descriptorour descriptor

(c) (d)

ris-Laplace regions. (b) A structured scene (Fig. 8h) with Hessian-Laplace regions. (c)h Hessian-Laplace regions.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/472

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/181

8

SIFT descriptorour descriptor

(c) (d)

(b) Harris-Affine regions using threshold-based matching. (c) Hessian-Affine regions

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/109

2

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/381

1

SIFT descriptorour descriptor

(c) (d)

b) Harris-Affine regions using threshold-based matching. (c) Hessian-Affine regions

1192 Y. Duanduan, A. Sluzek / Image and Vision Computing 28 (2010) 1184–1195

lated which possibly leads to a statistical bias in the performanceevaluation. Thus, in this section, we attempt to test our descriptorswith a relatively large dataset. Meanwhile, we provide an exampleconfirming the usefulness of our descriptor in image matching.

The application example is scene retrieval where effective im-age matching is very important. We set a database containing200 images with 48 query images from eight totally differentscenes (each scene having 6 images with transformations) while

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/601

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/208

6

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/617

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/180

3

SIFT descriptorour descriptor

(a) (b)

(e) (f)

Fig. 14. Evaluation of viewpoint change. (a) A structured scene (Fig. 8c) with Harris-Afregions using threshold-based matching. (c) A structured scene (Fig. 8c) with Hessian-Afregions using threshold-based matching. (e) A textured scene (Fig. 8 g) with Harris-Affineusing threshold-based matching. (g) A textured scene (Fig. 8 g) with Hessian-Affine regioregions using threshold-based matching.

Fig. 15. Some example

the remaining 152 images serve as confusing data. The 48 queryimages are from data set in [1]. Some examples are shown inFig. 15. The confusing data (see Fig. 16) are randomly selected frominternet.

The goal of scene retrieval is to automatically identify whichimages from the database present the same scene. Thus, a queryimage is compared not only with the correlated images (whichare the retrieval targets) but also with uncorrelated images (which

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/877

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/348

0

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/625

SIFT descriptorour descriptor

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-precision

#cor

rect

/166

7

SIFT descriptorour descriptor

(c) (d)

(g) (h)

fine regions using NN matching. (b) A structured scene (Fig. 8c) with Harris-Affinefine regions using NN matching. (d) A structured scene (Fig. 8c) with Hessian-Affineregions using NN matching. (f) A textured scene (Fig. 8e) with Harris-Affine regionsns (Fig. 8 g) using NN matching. (h) A textured scene (Fig. 8 g) with Hessian-Affine

s of query images.

Fig. 16. Exemplary confusing data.

Y. Duanduan, A. Sluzek / Image and Vision Computing 28 (2010) 1184–1195 1193

are confusing data) based on the extracted local descriptors. In thisway, the proposed descriptors and SIFT are tested with both corre-lated and independent regions.

We use the normalized average rank [21] as a measure ofmatching performance:

R ¼ 1NNR

XNR

i¼1

Ri �NRðNR � 1Þ

2

!ð12Þ

where Ri is the rank at which the ith relevant image is retrieved, NR

is the number of relevant images for a given query, and N is thenumber of images in the database. A smaller normalized rankmeans a better performance.

The retrieval process is repeated 48 times for all 48 queryimages. Each time, the query is compared with the other 199images. For each comparison, we identify the keyregions in boththe query and the reference using Hessian-Affine detector, and ap-ply the algorithm of this paper (or SIFT) to build local descriptors.We assume two descriptors are matched if they are the nearestneighbor descritpors in both the query and reference (which isidentical to the definition of coherent points in Section 3.2). Then,the number of matched descriptors is used as the score of imagesimilarity. The number of keypoints being matched is also usedas the score in [7,22], which is a relatively simple and straightfor-ward method to evaluate the performance of different descriptors.

Table 2 gives the relevant rank distributions of two kinds ofdescriptors for this database, where Rmed denotes the median valueof the normalized average rank for all 48 retrievals; Ravg denotes theaverage value; Rmin, Rmax denote the minimal and the maximal val-ues; the upper and lower quartile values are denoted by Rup_qua andRlow_qua, respectively. We can see our descriptors obtained betterperformance than SIFT.

Exemplary retrieval results obtained using our descriptors areshown in Fig. 17. In each row, the leftmost image is the queryand subsequent images are the best matches (five nearest neigh-bors). Cases of incorrectly retrieved images can be noticed in therows (f)–(h). They illustrate the general tendency that erroneousmatching results happen when the query image is a texturedscene. In such images, numerous keypoints are described by simi-lar rectangular approximations so that it is very difficult to accu-rately align their locations.

In this experiment, the performance of our descriptor is evalu-ated in the application of image retrieval using a relative largedatabase where 48 � 199 image matching results hopefully pro-vide an unbiased statistics.

Table 2Performance of SIFT and our descriptor for scene retrieval database.

Descriptors Rmedia Raverage [Rmin, Rmax] [Rlow_qua, Rup_qua]

SIFT 0.014 0.087 [0, 0.688] [0, 0.043]Our descriptors 0 0.046 [0, 0.406] [0, 0.025]

4.5. Execution time estimates

In this section, we compare the execution time both for descrip-tor building and for matching images. In case of image matching,the dominant factor is the dimension of descriptors (and, of course,the number of interest regions) because the main computation isto calculate the distance between the descriptors. Thus, ourdescriptor of 38 dimensions must outperform the SIFT which has128 dimensions. As an example experimentally confirming thisconclusion, we recorded the NN matching time for the Fig. 8a im-age pair (with 540 and 682 interest regions). Using our 38-dimen-sional descriptor, the matching time is 157 ms, while using a 128-dimensional SIFT descriptor, the matching time increases to1143 ms.

In descriptor building, our descriptor is calculated in twophases. First, a simple approximation-based descriptor is built,and subsequently the images are aligned (possibly iteratively) toincorporate the positional data into the descriptor. Thus, we esti-mate the execution time for two phases separately and comparethem with the execution time for building SIFT.

Table 3 shows that the approximation part of our descriptorcan be built very quickly, but image alignment and positionaldata incorporation usually take longer time because of clusteringcoherent points. It also strongly depends on the number of iter-ations. If the number of the iteration is small (it happens for im-age pairs from Fig. 8d or f.) the second phase becomes shorterthan the first one. Nevertheless, the total execution time is dra-matically shorter than the time of building SIFT descriptors incase of all analyzed images (the paper presents only a sampleof results).

5. Conclusions

This paper proposes a relatively low-dimensional imagedescriptor based on both shape approximations and localizationdata. The approximations are obtained using simple rectangularpatterns stretched and rotated to fit rectangles provided by adetector of interest points (regions). The shape approximationphase is the foundation of the descriptor, because its results areused to identify the control points for the image alignment (i.e.indirectly providing positional data for the enhanced descriptor).The alignment is based on TPS warping algorithm with its controlpoints automatically identified by the algorithm. The localizationdata are incorporated into the descriptor using the aligned loca-tions of interest regions.

The proposed descriptor has been evaluated on real imageswith different geometric and photometric transformations andfor different scene types, and compared with SIFT considered oneof the best local descriptors. Results show the proposed low-dimensional descriptor in most cases performs better than SIFT,and its computational complexity is dramatically lower. Cases ofpoorer performance (selected classes of images with high precision

Fig. 17. Some retrieval results using our descriptors.

1194 Y. Duanduan, A. Sluzek / Image and Vision Computing 28 (2010) 1184–1195

of matching) will be analyzed for further improvements of ourdescriptor. We also provide a concrete example of application ofthe proposed descriptor by which the descriptor performance istested with a relatively large database containing both correlated

images and uncorrelated images. The tests show the proposeddescriptor performs better than SIFT.

In the future works we would like to further investigate perfor-mances and usefulness of the proposed method. In particular, we

Table 3The execution time of description building.

Imagepairs

The number ofinterest regions

Building our descriptor (s) BuildingSIFT (s)

Shapeapproximation

Locationalignment

Total

Fig. 8a 540 + 682 0.20 1.10 1.30 4.38Fig. 8b 945 + 516 0.32 0.54 0.86 9.73Fig. 8c 1791 + 2035 0.47 1.31 1.78 17.98Fig. 8d 337 + 219 0.14 0.07 0.21 3.16Fig. 8e 653 + 957 0.33 0.60 0.93 6.15Fig. 8f 246 + 236 0.13 0.06 0.19 2.40Fig. 8 g 2267 + 2165 0.64 0.32 0.96 10.38Fig. 8 h 2797 + 1339 0.57 0.44 1.01 14.57

Y. Duanduan, A. Sluzek / Image and Vision Computing 28 (2010) 1184–1195 1195

will focus on the following three issues. First, it can be noticed thatthe shape-approximation part of the descriptors can be built givenindividual images, while the two coordinate-based dimensions ofthe descriptors can be added only when we match two images.Therefore, we would attempt to develop an exemplary databaseof image where the shape approximations are computed off-lineand memorized together with the images in the database. Then,when the visual information retrieval operation is attempted (asearch for images similar to a query image) we need to calculateonly the TPS transformations and add two dimensions to thedescriptors. If warping is not successful (i.e. images are considereddifferent) the results would be available very quickly (see exem-plary time estimates in Table 3). Thus, the search to potentialmatches would be performed efficiently. Another proposedimprovement is to find the optimum number of clusters of coherentpoints (see Section 3.2) providing the best correspondence of imagewarping. Such a subset of clusters would potentially indicate themost similar fragments of two images (even if other parts of theimages are very different – e.g. the same objects on different back-grounds). Finally, we plan to expand the list of patterns used for theapproximations (e.g. based on the ideas presented in [18]). Theresults, if encouraging, will be reported in the future papers.

Acknowledgment

The results presented in the paper are done under A*STAR Sci-ence and Engineering Research Council Grant 072 134 0052. Thefinancial support of SERC is gratefully acknowledged.

References

[1] K. Mikolajczyk, C. Schmid, A performance evaluation of local descriptors, IEEETrans. PAMI 27 (2005) 1615–1630.

[2] F. Peronnin, C. Dance, G. Csurka, M. Bressan, Adapted vocabularies for genericvisual categorization, LNCS 3954 (2006) 464–475.

[3] H. Moravec, Rover visual obstacle avoidance, in: Proc. Int. Joint Conf. ArtificialIntelligence, Vancouver, 1981, pp. 785–790.

[4] C. Harris, M. Stephens, A combined corner and edge detector, in: Proc. 4thAlvey Vision Conf., Manchester, 1988, pp. 147–151.

[5] D.G. Lowe, Object recognition from local scale-invariant features, in: Proc. 7thInt. Conf. Computer Vision, vol. II, Kerkyra, September 1999, pp. 1150–1157.

[6] K. Mikolajczyk, C. Schmid, indexing based on scale invariant interest points, in:Proc. 8th Int. Conf. Computer Vision, vol. I, Vancouver, July 2001, pp. 525–531.

[7] D.G. Lowe, Distinctive image features from scale invariant keypoints, Int. J.Comp. Vis. 60 (2004) 91–110.

[8] K. Mikolajczyk, C. Schmid, Scale and affine invariant interest point detectors,Int. J. Comp. Vis. 60 (2004) 63–86.

[9] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky,T. Kadir, L. Van Gool, A comparison of affine region detectors, Int. J. Comp. Vis.65 (2005) 43–72.

[10] R. Fergus, P. Perona, A. Zisserman, Object class recognition by unsupervisedscale-invariant learning, in: Proc. CVPR 2003, vol. II, Madison, June 2003, pp.264–271.

[11] J.J. Koenderink, A.J. van Doorn, Representation of local geometry in the visualsystem, Biol. Cybern. 55 (1987) 367–375.

[12] F. Mindru, T. Tuytelaars, L. van Gool, Th. Moons, Moment invariants forrecognition under changing viewpoint and illumination, Comp. Vis. ImageUnderstand. 94 (2004) 3–27.

[13] F. Schaffalitzky, A. Zisserman, Multi-view matching for unordered image sets,LNCS 2350 (2002) 414–431.

[14] Y. Ke, R. Sukthankar, PCA-SIFT: a more distinctive representation for a localimage descriptors, in: Proc. IEEE CVPR 2004, vol. II, Washington DC, June 2004,pp. 506–513.

[15] A.E. Abdel-Hakim, A.A. Farag, CSIFT: a SIFT descriptor with color invariantcharacteristics, in: Proc. CVPR 2006, vol. II, New York, June 2006, pp. 1978–1983.

[16] S. Lazebnik, C. Schmid, J. Ponce, Sparse texture representation using affine-invariant neighborhoods, in: Proc. CVPR 2003, vol. II, Madison, June 2003, pp.319–324.

[17] M.J. Tarr, H.H. Bülthoff, M. Zabinski, V. Blanz, To what extent do unique partsinfluence recognition across changes in viewpoint, Psychol. Sci. 8 (1997) 282–289.

[18] A. Sluzek, On moment-based local operators for detecting image patterns,Image Vis. Comput. 23 (2005) 287–298.

[19] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simplefeatures, in: Proc. CVPR 2001, vol. I, Hawaii, December 2001, pp. 511–518.

[20] F.L. Bookstein, Principle warps: thin plate splines and the decomposition ofdeformations, IEEE Trans. PAMI 16 (1989) 460–468.

[21] H. Muller, W. Muller, S. Marchand-Maillet, T. Pun, Performance evaluation incontent-based image retrieval, Pattern Recogn. Lett. 22 (2001) 593–601.

[22] Y. Ke, R. Sukthankar, L. Huston, Efficient near-duplicate detectionand sub-image retrieval, ACM Multimedia (2004) 869–876.