Building kernels from binary strings for image matching

1

� Paper Title: Building kernels from binary strings for image matching� Paper number: TIP-00651-2003.R1� Journal: IEEE Transactions on Image Processing

Contact information:� Francesca Odone,

Istituto Nazionale di Fisica della Materiac/o DISI, Universita di GenovaVia Dodecaneso 35I-16146 Genova, Italytel +39 0103536607,fax +39 0103536699e-mail [email protected]

� Annalisa Barla,DISI, Universita di GenovaVia Dodecaneso 35I-16146 Genova, Italytel +39 0103536610,fax +39 0103536699e-mail [email protected]

� Alessandro Verri,DISI, Universita di GenovaVia Dodecaneso 35I-16146 Genova, Italytel +39 0103536601,fax +39 0103536699e-mail [email protected]

Software used for formatting:� LATEXfor Linux, TEXversion 3.14159

2 IEEE TRANS. ON IMAGE PROCESSING, VOL. XX, NO. YY, ...

Building Kernels from Binary Stringsfor Image Matching

Francesca Odone, Annalisa Barla, and Alessandro Verri

Abstract— In the statistical learning framework the use ofappropriate kernels may be the key for substantial improvementin solving a given problem. In essence, a kernel is a similaritymeasure between input points satisfying some mathematicalrequirements and possibly capturing the domain knowledge.In this paper we focus on kernels for images: we representthe image information content with binary strings and discussvarious bitwise manipulations obtained using logical operatorsand convolution with non-binary stencils. In the theoreticalcontribution of our work we show that histogram intersectionis a Mercer’s kernel and we determine the modifications underwhich a similarity measure based on the notion of Hausdorffdistance is also a Mercer’s kernel. In both cases we determineexplicitly the mapping from input to feature space. The presentedexperimental results support the relevance of our analysis fordeveloping effective trainable systems.

I. INTRODUCTION

AMain issue of statistical learning approaches like SupportVector Machines or other kernel methods [1], [2] is

which kernel to use for which problem. A number of generalpurpose kernels are available in the literature but, especiallyin the case in which only a relatively small number ofexamples is available, there is little doubt that the use of anappropriate kernel function can lead to a substantial increasein the generalization ability of the developed learning system.Ultimately, the choice of a specific kernel function is basedon the prior knowledge of the problem domain.

In this paper we discuss kernels based on the manipulationof binary vectors in the computer vision domain, consideringboth histogram and iconic representation. Since binary vectorscan also be used for representing the information content ofsignals other than images, the presented analysis may be ofinterest to a number of application domains.

The difficulty of building kernels is to satisfy certain mathe-matical requirements while incorporating the prior knowledgeof the application domain. For this reason, we studied differentsimilarity measures for image matching and determined underwhat assumptions they can be used to build kernels forimages. Since the image kernels we found are based onmanipulating binary strings we first present the results of ourwork discussing kernels for binary strings in abstract terms andthen we specialize our results to the computer vision domain.

The topic of kernels for images is relatively new. A numberof studies have been reported about the use of general purposekernels for image-based classification problems [3], [4], [5],[6], [7], [8]. One of the first attempts to introduce prior

F. Odone is with Istituto Nazionale di Fisica della Materia, Genova, Italye-mail: [email protected]

A. Barla and A. Verri are with INFM-DISI, Universit a di Genova, Genova,Italy e-mail:

�barla,verri � @disi.unige.it

knowledge on the correlation properties of images in thedesign of an image kernel can be found in [9]; in [10], instead,a family of functions which seem to be better suited thanGaussian kernels for dealing with image histograms has beenstudied. In essence the kernels described in [10] ensure heaviertails than Gaussian kernels in the attempt to contrast the wellknown phenomenon of diagonally dominant kernel matricesin the case of high dimensional inputs. In [11] an RBF kernelis engineered in order to use the Earth Mover’s Distance as adissimilarity measure, while in [12] an RBF kernel based on�� has been shown to be of Mercer’s type. As opposed to theglobal approach to image representation which is common toall of the previous works, in [13] a class of Mercer’s kernelsfor local image features is proposed.

This paper is organized as follows. In Section II we setthe background and describe the mathematical requirementsfor kernel functions, in Section III we discuss the variousmanipulations on binary strings which we studied. Kernels forhistogram-based representations are discussed in Section IV,while kernels for iconic representations are the theme of Sec-tion V. Experimental results, including extensive comparisonwith previous work, are reported in Section VI. We draw theconclusions of our work in Section VII.

II. KERNEL FUNCTIONS

In this section we summarize the mathematical requirementsa function needs to satisfy in order to be a legitimate kernelfor SVMs.

A. Support vector machines

Following [14], many problems of statistical learning [15],[1] can be cast in an optimization framework in which thegoal is to determine a function minimizing a functional � ofthe form

�� ! �" � " �# (1)

where the

pairs $ �%� � �&� �� %� � ��

� �(')'*'*� �%� � �� ,+

, the exam-ples, are i.i.d. random variables drawn from the space -/.10according to some fixed but unknown probability distribution,� is a loss function measuring the fit of the function � to thedata,

"324" #the norm of � induced by a certain function 5 ,

named kernel, controlling the smoothness – or capacity – of� , and

7698a trade-off parameter. For several choices of the

loss function � , the minimizer of the functional in (1) takesthe general form �� :

� 5 �%� � ��;� � (2)

ODONE ET AL.: BUILDING KERNELS FROM BINARY STRINGS FOR IMAGE MATCHING 3

where the coefficients :�

depend on the examples. The math-ematical requirements on 5 must ensure the convexity of (1)and hence the uniqueness of the minimizer (2).

SVMs for classification [15], for example, correspond tochoices of � like

� � � �%�� &� � � � �� %�

� � � � �with

� �� if�� 8

, and

8otherwise, and lead to a convex

QP problem with linear constraints in which many of the :�

vanish. The points ��

for which :� 8

are termed supportvectors and are the only examples needed to determine thesolution (2).

Before discussing the mathematical requirements on 5 webriefly consider two special cases of classifiers that we will usefor our experiments: binary classifiers and novelty detectors,or one-class classifiers.

B. Binary classificationIn the case of binary classification [15] we have � � �$ � � � � + for � � �(')'*')� , and the dual optimization problem

can be written as ��

�� !�"�

�� #%$'&

� ( &) �* (3)

subject to

�� ,+.- (-0/�� /21 35476 � +98 (

�19/

��/:-;354<6=��+

�8 >

A new point is classified according to the sign of the expres-sion �� :

� 5 �%� � �� @? �

where the coefficient

?can be determined from the Kuhn Tuker

conditions. A closer look to the QP problem (3) reveals thatthe uniqueness of the solution is ensured by the convexity ofthe objective function.

C. Novelty detectionIn the case of novelty detection described in [16] for all

training points we have � � �— i.e. all the training examples

describe only one class — and the optimization problemreduces to �� !� �

� #%$'& � ( & � * ��

�� !� �� #%$'&

� ( & * (4)

subject to

�� !� �� +A8

-B/��!/C1 >

If 5 �%� � ��

is constant over the domain - , a novelty is detectedif the inequality �� :

� 5 �� ED

(5)

is violated for some fixed value of the threshold parameterD 6 8. Similarly to the case of above, the uniqueness of the

solution is ensured by requiring the convexity of the quadraticform in the objective function.

D. Kernel requirements

The convexity of the functionals (3) and (4) is guaranteed ifthe function 5 is positive definite. We remind that a function5GF - . -IHKJ L is positive definite if for all M and all choicesof � � , � � , ..., ��N � - , and : � , : � , ..., :ON � J LN� � �� N�P �� :

�: P 5 �%�

� � � P� � 8 ' (6)

A theorem of functional analysis due to Mercer [17] allows usto write a positive definite function as an expansion of certainfunctions Q�R �%�

�TS � �(')'*')�VU , with U possibly infinite, or

5 �� W� X�

R � � Q R ��Q R �� W

� ' (7)

For this reason, a positive definite function is also calleda Mercer’s kernel, or simply a kernel. Whether or notthis expansion is explicitly computed, for any kernel 5 ,5 �� W

�can be thought of as the inner product between

the vectors Q �%�� Q � ��

� � Q � �%�� '*')'*� Q X �%�

�&�and Q �� W

� � Q � �%� W� � Q � �� W

� � '*')'*� Q X �%� W��

. For a rigorous statement of Mer-cer’s theorem the interested reader is referred to [17], [18].

III. MANIPULATING BINARY STRINGS

In this section we discuss the manipulations of binary stringsthat we studied and applied to images. We start by establishingthe notation we use throughout the paper.

A. Notation

We denote strings with upper case letters like Y and Z andbits with lower case letters like [ and

?. A binary string of

fixed length is a sequence of bits like

8 8 � 8or [ � [ � ')'*' [)\ with["] � $ 8 � � + for ^ � � '*')'*�`_ . A string Y of length _ can also

be written as a _ -dimensional vector a � [ � � '*')'*� [b\�. If a

is an ordinary _ -dimensional vector, the components of whichcan take on arbitrary values, we write a � Y � � '*')'*� Y0\

�.

Without risk of confusion we do not distinguish between thetruth values “0” and “1” and the numerical values “0” and“1”. Finally, we write a

2"cfor the standard inner product

between the two vectors a and

c.

B. Making kernels with logical connectives

We first deal with the case in which all binary strings havethe same length and the similarity between strings is computedbitwise. As we will see in the next section this analysis isrelevant to histogram-like representations.

The question we address is which two-place logical operatoracting bitwise on pairs of binary strings Y and Z defines aMercer’s kernel. We start by counting the number of operatorswhich can be defined on bit pairs. Since there are 4 possibleinput bit pairs – �

8 � 8 � , �8 � � � , � � �

8 �, and � � � �

�– and 2

possible output values –

8and

�– simple counting shows

that there are only d�e different logical operators on bit pairs.Following standard notation we use f , g , and h for theconjunction, disjunction, and if-and-only-if two-place logical


operators, and � for the negate one-place operator. We startwith the conjunction, the AND operator, and define

5�� Y � Z� \�

] � � [ ] f?] ' (8)

It is easy to see that 5 � is the linear kernel acting on thevectors a and

csince\�

] �� [ ] f?] \�

] �� [ ]?] a 2�c '

Of the remaining 15 cases only the negate of the disjunction,the NOR operator, and the if-and-only-if operator define Mer-cer’s kernels. Indeed, using the identity� � ["] g

?]� � � � [�]

�f � �

?]�

we find that

5�� Y � Z� \�

] � � � � [ ] g?]� \�

] ��(� �� [ ]��

?]�

� � a� 2

� �c � �

from which we see that 5 �� is the standard inner productbetween the binary vectors obtained by negating each compo-nent of the original vectors a and

crespectively. Furthermore,

from the fact that the sum of kernels is a kernel we find that

5�� Y � Z� \�

] �� [ ] h?]

is a Mercer’s kernel since5 � � Y � Z� 5 � � Y � Z

�� 5 �� Y � Z� '

That the other 13 two-place logical operators do not defineMercer’s kernels can be immediately verified through straight-forward counterexamples. Let us consider, for example, , theor exclusive operator and define

5� � Y � Z� \�

] �� [�]� ? ] a� c 'Explicit computation of the d . d matrix 5 , 5 � P a � a P ,for the two binary strings a � � � �

8 �and a � �

8 � � � gives

5/ 8 dd8��

the determinant of which is negative. We thus have that 5�is not a Mercer’s kernel.

C. Beyond bitwise matching

Up to now we have discussed operators acting bitwise onbinary strings. If nearby bits in a string are correlated weaim at studying Mercer’s kernels able to capture short rangecorrelations. These kernels are expected to be particularlyuseful when dealing with images and signals in the originaldomain.

The key idea is to replace the binary vector a with thenon-binary vector a�� . The vector a�� can be thought of asthe result of convolving a with the stencil � 1,

a � a�� ' (9)

Intuitively, the vector a � captures short range spatial cor-relations between nearby components of a . For example, if� � � � � � �

�we have

Y �] �� [ ] ��

where, for simplicity, we consider the last component, [ \ ,adjacent to the first, [ � (equivalently, all summations onsubscripts are taken modulo _ ). We immediately see that eachstencil � identifies a Mercer’s kernel 5 � , or5 � � Y � Z

� a � 2�c � �where the convolution in (9) defines the feature mapping.

We now show that a stencil satisfying some additionalconstraints leads to a new type of Mercer’s kernel which, aswe will see in Section V, is well suited for iconic matching.

Theorem: Let � be defined as

� �� d�� '*')'*� � � � d��

�! "$# %& � � � �� d �� (')'*')� �� d �

�! "'# %&() � (10)

then there exists a mapping Q FOJ L \ H J L � &+* \defined as,- Q � -

�such that for all vectors

- � J L \- � 2 - ,- 2 ,- 'Proof: Using the stencil � of (10), for the components ^!� �(' '(' �`_ of

- � we find

� �] /. ] � �d�� &�� &�10�32 . ] � � '

Taking the inner product between- � and

-gives

- � 2 - �4 � �55� 4 � \�] �� . �]

�d \�] � �

&�� &�10�32 .=] .=] � � (766) 'Expanding and rearranging the terms of the above sums weobtain4 � - � 2 - . �� '*')' � . ��! "$# %e &

� '*')' � . �\ � '*'*' � . �\! "'# %e &�d . � � . �8� &

� ')'*' � . � � &! "$# %� &�� '*')'�

d .�\ � .�\ � &� '*')' � .�\ � &! "$# %� &

� '1A stencil is a pattern of weights used in computing a convolution, arranged

in such a way as to suggest the spatial relationships between the places wherethe weights are applied [19].


Finally, grouping the squares and the corresponding mixproducts gives

�� + 8��

� �� $�� *�� > > > � ��

�� $�� *�� ! >(11)

From Eq. 11 it follows that the dot product- � 2 - is the sum

of d � . _ squares and can thus be regarded as the square ofthe norm of a vector

,-#".

The d �/. _ components of,-

, which are of the type�d%$ � � . ]

� . ] � � �for ^ � � ' '('`_ and � � � �(' '(' � � ( �

8), can be thought of

features2 of a feature mapping underlying the Mercer’s kernel& � defined as & � � a � c � ,a 2 ,c 'Simple counting arguments can be advocated to extend thistheorem to other stencils subject to the condition that thecentral entry is 1 and the remaining ' entries ( � can bewritten as ( � M � �*) with

) ,+.-P � � M P .We conclude by observing that, somewhat surprisingly, the

kernel& � can also be written as& � � a � c � �

d0/ a2�c � �@c 2

a �21 ' (12)

As we will see in Section V this writing is closely related witha similarity measure proposed for iconic matching in [20].

We are now in a position to study under what conditions anumber of similarity measures, well known to the computervision community, can be used as kernels in statistical learningmethods. We start discussing the case of images representedthrough histograms.

IV. HISTOGRAM INTERSECTION KERNEL

In this section we focus on histogram intersection, a tech-nique made popular in [21] for color indexing with applicationto object recognition.

We start off by reminding the definitions of histograms forimages and histogram intersection: in the case of images ahistogram can be associated to properties or characteristicsof each pixel, such as the brightness or the color of pixels,the direction or the magnitude of edges, ... After defining therange of possible values for a given property and dividingthe range in intervals (called bins), the histogram is computedby counting how many pixels of the image belong to eachinterval.

Histogram intersection can be used as a similarity measurefor histogram-based representations of images In the simplecase of 1-D images of U pixels, we denote with Y and Z the

2For the specific stencil 3 in the theorem every component appears twice inthe r.h.s. of (11) and the mapping could be more parsimoniously defined in afeature space of dimensionality 4�576 . In general, however, the dimensionalityof the feature space is 894:5;6 .

<-bin histograms of images = and > respectively and define

the histogram intersection,& �N@? , as& �

NA? � Y � Z� CB�& ��EDGFIH $=Y & � Z &

+(13)

with Y & and Z & the � -th bin for � � �(')'*'*� < and+ B& �� Y & + B& � � Z & U . Intuitively, histogram in-tersection measures the degree of similarity between twohistograms and we clearly have

8KJ & �N@?

J U ' Unlikeother similarity measures, like L � distance for example, thissimilarity measure is not only well suited to deal with colorand scale changes, but can also be successfully used in thepresence of non-segmented objects. Furthermore,

& �N@? can

be computed efficiently and adapted to search for partiallyoccluded objects in images.

The effectiveness of histogram intersection as a similaritymeasure for color indexing raises the question of whether, ornot, this measure can be adopted in kernel based methods. Weanswer this question by finding explicitly the feature mappingafter which histogram intersection is an inner product.

If we represent an histogram Y with the _ -dimensionalbinary vector a (with _ U . <

) defined as

M + �ON7PQ RTS U8 ( > > > ( 8 (WV � N2PQ R S U- (V> > > ( -S U Q R��YXIZW[9\ ] ( N�^Q R S U8 ( > > > ( 8 (�V � N�^Q RTS U- ( > > > ( -S U Q R� �E]9_`[9\ ] ( > > > ( NWaQ R S U8 ( > > > ( 8 ( V � NWaQ RTS U- ( > > > ( -S U Q Rb �EZdc`[9\ ]� �! (14)

and similarly Z with

c, it can be readily seen that

& �N@? � Y � Z

�in (13) is equal to the standard inner product between thetwo corresponding vectors a and

c, or in the notation of the

previous section & �N@? � Y � Z

� 5 � � Y � Z� '

We thus have that& �N@? is a Mercer’s kernel and that the binary

vector in (14) describes explicitly the mapping a Q � Y�

between input and feature space.A few comments are in order. First we notice the redundant

representation of the information content in (14). The _components of a are dependent features since the originalhistogram Y is uniquely determined by the rightmost 1 ineach of the

< U -tuplets of components in (14).Second, the assumption of dealing with images with the

same number of pixels is unnecessary. In the case of images ofdifferent size the same analysis can be obtained by normalizingthe histogram areas and repeating the same construction ofabove on the resulting normalized histograms. Alternatively,it is sufficient to take _ large enough (for example equal toeU . <

witheU equal to the number of pixels of the largest

image in the considered set).Third, the generalization of this result to higher dimen-

sion like 2-D images and 3-D color space representations isstraightforward.

Fourth, if we restrict our attention to the case in which thebinary strings have the same number U of bits equal to 1, theclose relation between histogram intersection and L � distancebecomes apparent. Indeed, from the fact that for two stringsof length _ with U 1s each,

eU of which at the same location,we have a

2=c eU and a� c d � U � eU �it follows thatd�a

2=c �a� c � d U ' (15)


Using Eqs. (15) and the fact that the L � distance in input spaceis equal to the L � bit-wise distance in feature space, that is,

B�& �� Y & � Z & � \�] � � � [ ] �

?] � �

we have that& �N@? � Y � Z

� U � ��"a � c "�� P ' This leads

to the conclusion that under the assumption of dealing withbinary strings with the same number U of bits equal to 1, theL � distance is the distance naturally associated with histogramintersection [21], [22].

V. HAUSDORFF KERNEL

In this section we show that, by means of appropriatemodifications, a similarity measure for grey level images basedon the notion of Hausdorff distance is a Mercer’s kernel.

A. Matching grey-level images with Hausdorff distance

Let us start describing the essence of the similarity measureproposed in [20]. Without loss of generality and for ease ofnotation we consider the case of 1-D images.

Suppose we have two U pixel grey-level images, = and > ,of which we want to compute the degree of similarity. In orderto compensate for small level changes or local transformationswe define a neighborhood

�of each pixel � � � '*')'*�`U and

evaluate the expression�� > � X� � �� = � �� > � � � � � (16)

where�

is the unit step function, and � � denotes the pixelof > most similar to = � �� in the neighborhood

� � � � of � . Foreach pixel � the corresponding term in the sum (16) equals1 if

� =�� ; � > � � � � J � (that is, if in the neighborhood� � � �there exists a pixel � � in which > differs by no more than �

from = � �� ) and 0 otherwise. The rationale behind (16) is toevaluate the degree of overlap of two images bypassing pixel-wise correspondences that do not take into account the effectsof small variations and acquisition noise.

Unless the set�

consists of the only element � , � � � > � � � = � . Symmetry can be restored, for example, by taking theaverage & � = � > � �

d � �� > � � � � = �� 'The quantity

& � = � > � can be computed in three steps:

1) Expand the two images = and > into 2-D binarymatrices Y and Z respectively, the second dimensionbeing the grey value. For example, for Y we write

Y � � �� if = � �;� ��8otherwise.

2) Dilate both matrices by growing their nonzero entries,i.e. by adding 1s in a neighborhood � � d in the grey valuedimension and half the linear size of

�in the spatial

dimension. Let Y�� and Z�� be the resulting 2-D dilatedbinary matrices.

3) Compute the size of the intersections between Y andZ�� , and Z and Y�� , and take the average of the two

values. Thinking of the 2-D binary matrices as (binary)vectors in obvious notation we could also write& � = � > � �

d / a2=c � �.c 2

a � 1 ' (17)

Under appropriate conditions on the dilation sizes�

is closelyrelated to the partial directed Hausdorff distance betweenthe binary matrices Y and Z thought of as 2-D point sets[23], [20]. As a similarity measure, the function

&in (17)

has been shown to have several interesting properties, liketolerance to small changes affecting images due to geometrictransformations, viewing conditions, and noise, and robustnessto occlusions [20].

B. Hausdorff-based Mercer’s kernels

Here again we ask whether, or not, the similarity measureof above can be adopted as a kernel in statistical learningmethods. For a fixed training set of images, $�= � � '*'*')� = � + , anempirical answer to this question can be simply provided bychecking the positive semidefiniteness of the

. matrix5 � P & � =

� � = P � �d / a

� 2a �P �

a P2a �� 1 '

A simple counterexample shows that& � = � > � is not a Mer-

cer’s kernel. Consider the three 1-pixel images = � �,= � d , =�� , and � �

. Explicit computation of theentries of the � .�� matrix 5 gives

5/�� 8� � �8 � ��

which, having negative determinant, is clearly not positivesemidefinite.

In [24]&

was used as a kernel for one class SV-basedclassification. The argument presented there to justify theconvexity of the optimization problem was flawed: as wediscussed in Section II, the fact that

&is not a Mercer’s kernel

may lead to a non-convex problem with multiple solutions. Thevery promising experimental results in [24] suggest that thelocal minimum obtained as the solution of the optimizationproblem, was empirically close to the global minimum.

We now show that through an appropriate alteration of&

it is possible to obtain a Mercer’s kernel without losing theattractive properties of Hausdorff-based similarity measures.The function

&is not a Mercer’s kernel since the dilation

stage makes the off diagonal elements of the “kernel” matrixtoo large. Using the theorem of Section III-C, we see that ifwe redefine the dilation stage as a convolution with a 3-Dstencil � of the type in expression (10) for all 3 dimensions,the newly defined quantity

& � ,& � � a � c � �d � a

2=c � �@c 2a � � �

is an inner product (see Eq. (12)). From the proof of thetheorem, it follows that this result is made possible by (a) thereduced amount of dilation built in the convolution stage – asopposed to the original dilation stage, and (b) the linearity ofthe convolution operator. An example in the simpler 1-D caseis shown in figure 1. In the 3-D case the stencil is defined overa cuboid with the central entry equal to 1 and the off-centerweights summing up to 1.


0 0 0 0 0 0

0

0

0

1 1

1 1 1 1 1 1 1 1 1

1 1

01

1/4 1/4 1/41/2 1/23/4 5/4

0

0

D

S

A

A

A

Fig. 1. Difference between �� and � �. Given the binary vector �

(top) the vector �� (middle) is obtained by dilating each unit entry of� as specified in step 2 of Section V-A within a neighborhood of half-size 2. Vector � �

(bottom) is obtained by convolving � with the stencil�� . Notice that � � is still a binary vector while� �

has rational entries that may also be greater than 1.

VI. EXPERIMENTAL RESULTS

In this section we describe the experiments we performedto corroborate the theoretical analysis presented in previoussections. We first consider two different image classificationtasks addressed through histogram-based representations: in-door/outdoor classification and cityscape retrieval. Then wedeal with the problem of 3-D object detection from singleimage with iconic representation.

A. Histogram-based representation

Most of the images used for the problems of indoor/outdoorclassification and cityscape retrieval were downloaded fromthe Web, by randomly crawling Web sites containing relevantimages. A small percentage was composed by holiday photosof members of our group.

1) Indoor/outdoor image classification: Deciding whetheran image is a picture of an indoor or an outdoor scene is atypical example of understanding the semantics of the imagecontent [25]. Its practical use ranges from scene understandingto automatic image adjustment for high quality printout andintelligent film development.

For our experiments we gathered a collection of 1392outdoor and 681 indoor images. Figures 2 and 3 show someexamples of the used indoor and outdoor images respectively.Figure 4, instead, shows examples of ambiguous images, in-door images with natural outdoor light effects (often caused bythe presence of windows), outdoor images taken at the sunset,or close-ups. The images of the dataset are not homogeneousin size, format, resolution, or acquisition device. Their typicalsize is in the range of

� 8 e�� 8 �pixels. This problem can be

naturally seen as a binary classification problem: we thereforeuse a Support Vector Machine for binary classification. Sincean important cue for this problem is color distribution [25],we decided to represent each image by means of a single,global color histogram and compare the performance of SVMfor various kernels.

We built a training set of 400 indoor and 400 outdoorimages, by randomly sampling the entire population. Allremaining 1273 images (992 of which outdoor and 281 indoor)formed the test set.

The results we obtained on HSV and RGB color histogramscomparing the histogram intersection kernel with common off-

Fig. 2. Samples of indoor images used in our experiments.

Fig. 3. Samples of outdoor images used in our experiments.

the-shelf kernels and the heavy-tail RBF kernels [10] definedas 5 �%� ��

� �� d! � � � � "$#� � � #� � %�&

are shown in Tables I and II respectively. As in [10], we callLaplacian kernels the heavy-tail RBF kernels with

? �, and

Sublinear kernels the ones with

? 8 ' ' . For all RBF kernelswe trained the system varying the value of in a reasonablywide range. Here we report the results obtained with the bestperforming s for each type of kernel.

For both color spaces, in a first set of experiments we builthistograms with

� ' . � ' . � ' bins and then check the sensitivityto bin resolution considering coarser ( ( .)( .)( bins) andfiner color histograms ( d

8 .Cd 8 . d 8 bins). For each kerneland each histogram resolution the displayed recognition rateswere obtained by averaging the results over 4 different splitsbetween training and test sets built according to the procedureof above and computing the associated standard deviation. By


Kernel type Recognition rate��

� 5 � 5 � �� 5 �� 5 �� 8�� 5�8�� 5�8��histogram intersection 69.6 � 0.9 89.7 � 0.3 72.0 � 0.9linear 70.6 � 1.3 82.7 � 1.2 68.2 � 1.92-nd deg polynomial 69.9 � 0.5 82.8 � 1.1 68.7 � 1.04-th deg polynomial 67.2 � 0.7 81.9 � 0.6 69.0 � 0.5Gaussian ( � � � � ) 69.1 � 0.6 83.5 � 0.7 68.9 � 1.1Laplacian ( � � � � , �

) 71.2 � 0.2 90.7 � 1.0 73.2 � 0.9Laplacian ( � � � , � � � ) 66.1 � 0.8 90.8 � 0.6 78.2 � 0.9Laplacian ( � �� , � � 8 � ) 66.5 � 0.5 91.2 � 0.7 79.1 � 1.2Sublinear ( � �� , �

) 73.6 � 0.4 89.7 � 1.0 73.0 � 1.6

TABLE I

RECOGNITION RATES (R.R.) FOR SVMS WITH DIFFERENT KERNELS ON THE INDOOR/OUTDOOR CLASSIFICATION PROBLEM IN THE HSV COLOR SPACE.

THE R.R. ARE OBTAINED BY AVERAGING OVER 4 DIFFERENT SPLITS (THE INDICATED DEVIATIONS ARE STANDARD DEVIATIONS). IN THE CASE OF RBF

KERNELS ONLY THE PARAMETER VALUES LEADING TO THE BEST RECOGNITION RATES ARE REPORTED.

Kernel type Recognition rate��

� 5 � 5 � �� 5 �� 5 �� 8�� 5;8�� 5;8��histogram intersection 58.7 � 2.2 88.4 � 0.3 88.4 � 1.1linear 55.6 � 1.0 82.6 � 1.1 82.5 � 1.72-nd deg polynomial 53.3 � 1.7 83.6 � 0.5 83.6 � 1.84-th deg polynomial 50.4 � 0.8 84.6 � 0.8 84.5 � 1.2Gaussian ( � � � � ) 50.8 � 0.4 85.8 � 1.2 84.3 � 1.0Laplacian ( � � � , �

) 47.1 � 0.7 90.4 � 0.2 89.8 � 1.2Laplacian( � � � , � � � ) 50.0 �� 90.2 � 0.5 90.1 � 0.9Laplacian( � �� , � � 8 � ) 48.6 � 0.1 89.8 � 0.7 88.6 � 0.4Sublinear ( � �� , �

) 48.6 � 0.4 89.1 � 0.3 89.2 � 0.3

TABLE II

RECOGNITION RATES (R.R.) FOR SVMS WITH DIFFERENT KERNELS ON THE INDOOR/OUTDOOR CLASSIFICATION PROBLEM IN THE RGB COLOR SPACE.

THE R.R. ARE OBTAINED BY AVERAGING OVER 4 DIFFERENT SPLITS (THE INDICATED DEVIATIONS ARE STANDARD DEVIATIONS). IN THE CASE OF RBF


Fig. 4. Some examples in which the labeling is ambiguous (see text).

inspecting Tables I and II a number of conclusions can bedrawn. First, histogram intersection and heavy-tail kernels be-have better than off the shelf kernels at intermediate resolution(� ' . � ' . � ' bins) for both color spaces. The results vary

differently with respect to bin resolution for the two colorspaces. In the RGB case very good results are also obtained atthe finer resolution (for a few kernels even better rates), whilein the HSV case there is a significant drop in performancefor both coarser and finer resolution. At peak performances,however, the choice of the color space does not seem to becrucial, as reported in [10]. The reasonably small range ofvariation of the various standard deviations indicate that theobtained results are quite stable.

Second, heavy-tail kernels for a rather wide selection ofparameters give slightly though consistently better results thanhistogram intersection (around 10 out of 1000 images). Thislimited gain should be compared with the fact that histogramintersection has no parameter to tune and is much cheaperto compute. The issue of parameter estimation should not beneglected because the training sets were separated with noerror by all kernels. This said, the heavy-tail kernels appear tobe better suited to capture the sparsity of the color histograms.

Table III displays the support vector number for both HSVand RGB color spaces respectively. From the entries of TableIII is it clear that the HSV representation leads to the selectionof a higher number of support vectors. Furthermore, thenumber of support vectors for each color space is higher forthe kernels performing best on the given test. The implications,if any, of these empirical findings need to be investigated. Inthe case of HSV space we also performed experiments with anunbalanced sampling (different number of bins per channel).In the case of indoor/outdoor classification the quantity of lightand saturation seemed to be more important than the shade ofcolors; for this reason we built histograms with only 4 bins forthe hue (instead of 15). The recognition rates drop of about��

for all kernels with respect to the� ' . � ' . � ' case,

but this choice is an interesting compromise allowing us toachieve still very good results with a significant reduction inthe dimensionality of the input space.


Kernel type RGB HSVhistogram int. 436 532linear 372 419polynomial (deg. 2) 355 395polynomial (deg. 4) 342 370Gaussian ( � � � � ) 341 379Gaussian ( � � � � ) 364 396Laplacian ( � � � � ) 564 632Laplacian ( � � � ) 544 615Sublinear ( � �� ) 544 581

TABLE III

NUMBER OF SUPPORT VECTORS FROM DIFFERENT KERNELS, A TRAINING

SET OF 800 ITEMS,�� 5 �� 5 �� BINS FOR BOTH HSV AND RGB.

We conclude by observing that the results obtained usingappropriate kernels (both histogram intersection and heavy-tail kernels) are very close to those reported in [25], in whicha fairly complex classification procedure was proposed andtested on a different database reaching a recognition rateslightly better than 90%. Overall, it appears that the learningfrom examples approach is able to exploit the fact that colordistribution captures an essential aspect of the indoor/outdoorclassification problem.

2) Cityscape retrieval: The problem of finding cityscapesin an image dataset 3 amounts to finding a method for under-standing whether an image is a picture taken from a distancesufficient to obtain a panoramic view of a city: skylines arewell known examples of cityscapes.

For this set of experiments we collected a dataset of803 positive examples consisting of open city views andskylines (some examples of which are shown in Figure 5).Somewhat arbitrarily, we decided to sample the space of noncityscapes gathering 849 examples of city details (like streetcorners, buildings, etc.), different sorts of man-made objects,people and landscapes (see Figure 6 for some examples).

Fig. 5. Samples of cityscapes (top), the detected edge points (middle) andthe histograms of edge directions (bottom).

An important cue is the geometry of the scene: a possibleapproach is to look for specific geometric features, like straight

3The relevance of this problem to news agencies has been brought to ourattention by Reuters.

Fig. 6. Samples of non cityscapes (top), the detected edge points (middle)and the histograms of edge directions (bottom).

lines or polygons. In the attempt to keep the preprocessingstage to a minimum and investigate the relevance of usingappropriate kernels in the statistical learning framework, weresort to using a binary SVM classifier and representing imageinformation with edge direction histograms. We use the Cannyedge detector to extract edges and estimate their orientation.Histograms were obtained dividing the range between

8and� ( 8 degrees in 10 equally spaced intervals and adding a further

bin for non-edge pixels.As in the previous set of experiments, we formed a training

set of 800 images (400 cityscapes and 400 non cityscapes)randomly sampling the image data set and using the remaining852 images of the entire dataset as the test set (403 cityscapesand 449 non cityscapes).

Table IV shows the recognition rates obtained using differ-ent kernels and two different widths of the low pass filter usedfor edge detection. In the case of RBF kernels only the valuesleading to the best recognition rates are included.

The results are displayed in terms of precision, the propor-tion of cityscapes within all images retrieved by the system,and recall, the proportion of cityscapes retrieved by the systemwithin all cityscapes in the test set. The precision-recallevaluations is popular in the world of retrieval: the measureof precision is imporant for retrieval purposes as it describesthe proportion of retrieved items that are correct for the query;the measure of recall is important for classification purposessince it gives the percentage of positive items detected ina validation data set. In a good classifier the two quantitiesshould be high and possibly well balanced (otherwise furthertuning is required).

As it can be seen by inspecting Table IV the histogramintersection kernel in this case appears to perform best. The


effectiveness of heavy-tail kernels seem rather limited, if any.The combination of parameters leading to the best results aredisplayed in the table.

This quite different behavior is likely to be due to the verydifferent nature of the inputs between this and the previousset of experiments. Here, histograms live in a space of 11dimension and, since the typical image size is

� 8 � � � 8 epixels, are dense. As before, the model selection is notstraightforward, given the fact that for all kernels we obtainedperfect separation of the training set for � � 8 8

. Once more,the advantage of using a kernel which does not depend on freeparameter is clear. Concluding, we observe that these results,which represent a substantial improvement with respect tothe same approach with co-occurrence matrices instead ofhistograms of edge direction [26], are very similar to thosereported in [27], possibly the state-of-the-art on the subject.There is little doubt that better results could be obtained byemploying more sophisticated representations.

B. Iconic representation

To evaluate the effectiveness of the Hausdorff-based kerneldescribed in Section V we performed some experiments with aview-based identification system. The idea is to train a systemfrom positive examples only to identify a 3D object from asingle image. Following an appearance-based approach, thetraining set consists of a collection of views of a certainobject taken by a camera moving around it. At run timethe system must answer the question of whether the currentimage, or a part of it, is a view of the same object. Theobject is a toy duck to be detected in a cluttered environment.An important component of this system is the techniqueused for image matching. Figure 7 shows example frames

Fig. 7. Example frames from the original toy duck sequence

belonging to the training sequence. The collection of viewsis built using a semiautomatic tracking stage, the purpose of

which is to obtain a crude form of foreground/backgroundsegmentation. A rectangular region containing the object ofinterest is identified in the first frame and used as a template.The region is then tracked through the sequence maximizingin each frame the matching score of Eq.(17) – with toleranceon the grey-levels � � and neighborhood U ] �7. � . Ifthe maximized matching score falls below a given threshold, anew rectangular region, or template, is drawn around the objectof interest in the current frame and the procedure is restarted.Figure 8 shows the results of the tracking after initializationwith the template shown in the first frame of each figurerespectively. It can be easily checked that the object is notalways centered since the threshold determining the need fora new initialization of the tracking procedure has been set toa small number to minimize manual intervention. Each imagein the segmented sequence is resized to ��3. 4��

(therefore, thedata live in a space of � d 4 � dimensions).

In our experiments we train a one-class SVM [16] on460 images taken from a single image sequence. We testthe system on two different image sequences of the sameobject (a total of 660 images) and 1360 images of negativeexamples. The positive test data have been acquired in differentdays by different people. To avoid searching across scale,in the acquisition stage the apparent size of the object ofinterest throughout the sequences was kept approximatelyconstant. The negative test data contain other views of thesame environment including similar objects (e.g., other toys)at roughly the same scale. Clearly, searching across scale cannot be avoided when developing a practical detection system.

The results of the experiments performed are shown bymeans of Receiver Operating Characteristic (ROC) curves.Each point of a ROC curve represents a pair consisting ofthe false-alarm rate and the hit-rate of the system obtainedby varying the radius of the sphere in feature space. Thesystem efficiency can be evaluated by the growth rate of itsROC curve, and for a given false-alarm rate, the better systemwill be the one with a higher hit probability. The overallperformance of a system can be measured by the equal errorrate (e.e.r.) representing the best compromise between falsealarms and hit rate. Since the results obtained seem hardlyaffected by changing the regularization parameter � of SVMsin the range

8 ' �� 8 ' � , we fix � 8 ' d .The ROC curves in Figures 9, 10 and 11 show the per-

formances obtained for various kernels on the identificationproblem described above. Figure 9 compares the results ofpolynomial kernels of various degrees. It shows immediatelythat more complex polynomials do not raise recognition accu-racy. Figure 10 shows the results obtained with Gaussian RBFkernels of different s. As illustrated by the figure it is not easyto find a range of leading to acceptable results. Figure 11compares the performances of the modified Hausdorff kernelfor different 3-D stencils.The curve identified by the triple" �� corresponds to a stencil defined over a cuboid of side� d "

� � � . � d �� . � d �

� � �.

Concluding, we also checked the performance of Nearest-Neighbors techniques, using the distance measures naturallydefined by the various kernels. The obtained results are in-variably poor with an e.e.r. in the range of

4 8 �.


�� 8 �Kernel type Precision (%) Recall (%) Precision (%) Recall (%)

histogram intersection 85.1 � 1.7 90.1 � 1.7 82.2 � 1.4 87.1 � 3.2linear 86.0 � 0.0 82.8 � 1.9 83.6 � 0.3 80.2 � 2.62-nd deg polynomial 84.2 � 0.7 85.9 � 4.6 83.4 � 1.0 83.8 � 2.74-rd deg polynomial 82.3 � 0.3 90.0 � 2.7 82.6 � 1.9 84.2 � 2.1Gaussian RBF ( � � � ) 83.5 � 2.8 79.7 � 6.7 84.0 � 0.6 79.5 � 2.6Laplacian RBF ( � �� ) 84.4 � 1.4 83.0 � 1.6 83.1 � 1.0 82.3 � 1.2

TABLE IV

ACCURACY OF THE CITYSCAPE RETRIEVAL SYSTEM, FOR DIFFERENT KERNELS AND TWO DIFFERENT IMAGE FILTERINGS. IN THE CASE OF RBF


Fig. 8. Tracking results after initialization with the template shown in the first frame. See text for a description of the tracking procedure.

VII. CONCLUSIONS

In this paper we dealt with the problem of engineeringkernels for images. We represented images with binary stringsand discussed bitwise manipulations obtained using logicaloperators and convolution with non-binary stencils. Fromthe theoretical viewpoint we used our analysis to show thathistogram intersection is a Mercer’s kernel and determinethe modifications under which a similarity measure basedon the notion of Hausdorff distance is a Mercer’s kernel.Extensive experimental results support the tenet that choosingan appropriate kernel leads to the construction of more effec-tive trainable systems. Current work on the topic of kernelengineering focuses on extending the presented analysis tobinary strings of different length and studying the relation withstring kernels.

ACKNOWLEDGMENT

We thank John Shawe-Taylor, Yoram Singer, and EmanueleFranceschi for useful discussions. This research has beenpartially funded by the INFM Advanced Research ProjectMAIA, the FIRB Project ASTA RBAU01877P, and by theEU Project IST-2000-25341 KerMIT.

REFERENCES

[1] V. Vapnik, Statistical learning theory. John Wiley and sons, New York,1998.

[2] N. Cristianini and J. Shawe-Taylor, An Introduction to Support VectorMachines and other kernel-based learning methods. Cambridge Univ.Press, 2000.

[3] M. Pontil and A. Verri, “Support vector machines for 3d object recogni-tion,” IEEE Transactions on Pattern Analysis and Machine Intelligence,pp. 637–646, 1998.


0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

deg 1deg 2deg 3deg 4deg 5deg 6

Fig. 9. ROC curves comparing various polynomial kernels for the toy duckexperiment (see text).

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

s=3000s=5000s=6000s=8000

s=10000s=50000

s=100000

Fig. 10. ROC curves comparing various Gaussian RBF kernels for the toyduck experiment (see text).

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

111311333433444555655

Fig. 11. ROC curves comparing Hausdorff-based Mercer’s kernel for variousstencils for the toy duck experiment (see text).

[4] C. Papageorgiou and T. Poggio, “A trainable system for object detec-tion,” International Journal of Computer Vision, vol. 38, no. 1, pp. 15–33, 2000.

[5] G. Guodong, S. Li, and C. Kapluk, “Face recognition by support vectormachines,” in Proc. IEEE International Conference on Automatic Faceand Gesture Recognition, 2000, pp. 196–201.

[6] K. Jonsson, J. Matas, J. Kittler, and Y. Li, “Learning Support Vectors forface verification and recognition,” in Proc. IEEE Int Conf on AutomaticFace and Gesture Recognition, 2000.

[7] B. Heisele, P. Ho, and T. Poggio, “Face recognition with support vectormachines: global versus component-based approach,” in ICCV, 2001,pp. 688–694.

[8] A. Mohan, C. Papageorgiou, and T. Poggio, “Example-based objectdetection in images by components,” IEEE Trans. Patt. Anal. Mach.Intell., vol. 23, pp. 349–361, 2001.

[9] B. Scholkopf, P. Simard, A. Smola, and V. Vapnik, “Prior knowledgein support vector kernels,” in Advances in Neural Inf. Proc. Systems,vol. 10. MIT Press, 1998, pp. 640–646.

[10] O. Chapelle, P. Haffner, and V. Vapnik, “SVMs for histogram-basedimage classification,” IEEE Transactions on Neural Networks, specialissue on Support Vectors, 1999.

[11] F. Jing, M. Li, H. Zhang, and B. Zhang, “Support Vector Machines forregion-based image retrieval,” in Proc. IEEE International Conferenceon Multimedia and Expo, 2003.

[12] S. Belongie, C. Fowlkes, F. Chung, and J. Malik, “Spectral partitioningwith indefinite kernels using the nystrom extention,” in ECCV, part III,Copenhagen, Denmark, may 2002, p. 531 ff.

[13] C. Wallraven, B. Caputo, and A. Graf, “Recognition with local features:the kernel recipe,” in Proceedings of the International Conference onComputer Vision, vol. I, 2003, p. 257ff.

[14] T. Evgeniou, M. Pontil, and T. Poggio, “Regularization networks andsupport vector machines,” Advances in Computational Mathematics,vol. 13, pp. 1–50, 2000.

[15] V. Vapnik, The nature of statistical learning Theory. John Wiley andsons, New York, 1995.

[16] D. M. J. Tax and R. Duin, “Uniform object generation for optimizingone-class classifiers,” Journal of Machine Learning Research, SpecialIssue on Kernel methods, no. 2, pp. 155–173, 2002.

[17] R. Courant and D. Hilbert, Methods of Mathematical Physics, Vol. 2.Interscience, London UK, 1962.

[18] T. Poggio, S. Mukherjee, R. Rifkin, A. Rakhlin, and A. Verri, “b,” inProceedings of the Conference on Uncertainty in Geometric Computa-tions., 2001.

[19] K. P. Horn, Robot Vision. MIT Press, 1986.[20] F. Odone, E. Trucco, and A. Verri, “General purpose matching of grey

level arbitrary images,” in 4th International Workshop on Visual Forms,ser. LNCS 2059, 2001, pp. 573–582.

[21] M. J. Swain and D. H. Ballard, “Color indexing,” IJCV, vol. 7, no. 1,pp. 11–32, 1991.

[22] M. J. Swain, “Interactive indexing into image databases,” in Storage andretrieval for image and video databases (SPIE), 1993, pp. 95–103.

[23] D. Huttenlocher, G. Klanderman, and W. Rucklidge, “Comparing imagesusing the hausdorff distance.” Trans. on PAMI, vol. 15, no. 9, 1993.

[24] A. Barla, F. Odone, and A. Verri, “Hausdorff kernel for 3D objectacquisition and detection,” in ECCV, 2002.

[25] M. Szummer and R. W. Picard, “Indoor-outdoor image classification,”in IEEE Intl Workshop on Content-based Access of Image and VideoDatabases, 1998.

[26] A. Barla, F. Odone, and A. Verri, “Old fashioned state-of-the-art imageclassification,” in Proceesings of the International Conference on ImageAnalysis and Processing, Mantova, Italy, 2003.

[27] A. K. Jain and A. Vailaya, “Image retrieval using color and shape,”Pattern Recognition, vol. 29, pp. 1233–1244, august 1996.


PLACEPHOTOHERE

Francesca Odone received a Laurea degree in Infor-mation Science in 1997 and a PhD in Computer Sci-ence in 2002, both from the University of Genova,Genova, Italy. She visited Heriot-Watt University,Edinburgh, UK, in 1997 as a research associate, andin 1999 as a visiting PhD student. Since 2002 sheis a research associate for the Istituto Nazionale diFisica della Materia and she works at the Departmentof Computer and Information Science of the Univer-sity of Genova. She published papers on computervision applied to automation, motion analysis, image

matching, image classification and view-based object recognition. Her presentresearch focuses on statistical learning and its application to computer visionand image understanging problems.

PLACEPHOTOHERE

Annalisa Barla received a Laurea degree in Physicsin 2001, from the University of Genova, Genova,Italy. She is a PhD student in Computer Sciencesince 2002 at the Department of Computer andInformation Science of the University of Genova.Her main interests in research are connected withstatistical learning applied to image understandingproblems, such as content-based image retrieval andclassification. She published papers on image repre-sentation and kernel engineering for image classifi-cation.

PLACEPHOTOHERE

Alessandro Verri received both the Laurea andthe PhD in Physics from the University of Genoa,Genoa, Italy, in 1984 and 1989, respectively. Since1989 he is with the University of Genoa where heis professor in the Department of Computer andInformation Science. He has been visiting scien-tist and professor at MIT, INRIA (Rennes), ICSI(Berkeley), and Heriot-Watt University (Edinburgh).He has published nearly 50 papers on stereopis,motion analysis in natural and machine vision sys-tems, shape representation and recognition, pattern

recognition, and 3-D object recognition. He is co-author of a textbookon computer vision with Dr. E. Trucco (Prentice Hall). Currently, he isinterested in the mathematical and computational aspects of computer visionand statistical learning theory and he is leading a number of applied projectson the development of computer vision-based solution to industrial problems.He has been and is in the Program Committee of the major internationalconferences in the area of image processing and computer vision and servesas referee of several leading journals in the field.

Building kernels from binary strings for image matching

Documents

Transcript of Building kernels from binary strings for image matching