Efficient Similarity Search in Nonmetric Spaces with Local Constant Embedding

16
Efficient Similarity Search in Nonmetric Spaces with Local Constant Embedding Lei Chen, Member, IEEE, and Xiang Lian, Student Member, IEEE Abstract—Similarity-based search has been a key factor for many applications such as multimedia retrieval, data mining, Web search and retrieval, and so on. There are two important issues related to the similarity search, namely, the design of a distance function to measure the similarity and improving the search efficiency. Many distance functions have been proposed, which attempt to closely mimic human recognition. Unfortunately, some of these well-designed distance functions do not follow the triangle inequality and are therefore nonmetric. As a consequence, efficient retrieval by using these nonmetric distance functions becomes more challenging, since most existing index structures assume that the indexed distance functions are metric. In this paper, we address this challenging problem by proposing an efficient method, that is, local constant embedding (LCE), which divides the data set into disjoint groups so that the triangle inequality holds within each group by constant shifting. Furthermore, we design a pivot selection approach for the converted metric distance and create an index structure to speed up the retrieval efficiency. Moreover, we also propose a novel method to answer approximate similarity search in the nonmetric space with a guaranteed query accuracy. Extensive experiments show that our method works well on various nonmetric distance functions and improves the retrieval efficiency by an order of magnitude compared to the linear scan and existing retrieval approaches with no false dismissals. Index Terms—LCE, nonmetric, similarity, pivot. Ç 1 INTRODUCTION S IMILARITY-BASED search is widely used in multimedia retrieval, data mining, Web search and retrieval, and a number of other applications. There are two key issues related to the similarity-based search: 1) the design of a distance function and 2) the development of efficient search algorithms. The first issue is directly related to the accuracy of the retrieval, since the distance function indicates how similar (close) two objects are. The problem is complicated by the fact that similarity is subjective and determined by human observation. Therefore, in most cases, we want to select a distance function to capture the characteristics of data and also closely mimic the similarity as recognized by humans. A number of distance functions have been proposed, which satisfy these constraints (for example, Dynamic Time Warp (DTW) [4], [25], partial Hausdorff distance [22], etc.). Unfortunately, some of them do not follow the triangle inequality, a required property of a metric distance function defined as follows: Definition 1.1 (Metric Distance Function). Given a distance function dist defined in space DD, dist is metric if it satisfies the following properties: 1. distðx; yÞ > 0, 2. distðx; yÞ¼ 0 , x ¼ y, 3. distðx; yÞ¼ distðy; xÞ, and 4. distðx; zÞ distðx; yÞþ distðy; zÞ, where x, y, and z are any three data objects in D. In the rest of this paper, we use the term “nonmetric,” referring to distance functions that only violate Property 4, that is, the triangle inequality. Although there certainly are distance functions that are metric (for example, L p -norms [37]), it has been argued that they do not always properly represent the human percep- tion of similarity [19], [23]. As a vivid example [23], many people observe that a centaur is very similar to both a person and a horse. However, they also observe that the difference between the person and horse is more than twice as great as that between the person (horse) and centaur. This indicates that the human judgment of similarity does not follow the triangle inequality and thus somehow cannot be captured by metric distance functions. Furthermore, as pointed out by Jacobs et al. [23], it is not the poor selection of features or the careless design that cause a distance function not to follow the triangle inequality. Inherently, distance functions that are robust to noisy data usually violate this inequality. Much work in psychology also suggests that human similarity judgments do not follow the triangle inequality either [33]. Therefore, these “good” robust yet nonmetric distance functions are often used as the underlying measures for the similarity search in many important applications such as time-series matching and image retrieval. For instance, Longest Common Subsequences (LCSS) [5], [13], [34], DTW [26], [38], [4], [25], and Edit Distance on Real sequence (EDR) [11] are nonmetric but robust for retrieving noisy time-series data similar to a query time series in the context of time-series databases. Furthermore, the nonmetric partial Hausdorff distance [22] IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 3, MARCH 2008 321 . The authors are with the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China. E-mail: {leichen, xlian}@cse.ust.hk. Manuscript received 26 Apr. 2007; revised 15 Aug. 2007; accepted 3 Oct. 2007; published online 8 Oct. 2007. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-2007-04-0184. Digital Object Identifier no. 10.1109/TKDE.2007.190700. 1041-4347/08/$25.00 ß 2008 IEEE Published by the IEEE Computer Society

Transcript of Efficient Similarity Search in Nonmetric Spaces with Local Constant Embedding

Efficient Similarity Search in Nonmetric Spaceswith Local Constant EmbeddingLei Chen, Member, IEEE, and Xiang Lian, Student Member, IEEE

Abstract—Similarity-based search has been a key factor for many applications such as multimedia retrieval, data mining, Web search

and retrieval, and so on. There are two important issues related to the similarity search, namely, the design of a distance function to

measure the similarity and improving the search efficiency. Many distance functions have been proposed, which attempt to closely

mimic human recognition. Unfortunately, some of these well-designed distance functions do not follow the triangle inequality and are

therefore nonmetric. As a consequence, efficient retrieval by using these nonmetric distance functions becomes more challenging,

since most existing index structures assume that the indexed distance functions are metric. In this paper, we address this challenging

problem by proposing an efficient method, that is, local constant embedding (LCE), which divides the data set into disjoint groups so

that the triangle inequality holds within each group by constant shifting. Furthermore, we design a pivot selection approach for the

converted metric distance and create an index structure to speed up the retrieval efficiency. Moreover, we also propose a novel

method to answer approximate similarity search in the nonmetric space with a guaranteed query accuracy. Extensive experiments

show that our method works well on various nonmetric distance functions and improves the retrieval efficiency by an order of

magnitude compared to the linear scan and existing retrieval approaches with no false dismissals.

Index Terms—LCE, nonmetric, similarity, pivot.

Ç

1 INTRODUCTION

SIMILARITY-BASED search is widely used in multimedia

retrieval, data mining, Web search and retrieval, and a

number of other applications. There are two key issues

related to the similarity-based search: 1) the design of a

distance function and 2) the development of efficient search

algorithms. The first issue is directly related to the accuracy

of the retrieval, since the distance function indicates how

similar (close) two objects are. The problem is complicated

by the fact that similarity is subjective and determined by

human observation. Therefore, in most cases, we want to

select a distance function to capture the characteristics of

data and also closely mimic the similarity as recognized by

humans. A number of distance functions have been

proposed, which satisfy these constraints (for example,

Dynamic Time Warp (DTW) [4], [25], partial Hausdorff

distance [22], etc.). Unfortunately, some of them do not

follow the triangle inequality, a required property of a metric

distance function defined as follows:

Definition 1.1 (Metric Distance Function). Given a distance

function dist defined in space D�D, dist is metric if it

satisfies the following properties:

1. distðx; yÞ > 0,2. distðx; yÞ ¼ 0, x ¼ y,

3. distðx; yÞ ¼ distðy; xÞ, and4. distðx; zÞ � distðx; yÞ þ distðy; zÞ,

where x, y, and z are any three data objects in D.

In the rest of this paper, we use the term “nonmetric,”referring to distance functions that only violate Property 4,that is, the triangle inequality.

Although there certainly are distance functions that aremetric (for example, Lp-norms [37]), it has been argued thatthey do not always properly represent the human percep-tion of similarity [19], [23]. As a vivid example [23], manypeople observe that a centaur is very similar to both aperson and a horse. However, they also observe that thedifference between the person and horse is more than twiceas great as that between the person (horse) and centaur.This indicates that the human judgment of similarity doesnot follow the triangle inequality and thus somehow cannotbe captured by metric distance functions. Furthermore, aspointed out by Jacobs et al. [23], it is not the poor selectionof features or the careless design that cause a distancefunction not to follow the triangle inequality. Inherently,distance functions that are robust to noisy data usuallyviolate this inequality. Much work in psychology alsosuggests that human similarity judgments do not follow thetriangle inequality either [33]. Therefore, these “good”robust yet nonmetric distance functions are often used asthe underlying measures for the similarity search in manyimportant applications such as time-series matching andimage retrieval. For instance, Longest Common Subsequences(LCSS) [5], [13], [34], DTW [26], [38], [4], [25], and EditDistance on Real sequence (EDR) [11] are nonmetric butrobust for retrieving noisy time-series data similar to aquery time series in the context of time-series databases.Furthermore, the nonmetric partial Hausdorff distance [22]

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 3, MARCH 2008 321

. The authors are with the Department of Computer Science andEngineering, Hong Kong University of Science and Technology, ClearWater Bay, Kowloon, Hong Kong, China.E-mail: {leichen, xlian}@cse.ust.hk.

Manuscript received 26 Apr. 2007; revised 15 Aug. 2007; accepted 3 Oct.2007; published online 8 Oct. 2007.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2007-04-0184.Digital Object Identifier no. 10.1109/TKDE.2007.190700.

1041-4347/08/$25.00 � 2008 IEEE Published by the IEEE Computer Society

and Dynamic Partial Function (DPF) [19] can be used forobtaining multimedia data such as images (or videos) overmultimedia databases that are similar to a query image (orvideo). In this paper, we consider the similarity search witha nonmetric distance function.

In reality, many of the nonmetric distance functions arecomputationally expensive, including DTW [26], [38], [4],[25] and EDR [11], whose computation cost is Oðn2Þ, wheren is the arity of vectors. In the metric space, most indexingalgorithms have successfully applied the triangle inequalityto reduce the number of distance computations signifi-cantly, which is one of the most important factors forretrieval efficiency during similarity-based search (the otherfactor is the I/O cost). In the nonmetric space, however, thetriangle inequality no longer holds. One possible alternativeto answer similarity queries without dismissals is to carryout a linear scan of the data set D. Obviously, this is not ascalable solution, in terms of both computation and I/Ocost, when the size of the data set ðjDjÞ becomes large. Thus,it is crucial whether or not we can apply the triangleinequality to prune the search space so as to improve theretrieval efficiency.

Several approaches have been proposed to address thisissue, including dynamically embedding queries [3], simi-larity preserving modification [31], and constant shiftingembedding [27]. Some of these methods are based onapproximations [3], [31], which trade accuracy for effi-ciency, thus introducing undesired false dismissals (dataobjects that meet the query constraint but are absent in thequery result). Others have low pruning power due to theembedded distance function [27]. Based on these observa-tions, a question that needs to be explored is: “Can weimprove the retrieval efficiency of search in nonmetric spaces overlarge databases?”

In this paper, we address the above question and makethe following contributions:

1. We introduce a local constant embedding (LCE)method to assign data objects to different groupssuch that the triangle inequality holds within eachgroup.

2. We design a grouping method to guarantee that nofalse dismissals are introduced during search withinthese groups.

3. We propose a pivot selection technique in nonmetricspaces and construct an index based on selectedpivots.

4. Last, we also present a method to answer approx-imate similarity search in the nonmetric space with aguaranteed query accuracy.

The rest of this paper is arranged as follows: Section 2briefly reviews the existing work on searching in nonmetricspaces. Section 3 presents a novel LCE method to enable thesimilarity search in nonmetric spaces. Section 4 introduces anew pivot selection method and an indexing structure innonmetric spaces, modified from iDistance [24]. Section 5discusses the approximate similarity search in nonmetricspaces. Section 6 presents extensive experimental results onvarious nonmetric measures. Finally, Section 7 concludesthis paper.

2 RELATED WORK

In this section, we first give a brief review of the similaritysearch in metric spaces and the triangle inequality, which isone of the most important tools for pruning in the metricspace. Then, we outline several approaches that have beenproposed for improving the retrieval efficiency in anonmetric space, where the triangle inequality does nothold. In general, these approaches can be classified into twocategories: embedding and classification based. Fig. 1 sum-marizes the commonly used symbols in this paper.

2.1 Similarity Search in Metric Spaces

Similarity search in metric spaces [9], [28], [39] has beenstudied for many years and is widely used in manyapplications such as information retrieval, Web search,and the like, where its problem definition is given asfollows: Assume that we have a large data set D containingN data objects. Given a query object q and a similaritythreshold ", the similarity search in metric spaces retrievesall data objects x in D satisfying the condition thatdistðq; xÞ � ", where the distance function distðq; xÞ betweenobjects q and x can be any metric such as Lp-norms [2], [15],ERP [10], and so on.

One of the most important properties of metric distancefunctions is the triangle inequality, which is the underlyingprinciple of most indexing algorithms in the metric space.We illustrate how we can use the triangle inequality to prunethe search space by a simple example. Assume that we havea small data set containing two data objects x and y (that is,N ¼ 2) and a metric distance function dist that is computa-tionally costly. Given a query object q and a radius ", wewant to find all objects that fall into the range centered at qwith the radius ". In the example, the distance between x andy is known, precomputed in advance before the queryexecution, whose pairwise distances are stored in an N �N(that is, 2 � 2) matrix pmatrix. During the query processing,we first access object x by computing the distance distðq; xÞbetween q and x. In case distðq; xÞ > ", object x does notbelong to the query result. As a second step, when we scanthe next object y, we can save the computation cost by usingdistances distðq; xÞ and distðx; yÞ that have already beencalculated. In particular, we check whether or not theabsolute distance difference jdistðq; xÞ � distðx; yÞj, which isthe lower bound of distðq; yÞ by the triangle inequality, isgreater than ". If the answer is positive (that is,distðq; yÞ � jdistðq; xÞ � distðx; yÞj > "), object y can be safelypruned, and the expensive distance computation distðq; yÞbetween q and y can be avoided. Therefore, by utilizing the

322 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 3, MARCH 2008

Fig. 1. Meanings of notations.

triangle inequality, we can reduce the number of costlydistance computation for range queries. Although theexample illustrated here involves the range query, thesolution can be easily extended to handle k nearest neighborqueries ðkNNÞ by applying the approach used in [20], [30].Specifically, we can first issue a range query centered at thequery point with an initial radius "0. If the number of theretrieved objects is below k, then we would increase thesearch radius "00 ð> "0Þ and issue another range query. Thisway, we can obtain the final kNN results.

In nonmetric spaces, the triangle inequality does nothold; thus, it is a very challenging problem to performefficient similarity search. In the sequel, we present twocategories of the proposed solutions for improving theretrieval efficiency in nonmetric spaces: embedding andclassification based.

2.2 Embedding Methods

For embedding methods, we usually embed a nonmetricdistance function into a metric one and apply a metric indexto improve the retrieval efficiency. Within this category, wecan further classify methods into approximate embedding andexact embedding. Approximate embedding refers to methodsthat may introduce false dismissals, whereas the exact onesdo not. Athitsos et al. [3] proposed an approximateembedding approach that maps data objects onto a vectorspace. Specifically, they introduce a “query-sensitive”distance function together with the embedding method,which is motivated by the fact that different embeddingdimensions may be important for different query objects.Therefore, in the embedded vector space, in order toenhance the retrieval accuracy, they construct an embed-ding (with a classifier trained by AdaBoost algorithm in[29]) such that the weight of L1 distance can be adjustedautomatically according to the query object. Recently,Skopal [31] presented an embedding method TriGen (TG)to convert a nonmetric distance function into a metric oneby a class of metric-preserving and similarity-preservingmodifiers. In particular, TG attempts various distancemodifiers on samples obtained from the data set andchooses the best modifier that statistically guarantees themetric-preserving ratio of the data set. As mentioned by theauthor, however, some nonmetric measures are very hardto convert for efficient similarity search by using thisapproach. Many other approximate embedding techniqueshave been proposed to map an arbitrary space to either aeuclidean or a pseudoeuclidean space such as the Lipschitzembedding [20], Fastmap [14], Metric Map [36], and Sparse Map[21]. Even though these methods try to preserve a largeamount of the proximity structure in the original space,false dismissals are unavoidable.

The only exact embedding method proposed so far isconstant shift embedding (CSE) [27], which converts adistance function that does not follow the triangleinequality into another one that does, assuming thatquery objects are contained in the data set. The idea isexplained as follows: Given a distance function dist thatis defined in data space D�D and does not follow thetriangle inequality, there exist three data objects x, y, z 2D such that distðx; yÞ þ distðy; zÞ < distðx; zÞ. We canconvert dist into another distance function dist0 by adding

a positive value c to each distance value calculated bydist, that is, dist0ðx; yÞ ¼ distðx; yÞ þ c. If c is large enough,we may have dist0ðx; yÞ þ dist0ðy; zÞ � dist0ðx; zÞ (equiva-lent to distðx; yÞ þ distðy; zÞ þ c � distðx; zÞ). Thus, thetriangle inequity holds on dist0.

In [27], the c value is set to the minimum eigenvalue ofthe pairwise distance matrix. However, an analysis of theconverted pairwise distance matrix shows that this mini-mum eigenvalue is quite large, thus making it meaninglessto prune by the triangle inequality [11]. The reason is thatthe lower bound of distðx; yÞ, distðx; zÞ � distðy; zÞ � c, is toosmall to prune the data object. Reducing the value of c mayincrease the pruning ability but may also cause somedistances to not follow the triangle inequality and introducefalse dismissals.

2.3 Classification-Based Method

Classification-based methods transform the similarity-basedsearch in a nonmetric space to a classification problem.Basically, these methods are performed in three steps: 1) thedata set is clustered into a set of classes using clusteringalgorithms on the nonmetric distance function, 2) for eachclass, a set of representative objects are selected to representthe class (Goh et al. [19] use a set of atypical objects, whichare objects that differ from other objects in the class,whereas Jacob et al.[23] use correlated objects), and 3) thesimilarity search is performed to classify the query objectinto a class. The classification-based approach is still anapproximation search in the nonmetric space, since itintroduces false dismissals.

The approach that we propose in this paper belongs tothe first category, that is, the exact embedding method, whichassumes that query objects follow the data distribution.Compared to approximate solutions to handling search innonmetric spaces, our approach does not introduce falsedismissals and is therefore more accurate. Different fromthe CSE method [27], which uses a global shifting value c,we apply different c values for different groups of objects,which significantly increases the pruning power.

3 LOCAL CONSTANT EMBEDDING

In order to address the problem of similarity search in thenonmetric space, we propose the LCE method. Specifically,similar to the assumption made in [27], we consider thestatic data set, where all the data reside on the disk, andmoreover, we assume that query objects follow the samepairwise distance distribution as data objects in the dataset D. Later in the paper, we will discuss the cases ofdynamic data set, where query objects do not follow thesame distribution as data objects. The intuition of our LCEmethod is to divide the data set into several disjoint groupsand assign different group shifting values c to differentgroups, respectively, in order to make data objects withineach group follow the triangle inequality. Therefore, thepruning power of each group can be increased in LCE, sincewe use smaller (local) group shifting values, compared to onelarge global shifting value in CSE. However, there are twoimportant yet nontrivial issues of grouping in LCE thatneed to be addressed. First, we have to decide the numberof data objects assigned to each group in order to achieve

CHEN AND LIAN: EFFICIENT SIMILARITY SEARCH IN NONMETRIC SPACES WITH LOCAL CONSTANT EMBEDDING 323

high pruning power. As an example, for those groups withlarge group shifting values, since the pruning power is low, itis unlikely that we should assign many data objects to them.Moreover, we also need to consider the indexing issues forchoosing different group sizes. For example, a large groupsize may result in high query processing cost. The secondissue of grouping is how we can distribute data objects intogroups. Note that the grouping method is not a simpledivision of the data set into several partitions. Instead, weneed to consider the relationships of data objects within andamong groups such that no false dismissals are introducedfor queries.

We summarize the LCE method into two steps asfollows:

1. Compute the pairwise distance matrix pmatrix forthe data set D. Then, obtain the minimum constantshifting value c for each distinct triplet in D. LetcminðDÞ and cmaxðDÞ be the minimum and maximumvalues of c in the whole data set, respectively.

2. Divide N data objects into m groups G1; G2; . . . ; Gm

with approximately the same size. Let the maximum cvalue in each groupGi be cmaxðGiÞ, where cmaxðGiÞ �cmaxðGjÞ for i < j.

3.1 Finding cminðDÞ and cmaxðDÞThe first step of LCE is to retrieve all c values of distincttriplets that do not follow the triangle inequality. We namethese triplets nontriangle triplets. The following lemmastates that for any nontriangle triplet, we can find oneconstant c in order to make it follow the triangle inequality.

Lemma 3.1. Given three data objects x, y, and z and theirpairwise distances distðx; yÞ, distðy; zÞ, and distðx; zÞ, wherethe distance function distð�; �Þ is nonmetric, there is at mostone inequality on distances that does not follow the triangleinequality.

Based on Lemma 3.1, we can get the minimum constantshifting value c for each triplet by computing the differencebetween the largest distance and the sum of the other twodistances, as given in the following corollary.

Corollary 3.1. Given three data objects x, y, and z and theirpairwise distances distðx; yÞ, distðy; zÞ, and distðx; zÞ, if

distðx; yÞ � distðy; zÞ � distðx; zÞ and

distðx; yÞ þ distðy; zÞ < distðx; zÞ;

then c ¼ distðx; zÞ � ðdistðx; yÞ þ distðy; zÞÞ.

The following corollary claims that if a constant c can beused to convert a nontriangle triplet into a triplet followingthe triangle inequality, then a constant value c0, which islarger than c, is able to do the conversion as well.

Corollary 3.2. Given a nontriangle triplet, if a constant c canconvert distances in a triplet into those following the triangleinequality, then for any constant c0 � c, c0 can be also used toconvert distances into ones following the triangle inequality.

Now, given a distance function dist and N �N pairwisedistance matrix pmatrix for N data objects, we can compute

constant shifting values c of all the distinct nontriangletriplets, as claimed in Corollary 3.1, by examining everythree data objects in pmatrix. Furthermore, we select theminimum and maximum values of these cs as cminðDÞ andcmaxðDÞ, respectively.

3.2 Dividing Triplets into Groups

As a second step of LCE, we want to divide N data objectsinto m disjoint groups, to each of which we assign its owngroup shifting value. Therefore, within each group, theoriginal nonmetric distance function dist can be convertedinto a metric one dist0 using the group shifting value. Givenany query object, we can consider each group separatelyand apply the triangle inequality in the converted space toprune data objects within the group.

Before we present our algorithm for dividing N dataobjects into m groups, we show the rationale of choosingthe group size through simple experiments. Specifically, wetook two data sets stock and image histogram and computedthe pairwise DTW and partial Hausdorff distances of dataobjects, respectively. Then, we divide each data set intodisjoint groups such that their group shifting values are 1, 2, 3,and so on (from an arithmetic series). The resulting groupsizes with respect to group shifting values approximatelyfollow the Zipf distribution. That is, for small group shiftingvalues, the size of the group is very large, whereas for largevalues, the group size is small. In fact, when the groupshifting value is too large, there exists little possibility ofpruning false positives (data objects that do not qualify thequery constraints). Therefore, in order to have high pruningpower, we want to have a finer division on c with smallgroup shifting values, whereas a coarser division forc intervals with larger values. Motivated by this, ourgrouping algorithm aims at dividing these N data objectsinto m groups G1; G2; . . . ; Gm with approximately the samesize. Since a large number of data objects have small groupshifting values, given a fixed number of groups, dividing thedata set into groups of equal size will favor those dataobjects with small shifting values. With such divisionstrategy, for each group Gi, we assign the group shiftingvalue cmaxðGiÞ, which is the maximum c value, consideringall triplets in Gi. For any two groups Gi and Gj, if i < j, wehave cmaxðGiÞ � cmaxðGjÞ.

In order to avoid false dismissals, the resulting groupsmust have the following property hold:

Grouping Property. For any two data objects x, y 2 Gi andany object z 2 Gj (Gi and Gj not necessarily distinct), the c valueof the triplet hx; y; zi has to obey c � cmaxðGiÞ, where 1 � i,j � m.

If the above grouping property does not hold withinthe same group ði ¼ jÞ or between groups ði 6¼ jÞ, theretrieval method using the converted triangle inequalitymay introduce false dismissals. When objects x, y, and zare in the same group Gi and their minimum c is greaterthan cmaxðGiÞ, it is clear that using cmaxðGiÞ to do thepruning introduces false dismissals. We explore the casewhere the triplet hx; y; zi is located in two differentgroups using a simple example with a query object q 2Gj and two data objects x, y 2 Gi. Let the minimum cvalue of triplet hx; y; qi satisfy the condition c > cmaxðGiÞ.When using the lower bound distance of distðq; yÞ,

324 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 3, MARCH 2008

jdistðq; xÞ � distðx; yÞj � cmaxðGiÞ, to filter, we may intro-duce false dismissals. This is because the real lowerbound distance of distðq; yÞ is jdistðq; xÞ � distðx; yÞj � c,which is less than jdistðq; xÞ � distðx; yÞj � cmaxðGiÞ.

Fig. 2 gives the general framework for our groupingapproach. Initially (step 1 in Fig. 2), we set the maximum cvalue cmaxðG1Þ of the first group G1 to cminðDÞ þ ðcmaxðDÞ �cminðDÞÞ=m and then perform the grouping algorithm sothat the whole data set is divided into two groups G1 andGres

1 . Obviously, G1 and Gres1 should satisfy the grouping

property. That is, for any data x in Gres1 and any data y and z

in G1, the c value for triplet hx; y; zi must be not larger thancmaxðG1Þ. If the cardinality of G1 is approximately aroundN=m (equal size for each group), then we go on to obtainthe second group; otherwise, cmaxðG1Þ is adjusted so that theresulting group size is N=m.

As a second step (step 2 in Fig. 2), we intend to obtain asecond group G2. Similarly, we can choose cmaxðG2Þ to becmaxðG1Þ þ ðcmaxðDÞ � cminðDÞÞ=m and partition Gres

1 intotwo parts: G2 and Gres

2 . Here, the grouping condition for G2

is a bit different from G1 in that for any data x ðx0Þ inGres

2 ðG1Þ and any two objects y and z in G2, the c value fortriplet hx; y; zi ðhx0; y; ziÞ should not be greater thancmaxðG2Þ. In the case that the c value of hx; y; zi ðhx0; y; ziÞviolates the condition, we must remove either y or z from G2

to Gres2 and break such violation. If the size of group G2 is

about N=m, we continue finding the next group; otherwise,cmaxðG2Þ is varied so that the group size meets ourrequirement. We do not need to consider the case thattwo data objects in G1 and one data object in G2 may violategrouping property, since this case has already beenaddressed when we group G1 and Gres

1 ðG2 � Gres1 Þ. This

procedure repeats until the ðm� 1Þth step, where theresidual partition Gres

m�1 is considered as the last group Gm.The detailed steps of our grouping method are given

in Fig. 3. First, we initialize a label array label arr ofsize N with values of �1, which stores labels indicatingto which group each object belongs (lines 1 and 2).Since we want the resulting m groups to be of nearlyequal size, we divide the whole data set incrementallyinto m groups through adjusting the bounding valuecmaxðGiÞ. In particular, we first scan the precomputeddistance matrix pmatrix and obtain the minimum andmaximum c values cminðDÞ and cmaxðDÞ, respectively,among all triplets in the data set (line 3). Then, weobtain one group at a time. Specifically, for group Gi,we assign to one group those objects whose c values fall

into the interval ðcmaxðGi�1Þ; cmaxðGiÞ�, where cmaxðGiÞ isinitially set to cmaxðGi�1Þ þ ðcmaxðDÞ � cminðDÞÞ=m andcmaxðG0Þ ¼ cminðDÞ, as accomplished in procedure As-signGroupLabel (line 6). Let BS be the set containingcmaxðGjÞ values, where 0 � j � i. Moreover, if the size ofthe resulting group is not around N=m, we adjustcmaxðGiÞ for this group and relabel this group with thenew cmaxðGiÞ (line 7). That is, if the resulting group sizeis smaller than N=m, we increase the cmaxðGiÞ value bya step value � 2 ð0; ðcmaxðDÞ � cminðDÞÞ=m�. If the in-creased cmaxðGiÞ value results in a group size muchlarger than N=m, then we use a smaller step value �=2instead. The regrouping iteration stops until appropriategroup size (around N=m) is obtained. Finally, we outputeach group Gi and its cmaxðGiÞ (line 8).

In the following, we discuss how the procedureAssignGroupLabel, as shown in Fig. 4, assigns labels todata objects and removes the possible violated triplets indetail. Assume now that we plan to label the ith group,whereas the previous ði� 1Þ groups have already beenobtained. In the label array label arr, some entries arelabeled with integers from 1 to ði� 1Þ, whereas others areunlabeled with values of �1. As illustrated in Fig. 4(line 1), we enumerate all triplets T that have c valueswithin the interval ðcmaxðGi�1Þ; cmaxðGiÞ�. Then, for eachtriplet hx; y; zi in T , we set the label of each object p in thetriplet hx; y; zi to i if the object has not been assigned anygroup identifier before (with its label �1; lines 2-4). Afterassigning the label i, we only need to consider three cases

CHEN AND LIAN: EFFICIENT SIMILARITY SEARCH IN NONMETRIC SPACES WITH LOCAL CONSTANT EMBEDDING 325

Fig. 2. Framework of grouping objects.

Fig. 3. Nonmetric grouping.

Fig. 4. Assign group label.

(shown in Fig. 4) that are necessary to be furtherinvestigated for possible violations of grouping propertydue to the new label:

. Case 1. For any triplet hx; yðy0Þ; zðz0Þi, only one object(such as x) in the triplet is marked with the label i,whereas the other two y and z are assigned to agroup from 1 to ði� 1Þ, or y0 and z0 are assigned totwo different groups from 1 to ði� 1Þ, as shown inFig. 5a. After separating the data set D into groups,the search will be carried out in each group. Thus,we do not need to investigate cases where y0 and z0

are located in different groups, since the search willnot be conducted on two different groups at thesame time. This only refers to the case that y and zare located in the same group from 1 to ði� 1Þ.

. Case 2. For any triplet hx; y; zi, only two objects in thetriplet (for example, x and y) are marked with i,whereas the remaining one (for example, z) is in aprevious group, from 1 to ði� 1Þ, which is shown inFig. 5b.

. Case 3. For any triplet hx; y; zi, all the three objects x,y, and z are assigned to group i, which is shown inFig. 5c.

The above three cases may cause the grouping propertyto become violated. For Case 1, as depicted in Fig. 6a, objectx is marked as group Gi, and both objects y and z are inanother group Gi�1 (note that y and z should be in the samegroup Gj, where j < i, but not necessarily in Gi�1). If itholds that the minimum c value of triplet hx; y; zi is greaterthan cmaxðGi�1Þ (as the group of y and z has the c boundcmaxðGi�1Þ), which violates the grouping property, werandomly select either object y or object z and relabel it toi to break the grouping property violation, which is shownin Fig. 6b. The detailed steps for handling Case 1 isillustrated in lines 2-6 of procedure AssignGroupLabel. The

rationale of handling Case 1 is that if x is the query objectand y (or z) is the pivot, then the c value of this triplet (in theexample, > cmaxðGi�1Þ) will exceed the maximum c of thegroup, where object y (or z) resides. Thus, we randomlychoose object y or object z and set its label to Gi. Theresulting triplet has one object labeled Gi�1 and the othertwo labeled Gi, which corresponds to Case 2.

For Case 2, as shown in Fig. 7a, objects x and y in a triplethx; y; zi are labeled i, and the third one z is in a group (likeGi�1) other than Gi. If the c value of this triplet hx; y; zi isgreater than cmaxðGiÞ, the grouping property does not hold.We could randomly remove either object x or object y fromGi and unlabel it (for example, y) for future groups, as givenin Fig. 7b. In procedure AssignGroupLabel, lines 9 and 10are steps to handle Case 2.

For Case 3, as shown in Fig. 8a, all objects labeled i in atriplet hx; y; zi have c values greater than cmaxðGiÞ, whichmeans that the triplet does not belong to group Gi. In thisscenario, we randomly choose one object (for example, x)from the triplet and unlabel it to break the violation, asgiven in Fig. 8b. The detailed steps are listed in lines 11 and12 of procedure AssignGroupLabel.

Note that although we need to enumerate triplets inlines 1 and 7 of procedure AssignGroupLabel, whosecomputation complexity is OðN3Þ in the worst case (thetotal cost of grouping is thus OðmN3Þ), it can be computedby offline processing.

Speeding up the Grouping. Since the time complexity ofthe grouping procedure is cubic, for a large data size N , wepropose an alternative to speed up the grouping efficiencyby sacrificing query efficiency. Due to the major groupingcost coming from N , the basic idea of our method is toreduce N . In particular, we apply the clustering algorithm[18] to cluster data objects in the data set and obtainN 0 superobjects, each representing a set of objects, whereN 0 < N . Note that although the clustering algorithm [18] isdesigned for objects in a metric space, based on the pairwise

326 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 3, MARCH 2008

Fig. 5. Three cases of assigning group labels. (a) Case 1. (b) Case 2.

(c) Case 3.

Fig. 6. Handling Case 1 of assigning group labels. (a) Case 1, where the

c value of < x; y; z > is greater than cmaxðGi�1Þ. (b) Handling Case 1.

Fig. 7. Handling Case 2 of assigning group labels. (a) Case 2, where the

c value of < x; y; z > is greater than cmaxðGiÞ. (b) Handling Case 2.

Fig. 8. Handling Case 3 of assigning group labels. (a) Case 3, where the

c value of < x; y; z > is greater than cmaxðGiÞ. (b) Handling Case 3.

distance among data objects, we can apply it to thenonmetric space as well. Therefore, after the clustering,we can obtain a set of N 0 superobjects, with which weperform the modified grouping algorithm.

Our speedup method performs the grouping procedureover superobjects, trading the query efficiency for thegrouping efficiency. Specifically, since each superobject xSrepresents a set of data objects, we can compute the range ofshifting values ½cminðxSÞ; cmaxðxSÞ� for all the triplets (of dataobjects) within superobject xS , whose computation cost isonly OððN=N 0Þ3Þ. Furthermore, for each triplet of super-objects hxS; yS; zSi, the range of shifting values for hx; y; zi,where x 2 xS , y 2 yS , and z 2 zS , can be calculated by thefollowing lemma.

Lemma 3.2. Let distLBðxS; ySÞ ðdistUBðxS; ySÞÞ be the mini-mum (maximum) possible distance between any two dataobjects from superobjects xS and yS , respectively. Given atriplet of superobjects hxS; yS; zSi for any objects x 2 xS ,y 2 yS , and z 2 zS , the shifting value for triplet hx; y; zi isbounded by interval ½cminðxS; yS; zSÞ; cmaxðxS; yS; zSÞ�, where

cminðxS; yS; zSÞ ¼ minuS;vS ;wS2fxS;yS;zSgðdistLBðuS; vSÞ� ðdistUBðuS; wSÞ þ distUBðvS; wSÞÞÞ;

and

cmaxðxS; yS; zSÞ ¼ maxuS;vS;wS2fxS;yS ;zSgðdistUBðuS; vSÞ� ðdistLBðuS; wSÞ þ distLBðvS; wSÞÞÞ:

We modify procedure AssignGroupLabel (in Fig. 4) tohandle the case of superobjects and illustrate it in procedureAssignGroupLabel_Speedup in Fig. 9. In particular, theunderlined parts are the differences from the previousversion over normal data objects. The procedure Assign-GroupLabel_Speedup takes the N 0 �N 0 lower (upper)bound distance matrix pmatrixLB ðpmatrixUBÞ for super-objects as input, that is, recording

distLBðxS; ySÞ ðdistUBðxS; ySÞÞ

for any two superobjects xS and yS , which requires muchsmaller space than the original one N �N . They can be

used to compute the range of shifting values for triplets ofsuperobjects, as given in Lemma 3.2. The procedureAssignGroupLabel_Speedup enumerates all triplets ofsuperobjects hxS; yS; zSi, which have their upper bound ofshifting values cmaxðxS; yS; zSÞ and cmaxðuSÞ (for all theunlabeled superobjects uS 2 fxS; yS; zSg) falling into theinterval ðcmaxðGi�1Þ; cmaxðGiÞ� (line 1). By doing so, we canobtain a set of candidates to be labeled as group Gi.

As a second step, we perform the validation phase,checking any violation of the grouping property for tripletsof superobjects. Similar to procedure AssignGroupLabel, forany triplet hxS; yS; zSi, we have three cases, as shown inFig. 5. Specifically, the first case corresponds to the one thattwo superobjects yS and zS are in the same group Gj (forj 2 ½1; iÞ), and the other one xS is in group Gi. If the upperbound of cmaxðxS; yS; zSÞ is greater than group shifting valuecmaxðGjÞ, then we randomly select a superobject, either yS orzS , and label it as the ith group to break the violation of thegrouping property. For the second case, assume that twosuperobjects xS and yS are in Gi and another one zS is in Gj

(for j 2 ½1; iÞ). If it holds that cmaxðxS; yS; zSÞ is greater thancmaxðGiÞ, we randomly select a superobject, either xS or yS ,and unlabel it. Finally, for the third case, where xS , yS , andzS are in Gi, and cmaxðxS; yS; zSÞ > cmaxðGiÞ, we simplyunlabel a random superobject from them. Note that lines 5and 8 in Fig. 9 exactly correspond to the validation of thegrouping property for the three cases. Other places inprocedure AssignGroupLabel_Speedup are analogous toAssignGroupLabel.

As a result, we can divide data objects into groups with

different labels. Since we check all the triplets of superobjects

for possible violations of the grouping property in the

grouping procedure, and break them if any, the resulting

groups can guarantee that no false dismissals are introduced if

we use their own group shifting values to perform the

similarity search. Furthermore, due to the use of super-

objects, the total time complexity of the grouping over

superobjects is only OðmN 03Þ, which is much (that is,

ðN=N 0Þ3 times) smaller than OðmN3Þ over data objects

directly for N 0 < N . Note that in procedure AssignGroupLa

bel_Speedup, we compute the maximum possible c values

for triplets of superobjects hxS; yS; zSi instead of the exact c

values for triplets of data objects. Thus, intuitively, it is more

likely that a violation of the grouping property occurs in

line 5 or line 8 of procedure AssignGroupLabel_Speedup,

which may result in higher group shifting values and lower

ability to prune objects.Discussions on the Dynamic Data Set. We relax the

assumption of static data sets and discuss how we canhandle the grouping of dynamic data sets. In particular, wefirst consider the insertion of a new data object x, asillustrated in procedure Insertion in Fig. 10. Assume that wehave already obtained m groups over the existing dataset D. Now, our goal is to decide into which group the newobject x should be put. In order to achieve high pruningpower, we want to include x to a group with group shiftingvalue as small as possible. Thus, we start from G1, followedby G2, G3, and so on. Specifically, for each group Gi (line 1),we first check the grouping property such that for anyobjects y, z 2 Gi, it holds that the c value of triplet hx; y; zi is

CHEN AND LIAN: EFFICIENT SIMILARITY SEARCH IN NONMETRIC SPACES WITH LOCAL CONSTANT EMBEDDING 327

Fig. 9. Assign group label over superobjects.

never greater than cmaxðGiÞ (assuming that x is the queryobject, as in lines 2 and 3). Here, the time complexity ofpicking up any two data objects in a group is OðjGij2Þ. If theproperty does not hold (lines 4 and 5), we randomly select y(or z) from Gi and throw it to a group Gj such that itsc value is within ðcmaxðGjÞ; cmaxðGjþ1Þ� for j > i. Then, thegrouping property is verified for Gj due to the newlyincluded y (or z). Specifically, for any objects p1 2 Gj andp2 2 Gk ðk > iÞ, if triplet hp1; p2; yi has its shifting valuegreater than cmaxðGjÞ, we move y to the next group Gjþ1 andrepeatedly check the grouping property for Gjþ1 (lines 5-7).After all the violations of the grouping property have beenhandled, we consider the triplet hx; y; zi for all y 2 Gi andz 2 Gj ðj 6¼ iÞ (assuming that x is in Gi, and z is the queryobject). If the c value of triplet hx; y; zi is greater thancmaxðGiÞ, then x cannot be included in Gi, and we have toconsider the next group Giþ1 (lines 8-11). Only in the casewhere all such triplets do not violate the grouping propertycan we add x to group Gi and terminate the procedure(lines 12 and 13). Note that although procedure Insertioncan incrementally perform the grouping with insertionupdates, the resulting group sizes cannot be guaranteed.When the group sizes are skewed toward groups with largegroup shifting values (that is, low pruning ability forqueries), procedure NonMetricGrouping (in Fig. 3) isinvoked to regroup the data set.

Finally, the deletion of a data object from data set D iseasy. We can simply remove this object from its locatinggroup, which would not cause any violation of the groupingproperty.

3.3 Query Processing

With LCE, we can significantly improve the search perfor-mance by converting the pairwise distances from nonmetricto metric and using the triangle inequality to prune objectsvia pivots. In particular, for each group Gi, we can selectpivots (for example, p) by using any existing pivot selectiontechnique in the metric space such as random selection [6],HF selection [17], and so on. Fig. 11 illustrates the procedureof the query processing with pivots. When a range query qwith a radius " arrives, the query processing procedureQueryProcessing issues m range queries on Gi groups (lines2 and 3), respectively. Then, after obtaining all the m query

results rslt, each candidate object x in rslt is checked bycomputing its real distance distðq; xÞ from the query object q(lines 5 and 6) and removed from rslt if it does not fall intothe range. More specifically, in lines 2 and 3 of QueryProces-

sing, the procedure PivotRangeSearch (Fig. 12) is invoked,which applies the linear scan using pivots p within group Gi.Specifically, for each groupGi, instead of computing the realdistance between q and every data x as in the linear scan, wenow can use pivots in group Gi to do the pruning first. Thatis, if the lower bound of distance distðq; xÞ from a queryobject q to object x, jdistðq; pÞ � distðp; xÞj � cmaxðGiÞ, isgreater than " for at least one pivot pj, then object x can besafely pruned (lines 3-5 of procedure PivotReangeSearch).Note that here, all distances within Gi are the converteddistances, considering the group c value cmaxðGiÞ. Comparedto the pure linear scan, pruning with pivots significantlyreduces the computation cost in that many real distancecalculations can be saved. On the other hand, however, pageaccesses of this method are not reduced, since we still needto scan every object on pages sequentially.

In this work, we assume that the pairwise distancebetween a query object and a data object in the databasefollows the same distribution as the pairwise distance ofdata objects in the data set. Since all possible cases that mayviolate the group property have been handled in procedureNonMetricGrouping and queries are searched within eachgroup, the following theorem holds.

Theorem 3.1. Given a data set D and a nonmetric distancefunction dist, procedure NonMetricGrouping assigns objectsto m groups. No false dismissals will be introduced if weconvert range queries rangeðqÞ over D to search over eachgroup Gi ð1 � i � mÞ.

Proof. As illustrated in Fig. 5, there are totally three possiblecases in procedure NonMetricGrouping when we assigngroup labels to data objects. Furthermore, in each case,we check all triplets and adjust labels of those objects intriplets that violate the grouping property. As a result,

328 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 3, MARCH 2008

Fig. 10. Assign group labels upon insertion.

Fig. 11. Query processing.

Fig. 12. Pivot range search.

groups obtained by invoking procedure NonMetric-

Grouping obey the grouping property; that is, withineach group and among the groups, the triangle inequal-ity holds in the converted space. Moreover, based on theassumption that the distance between query object andobject in the data set follows the same distribution ofpairwise distances in the data set, the same three casesare applicable to the query object and have been handledby the procedure NonMetricGrouping. Thus, for eachrange query over each group (like PivotRangeSearch),no false dismissals will be introduced and, hence, theentire query procedure QueryProcessing does not havefalse dismissals. tu

4 INDEXING OVER GROUPS

In order to achieve lower I/O cost, instead of the linear scan,we make further improvement by creating a metricindexing structure for each group under dist0 such asM-tree [12], MVP-tree [7], iDistance [24], and so on.Therefore, whenever a range query q arrives, we wouldissue m range queries on m individual indices built forgroups, respectively, and then answer the query byreturning all the m results. Obviously, accessing a tree-likestructure is more efficient than the sequential scan in termsof I/O cost. However, the I/O cost is still high in the sensethat we have to issue m range queries on m trees,respectively, each descending from the root to the leaf.Furthermore, specific to our problem, each group Gi has itsown embedding value cmaxðGiÞ. This motivates our adop-tion of a variant of the iDistance indexing method to speedup the query efficiency.

Irrespective of the use of a (pivot-based) linear scan orindex search, the selection of pivots directly affects theperformance of these methods. Therefore, in the sequel, wefirst discuss the pivot selection in nonmetric space, followedby the indexing technique with the selected pivots.

4.1 Pivot Selection

Specifically, for indexing over groups, we need toconsider how we can select pivots in each group Gi

first. Recall that in the metric space, a pivot p isconsidered good if it has high pruning power [8].Specifically, given a query object q, for any data object xin the data set, the absolute difference between distancesdistðq; pÞ and distðp; xÞ must be as large as possible,where dist is a metric distance function. The rationalebehind this is that if jdistðq; pÞ � distðp; xÞj is largeenough, then applying the triangle inequality

distðq; xÞ � jdistðq; pÞ � distðp; xÞj, the possibility of prun-ing object x is high. In nonmetric spaces, however, sincethe triangle inequality no longer holds, no previouswork has used pivots to support range queries. In ourwork, we propose a novel pivot selection method innonmetric spaces, considering the embedding values. Inparticular, given query object q, for any data object x inD, an object p is selected as a pivot if jdistðq; pÞ �distðp; xÞj � cmaxðpÞ is high, where cmaxðpÞ is defined asthe maximum c value, considering all triplets hq; p; xicontaining p. The rationale behind this is that when weconvert the nonmetric space into a metric one by addingc to the nonmetric distance between objects, it holdsthat distðq; xÞ � jdistðq; pÞ � distðp; xÞj � cmaxðpÞ. Therefore,greater value of jdistðq; pÞ � distðp; xÞj � cmaxðpÞ indicateshigher pruning power.

We have already obtained m groups of data. Now, weaim at selecting pivots, say, p, within each group Gi sothat the pruning power of queries is high. We illustratethe selection of the first pivot in Fig. 13. Since each objectin Gi is a possible candidate pivot for Gi, we considerthem one by one. In particular, for each pivot candidatep 2 Gi, we enumerate all triplets containing p in the formhq; p; xi, where q is any object in D, and x is any object inGi. For each such triplet hq; p; xi, we calculate its measure�q;p;x ¼ jdistðq; pÞ � distðp; xÞj � cmaxðpÞ. Note that since pis an object in group Gi, it holds that cmaxðpÞ � cmaxGi.Then, we sum up measures �q;p;x of all triplets and obtainthe summation result �p, which measures the goodness ofselecting p as the pivot, similar to measuring the pivotselection in metric spaces, as proposed in [8]. Intuitively,since large �q;p;x indicates high probability that object xcan be pruned by object p, the summation of all such�q;p;x can well estimate the pruning power of object p ifselected as a pivot. Finally, we choose p as a pivot if itsmeasurement �p is the largest among all candidates.

Fig. 14 demonstrates details of our pivot selectionalgorithm in nonmetric spaces. The algorithm is moregeneral in the sense that it can handle the case of identifyingseveral pivots in a group Gi incrementally. The basic idea ofselecting the subsequent pivots is to fix those pivots that arepreviously selected and then choose the remaining ones sothat the total pruning power is the highest. Specifically, inlines 3-10 of procedure PivotSelection, for each q in the dataset D and each object x in Gi, we first compute �q;p;x forcandidate p and �q;piv;x for all previous pivots piv, and we

CHEN AND LIAN: EFFICIENT SIMILARITY SEARCH IN NONMETRIC SPACES WITH LOCAL CONSTANT EMBEDDING 329

Fig. 13. Illustration of pivot selection.

Fig. 14. The procedure of pivot selection.

pick up the maximum one �max among them, whichindicates the highest pruning power by the collaborationof p and all piv. Similar to the selection of the first pivot, wesum up �max for all the combinations of q and x and obtainthe summation �p. For all the candidate pivots, we retrievethe one with the largest �p to be the next pivot (that is, theone with the highest pruning power collaborating withother pivots (line 10)) and add it to a pivot set pivot set.Finally, if the size of pivot set does not reach a specifiednumber piv num, we go on to repeatedly find the next pivotsimilarly (line 12); otherwise, the algorithm terminates.

With regard to the number of pivots that we choose, wefollow the rule in the metric space [17] that the best numberof pivots should be between ðdimþ 1Þ and 2� ðdimþ 1Þ,where dim is the intrinsic dimensionality of data in the metricspace. Since we converted the nonmetric space into a metricone, the optimal number should also hold. Thus, we simplychoose twice the intrinsic dimensionality 2� ðdimþ 1Þ. Inorder to show the effectiveness, we also compared our pivotselection technique with other selection methods proposedfor metric space such as random [6] and HF selection [17] inthe experiment section.

4.2 iDistance Indexing

Since objects in each group Gi follow the triangle inequalityby adding cmaxðGiÞ to their original distance, we can plug ina metric space indexing structure such as MVP-tree [6],M-tree [12], Slim-tree [32], or iDistance [24]. However, withrespect to m range queries (procedure QueryProcessing),the I/O cost is quite high. In fact, in LCE, we partition thedata set D into m disjoint groups G1; . . . ; Gm and make eachgroup Gi become metric by introducing the cmax value,which exactly fits the prerequisite of iDistance, a metricspace with m disjoint partitions. Thus, we can create aBþ-tree to index all the objects in the data set D followingthe framework of iDistance.

Specifically, assume that the group shifting valuecmaxðGiÞ of each group Gi is defined as the maximum cvalue cmaxðpÞ of the first pivot p in Gi. We map each data xjin Gi to a one-dimensional key i�MAX0 þ distðxj; piÞ,where MAX0 can be chosen as (the maximum distance fromdata to pivots) þ (the maximum c value among all triplets).Finally, we index these one-dimensional objects into aBþ-tree. For range queries q with radius " over group Gi, theresulting query intervals for searching in Bþ-tree are ½i�MAX0 þ distðq; piÞ � "� cmaxðpÞ; i�MAX0 þ distðq; piÞ þ"þ cmaxðpÞ� for 1 � i � m.

As mentioned in Section 4.1, for each group Gi, we maychoose more than one pivot. Since iDistance only uses onepivot for each group, in order to increase the pruning powerwith more pivots, we can store the distances from dataobjects to these additional pivots by compression with aHilbert value [16]. Assume that ki pivots pi1; pi2; . . . ; piki areused in group Gi. Apart from one pivot, say, pi1, used forbuilding the Bþ-tree, for each data object xj in Gi, wecompute the ðki � 1Þ distances from xj to pi2; pi3; . . . ; piki ,respectively. Then, we convert the ðki � 1Þ-dimensionalvector hdistðxj; pi2Þ; distðxj; pi3Þ; . . . ; distðxj; pikiÞi into a sin-gle Hilbert value [16] and store this number in the leaf nodeof Bþ-tree, together with the key ði�MAX0 þ distðxj; pi1ÞÞ.Thus, the revised iDistance indexing is both space and time

efficient. Note that once we construct the index for dataobjects in the nonmetric space, we no longer need to storethe large pairwise distance matrix pmatrix (taking up N �N space), since we can use the index to prune false positivesduring the similarity search.

Fig. 15 illustrates procedure IndexRangeSearch, withdetails of carrying out a range query for group Gi usingiDistance. In order to obtain all candidates using iDistance,we invoke procedure IndexRangeSearch (Fig. 15) once foreach group Gi ð1 � i � mÞ and then unite the returnedcandidate sets. In particular, we plug in IndexRangeSearch(that obtains the remaining candidates after pruning with kipivots) to line 3 of procedure QueryProcessing (Fig. 11),replacing PivotRangeSearch (that returns those after prun-ing with one pivot).

5 APPROXIMATE SIMILARITY SEARCH IN THE

NONMETRIC SPACE

In previous sections, we have discussed the process ofperforming an efficient range query in the nonmetric space.The exact retrieval of the answer to range queries is,however, based on the assumption that query objects followthe same pairwise distance distribution as data objects. Inthis section, we study the case where query objects do nothave the same distribution as data objects and present anovel approach to perform the approximate similaritysearch with a guaranteed query accuracy.

In particular, our problem of the approximate similaritysearch is explained as follows: Assume that we have a dataset D, over which we apply LCE to partition it into mdisjoint groups Gi with m group shifting values cmaxðGiÞ,respectively, as mentioned above. Given a range query withquery object q and a query accuracy threshold � 2 ð0; 1�, ourgoal is to set approximate group shifting value cAmaxðGi; �Þ forLCE in each group Gi ð1 � i � mÞ such that the number ofretrieved actual objects by the LCE method (usingcAmaxðGi; �Þ) is at least � times the total number of actualquery answers. Here, the query accuracy is measured by therecall ratio, that is, the number of qualified objects obtainedfrom query processing divided by the total number ofqualified objects.

In order to answer queries with a given query accuracy,we propose an approach consisting of two steps: distributiondetection and parameter estimation. Specifically, we considereach group Gi (for 1 � i � m) individually. In the firstdistribution detection step, we decide whether or not queryobject q indeed follows the same distribution as data objectsin group Gi. If this is the case, then we can simply setcAmaxðGi; �Þ to the group shifting value cmaxðGiÞ for group Gi;

330 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 3, MARCH 2008

Fig. 15. Index range search.

otherwise, we have to estimate the cAmaxðGi; �Þ value in thesecond parameter estimation step based on the query historysuch that the query results satisfy the � constraint. Note thatwe assume that we have the query history. If it is notavailable, we have to conduct sequential scan for this queryafter the first step and accumulate the query history.

Fig. 16 illustrates the basic idea of the distribution detection

step for the approximation similarity search. In particular,

assume that we have obtained m disjoint groups

G1; G2; . . . ; and Gm from data set D by LCE. Taking the first

group G1 as an example, we preprocess G1 by randomly

picking up a group sample set S1 of size k (for example, s1,

s2, and s3 in Fig. 16), which can be considered as

representatives of group G1. Moreover, we can offline

precompute a histogram HðG1Þ (with b bins) of c values for

all the triplets within group G1, that is, the frequency

freqðG1; cÞ as a function of c values. Furthermore, upon the

arrival of query object q, similarly, we can construct another

histogram HqðS1Þ (with b bins) recording the distribution of

c values for all possible triplets, including query object q

and any two samples in S1, that is, the frequency

freqqðS1; cÞ as a function of c values. Intuitively, if the

query object q follows the same distribution as objects in G1,

then it is very likely that the two distributions HðG1Þ and

HqðS1Þ would have similar shapes. Therefore, we can

evaluate the similarity between these two distributions by

Pearson’s �-square test [1]. In particular, for each of the two

histograms HðG1Þ and HqðS1Þ, we normalize frequencies of

its bins such that the total mass in the histogram is equal to

1. Then, assuming that Ei and Oi are the contents of the

ith bin in histograms HðG1Þ and HqðS1Þ, respectively, we

have �20 ¼

Pbi¼1ðOi�EiÞ2

Ei, where b is the degree of freedom for

the �-square distribution. Let cdfðzÞ be the cumulative density

function of the �-square distribution. We say that the two

distributions (in histograms HðG1Þ and HqðS1Þ) are similar

if cdfð�20Þ ¼ Prf�2 � �2

0g � �, where � is specified as the

threshold within [0, 1]. Thus, if it holds that cdfð�20Þ � �,

then we let cAmaxðG1; �Þ ¼ cmaxðG1Þ; otherwise, we continue

with the second parameter estimation step as follows:

In the second step, in order to improve the queryaccuracy (recall ratio), we can increase the group shiftingvalue from cmaxðG1Þ to a higher value cAmaxðG1; �Þ for groupG1. Thus, we need to estimate cAmaxðG1; �Þ under theconstraint of query accuracy (that is, the threshold of recallratio �) utilizing the query history. In particular, Fig. 17illustrates the rationale of our estimation for group G1,assuming that we have a number of historical query objects.Based on the query history, we can know the relationshipsbetween the recall ratio and the obtained cdfð�2

0Þ value byusing cmaxðG1Þ; cmaxðG2Þ; . . . ; and cmaxðGmÞ as the groupshifting value for G1 individually, as illustrated in m chartsin Fig. 17. Since we have

cmaxðG1Þ < cmaxðG2Þ < . . . < cmaxðGmÞ;

a larger group shifting value would result in higher recallratio but low pruning power. Thus, our goal is to choose thesmallest value from them (m shifting values) such that theresulting query accuracy satisfies the � constraint.

As depicted in Fig. 17, the leftmost chart illustrates therelationship with respect to recall ratio and cdfð�2Þcalculated in the distribution detection step by usingCmaxðG1Þ as group shifting value. Without loss of general-ity, assume that the computed cdfð�2Þ value is cdfð�2

0Þ,whose corresponding recall ratio in the leftmost chart is �0.Unfortunately, since it holds that �0 < �, the query accuracycannot be guaranteed if we use CmaxðG1Þ as the groupshifting value. Thus, a larger value is required in order toachieve the given query accuracy. Alternatively, we lookinto the second chart, which uses a larger cmaxðG2Þ as thegroup shifting value. Since this time, the recall ratio �00

corresponding to cdfð�20Þ is greater than threshold �, it

indicates that cmaxðG2Þ can be used as the group shiftingvalue while guaranteeing the query accuracy. The cases ofestimating cAmaxðGi; �Þ for other groups Gi are similar. Theonly difference is that for group G2, there are ðm� 1Þ charts,as shown in Fig. 17, using cmaxðG2Þ; cmaxðG3Þ; . . . ; cmaxðGmÞ,respectively, as the group shifting values. For group G3,there are totally ðm� 2Þ charts, and so on. For the last groupGm, if the query accuracy is not reached, then we have tosequentially scan the entire group (since it is not knownhow the shifting value can be set to guarantee the queryaccuracy). This way, we can set the approximate groupshifting value cAmaxðGi; �Þ for each group with a guaranteedquery accuracy.

CHEN AND LIAN: EFFICIENT SIMILARITY SEARCH IN NONMETRIC SPACES WITH LOCAL CONSTANT EMBEDDING 331

Fig. 16. The distribution detection step of approximate similarity search.

Fig. 17. The parameter estimation step of approximate similarity search.

6 EXPERIMENTAL EVALUATION

In this section, we evaluate the effectiveness and efficiency

of the proposed LCE. Throughout our experiments, we use

two real data sets: stock [35] and histogram [31]. The stock

data set contains daily stock prices from 4,622 American

companies. Each series contains 1,024 daily closing prices,

starting from March 1995 to March 1999. The histogram data

set consists of 64-level gray-scale histograms transformed

from 10,000 Web-crawled images. For stock, we test the

experimental results with two nonmetric distance func-

tions: DTW and partial Hausdorff distance (pHausdorff

[22]). For histogram, we apply two other nonmetric

measures: squared L2 distance sqL2 and fractional Lp

distance L0:75 ðp ¼ 0:75Þ. Specifically, given two vectors

X ¼ fx1; x2; . . . ; xng and Y ¼ fy1; y2; . . . ; yng of arity n, the

squared L2 distance sqL2 is the squared euclidean distance

between X and Y , that is,Pn

i¼1ðxi � yiÞ2, and the fractional

Lp distance isPn

i¼1ðjxi � yijpÞ1=p, where 0 < p < 1. Since the

original stock and histogram data sets are small, we extract

vectors of arity 16 (consisting of 16 consecutive values)

from each series to make the total number of subsequences

large enough (for the purpose of scalability test) and

precompute the pairwise distance matrix using the non-

metric distance function. The resulting distance matrix is

the input of our grouping methods. Following the same

assumption as in CSE [27], we randomly select objects from

the data set as query objects. Section 6.1 studies the effect of

the number of groups that we choose on the query

performance. Sections 6.2 and 6.3 illustrate the experimen-

tal results of LCE under different query selectivities and

data sizes, respectively. Section 6.4 compares the pruning

power of LCE with that of CSE and, moreover, verifies the

effectiveness of our pivot selection technique in nonmetric

spaces. Section 6.5 presents the query performance of the

approximate similarity search. All our experimental results

are the averages of 100 runs.

6.1 The Performance versus the Number of Groups

First, we evaluate the effect of the number of groups usingthe proposed grouping algorithm, in terms of bothcomputation and I/O cost, during the range query onBþ-tree. In particular, we set the page size of Bþ-tree to1,024 bytes. Our experimental results are compared with thepure linear scan. Specifically, assuming that all the objectsare stored on the disk, we sequentially access consecutivedisk blocks and compute the real distance from the queryobject to data on pages from scratch. Obviously, thecomputation cost of the linear scan can be measured bythe size of the data set. On the other hand, the number ofI/Os for the linear scan is defined as the size of the data setdivided by the capacity of each page. For fair comparison,we choose 1,024 bytes as the page size (which is the same aswhat we used in the Bþ-tree). In our experiment, we testdata sets of size 5,000. Each vector of length 16 takes upapproximately 64 bytes, with floating points. Therefore, we

can easily calculate the number of page accesses for thelinear scan as 5; 000=ð1; 024=64Þ ¼ 312:5.

Fig. 18 illustrates the computation and I/O costs of ourapproach compared to the linear scan, where the data set isof size 5,000, the query selectivity is 0.1 percent, and thenumber m of groups is set to 5, 10, 20, and 50, respectively.In particular, the query selectivity is defined as the number ofobjects retrieved by the range query divided by the totalnumber of objects in the data set. Moreover, the pruningpower is measured by the number of objects that can besafely pruned by the search procedure (such that thecomputation of their real distances to query object can beavoided) divided by the total data size. A smaller number ofreal distance computations indicate high pruning power.We also define the total elapsed time as the total time toaccomplish a search procedure. Figs. 18a and 18c illustratethe percentage of distance computation of our method tothat of linear scan, using different m values. For the fourmeasures on two data sets, when the number m of groupsincreases, the distance computation cost decreases, since wehave more pivots to prune objects. Figs. 18b and 18d showthe percentage of page accesses in our approach comparedto that of the linear scan. When the number of groupsincreases, page accesses also increase. The reason is thatmore groups result in a larger number of query intervals inBþ-tree. That is, more descending paths (higher I/Os) inBþ-tree are necessary for queries.

The ratio of the total elapsed time of our method to that ofthe linear scan is shown in Figs. 18b and 18d above eachcolumn (in the subsequent figures as well). With theincrease of the value of m, this ratio first decreases due tothe decreasing distance computation cost and then increasesbecause of the increasing I/Os. This gives us a good hintthat there is a trade-off between the computation andI/O costs with different numbers m of groups. In particular,when m increases, the cost of distance computationsdecreases (due to higher pruning power), whereas thenumber of page accesses (I/Os) increases. Thus, for aspecific data set, we can partition it into groups with

332 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 3, MARCH 2008

Fig. 18. The performance versus m. (a) Distance computation (stock).

(b) Page accesses (stock). (c) Distance computation (histogram).

(d) Page accesses (histogram).

different m values and test the performance (the total elapsedtime) of the resulting indices. An appropriate m can bedecided by selecting the one with the lowest total elapsedtime. In general, since the I/O cost is usually dominatingdistance computations, this m value is small. We use m ¼10 for all of the subsequent experiments.

6.2 The Performance versus Query Selectivity

Next, we evaluate the performance of LCE under differentquery selectivities, as compared to the linear scan. Specifi-cally, we test data sets stock and histogram of size 5,000, eachof which is divided into 10 groups. We vary the query radius" so that the query selectivity is 0.02 percent, 0.05 percent,0.1 percent, 0.5 percent, and 1 percent. Figs. 19a and 19cillustrate the percentage of the distance computation andpage accesses of LCE, which are much better than linear scan.When the query selectivity increases from 0.02 percent to1 percent, the required number of real distance computa-tions, compared with that of the linear scan, does notincrease too rapidly. This shows the robustness of ourmethod. Moreover, Figs. 19b and 19d demonstrate the I/Ocost of LCE and the linear scan. The number of page accessesgradually increases when the query selectivity increases.Based on the figures, LCE can greatly improve the retrievalefficiency in terms of computation and I/O costs.

6.3 The Performance versus Data Size

Then, we consider the scalability of LCE compared with thelinear scan. In particular, we fix the query selectivity to0.1 percent, set the number of groups to 10, and vary thedata size from 1,000 to 50,000. Figs. 20a and 20c present theresults of the computation cost. As expected, the distancecomputation increases, since the number of qualified objectsalso goes up with a larger data set and fixed selectivity.Interestingly, in Figs. 20b and 20d, the number of pageaccesses of LCE divided by that of the linear scan decreaseswhen the data size becomes large. We give a brief analysisas follows: Assume that the data size is jDj. The I/O cost of

LCE consists of two parts: the index search on the Bþ-tree(descending from the root to a leaf) and the search over leafnodes (sequential scanning leaf nodes of Bþ-tree until thesearch range is reached). For the former part, the I/O cost isat most m � dlogF jDje for m searching paths of groups,respectively, where F is the minimum fanout of a tree node,whereas the latter one has the cost sel � jDj=F , where sel isthe query selectivity. The ratio of the I/O of LCE to that oflinear scan is equal to ðm�dlogF jDjeþsel�jDj=F Þ

jDj=F 0 , where F 0 is thecapacity of a page for the linear scan. Obviously, when jDjbecomes larger, this ratio would decrease, which is vividlyillustrated in Figs. 20b and 20d. The resulting curvesindicate that compared to the linear scan, LCE has verygood scalability to the data size.

Apart from the scalability of the query efficiency, we alsorecord the preprocessing time of the data sets in Fig. 21,where the data size varies from 1,000 to 50,000, and m ¼ 10.In the figures, we can see that the required preprocessingtime is increasing cubically with the increasing data size,which confirms the complexity analysis of the grouping inSection 3.2.

6.4 Comparisons with Constant Shift Embeddingand TriGen and Pivot Selections

We now compare the pruning power of our method LCEwith that of CSE [27]. Recall that CSE is the only methodthat can answer similarity queries in nonmetric spacewithout false dismissals. In particular, CSE uses a singleglobal c to make nonmetric distance functions follow the

CHEN AND LIAN: EFFICIENT SIMILARITY SEARCH IN NONMETRIC SPACES WITH LOCAL CONSTANT EMBEDDING 333

Fig. 19. The performance versus query selectivity. (a) Distance

computation (stock). (b) Page accesses (stock). (c) Distance computa-

tion (histogram). (d) Page accesses (histogram).

Fig. 20. The performance versus data size. (a) Distance computation

(stock). (b) Page accesses (stock). (c) Distance computation (histo-

gram). (d) Page accesses (histogram).

Fig. 21. Preprocessing time versus data size.

triangle inequality. Fig. 22a compares CSE with LSE, interms of the pruning power (that is, the percentage of dataobjects that can be pruned), with four measures, two oneach data set. Specifically, we test the two data sets stockand histogram with a large size 10,000, fix the queryselectivity to 0.1 percent, and set m to 10. For CSE, weselect the same number of pivots as that in LCE. Asillustrated in the figure, LCE prunes much more candidatesthan CSE, since a large value of the global c can significantlyreduce the pruning power of the similarity search.

We also compare the pruning power of our approachLCE with the recent approximate embedding method TG[31]. Recall that TG attempts a number of metric-preservingdistance modifiers on a sample set so as to recover thetriangle inequality in the data set. In particular, werandomly sample 1 percent of the data set to find the bestdistance modifier among the 117 modifiers used in [31]. Theexperimental settings are the same as in the last experiment(the data size is 10,000, the query selectivity is 0.1 percent,and m ¼ 10). Fig. 22b demonstrates the effectiveness of TGand LCE using four distance measures in terms of thepruning power. Although the pruning power of TG is closeto LCE, it is always inferior to that of LCE over all the fourmeasures. Moreover, TG is an approximate approach,which may introduce false dismissals during the similaritysearch. According to our experiments, when the queryselectivity increases up to 1 percent, every 10 queries, thereis at least one false dismissal occurring on the average.

We further demonstrate the effectiveness of our pivotselection technique. Specifically, we compare our methodwith two other approaches: random [6] and HF [17]selections. We evaluate the pruning power on both the stockand histogram data sets of size 10,000, with two nonmetricmeasures each, where the selectivity is set to 0.1 percent, andm ¼ 10. Fig. 23 shows that our selection method is the bestamong the three approaches, since it considers the non-metric property of data. In summary, our experimentalresults confirm the usefulness of LCE for the exact similaritysearch in nonmetric spaces without dismissals, whichapplies local shifting values for different groups instead ofapplying a global one (in CSE) or plugging in a distancemodifier over the entire data set (in TG).

6.5 Performance of the Approximate SimilaritySearch

Finally, we study the performance of our approximatesimilarity search discussed in Section 5, where query objectsdo not follow the data distribution. In order to test thequery performance, we synthetically generate 200 query

objects, where the first 100 are used as the query history,and the rest are for the query evaluation. Specifically, thepairwise distances distðq; oÞ from any query object q to dataobjects o can be obtained as follows: First, we randomly pickup an object q0 from the data set D (which is excluded fromthe data set). Then, we produce new query object q such thatits pairwise distance distðq; oÞ to any object o 2 D is given bydistðq0; oÞ þ u, where u is a random number from variableX. Let min ðmaxÞ be the minimum (maximum) possibledistance of distðq0; oÞ for all o 2 D. Variable X follows eitherthe Uniform or Zipf (with skewness 0.8) distribution withininterval ½�� � ðmax�minÞ=2;� � ðmax�minÞ=2�, where �is a parameter indicating the deviation of the querydistribution from that of data. Intuitively, the larger � is,the more likely that query objects do not follow the datadistribution.

As mentioned in Section 5, we partition the data set Dinto m groups using LCE. Within each group Gi, weprecompute histogram HðGiÞ with 10 bins offline and selecta random sample set Si of size 20, which can be used in thedistribution detection step. Furthermore, based on the queryhistory (the first 100 query objects that we generate), we alsoprecalculate the relationships between the recall ratio andcdfð�2Þ using different shifting values, which are used inthe parameter estimation step.

Upon the arrival of a query object q, we first evaluatewhether or not the query object follows a distributionsimilar to that of each group. In particular, we construct thehistogram HqðSiÞ (with 10 bins) containing triplets fromquery object q and any two samples in Si and compute thecdfð�2

0Þ value between HðGiÞ and HqðSiÞ. Here, we set � to0.99. If cdfð�2

0Þ � 0:99, then we use cmaxðGiÞ as the groupshifting value cAmaxðGi; �Þ of Gi (that is, query object qapproximately follows the data distribution); otherwise, wecontinue with the parameter estimation step and estimatecAmaxðGi; �Þ such that based on the query history, the queryaccuracy is expected to be above �.

Fig. 24 illustrates the recall ratio of the approximatesimilarity search over two data sets of data size 5,000 withfour measures, where � varies from 0.7 to 0.85, m ¼ 10, andthe query selectivity is set to 0.1 percent. Note that the queryprecision is originally defined in information retrieval as thenumber of obtained actual result divided by the totalnumber of retrieved objects. Since the final results afterrefinement (checking their actual distances to query object)are exactly the true answers, the precision is always100 percent. Thus, we would only test the recall ratio.Fig. 24a shows the experimental results by generating queryobjects with Uniform variable X, and Fig. 24b with Zipf

334 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 3, MARCH 2008

Fig. 22. The performance comparison. (a) CSE and LCE. (b) TG and

LCE.

Fig. 23. Comparisons of pivot selections.

variable X, where � ¼ 0:5 in both cases. The value aboveeach column indicates the percentage of data objects thatcan be pruned during the approximate query processing.Based on the figures, we find that the actual recall ratio ofsimilarity queries is close to the expected threshold �.Moreover, with the increase of �, the percentage of pruneddata objects decreases (however, this is still large), showingthe efficiency of the search procedure.

Fig. 25 varies the parameter � from 0.1 to 1, where thedata size is 5,000, � ¼ 0:75, m ¼ 10, and the query selectivityis 0.1 percent. We test the effect of parameter � on the recallratio and the percentage of pruned objects during theapproximate similarity search. Fig. 25a shows the results forUniform X, whereas Fig. 25b shows the results with skew X.We find that when � varies, that is, the distribution ofquery objects deviates from that of data objects, the recallratio still remains high. Note that here, we use the samenumber (100) of query objects as history in the parameterestimation step for all the tested � values. Since large �indicates more possible query objects in the space, theestimation of the group shifting value may become lessaccurate with large � due to the lack of query history (thatis, the query samples as history are too few). Thus, in thefigures, the recall ratio decreases with the increasing �. Onthe other hand, the percentage of pruned objects decreaseswith the increasing �, since larger group shifting valueshave to be selected (with lower pruning power) in order tosatisfy the � constraint. In summary, when query objects donot follow the data distribution, our approximate similaritysearch can still have good efficiency while achieving theguaranteed query accuracy.

7 CONCLUSIONS

In this paper, we address the problem of similarity-basedsearch in nonmetric spaces, where the triangle inequalitydoes not hold. Currently, most index structures onlysupport metric distance functions, which cannot capturethe human perception of similarity. Moreover, measuresthat are “good,” robust to noise, yet close to humanrecognition are usually nonmetric. Thus, we propose anovel approach LCE to address this problem. In particular,LCE assigns data objects to different groups and associatesa specific constant shifting value for each group so that thetriangle inequality holds with this value. Moreover, wepropose a pivot selection technique and create a Bþ-tree forefficient retrieval of similarity queries. Furthermore, we alsopropose a novel method to answer approximate similaritysearch in nonmetric spaces with a guaranteed query

accuracy. Extensive experiments have confirmed the effec-

tiveness and efficiency of our LCE method under various

experimental settings.

ACKNOWLEDGMENTS

Funding for this work was provided by Hong Kong RGC

Grants under Project 611907 and the National Grand

Fundamental Research 973 Program of China under Grant

2006CB303000.

REFERENCES

[1] Wikipeadia, http://en.wikipedia.org/wiki/Chi-square_test, 2007.[2] R. Agrawal, C. Faloutsos, and A. Swami, “Efficient Similarity

Search in Sequence Databases,” Proc. Fourth Int’l Conf. Foundationsof Data Organization and Algorithms (FODO), 1993.

[3] V. Athitsos, M. Hadjieleftheriou, G. Kollios, and S. Sclaroff,“Query-Sensitive Embeddings,” Proc. ACM SIGMOD, 2005.

[4] D.J. Berndt and J. Clifford, “Finding Patterns in Time Series: ADynamic Programming Approach,” Advances in Knowledge Dis-covery and Data Mining, 1996.

[5] J.S. Boreczky and L.A. Rowe, “Comparison of Video ShotBoundary Detection Techniques,” Proc. Int’l Symp. Storage andRetrieval for Image and Video Databases, 1996.

[6] T. Bozkaya and M. Ozsoyoglu, “Indexing Large Metric Spaces forSimilarity Search Queries,” ACM Trans. Database Systems, vol. 24,no. 3, pp. 361-404, 1999.

[7] T. Bozkaya, N. Yazdani, and Z.M. Ozsoyoglu, “Matching andIndexing Sequences of Different Lengths,” Proc. Sixth Int’l Conf.Information and Knowledge Management (CIKM), 1997.

[8] B. Bustos, G. Navarro, and E. Chavez, “Pivot Selection Techniquesfor Proximity Searching in Metric Spaces,” Pattern RecognitionLetters, 2003.

[9] E. Chavez, G. Navarro, R. Baeza-Yates, and J.L. Marroquın,“Searching in Metric Spaces,” ACM Computing Surveys, 2001.

[10] L. Chen and R. Ng, “On the Marriage of Edit Distance and LpNorms,” Proc. 30th Int’l Conf. Very Large Data Bases (VLDB), 2004.

[11] L. Chen, M.T. Ozsu, and V. Oria, “Robust and Fast SimilaritySearch for Moving Object Trajectories,” Proc. ACM SIGMOD, 2005.

[12] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient AccessMethod for Similarity Search in Metric Spaces,” Proc. 23rd Int’lConf. Very Large Data Bases (VLDB), 1997.

[13] G. Das, D. Gunopulos, and H. Mannila, “Finding Similar TimeSeries,” Proc. First European Symp. Principles of Data Mining andKnowledge Discovery (PKDD), 1997.

[14] C. Faloutsos and K.-I. Lin, “FastMap: A Fast Algorithm forIndexing, Data Mining and Visualization of Traditional andMultimedia Datasets,” Proc. ACM SIGMOD, 1995.

[15] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos, “FastSubsequence Matching in Time-Series Databases,” Proc. ACMSIGMOD, 1994.

[16] C. Faloutsos and S. Roseman, “Fractals for Secondary KeyRetrieval,” Proc. Eighth ACM Symp. Principles of Database Systems(PODS), 1989.

[17] R.F.S. Filho, A.J.M. Traina, C. Faloutsos, “Similarity Searchwithout Tears: The OMNI Family of All-Purpose Access Meth-ods,” Proc. 17th Int’l Conf. Data Eng. (ICDE), 2001.

CHEN AND LIAN: EFFICIENT SIMILARITY SEARCH IN NONMETRIC SPACES WITH LOCAL CONSTANT EMBEDDING 335

Fig. 24. The performance versus �. (a) Uniform. (b) Zipf. Fig. 25. The performance versus �. (a) Uniform. (b) Zipf.

[18] V. Ganti, R. Ramakrishnan, J. Gehrke, A.L. Powell, and J.C.French, “Clustering Large Datasets in Arbitrary Metric Spaces,”Proc. 15th Int’l Conf. Data Eng. (ICDE), 1999.

[19] K-S. Goh, B.T. Li, and E. Chang, “Dyndex: A Dynamic and Non-Metric Space Indexer,” Proc. 10th ACM Int’l Conf. Multimedia(Multimedia), 2002.

[20] G.R. Hjaltason and H. Samet, “Properties of Embedding Methodsfor Similarity Searching in Metric Spaces,” IEEE Trans. PatternAnalysis and Machine Intelligence, 2003.

[21] G. Hristescu and M. Farach-Colton, “Cluster-Preserving Embed-ding of Proteins,” technical report, Center for Discrete Math. andTheoretical Computer Science, 1999.

[22] D.P. Huttenlocher, G.A. Klanderman, and W.A. Rucklidge,“Comparing Images Using the Hausdorff Distance,” IEEE Trans.Pattern Analysis and Machine Intelligence, 1993.

[23] D.W. Jacobs, D. Weinshall, and Y. Gdalyahu, “Classification withNonmetric Distances: Image Retrieval and Class Representation,”IEEE Trans. Pattern Analysis and Machine Intelligence, 2000.

[24] H.V. Jagadish, B.C. Ooi, K.-L. Tan, C. Yu, and R. Zhang,“iDistance: An Adaptive Bþ-Tree-Based Indexing Method forNearest Neighbor Search,” ACM Trans. Database Systems, vol. 30,no. 2, pp. 364-397, 2005.

[25] E. Keogh, “Exact Indexing of Dynamic Time Warping,” Proc. 28thInt’l Conf. Very Large Data Bases (VLDB), 2002.

[26] S. Kim, S. Park, and W. Chu, “An Indexed-Based Approach forSimilarity Search Supporting Time Warping in Large SequenceDatabases,” Proc. 17th Int’l Conf. Data Eng. (ICDE), 2001.

[27] V. Roth, J. Laub, J. Buhmann, and K.-R. Muller, “Going Metric:Denoising Pairwise Data,” Proc. Int’l Conf. Neural InformationProcessing Systems (NIPS), 2002.

[28] H. Samet, Foundations of Multidimensional and Metric DataStructures. Addison-Wesley, 2006.

[29] R. Schapire and Y. Singer, “Improved Boosting Algorithms UsingConfidence-Rated Predictions,” Machine Learning, 1999.

[30] T. Seidl and H. Kriegel, “Optimal Multi-Step k-Nearest NeighborSearch,” Proc. ACM SIGMOD, 1998.

[31] T. Skopal, “On Fast Non-Metric Similarity Search by MetricAccess Methods,” Proc. 10th Int’l Conf. Extending DatabaseTechnology (EDBT), 2006.

[32] C. Traina Jr., A.J.M. Traina, B. Seeger, and C. Faloutsos, “Slim-Trees: High-Performance Metric Trees Minimizing Overlapbetween Nodes,” Proc. Fourth Int’l Conf. Extending DatabaseTechnology (EDBT), 2000.

[33] A. Tversky, “Features of Similarity,” Psychological Rev., 1977.[34] M. Vlachos, G. Kollios, and D. Gunopulos, “Discovering Similar

Multidimensional Trajectories,” Proc. 18th Int’l Conf. Data Eng.(ICDE), 2002.

[35] C.Z. Wang and X. Wang, “Supporting Content-Based Searches onTime Series via Approximation,” Proc. 12th Int’l Conf. Scientific andStatistical Database Management (SSDBM), 2000.

[36] X. Wang, J. Wang, K. Lin, D. Shasha, B. Shapiro, and K. Zhang,“An Index Structure for Data Mining and Clustering,” Knowledgeand Information Systems, vol. 2, no. 2, pp. 161-184, 2000.

[37] B.-K. Yi and C. Faloutsos, “Fast Time Sequence Indexing forArbitrary Lp Norms,” Proc. 26th Int’l Conf. Very Large Data Bases(VLDB), 2000.

[38] B.-K. Yi, H. Jagadish, and C. Faloutsos, “Efficient Retrieval ofSimilar Time Sequences under Time Warping,” Proc. 14th Int’lConf. Data Eng. (ICDE), 1998.

[39] P. Zezula, G. Amato, V. Dohnal, and M. Batko, SimilaritySearch—The Metric Space Approach. Springer, 2006.

Lei Chen received the BS degree in computerscience and engineering from Tianjin University,China, in 1994, the MA degree from the AsianInstitute of Technology, Thailand, in 1997, andthe PhD degree in computer science from theUniversity of Waterloo, Canada, in 2005. He iscurrently an assistant professor in the Depart-ment of Computer Science and Engineering,Hong Kong University of Science and Technol-ogy. His research interests include multimedia

and time series databases, sensor and peer-to-peer databases, andstream and probabilistic databases. He is a member of the IEEE and theIEEE Computer Society.

Xiang Lian received the BS degree fromNanjing University in 2003. He is currentlyworking toward the PhD degree in the Depart-ment of Computer Science and Engineering,Hong Kong University of Science and Technol-ogy, Hong Kong. His research interests includestream time series and probabilistic databases.He is a student member of the IEEE and theIEEE Computer Society.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

336 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 3, MARCH 2008