Continuously Adaptive Similarity Search - DSpace@MIT

16
Continuously Adaptive Similarity Search The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Zhang, Huayi et al. "Continuously Adaptive Similarity Search." Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, May 2020, Portland, Oregon, Association for Computing Machinery, May 2020. As Published http://dx.doi.org/10.1145/3318464.3380601 Publisher Association for Computing Machinery (ACM) Version Author's final manuscript Citable link https://hdl.handle.net/1721.1/130067 Terms of Use Creative Commons Attribution-Noncommercial-Share Alike Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/

Transcript of Continuously Adaptive Similarity Search - DSpace@MIT

Continuously Adaptive Similarity Search

The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters.

Citation Zhang, Huayi et al. "Continuously Adaptive Similarity Search."Proceedings of the 2020 ACM SIGMOD International Conference onManagement of Data, May 2020, Portland, Oregon, Association forComputing Machinery, May 2020.

As Published http://dx.doi.org/10.1145/3318464.3380601

Publisher Association for Computing Machinery (ACM)

Version Author's final manuscript

Citable link https://hdl.handle.net/1721.1/130067

Terms of Use Creative Commons Attribution-Noncommercial-Share Alike

Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/

Continuously Adaptive Similarity SearchHuayi Zhang

βˆ—

Worcester Polytechnic Institute

[email protected]

Lei Caoβˆ—

MIT CSAIL

[email protected]

Yizhou Yan

Worcester Polytechnic Institute

[email protected]

Samuel Madden

MIT CSAIL

[email protected]

Elke A. Rundensteiner

Worcester Polytechnic Institute

[email protected]

ABSTRACTSimilarity search is the basis for many data analytics tech-

niques, including k-nearest neighbor classification and out-

lier detection. Similarity search over large data sets relies

on i) a distance metric learned from input examples and ii)

an index to speed up search based on the learned distance

metric. In interactive systems, input to guide the learning

of the distance metric may be provided over time. As this

new input changes the learned distance metric, a naive ap-

proach would adopt the costly process of re-indexing all

items after each metric change. In this paper, we propose

the first solution, called OASIS, to instantaneously adapt the

index to conform to a changing distance metric without this

prohibitive re-indexing process. To achieve this, we prove

that locality-sensitive hashing (LSH) provides an invarianceproperty, meaning that an LSH index built on the original

distance metric is equally effective at supporting similarity

search using an updated distance metric as long as the trans-

form matrix learned for the new distance metric satisfies

certain properties. This observation allows OASIS to avoid

recomputing the index from scratch in most cases. Further,

for the rare cases when an adaption of the LSH index is

shown to be necessary, we design an efficient incremental

LSH update strategy that re-hashes only a small subset of the

items in the index. In addition, we develop an efficient dis-

tance metric learning strategy that incrementally learns the

new metric as inputs are received. Our experimental study

using real world public datasets confirms the effectiveness

of OASIS at improving the accuracy of various similarity

search-based data analytics tasks by instantaneously adapt-

ing the distance metric and its associated index in tandem,

while achieving an up to 3 orders of magnitude speedup over

the state-of-art techniques.

1 INTRODUCTIONSimilarity search is important in many data analytics tech-

niques including i) outlier detection such as distance or

density-based models [4, 6, 25, 37]; ii) classification such as

βˆ—Equal Contribution

k-Nearest Neighbor (π‘˜NN) classification [11]; and iii) classi-

cal clustering methods such as density-based clustering [16]

and micro-clustering [1]. These techniques all rely on simi-

larity search to detect outliers that are far away from their

neighbors, to classify testing objects into a class based on

the classes of their k-nearest neighbors, or to identify objects

that are similar to each other.

The effectiveness and efficiency of similarity search relies

on two important yet dependent factors, namely an appropri-

ate distance metric and an effective indexing method. In simi-

larity search, the distance metric must capture correctly the

similarity between objects as per the domain. Pairs of objects

measured as similar by the distance metric should indeed

be semantically close in the application domain. Indexing

ensures the efficiency of the similarity search in retrieving

objects similar according to the distance metric. Clearly, the

construction of the index relies on the distance metric to

determine the similarity among the objects.

Unfortunately, general-purpose metrics such as Euclidean

distance often do not meet the requirements of applica-

tions [19]. To overcome this limitation, distance metric learn-

ing [46–49] has been introduced to learn a customized dis-

tance metric guided by which pairs of objects the application

considers similar or dissimilar, using so called similarity ex-amples.As a motivating example, in a collaboration with a large

hospital in the US (name removed due to anonymity), we

have been developing a large scale interactive labelling sys-tem to label EEG segments (450 million segments, 30TB)

with 6 classes representing different types of seizures. These

labelled EEG segments are used to train a classifier which

thereafter can automatically detect seizures based on EEG

signals collected during the clinical observation of patients.

To reduce the manual labelling effort by the neurologists, our

system supports similarity search requests submitted by the

neurologists to return the k nearest neighbors of the to-be-

labelled segment. The goal is to propagate the labels provided

by the experts to similar segments. However, when using Eu-

clidean distance in these nearest neighbor queries, segments

may not agree with the returned k nearest neighbors on class

labels. To solve this problem, our system employs distance

SIGMOD ’20, June 14–19, 2020, Portland, OR, USA Huayi Zhang, Lei Cao, Yizhou Yan, Samuel Madden, and Elke A. Rundensteiner

Figure 1: Example of distance metric learning

metric learning to construct a custom distance metric from

the similarity examples given by the neurologists. This en-

sures that the nearest neighbors of each segment are more

likely to correspond to segments that share the same labels

as depicted in Fig. 1. This then empowers us to automatically

label EEG segments with higher accuracy.

It has been shown that the effectiveness of distance met-

ric learning depends on the number of similarity examples

that are available in the application [5, 19]. Unfortunately, a

sufficient number of similarity examples may not be avail-

able beforehand to learn an accurate distance metric. In-

stead, similarity examples may be discovered over time as

the users interact with their data. For example, in our above

labelling system, whenever the neurologists find a segment

that should not be labeled with the same label as its nearest

neighbors, the distance metric has to be updated immediately

to avoid this mistake in the next search.

As another example, in a financial fraud detection appli-

cation, analysts may discover new fraudulent transactions

not known before. To make sure that the fraud detector can

accurately detect the latest threats, its distance metric must

be updated according to the new similarity examples, so that

the new types of fraud can be separable (distant) from nor-

mal objects. Therefore, the distance metric used in similarity

search has to be immediately learned and updated as soon

as new similarity examples become available.

Challenges & State-of-the-Art. While the continuous

learning of a distance metric to be used in similarity search

has been studied, it has been overlooked in the literature that

a change in distance metric renders the index constructed

using the old distance metric ineffective. This is because

the objects co-located by the index may no longer be close

under the new distance metric. Therefore, the index must

be re-built whenever the distance metric changes to assure

the users always work with the most up-to-date distance

semantics. Note that unlike index updates triggered by the

insertion of new objects, a change in the distance metric may

influence all existing objects. However, re-building an index

from scratch is known to be an expensive process, often tak-

ing hours or days [21]. Although this cost may be acceptable

in traditional databases where an index is typically built only

once upfront, in our context where the distance metric may

change repeatedly this is not viable. In particular, in interac-

tive systems such as our labelling system, hours of quiescent

time due to rebuilding index often is not acceptable to the

neurologists, because the time of the neurologists is precious.

However, without the support of an index, similarity search

is neither scalable nor efficient. In other words, in similar-

ity search the need to quickly adapt to the latest semantics

conflicts with the need to be efficient.Further, like indexing, distance metric learning itself is

an expensive process [47, 48]. Repeatedly learning a new

distance metric from scratch every time when a similar-

ity example becomes available would introduce a tremen-

dous computational overhead especially when new exam-

ples arrive frequently. Although some online distance met-

ric learning methods have been proposed [9, 24, 41], these

techniques [24] either tend to be not efficient due to the

repeated computation of the all-pair distances among all ob-

jects involved in all similarity examples received so far, or

others [9, 41] directly adopt the update method used in solv-

ing the general online learning problem [13]. This can lead

to an invalid distance metric producing negative distance

values [24].

Our ProposedApproach. In this work. we propose the firstcontinuously adaptive similarity search strategy, or OASIS

for short. It uses an incremental distance metric learning

method that together with an efficient index update strat-

egy satisfies the conflicting requirements of adaptivity and

efficiency in similarity search.

β€’ OASIS models the problem of incremental distance met-

ric learning as an optimization problem of minimizing a

regularized loss [13, 41]. This enables OASIS to learn a new

distancemetric based on only the current distance metric and

the expected distances between the objects involved in the

newly provided similarity example. Therefore, it efficiently

updates the distance metric with complexity quadratic in the

dimensionality of the data – independent of the total num-

ber of objects involved in all similarity examples (training

objects).

β€’ We observe that the popular Locality Sensitive Hashing

(LSH) [17, 23, 31, 32], widely used in online applications to

deal with big data, is robust against changes to the distance

metric, such that in many cases it is not necessary to update

the LSH index even when the distance metric changes. Dis-

tance metric learning transforms the original input feature

space into a new space using a transform matrix learned

from the similarity examples. We formally prove in Sec. 5

that hashing the objects in the original input space is equal

to hashing the objects in the new space using a family of new

Continuously Adaptive Similarity Search SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

hash functions converted from the original hash functions.

Therefore, if the new hash functions remain locality-sensitive

– i.e., similar objects are mapped to the same β€œbuckets” with

high probability, then the original LSH index will continue

to remain effective in supporting similarity search also with

the new distance metric. We show that this holds as long as

the transform matrix learned by distance metric learning sat-

isfies certain statistical properties. Leveraging this invarianceobservation, OASIS avoids the unnecessary recomputation

of LSH hashes.

β€’ Further, we design an efficient method to adapt the LSH

index when an update is found to be necessary. It is built on

our finding that the identification of the objects that most

likely would move to different buckets after the distance

metric changes can be transformed into the problem of find-

ing the objects that have a large inner product with some

query objects. Therefore, the objects that have to be updated

in the index can be quickly discovered by submitting a few

queries to a compact inner product-based LSH index. This

way, OASIS effectively avoids re-hashing the large majority

of objects during each actual LSH update.

Our experimental study using public datasets and large

synthetic datasets confirms that OASIS is up to 3 orders of

magnitude faster than approaches that use traditional online

metric learning methods [24] and update the index whenever

the distance metric changes. Moreover, OASIS is shown to

significantly improve the accuracy of similarity search-based

data analytics tasks from outlier detection, classification,

labeling, and nearest neighbor-based information retrieval

compared to approaches that do not update the distance

metric or use traditional online metric learning methods.

2 PRELIMINARIES2.1 Distance Metric LearningThe goal of distance metric learning is to learn a distance

function customized to the specific semantics of similarity

between objects of the application. Typical distance met-

ric learning techniques [46–49] achieve this by learning a

Mahalanobis distance metric 𝐷𝑀 , defined by:

𝐷𝑀 (βˆ’β†’π‘₯𝑖 ,βˆ’β†’π‘₯ 𝑗 ) = (βˆ’β†’π‘₯𝑖 βˆ’ βˆ’β†’π‘₯ 𝑗 )𝑇𝑀 (βˆ’β†’π‘₯𝑖 βˆ’ βˆ’β†’π‘₯ 𝑗 ) (1)

Hereβˆ’β†’π‘₯𝑖 and βˆ’β†’π‘₯ 𝑗 denote two 𝑑-dimensional objects, while

𝑀 represents a 𝑑 ×𝑑 matrix, called the transform matrix. TheMahalanobis metric can be viewed as: (1) applying first a

linear transformation from the d-dimensional input space

π‘…π‘‘π‘œ to a new d-dimensional input space 𝑅𝑑𝑛 represented by

L : Rdo β†’ Rdn ; followed by (2) computing the Euclidean dis-

tance in the new input space as in Equation 2:

𝐷𝑀 (βˆ’β†’π‘₯𝑖 ,βˆ’β†’π‘₯ 𝑗 ) = 𝐿(βˆ’β†’π‘₯𝑖 βˆ’ βˆ’β†’π‘₯ 𝑗 )

(2)

where LTL = M .

𝑀 must be positive semidefinite, i.e., have no negative

eigenvalues [19], to assure that the distance function will re-

turn positive distance values. Distance metric learning learns

the transform matrix𝑀 based on the user inputs regarding

which items are or are not considered to be similar, so called

similarity examples. The learned transform matrix aims to

ensure that all user-provided similarity examples are satisfiedin the transformed space.

The particular format of the similarity examples may differ

from application to application. In π‘˜NN classification [11]

and outlier detection [25], users may explicitly tell us which

class some examples belong to. 𝐾NN classification assigns

an object to the class most common among its k nearest

neighbors discovered from the labeled training examples.

Therefore, to improve the classification accuracy, given an

exampleβˆ’β†’π‘₯𝑖 , distance metric learning aims to learn a distance

metric that makesβˆ’β†’π‘₯𝑖 to share the same class with its k nearest

neighbor examples as shown in Fig. 1. This is also what is

leveraged by the neurologists who specify π‘˜NN queries in

our motivating example about the online labeling system.

In distance-based outlier detection [25], outliers are defined

as the objects that have fewer than π‘˜ neighbors within a

distance range 𝑑 . Distance metric learning therefore aims to

learn a transform matrix such that in the transformed space

the distance between an outlier example and its π‘˜th nearest

neighbor is larger than 𝑑 . In clustering, users may tell us that

two example objects (or examples) either must be or cannot

be in the same cluster.

In the literature, different distance metric learning meth-

ods have been proposed for various applications, such as

Large Margin Nearest Neighbor (LMNN) [47, 48] for π‘˜NN

classification,MELODY [51] for outlier detection, andDML [49]

for clustering.

2.2 Locality Sensitive HashingSimilarity search on large data relies on indexing, because

many search operations, including π‘˜ nearest neighbor (π‘˜NN)

search, are expensive. In online applications or interactive

systems which have stringent response time requirements,

indexing that produces approximate results tends to be suf-

ficient as long as it comes with accuracy guarantee; in par-

ticular, locality sensitive hashing (LSH) [17, 23, 31, 32], is a

popular method because it is scalable to big data, effective in

high dimensional space, and amenable to dynamic data, etc.

LSH hashes input objects so that similar objects map to

the same β€œbuckets” with high probability. The number of

buckets is kept much smaller than the universe of possible

input items. To achieve this goal, LSH uses a family of hash

functions that make sure the objects close to each other

have a higher probability to collide than the objects far from

SIGMOD ’20, June 14–19, 2020, Portland, OR, USA Huayi Zhang, Lei Cao, Yizhou Yan, Samuel Madden, and Elke A. Rundensteiner

each other, so-called locality-preserving hashing. Here we

introduce a popular type of hash function used in LSH:

𝑓 (βˆ’β†’π‘₯ ) =βŒŠβˆ’β†’π‘Ž Β· βˆ’β†’π‘₯ + 𝑏

π‘Š

βŒ‹(3)

whereβˆ’β†’π‘Ž is a random vector whose entries are drawn

from a normal distribution. This ensures 𝑓 (βˆ’β†’π‘₯ ) is locality-preserving [12].𝑏 is a random constant from the range [0,π‘Š ),withπ‘Š a constant that denotes the number of the buckets.

Next, we describe the process of building a classical (K,N)-parameterized LSH index:

β€’ Define an LSH family F of locality-preserving functions

𝑓 : 𝑅𝑑 β†’S which map objects from the d-dimensional input

space 𝑅𝑑 to a bucket 𝑠 ∈ S.β€’ Given the input parameter 𝐾 and 𝑁 , define a new LSH

family G containing 𝑁 meta hash functions 𝑔𝑛 ∈ G, whereeach meta function 𝑔𝑛 is constructed from 𝐾 random hash

functions 𝑓1, 𝑓2, ..., π‘“π‘˜ from F . Each 𝑔𝑛 corresponds to one

hash table.

β€’Hash objects: given an input objectβˆ’β†’π‘₯ ,𝑔𝑛 (βˆ’β†’π‘₯ ) = (𝑓1 (βˆ’β†’π‘₯ ), 𝑓2 (βˆ’β†’π‘₯ ),..., 𝑓𝐾 (βˆ’β†’π‘₯ )) (1 ≀ n ≀ N ).

The 𝑁 hash tables produced by the 𝑁 meta hash functions

correspond to the index structure of the given input data.

The procedure of querying the nearest neighbors of an

objectβˆ’β†’π‘ž from an LSH index is:

β€’ Given an objectβˆ’β†’π‘ž , for each hash function 𝑔𝑛 ∈ G, calcu-

late 𝑔𝑛 (π‘ž), and return the objectsβˆ’β†’π‘₯ , where 𝑔𝑛 (βˆ’β†’π‘₯ ) = 𝑔𝑛 (βˆ’β†’π‘ž ),

that is, the objectsβˆ’β†’π‘₯ that fall into the same bucket with

βˆ’β†’π‘ž .β€’ Union the results returned from all hash functions 𝑔𝑛

(n: 1 ... N) as candidates into one set, and find the nearest

neighbors ofβˆ’β†’π‘ž from this candidate set.

In addition to π‘˜NN queries, LSH is commonly used to

speed up data analytics techniques [40] for which similar-

ity search plays a critical role. Examples include distance or

density-based outliers [4, 6, 25, 37], density-based or spec-

tral clustering [16, 30]. In these applications, exact results

of similarity search that would require significantly more

retrieval time are not necessarily better than approximate

results with an accuracy guarantee as produced by LSH.

2.3 Problem StatementWe target the scenario where new similarity examples (e.g.,

user labels) continuously become available, requiring an

update to the similarity metric and the associated index.

Definition 2.1. ContinuouslyAdaptive Similarity Search.Given the transform matrix𝑀𝑑 used in the current distance

metric 𝐷𝑀𝑑and the current LSH index 𝐻𝑑 produced by LSH

family F𝑑 w.r.t. distance metric 𝐷𝑀𝑑, after receiving a new

similarity example 𝐸 at time step 𝑑 , our goal is that:

ConvertExampleToConstraints

Application

New Similarity Example

Distance Metric

Continuous Metric Learning

DistanceMetricUpdate

LSH Update

LSHUpdateStrategy

updateOrNot

YesDistance Constraints

Online Analytics Request

UpdatedLSHIndex

Figure 2: The OASIS System

(1) The transform matrix𝑀𝑑 be updated to𝑀𝑑+1 such that

the new similarity example 𝐸 is satisfied by the new distance

metric 𝐷𝑀𝑑+1 using𝑀𝑑+1; and(2) The LSH index𝐻𝑑 be updated to𝐻𝑑+1 so to be equivalent

to re-hashing the data using a new LSH family F β€²which is

locality-preserving w.r.t. the new distance metric 𝐷𝑀𝑑+1 .

3 THE OVERALL OASIS APPROACHAs shown in Fig. 2, given a similarity example supplied by a

particular application, the continuous metric learning com-

ponent of OASIS first converts it into one or multiple dis-

tance constraints in a standard format (Sec. 4.1). Then, our

incremental learning strategy updates the distance metric to

satisfy these constraints (Sec. 4.2). Taking the new distance

metric as input, the LSH update component determines if the

LSH index has to be updated based on our invariance obser-

vation established in Lemma 5.2 (Sec. 5). If necessary, OASIS

invokes our two-level LSH-based update strategy (Sec. 6) to

update the LSH index according to the new distance metric.

Applications. OASIS can be used to support a broad

range of applications, including but not limited to π‘˜NN

classification [11] used in text categorization and credit rat-

ing [19, 22], distance-based outlier detection [7, 8] to detect

network intrusion or financial frauds [15], and information

retrieval supported by π‘˜NN queries to retrieve documents,

images, or products that are most relevant to the given object.

LSH indexes are often used in these applications to speed up

the neighbor search process as discussed in Sec. 2.2.

In these applications, the new similarity examples might

correspond to the objects mis-classified by the π‘˜NN classifier,

the outliers that are captured recently and believed to be

different from the existing outliers, or a retrieved object not

relevant to the query object. OASIS then adapts the existing

distance metric to avoid these newly observed classification

errors, capture the new type of outliers, or ensure that only

relevant objects are returned. Next, the LSH index will be

updated to effectively support the new queries subsequently

submitted by the users or to sequentially process new data,

as shown in Fig. 2.

4 CONTINUOUS DISTANCE METRICLEARNING

In this section, we show how different types of similarity ex-

amples in various similarity search-based applications can be

Continuously Adaptive Similarity Search SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

encoded into the distance constraints in a quadruple format

(Sec. 4.1). In Sec. 4.2, we introduce an incremental distance

metric learning strategy using the distance constraints asinput. This general method serves diverse applications.

4.1 Modeling the Similarity ExamplesIn OASIS, a similarity example is converted into one

or multiple distance constraints in quadruple format

C = (βˆ’β†’π‘’ ,βˆ’β†’π‘£ , dt, b) that place constraints on the distance be-

tween two objects. More specifically, in 𝐢 , 𝑑𝑑 corresponds

to the target distance between objectsβˆ’β†’π‘’ and

βˆ’β†’π‘£ . When the

binary flag 𝑏 is -1, the distance between examplesβˆ’β†’π‘’ and

βˆ’β†’π‘£cannot be larger than 𝑑𝑑 . When b is 1, the distance between

βˆ’β†’π‘’ andβˆ’β†’π‘£ should be larger than 𝑑𝑑 .

Next, we show how to convert different types of similarity

examples in different analytics tasks into distance constraints.

To provide some representative examples, we illustrate the

conversion principles for kNN classifiers, π‘˜NN queries, and

distance-based outliers.

4.1.1 Distance Constraints in kNN Classifier

In a π‘˜NN classifier, given a new exampleβˆ’β†’π‘ž , the new con-

straints can be formed based onβˆ’β†’π‘ž ’s nearest neighbors. The

intuition is that the nearest neighbors ofβˆ’β†’π‘ž should be the

objects that belong to the same class asβˆ’β†’π‘ž . This will improve

the classification accuracy of the π‘˜NN classifier.

First, we define the concept of target neighbor.

Definition 4.1. Given an objectβˆ’β†’π‘ž , its target neighbors

denoted as Nt (βˆ’β†’π‘ž ) correspond to the k nearest neighbors,

determined by the current learned distance metric 𝐷𝑀 , that

share the same label𝑦𝑖 withβˆ’β†’π‘ž in the labeled example dataset.

The distance constraints are formed based on the nearest

neighbor ofβˆ’β†’π‘ž and its target neighbors as shown in Def. 4.2.

Definition 4.2. Given the new labeled objectβˆ’β†’π‘ž and its

nearest neighbor with a class label that is distinct from

βˆ’β†’π‘ž ’s label, denoted as N n (βˆ’β†’π‘ž ), βˆ€ target neighbors ofβˆ’β†’π‘ž de-

noted as N ti (βˆ’β†’π‘ž ) ∈ Nt (βˆ’β†’π‘ž ), a new constraint 𝐢𝑖 is formed

as: Ci = (βˆ’β†’π‘ž ,N ti (βˆ’β†’π‘ž ),DM (βˆ’β†’π‘ž ,N n (βˆ’β†’π‘ž )),βˆ’1) holds, where 𝐷𝑀

denotes the distance computed using the current learned

distance metric.

In Lemma 4.3, we show that the new distance metric

learned from these constraints is able to improve the classifi-

cation accuracy of the π‘˜NN classifier.

Lemma 4.3. Given an object βˆ’β†’π‘ž previously mis-classified bythe π‘˜NN classifier, βˆ’β†’π‘ž will no longer be mis-classified if thenew distance metric parameterized by transform matrix𝑀𝑑+1satisfies the new constraints C = {𝐢1,𝐢2, ...,πΆπ‘˜ } defined byDef. 4.2.

Proof. By the new constraints C, the distance betweenthe new mis-classified object

βˆ’β†’π‘ž (βˆ’β†’π‘’π‘‘ ) and any of its target

neighbors N ti (βˆ’β†’π‘ž ) (βˆ’β†’π‘£π‘‘ ) has to be smaller than the distance

between N ti (βˆ’β†’π‘ž ) and its nearest neighbor with non-matching

label N n (βˆ’β†’π‘ž ) (𝑑𝑑𝑑 ), then the target neighbors ofβˆ’β†’π‘ž are guar-

anteed to be the k nearest neighbors ofβˆ’β†’π‘ž in the labeled

example set 𝑑𝑑 . By the definition of the π‘˜NN classifier,βˆ’β†’π‘ž

belongs to the class of its target neighbors which share the

same class withβˆ’β†’π‘ž . Therefore, now βˆ’β†’π‘ž is guaranteed to be

correctly classified. Lemma 4.3 is proven. β–‘

Apply to π‘˜ Nearest Neighbor (π‘˜NN) Queries. The aboveconstraint definition can be equally applied to support π‘˜NN

queries. Same as in the case of π‘˜NN classification, in a π‘˜NN

query, a small set of labeled examples are collected to learn

a distance metric that makes sure an example objectβˆ’β†’π‘ž is

in the same class with its nearest neighbors found in the

example set. The key difference between π‘˜NN queries and

π‘˜NN classification is that inπ‘˜NNqueries the similarity search

is conducted on the large input dataset instead of the labeled

example set, although its distance metric is learned based on

the example set.

4.1.2 Distance Constraints in Distance-based OutliersAs described in Sec. 2.1, in the classical distance-based outlier

definition [25], outliers are defined as the objects that have

fewer than π‘˜ neighbors within a distance range 𝑑 . Similar

to π‘˜NN classification, OASIS forms the distance constraints

based on the k nearest neighbors of the new received example

βˆ’β†’π‘ž as shown in Def. 4.4.

Definition 4.4. Given the new labeled objectβˆ’β†’π‘ž and its

k nearest neighbors denoted as KNN(βˆ’β†’π‘ž ), βˆ€ objectβˆ’β†’π‘›π‘– ∈

KNN(βˆ’β†’π‘ž ), a new constraint Ci = (βˆ’β†’π‘ž ,βˆ’β†’π‘›π‘– , d, 1) is formed ifβˆ’β†’π‘ž

is an outlier. Otherwise, Ci = (βˆ’β†’π‘ž ,βˆ’β†’π‘›π‘– , d,βˆ’1).

Using the distance metric learned from the constraints

formed by Def. 4.4, the distance-threshold outliers can be ef-

fectively detected in the newly transformed space. Intuitively,

given an objectβˆ’β†’π‘ž which is an inlier, but was mis-classified as

outlier by the outlier detector using the old distance metric

in the old space, it will no longer be mis-classified in the

new space. This is because in the new space the distances

betweenβˆ’β†’π‘ž and its π‘˜ nearest neighbors are no larger than

𝑑 . Therefore, βˆ’β†’π‘ž has at least π‘˜ neighbors now. On the other

hand, ifβˆ’β†’π‘ž is an outlier, but was missed by the outlier detec-

tor in the old space, it tends to be captured in the new metric

space. This is because now the distance betweenβˆ’β†’π‘ž and its π‘˜

nearest neighbors are larger than 𝑑 . Therefore, it is unlikely

thatβˆ’β†’π‘ž has k neighbors.

SIGMOD ’20, June 14–19, 2020, Portland, OR, USA Huayi Zhang, Lei Cao, Yizhou Yan, Samuel Madden, and Elke A. Rundensteiner

4.2 Incremental Learning StrategyThe incremental learning strategy in OASIS takes one dis-

tance constraints Ct = (βˆ’β†’π‘’π‘‘ ,βˆ’β†’π‘£π‘‘ , dtt, bt) as input. First, it com-

pute the distance dΜ‚tt = DMt (βˆ’β†’π‘’π‘‘ ,βˆ’β†’π‘£π‘‘ ) based on the current dis-

tance metric using the transform matrix𝑀𝑑 . It incurs a loss

l(dΜ‚tt, dtβˆ—t ), where dtβˆ—t = dtt (1 + bt𝛾) corresponds to the tar-

get distance 𝑑𝑑𝑑 weighted by 𝛾 ∈ (0,1). If 𝑏𝑑 is -1, then the

weighted target distance π‘‘π‘‘βˆ—π‘‘ is smaller than the target dis-

tance 𝑑𝑑𝑑 specified in 𝐢 . Otherwise, π‘‘π‘‘βˆ—π‘‘ is larger than 𝑑𝑑𝑑 .

Therefore, l(dΜ‚tt, dtβˆ—t ) measures the difference between the

current distance and the expected distance between the ex-

amples specified in 𝐢𝑑 .

Then OASIS updates the matrix from𝑀𝑑 to𝑀𝑑+1 under theoptimization goal of minimizing the total loss of all distance

constraints received so far, i.e,

πΏπ‘œπ‘ π‘  =βˆ‘π‘‘

𝑙 ( ˆ𝑑𝑑𝑑 , π‘‘π‘‘βˆ—π‘‘ ) (4)

The goal is to ensure that the constraints seen previously

are still satisfied. OASIS uses a typical model for the loss func-

tion, that is, the squared loss [19]: l(dΜ‚tt, dtβˆ—t ) = 12 (dΜ‚tt βˆ’ dtβˆ—t )2 ,

though others could be plugged in.

Leveraging the state-of-the-art online learning strat-

egy [13, 41], OASIS solves the problem of minimizing Eq. 4

by learning a Mt+1 that minimizes a regularized loss:

𝑀𝑑+1 = argmin𝐷𝑙𝑑 (𝑀𝑑+1, 𝑀𝑑 ) + [𝑙 (_

𝑑𝑑, π‘‘π‘‘βˆ—π‘‘ ) (5)

In Eq. 5, 𝐷𝑙𝑑 (𝑀𝑑+1, 𝑀𝑑 ) is a regularization function given

by𝐷𝑙𝑑 (𝑀𝑑+1, 𝑀𝑑 ) = π‘‘π‘Ÿ (𝑀𝑑+1π‘€βˆ’1𝑑 ) - logdet(𝑀𝑑+1π‘€βˆ’1

𝑑 ) -𝑑 . Here

the LogDet divergence [10] is used as the regularization func-

tion due to its success in metric learning [13].

_

𝑑𝑑 = 𝛿𝑇𝑑 𝑀𝑑+1𝛿𝑑corresponds to the distance between

βˆ’β†’π‘’π‘‘ and βˆ’β†’π‘£π‘‘ computed us-

ing the new transformmatrix𝑀𝑑+1, where 𝛿t =βˆ’β†’π‘’π‘‘ βˆ’ βˆ’β†’π‘£π‘‘ .[ > 0

denotes the regularization parameter, and 𝑑 corresponds to

the data dimension.

By [38], Eq. 5 can be minimized if:

𝑀𝑑+1 = 𝑀𝑑 βˆ’[ (

_

𝑑𝑑 βˆ’ π‘‘π‘‘βˆ—π‘‘ )𝑀𝑑𝛿𝑑𝛿𝑇𝑑 𝑀𝑑

1 + [ (_

𝑑𝑑 βˆ’ π‘‘π‘‘βˆ—π‘‘ )𝛿𝑇𝑑 𝑀𝑑𝛿𝑑

(6)

The matrix𝑀𝑑+1 resulting from Eq. 6 is guaranteed to be pos-itive definite [24]. This satisfies the requirement of distance

metric learning.

Next, we show how to compute 𝑀𝑑+1 using Eq. 6. First,

we have to compute

_

𝑑𝑑 . Since_

𝑑𝑑 is a function of 𝑀𝑑+1, to

compute

_

𝑑𝑑 , we have to eliminate𝑀𝑑+1 from its computation

first. By multiplying Eq. 6 by 𝛿𝑇𝑑 and 𝛿𝑑 on right and left, we

get: 𝛿Tt Mt+1𝛿t = 𝛿Tt Mt𝛿t βˆ’[𝛿Tt (

_

dtβˆ’dtβˆ—t )Mt𝛿t𝛿Tt Mt𝛿t

1+[ (_

dtβˆ’dtβˆ—t )𝛿Tt Mt𝛿t.

Since

_

𝑑𝑑 = 𝛿𝑇𝑑 𝑀𝑑+1𝛿𝑑 and dΜ‚tt = 𝛿Tt Mt𝛿t , we get:

_

𝑑𝑑 =[π‘‘π‘‘βˆ—π‘‘

ˆ𝑑𝑑𝑑 βˆ’ 1 +√([π‘‘π‘‘βˆ—π‘‘ ˆ𝑑𝑑𝑑 βˆ’ 1)2 + 4[ ˆ𝑑𝑑

2

𝑑

2[ ˆ𝑑𝑑𝑑(7)

Using Eq. 7, we can directly compute

_

𝑑𝑑 . Next, 𝑀𝑑+1 canbe computed by plugging Eq. 7 into Eq. 6.

Algorithm 1 Incremental Distance Metric Learning

1: function UpdateMetric(Current Metric𝑀𝑑 , Ct = (βˆ’β†’π‘’π‘‘ ,βˆ’β†’π‘£π‘‘ , dtt , bt ) , 𝛾2:

ˆ𝑑𝑑𝑑 = DMt (βˆ’β†’π‘’π‘‘ ,βˆ’β†’π‘£π‘‘ ) ;

3: π‘‘π‘‘βˆ—π‘‘ = dtt (1 + bt𝛾 ) ;4: Compute

_

𝑑𝑑 using Eq. 7;

5: Compute𝑀𝑑+1 using Eq. 6;

Themain steps of the incremental distance metric learning

are shown in Alg. 1.

Time Complexity Analysis. The time complexity of this

method is determined by the product operation between the

d-dimensional vector and 𝑑 Γ— 𝑑 matrix. It is quadratic in the

number of dimensions 𝑑 , but does not rely on the number

of distance constraints received so far nor on the number of

example objects involved in these constraints.

5 LOCALITY SENSITIVE HASHING:INVARIANCE OBSERVATION

In this section, wewill establish the invariance observation forLSH, stating that in many cases it is not necessary to update

the LSH index at all when the distance metric changes. That

is, we will show that an LSH index built on the Euclidean

distance is sufficient to support similarity search under the

learned distance metric as long as the eigenvalues of its

transform matrix𝑀 fall into a certain range.

For this, we first show that the original LSH index 𝐻 built

using a LSH family F can be considered as a LSH index 𝐻 β€²

with respect to the learned distance metric 𝐷𝑀 built using a

different family of hash functions F β€².

Lemma 5.1. Given an LSH index H w.r.t. Euclidean distancebuilt using a family of locality-preserving hash functions:

𝑓 (βˆ’β†’π‘’ ) =βŒŠβˆ’β†’π‘Ž βˆ’β†’π‘’ + 𝑏

π‘Š

βŒ‹(8)

then 𝐻 can be considered to be an LSH index w.r.t the learneddistance metric 𝐷𝑀 with transform matrix 𝑀 = 𝐿𝑇𝐿, builtusing the following family of hash functions:

𝑓 β€²(βˆ’β†’π‘’ β€²) =

βŒŠβˆ’β†’π‘Ž β€²βˆ’β†’π‘’ β€² + π‘π‘Š

βŒ‹(9)

where βˆ’β†’π‘Ž β€² = βˆ’β†’π‘Ž πΏβˆ’1 and βˆ’β†’π‘’ β€² = πΏβˆ’β†’π‘’ . 1

1Here we consider

βˆ’β†’π‘Ž β€²and

βˆ’β†’π‘Ž as row vectors for ease of presentation.

Continuously Adaptive Similarity Search SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

Proof. Given a transform matrix 𝑀 in the learned dis-

tance metric, we get a 𝐿, where𝑀 = 𝐿𝑇𝐿. Replacing βˆ’β†’π‘Ž with

βˆ’β†’π‘Ž πΏβˆ’1𝐿 (βˆ’β†’π‘Ž =βˆ’β†’π‘Ž πΏβˆ’1𝐿) from Eq. 8, we get: f (βˆ’β†’π‘’ ) =

⌊ βˆ’β†’π‘Ž Lβˆ’1Lβˆ’β†’π‘’ +b

W

βŒ‹.

Consideringβˆ’β†’π‘Ž Lβˆ’1 as a new hyperplane

βˆ’β†’π‘Ž β€², we get

f (βˆ’β†’π‘’ ) =⌊ βˆ’β†’π‘Ž β€²Lβˆ’β†’π‘’ +b

W

βŒ‹. By the definition of the Mahalanobis dis-

tance (Sec. 2), Lβˆ’β†’π‘’ corresponds to the object in the new

space transformed fromβˆ’β†’π‘’ . Replacing Lβˆ’β†’π‘’ with

βˆ’β†’π‘’ β€², we get

f β€²(βˆ’β†’π‘’ β€²) =⌊ βˆ’β†’π‘Ž β€²βˆ’β†’π‘’ β€²+b

W

βŒ‹. It corresponds to a hash function in the

transformed space, whereβˆ’β†’π‘’ β€²

= Lβˆ’β†’π‘’ is an input object and

βˆ’β†’π‘Ž β€²=βˆ’β†’π‘Ž Lβˆ’1 the new hyperplane. Lemma 5.1 is proven. β–‘

Locality-Preserving. By the definition of LSH (Sec. 2.2), if

the hash functions are locality-preserving, then the LSH in-

dex𝐻 is effective in hashing the nearby objects into the same

buckets based on the learned distance metric 𝐷𝑀 . Therefore,

if 𝑓 β€²() (Eq. 9) is locality-preserving, then the LSH index 𝐻

built on the Euclidean distance is still effective in supporting

the similarity search under 𝐷𝑀 .

Next, we show the entries in the hyperplaneβˆ’β†’π‘Ž β€²

of 𝑓 β€²()(Eq. 9) correspond to a normal distribution 𝑁 (0, 𝐼 ) as longas the eigenvalues of the transform matrix𝑀 satisfy certain

properties. By [12], this assures that the hash function 𝑓 β€²()using

βˆ’β†’π‘Ž β€²as the hyperplane is locality-preserving.

Given a hash table produced by β„Žπ‘› hash functions, we use

matrix 𝐴 to represent the hyperplanes of all hash functions.

Therefore, 𝐴 is a β„Žπ‘› Γ— 𝑑 matrix, where 𝑑 denotes the dimen-

sion of the data. Each row corresponds to the hyperplane

(hash vector)βˆ’β†’π‘Ž of one hash function.

Since a valid transform matrix𝑀 is always symmetric and

positive definite as discussed in Sec. 2.1, it can be decomposed

into 𝑀 = 𝑅𝑇𝑆𝑅, where 𝑅𝑇𝑅 = 𝐼 and 𝑆 is a diagonal matrix

formed by the eigenvalues of 𝑀 . Since 𝑀 = 𝐿𝑇𝐿, we get

𝐿 = 𝑆1/2𝑅.

𝐴′= π΄πΏβˆ’1 = π΄π‘…βˆ’1π‘†βˆ’1/2 (10)

We use [βˆ’β†’π‘’1,βˆ’β†’π‘’2, ...,βˆ’β†’π‘’π‘‘ ] to represent π΄π‘…βˆ’1. βˆ’β†’π‘’π‘– represents the𝑖th column of π΄π‘…βˆ’1. We denote the elements at the diag-

onal of π‘†βˆ’1/2 as [𝑠1, 𝑠2, ..., 𝑠𝑑 ]. We then have π΄π‘…βˆ’1π‘†βˆ’1/2 =

[𝑠1βˆ’β†’π‘’1, 𝑠2βˆ’β†’π‘’2, ..., π‘ π‘‘βˆ’β†’π‘’π‘‘ ].Next, we show each row vector in π΄π‘…βˆ’1π‘†βˆ’1/2 which rep-

resents the hyperplaneβˆ’β†’π‘Ž β€²

of one hash function 𝑓 β€²() corre-sponds to a vector drawn from normal distribution 𝑁 (0, 𝐼 ) aslong as the eigenvalues of the transformmatrix𝑀 satisfy cer-

tain properties. By [12], this assures that the hash function

𝑓 β€²() using βˆ’β†’π‘Ž β€²as the hyperplane is locality-preserving.

Lemma 5.2. Statistically the hyperplane βˆ’β†’π‘Ž β€² in Eq. 9 corre-sponds to a vector drawn from a normal distribution 𝑁 (0, 𝐼 ) aslong asβˆ€ 𝑠𝑖 in themain diagonal of π‘†βˆ’1/2: 0.05 < 𝑃 (πœ’2

β„Žπ‘›< β„Žπ‘›π‘ 

2

𝑖 )< 0.95, where p-value is set as 0.1.

Proof. We first show each row vector in π΄π‘…βˆ’1 obeys

a normal distribution 𝑁 (0, 𝐼 ). Since each row vector in 𝐴

obeys a normal distribution 𝑁 (0, 𝐼 ), where 𝐼 is the identitymatrix, then each row vector in π΄π‘…βˆ’1 obeys a multivariate

normal distribution 𝑁 (0, Ξ£) [18], where Ξ£ = π‘…βˆ’1 (π‘…βˆ’1)𝑇 is

the covariance matrix. Since 𝑅𝑇𝑅 = 𝐼 , it is easy to see that

Ξ£ = π‘…βˆ’1 (π‘…βˆ’1)𝑇 = 𝐼 . Thus each row vector in π΄π‘…βˆ’1 can be re-

garded as a vector drawn from a normal distribution 𝑁 (0, 𝐼 ).Note this indicates that each element in π΄π‘…βˆ’1 can be re-

garded as a sample drawn from a normal distribution 𝑁 (0, 1).Therefore, each column vector

βˆ’β†’π‘’π‘– in π΄π‘…βˆ’1 can also be re-

garded as a vector drawn from a normal distribution 𝑁 (0, 𝐼 ).Next, we show each column vector 𝑠𝑖

βˆ’β†’π‘’π‘– in π΄π‘…βˆ’1π‘†βˆ’1/2 canalso be regarded as a vector drawn from a normal distribution.

This will establish our conclusion that each row vector in

π΄π‘…βˆ’1π‘†βˆ’1/2 which represents the hyperplaneβˆ’β†’π‘Ž β€²

of one hash

function 𝑓 β€²() can be regarded as a vector drawn from normal

distribution [18].

We use πœ’2 test to determine if βˆ€siβˆ’β†’π‘’π‘– ∈ [s1βˆ’β†’π‘’1, s2βˆ’β†’π‘’2, ..., sdβˆ’β†’π‘’π‘‘ ]statistically corresponds to a vector drawn from normal dis-

tribution 𝑁 (0, 𝐼 ) [34]. Given a β„Žπ‘› dimension vector drawn

from normal distribution 𝑁 (0, 𝐼 ), the sum of the squares of

its β„Žπ‘› elements obeys πœ’2 distribution with β„Žπ‘› degrees of free-

dom [28]. We already know thatβˆ’β†’π‘’π‘– is a vector drawn from a

normal distribution 𝑁 (0, 𝐼 ). Therefore, as long as the sum of

the squares of the elements in π‘ π‘–βˆ’β†’π‘’π‘– also obeys πœ’2 distribution

with β„Žπ‘› degrees of freedom, statistically π‘ π‘–βˆ’β†’π‘’π‘– corresponds to

a vector drawn from a normal distribution 𝑁 (0, 𝐼 ) [34].First, we calculate the expected value of the sum of the

squares of the elements in π‘ π‘–βˆ’β†’π‘’π‘– . Since βˆ’β†’π‘’π‘– corresponds to a

vector drawn from normal distribution 𝑁 (0, 𝐼 ), 𝐸 (β„Žπ‘›βˆ‘π‘—=1

𝑒2𝑖 𝑗 ) =

β„Žπ‘› , where 𝑒𝑖 𝑗 denotes the 𝑗th element inβˆ’β†’π‘’π‘– . Since 𝑠𝑖 is a

constant, we get 𝐸 (β„Žπ‘›βˆ‘π‘—=1

((𝑠𝑖𝑒𝑖 𝑗 )2 = 𝑠2𝑖 𝐸 (β„Žπ‘›βˆ‘π‘—=1

(𝑒2𝑖 𝑗 ) = β„Žπ‘›π‘ 2𝑖 .

We set the p-value as 0.1. By [34], if 0.05 < P (πœ’2hn < hns2i )< 0.95, we cannot reject the hypothesis that 𝑠𝑖

βˆ’β†’π‘’π‘– is a vec-

tor drawn from a normal distribution. This immediately

shows that each row vectorβˆ’β†’π‘Ž β€²

in π΄π‘…βˆ’1π‘†βˆ’1/2 statistically

corresponds to a vector drawn from a normal distribution

𝑁 (0, 𝐼 ). β–‘

Testing Method. Testing whether Lemma 5.2 holds is

straightforward. That is, we first compute the eigenvalues

𝑒𝑔𝑖 of the transform matrix𝑀 . Since 𝑆 is a diagonal matrix

formed by the eigenvalues of𝑀 , the element 𝑠𝑖 at the main

diagonal of π‘†βˆ’1/2 corresponds to π‘’π‘”βˆ’1/2𝑖

. We then use πœ’2 ta-

ble [34] to test if 0.05 < P (πœ’2n < hns2i ) < 0.95. This in fact cor-responds to 0.05 < P (πœ’2n < hnegβˆ’1i ) < 0.95. The eigenvaluecomputation and πœ’2 test can be easily done using Python

libraries like Numpy and SciPy [35]. As an example, given

SIGMOD ’20, June 14–19, 2020, Portland, OR, USA Huayi Zhang, Lei Cao, Yizhou Yan, Samuel Madden, and Elke A. Rundensteiner

an LSH index with 3 hash functions, by [34], βˆ€ eigenvalue

𝑒𝑔𝑖 of𝑀 , if 0.35 < 3 Γ— egβˆ’1i < 7 .81, that is 0.38 < egi < 8.57 ,then we do not have to update the LSH index.

Analysis. Now we analyze why Lemma 5.2 is often met.

By Eq. 5, Mt+1 = argminDld (Mt+1,Mt) + [l(dt+1, dβˆ—t ).l(dt+1, dβˆ—t ) represents the difference between the targeted

distance and the distance computed using𝑀𝑑+1. Since many

solutions exist that can make l(dt+1, dβˆ—t ) close to 0 [38], thesolution of Eq. 5 is mainly determined by Dld (Mt+1,Mt).Initially, Mt = I when t = 0. Accordingly, 𝐷𝑙𝑑 (𝑀𝑑+1, 𝑀𝑑 )=

𝐷𝑙𝑑 (𝑀𝑑+1, 𝐼 ) = π‘‘π‘Ÿ (𝑀𝑑+1) - logdet(𝑀𝑑+1) - 𝑑 . Note 𝑑𝑒𝑑 (𝑀𝑑+1) =∏di=1 egi, π‘‘π‘Ÿ (𝑀𝑑+1) =

βˆ‘π‘‘π‘–=1 𝑒𝑔𝑖 , where 𝑒𝑔𝑖 denotes the eigen-

value of 𝑀𝑑+1. By [45], 𝐷𝑙𝑑 is minimized when 𝑒𝑔1 = 𝑒𝑔2 =

. . . = 𝑒𝑔𝑑 =π‘‘βˆšπ‘‘π‘’π‘‘ (𝑀𝑑+1). Since in distance metric learning, 𝑑

usually is not small [47],π‘‘βˆšπ‘‘π‘’π‘‘ (𝑀𝑑+1) is close to 1 [45]. In a

similar manner, we can show that𝑀𝑑+1 has similar eigenval-

ues to𝑀𝑑 when 𝑑 > 0. By induction, the eigenvalues of𝑀𝑑+1tend to be close to 1, and thus tend to satisfy the requirement

in Lemma 5.2.

6 THE UPDATE OF LSHThe LSH index should be updated when the distance met-

ric changes, and Lemma 5.2 is not satisfied. To assure ef-

ficiency of similarity search, we now propose a fast LSH

update mechanism. It only re-hashes the objects that have

a high probability of being moved to a different bucket due

to an update of the distance metric. It includes two key in-

novations: (1) mapping the problem of finding the objects

that are strongly impacted by the update of the distance

metric to an inner product similarity search problem; and

(2) quickly locating such objects using an auxiliary inner

product similarity-based LSH index.

Measure the Influence of Distance Metric Change.Given a learned distance metric 𝐷𝑀 with transform matrix

𝑀 = 𝐿𝑇𝐿, by Sec. 2.1, computing the distance using the

learned distance metric is equivalent to computing Euclidean

distance in an input space transformed by 𝐿. Therefore, build-

ing an LSH index that effectively serves the similarity search

using the learned distance metric𝐷𝑀 corresponds to hashing

the input objects in the transformed space.

Suppose we have two distance metrics w.r.t. transform

matricesπ‘€π‘œπ‘™π‘‘ = πΏπ‘‡π‘œπ‘™π‘‘πΏπ‘œπ‘™π‘‘ and𝑀𝑛𝑒𝑀 = 𝐿𝑇𝑛𝑒𝑀𝐿𝑛𝑒𝑀 , then an LSH

index corresponding to each of these twometrics respectively

can be built using the following hash functions:

𝑓 (βˆ’β†’π‘’ )π‘œπ‘™π‘‘ =

βŒŠβˆ’β†’π‘Ž Β· πΏπ‘œπ‘™π‘‘βˆ’β†’π‘’ + π‘π‘Š

βŒ‹, 𝑓 (βˆ’β†’π‘’ )𝑛𝑒𝑀 =

βŒŠβˆ’β†’π‘Ž Β· πΏπ‘›π‘’π‘€βˆ’β†’π‘’ + π‘π‘Š

βŒ‹Then given an object

βˆ’β†’π‘’ as input, the difference between

the outputs produced by the two hash functions can be com-

puted as:

𝑓 (βˆ’β†’π‘’ )𝑑𝑖 𝑓 𝑓 =βŒŠβˆ’β†’π‘Ž Β· (𝐿𝑑𝑖 𝑓 𝑓 βˆ’β†’π‘’ )

π‘Š

βŒ‹(11)

where Ldiff = Lold βˆ’ Lnew represents the delta matrix be-

tween πΏπ‘œπ‘™π‘‘ and 𝐿𝑛𝑒𝑀 . f (βˆ’β†’π‘’ )diff represents how much the

change of the distance metric impactsβˆ’β†’π‘’ in the LSH index.

Obviously, it is not practical to find the objects that are

strongly impacted by the change of the distance metric by

computing 𝑓 (βˆ’β†’π‘’ )𝑑𝑖 𝑓 𝑓 for each input object, since it would be

as expensive as re-hashing all objects in the index.

An Inner Product Similarity Search Problem.As shownin Eq. 11, the value of 𝑓 (βˆ’β†’π‘’ )𝑑𝑖 𝑓 𝑓 is determined by

βˆ’β†’π‘Ž Β· (Ldiff βˆ’β†’π‘’ ).Since

βˆ’β†’π‘Ž Β· (Ldiff βˆ’β†’π‘’ ) = βˆ’β†’π‘Ž TLdiff Β· βˆ’β†’π‘’ , the value of 𝑓 (βˆ’β†’π‘’ )𝑑𝑖 𝑓 𝑓 is

thus determined byβˆ’β†’π‘Ž TLdiff Β· βˆ’β†’π‘’ , that is, the inner product

betweenβˆ’β†’π‘Ž TLdiff and

βˆ’β†’π‘’ . The larger the inner product is, thelarger the 𝑓 (βˆ’β†’π‘’ )𝑑𝑖 𝑓 𝑓 is. Therefore, the problem of finding the

objectsβˆ’β†’π‘’ that should be re-hashed can be converted into the

problem of finding the objects that have a large inner productwith a query object βˆ’β†’π‘ž =

βˆ’β†’π‘Ž TLdiff , where βˆ’β†’π‘ž is constructed by

the hyperplaneβˆ’β†’π‘Ž used in the hash function and the delta

matrix 𝐿𝑑𝑖 𝑓 𝑓 .

By [44], finding the objectsβˆ’β†’π‘’ that have a large inner

product with the objectβˆ’β†’π‘ž corresponds to an inner product

similarity search problem. This problem can be efficiently

solved if we have an LSH index that effectively supports

inner product similarity search. Whenever the LSH has to

be updated, to find the objects that have to be re-hashed, we

thus only need to construct the query objectβˆ’β†’π‘ž and then

submit a similarity search query to the inner product-based

LSH index.

Next, we first present a data augmentation-based LSH

mechanism to efficiently support inner product similarity

search. We then introduce a multi-probe based strategy that

effectively leverages the output of the inner product similar-

ity search to find the objects that have to be updated.

6.1 Inner Product Similarity LSHThe key insight of the LSH mechanism is that by augment-

ing the query objectβˆ’β†’π‘ž and the input object

βˆ’β†’π‘’ with one

extra dimension, the inner product similarity search can

be efficiently supported by leveraging any classical cosine

similarity-based LSH.

First, consider the cosine similarity defined in [43] as:

π‘π‘œπ‘  (βˆ’β†’π‘ž ,βˆ’β†’π‘’ ) =βˆ’β†’π‘ž Β· βˆ’β†’π‘’ βˆ’β†’π‘ž βˆ’β†’π‘’ (12)

The inner product similarity [36, 42] can be expressed as:

π‘–π‘›π‘›π‘’π‘Ÿ (βˆ’β†’π‘ž ,βˆ’β†’π‘’ ) = βˆ’β†’π‘ž Β· βˆ’β†’π‘’ =

βˆ’β†’π‘ž Β· βˆ’β†’π‘’ βˆ’β†’π‘ž βˆ’β†’π‘’ Γ— βˆ’β†’π‘ž Γ— βˆ’β†’π‘’ (13)

Continuously Adaptive Similarity Search SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

By Eq. 12 and 13, we get:

π‘–π‘›π‘›π‘’π‘Ÿ (βˆ’β†’π‘ž ,βˆ’β†’π‘’ ) = π‘π‘œπ‘  (βˆ’β†’π‘ž ,βˆ’β†’π‘’ ) Γ— βˆ’β†’π‘ž Γ— βˆ’β†’π‘’ (14)

In our OASIS, the inner product similarity search is used to

search for the neighbors of the query objectβˆ’β†’π‘ž . During one

similarity search, Eq. 14 is repeatedly used to compute the

distance betweenβˆ’β†’π‘ž and different input objects

βˆ’β†’π‘’ . Therefore,for our similarity search purpose,

βˆ’β†’π‘ž can be safely ignored

from Eq. 14, because it does not influence the relative affinity

from differentβˆ’β†’π‘’ to

βˆ’β†’π‘ž .However, the search results are influenced by

βˆ’β†’π‘’ . Evenif an object

βˆ’β†’π‘’ has a high cosine similarity with the query

objectβˆ’β†’π‘ž , it does not necessarily mean that

βˆ’β†’π‘’ has a high

inner product similarity withβˆ’β†’π‘ž because of varying

βˆ’β†’π‘’ ofdifferent

βˆ’β†’π‘’ .We now propose to fill this gap between the inner prod-

uct similarity and cosine similarity by appending one extra

dimension to the query objectβˆ’β†’π‘ž and the input object

βˆ’β†’π‘’ as

shown below:

βˆ’β†’π‘ž β€² =[βˆ’β†’π‘ž , 0] ,βˆ’β†’π‘’ β€² =

[βˆ’β†’π‘’ ,

√1 βˆ’

βˆ’β†’π‘’ 2]Next, we show that after this data augmentation, the inner

product similarity π‘–π‘›π‘›π‘’π‘Ÿ (βˆ’β†’π‘ž ,βˆ’β†’π‘’ ) can be computed based on

the cosine similarity between the augmentedβˆ’β†’π‘ž β€²

andβˆ’β†’π‘’ β€²

.

Lemma 6.1.

π‘–π‘›π‘›π‘’π‘Ÿ (βˆ’β†’π‘ž ,βˆ’β†’π‘’ ) = βˆ’β†’π‘ž Β· βˆ’β†’π‘’ = π‘π‘œπ‘  (βˆ’β†’π‘ž β€²,βˆ’β†’π‘’ β€²) Γ— βˆ’β†’π‘ž (15)

Proof. Usingβˆ’β†’π‘ž β€²

andβˆ’β†’π‘’ β€²

in the computation of Cosine

similarity, we get cos(βˆ’β†’π‘ž β€²,βˆ’β†’π‘’ β€²) =

βˆ’β†’π‘ž Β·βˆ’β†’π‘’

βˆ₯βˆ’β†’π‘ž βˆ₯βˆ₯βˆ’β†’π‘’ β€²βˆ₯ . Since βˆ’β†’π‘’ β€²

= [βˆ’β†’π‘’ ,√1 βˆ’ βˆ’β†’π‘’ 2] = 1, we get cos(βˆ’β†’π‘ž β€²,βˆ’β†’π‘’ β€²) =

βˆ’β†’π‘ž Β·βˆ’β†’π‘’βˆ₯βˆ’β†’π‘ž βˆ₯ . In turn,

we get π‘–π‘›π‘›π‘’π‘Ÿ (βˆ’β†’π‘ž ,βˆ’β†’π‘’ ) = βˆ’β†’π‘ž Β· βˆ’β†’π‘’ = π‘π‘œπ‘  (βˆ’β†’π‘ž β€²,βˆ’β†’π‘’ β€²) Γ— βˆ’β†’π‘ž . β–‘

By Lemma 6.1, if we build a cosine similarity LSH index for

the augmented objects, then the original objects that fall into

the same bucket will have a high inner product similarity.

Therefore, our data augmentation strategy ensures that

any cosine similarity LSH index can be effectively used to

support inner product similarity search. In this work, we

adopt the method introduced in [2].

6.2 Cosine Similarity LSH-based UpdateAccuracy VS Efficiency. OASIS re-hashes the objects thathave a large inner product with the query object

βˆ’β†’π‘ž . Theseobjects are produced using a similarity query conducted on

the inner product LSH introduced above. Therefore, the cost

of adapting the LSH to fit the new distance metric depends

on the number of objects returned by the similarity query.

If too many objects were to be returned, then the update

would be expensive. On the other hand, returning too few

objects tends to miss the objects that should have changed

their buckets in the new LSH. Hence it risks to sacrifice its

accuracy in supporting similarity search when using the new

distance metric.

In this work, we tackle this delicate trade off by designing a

multi-probe LSH [31] based approach. This approach returns

as few objects as possible that could potentially change their

bucket membership in the new LSH, while still ensuring the

accuracy of the new LSH in supporting similarity search.

Multi-probe LSH. As shown in Fig. 3, compared to the

traditional LSH solution, a multi-probe LSH [31] divides

each large bucket intoπ‘š small buckets. Then at query time,

instead of probing one single bucket 𝑏 that the query object

falls into, it probes theπ‘š small buckets. Multi-probe tends

to have a higher accuracy for the nearest neighbor search

task than the traditional LSH [31].

As an example, in Fig. 3, the traditional LSH returns 𝑛1and 𝑛3 as the π‘˜ nearest neighbors of

βˆ’β†’π‘ž (π‘˜ = 2 in this case).

However, in fact, 𝑛2 is closer toβˆ’β†’π‘ž than 𝑛3. In this case, 𝑛2

is missed, because it falls into a nearby bucket. Multi-probe

LSH (in this caseπ‘š = 3) instead correctly returns 𝑛1 and 𝑛2as nearest neighbors of

βˆ’β†’π‘ž , because it also probes the two

neighboring buckets. As can be observed, the largerπ‘š is, the

higher the accuracy of the nearest neighbor search becomes.

However, a largerπ‘š also increases the search costs.

Figure 3: Multi-probe VS traditional LSH

Wrong(old) LocationTrue(new) Location

Query Instance

m

Figure 4: Avoidance of Additional Error in π‘˜NN Search

Avoidance of Additional Error in π‘˜NN Search. Theobservation here is that using multi-probe, even if we do not

re-hash an object that should be moved to a different bucket

from the old LSH to the new LSH w.r.t. the new learned

distance metric, often it does not impact the accuracy of

π‘˜NN search using the new distance metric, as shown below.

Given one input objectβˆ’β†’π‘’ , assume it is hashed into bucket

π‘π‘œπ‘™π‘‘ in the old LSH index𝐻 , while it should have been hashed

SIGMOD ’20, June 14–19, 2020, Portland, OR, USA Huayi Zhang, Lei Cao, Yizhou Yan, Samuel Madden, and Elke A. Rundensteiner

into bucket 𝑏𝑛𝑒𝑀 in the new LSH 𝐻 β€². Therefore,

βˆ’β†’π‘’ moves

e =| bold βˆ’ bnew | buckets due to the change of the distance

metric, where 𝑒 can be computed using Eq. 11. For example,

in Fig. 4, objectβˆ’β†’π‘’ moves two buckets. Suppose

βˆ’β†’π‘’ is a nearest

neighbor of queryβˆ’β†’π‘ž . βˆ’β†’π‘ž should be hashed into bucket 𝑏𝑛𝑒𝑀

in 𝐻 β€²per the property of LSH. Suppose

βˆ’β†’π‘’ is not re-hashed

and thus remains in bucket π‘π‘œπ‘™π‘‘ even in the new LSH 𝐻 β€². In

this case, if m β‰₯ 2 Γ— e + 1, then the multi-probe strategy is

still able to findβˆ’β†’π‘’ as the nearest neighbor of

βˆ’β†’π‘ž . As shownin Fig. 4, if we set π‘š as 5 = 2 Γ— 2 + 1, then βˆ’β†’π‘’ will still be

correctly captured as the nearest neighbor ofβˆ’β†’π‘ž , because

multi-probe also considers the objects in the 2 nearby buckets

left to 𝑏𝑛𝑒𝑀 as candidates.

Search the To-be-updated objects. Based on this obser-

vation, we design a search strategy that by only examining

the objects in a small number of buckets, efficiently locates

the objects that could introduce additional error in the ap-

proximate π‘˜NN search if not re-hashed in the new LSH.

As shown in Alg. 2, given the delta matrix 𝐿𝑑𝑖 𝑓 𝑓 = 𝐿𝑛𝑒𝑀 βˆ’πΏπ‘œπ‘™π‘‘ and the parameterπ‘š of multi-probe, we first construct

the query objectβˆ’β†’π‘ž as

βˆ’β†’π‘Ž 𝑇𝐿𝑑𝑖 𝑓 𝑓 (Line 2). Then we search for

the to-be-re-hashed objects starting from the bucket 𝑏 of the

inner product LSH thatβˆ’β†’π‘ž falls into. For each object

βˆ’β†’π‘’ in

bucket 𝑏, using Eq. 11 we compute the number of buckets

thatβˆ’β†’π‘’ moves denoted as 𝑒 (Lines 7-9). βˆ’β†’π‘’ will be re-hashed

if 𝑒 is larger than π‘šβˆ’12

(Lines 10-11). Otherwise, additional

error may be introduced in the π‘˜NN search. If the 𝑒 averaged

over all objects in bucket 𝑏, denoted as π‘Žπ‘£π‘”(𝑒), is larger thanπ‘šβˆ’12, the buckets 𝑏𝑛 nearby 𝑏 are searched iteratively until

the π‘Žπ‘£π‘”(𝑒) of a bucket 𝑏𝑛 is under π‘šβˆ’12

(Lines 13-17). These

objects are then re-hashed using f (βˆ’β†’π‘’ )new =

⌊ βˆ’β†’π‘Ž Β·Lnewβˆ’β†’π‘’ +b

W

βŒ‹.

Algorithm 2 Locate to-be-updated Objects

1: function locateUpdate(Delta matrix 𝐿𝑑𝑖𝑓 𝑓 , numberOfProbe π‘š, Hash

vectorβˆ’β†’π‘Ž , Hash constπ‘Š )

2:βˆ’β†’π‘ž =

βˆ’β†’π‘Ž 𝑇 Β· 𝐿𝑑𝑖𝑓 𝑓 ;

3: list[] candidates = null; i = 1;

4: list[] buckets = buckets.add (H(q));

5: while (𝑖 ≀ π‘šβˆ’12

) do6: eSum = 0; count = 0;

7: for each bucket b in buckets do8: for each object

βˆ’β†’π‘’ in bucket b do

9: e =⌊ βˆ’β†’π‘Ž Β· (Ldiff

βˆ’β†’π‘’ )

W

βŒ‹;

10: if 𝑒 > π‘šβˆ’12

then11: candidates.add(e);

12: eSum +=e; count++;

13: avg (e) = eSumcount ;

14: if π‘Žπ‘£π‘” (𝑒) > π‘šβˆ’12

then15: buckets.addNeighbor(H(q),i); i++;

16: else17: return candidates;

7 EXPERIMENTAL EVALUATION7.1 Experimental Setup and MethodologiesExperimental Setup. All experiments are conducted on a

LINUX server with 12 CPU cores and 512 GB RAM. All code

is developed in Python 2.7.5.

Datasets. We evaluate our OASIS using both real datasets

and synthetic datasets.

Real Datasets.We use four real datasets publicly avail-able with various of dimensions and sizes, PageBlock [33],

KDD99 [39], CIFAR10 [26], and HEPMASS, to evaluate how

OASIS performs on different analytics tasks.

PageBLOCK contains information about different types

of blocks in document pages. There are 10 attributes in each

instance. PageBlock contains 5,473 instances, with 560 in-

stances that are not text blocks labeled as class 1 and the

other text instances labeled as class 2. PageBlock is used to

evaluate the performance of OASIS for π‘˜NN classification.

CIFAR10 [26] is a popular benchmark dataset for image

classification. It includes 10 classes and 60,000 training im-

ages. All images are of the size 32Γ—32. Following the typical

practice [14], we first use unsupervised deep learning to ex-

tract a 4,096 dimensional feature vector to represent each

raw image. We then perform PCA to further reduce the di-

mension of the feature vectors to 100 dimensions. CIFAR10

is used in our labeling evaluation.

KDD99 corresponds to network intrusion detection data

curated as benchmark data set for outlier detection study.

There are 24 classes of intrusion connections and normal

connections. It contains in total 102,053 instances with 3,965

intrusion connections labeled as outliers. Each instance has

35 attributes plus the class label. Given KDD99 is an outlier

benchmark, we use it to evaluate the effectiveness of OASIS

in outlier detection.

HEPMASS was generated in high-energy physics experi-

ments for the task of searching for the signatures of exotic

particles. There are 10,500,000 instances in HEPMASS. Each

instance has 28 attributes. It is composed of two classes of

instances. 50% of the instances correspond to a signal process.

The other instances are a background process.

Synthetic Datasets. Since most public data sets with la-

bels are relatively small, we use synthetic datasets to evaluate

the scalability of OASIS, in particular the performance of our

LSH update strategy. We generate synthetic datasets using

scikit-learn 0.19.1. Each synthetic dataset has 50 attributes.

The largest dataset contains 100 million instances. The in-

stances are divided into 2 classes with 50 million instances

each. The data in each class is skewed and can be grouped

into multiple clusters.

Experimental Methodology & Metrics.We evaluate the

effectiveness and efficiency of our OASIS approach in four

Continuously Adaptive Similarity Search SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

(a) CPU time: varying 𝐾 (b) Accuracy: varying 𝐾

Figure 5: KNN Classification: Accuracy and CPUTime (PageBLOCK)

(a) CPU time: varying 𝐾 (b) F1 Score: varying 𝐾

Figure 6: Outlier Detection: F1 score and CPU Time(KDD99)

(a) CPU time: varying 𝐾 (b) Accuracy: varying 𝐾

Figure 7: Labeling: Accuracy and CPU Time (CI-FAR10)

(a) CPU time: varying 𝐾 (b) Accuracy: varying 𝐾

Figure 8: 𝐾NN Query: Accuracy and CPU Time (HEP-MASS)

different data analytics tasks, including labeling, π‘˜NN clas-

sification, outlier detection, and data exploration supported

by π‘˜NN queries.

Effectiveness. When evaluating the π‘˜NN classification,

we measure the accuracy of the prediction. When evaluat-

ing the outlier detection, we measure the F1 score. When

evaluating the labeling, we measure the accuracy of the auto-

matically produced labels. When evaluating the π‘˜NN query,

we measure the accuracy of the returned results.

We compare OASIS against three alternative approaches:

(1) baseline approach (No Update) that does not update thedistance metric; (2) static update approach (Static Update)that re-learns the transform matrix from scratch each time

when a new example is collected. It repeatedly applies the

most popular distance metric learning methods with respect

to different tasks [47, 49, 51]; It is expected to produce the

distance metric best fitting the similarity examples seen

over time; (3) an online metric learning method LEGO [24]

Efficiency. Efficiency is evaluated by the running time.

We measure the running time of our continuous metric learn-

ing method and our LSH update method separately.

Same to the effectiveness evaluation, we compare the effi-

ciency of our incremental metric learning method in OASIS

against (1) No Update, (2) Static Update, and (3) LEGO.

For evaluating our LSH update method in OASIS, we mea-

sure the total running time of similarity searches plus the

update of the LSH index. We compare against two alternative

approaches: (1) similarity search by linear scan without the

support of LSH (Linear Scan); and (2) similarity search with

the LSH rebuilt from scratch whenever the distance metric

changes (Rebuild LSH ). We use the large HEPMASS dataset

and synthetic datasets to evaluate the LSH update method

of OASIS. The other three real datasets are small and thus

not appropriate for this scale measurement experiment.

Evaluation of the Proposed Inner Product LSH. We

also evaluate our proposed inner product LSH by measuring

the recall and query time.

7.2 Evaluation of kNN ClassificationWhen supporting π‘˜NN classification, OASIS starts by using

Euclidean distance as the initial distance metric. If a test-

ing object is misclassified, it is used as a new example that

triggers the update process of OASIS, namely updating the

distance metric model and the LSH. Static Update re-learnsthe distance metric using LMNN [47] – the most popular

distance metric learning method used in π‘˜NN classifier.

Evaluation of Classification Accuracy. We evaluate

how OASIS improves the accuracy of π‘˜NN classification

using the PageBlock data set. Initially, the labeled training

set contains 1,000 instances. We vary the parameter 𝐾 , the

most important parameter in π‘˜NN classifier, from 3 to 10.

As shown in Fig. 5(b), OASIS outperforms No Update in all

cases because of the progressively enriched labeled dataset

and the continuous update of the distance metric model,

while No Update always uses the initially labeled dataset andthe fixed distance metric model which becomes stale over

time. Further, the accuracy of OASIS is very close to Static Up-date. Static Update re-learns the distance metric from scratch

SIGMOD ’20, June 14–19, 2020, Portland, OR, USA Huayi Zhang, Lei Cao, Yizhou Yan, Samuel Madden, and Elke A. Rundensteiner

every time when a new example is acquired, therefore its

performance is expected to be the best. This shows that the

distance metric model incrementally learned by OASIS is

close to the optimal model, while being much more efficient

than Static Update (see Fig. 5(a)). LEGO also incrementally up-

dates the distance metric model. However, it performs worse

than OASIS in all cases, although better than No Update.

This is as expected, because when forming the new distance

constraint for a new received similarity example, LEGO first

computes the pairwise distances among all similarity ex-

amples received so far and uses the 5th percentile distance

as the target distance. Using a target distance computed by

this hard-coded rule, given one particular similarity example,

its neighbors in the same class are not necessarily closer

than its non-matching neighbors, potentially introducing

classification errors in LEGO. In contrast, OASIS computes

a customized target distance for each example that ensures

this similarity example will not get mis-classified any more,

as described in Sec. 4.1.1.

Evaluation of Efficiency.As shown in Fig. 5, OASIS signifi-cantly outperforms Static Update up to 3 orders of magnitude

in updating the distance model. This is because the incremen-

tal update strategy has a time complexity that is quadratic

in the dimensionality of the data and is not dependent on

the number of examples acquired so far. While Batch Update

and Static Update both have to re-learn the distance metric

model from scratch. Apart from the time complexity which

is quadratic in the size of the labeled dataset and in number

of data dimensions, learning the distance metric model from

scratch also consumes a huge amount of memory. There-

fore, it tends to fail due to an out of memory error when the

number of examples is large. Our OASIS instead requires

near-zero extra memory besides the transform matrix.

OASIS is up to 100x faster than LEGO in updating the

distance metric. This is because LEGO has to be aware of all

pairwise distances among all examples seen so far in each

update, while OASIS only needs the π‘˜NN to the new example

to produce the distance constraints.

7.3 Evaluation of Outlier DetectionWe use the KDD99 dataset to evaluate the performance of

OASIS on supporting distance-based outliers. If an object is

mis-classified, we use the strategy discussed in Sec. 4.1.2 to

produce the distance constraints and then update the dis-

tance metric. For Static Update, we insert the newly acquired

outlier or inlier examples into the training set and re-learn

the distance metric using MELODY [51] – the most recent

distance metric learning-based outlier detection method. Ini-

tially the training set has 1000 examples. Similar to our OA-

SIS, LEGO also online updates the distance metric when an

object gets mis-classified, using the constraints produced by

its own strategy.

Evaluation of F1 Score.We evaluate how OASIS improves

the accuracy of outlier detection by continuously updating

the distance metric. The input parameter 𝐾 is varied in this

case. As shown in Fig. 6(b), OASIS outperforms Static Updatein F1 score by about 10%. Note since Static Update re-learnsthe distance metric from scratch each time when a new exam-

ple is acquired, we expected it to perform the best. However,

this is not the case. This indicates that the general incremen-

tal update strategy used in OASIS is very effective, sometimes

even better than the distance metric learning method cus-

tomized to the analytics tasks. The F1 scores of No Updateand LEGO are much lower. The reason that LEGO performs

poorly is that it uses the 5th percentile pairwise distance

as target distance. The learned distance metric using this

small target distance tends to make the outliers close to their

neighbors. This is not suitable for outlier detection, since

based on the general principle of outlier detection, outliers

correspond to the objects that have large distances to their

neighbors [15].

Evaluation of Efficiency. As shown in Fig. 6(a), OASIS is

up to 5 orders of magnitude faster than Static Update and 3

orders of magnitude faster than the online algorithm LEGO.

7.4 Evaluation of LabelingWe use the CIFAR10 dataset to evaluate the effectiveness

of OASIS at supporting labeling applications. We randomly

select 100 images as an initial labeled data set to learn a

distance metric using LMNN [47]. Then we randomly select

3,000 images as labeled query set and consider the rest of

CIFAR10 as an unlabeled set. We then propagate labels to

the unlabeled set using π‘˜NN queries following the OASIS

procedure described in Sec. 1.

Evaluation of Accuracy. The labeling accuracy is mea-

sured as the ratio of the number of correctly labeled objects

to all labeled objects by propagation. As shown in Sec. 7(b),

the accuracy of OASIS is comparable to the Static Update so-lution, and higher than No-Update and LEGO. Note when π‘˜

increases, the accuracy of system increases even further. This

property is particularly useful in labeling, because a larger π‘˜

indicates that more labels can be automatically produced.

Evaluation of Efficiency. As shown in Fig. 7(a), OASIS is

2 orders faster than Static Update and LEGO. The processingtimes of Static Update and LEGO are close to each other.

7.5 Evaluation of π‘˜NN QueriesThe results are shown in Fig. 8. Please refer to the extended

version [3] of this manuscript for the analysis.

Continuously Adaptive Similarity Search SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

(a) k=10 (b) k=50 (c) k=100 (d) Average

Figure 9: Recall metric for 𝐾NN query with LSH udpate: Varying K (HEPMASS dataset)

(a) k=10 (b) k=50 (c) k=100 (d) Average

Figure 10: Recall metric for 𝐾NN query with LSH update: Varying K (Synthetic dataset)

(a) HEPMASS dataset (b) Synthetic dataset

Figure 11: CPU time of LSH update with varyingdataset sizes

(a) Recall: Varying value of 𝐾 (b) CPU Time: Varying Dataset Size

Figure 12: Inner Product LSH: varying 𝐾 (HEPMASS)

7.6 Evaluation of LSH UpdatesWe use the largest real dataset HEPMASS and large synthetic

datasets to evaluate both the effectiveness and efficiency of

our LSH update method. For effectiveness, we measure the

recall of the exact π‘˜NN among the candidates returned from

LSH. For efficiency, we measure the overall processing time

including the time for updating LSH and for π‘˜NN search. 100

distance metric updates are triggered in each case. Each LSH

index uses 3 hash tables.

Effectiveness. The recall is measured for 100 queries. We

vary the k parameter (k = 10, 50 and 100) in the π‘˜NN queries.

We compare the LSH update method in OASIS against Re-build which re-hashes all objects whenever the distance met-

ric changes.

Fig. 9 and 10 show how the recall of our LSH update

method and Rebuild change over time during the update

process. The two curves w.r.t. LSH update and Rebuild look

similar. In addition, we also measure the average recall. As

shown in Figs. 9(d) and 10(d), the average recall of LSH update

is only slightly lower than Rebuild, confirming the effective-

ness of our LSH update strategy.

Efficiency. We compare our LSH update with Linear Scanwithout using indexing and Rebuild LSH. We measure the

total processing time of 1,000 π‘˜NN queries on each dataset.

Real HEPMASS dataset. We vary the sizes of the sub-

sets of HEPMASS used in the evaluation from 100k to 10M.

As shown in Fig. 11(a), our LSH update method consistently

outperforms Rebuild LSH by 3 to 4 orders of magnitude. This

is because our LSH update strategy avoids the update of the

LSH index in most cases. Further, in the rare case of an actual

update, only a small number of objects are re-hashed. Our

LSH update method also consistently outperforms LinearScan by 2 to 3 orders of magnitude, because the time com-

plexity of π‘˜NN search is reduced from linear to sub-linear for

LSH. Rebuild LSH is even slower than Linear Scan, although

it also uses the sub-linear time complexity approximate π‘˜NN

search. This is due to the costs of having to re-hash all objects

in each update. This update overhead outweighs the query

time saved by the approximate π‘˜NN search.

SIGMOD ’20, June 14–19, 2020, Portland, OR, USA Huayi Zhang, Lei Cao, Yizhou Yan, Samuel Madden, and Elke A. Rundensteiner

Synthetic Datasets. The results on the synthetic datasetsare similar to those of the HEPMASS dataset as shown in

Fig. 11(b). When the dataset gets larger to 100M, our method

outperforms the alternative methods even more. This con-

firms that our method is scalable to large datasets.

7.7 Evaluation of Inner Product LSHWe compare our inner product LSH used in OASIS to quickly

find the to-be-rehashed objects against two state-of-the-art

inner product LSH methods ALSH [42] and Hyperplane [36]

using the real HEPMASS dataset.

Following the typical practice of evaluating approximate

similarity search [29], we measure both recall and query

processing time. For this, we configure all LSH methods to

have a similar response time by first tuning the parameters on

a small sample dataset (10k) and then evaluating the accuracy

on larger datasets. As confirmed in Fig. 12(b), after parameter

tuning, thesemethods indeed have a similar query timewhen

increasing the sizes of the datasets (with our OASIS slightly

faster).

As shown in Fig 12(a), the OASIS inner product LSH signif-

icantly outperforms the alternate methods ALSH and Hyper-

plane. When varying the parameter 𝐾 , the recall of OASIS

LSH is consistently higher than 0.85, while the recall of ALSH

and Hyperplane remain below 0.2. This is because our data

augmentation mechanism ensures that inner product similar-

ity search can be supported using the cosine similarity LSH,

while cosine similarity LSH is well studied in the literature

and achieves high recall [2].

8 RELATEDWORKOnline Learning. Some prior research [9, 24, 41] focused

on online metric learning. [24] solves for the optimization

function using the exact gradient. This solves the problem

that traditional online learning approaches could produce

non-definite matrices [13] because of the approximate so-

lution for the gradient of the loss function. However, [24]

cannot ensure that the previously mis-classified objects are

no longer mis-classified in the new transformed space, be-

cause it uses the 5th percentile of the pairwise distances

among all training examples to set the target distance. We

solve this problem by forming the new constraints on the

optimization function that take the distance relationship

between the matching neighbors of the new item and its

non-matching neighbors into consideration.

[9] proposed a method that incrementally learns a good

function to measure the similarity between images. [41] fo-

cuses on sparse data like image datasets similar to [9]. Both

[9] and [41] cannot guarantee the transition matrix 𝑀 to

be positive semidefinite. Therefore, they cannot be directly

applied to distance metric learning.

Incremental LSH Update. In [24], a method is sketched

to support the update of cosine similarity-based LSH in a lin-

early transformed space. It repeatedly samples some objects

to re-hash whenever the transform changes and periodically

rebuild the LSH from scratch to ensure its effectiveness. In

this work, we prove that the LSH built in the original input

feature space is equally effective in supporting similarity

search in the transformed input feature space as long as

the transform matrix satisfies Lemma 5.2. Therefore, OASIS

avoids updating LSH in most cases.

LSH for Inner Product Similarity Search. Maximum

Inner product search (MIPS) attracts much attention recently

because of its application in Deep Learning. Accordingly, in-

ner Product similarity-based LSH methods were proposed to

speed up MIPS [20, 36, 42]. To support MISP, [42] extend the

LSH framework to allow asymmetric hashing. [36] instead

presents a simple symmetric LSH exists that has stronger

guarantees and better empirical performance than the asym-

metric LSH in supporting MIPS. However, the LSH in [36],

simply using the hyperplane hash function, is shown to have

a long running time – often as long as the full linear scan.

Active Distance Metric Learning. In [27, 50], active

learning-based approaches were proposed that learn a dis-

tancemetric by selectively sampling the to-be-labeled objects

from the existing dataset. Therefore, their focus is on reduc-

ing the labeling work from the users. However, unlike our

work, they do not account for the continuously incoming

objects that had previously not been seen. Therefore, once a

distance metric is learned, it will never be updated. We in-

stead focus on the continuous update of the distance metric

driven by dynamically acquiring similarity constraints.

9 CONCLUSIONIn this work, we propose OASIS that consistently improves

the accuracy and efficiency of similarity search-based data

mining tasks. OASIS features an incremental distance metric

learning strategy that adapts the distance metric whenever

new similarity example becomes available. OASIS also offers

an update mechanism that only updates the LSH index when

it is absolutely necessary and re-hashes a small subset of

objects when the update actually happens. Our experimental

study confirms the 3 orders of magnitude speedup of our

strategies over the state-of-the-art methods.

REFERENCES[1] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clus-

tering evolving data streams. VLDB, 2003.

[2] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt.

Practical and optimal lsh for angular distance. In NIPS, pages 1225–1233, 2015.

[3] Anonymous. Continuously adaptive similarity search. https:

//drive.google.com/file/d/1hFsqVD6LlQPRm7SBydk1TRlzGuD1yvf2/

view?usp=sharing, 2019.

Continuously Adaptive Similarity Search SIGMOD ’20, June 14–19, 2020, Portland, OR, USA

[4] S. D. Bay and M. Schwabacher. Mining distance-based outliers in near

linear time with randomization and a simple pruning rule. In KDD,pages 29–38, 2003.

[5] A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for

feature vectors and structured data. CoRR, abs/1306.6709, 2013.[6] M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. LOF: identifying

density-based local outliers. In SIGMOD, pages 93–104, 2000.[7] L. Cao, J. Wang, and E. A. Rundensteiner. Sharing-aware outlier an-

alytics over high-volume data streams. In SIGMOD, pages 527–540,2016.

[8] L. Cao, D. Yang, Q. Wang, Y. Yu, J. Wang, and E. A. Rundensteiner. Scal-

able distance-based outlier detection over high-volume data streams.

In ICDE, pages 76–87, 2014.[9] G. Chechik, U. Shalit, V. Sharma, and S. Bengio. An online algorithm

for large scale image similarity learning. In NIPS, pages 306–314, 2009.[10] A. Cichocki, S. Cruces, and S. ichi Amari. Log-determinant diver-

gences revisited: Alpha-beta and gamma log-det divergences. Entropy,17:2988–3034, 2015.

[11] T. Cover and P. Hart. Nearest neighbor pattern classification. IEEETrans. Inf. Theor., 13(1):21–27.

[12] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive

hashing scheme based on p-stable distributions. In SCG, pages 253–262.ACM, 2004.

[13] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-

theoretic metric learning. ICML, pages 209–216, New York, NY, USA,

2007. ACM.

[14] A. Dosovitskiy, J. T. Springenberg, M. A. Riedmiller, and T. Brox. Dis-

criminative unsupervised feature learning with convolutional neural

networks. In NIPS, pages 766–774, 2014.[15] C. C. A. S. Edition. Outlier Analysis. Springer, 2017.[16] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm

for discovering clusters in large spatial databases with noise. KDD’96,

pages 226–231, 1996.

[17] J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme

based on dynamic collision counting. SIGMOD, pages 541–552, 2012.

[18] N. Hansen. The cma evolution strategy: A tutorial. arXiv preprintarXiv:1604.00772, 2016.

[19] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of StatisticalLearning: Data Mining, Inference, and Prediction. Springer, 2009.

[20] Q. Huang, G. Ma, J. Feng, Q. Fang, and A. K. Tung. Accurate and

fast asymmetric locality-sensitive hashing scheme for maximum inner

product search. In KDD, pages 1561–1570. ACM, 2018.

[21] S. Idreos, O. Papaemmanouil, and S. Chaudhuri. Overview of data

exploration techniques. In SIGMOD, pages 277–281, 2015.[22] S. B. Imandoust and M. Bolandraftar. Application of k-nearest neigh-

bor ( knn ) approach for predicting economic events : Theoretical

background. 2013.

[23] P. Indyk and R. Motwani. Approximate nearest neighbors: Towards

removing the curse of dimensionality. STOC, pages 604–613. ACM,

1998.

[24] P. Jain, B. Kulis, I. S. Dhillon, and K. Grauman. Online metric learning

and fast similarity search. NIPS, pages 761–768, USA, 2008.

[25] E. M. Knorr and R. T. Ng. Algorithms for mining distance-based

outliers in large datasets. In VLDB, pages 392–403, 1998.[26] A. Krizhevsky and G. Hinton. Learning multiple layers of features

from tiny images. 2009.

[27] K. Kumaran, D. Papageorgiou, Y. Chang, M. Li, and M. TakΓ‘c. Active

metric learning for supervised classification. CoRR, abs/1803.10647,2018.

[28] H. O. Lancaster and E. Seneta. Chi-square distribution. Encyclopediaof biostatistics, 2, 2005.

[29] W. Li, Y. Zhang, Y. Sun, W. Wang, M. Li, W. Zhang, and X. Lin. Approx-

imate nearest neighbor search on high dimensional data - experiments,

analyses, and improvement. TKDE, 2019.[30] U. Luxburg. A tutorial on spectral clustering. Statistics and Computing,

17(4):395–416, Dec. 2007.

[31] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe

lsh: Efficient indexing for high-dimensional similarity search. PVLDB,

pages 950–961, 2007.

[32] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Intelligent

probing for locality sensitive hashing: Multi-probe lsh and beyond.

Proc. VLDB Endow., 10(12):2021–2024, Aug. 2017.[33] D. Malerba, F. Esposito, and G. Semeraro. A further comparison of

simplification methods for decision-tree induction. In Learning fromdata, pages 365–374. Springer, 1996.

[34] J. McDonald. Chi-square test of goodness-of-fit: Handbook of BiologicalStatistics. Sparky House Publishing, 2009.

[35] K. Millman and M. Aivazis. Python for scientists and engineers. Com-puting in Science & Engineering, 13(02):9–12, mar 2011.

[36] B. Neyshabur and N. Srebro. On symmetric and asymmetric lshs for

inner product search. ICML, pages 1926–1934. JMLR.org, 2015.

[37] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. LOCI:

fast outlier detection using the local corr. integral. In ICDE, pages315–326, 2003.

[38] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Nu-merical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge

University Press, New York, NY, USA, 3 edition, 2007.

[39] X. Qin, L. Cao, E. A. Rundensteiner, and S. Madden. Scalable kernel

density estimation-based local outlier detection over large data streams.

2019.

[40] A. Rajaraman and J. D. Ullman. Mining of Massive Datasets. Cambridge

University Press, New York, NY, USA, 2011.

[41] S. Shalev-Shwartz, Y. Singer, and A. Y. Ng. Online and batch learning

of pseudo-metrics. In ICML, page 94, 2004.[42] A. Shrivastava and P. Li. Improved asymmetric locality sensitive

hashing (ALSH) for maximum inner product search (MIPS). pages

812–821, 2015.

[43] A. Singhal. Modern information retrieval: A brief overview. IEEE DataEng. Bull., 24(4):35–43, 2001.

[44] R. Spring and A. Shrivastava. Scalable and sustainable deep learning

via randomized hashing. In SIGKDD, pages 445–454, 2017.[45] J. M. Steele. The Cauchy-Schwarz master class: an introduction to the

art of mathematical inequalities. Cambridge University Press, 2004.

[46] F. Wang and J. Sun. Survey on distance metric learning and dimension-

ality reduction in data mining. Data Min. Knowl. Discov., 29(2):534–564,Mar. 2015.

[47] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for

large margin nearest neighbor classification. In NIPS, pages 1473–1480,2006.

[48] K. Q. Weinberger and L. K. Saul. Distance metric learning for large

margin nearest neighbor classification. J. Mach. Learn. Res., 10:207–244,June 2009.

[49] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric

learning, with application to clustering with side-information. In NIPS,NIPS’02, pages 521–528, Cambridge, MA, USA, 2002. MIT Press.

[50] L. Yang, R. Jin, and R. Sukthankar. Bayesian active distance metric

learning. In UAI, pages 442–449, 2007.[51] G. Zheng, S. L. Brantley, T. Lauvaux, and Z. Li. Contextual spatial

outlier detection with metric learning. In SIGKDD, pages 2161–2170,2017.