Effective Incremental Clustering for Duplicate Detection in Large Databases

8
Effective Incremental Clustering for Duplicate Detection in Large Databases Francesco Folino ICAR-CNR Via Bucci 41c 87036 Rende (CS) - Italy [email protected] Giuseppe Manco ICAR-CNR Via Bucci 41c 87036 Rende (CS) - Italy [email protected] Luigi Pontieri ICAR-CNR Via Bucci 41c 87036 Rende (CS) - Italy [email protected] Abstract We propose an incremental algorithm for discovering clusters of duplicate tuples in large databases. The core of the approach is the usage of an indexing technique which, for any newly arrived tuple μ, allows to efficiently retrieve a set of tuples in the database which are mostly similar to μ, and which are likely to refer to the same real-world entity which is associated with μ. The proposed index is based on a hashing approach which tends to assign similar objects to the same buckets. Empirical and analytical evaluation demonstrates that the proposed approach achieves satisfac- tory efficiency results, at the cost of low accuracy loss. 1 Introduction Recognizing similarities in large collections of data is a major issue in the context of information integration sys- tems. An important challenge in such a setting is to discover and properly manage duplicate tuples, i.e., syntactically dif- ferent tuples which are actually identical from a semantical viewpoint, in that they refer to the same real-world entity. There are several application scenarios involving this im- portant task. A typical example consists in the reconcili- ation of demographic data sources in a data warehousing setting. Names and addresses can be stored in rather differ- ent formats, thus raising the need for an effective reconcil- iation strategy which could be crucial to effective decision making. In such cases the problem is the analysis of a (typ- ically large) volume of small strings, in order to reconstruct the semantic information on the basis of the few syntactic data available. By assuming that each possible string repre- sents a dimension along which the information contained in a tuple can projected, tuples represent a small informative content in a high-dimensional space. This issue has given rise to a large body of work in different research communi- ties, and under a variety of names (e.g., De-duplication [15], Merge/Purge [10], Entity-Name Matching [6]). Traditionally, the problem of tuple de-duplication has been addressed mainly from an accuracy viewpoint, by fo- cusing on the minimization of incorrect matchings. How- ever, efficiency and scalability issues do play a predomi- nant role in many application contexts where large data vol- umes are involved, especially when the object-identification task is part of an interactive application, calling for short re- sponse times. In general, the large volume of involved data imposes severe restrictions on the design of data structures and algorithms for data de-duplication, and disqualifies any approach requiring quadratic time in the database size or producing many random disk accesses and continuous pag- ing activities. Thus, in this paper our main objective is to devise a scalable method for duplicate detection that can be profitably applied to large databases, in an incremental way. More specifically, many de-duplication approaches in the literature essentially attempt to match or cluster dupli- cated records (e.g., [6, 14]), and do not guarantee an ad- equate level of scalability. On the other hand, the usage of traditional clustering algorithms is made unviable by the high number of object clusters expected in a typical dedu- plication scenario, which can be of the same order as the size of the database. Under a scalability perspective, to the best of our knowledge, the only suitable approach is the one proposed in [13], where objects are first grouped in “canopies”, i.e., subsets containing objects suspected to be similar, and pairwise similarities are computed only inside each canopy. However this approach does not cope with in- crementality issues. Recently, some approaches have been proposed [9, 1, 5] which exploit efficient indexing schemes based on the extraction of relevant features from the tuples under consideration. Such approaches can be adapted to deal with the de-duplication problem, even though they are not specifically designed to approach the problem from an incremental clustering perspective. The solution we propose essentially relies on an efficient and incremental clustering technique that allows to discover all clusters containing duplicate tuples. The core of the ap- proach is the usage of a suitable indexing technique which,

Transcript of Effective Incremental Clustering for Duplicate Detection in Large Databases

Effective Incremental Clustering for Duplicate Detection in Large Databases

Francesco FolinoICAR-CNR

Via Bucci 41c87036 Rende (CS) - Italy

[email protected]

Giuseppe MancoICAR-CNR

Via Bucci 41c87036 Rende (CS) - Italy

[email protected]

Luigi PontieriICAR-CNR

Via Bucci 41c87036 Rende (CS) - Italy

[email protected]

Abstract

We propose an incremental algorithm for discoveringclusters of duplicate tuples in large databases. The core ofthe approach is the usage of an indexing technique which,for any newly arrived tuple µ, allows to efficiently retrieve aset of tuples in the database which are mostly similar to µ,and which are likely to refer to the same real-world entitywhich is associated with µ. The proposed index is based ona hashing approach which tends to assign similar objectsto the same buckets. Empirical and analytical evaluationdemonstrates that the proposed approach achieves satisfac-tory efficiency results, at the cost of low accuracy loss.

1 Introduction

Recognizing similarities in large collections of data isa major issue in the context of information integration sys-tems. An important challenge in such a setting is to discoverand properly manage duplicate tuples, i.e., syntactically dif-ferent tuples which are actually identical from a semanticalviewpoint, in that they refer to the same real-world entity.There are several application scenarios involving this im-portant task. A typical example consists in the reconcili-ation of demographic data sources in a data warehousingsetting. Names and addresses can be stored in rather differ-ent formats, thus raising the need for an effective reconcil-iation strategy which could be crucial to effective decisionmaking. In such cases the problem is the analysis of a (typ-ically large) volume of small strings, in order to reconstructthe semantic information on the basis of the few syntacticdata available. By assuming that each possible string repre-sents a dimension along which the information contained ina tuple can projected, tuples represent a small informativecontent in a high-dimensional space. This issue has givenrise to a large body of work in different research communi-ties, and under a variety of names (e.g., De-duplication [15],Merge/Purge [10], Entity-Name Matching [6]).

Traditionally, the problem of tuple de-duplication hasbeen addressed mainly from an accuracy viewpoint, by fo-cusing on the minimization of incorrect matchings. How-ever, efficiency and scalability issues do play a predomi-nant role in many application contexts where large data vol-umes are involved, especially when the object-identificationtask is part of an interactive application, calling for short re-sponse times. In general, the large volume of involved dataimposes severe restrictions on the design of data structuresand algorithms for data de-duplication, and disqualifies anyapproach requiring quadratic time in the database size orproducing many random disk accesses and continuous pag-ing activities. Thus, in this paper our main objective is todevise a scalable method for duplicate detection that can beprofitably applied to large databases, in an incremental way.

More specifically, many de-duplication approaches inthe literature essentially attempt to match or cluster dupli-cated records (e.g., [6, 14]), and do not guarantee an ad-equate level of scalability. On the other hand, the usageof traditional clustering algorithms is made unviable by thehigh number of object clusters expected in a typical dedu-plication scenario, which can be of the same order as thesize of the database. Under a scalability perspective, tothe best of our knowledge, the only suitable approach isthe one proposed in [13], where objects are first grouped in“canopies”, i.e., subsets containing objects suspected to besimilar, and pairwise similarities are computed only insideeach canopy. However this approach does not cope with in-crementality issues. Recently, some approaches have beenproposed [9, 1, 5] which exploit efficient indexing schemesbased on the extraction of relevant features from the tuplesunder consideration. Such approaches can be adapted todeal with the de-duplication problem, even though they arenot specifically designed to approach the problem from anincremental clustering perspective.

The solution we propose essentially relies on an efficientand incremental clustering technique that allows to discoverall clusters containing duplicate tuples. The core of the ap-proach is the usage of a suitable indexing technique which,

for any newly arrived tuple µ, allows to efficiently retrieve aset of tuples in the database which are likely mostly similarµ, and hence are expected to refer to the same real-worldentity associated with µ. The proposed indexing techniqueis based on a hashing scheme, which tends to assign objectswith high similarity to the same buckets. We exploit a re-fined key-generation technique which, for each tuple underconsideration, guarantees a controlled level of approxima-tion in the search for the nearest neighbors of such a tuple.To this purpose, we resort to a family of Locality-Sensitivehashing functions [11, 3, 8], which are guaranteed to as-sign any two objects to the same buckets with a probabilitywhich is directly related to their mutual similarity. Notably,with the help of an empirical evaluation performed on syn-thesized and real data, we can asses that the hashing-basedmethod ensures effective on-line matching, at the expenseof limited accuracy loss.

Notice that the approach proposed in this paper is a sub-stantial improvement of the proposal in [4]. There, weadopted an indexing scheme tailored to a set-based distancefunction, and the management of each tuple was faced at acoarser granularity. This paper extends and improves thatproposal in both effectiveness and efficiency, by allowingfor a direct control on the degree of granularity needed todiscover the actual neighbors (duplicates) of a tuple.

2 Problem statement and overview of the ap-proach

In the following we introduce some basic notationand preliminary definitions. An item domain M ={a1, a2, . . . , am} is a collection of items. We assume m tobe very large: typically, M represents the set of all pos-sible strings available from a given alphabet. Moreover,we assume that M is equipped with a distance functiondistM(·, ·) : M×M 7→ [0, 1], expressing the degree ofdissimilarity between two generic items ai and aj .

A tuple µ is a subset of M. An example tuple is

{Alfred, Whilem, Salisbury, Hill, 3001, London}

which represent registry information about a subject.Notice that a more appropriate representation [4] cantake into account a relational schema in which eachtuple fits. For example, in the above schema, a moreinformative setting requires to separate the tuples intothe fields NAME, ADDRESS, CITY, and to associate anitemset with each field: µ[NAME] = {Alfred,Whilem},µ[ADDRESS] = {Salisbury, Hill, 3001},µ[CITY] = {London}. For ease of presentation, we shallomit such details: the results which follow can be easilygeneralized to such a context.

We assume that the set of all tuples is equipped with adistance function, dist(µ, ν) ∈ [0, 1], which can be defined

for comparing any two tuples µ and ν, by suitably com-bining the distance values computed through distM on thevalues of matching fields. In the following, we assume thatboth dist(µ, ν) and distM are defined in terms of the Jac-card coefficient.

The core of the Entity Resolution problem [4] can beroughly stated as the problem of detecting, within a data-base DB = {µ1, . . . , µN} of tuples, a suitable partitioningC1, . . . , CK of the tuples, such that for each group Ci, intra-group similarity is high and extra-group similarity is low.For example, the dataset

µ1 Jeff, Lynch, Maverick, Road, 181, Woodstockµ2 Anne, Talung, 307, East, 53rd, Street, NYCµ3 Jeff, Alf., Lynch, Maverick, Rd, Woodstock, NYµ4 Anne, Talug, 53rd, Street, NYCµ5 Mary, Anne, Talung, 307, East, 53rd, Street, NYC

can be partitioned into C1 = {µ1, µ3} and C2 = {µ2, µ4, µ5}.This is essentially a clustering problem, but it is formu-

lated in a specific situation, where there are several pairsof tuples in DB that are quite dissimilar from each other.This can be formalized by assuming that the size of the set{〈µi, µj〉 | dist(µi, µj) ' 1 } is O(N2): thus, we can ex-pect the number K of clusters to be very high – typically,O(N).

A key intuition is that, in such a situation, cluster mem-bership can be detected by means of a minimal number ofcomparisons, by considering only some relevant neighborsfor each new tuple, efficiently extracted from the currentdatabase through a proper retrieval method. Moreover, weintend to cope with the clustering problem in an incremen-tal setting, where a new database DB∆ must be integratedwith a previously reconciled one DB. Practically speaking,the cost of clustering tuples in DB∆ must be (almost) inde-pendent of the size N of DB. To this purpose, each tuplein DB∆ is associated with a cluster in P , which is detectedthrough a sort of nearest-neighbor classification scheme.

The algorithm in Figure 1 summarizes our solution to thedata reconciliation problem. Notably, the clustering methodis parametric w.r.t. the distance function used to compareany two tuples, and is defined in an incremental way, forit allows to integrate a new set of tuples into a previouslycomputed partition. In fact, the algorithm receives a data-base DB and an associated partition P , besides the set ofnew tuples DB∆; as a result, it will produce a new partitionP ′ of DB ∪ DB∆, obtained by adapting P with the tuplesfrom DB∆.

In more detail, for each tuple µi in DB to be clustered,the k most prominent neighbors of µi are retrieved throughprocedure KNEARESTNEIGHBOR. The cluster member-ship for µi is then determined by calling the MOSTLIKELY-CLASS procedure, estimating the most likely cluster amongthose associated with the neighbors of µi. If such a clusterdoes not exist, µi is estimated not to belong to any of the

GENERATE-CLUSTERS(P ,DB∆,k)Output: A partition P′ ofDB ∪ DB∆;1: P′ ← P ;DB′ ← DB;2: Let P′ = {C1, . . . , Cm} andDB∆ = {µ1, . . . , µn};3: for i = 1 . . . n do4: neighbors ← KNEARESTNEIGHBOR(DB′, µi, k);5: Cj ← MOSTLIKELYCLASS(neighbors,P′);6: DB′ ← DB′ ∪ {µi};7: if Cj = ∅ then8: create a new cluster Cm+1 = {µi};9: P′ ← P′ ∪ {Cm+1};10: else11: Cj ← Cj ∪ {µi};12: PROPAGATE(neighbors,P′);13: end if14: end for

PROPAGATE(S,P)P1: for all µ ∈ S doP2: neighbors ← KNEARESTNEIGHBOR(DB, µ, k);P3: C ← MOSTLIKELYCLASS(neighbors,P);P4: if µ 6∈ C thenP5: C ← C ∪ {µ};P6: PROPAGATE(neighbors,P);P7: end ifP8: end for

Figure 1. Clustering algorithm

existing clusters with a sufficient degree of certainty, andhence it is assigned to a newly generated cluster.

Finally, procedure PROPAGATE is meant to scan theneighbors of µi in order to possibly revise their clustermemberships, since in principle they could be affected bythe insertion of µi. Notice that, in typical Entity Resolutionsettings, where clusters are quite far from each other, thepropagation affects only a reduced number of tuples, andends in a few iterations. Further details on the clusteringscheme sketched above can be found in [4].

It is worth remarking that the complexity of the algo-rithm in Figure 1, given the size N of DB and M of DB∆,depends on the three major tasks: (i) the search for neigh-bors (line 4, having cost n), (ii) the voting procedure (line 5,with a cost proportional to k), and (iii) the propagation ofcluster labels (line 12, having a cost proportional to n, basedon the discussion above). As they are performed for each tu-ple in DB∆, the overall complexity is O(M(n + k)). Sincek is O(1), it follows that the main contribution to the com-plexity of the clustering procedure is due to the cost O(n) ofthe KNEARESTNEIGHBOR procedure. Therefore, the mainefforts towards computational savings are to be addressedin designing an efficient method for neighbor search. Ourmain goal is doing this task by minimizing the number ofaccesses to the database, and avoiding the computation ofall pair-wise distances.

In [4], we proposed an hashing scheme which maps anytuple into a proper set of features, so that the similarity be-tween two tuples can be evaluated by simply looking at theirrespective features. To this purpose, a hash-based indexstructure, simply called Hash Index, was introduced, whichconsists of a pair H = 〈FI,ES〉, where:

– ES, referred to as External Store, is a storage structure

devoted to manage a set of tuple buckets by an opti-mized usage of disk pages: each bucket gathers tuplesthat are estimated to be similar to each other, in thatthey share a relevant set of properly defined features;

– FI , referred to as Feature Index, is an indexing struc-ture which, for each given feature s, allows to effi-ciently recognize all the buckets in ES that containtuples exhibiting s.

Figure 2 shows how such an index can be used to per-form nearest-neighbor searches, so supporting the wholeclustering approach previously described. The algorithmworks according to the number k of desired neighbors.

KNEARESTNEIGHBORS(DB,µ,k)1: Let S = {s | s is a relevant feature of µ };2: N ← ∅;3: while S 6= ∅ do4: x = S.Extract();5: h ← FI .Search(x);6: if (h = 0) then7: h ← FI .Insert(x);8: else9: while ν = ES .Read(h) do10: ifN .size < k or dist(µ, ν) < N .MaxDist() then11: N .Insert(ν, dist(µ, ν));12: end if13: end while14: end if15: ES .Insert(µ, h);16: end while17: returnN ;

Figure 2. The KNEARESTNEIGHBOR procedure.

A major point in the proposed approach is the choice offeatures for indexing tuples, which will strongly impact onthe effectiveness of the whole method, and should be care-fully tailored to the criterion adopted for comparing tuples.In [4] we described an indexing scheme which is meant toretrieve similar tuples, according to a set-based dissimilarityfunction, namely the Jaccard distance – for any two tuplesµ, ν ⊆ M, dist(µ, ν) = 1 − |µ ∩ ν|/|µ ∪ ν|. In practice,we assume that distM corresponds to the Dirichlet func-tion, and that, consequently, the dissimilarity between twoitemsets is measured by evaluating their degree of overlap.

In this case, a possible way of indexing a tuple µ sim-ply consists in extracting a number of non-empty subsets ofµ, named subkeys of µ, as indexing features. As the num-ber of all subkeys is exponential in the cardinality of thegiven tuple, the method is tuned to produce a minimal set of“significant” subkeys. In particular, a subkey s of µ is saidδ-significant if b|µ| × (1− δ)c ≤ |s| ≤ |µ|.

Notably, any tuple ν such that distJ(µ, ν) ≤ δ mustcontain at least one of the δ-significant subkeys of µ [4].Therefore, searching for tuples that exhibit at least one ofthe δ-significant subkeys derived from a tuple µ constitutes

a strategy for retrieving all the neighbors of µ without scan-ning the whole database. This strategy also guarantees anadequate level of selectivity: indeed, if µ and ν contain asensible number of different items, then their δ-significantsubkeys do not overlap. As a consequence, the probabilitythat µ is retrieved for comparison with ν is low.

Despite its simplicity, this indexing scheme was provento work quite well in practical cases [4]. Notwithstanding,two main drawbacks can be observed:

1. The cost of the approach critically depends on thenumber of δ-relevant subkeys: the larger is the set ofsubkeys, the higher is the number of writes needed toupdate the index.

2. More importantly, the proposed key-generation tech-nique suffers from a coarse grain dissimilarity betweenitemsets, which does not take into account a more re-fined definition of distM. Indeed, the proposed ap-proach is subject to fail, in principle, in cases wherethe likeliness among single tokens has to be detectedas well. As an example, the tuples

µ1 Jeff, Lynch, Maverick, Road, 181, Woodstockµ2 Jef, Lync, Maverik, Rd, 181, Woodstock

are not reckoned as similar in the proposed approach(even though they clearly refer to the same entity), dueto slight differences between some semantically equiv-alent tokens. Notice that lowering the degree δ of dis-similarity partially alleviates this effect, but at the costof worsening the performances of the index notably.

3 Hierarchical Approximate hashing basedon q-grams

Our objective in this section is to define a hash-based in-dex which is capable of overcoming the above describeddrawbacks. In particular, we aim at defining a key-generation scheme which only requires a constant numberof disk writes and reads, yet guaranteeing a fixed (low) rateof false negatives.

To overcome these limitations, we have to generate afixed number of subkeys, which however are capable ofreflecting both the differences among itemsets, and thoseamong tokens. To this purpose, we define a key-generationscheme by combining two different techniques:

– the adoption of hash functions based on the notion ofminwise independent permutation [8, 2], for boundingthe probability of collisions;

– the use of q-grams (i.e., contiguous substrings of sizeq) for a proper approximation of the similarity amongstring tokens [9].

A locally sensitive hash function H for a set S, equippedwith a distance function D, is a function which bounds theprobability of collisions to the distance between elements.Formally, for each pair p, q ∈ S and value ε, there ex-ists values P ε

1 and P ε2 such that: (i) if D(p, q) ≤ ε then

Pr[H(p) = H(q)] ≥ P ε1 , and (ii) if D(p, q) > ε then

Pr(H(p) = H(q)] > P ε2 .

Clearly, such a function H provides a simple solutionto the problem of false negatives described in the previoussection. Indeed, for each µ, we can define a representationrep(µ) = {H(a)|a ∈ µ}, and fill the hash-based indexby exploiting δ-significant subkeys from such a represen-tation. To this aim, we can exploit the theory of minwiseindependent permutations [2]. A minwise independent per-mutation is a coding function π of a set X of generic itemssuch that, for each x ∈ X , the probability of the code asso-ciated with x being the minimum is uniformly distributed,i.e., Pr[min(π(X)) = π(x)] = 1

|X| .A minwise independent permutation π naturally de-

fines a locally sensitive hash function H over an item-set X , defined as H(X) = min(π(x)). Indeed, foreach two itemsets X and Y , it can be easily verifiedthat Pr[min(π(X)) = min(π(Y ))] = |X∩Y |

|X∪Y | . This sug-gests that, by approximating distM(ai, aj) with the Jaccardsimilarity among some given features of ai and aj , we canadopt the above envisaged solution to the problem of falsenegatives. When M contains string tokens (as it usuallyhappens in a typical entity resolution setting), the featuresof interest of a given token a can be represented by the q-grams of a. It has been shown [9, 16] that the comparisonof the q-grams provides a suitable approximation of the Editdistance, which is typically adopted as a classical tool forcomparing strings.

In the following, we show how minwise functions can beeffectively exploited to generate suitable keys. Consider theexample tuples

µ1 Jeff, Lynch, Maverick, Road, 181, Woodstockµ2 Jef, Lync, Maverik, Rd, Woodstock

Clearly, the tuples µ1 and µ2 refer to the same entity (andhence should be associated with the same key). The detec-tion of such a similarity can be accomplished by resortingto the following observations:

1. some tokens in µ1 are strongly similar to tokens inµ2. In particular, Jeff and Jef, Lynch and Lync,Road and Rd, and Maverick and Maverik. Thus,denoting by a1, a2, a3, and a4 the four “approximatelycommon” terms, the tuples can be represented as:

µ1 a1, a2, a3, a4, 181,Woodstockµ2 a1, a2, a3, a4, Woodstock

2. the postprocessed tuples exhibit only a single mis-match. If a minwise permutation is applied to both,the resulting key shall very likely be the same.

Thus, a minwise function can be applied over the “purged”representation of a tuple µ, in order to obtain an effectivekey. The purged version of µ should guarantee that to-kens exhibiting high similarity with tokens in other tuples,change their representation towards a “common approxi-mate” token.

Again, the approximate representation of a token, de-scribed in point 1 of the example above, can be obtainedby resorting to minwise functions. Given two tokens ai

and aj , recall that their dissimilarity dM(ai, aj) is definedin terms of the Jaccard coefficient. In practice, if feat(a)represents a set of features of token a, then dM(ai, aj) =1 − |feat(ai) ∩ feat(aj)|/|feat(ai) ∪ feat(aj)|. The setfeat(a) can be defined in terms of q-grams. The latter rep-resent the simplest yet effective information contents of atoken and indeed, they have been widely used and demon-strated fruitful in estimating the similarity of strings [9, 16].Hence, by applying a minwise function to the set of q-gramsof a, we again have the guarantee that similar tokens col-lapse to a unique representation.

Thus, given a tuple µ to encode, the key-generationscheme we propose works in two different levels:

– In the first level, each element a ∈ µ is encoded byexploiting a minwise hash function H l. This guaran-tees that two similar but different tokens a and b arewith high probability associated with a same code. Asa side effect, tuples µ and ν sharing “almost similar”tokens are purged into two representations where suchtokens converge towards unique representations.

– In the second level, the set rep(µ) obtained from thefirst level is encoded by exploiting a further minwisehash function Hu. Again, this guarantees that tuplessharing several codes are likely associated with thesame key.

The final code obtained in this way can be effectivelyadopted in the indexing structure described in section 2.

A key point is the definition of a proper family of min-wise independent permutations upon which to define thehash functions. A very simple idea is to randomly map afeature x of a generic set X to a natural number. Then,provided that the mapping is random, the probability ofmapping a generic x ∈ X to a minimum number is uni-formly distributed, as required. In practice, it is hard to ob-tain a truly random mapping. Hence, we exploit a familyof “practically” minwise independent permutations [2], i.e.,the functions π(x) = (a · c(x)+ b) mod p, where a 6= 0 andc(x) is a unique numeric code associated with x (such as,e.g. the code obtained concatenating the ASCII charactersit includes). Provided that a, b, c(x) and p are large enough,the behavior of π is practically random, as we expect.

We further act on the randomness of the encoding, bycombining several alternative functions (obtained by choos-

ing different values of a, b and p). Recall that a hash func-tion on π is defined as Hπ(X) = min(π(X)), and thatPr[Hπ(X) = Hπ(Y )] = |X ∩ Y |/|X ∪ Y | = ε. No-tice that the choice of a, b and p in π introduces a prob-abilistic bias in Hπ , which can in principle leverage falsenegatives. Let us consider the events A ≡”sets X and Yare associated with the same code”, and B = ¬A. Then,pA = ε and pB = 1 − ε. By exploiting h different en-codings H l

1, . . . ,Hlh (which differ in the underlying π per-

mutations), the probability that all the encodings exhibit adifferent code for X and Y is (1−ε)h. If ε > 1/2 representsthe average similarity of items, we can exploit the h differ-ent encodings for computing h alternative representationsrep1(µ), . . . , reph(µ) of a tuple µ. Then, by exploiting allthese representations in a disjunctive manner, we lower theprobability of false negatives to (1− ε)h.

In general, allowing several trials generally flavors highprobabilities. Consider the case where ε < 1/2. Then,the probability that, in k trials (corresponding to k differentchoices of a, b and p) at least one trial is B is 1−εk. We canapply this to the second-level encoding, where, converselyfrom the previous case, the probabilistic bias can influencefalse positives. Indeed, two dissimilar tuples µ and ν couldin principle be associated with the same token, due to a spe-cific bias in π which affects the computation of minimumrandom code both in repi(µ) and in repi(ν). If, by the con-verse a key is computed as a concatenation of k differentencodings Hu

1 , . . . ,Huk , the probability of having a differ-

ent key for µ and ν is 1−εk, where ε is the Jaccard similaritybetween repi(µ) and repi(ν).

It is worth noting that the application of LSH methodsto nearest neighbors searches in high-dimensional spaceshas been thoroughly studied: [8, 7], in particular, adopt ahierarchical combination of LSH functions similar to ours.However, their objective is rather different: given an objectq from some given domain D and a threshold δ, retrieve theset S = {p ∈ D|d(q, p) < δ} in O(|S|). By contrast, ina deduplication scenario the value δ is unknown, and is ingeneral parametric to the similarity function to be adopted.Thus, using LSH in this context raises some novel, specific,issues, which have not yet been studied. In particular, sincethe level of mismatch (e.g., mispelling errors) actually af-fecting duplicate tuples is an application-dependent para-meter, it is not clear how to tune the h and k parameters. Onthe other hand, the extraction of q-grams from input stringsintroduces a further degree of freedom in the method, sincedetermining a proper value for q-gram sizes is not a trivialtask, in general. Notice that the empirical analysis illus-trated in the following section is crucial in this respect, asit allows to get insight on the issue of optimally setting theabove parameters.

4 Experimental Results

The above discussion shows that the effectiveness of theapproach relies on choosing proper values of h and k. Lowvalues of h leverage false negatives, whereas high valuesleverage false positives. Analogously, low values of k lever-age false positives, whereas high values should, in principle,leverage false negatives.

Thus, this section is devoted to studying suitable valuesof these parameters that fix a high correspondence betweenthe retrieved and the expected neighbors of a tuple. To thispurpose, for a generic tuple µ we are interested in evaluat-ing the number TPµ of true positives (i.e., the tuples whichare retrieved and that belong to the same cluster of µ), andcompare it to the number of false positives FPµ (i.e., tu-ples retrieved without being neighbors of µ), and false neg-atives FNµ (i.e., neighbors of µ which are not retrieved).As global indicators we exploit the average precision andrecall per tuple, i.e. precision = 1

N

∑µ∈DB

TPµ

TPµ+FPµand

recall = 1N

∑µ∈DB

TPµ

TPµ+FNµ, where N denotes the num-

ber of tuples in DB.The values of such quality indicators influence the effec-

tiveness of the clustering scheme in Figure 1. In general,high values of precision allows for correct de-duplication:indeed, the retrieval of true positives directly influences theMOSTLIKELYCLASS procedure which assigns each tupleto a cluster. When precision is low, the clustering methodcan be effective only if recall is high.

Notice that low precision may cause a degradationof performances, if the number of false positives is notbounded. Thus, we also evaluate the efficiency of the index-ing scheme, in terms of the number of tuples retrieved byeach search. This value depends on h and k, and is clearlyrelated to the rate of false positives.

Experiments were conducted on both real and synthe-sized data. For the real data, we exploited a (not publiclyavailable) collection of about 105,140 tuples, representinginformation about customers of an Italian bank. Syntheticdata was produced by generating 50,000 clusters with anaverage of 20 tuples per cluster, and each tuple containing20 tokens in the average. Each cluster was obtained by firstgenerating a representative of the cluster, and then produc-ing the desired duplicates as perturbations of the representa-tive. The perturbation was accomplished either by deleting,adding or modifying a token from the cluster representative.The number of perturbations was governed by a gaussiandistribution having p as mean value. The parameter p wasexploited to study the sensitivity of the proposed approachto the level of noise affecting the data, and due, for exampleto misspelling errors.

Figures 3 and 4 illustrate results of some tests we con-ducted on this synthesized data, in order to analyze the sen-sitivity to the parameters q, h and k (relative to the indexing

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

k=1 k=3 k=5 k=3 k=5 k=5

h=1 h=1 h=1 h=3 h=3 h=5

q=2q=3q=1-2-3q=1-2

(a) precision vs. h, k and q

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

k=1 k=3 k=5 k=3 k=5 k=5

h=1 h=1 h=1 h=3 h=3 h=5

q=2q=3q=1-2-3q=1-2

(b) recall vs. h, k and q

Figure 3. Results on synthetic data w.r.t. q-gram size (q) and nr. of hash functions (h,k)

scheme), and p (relative to the noise in the data). In particu-lar, the values of q ranged over 2, 3, 1-2 (both 1-grams and2-grams) and 1-2-3 (q-grams with sizes 1, 2 and 3).

Figures 3.(a) and 3.(b) show the results of precision andrecall for different values of h and k, and p = 2. We cannotice that precision raises on increasing values of k, anddecreases on increasing values of h. The latter statementdoes not hold when q-grams of size 1 are considered. Ingeneral, stabler results are guaranteed by using q-grams ofsize 3. As to the recall, we can notice that, when k is fixed,increasing values of h correspond to improvements as well.If h is fixed and k is increased, the recall decreases onlywhen q = 3. Here, the best results are guaranteed by fixingq=1-2-3. In general, when h ≥ 3 and k ≥ 3, the indexingscheme exhibits good performances.

Figures 4.(a) and 4.(b) are useful to check the robust-ness of the index. As expected, the effectiveness of the ap-proach tends to degrade when higher values of the perturba-tion factor p are used to increase intra-cluster dissimilarity.However, the proposed retrieval strategy keeps on exhibit-ing values of precision and recall that can still enable an ef-fective clustering. In more detail, the impact of perturbationon precision is clearly emphasized when tuples are encodedby using also 1-grams, whereas using only either 2-grams

0%

20%

40%

60%

80%

100%

2 4 8perturbation factor

prec

isio

n q=1-2q=2q=1-2-3q=3

(a) precision vs. perturbation and q

0%

20%

40%

60%

80%

100%

2 4 8perturbation factor

reca

ll

q=1-2q=2q=1-2-3q=3

(b) recall vs. perturbation and q

Figure 4. Results on synthetic data w.r.t. q-gram size (q) and perturbation

or 3-grams allows for making precision results stabler. No-tice that for q = 3 a nearly maximum value of precision isachieved, even when a quite perturbed data set is used.

Figure 5 provides some details on the progress of thenumber of retrieved neighbors, TP , FP and FN , when anincreasing number of tuples, up to 1,000,000, is inserted inthe index. For space reasons, only some selected combina-tions of h and k, and q are considered, which were deemedas quite effective in previous analysis. Anyway, we pin-point that some general results of the analysis illustratedhere also apply to other cases. The values are averaged ona window of 5,000 tuples. In general, it is interesting to ob-serve that the number of retrievals for each tuple is alwaysbounded, although for increasing values of the data size theindex grows. This general behavior, which we verified forall configurations of h, k and q, clearly demonstrates thescalability of the approach. In particular, we observe thatfor q = 3 the number of retrievals is always very low andnearly independent of the number of tuples inserted (see fig-ures 5.(c) and 5.(d)). More in general, the figures confirmthe conceptual analysis that the number of I/O operationsdirectly depends on the parameter h, the latter determiningthe number of searches and updates against the index.

All these figures also agree with the main outcomes of

(a) q = 2, h = 3, k = 3

(b) q = 2, h = 5, k = 5

(c) q = 3, h = 3, k = 3

(d) q = 3, h = 5, k = 5

Figure 5. Scalability w.r.t. the data size

the effectiveness analysis previously conducted with thehelp of Figure 3. In particular, notice the quick decreaseof FP and FN when both k and h turn from 3 to 5, inthe case of q = 2 (Figure 5.(a) and 5.(b)), that motivatesthe improvement in both precision and recall observed in

these cases. Moreover, the high precision guaranteed by ourapproach with q-grams of size 3, is substantiated by Fig-ures 5.(c) and 5.(d), where the number of retrieved tuples isvery close to TP ; in particular, for k = 5 (Figure 5.(d)) theFP curve definitely flatten on the horizontal axis.

The above considerations are confirmed by experimentson real data. Figure 6.(a) shows the results obtained for pre-cision and recall by using different values of q, whereas fig-ure 6.(b) summarizes the average number of retrievals andquality indices. As we can see, recall is quite high even ifprecision is low (thus allowing for a still effective cluster-ing). Notice that the average number of retrievals is low,thus guaranteeing a good scalability of the approach.

0%

20%

40%

60%

80%

100%

1-2 2 1-2-3 3

q-gram size

Precision

Recall

(a) precision and recall

-

1

2

3

4

5

6

7

8

9

1-2 2 1-2-3 3

q-gram size

Retrievals (avg)

TP (avg)

FP (avg)

FN (avg)

(b) average number of retrievals, TP, FPand FN

Figure 6. Results on real data using differentq-gram sizes

5 Conclusions and Future Works

In this paper, we addressed the problem of recognizingduplicate data, specifically focusing on scalability and in-crementality issues. The core of the proposed approach isan incremental clustering algorithm, which aims at discov-ering clusters of duplicate tuples. To this purpose, we stud-ied a refined key-generation technique, which allows a con-trolled level of approximation in the search for the nearestneighbors of a tuple. An empirical analysis, on both synthe-sized and real data, showed the validity of the approach.

Clearly, when strings are too small or too different tocontain enough informative content, the de-duplication taskcannot be properly accomplished by the proposed clustering

algorithm. To this purpose, we plan to study the extensionof the proposed approach to different scenarios, where moreinformative similarity functions can be exploited. An exam-ple is the adoption of link-based similarity: recently, sometechniques were proposed [12] which have been proved ef-fective but still suffer from the incrementality issues whichare the focus of this paper.

References

[1] R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminatingfuzzy duplicates in data warehouses. In Proc. VLDB Conf.,pages 586–597, 2002.

[2] A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher.Minwise independent permutations. In Proc. STOC Conf.,pages 327–336, 1998.

[3] A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syn-tactic clustering on the web. In Proc. WWW Conf., pages1157–1166, 1997.

[4] E. Cesario, F. Folino, G. Manco, and L. Pontieri. An in-cremental clustering scheme for duplicate detection in largedatabases. In Proc. IDEAS Conf., 2005.

[5] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robustand efficient fuzzy match for online data cleaning. In Proc.SIGMOD Conf., pages 313–324, 2003.

[6] W. W. Cohen and J. Richman. Learning to match and clus-ter large high-dimensional data sets for data integration. InProc. SIGKDD Conf., pages 475–480, 2002.

[7] A. Gionis, D. Gunopulos, and N. Koudas. Efficient and tun-able similar set retrieval. In Proc. SIGMOD Conf., 2001.

[8] A. Gionis, P. Indyk, and R. Motwani. Similarity search inhigh dimensions via hashing. In Proc. VLDB Conf., pages518–529, 1999.

[9] L. Gravano et al. Approximate string joins in a database (al-most) for free. In Proc. VLDB Conf., pages 518–529, 2001.

[10] M. A. Hernandez and S. J. Stolfo. The merge/purge problemfor large databases. In Proc. SIGMOD Conf., pages 127–138, 1995.

[11] P. Indyk and R. Motwani. Approximate nearest neighbor- towards removing the curse of dimensionality. In Proc.STOC Conf., pages 604–613, 1998.

[12] D. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting re-lationships for domain independent data cleaning. In Proc.SIAM Conf., pages 262–273, 2005.

[13] A. K. McCallum, K. Nigam, and L. Ungar. Efficient clus-tering of high-dimensional data sets with application to ref-erence matching. In Proc. SIGKDD Conf., pages 169–178,2000.

[14] A. E. Monge and C. P. Elkan. The field matching problem:Algorithms and applications. In Proc. SIGKDD Conf., pages267–270, 1996.

[15] S. Sarawagi and A. Bhamidipaty. Interactive deduplicationusing active learning. In Proc. SIGKDD Conf., pages 269–278, 2002.

[16] E. Ukkonen. Approximate string matching using q-gramsand maximal matches. TCS, 92(1):191–211, 1992.