Discovery of Image Versions in Large Collections

10
Discovery of Image Versions in Large Collections Jun Jie Foo, Ranjan Sinha, and Justin Zobel School of Computer Science & IT RMIT University, Melbourne, Australia, 3001 {jufoo,rsinha,jz}@cs.rmit.edu.au Abstract. Image collections may contain multiple copies, versions, and fragments of the same image. Storage or retrieval of such duplicates and near-duplicates may be unnecessary and, in the context of collections de- rived from the web, their presence may represent infringements of copy- right. However, identifying image versions is a challenging problem, as they can be subject to a wide range of digital alterations, and is poten- tially costly as the number of image pairs to be considered is quadratic in collection size. In this paper, we propose a method for finding the pairs of near-duplicates based on manipulation of an image index. Our approach is an adaptation of a robust object recognition technique and a near-duplicate document detection algorithm to this application domain. We show that this method requires only moderate computing resources, and is highly effective at identifying pairs of near-duplicates. 1 Introduction Many digital images found on resources such as the web are copies or variants of each other. Identification of these near-duplicate images is a challenging problem, as two versions are rarely identical. They may differ in filename, format, and size; simply saving an image may lead to bitwise differences due to the variations in the coding standards in different software. Some common modifications include conversion to greyscale, change in color balance and contrast, rescaling, rotating, cropping, and filtering. These are instances where the near-duplicates — which we term co-derivative — are derived from the same digital image source, and are sometimes known as identical near-duplicates [8]. Non-identical near-duplicates — which are of interest in media tracking and filtering [8,23] — are images that share the same scenes or objects. In this work, we do not address non-identical near-duplicates due to their subjectivity in interpretation. Although the detection of copied digital images has been extensively re- searched in the field of digital watermarking [6,9,11,12], such methods are ill- suited for retrieval applications [16,19]. Similarly, content-based retrieval [22] techniques are unsuitable, as they are designed to identify images with similar traits and have limited effectiveness for this task [3,17,20]. For the task of retrieval of near-duplicate images in response to a query, Ke et al. [14] have demonstrated near-perfect accuracy using PCA-SIFT local descriptors. Lu and Hsu [16] demonstrated an effective method of image hash- ing for retrieval of near-duplicate images. However, little prior work concerns T.-J. Cham et al. (Eds.): MMM 2007, LNCS 4352, Part II, pp. 433–442, 2007. c Springer-Verlag Berlin Heidelberg 2007

Transcript of Discovery of Image Versions in Large Collections

Discovery of Image Versions in Large Collections

Jun Jie Foo, Ranjan Sinha, and Justin Zobel

School of Computer Science & ITRMIT University, Melbourne, Australia, 3001

{jufoo,rsinha,jz}@cs.rmit.edu.au

Abstract. Image collections may contain multiple copies, versions, andfragments of the same image. Storage or retrieval of such duplicates andnear-duplicates may be unnecessary and, in the context of collections de-rived from the web, their presence may represent infringements of copy-right. However, identifying image versions is a challenging problem, asthey can be subject to a wide range of digital alterations, and is poten-tially costly as the number of image pairs to be considered is quadraticin collection size. In this paper, we propose a method for finding thepairs of near-duplicates based on manipulation of an image index. Ourapproach is an adaptation of a robust object recognition technique and anear-duplicate document detection algorithm to this application domain.We show that this method requires only moderate computing resources,and is highly effective at identifying pairs of near-duplicates.

1 Introduction

Many digital images found on resources such as the web are copies or variants ofeach other. Identification of these near-duplicate images is a challenging problem,as two versions are rarely identical. They may differ in filename, format, and size;simply saving an image may lead to bitwise differences due to the variations inthe coding standards in different software. Some common modifications includeconversion to greyscale, change in color balance and contrast, rescaling, rotating,cropping, and filtering. These are instances where the near-duplicates — whichwe term co-derivative — are derived from the same digital image source, and aresometimes known as identical near-duplicates [8]. Non-identical near-duplicates— which are of interest in media tracking and filtering [8,23] — are images thatshare the same scenes or objects. In this work, we do not address non-identicalnear-duplicates due to their subjectivity in interpretation.

Although the detection of copied digital images has been extensively re-searched in the field of digital watermarking [6,9,11,12], such methods are ill-suited for retrieval applications [16,19]. Similarly, content-based retrieval [22]techniques are unsuitable, as they are designed to identify images with similartraits and have limited effectiveness for this task [3,17,20].

For the task of retrieval of near-duplicate images in response to a query,Ke et al. [14] have demonstrated near-perfect accuracy using PCA-SIFT localdescriptors. Lu and Hsu [16] demonstrated an effective method of image hash-ing for retrieval of near-duplicate images. However, little prior work concerns

T.-J. Cham et al. (Eds.): MMM 2007, LNCS 4352, Part II, pp. 433–442, 2007.c© Springer-Verlag Berlin Heidelberg 2007

434 J.J. Foo, R. Sinha, and J. Zobel

sifting a collection to find all near-duplicate pairs. The RIME [3] system was de-signed to address this issue using a cluster-based approach. However, the severityof the image alterations is limited. Recently, Zhang and Chang [23] propose aframework that identifies near-duplicate images using machine learning by graphmatching. They present an application of this detection for topic and semanticassociation, wherein they observe average effectiveness on a small collection ofimages; efficiency and scalability remains an issue.

In this work, we propose a new method for automatically identifying the co-derivative images in a large collection, based on analysis of the index generatedby an existing robust object-recognition technique. The rationale is that co-derivative images (or their sub-parts) should share identical objects that areunlikely to be matched in unrelated images. As such, using such an approachenables us to identify candidate near-duplicate pairs, which we can then processin more detail to verify whether they are indeed co-derivative.

To avoid the inefficiency inherent in quadratic-cost comparison of every pairof images, we apply the concept of discovery [1] that is used to identify near-duplicate documents to show that it can be adapted for images. We explorean approach that exploits the PCA-SIFT local descriptors [14] indexed usinglocality-sensitive hashing [5], which have been shown to be highly effective fornear-duplicate image retrieval. With this approach, co-derivatives are likely tohave features with similar hash values; processing the hash table provides anefficient mechanism for identifying candidate pairs.

Using collections of 10,000 to 40,000 images, we show that even severely al-tered co-derivative images can be efficiently identified using our approach. Formild alterations, including moderate cropping and alterations to properties suchas hue and intensity, both accuracy and completeness typically exceed 80%, whilecomputational costs remain reasonable.

2 Co-derivative Detection and the Discovery Problem

Zhang and Chang [23] and Jaimes et al. [8] have broadly categorized duplicateimages into three main categories by the type of changes, namely scene, camera,and image. Image changes describe digital editing operations such as cropping,filtering, and color, contrast, or resolution alteration. It is these changes that wedefine to lead to images being co-derived , since they are derived from the samedigital source; we investigate the discovery of such images in this paper.

The problem of automatic identification of pairs of co-derivative images canbe conceptualized using a relationship graph [1], where each node represents aunique image; the presence of an edge between two nodes reflects a co-derivativerelationship between the images. This discovery process of the relationships ina given collection is a challenging task due to the quadratic number of potentialedges (image pairs) for a collection of images [23].

In this paper, we borrow concepts from near-duplicate document detection[1,2,21]. Text documents are parsed into representative units of words or charac-ters; these units can be indexed in an inverted file [24], where each entry contains

Discovery of Image Versions in Large Collections 435

Fig. 1. An example of a co-derivative image; the lines depict the matching PCA-SIFTlocal descriptor matches

the postings list of documents (IDs) in which this particular unit occurs, alongwith any auxiliary information. Co-derivative document detection algorithms ex-ploit the postings list to generate the relationship graph; the principal differencesbetween the algorithms in the literature lies in unit selection heuristics [1].

Broder et al. [2] propose counting the number of possible document pairings(edges) in each postings list to identify the number of co-occurring units in anytwo documents. Although this method is effective, it is costly, as the number ofunique edges that can be generated is quadratic to the length of the postingslist. Shivakumar and Garcia-Molina [21] address the scalability issues with afilter-and-refine scheme, based on hash-based probabilistic counting. The keyidea is to set an upper-bound on the number of unique edges by coarse countingin the first-pass — using a hash table — to discard edges that do not havesufficient co-occurring units. This method is efficient, that is, given a hash table ofsufficient size the number of identified edges can be dramatically reduced [1,21].Our approach is to adapt such methods to the image domain, using, instead ofpostings, distinctive interest points [13] and locality sensitive hashing [5].

Distinctive interest points. Images can be characterized by local or global fea-tures, such as color, shape, texture, or salient points [22]. Local image featuressuch as interest points have been shown to be robust for applications of objectrecognition and image matching [18]. Intuitively, co-derivative images shouldshare identical objects; thus robust object recognition techniques are well-suitedfor this application domain. We use the SIFT [15] interest point detector and thePCA-SIFT local descriptors, as they have been demonstrated to be effective forquery-based co-derivative image retrieval [4,14]; for convenience, we refer to themas a PCA-SIFT feature. An illustration that it works well is shown in Figure 1.In this pair of images, the dashed lines connect corresponding (automaticallydetected) PCA-SIFT features; as can be seen, the matching is accurate.

To generate local descriptors, PCA-SIFT uses the same information as theoriginal SIFT descriptor, that is, location, scale, and dominant orientations. Keet al. [13,14] have empirically determined that n = 36 feature spaces for the localdescriptor performs well for object detection, and, specifically, for near-duplicateimage retrieval; wherein any two PCA-SIFT local descriptors are deemed similar(a match) within an Euclidean distance (L2-norm) of 3, 000. Hence, we use the

436 J.J. Foo, R. Sinha, and J. Zobel

same setting in this work, where the only difference lies in similarity assessmentof PCA-SIFT local descriptors, which we describe later.

The number of PCA-SIFT local descriptors is dictated by that of the key-points which SIFT detects, typically ranging from hundreds to thousands perimage (depending on image complexity). To recognize two objects with reason-able reliability, a minimum match of 5 local descriptors has been empiricallyobserved to work [13]. To index such a considerable number of PCA-SIFT localdescriptors for a large image collection is costly, but fortunately, the numberof keypoints that SIFT generates can be significantly reduced by varying thethreshold value in the second stage, which has been demonstrated to signifi-cantly improve efficiency with only slight loss in accuracy [4]. Here, we use theoriginal approach,1 as our emphasis is on effectiveness.

Locality sensitive hashing. Given units that can be extracted from images andused to assess co-derivation, we need a method of indexing these units so thatfeatures from co-derived images are gathered together. Such clustering shouldallow efficient determination of the relationship graph. In this paper, we makeuse of the locality sensitive hashing (LSH) structure used to index the PCA-SIFTfeatures for approximate nearest-neighbor search [5,7].

All points sharing identical hash values (collisions) within a given hash tableare estimated, by the Manhattan distance (L1-norm embedded in the Hammingspace), to be closer to each other than those that do not. Thus, the search spaceof an approximate nearest-neighbor match is greatly reduced to those that shareidentical hash values. Hence, the greater the number of hash-collisions betweenpoints from two images, the higher the probability that they are co-derivativesdue to increased PCA-SIFT feature matches; whereby a feature match is ap-proximated by a hash-collision in L1-norm instead of an L2-distance of 3, 000.

An additional L2-norm verification can be used to discard false positivematches under the L1, but has been observed to have limited impact on ef-fectiveness [5]. Moreover, any additional verification is costly in our applicationas we aim to identify all instances of co-derivatives. In this work, we assumesimilarity of two PCA-SIFT features by a hash-collision in the L1-norm.

3 Deriving the Co-derivative Relationship Graph

To generate the relationship graph of an image collection we apply a refine-and-filter scheme. As a first-pass pruning strategy, we use hash-based probabilisticcounting (explained below) to quickly discard pairs of images that are not near-duplicates, after which the pruned collection is further processed so that falsepositives are discarded. Hash-based probabilistic counting is suited to this do-main because of the character of the PCA-SIFT features. The post-indexingphase of PCA-SIFT features of an image are mapped to a series of hash-keysacross l LSH hash tables, such that two features sharing identical hash-keys are,with high confidence, close to each other in a d-dimensional space. Each hash-key1 An average of around 1, 600 descriptors are extracted per image on our collection.

Discovery of Image Versions in Large Collections 437

can be treated as a unit, akin to words in text, and thus an image is transformedinto a series of representative units that can be stored in a postings list; eachentry consists of a list of images that contains PCA-SIFT features that shareidentical hash-keys.

As in the approach of Shivakumar and Garcia-Molina [21] for relationshipgraph generation, all possible image pairs (edges) in every postings list are hashedto an array A of m counters using a universal hash function h. For an edge oftwo nodes id1 and id2 that share co-occurring entries (units) in the postings list,A[h(id1, id2)] is incremented, where id denotes the image ID; this is essentially ahash-counter. Due to hash collisions, this hash-counter can occasionally generatespurious edges, whereby edges with no co-occurring features share identical hashlocations with the ones that do, resulting in a large number of false positivematches. Typically, this occurs with high probability due to incommensuratehash-counters, but, given that we are only concerned with values lower than orequal to a threshold T , the number of hash-counters can be increased by usingbytes, or smaller multiples of bits, rather than words; thus the rate of hashcollisions can be reduced without impact on memory usage.

As the number of matching units (PCA-SIFT features) between two co-derivative images ranges from tens to thousands, we apply thresholding to dis-card edges that do not have sufficient matches. Furthermore, we expect to findsome image pairs that share identical objects, but are neither co-derivatives nornear-duplicates (pictures of flags, for example). Hence, we experimentally incre-ment this parameter to find an optimal threshold for identifying all co-derivativeimages, while omitting as many false matches as possible.

To minimize the number of spurious edges, we accumulate the matching fea-tures between two images using exact counting, that is, we keep an integercounter for every unique edge using a static structure with no possibility of col-lision. As such, each counter reflects the actual number of PCA-SIFT featurematches between two images; we use the same threshold value as in the pruningstrategy to discard edges without this minimum number of matching features.

4 Evaluation Methodology

For our image collection, we use three image data sets of crawled images from theSPIRIT collection [10]. To enable a meaningful evaluation, we generate a seedcollection of 5, 000 co-derivative images by using 100 randomly selected individ-ual images from the same collection, each of which is then digitally transformedusing 50 different alterations. Hence, each of the 50 images has a co-derivativerelationship to each of the other 49 images, with a resultant 122, 500 manuallygenerated edges; each of the three image data sets — C1, C2, and C3 — is popu-lated with the seed collection, and randomly gathered images from the SPIRITcollection to aggregated sizes of 10, 000, 20, 000, and 40, 000, respectively.

The list of alterations is similar to that of Ke et al. [14]:

1. format change: format change from .jpg to .gif (1 alteration)2. colorise: each of the red, green, and blue channels are tinted by 10% (3)

438 J.J. Foo, R. Sinha, and J. Zobel

3. contrast: increase and decrease contrast (2)4. severe contrast: increase and decrease contrast 3× of original image (2)5. crop: crop 95%, 90%, 80%, and 70% of image, preserve centre region (4)6. severe crop: crop 60%, 50%, and 10% of image, preserve centre region (3)7. despeckle: apply “despeckle” operation of ImageMagick (1)8. border: a frame size 10% of image is added using random colors (4)9. rotate: rotate image (by 90o, 180o, and 270o) about its centre (3)

10. scale-up: increase scale by 2×, 4×, and 8× (3)11. scale-down: decrease scale by 2×, 4×, and 8× (3)12. saturation: alter saturation by 70%, 80%, 90%, 110%, and 120% (5)13. intensity: alter intensity level by 80%, 90%, 110%, 120% (4)14. severe intensity: alter intensity level by 50% and 150% (2)15. rotate+crop: rotate image (by 90o, 180o, and 270o), crop 50% in centre

region (3)16. rotate+scale: rotate image (by 90o, 180o, and 270o), decrease scale 4x (3)17. shear: apply affine warp on both x and y axes using 5o, and 15o (4)

The number in parentheses is the number of instances for each alteration type.2

Making an arbitrary distinction based on our experience with using these tech-niques for image search, we separate the image alterations according to whethermatches are easy or hard to detect. All alterations in categories 4, 6, 14, 15,16, and 17 are hard, as are the 8× scalings in categories 10 and 11; all otheralterations are (relatively!) easy.

As all three of our collections represent real-world images, we do not expect tofind many instances of transformations as severe as those of our seeded images.Thus the seeded images are essential for evaluating the effectiveness of our ap-proach on identifying severely altered co-derivatives; moreover, they also serve asa good testbed for testing the limits of various threshold values (denoted by Tn)to derive an optimal setting for identifying co-derivatives.

For all three collections, images are converted into greyscale3 and resizedto 512 pixels in the longer edge. Experiments used a two-processor Xeon 3 GHzmachine with 4 GB of main memory running Linux 2.4. The effectiveness of ourapproach can be evaluated by assessing the relationship graph. An ideal human-evaluated relationship graph of the entire image collection, otherwise knownas a reference graph [1], provides a benchmark for measuring a co-derivativedetection algorithm. We use the evaluation metrics of coverage and density [1],which are similar to recall and precision in information retrieval. The coverageof a computer-generated relationship graph is the completeness relative to thereference graph; the density is the proportion of edges that are correct.

In practice, generation of a complete reference graph is implausibly difficult.However, given that we have a pre-generated reference graph with the use of aseed collection, we can evaluate the coverage of a relationship graph using theratio of pre-determined edges that are identified in the reference graph. True2 All alterations are created using ImageMagick, http://www.imagemagick.com3 The PCA-SIFT features are extracted from greyscale images and hence are inher-

ently robust against color changes.

Discovery of Image Versions in Large Collections 439

Table 1. Estimated coverage (%), average precision (%), and number of identifiededges of the co-derivative relationship graph generated using collections C1, C2, andC3; seven threshold values ranging from T4 to T256 are used

T4 T8 T16 T32 T64 T128 T256

Collection C1

Coverage 93.5 91.3 88.5 84.3 77.5 68.2 54.9Average precision 22.0 55.7 83.3 92.1 93.2 91.7 89.4

No. of edges 2,754,583 579,428 187,167 123,536 102,678 86,762 68,286Run-time (mins) 6.5 5.1 6.7 5.1 4.7 6.0 5.1

Collection C2

Coverage 93.6 91.4 88.6 84.6 78.0 68.9 55.9Average precision 18.8 51.4 80.9 91.3 93.0 91.6 89.4

No. of edges 6,660,498 1,211,766 304,877 155,629 114,054 91,522 70,892Run-time (mins) 12.4 9.3 8.5 8.4 8.3 8.2 8.1

Collection C3

Coverage 93.6 91.5 88.8 84.7 78.2 69.2 56.4Average precision 16.8 48.8 79.4 90.8 92.9 91.5 89.4

No. of edges 15,559,155 2,499,650 551,458 226,994 137,399 99,132 73,528Run-time (mins) 28.3 18.2 16.0 15.6 15.6 15.5 15.3

density of a relationship graph cannot be appropriately evaluated without acomplete reference graph of the entire collection; thus, to estimate density, wecan select edges from the computer-generated graph to determine if co-derivativerelationships exist between connecting nodes, a labour-intensive process. For aless resource intensive evaluation, we also evaluate the average precision of thecomputer-generated relationship graph — the ratio of co-derivative edges to thetotal identified edges in that graph for each image used in coverage estimation.

5 Results

We experiment with seven threshold values (T ) from 4 to 256 — doubling pro-gressively — for each collection. Images are a match if the number of sharedfeatures exceeds the threshold. As shown in Table 1, using our algorithm, theseeded co-derivative images are detected with high overall effectiveness for allcollections. As anticipated, coverage favors low threshold values whereas preci-sion favors higher thresholds. A small number of feature matches between twoimages is insufficient to discard non co-derivatives pairs, due to spurious edgesthat are generated by images that share similar or identical objects. Evaluationtime does not rise drastically with collection size, suggesting, in this preliminaryimplementation, that this approach should scale to much larger collections.

A threshold of 16 or 32 leads to large gains in precision, and reductions inevaluation time, without great loss of coverage. Thus the number of shared

440 J.J. Foo, R. Sinha, and J. Zobel

Table 2. Estimated coverage and average precision (%) of the co-derivative relation-ship graph generated using collections C1, C2, and C3; hard and easy alterations areevaluated separately thresholding from T4 to T256

T4 T8 T16 T32 T64 T128 T256

Collection C1 Easy Coverage 97.1 95.9 94.4 91.8 87.4 80.6 68.5Average precision 18.0 53.6 86.0 96.7 98.5 98.5 97.8

Hard Coverage 87.1 83.0 78.0 70.9 59.9 46.3 30.7Average precision 29.2 59.5 78.6 83.9 83.9 79.4 74.4

Collection C2 Easy Coverage 97.1 96.0 94.5 92.0 87.7 81.1 69.4Average precision 14.9 48.7 83.2 95.9 98.2 98.4 97.8

Hard Coverage 87.2 83.2 78.3 71.4 60.6 47.2 31.9Average precision 25.7 56.1 76.7 83.3 83.7 79.4 74.4

Collection C3 Easy Coverage 97.1 96.0 94.5 92.1 87.9 81.3 69.8Average precision 13.0 45.8 81.6 95.3 98.1 98.4 97.9

Hard Coverage 87.4 83.4 78.5 71.7 61.0 47.6 32.4Average precision 23.6 54.0 75.7 82.9 83.6 79.3 74.5

features between some co-derivative images (after severe alterations) occasionallydoes not rise above the noise level; indeed, operations such as severe cropping,shearing, or scaling may make images appear unrelated even to humans. To ex-plore this further, we separately measured coverage and precision for easy andhard alterations, as shown in Table 2. As can be seen, much of the difficultyis due to the hard alterations. For the easy alterations, coverage and precisionuniformly remain above 80% and 95%, respectively, from threshold 16 up to 128.Note too that collection size has little impact on the accuracy of the method.

Also shown in Table 1 is the number of identified co-derivative edges for eachthreshold value. We observe a considerable reduction in the identified edges —for all three collections — while the threshold value is increased. The observedcoverage, relative to the number of identified edges, indicates that our approachis indeed effective for narrowing the search space for the candidate edges. Table 1further shows that the number of identified co-derivative edges — for all thresh-old values — follows a linear trend as the collection size grows, even though thegrowth of the number of total edges is quadratic. This indicates that the growthof an index structure should also grow linearly, which makes this a scalableapproach for moderate-sized image collections.

For our final experiment, estimation of the density of our generated relation-ship graph, we manually sample 100 identified edges that do not exist in thereference graph on collection C3 — our largest available collection. Because weare concerned with only co-derivative images and not near-duplicates, a randomsampling of edges from the set of identified edges is inappropriate as PCA-SIFT features are likely to identify all instances of near-duplicates; hence, arandom sampling can result in no matches found by our definition. For a more

Discovery of Image Versions in Large Collections 441

meaningful evaluation, we first rank the edges by their number of matching fea-tures, and sample using the top 10% of the identified edges. We conjecture thatco-derivative pairs should have more matching features between them than nonco-derivative pairs; hence a simple ranking is appropriate to quantify the differ-ences. The estimated density is observed to be 96%, but we believe this to bean overestimate due to a reasonable number of exact duplicates that may havebeen crawled from different sources. Nevertheless, this experiment demonstratesthat, in addition to the seeded co-derivative images, other unseeded instancesare also identified with high effectiveness.

6 Conclusions

We have proposed a new method for automatic identification of the pairs of co-derivative images in a large collection. We have demonstrated that near-duplicatetext document detection techniques can be effectively adapted for images indexedusing PCA-SIFT local descriptors and locality sensitive hashing. Our findingshere corroborate our hypothesis, that co-derivative images can be automaticallyand effectively identified using a refine-and-filter scheme: a first-pass strategyidentifies all images that share identical objects so that a smaller image set canbe further processed for co-derivative pairs. Accuracy is high, especially for lesssevere image alterations, and the computational costs are moderate. That is, ourmethod provides effective discovery of duplicates and near-duplicates, and thusis a practical approach to collection management and protection of copyright.

References

1. Y. Bernstein and J. Zobel. A scalable system for identifying co-derivative docu-ments. In Proc. SPIRE Int. Conf. on String Processing and Information Retrieval,pages 55–67, October 2004.

2. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clusteringof the web. Computer Networks, 29(8-13):1157–1166, 1997.

3. E. Chang, J. Z. Wang, and G. Wiederhold. RIME: A replicated image detector forthe world-wide web. In Proc. SPIE Int. Conf. on Multimedia Storage and ArchivingSystems III, 1998.

4. J. J. Foo and R. Sinha. Pruning SIFT for scalable near-duplicate image matching.In Proc. ADC Australasian Database Conference, Feb 2007.

5. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions viahashing. In Proc. VLDB Int. Conf. on Very Large Data Bases, pages 518–529,Edinburgh, Scotland, UK, September 1999. Morgan Kaufmann.

6. F. Hartung and M. Kutter. Multimedia watermarking techniques. ProceedingsIEEE (USA), 87(7):1079–1107, 1999.

7. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing thecurse of dimensionality. In Proc. STOC Int. Conf. on Theory of Computing, pages604–613, Dallas, Texas, USA, May 1998. ACM Press.

8. A. Jaimes, S.-F. Chang, and A. C. Loui. Duplicate detection in consumer pho-tography and news video. In Proc. MM Int. Conf. on Multimedia, pages 423–424,2002.

442 J.J. Foo, R. Sinha, and J. Zobel

9. N. F. Johnson, Z. Duric, and S. Jajodia. On “Fingerprinting” images for recog-nition. In Proc. MIS Int. Workshop on Multimedia Information Systems, pages4–11, Indian Wells, California, October 1999.

10. H. Joho and M. Sanderson. The spirit collection: an overview of a large webcollection. SIGIR Forum, 38(2):57–61, 2004.

11. X. Kang, J. Huang, and Y. Q. Shi. An image watermarking algorithm robust togeometric distortion. In Proc. IWDW Int. Workshop on Digital Watermarking,pages 212–223, Seoul, Korea, November 2002. Springer.

12. X. Kang, J. Huang, Y. Q. Shi, and Y. Lin. A DWT-DFT composite watermarkingscheme robust to both affine transform and JPEG compression. IEEE Trans.Circuits and Systems for Video Technology, 13(8):776–786, 2003.

13. Y. Ke and R. Sukthankar. PCA-sift: A more distinctive representation for localimage descriptors. In Proc. CVPR Int. Conf. on Computer Vision and PatternRecognition, pages 506–513, Washington, DC, USA, June–July 2004. IEEE Com-puter Society.

14. Y. Ke, R. Sukthankar, and L. Huston. An efficient parts-based near-duplicate andsub-image retrieval system. In Proc. MM Int. Conf. on Multimedia, pages 869–876,New York, NY, USA, October 2004. ACM Press.

15. D. G. Lowe. Distinctive image features from scale-invariant keypoints. Int. Journalof Computer Vision, 60(2):91–110, 2004.

16. C.-S. Lu and C.-Y. Hsu. Geometric distortion-resilient image hashing schemeand its applications on copy detection and authentication. Multimedia Systems,11(2):159–173, 2005.

17. J. Luo and M. A. Nascimento. Content based sub-image retrieval via hierarchicaltree matching. In Proc. MMDB Int. Workshop on Multimedia Databases, pages63–69, November 2003.

18. K. Mikolajczyk and C. Schmid. Scale and affine invariant interest point detectors.Int. Journal of Computer Vision, 60(1):63–86, 2004.

19. A. Qamra, Y. Meng, and E. Y. Chang. Enhanced perceptual distance functionsand indexing for image replica recognition. IEEE Trans. Pattern Analysis andMachine Intelligence, 27(3):379–391, 2005.

20. N. Sebe, M. S. Lew, and D. P. Huijsmans. Multi-scale sub-image search. In Proc.MM Int. Conf. on Multimedia, pages 79–82, Orlando, FL, USA, October–November1999. ACM Press.

21. N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents andservers on the web. In Proc. WebDB Int. Workshop on World Wide Web andDatabases, pages 204–212, March 1998.

22. A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-basedimage retrieval at the end of the early years. IEEE Trans on Pattern Analysis andMachine Intelligence, 22(12):1349–1380, December 2000.

23. D. Zhang and S.-F. Chang. Detecting image near-duplicate by stochastic attributedrelational graph matching with learning. In Proc. MM Int. Conf. on Multimedia,pages 877–884, October 2004.

24. J. Zobel and A. Moffat. Inverted files for text search engines. ACM ComputingSurveys, June 2006.