It's All About the Data

INV ITEDP A P E R

It’s All About the DataThis paper explains how training data is important for many computer vision

algorithms and presents case studies of how the Internet can be used to

obtain high-quality data.

By Tamara L. Berg, Alexander Sorokin, Member IEEE, Gang Wang,

David Alexander Forsyth, Fellow IEEE, Derek Hoiem, Ian Endres, Member IEEE,

and Ali Farhadi, Member IEEE

ABSTRACT | Modern computer vision research consumes

labelled data in quantity, and building datasets has become

an important activity. The Internet has become a tremendous

resource for computer vision researchers. By seeing the

Internet as a vast, slightly disorganized collection of visual

data, we can build datasets. The key point is that visual data are

surrounded by contextual information like text and HTML tags,

which is a strong, if noisy, cue to what the visual data means. In

a series of case studies, we illustrate how useful this contextual

information is. It can be used to build a large and challenging

labelled face dataset with no manual intervention. With very

small amounts of manual labor, contextual data can be used

together with image data to identify pictures of animals. In fact,

these contextual data are sufficiently reliable that a very large

pool of noisily tagged images can be used as a resource to build

image features, which reliably improve on conventional visual

features. By seeing the Internet as a marketplace that can

connect sellers of annotation services to researchers, we can

obtain accurately annotated datasets quickly and cheaply. We

describe methods to prepare data, check quality, and set prices

for work for this annotation process. The problems posed by

attempting to collect very big research datasets are fertile for

researchers because collecting datasets requires us to focus on

two important questions: What makes a good picture? What is

the meaning of a picture?

KEYWORDS | Computer vision; Internet

I . INTRODUCTION

As the world moves online, Internet users are creating and

distributing more and more images and video. The number

of images indexed by search engines like Google and

Yahoo! is growing exponentially, and is currently in the

tens of billions. The same growth and proliferation is

experienced by community photo Web sites like Flickr,

PicasaWeb, Panoramio, Woophy, Fotki, and Facebook.Flickr alone currently has more than three billion photos,

with several million new pictures uploaded per day. In

recent years, the trend has been to make much of this

visual data publicly available on the Web, usually attached

to various forms of context such as captions, tags,

keywords, Web pages, and so on. By doing so, Internet

users have transformed computer vision by providing a

need for new applications to sort, search, browse, andinteract with this content. This vast amount of visual data

also creates an interesting sampling of images and video on

which researchers can develop and evaluate theories.

The core intellectual problems in computer vision are

recognition and reconstruction, very broadly interpreted.

In recognition, one attempts to attach semantics to visual

data like images or video. The nature of the semantics and

of the attachment both vary widely. It is often valuable tomark particular instances of objectsVfor example, that

image is a picture of my fluffy cat. Alternatively, one might

want to mark categoriesVfor example, that image is a

picture of a cat. Finally, one might want to mark attributesof objectsVfor example, that image contains a fluffy thing.

There are strong relations between these ideas that

remain somewhat mysterious. Categories are hard to

define with any precision but represent a pool of instanceslinked by some visual and some semantic similarity. Visual

categories may or may not mirror semantic or taxonomic

categories. There are objects that are very different but

Manuscript received April 6, 2009; revised August 13, 2009; accepted

August 14, 2009. Date of publication May 17, 2010; date of current version July 21,

2010. This work was supported in part by the National Science Foundation under

Awards IIS-0803603 and IIS-0534837, in part by the Department of Homeland Security

under MIAS, a DHS-IDS Center for Multimodal Information Access and Synthesis at the

University of Illinois at Urbana-Champaign, in part by the Office of Naval Research

under Grant N00014-01-1-0890 as part of the MURI program, and in part by a

Beckman Institute postdoctoral fellowship.

T. L. Berg is with the Department of Computer Science, State University of New York

Stony Brook, Stony Brook, NY 11794 USA (e-mail: [email protected]).

A. Sorokin, G. Wang, D. A. Forsyth, D. Hoiem, I. Endres, and A. Farhadi are with the

Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana,

IL 61801 USA (e-mail: [email protected]; [email protected]; [email protected];

[email protected]; [email protected]; [email protected]).

Digital Object Identifier: 10.1109/JPROC.2009.2032355

1434 Proceedings of the IEEE | Vol. 98, No. 8, August 2010 0018-9219/$26.00 �2010 IEEE

share many visual attributes. For example, penguins arenot fish, though they can look quite like fish and often

appear in the same context as fish. Alternatively, there are

categories that can vary widely in visual appearance yet

still carry the same semantic label (i.e. Bchair[ can refer to

any object you can sit on from beanbags, to stools, to

rocking chairs). Adding to the confusion is the fact that a

particular object may belong to many categoriesVfor

example, my bicycle is a bicycle, but it is also a means oftransportation, a wheeled object, and a metal object, and it

can be an obstacle.

In reconstruction, one attempts to build representations

of the geometric and photometric properties of the world

from visual data. The visual data could be single images,

multiple images, video, or even the product of exotic imaging

systems (e.g., thermal images or laser range scanner data).

The representations take many forms. One often wants tocreate models like those used in computer graphics to render

in virtual environments. Alternatives include complex data

structures linking images so that they can be presented to

viewers in a way that give a strong sense of movement and

layout in space. Progress in this area has been driven by a

variety of practical problems and has spawned many ap-

plications; the key guide is [1]. An important recent trend

involves applying multiview reconstruction methods toBfound[ imagesVif enough tourists have photographed the

Colosseum and one collects these photographs together, then

one can build compelling representations of that space [2],

[3]. We will not discuss reconstruction further in this paper,

but refer interested readers to other papers in this issue.

Recognition is a topic of intense research, and a com-

prehensive review would take us out of our way; instead

we provide some entry points to the literature. Textbookaccounts [4], [5] are now out of date. A good overview of

topics is provided by the edited proceedings of a recent

workshop [6]. Our definition of recognition is deliberately

broad to bring together discussion from several related

research areas. One important subtopic in recognition is

object recognition, where one builds models to recognize

object categories or instances. There are annual competi-

tions culminating in workshops, and a good survey of thestate-of-the-art can be obtained by looking at the

proceedings.1 Other important subtopics include activityrecognition, where one builds descriptions of what people

are doing from visual data (we are not aware of an

extensive review; [7] has a fair sketch of recent literature);

face recognition, where one must attach identities to

pictures or video of faces (recent reviews in [8]–[10],

critical discussion in [11]); and detection, where one mustlocalize all instances of a particular category in an image

(the recognition workshops, given above, deal with some

aspects of detection). A small number of object categorieshave been the subjects of intense study because they are

extremely common or particularly important to people.

Pedestrians are one such category [12]–[15]. Faces are the

most important such category, and detecting frontal faces

is now a fairly reliable and, as we shall see, extremely

useful technology (see Section III-A and [16]–[18]).

The central activity of object recognition research is

building systems to evaluate ideas about object represen-tation. The main tools used are statistical in nature, and a

major part of the research involves learning from labelled

training data. Much of the research debate involves what

to learn and how to learn it. In the case of frontal face

detectors, the established method is to search all image

windows for a face by computing features from the win-

dow, then presenting these features to a classifier trained

to respond to faces only. Other object categories can bedetected like this, but the details and success of such

approaches vary widely. This is because most object cate-

gories are extremely variable in appearanceVdog species

range from the tiny Chihuahua to the large German

Shepherd; different cars can have different colors and

shapes; pedestrians wear a wide range of clothes and might

raise or lower their arms. All of this variability means that

learning a classifier that will effectively respond to anobject category might require an unmanageable quantity of

examples to sample all possible appearances. This problem

can be simplified by constructing complex feature repre-

sentations that suppress the worst of these sources of

variation; doing so usually involves a component of learn-

ing from data too.

Recognition research and reconstruction research share

one important property: a strong demand for visual data inimmense quantities. Humans, our best example of a

working recognition system, observe huge amounts of

visual data from which to learn to recognize objects in early

life. If one is to estimate the visual system as capturing

30 frames a second, then this translates to 9� 108 images

per year. Most of these data will be unlabelled, but some

labelled data will be provided through parental teaching,

etc. A fair guess is that object recognition research willrequire hundreds of millions of images to build a similar

working recognition system. The Internet is exciting

because it has visual data on this scale. Furthermore,

data drawn from the Internet have a better chance of being

Bfair[ than data made in-house because it is produced by

people unrelated to the research problem. From the point

of view of a computer vision researcher, the Internet is a

device that has the potential to produce intriguing data-sets. Unfortunately, these data have not yet been correctly

annotated. Current processes of hand annotation are

difficult and expensive to use at the kind of scale we want.

In this paper, we demonstrate several fully or semiauto-

matic methods for producing large, well-labelled datasets

that are useful for computer vision research. We explore

two different but related ways to produce these datasets:

1The 2007 workshop dealt with a dataset from Pascal and one fromthe California Institute of Technology, and proceedings are at http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2007/workshop/index.html;the 2008 workshop dealt with a dataset from Pascal, proceedings at http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2008/workshop/index.html.

Berg et al. : It’s All About the Data

Vol. 98, No. 8, August 2010 | Proceedings of the IEEE 1435

by utilizing surrounding textual information in conjunc-tion with image information (Section III-A to C) or by

using inexpensive human labeling resources (Section IV).

II . CURRENT METHODS FOR PREPARINGANNOTATED DATASETS

We do not really know what the ideal dataset would be for

object recognition research. However, some of theproperties of a good dataset are apparent.

• Variety: a good dataset should be rich enough that

it represents all major visual phenomena we are

likely to encounter in applying the system we are

training and testing to all the images in the world.

• Scale: a good dataset will be big. We do not really

know what the major visual phenomena are, but

we have the best chance that all are representedwell if the dataset is big.

• Precision: the labels in the dataset should be

reasonably accurate, though systems should be able

to handle some amount of noise in the labeling

process.

• Suitability: the dataset should have many examples

relevant to the problems we wish to solve.

• Cost: the dataset should not be unreasonablyexpensive.

• Representativeness: the dataset should represent visual

phenomena Bfairly[; we should not be selecting data

to emphasize rare but interesting effects, or to

deemphasize common but difficult ones.

We can say with confidence that the perfect dataset

will be big. A rough estimate of the size of this dataset can

be obtained as follows. It should contain images of objectsin the contexts in which they occur. The images should

cover most significant categories, and there should be

many images of each category. Within a category, the

images should show all variations important for training

and for testing, including the different shapes, textures,

and configurations the members can take. All important

viewing conditions should be represented by images taken

at different viewpoints, under different illuminationconditions, and so on. These images should be labelled.

It is difficult to be crisp about what would be in the

labelsVfor example, should one label every book in a

cluttered office? If not, which ones?Vbut the labels would

include the names of the categories of the important

objects, where those objects were in the image, informa-

tion about the context in which they lie, and the viewing

conditions. There might also be information about whythose objects were worth labeling and others were not. A

fair guess at the number of categories that should be

represented is in the thousands; a fair guess of the number

of images needed to represent internal variations in each

category is thousands or more; and a fair guess of the

number of images needed to represent view variations

might be hundreds or more. All this means that we might

need hundreds of millions of images. This figure could betoo small by several orders of magnitude if each guess is

too small; it could be too big by several orders of

magnitude if we discover processes of generalization

across examples that, for example, link internal variation

between categories or views. It is a fallacy to believe that,

because good datasets are big, then big datasets are good.

Building hand-labelled datasets has now become a

significant activity in computer vision. The most prominentexamples here include: the Berkeley segmentation dataset

[19]; the Caltech 5, Caltech 101, [20] and Caltech 256 [21]

datasets; the Pascal VOC datasets [22], [23]; the University

of Illinois at Urbana-Champaign (UIUC) car dataset [24];

the localized semantics dataset [25]; the Massachusetts

Institute of Technology (MIT) [26] and INRIA [27] pedes-

trian datasets; andthe Yale [28], FERET [29], and CMU

PIE [30] face datasets. Every dataset above is a focused datacollection targeted at a specific research problem: segmen-

tation, car detection, pedestrian detection, face detection

and recognition, and object category recognition. Such

datasets will continue to be an important part of computer

vision research because they do well on the precision and

suitability criteria (above). The annotations are high-

quality, and the data represent what one wants to study.

But datasets like this will always be relatively small be-cause the strategy does not scale well: it consumes sig-

nificant amounts of highly skilled labor, requires much

management work, and is expensive.

Another strategy for building large scale datasets is to

get volunteers to label images for free. LabelMe [31] is a

public online image annotation tool that now has more than

11 845 images and 18 524 video frames with at least one

object labelled [31]. The current Web site counter displays222 970 labelled objects. The annotation process is simple

and intuitive; users can browse existing annotations to get

the idea of what kind of annotations are required and then

label segmentation polygons within an image to denote an

object’s location. The dataset is freely available for

download and comes with handy Matlab toolbox to browse

and search the dataset. The dataset is semicentralized. MIT

maintains a publicly accessible repository, accepts imagesto be added to the dataset, and distributes the source code

to allow interested parties to set up a similar repository. The

ESP game [32] and Peekaboom [33] are interactive games

that collect image annotations by entertaining people.

Players cooperate and receive points by providing textual

and location information that is likely to describe the

content of the image to their partner. These games have

produced a great deal of data (over 37 million2 and 1 million[33] annotations, respectively). The Peekaboom project

recently released a collection of 57 797 images annotated

through gameplay, but to date this collection has not been

widely used, for reasons we do not know. While datasets

produced by volunteers can be very large, there remain

2See www.espgame.org.


1436 Proceedings of the IEEE | Vol. 98, No. 8, August 2010

difficulties with precision (volunteers may not do the rightthing), with variety (volunteers may prefer to label a

particular kind of image), and with suitability (volunteers

may not wish to label the kind of data we want labelled).

The game-based approach has two inconveniences. The

first is centralization. To achieve proper scale, it is neces-

sary to have a well-attended game service that features the

game. This constrains publishing of a new game to obtain

project-specific annotations. The second one is the gameitself. To achieve reasonable scale, one has to design a

game. The game should be entertaining, or else no one will

play it. This will require creativity and experimentation to

create appropriate annotation interface.

Finally, dedicated annotation services can provide

quality and scale, but at a high price. ImageParsing.com

has built one of the world largest annotated datasets [34].

With more than 49 357 images, 587 391 video frames, and3 927 130 annotated physical objects, this is an invaluable

resource for vision scientists. At the same time, the cost of

entry is steep. Obtaining standard data would require at

least a $1000 investment, and custom annotations would

require at least $5000.3 ImageParsing.com provides high-

quality annotations and has a large number of images

available for free. Another large annotated dataset that has

recently been released is ImageNet [35]. This data col-lection provides whole image level category labels for more

than three million images collected using several Internet

search engines for about 5000 WordNet [36] synsets and

labelled via a large-scale annotation effort using Amazon’s

Mechanical Turk service.4 It is important to note that these

two datasets [34], [35] present probably the most rigorous

and the most varied collections to date for image labeling

and classification tasks.One important issue that computer vision researchers

should be aware of when collecting and distributing datasets

from the Internet is copyright. Images posted on the Internet

are owned by the photographer, and usage rights are not

usually specified (except in the special case of examples like

the creative commons license on Flickr.com). Most use of

such images for educational and research purposes, in-

cluding republishing images in research papers, may becovered under fair-use policies, but the general issue of

copyright for Internet images is still a bit murky and has not

been satisfactorily resolved to our knowledge.

No current procedure for producing datasets is entirely

satisfactory. In what follows, we describe two mechanisms

to use the Internet to produce useful and interesting data-

sets. One can regard the Internet as a vast, disorganized

repository, and build methods to prowl through thisrepository to collect data that has the right form, or that

can be levered into the right form (Section III-A to C).

These methods alleviate or bypass the necessity of human

labeling to create data collections but also present other

challenges or inconveniences. The most prominent chal-lenge here is how to design effective and accurate com-

puter vision and natural language processing algorithms

for labeling data, something that can be quite difficult.

Alternatively, one can regard the Internet as a marketplace

where it is possible to find someone willing to help

improve data quickly and cheaply (Section IV).

III . THE INTERNET AS A REPOSITORY

It is not helpful to think of the Internet as a big pile of

images or videos because most visual data items on the

Internet are surrounded by complex and interesting

contexts that can be useful for guiding the labeling pro-

cess. Generally, we shall talk about images, but much of

what we say applies to video as well. In the simplest case,

images are associated with captions, which contain explicitinformation about the image’s content. In more complex

cases, images appear embedded in Web pages, which

would render text near the image. The relations between

this text and the image can be extremely revealing. Re-

sources like Wikipedia contain highly structured informa-

tion in both text and visual form but tend to have relatively

few images. Lastly, there are resources where one can

perform keyword searches for images, like Google’s imagesearch or Flickr; images obtained from these sites have, at

least implicitly, the context of the site’s index and the

search used to obtain the image attached to them.

We describe three case studies that show how useful

that context can be. In the first, we show how news pic-

tures, taken with their captions, can yield a rich and

demanding face dataset. In the second, we show how to

use text lying near pictures, taken with image features, tocollect a complex dataset of animals. In the third, we show

that a very big collection of pictures with noisy labels can

improve the performance of an object recognizer, even

though both the pictures and the labels are noisy and

uncontrolled. Each of these case studies illustrates what we

believe are important principles in using Internet

resources to produce labelled datasets.

1) Throwing away data items that cannot be labelledfor some reason or another is a valuable strategy if

one starts with an enormous collection. With care,

this produces data with biases no worse than other

methods of building datasets.

2) Levering sources of information against one

another is very valuable, and the opportunities to

do so appear to be quite widespread. For example,

the text that appears near an image, taken withthat image, can be very helpful in deciding what

labels an image should carry.

Each of these case studies illustrates how useful it is to

start with Web-scale resources. In the first case study, we

rely on being able to throw away captioned images we

cannot work with; because the original collection is so

large, there are enough captioned images we can handle to

3See www.ImageParsing.com.4See http://www.mturk.com/.



yield a large dataset. In the second case study, because thecollection is so large, we can find enough data items where

the relations between the visual data and the context are

clear enough that we can label effectively. In the third case

study, we find we can obtain a useful signal for image

labeling from many images with quite noisy labels

attached. This works because we have so many images

that we can smooth this noise effectively. Related and

future work to these studies is discussed in Section V.

A. Case Study: Preparing a Labeled Dataset ofNames and Faces

Many nontraditional kinds of image data are now being

made freely available online. One such source of data is

news photographs with associated captions posted by

various news sources (Associated Press, Yahoo! News,

CNN, etc.) at rates of thousands of new images per day.This particular data source opens up new problems for

exploration. In traditional face recognition, the question

posed is: BGiven a face, who’s face is this?[ In general, the

answer to this question could be one of hundreds,

thousands, or even millions of possible people. However,

for news photographs with captions, one canVwithout

losing a great deal of accuracyVreduce this problem to

consider only those people who’s name appears in thecorresponding caption (usually fewer than about ten

names). This is an example of the power of levering

sources of information against one another.

In our work on automatically labeling face images in

photographs [37], [38], we show that a large and realistic

face dataset (or face dictionary) can be built from a

collection of news photographs and their associated

captions (one photo-caption pair is shown in Fig. 2).Our automatically constructed face dataset consists of

30 281 face images, obtained by applying a face finder to

approximately half a million captioned news images. The

faces are labelled using image information from the photo-

graphs and word information extracted from the corre-

sponding caption. Here we take advantage of recent

computer vision successes on accurate face detection

algorithms to focus on parts of the image we care about:those that contain peoples’ faces. The other 470 000 images

we collected were thrown away as not containing faces, not

containing faces that were big enough to identify reason-

ably accurately, or not having names in their captions. If

one has sufficient noisy data, then it is entirely reasonable

to throw away 90% of the data to get a clean dataset.

This dataset is more realistic than usual face recogni-

tion datasets because it contains faces captured Bin thewild[ under a wide range of positions, poses, facial expres-

sions, and illuminations. The data display a rich variety of

phenomena found in real-world face recognition tasksVsignificant variations in color, hairstyle, expression, etc.

Equally interesting is that it does not contain large num-

bers of faces in highly unusual and seldom seen poses, such

as upside down. Rather than building a database of face

images by choosing arbitrary ranges of pose, lighting,expression, and so on, we simply let the properties of a

Bnatural[ data source determine these parameters. News

photographs tend to be well illuminated and captured with

nice high-end cameras. The faces we find also tend to be

mainly frontal, relatively unoccluded, and large due to the

face detector and how we select detections. Name fre-

quencies have the long tails that occur in natural language

problems. We expect that face images follow roughly thesame distribution. We have hundreds to thousands of

images of a few individuals (e.g., President Bush) and a

large number of individuals who appear only a few times or

in only one picture (e.g., Fig. 1 Sophia Loren). One expects

real applications to have this property. For example, in

airport security cameras, a few people (security guards, or

airline staff perhaps) might be seen often, but the majority

of people would appear infrequently. All of these factorslead to biases in our created dataset and will have

implications on recognition algorithms trained using such

a datasetVe.g., algorithms may not be as good at recognition

on highly occluded faces or recognition under low lighting

conditions. However, we believe that in the long run, devel-

oping detectors, recognizers, and other computer vision tools

around such a database in addition to current laboratory

produced recognition datasets will help produce programsthat work better in realistic everyday settings.

We describe our construction of a face dictionary as a

sequence of three steps. First, we detect names in captions

using an open source named entity recognizer [39]. Next, we

detect [40], align, and represent faces using standard face

representations with some modifications to handle the large

quantity of data encountered here. Finally, we associate

names with faces using either a purely appearance-basedclustering method or an enhanced method that also in-

corporates text cues in a maximum entropy framework [41].

This fully automatic method alternates between updat-

ing the best assignment of names to faces given the current

appearance and language models and updating the models

given the current set of assignments (some predicted

assignments are shown in Fig. 1). This allows us to build an

appearance model for each individual in the collection anda language model of depiction that is general across all

captions without having to provide any hand-labelled

supervisory information. The intuition behind how we do

this is as follows: if we observe a face across multiple

photographs with varying appearance but some visual

characteristics in common, and notice that this face always

co-occurs with the name BGeorge Bush,[ then by pooling

information across multiple pictures, we can learn anappearance model of the individual with the name BGeorge

Bush.[ On the language side, we might observe that often

when a name is followed by, for example, a present-tense

verb, the face that this name refers to is likely to be depicted

in the corresponding photograph. This information can be

pooled across all captions and names to learn a general

model for depiction.



One important discovery we have made is that the

context in which a name appears in a caption provides

powerful cues as to whether it is depicted in the associated

image. Our language model learns several useful indications

of depiction, including the fact that if a name appears in a

caption directly followed by various forms of present-tense

verbs, that person is likely to be found in the accompanying

photo; names that appear near the beginning of the caption

tend to be pictured; words like Bshown,[ Bpictured,[ or

contextual cues such as B(L)[ or B(C)[ are good indicators

Fig. 1. The figure shows a representative set of clusters, illustrating a series of important properties of both the dataset and the method.

1) Some faces are very frequent and appear in many different expressions and poses, with a rich range of illuminations (e.g., clusters labelled

Secretary of State Colin Powell or Donald Rumsfeld). 2) Some faces are rare or appear in either repeated copies of one or two pictures or

only slightly different pictures (e.g., cluster labelled Chelsea Clinton or Sophia Loren). 3) Some faces are not, in fact, photographs (M. Ali).

4) The association between proper names and face is still somewhat noisy, for example, Leonard Nemoy, which shows a name associated with

the wrong face, while other clusters contain mislabeled faces (e.g., Donald Rumsfeld or Angelina Jolie). 5) Occasionally faces are incorrectly

detected by the face detector (Strom Thurmond). 6) Some names are genuinely ambiguous (James Bond, two different faces naturally

associated with the name (the first is an actor who played James Bond, the second an actor who was a character in a James Bond film).

7) Some faces appear in black in white (Marilyn Monroe) while most are in color. 8) Our clustering is quite resilient in the presence of spectacles

(Hans Blix, Woody Allen), perhaps wigs (John Bolton) and mustaches (John Bolton).



of depiction when they occur. By incorporating simplenatural language techniques, we show significant improve-

ment in our face labeling procedure over a purely appearance-

based modelVan image-based model gives approximately

67% accuracy on this task, while an image plus text model

improves performance to about 78%. Additionally, by learn-

ing the visual and textual models in concert, we boost the

amount of information available from either the images and

text alone. This increases the performance power of bothlearned models. We have conclusively shown that by incor-

porating language information we can improve a vision task,

namely, automatic labeling of faces in images.

Once our procedure is complete, we have a large,

accurately labelled set of faces, an appearance model for

each individual depicted, and a natural language model

that can produce accurate results on captions in isolation.

This language model can accurately label names in cap-tions without looking at a single pixel in the accompanying

photo with an accuracy of 86%. The dataset we produce

consists of 30 281 faces with roughly 3000 different

individuals that can be used for further exploration of face

recognition algorithms. Another product of our system is a

Web interface that organizes the news in a novel way,

according to individuals present in news photographs.

Users are able to browse the news according to individualFig. 2, bring up multiple photographs of a person, and view

the original news photographs and associated captions

featuring that person.

1) What Can Be Done With Names and Faces: While

perfect automatic labeling is not yet possible, the dataset of

labelled faces we have produced (Section III-A) has already

proven useful because it is large, because it contains chal-lenging phenomena, and because correcting labels for a

subset is relatively straightforward.

Because the dataset is large, it is possible to focus

inquiry on specific subsets of the data. For example, Ozkan

and Duygulu [42] present a clustering algorithm that

requires many images of each individual. They used the

most frequent 23 people in the database (each of whom

occurred more than 200 times) in experiments on refiningperson-name assignments. In that work, more than half of

the labels needed to be correct, a threshold easily exceeded

by the current dataset. Work by Ferencz et al. concentrates

on another subset of the dataVthey consider a large num-

ber of individuals (more than 400) with at least two images

per person for work on modeling appearance for face

recognition [43]. That work required completely correct

labeling of the faces, which was easily accomplished in acouple of hours with a custom interface to clean up the

automatic labeling in the Bnames and faces[ dataset.

Because the dataset exhibits challenging phenomena, it

has been used at the starting point for the fully labelled,

Blabelled faces in the wild[ (LFW) dataset [44] designed to

test face recognition on images of faces with real-world

variation. Benchmark results have shown that these data

are indeed very challenging for traditional algorithms andhave focused research on effective face-recognition

algorithms [45], [46].

Besides our work on labeling faces in news photo-

graphs, there have been several other related projects in

the domain of face labeling with contextual information.

Gallagher and Chen have further expanded on our notion

of context to introduce gender and age cues into the face-

labeling problem [47]. By estimating probabilities ofnames referring to men or women, and older or younger

people, they can constrain the labeling problem accord-

ingly (female faces should only be matched to female

names, an older face is more likely to correspond to the

name BMildred,[ etc.), leading to improved performance.

Similar problems have also been explored for labeling:

faces in consumer photo collections [48], [49], faces in

video with associated transcripts [50]–[53], or in videowith a small amount of manual labeling [54].

The trend in this area is toward further understanding

and use of the connection between images of faces and

context, in the images themselves, associated text or meta-

data, and the world at large.

B. Case Study: Preparing a Labeled Dataset ofAnimals on the Web

Not every picture has an explicit caption; sometimes,

pictures just have text nearby. Nonetheless, we can lever this

text against image features to obtain very useful results. Text

is a natural source of information about the content of

images, but the relationship between free text and images on

a Web page is complex. In particular, there are no obvious

indicators linking particular text items with image content

(a problem that does not arise if one confines attention tocaptions, annotations or image names). All this makes text a

noisy cue to image content if used alone. However, this

noisy cue can be helpful if combined appropriately with

good image descriptors and good examples.

In our work on identifying images containing categories

of animals [55], we develop a method to classify images

depicting animals in a wide range of aspects, configura-

tions, and appearances. In addition, the images typicallyportray multiple species that differ in appearance (e.g.,

uakari’s, vervet monkeys, spider monkeys, rhesus monkeys,

etc.). Our method is accurate despite this variation and

relies on an integration of information from four simple

cues: text, color, shape, and texture.

We demonstrate one application by harvesting approx-

imately 14 000 pictures for ten animal categories from the

Web. Since there is little point in looking for, say, Balligator[in Web pages that do not have words like Balligator,[Breptile,[ or Bswamp,[ we use Google to focus the search.

Using Google text search, we retrieve the top 1000 Web

pages for each category and use our method to rerank the

images on the returned pages.

From the retrieved images, we first select a small set of

example images from each query using only text infor-



mation. We use latent Dirichlet allocation [56] to discovera set of ten latent topics from the words contained on the

retrieved Web pages. These latent topics give distributions

over words and are used to select highly likely words for

each topic. We rank images according to their nearby

word likelihoods and select a set of 30 exemplars for each

topic.

Words and images can be ambiguous (e.g., Balligator[could refer to Balligator boots[ or Balligator clips[ as well

as the animal). Currently, there is no known method for

breaking this polysemy-like phenomenon automatically.

Therefore, at this point, we ask the user to identify which

topics are relevant to the concept they are searching for.

The user labels each topic as relevant or background,

Fig. 2. We have created a Web interface for organizing and browsing news photographs according to individual. Our dataset consists of

30 281 faces depicting approximately 3000 different individuals. Here we show a screen shot of our face dictionary (top), one cluster from

that face dictionary (actress Jennifer Lopez) (bottom left), and one of the indexed pictures with corresponding caption (bottom right).

This face dictionary allows a user to search for photographs of an individual as well as gives access to the original news photographs and

captions featuring that individual. It also provides a new way of organizing the news according to the individuals present in its photos.



depending on whether the associated images and wordsillustrate the category well. Given this labeling, we merge

selected topics into a single relevant topic and unselected

topics into a background topic (pooling their exemplars

and likely words). These exemplars often have high

precision, a fact that is not surprising given that most

successful commercial image search techniques rely on

textual information to index images. The word score for an

image is thus the summed likelihood of nearby words giventhe relevant topic model.

Visual cues are evaluated by a voting method that

compares local image phenomena with the relevant visual

exemplars for the category. For each image, we compute

features of three types: shape, color, and texture. For each

type of feature, we create two pools: one containing

positive features from the relevant exemplars and the other

negative features from the background exemplars. Foreach feature of a particular type in a query image, we apply

a 1-nearest neighbor classifier with similarity measured

using normalized correlation to label the feature as the

relevant topic or the background topic. For each visual cue

(color, shape, and texture), we compute the sum of the

similarities of features matching positive exemplars,

normalized so that scores range between zero and one.

The final score consists of summing the four individualcue scores. By combining each of these modalities, a much

better ranking is achieved than using any of the cues alone.

Precision recall curves are shown in Fig. 3, where, as you

move down the ranked set of images, precision is measured

as the percentage of images returned that are good and

recall as the percentage of all good images that are

returned. Though in isolation each of the visual and textual

features is relatively weak, the combination performs quite

well at this ranking task and generally provides an imageranking that is far superior to the original Google ranking.

The resulting sets of animal images (Fig. 4) are quite

compelling and demonstrate that we can handle a broad

range of animals.

Because we apply our system to Web pages with free

text, the word cue is extremely noisy, but again similarly to

the face labeling task (discussed in Section III-A) we show

unequivocal evidence that combining visual informationwith textual information improves performance on an

image-labeling task. By considering top-ranked images

using our text and image classification model, we can

produce labelled datasets where the majority of labels are

correct. For the ten original animal categories, this

corresponds to an average precision of 55% for the top

100 returned results, as compared to the original Google

ranking precision of 21%. The giraffe and frog classifiersare especially accurate, returning 74 and 83 true positives,

respectively. For one category, monkey, we have collected

an additional much larger set of about 13 000 images using

the original query plus related query words like Bold

world[ or Bscience.[ The dataset that we produce from this

set of images is startlingly accurate (81% precision for the

first 500 images and 69% for the first 1000). Not only does

the resulting collection contain monkeys in a variety ofposes, aspects, and depictions but it also contains large

number of monkey species and other related primates

including lemurs, chimps, and gibbons.

Here we have demonstrated a method to produce large,

accurate, and challenging visual datasets for object

categories with widely varying appearance using only a

small amount of human supervision. Though we chose to

focus on animal categories here, this method could be

Fig. 3. We are able to correctly classify images from the Web depicting categories of animals despite wide variations in appearance.

Even though the relationship between words and pictures on a Web page is complex, we are able to very effectively rerank Google search

results by combining several image and text based cues. These graphs show classification performance for three of our animal classes:

‘‘monkey’’ (left), ‘‘frog’’ (center), and ‘‘giraffe’’ (right). Colored precision-recall curves show performance for the original Google Web search

classification (red), word-based classification (green), local shape feature-based classification (magenta), color-based classification (cyan), and

texture-based classification (yellow). Our final output classification (black) utilizes a combination of all of these cues. Even though each cue in

isolation may have poor ranking performance, levering the different sources of information in combination (black) tends to do quite well.

In particular, you can see that incorporating visual information increases classification performance enormously over using word-based

classification alone.



applied to most other natural object categories, providingone possible solution for collecting large labelled image

datasets.

1) What Can Be Done With Text and Images: Many other

computer vision researchers have explored the relation-

ship between words and pictures in various settings for

exploring the semantics between words and pictures [57]–[59], image clustering [60], [61], or story illustration [62].

Others have attempted to leverage Internet image col-

lections to assist in image search [63], [64] or recognition

[65]–[70]. The most common strategy is to improve anno-

tation quality or filter spurious search results output by

one of the commercial search engines, gathering a new

Fig. 4. By levering different sources of informationVimage cues computed on the picture itself and text cues computed on the surrounding

Web pageVwe are able to effectively rerank Google search results. Here we show some ranked results from the ‘‘bear,’’ ‘‘dolphin,’’ ‘‘frog,’’

‘‘giraffe,’’ ‘‘leopard,’’ and ‘‘penguin’’ categories. Most of the top classified images for each category are correct and often display a wide variety of

poses (‘‘giraffe’’), depictions (‘‘leopard’’Vheads or whole bodies), and even multiple species (‘‘penguin,’’ ‘‘bear’’). Notice that the highly

ranked false positives (dark red) are quite reasonable since they display appearances or contexts similar to the true category; teddy bears for

the ‘‘bear’’ class, whale images for the ‘‘dolphin’’ class, and leopard frogs or leopard geckos for the ‘‘leopard’’ class. Drawings, even if they

depict the desired category, were counted as false positives for this task (e.g., ‘‘dolphin’’ and ‘‘leopard’’ categories).



collection of images that can be used for training

recognition systems [65]–[68].

Along these lines, we propose several other importantgoals for future research: labeling Web images with

relevant nouns extracted from the containing Web pages,

ranking nearby sentences according to how well they

describe the picture, and developing automated systems

for writing summaries about the content of photographs by

mining nearby textual information.

C. Case Study: The Internet as a Source of FeaturesSo far, we have looked at ways of using Internet images

to build datasets. We can also build representations for an

input image based on the metadata that surrounds visually

similar Internet images. The metadata, such as surround-

ing text or GPS coordinates, provides direct access toimportant scene information. Here, we discuss our recent

work [71] that builds text features from histograms of tags

and group names that are linked to the K most similar

images in a large set downloaded from Flickr (a popular

photo-sharing site). The text features can then be used to

improve prediction of object presence (Fig. 5 shows the

framework of our approach).

Our method is based on two key ideas. First, it is ofteneasier to determine the content of an image using nearby text

than with currently available image features. State-of-the-art

methods in computer vision [72] are still not capable of

handling the unpredictability of object positions and sizes,

appearance, lighting, and unusual camera angles that are

common in consumer photographs, such as those found on

Flickr. Determining object presence from the text that

surrounds an image (tags, discussion, group names) is also farfrom trivial due to polysemy, synonymy, and incomplete or

spurious annotations. Still, as we saw in Section III-B, even

unstructured text can provide valuable information that is

not easy to extract from image features. The second

important idea is that, given a large enough dataset, we are

bound to find very similar images to an input image, even

when matching with simple image features. This idea has

been demonstrated by Torralba et al. [73], who showed thatmatching tiny (32 � 32) images using Euclidean distance of

intensity leads to surprisingly good object recognition

results if the dataset is large enough (tens of millions of

images). Likewise, Hays and Efros [74], [75] showed that

simple image matching can be used to complete images and

to infer world coordinates. Our approach is to infer likely

text for our input image based on similar images in a large

dataset and use that text to determine whether an object ispresent. Along these lines, Quattoni et al. [69] use captioned

images to learn a more predictive visual representation. Our

work is related to this in that we learn a distance metric

that causes images with similar surrounding text to be

similar in visual feature space.

The text features that we use are straightforward.

Flickr images possess tags and are often assigned to one or

more groups. When we collected the images, we made arecord of the tags and group names associated with each

image. Each unique tag or group name forms a single item,

even though it may include multiple words. For example,

the group name BDogs! Dogs! Dogs![ is treated as a single

item. We use only tags and group names that occur

frequently in the auxiliary dataset, resulting in a vocabu-

lary of about 6000 items. The text features are computed

by building a histogram of these items, counted over the Knearest neighbors of the test image. In our experiments,

K ¼ 150. We can then train and evaluate classifier based

on these text features in the standard way (we use a

support vector machine classifier with a chi-squared kernel

in our experiments).

In the work described in this paper, we found that

these text features do not outperform standard construc-

tions of visual features for standard discriminative tasks.However, because the text features tend to make quite

different errors than the visual features (also observed in

our work on classifying animal images; see Section III-B),

predictions based on one can correct predictions based on

the other. In general, we find that, by combining the

features, we can obtain a small but significant improve-

ment in error rate across classes for a standard task

(classification in the PASCAL 2006 dataset; see [76]).Figs. 6 and 7 provide illuminating examples. In Fig. 6, we

show examples misclassified by the visual classifier but

correctly classified by the text classifier on the PASCAL

2006 dataset. In the first image, the cat is in a sleeping

pose, which is unusual in the PASCAL training set, and so

the visual classifier gets it wrong. However, we find many

such images in the auxiliary dataset, and there are several

sleeping cat images in the 25 nearest neighbors, so the textcue can make a correct prediction. When the nearest

neighbors are poor, the text feature is less helpful, as Fig. 7

shows. While one might worry that our text features are

powerful only because our images are tagged with category

labels, this is not the case. We have tested this by excluding

category names and their plural inflections from the text

features. This means that, for example, the words Bcat[

Fig. 5. The framework of our approach. We have training and

test images (here we only show the test image part). We also have an

auxiliary dataset consisting of Internet images and associated text.

For each test image, we extract its visual features and find the K most

similar images from the Internet dataset. The text associated with

these nearest neighbor Internet images is summarized to build the text

features. Text classifiers that are trained with the same type of text

features are applied to predict the object labels. We can also train

visual classifiers with the visual features. The outputs from the two

classifiers are fused to do the final classification.



and Bcats[ would not appear in the features. The effect on

performance is extremely small. This suggests that textassociated with images is rich in secondary cues (perhaps

Bmice[ or Bcatnip[ appear strongly with cats).

IV. THE INTERNET AS A MARKETPLACE

Another way to see the Internet is as a device to connect

willing buyers with willing sellers. We would like to buy

image annotations cheaply and in large quantities by

outsourcing annotation work to an online worker commu-

nity. There are now strong tools for doing so efficiently.

We have used Amazon’s Mechanical Turk (Section IV-A)

and have found the resulting annotations to be both goodand cheap (Section IV-B).

A. How to Annotate on Mechanical TurkEach annotation task is converted into a human intel-

ligence task (HIT). The tasks are submitted to Amazon

Mechanical Turk (MT). Online workers choose to work on

the submitted tasks. Every worker opens our Web page

with a HIT and does what we ask them to do. They

Bsubmit[ the result to Amazon. We then fetch all resultsfrom Amazon MT and convert them into annotations. The

core tasks for a researcher are 1) define an annotation

protocol and 2) determine what data need to be annotated.

To get good quality data, one must ensure that the

workers understand the requested task and try to perform

it well. We have found it to be extremely helpful to set up

examples showing how we would like the labeling task to

be performed. Additionally, we have three strategies toclean up occasional errors and detect and prevent cheat-

ing. The basic strategy is to collect multiple annotations for

every image. This will account for natural variability of

human performance, reduce the influence of occasional

errors, and allow us to catch malicious users. However,

this increases the cost of annotation. The second strategy is

to perform a separate grading task. A worker looks at

several annotated images and scores every annotation. Weget explicit quality assessments at a fraction of the cost

because grading is easy. The third strategy is to build a goldstandardVa collection of images with trusted annotations.

Fig. 6. The left column shows the PASCAL 2006 images whose category labels cannot be predicted by the visual classifier but can be predicted

by the text classifier; The center column shows the 25 nearest neighbor images retrieved from the Internet dataset; the right column shows

the built text feature vectors. In the first image, the cat is in a sleeping pose, which is unusual in the PASCAL training set. So the visual classifier gets

it wrong. Some sleeping cat images are retrieved from the auxiliary dataset. Then the text features make a correct prediction.



Images from the gold standard are injected into the

annotation process and used to detect and correct workers

deviating significantly from the desired results. This

strategy is again cheap, as only a small fraction of images

come from the gold standard.

B. Our Experience of Annotation onMechanical Turk

We have detailed quality data for four annotation pro-

tocols (Fig. 8): two coarse object segmentation protocols,

polygonal labeling, and 14-point human landmark labeling.

The object segmentation protocols show an image to the

worker and a small image of the query (person). We ask

the worker to click on every circle (site) overlapping with

the query (person). The first protocol places sites on aregular grid, whereas the second places sites at the centers

of superpixels (computed with [77], [78]). In the third

protocol, polygonal labeling, we ask the worker to trace the

boundary of the person in the image (similar to the

LabelMe [31] task). The fourth protocol labels the land-

marks of the human body used for pose annotation [79],

asking the worker to click on locations of these points in a

specified order. So far, we have run five annotationexperiments using data collected from YouTube (experi-

ments 1, 2, 5), a dataset of people [79] (experiments 3, 4), a

small sample of data from LabelMe [31], Weizman [80],

and our own dataset (experiments 5). In all experiments,

we are interested in people. As shown in Table 1, we have

obtained a total of 3861 annotations for 982 distinct images

collected for a total cost of $59.

We present sample annotation results (Figs. 8 and 9) to

show the representative annotations and highlight the

most prominent failures. We are extremely satisfied with

the quality of the annotations taking into account that

workers receive no feedback from us.

1) Pricing: Pricing annotation work is difficult, and

throughput is quite sensitive to price. Even if the task is

underpriced, some workers participate out of curiosity or

for entertainment, but we do not expect to be able to

obtain good annotations at large scales like this. If the price

is too high, we could be wasting resources and possibly

attracting inefficient workers. We have no algorithm for

pricing but surveyed workers to get a sense of how to pricework. As Table 1 shows, the hourly pay in experiments 4

and 5 was roughly $1/h. In these experiments, we had a

comments field, and some comments suggested that the

pay should be increased by a factor of three. From this, we

conclude that the perceived fair pricing is about $3/h,

though we expect that this depends on the nature of the

work. In further experiments, we have offered small

amounts of work at a given price, then raised or loweredpay depending on how quickly it was finished.

2) Annotation Quality: To understand the quality of

annotations, we use three simple consistency scores for a

pair of annotations (a1 and a2) of the same type. Forprotocols 1–3, we divide the area where annotations

disagree by the area marked by any of the two annotations.

We can think about this as XORða1; a2Þ=ORða1; a2Þ. For

Fig. 7. The left column shows the PASCAL images whose category labels cannot be predicted by the text classifier but can be predicted by the

visual classifier; The center column shows the 25 nearest neighbor images retrieved from the Internet dataset; the right column shows the built

text features of the PASCAL images. The text features do not work here mainly because we fail to find good nearest neighbor images.



protocols 1 and 2, XOR counts of sites with the differentannotations and OR counts the sites marked by any of the

two annotations a1 and a2. For protocol 3, XOR is the area

of the symmetric difference and OR is the area of the

union. For protocol 4, we measure the average distance

between the selected landmark locations. Ideally, the

locations coincide and the score is zero.

We then select the two best annotations for every

image by simply taking a pair with the lowest score, i.e.,most consistent. For protocol 3, we further assume that the

polygon with more vertices is a better annotation and we

put it first in the pair. The distribution of scores and a

detailed analysis appears in Fig. 9 with scores ordered

from best (lowest) on the left to the worst (highest) on

the right. Some of the errors come from sloppy annotations

(especially in the heavily underpaid experiment 3), but

most of the disagreements arise when the question we askis difficult to answer. For example, in experiment 4, wor-

kers were asked to label landmarks that are often obscured

by clothing. In Fig. 10, we show consistency of the anno-

tations of each landmark between the thirty-fifth and the

sixty-fifth percentile of Fig. 9, indicating that hips are

much more difficult to localize compared to other, less

often obscured joint positions.

More recently, in pursuit of an object-recognitionresearch agenda [81], we have used Mechanical Turk to

label two large datasets with approximately 400 000 labels.

Our datasets are intended to explore the object description

problem. In particular, we want to describe objects in terms

of their attributes, properties like Bmade of wood,[ Bmade

of metal,[ Bhas a wheel,[ Bhas skin,[ and so on. We

collected attribute annotations for each of 20 object classes

in a standard object recognition dataset, PASCAL VOC2008, created for classification and detection of visual

object classes in a variety of natural poses, viewpoints, and

orientations. These object classes cluster nicely into

Banimals,[ Bvehicles,[ and Bthings[ and include object

classes such as people, bird, cat, boat, tv, etc., with between

150 and 5000 instances per category. To supplement this

dataset, we collected images for 12 additional object classes

from Yahoo! image search, selected to have objects similarto the PASCAL classes while having different correlations

between the attributes. For example, PASCAL has a Bdog[category, so we collected Bwolf[ images, and so on. Objects

in this set include, for example: wolf, zebra, goat, donkey,

monkey, statue of people, centaur, etc.

Fig. 8. Example results show the example results obtained from the annotation experiments for two of our protocols. The first column is the

implementation of the protocol, the second column shows obtained results, and the third column shows some poor annotations we observed.

The user interfaces are similar, simple, and are easy to implement.

Table 1 Collected Data. In Our Five Experiments, We Have Collected

3861 Labels for 982 Distinct Images for Only $59. In Experiments 4 and 5,

the Throughput Exceeds 300 Annotations Per Hour Even at Low ($1/h)

Hourly Rate. We Expect Further Increase in Throughput as We Increase the

Pay to Effective Market Rate



We made a list of 64 attributes to describe our objects

and collected annotations for semantic attributes for each

object using Amazon’s Mechanical Turk, a total ofapproximately half-a-million attribute labels for a total

cost of $600. Labeling objects with their attributes can

often be an ambiguous task. This can be demonstrated by

imperfect interannotator agreement among Bexperts[(authors) and Amazon Turk annotators. The agreement

among experts is 84.3%, between experts and Amazon Turk

annotators is 81.4%, and among Amazon Turk annotators is

84.1%. As we have worked with the dataset, we have foundsome important idiosyncrasies. For example, images of

people labelled as not having Bskin[ almost always do; there

just is not very much visible, or the skin that is visible is

unremarkable (hands or faces). This effect could be

ascribed to miscommunication between annotators andcollectors. We have not found evidence that we need to

deploy quality control measures, though we may have been

lucky in the interface design.

V. CONCLUSION: DATASET RESEARCHON THE INTERNET IS FERTILE

Recognition research needs datasets, and the Internet is a

valuable source of datasets. Collecting these datasets is not

just collation. Instead, the question of how to produce

Fig. 9. Quality details. We present detailed analysis of annotation quality for experiment 4. For every image, the best fitting pair of annotations

is selected. The score of the best pair is shown in the figure. For experiment 4, we compute the average distance between the marked points.

The scores are ordered low (best) to high (worst). For clarity, we render annotations at 5:15:95 percentiles of the score. Blue curve and dots show

annotation 1; yellow curve and dots show annotation 2 of the pair.

Fig. 10. Quality details per landmark. We present analysis of annotation quality per landmark in experiment 4. We show scores of the best

pair for all annotations between the thirty-fifth and sixty-fifth percentilesVbetween points C and E of experiment 4 in Fig. 9. All the plots

have the same scale: from image 100 to 200 on horizontal axis and from 3 to 13 pixels of error on the vertical axis. These graphs show

annotators have greater difficulty choosing a consistent location for the hip than for any other landmark; this may be because some place the

hip at the point a tailor would use and others mark the waist or because the location of the hip is difficult to decide under clothing.



them has proven fertile. It has been the source of inspi-ration for a wide range of research activities and for new

research agendas. The question of how to attach a name to

a visual representation of a face using whatever context

seems likely to help has now produced novel applications,

datasets, and insights about context and its meaning

(Section III-A). Similarly, the general question of how to

exploit text near images has become a major research

agenda in computer vision (Section III-B1).But there are other, currently less well-developed lines

of enquiry. For example, we could use the view of the

Internet as a marketplace where annotation services are

available very cheaply to evaluate object detectors in a

realistic way. Current practice involves building datasets of

annotated positive and negative examples, then using part

for training and part for testing. Once such a dataset has

been used a lot, results reported are questionable becausethey are subject to a form of selection bias. Where cheap

annotation is available, we can adopt a different strategy.

We want to evaluate a detector on all possible images, but

annotation is not cheap enough to annotate all images. So

we must draw a sample, and we could do so by applying the

detector to a large pool of images (ideally, every image),

then sampling the images where the detectors respond and

sending them out for annotation. This argument is similarin flavor to the procedures used by active learning systems,

where a large pool of data is iteratively used to develop

classifiers with a human in the loop for interactive training

[82]. In particular, if one has two or more detectors, each

may serve as a guide to where other detectors should have

responded but did not. This view offers the prospect of

relatively cheap, very large-scale relative evaluations of

detectors that yield statistics close to the actual perfor-mance one expects on every image in the world.

As another example, if one has thousands (or millions)

of pictures of an object category, which pictures are iconic,

in the sense that they give a good representation of the

object? Some recent papers have explored methods for

choosing the most representative or canonical photographs

for a given location [83], monument [3], [84], or object

category [85] by looking at coherence within an image andacross the set of search returns. But another strategy

observes that only well-composed and aesthetically

appealing images are likely to be the most relevant to a

query (returning a poor image is unlikely to be alignedwith a user’s needs) [86]–[88]. Recently, Flickr.com, a

popular photo-sharing site, has used a similar idea quite

successfully with their measure of Binterestingness[related to user behavior around photographs. Photo-

graphs are considered Binteresting[ if they demonstrate a

great deal of user activity, in the form of user favorites,

clicks, comments, etc. This measure, though computed

from human behavior, is usually directly related to thephotograph’s intrinsic quality since people tend to mark

as a favorite or comment on only those photographs that

strike them as pleasing in some way. Though neither

aesthetic quality nor Binterestingness[ directly address

the question of relevance, utilizing these alternative

sources of information as part of search could help to

improve ranking results significantly.

As yet another example, if we want to allow users tosearch, browse, or otherwise access visual data in large

quantities, it will need to be organized in ways that make

sense. This organization will have to represent at least

some components of the Bmeaning[ of the visual data.

Obtaining this meaning will, for the foreseeable future,

require an understanding of the complex relationships

between visual data and textual information that appears

nearby. The ultimate goal is to use methods of both com-puter vision and natural language understanding to pro-

duce a representation of meaning for visual data from that

data and all the text that might be relevant. Doing so is

much more difficult than, for example, attaching names to

faces because the tools in both vision and natural language

are less well developed for the general case. However, as

we have shown, words on pages surrounding images

contain really useful information about those pictures.In summary, what might appear to be a humdrum

questionVhow to collect data on the InternetVis in fact a

fertile research area in computer vision that forces us to

address deep and important questions of visual represen-

tation and meaning. h

Acknowledgment

The authors thank the anonymous referees for their

helpful comments. They thank Dolores Labs and L. Biewald

for their help managing data annotation.

RE FERENCES

[1] R. Hartley and A. Zisserman, Multiple ViewGeometry in Computer Vision, 2nd ed.Cambridge, U.K.: Cambridge Univ. Press,2003.

[2] N. Snavely, S. M. Seitz, and R. Szeliski,BPhoto tourism: Exploring photo collectionsin 3d,[ in Proc. SIGGRAPH, 2006.

[3] X. Li, C. Wu, C. Zach, S. Lazebnik, andJ.-M. Frahm, BModeling and recognition oflandmark image collections using iconic scenegraphs,[ in Proc. ECCV, 2008.

[4] D. Forsyth and J. Ponce, Computer Vision:A Modern Approach. Englewood Cliffs, NJ:Prentice-Hall, 2002.

[5] L. Shapiro and G. Stockman, ComputerVision. Englewood Cliffs, NJ: Prentice-Hall,2001.

[6] J. Ponce, M. Hebert, C. Schmid, andA. Zisserman, Eds., Toward Category LevelObject Recognition. Berlin, Germany:Springer, 2006.

[7] N. Ikizler and D. Forsyth, BSearching videofor complex activities with finite statemodels,[ Proc. IJCV, 2008.

[8] W. Zhao, R. Chellappa, P. J. Phillips, andA. Rosenfeld, BFace recognition: A literaturesurvey,[ ACM Comput. Surv., vol. 35, no. 4,pp. 399–458, 2003.

[9] S. G. Kong, J. Heo, B. R. Abidi, J. Paik, andM. A. Abidi, BRecent advances in visual andinfrared face recognition-a review,[ Comput.Vision Image Understand., vol. 97, no. 1,pp. 103–135, 2005.

[10] A. J. S. Z. Li, A Handbook of FaceRecognition. Berlin, Germany: Springer,2005.

[11] P. J. Phillips and E. Newton, BMeta-analysis offace recognition algorithms,[ in Proc. Int.Conf. Autom. Face Gesture Recognit., 2002.

[12] B. Leibe, A. Leonardis, and B. Schiele,BRobust object detection with interleavedcategorization and segmentation,[ Int. J.Comput. Vision, 2008.



[13] N. Dalal and B. Triggs, BHistograms oforiented gradients for human detection,[ inProc. CVPR, 2005, pp. I: 886–I: 893.

[14] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna,and T. Poggio, BPedestrian detection usingwavelet templates,[ in Proc. IEEE Conf.Comput. Vision Pattern Recognit., 1997,pp. 193–199.

[15] P. Viola, M. Jones, and D. Snow, BDetectingpedestrians using patterns of motion andappearance,[ Int. J. Comput. Vision, vol. 63,no. 2, pp. 153–161, Jul. 2005.

[16] M.-H. Yang, D. Kriegman, and N. Ahuja,BDetecting faces in images: A survey,[ IEEETrans. Pattern Anal. Machine Intell., vol. 24,pp. 34–58, 2002.

[17] H. Rowley, S. Baluja, and T. Kanade, BNeuralnetwork-based face detection,[ in Proc. CVPR,1996, pp. 203–208.

[18] T. Poggio and K.-K. Sung, BFindinghuman faces with a gaussian mixturedistribution-based face model,[ in Proc. AsianConf. Comput. Vision, 1995, pp. 435–440.

[19] D. Martin, C. Fowlkes, D. Tal, and J. Malik,BA database of human segmented naturalimages and its application to evaluatingsegmentation algorithms and measuringecological statistics,[ in Proc. Int. Conf.Comput. Vision, 2001.

[20] L. Fei-Fei, R. Fergus, and P. Perona,BOne-shot learning of object categories,[IEEE Trans. Pattern Anal. Machine Intell., to bepublished.

[21] G. Griffin, A. Holub, and P. Perona. (2007).Caltech-256 object category dataset,California Inst. of Technology,Tech. Rep. 7694. [Online]. Available: http://authors.library.caltech.edu/7694

[22] M. Everingham, A. Zisserman,C. K. I. Williams, and L. Van Gool,The PASCAL visual object classes challenge2006 (VOC2006) results. [Online]. Available:http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf

[23] M. Everingham, L. Van Gool, C. K. I. Williams,J. Winn, and A. Zisserman, The PASCAL visualobject classes challenge 2007 (VOC2007) results.[Online]. Available: http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html

[24] S. Agarwal, A. Awan, and D. Roth, BLearningto detect objects in images via a sparse,part-based representation,[ IEEE Trans.Pattern Anal. Machine Intell., vol. 26,pp. 1475–1490, Nov. 2004.

[25] K. Barnard, Q. Fan, R. Swaminathan,A. Hoogs, R. Collins, P. Rondot, andJ. Kaufhold, BEvaluation of localizedsemantics: Data, methodology, andexperiments,[ Proc. IJCV, 2008.

[26] C. Papageorgiou and T. Poggio, BA trainablesystem for object detection,[ Proc. IJCV, 2000.

[27] N. Dalal and B. Triggs, BHistograms oforiented gradients for human detection,[ inProc. CVPR, 2005.

[28] P. N. Belhumeur, J. P. Hespanha, andD. J. Kriegman, BEigenfaces vs. Fisherfaces:Recognition using class specific linearprojection,[ IEEE Trans. Pattern Anal. MachineIntell. (Special Issue on Face Recognition),vol. 19, no. 7, pp. 711–720, 1997.

[29] P. J. Phillips, A. Martin, C. Wilson, andM. Przybocki, BAn introduction to evaluatingbiometric systems,[ Computer, vol. 33, no. 2,pp. 56–63, 2000.

[30] T. Sim, S. Baker, and M. Bsat, BThe CMUpose, illumination, and expression (pie)database,[ in Proc. AFGR, 2002.

[31] B. Russell, A. Torralba, K. Murphy, andW. T. Freeman, BLabelme: A database andWeb-based tool for image annotation,[ Int. J.Comput. Vision, 2007.

[32] L. von Ahn and L. Dabbish, BLabeling imageswith a computer game,[ in Proc. ACM CHI,2004.

[33] L. von Ahn, R. Liu, and M. Blum,BPeekaboom: A game for locating objects inimages,[ in Proc. ACM CHI, 2006.

[34] B. Yao, X. Yang, and S.-C. Zhu, BIntroductionto a large scale general purpose ground truthdataset: Methodology, annotation tool, andbenchmarks,[ in Proc. EMM CVPR, 2007.

[35] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li,and L. Fei-Fei, BImageNet: A large-scalehierarchical image database,[ in Proc.CVPR’09, 2009.

[36] C. Fellbaum, Ed., WordNet: An ElectronicLexical Database. Cambridge, MA: MITPress, 1998.

[37] T. Berg, A. Berg, J. Edwards, M. Maire,R. White, E. Learned-Miller, Y. Teh, andD. Forsyth, BNames and faces,[ in Proc. CVPR,2004.

[38] T. L. Berg, A. C. Berg, J. Edwards, andD. Forsyth, BWho’s in the picture?’’ inProc. NIPS, Dec. 2004.

[39] H. Cunningham, D. Maynard, K. Bontcheva,and V. Tablan, BGate: A framework andgraphical development environment forrobust nlp tools and applications,[ Assoc.Computat. Linguist., 2002.

[40] K. Mikolajczyk, ‘‘Face detector,’’Ph.D. dissertation, INRIA, Rhone-Alpes.

[41] A. Berger, S. Pietra, and V. D. Pietra,BA maximum entropy approach to naturallanguage processing,[ Computat. Linguist.,vol. 22-1, 1996.

[42] D. Ozkan and P. Duygulu, BA graph basedapproach for naming faces in news photos,[ inProc. CVPR, 2006.

[43] A. Ferencz, E. Learned-Miller, and J. Malik,BLearning hyper-features for visualidentification,[ in Proc. NIPS, 2005.

[44] G. B. Huang, M. Ramesh, T. Berg, andE. Learned-Miller, BLabeled faces in the wild:A database for studying face recognition inunconstrained environments,[ Univ. ofMassachusetts, Amherst, Tech. Rep., 2007.

[45] E. Nowak and F. Jurie, BLearning visualsimilarity measures for comparing never seenobjects,[ in Proc. CVPR, 2007.

[46] G. B. Huang, V. Jain, and E. Learned-Miller,BUnsupervised joint alignment of compleximages,[ in Proc. ICCV, 2007.

[47] A. Gallagher and T. Chen, BEstimating age,gender and identity using first name priors,[in Proc. CVPR, 2008.

[48] M. Naaman, R. B. Yeh, H. Garcia-Molina, andA. Paepcke, BLeveraging context to resolveidentity in photo albums,[ in Proc. Joint Conf.Digital Libraries, 2005.

[49] L. Zhang, Y. Hu, M. Li, W. Ma, and H. Zhang,BEfficient propagation for face annotation infamily albums,[ in ACM Multimedia, 2004.

[50] S. Satoh, Y. Nakamura, and T. Kanade,BName-it: Naming and detecting faces innews videos,[ in IEEE Multimedia, 1999.

[51] R. Houghton, BNamed faces: Putting names tofaces,[ in IEEE Intell. Syst., 1999.

[52] X. Song, C.-Y. Lin, and M.-T. Sun,BCrossmodality automatic face model trainingfrom large video databases,[ in Proc. CVPRWWorkshop CVPR, 2004.

[53] M. Everingham, J. Sivic, and A. Zisserman,BHello! My name is. . . BuffyVAutomatic

naming of characters in TV video,[ in Proc.BMVC, 2006.

[54] D. Ramanan, S. Baker, and S. Kakade,BLeveraging archival video for building facedatasets,[ in Proc. ICCV, 2007.

[55] T. L. Berg and D. A. Forsyth, BAnimals on theweb,[ in Proc. CVPR, 2006.

[56] D. Blei, A. Ng, and M. Jordan, BLatentDirichlet allocation,[ Proc. JMLR, 2003.

[57] K. Barnard and D. Forsyth, BLearning thesemantics of words and pictures,[ in Proc.ICCV, 2001.

[58] K. Barnard, P. Duygulu, N. de Freitas,D. Forsyth, D. Blei, and M. I. Jordan,BMatching words and pictures,[ J. MachineLearn. Res., 2003.

[59] J. Li and J. Z. Wang, BAutomatic linguisticindexing of pictures by a statistical modelingapproach,[ IEEE Trans. Pattern Anal. MachineIntell., 2003.

[60] B. Gao, T. Liu, T. Qin, X. Zheng, Q. Cheng,and W. Ma, BWeb image clustering byconsistent utilization of visual features andsurrounding texts,[ in ACM Multimedia, 2005.

[61] K. Barnard, P. Duyguly, and D. Forsyth,BClustering art,[ in Proc. CVPR, Jun. 2001.

[62] D. Joshi, J. Z. Wang, and J. Li, BThe storypicturing engine: Finding elite images toillustrate a story using mutualreinforcement,[ in Proc. 6th ACM SIGMM Int.Workshop Multimedia Inf. Retrieval (MIR’04),2004.

[63] R. Fergus, P. Perona, and A. Zisserman,BA visual category filter for Google images,[in Proc. 8th Eur. Conf. Comput. Vision, 2004.

[64] N. Ben-Haim, B. Babenko, and S. Belongie,BImproving web-based image search viacontent based clustering,[ in Proc. Conf.Comput. Vision Pattern Recognit. Workshop,2006, pp. 106–111.

[65] R. Fergus, P. Perona, and A. Zisserman,BA sparse object category model for efficientlearning and exhaustive recognition,[ in Proc.Comput. Vision Pattern Recognit., 2005.

[66] K. Wnuk and S. Soatto, BFiltering Internetimage search results towards keyword basedcategory recognition,[ in Proc. Conf. Comput.Vision Pattern Recognit., 2008.

[67] L.-J. Li, G. Wang, and L. Fei-Fei, BOPTIMOL:Automatic Object Picture collecTion viaIncremental MOdel Learning,[ in Proc. IEEEComput. Vis. Pattern Recognit., Minneapolis,MN, 2007, pp. 1–8.

[68] F. Schroff, A. Criminisi, and A. Zisserman,BHarvesting image databases from the web,[in Proc. 11th Int. Conf. Comput. Vision,Rio de Janeiro, Brazil, 2007.

[69] A. Quattoni, M. Collins, and T. Darrell,BLearning visual representations using imageswith captions,[ in Proc. CVPR 2007, Jun. 2007.

[70] B. Collins, J. Deng, L. Kai, and L. Fei-Fei,BTowards scalable dataset construction: Anactive learning approach,[ in Proc. ECCV,2008.

[71] G. Wang, D. Hoiem, and D. Forsyth,BBuilding text features for object imageclassification,[ in Proc. CVPR, 2009.

[72] M. Everingham, L. Van Gool, C. Williams,J. Winn, and A. Zisserman, The PASCAL visualobject classes challenge 2007 (VOC2007)results, 2007.

[73] A. Torralba, R. Fergus, and W. T. Freeman,BTiny images,[ Computer Science andArtificial Intelligence Lab, MassachusettsInst. of Technology, Tech. Rep. MIT-CSAIL-TR-2007-024. [Online]. Available: http://dspace.mit.edu/handle/1721.1/37291



[74] J. Hays and A. Efros, BScene completion usingmillions of photographs,[ in Int. Conf. Comput.Graph. Interact. Techniques, 2007.

[75] J. Hays and A. Efros, BIM2GPS: Estimatinggeographic information from a single image,[in Proc. IEEE Conf. Comput. Vision PatternRecognit. 2008, 2008, pp. 1–8.

[76] M. Everingham, A. Zisserman,C. K. I. Williams, and L. Van Gool, ThePASCAL visual object classes challenge 2006(VOC2006) results. [Online]. Available: http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf

[77] G. Mori, X. Ren, A. Efros, and J. Malik,BRecovering human body configurations:Combining segmentation and recognition,[ inProc. CVPR, 2004.

[78] D. Martin, C. Fowlkes, and J. Malik,BLearning to detect natural image boundaries

using local brightness, color, and texturecues,[ IEEE Trans. Pattern Anal. MachineIntell., 2004.

[79] D. Ramanan, BLearning to parse images ofarticulated bodies,[ in Proc. NIPS, 2007,pp. 1129–1136.

[80] M. Blank, L. Gorelick, E. Shechtman, M. Irani,and R. Basri, BActions as space-time shapes,[in Proc. ICCV, 2005, pp. 1395–1402.

[81] A. Farhadi, I. Endres, D. Hoiem, andD. Forsyth, BLearning to describe objects,[ inProc. CVPR, 2009.

[82] Y. Abramson and Y. Freund, BSemi-automaticvisual learning (Seville): A tutorial on activelearning for visual object recognition,[ inTutorial CVPR, 2005.

[83] I. Simon, N. Snavely, and S. Seitz, BScenesummarization for online image collections,[in Proc. ICCV, 2007.

[84] T. L. Berg and D. A. Forsyth, BAutomaticranking of iconic images,’’ Univ. CaliforniaBerkeley Tech. Rep., 2007.

[85] T. L. Berg and A. C. Berg, BFinding iconicimages,[ in Proc. 2nd Internet Vision WorkshopIEEE Conf. Comput. Vision Pattern Recognit.,2009.

[86] R. Raguram and S. Lazebnik, BComputingiconic summaries for general visualconcepts,[ in Proc. 1st Internet VisionWorkshop, 2008.

[87] Y. Ke, X. Tang, and F. Jing, BThe design ofhigh-level features for photo qualityassessment,[ in Proc. CVPR, 2006.

[88] R. Datta, D. Joshi, J. Li, and J. Z. Wang,BStudying aesthetics in photographic imagesusing a computational approach,[ in Proc.ECCV, 2006.

ABOUT T HE AUTHO RS

Tamara L. Berg received the Ph.D. degree from

the Computer Science Department, University of

California, Berkeley, in 2007.

She was a member of the Berkeley Computer

Vision Group and worked under the advisorship of

Prof. D. Forsyth. She spent 2007–2008 as a

Postdoctoral Researcher with Yahoo! Research

developing various projects related to digital

media, including the automatic annotation of con-

sumer photographs. She is currently an Assistant

Professor at Stony Brook University. Her main research area is digital

media, specifically focused on organizing large collections of images with

associated text through the development of techniques in computer

vision and natural language processing. Past projects include automat-

ically identifying people in news photographs, classifying images from

the Web, and finding iconic images in consumer photo collections. She is

also generally interested in bringing together people and expertise from

various areas of digital media including digital art, music, and cultural

studies.

Alexander Sorokin (Member, IEEE) received the

B.S. degree (magna cum laude) from Lomonosov

Moscow State University, Russia, in 2003. He is

pursuing the Ph.D. degree at the University of

Illinois at Urbana-Champaign.

Mr. Sorokin is a member of Phi Kappa Phi. In

2007, he received a Best Paper Award from

Information Visualization 2007.

Gang Wang received the B.S. degree from Harbin

Institute of Technology, China, in 2005. He is

currently working towards the Ph.D. degree at

the University of Illinois at Urbana-Champaign,

Urbana.

His research interests include computer vision

and machine learning.

David Alexander Forsyth (Fellow, IEEE) received

the B.Sc. and M.Sc. degrees in electrical engineer-

ing from the University of the Witwatersrand,

Johannesburg, South Africa, and the D.Phil. degree

from Balliol College, Oxford, U.K.

He has published more than 100 papers on

computer vision, computer graphics, and machine

learning. He is a coauthor (with J. Ponce) of

Computer Vision: A Modern Approach (Englewood

Cliffs, NJ: Prentice-Hall, 2002). He was a Professor

at the University of California, Berkeley. He is currently a Professor at the

University of Illinois at Urbana-Champaign.

Prof. Forsyth was Program Cochair for IEEE Computer Vision and

Pattern Recognition (CVPR) in 2000, General Cochair for CVPR 2006, and

Program Cochair for the European Conference on Computer Vision 2008.

He will be Program Cochair for CVPR 2011. He received an IEEE Technical

Achievement Award in 2005.

Derek Hoiem received the Ph.D. degree from

Carnegie–Mellon University, Pittsburgh, PA.

He is an Assistant Professor in computer

science at the University of Illinois at Urbana-

Champaign. His dissertation on automatically

transforming 2-D images into 3-D scenes was

featured in The Economist. His current research

focuses on physical scene understanding and

property-centric object recognition.

Dr. Hoiem received an Honorable Mention for

the ACM Doctoral Dissertation Award.



Ian Endres (Member, IEEE) received the B.S.

degree in computer science from the University

of Illinois at Urbana-Champaign in 2008, where he

is currently pursuing the Ph.D. degree. His re-

search interests include computer vision and

machine learning.

Ali Farhadi (Member, IEEE) is pursuing the Ph.D.

degree in the Computer Science Department,

University of Illinois at Urbana-Champaign.

His work is mainly focused on computer vision

and machine learning. More specifically, he is

interested in transfer learning and its application

to aspect issues in human activity and object

recognition, scene understanding, and attribute-

based representation of objects.

He recently received the inaugural Google

Fellowship in computer vision and image interpretation and the Uni-

versity of Illinois CS/AI 2009 Award.



It's All About the Data

Documents

Transcript of It's All About the Data