It's All About the Data
-
Upload
independent -
Category
Documents
-
view
1 -
download
0
Transcript of It's All About the Data
INV ITEDP A P E R
It’s All About the DataThis paper explains how training data is important for many computer vision
algorithms and presents case studies of how the Internet can be used to
obtain high-quality data.
By Tamara L. Berg, Alexander Sorokin, Member IEEE, Gang Wang,
David Alexander Forsyth, Fellow IEEE, Derek Hoiem, Ian Endres, Member IEEE,
and Ali Farhadi, Member IEEE
ABSTRACT | Modern computer vision research consumes
labelled data in quantity, and building datasets has become
an important activity. The Internet has become a tremendous
resource for computer vision researchers. By seeing the
Internet as a vast, slightly disorganized collection of visual
data, we can build datasets. The key point is that visual data are
surrounded by contextual information like text and HTML tags,
which is a strong, if noisy, cue to what the visual data means. In
a series of case studies, we illustrate how useful this contextual
information is. It can be used to build a large and challenging
labelled face dataset with no manual intervention. With very
small amounts of manual labor, contextual data can be used
together with image data to identify pictures of animals. In fact,
these contextual data are sufficiently reliable that a very large
pool of noisily tagged images can be used as a resource to build
image features, which reliably improve on conventional visual
features. By seeing the Internet as a marketplace that can
connect sellers of annotation services to researchers, we can
obtain accurately annotated datasets quickly and cheaply. We
describe methods to prepare data, check quality, and set prices
for work for this annotation process. The problems posed by
attempting to collect very big research datasets are fertile for
researchers because collecting datasets requires us to focus on
two important questions: What makes a good picture? What is
the meaning of a picture?
KEYWORDS | Computer vision; Internet
I . INTRODUCTION
As the world moves online, Internet users are creating and
distributing more and more images and video. The number
of images indexed by search engines like Google and
Yahoo! is growing exponentially, and is currently in the
tens of billions. The same growth and proliferation is
experienced by community photo Web sites like Flickr,
PicasaWeb, Panoramio, Woophy, Fotki, and Facebook.Flickr alone currently has more than three billion photos,
with several million new pictures uploaded per day. In
recent years, the trend has been to make much of this
visual data publicly available on the Web, usually attached
to various forms of context such as captions, tags,
keywords, Web pages, and so on. By doing so, Internet
users have transformed computer vision by providing a
need for new applications to sort, search, browse, andinteract with this content. This vast amount of visual data
also creates an interesting sampling of images and video on
which researchers can develop and evaluate theories.
The core intellectual problems in computer vision are
recognition and reconstruction, very broadly interpreted.
In recognition, one attempts to attach semantics to visual
data like images or video. The nature of the semantics and
of the attachment both vary widely. It is often valuable tomark particular instances of objectsVfor example, that
image is a picture of my fluffy cat. Alternatively, one might
want to mark categoriesVfor example, that image is a
picture of a cat. Finally, one might want to mark attributesof objectsVfor example, that image contains a fluffy thing.
There are strong relations between these ideas that
remain somewhat mysterious. Categories are hard to
define with any precision but represent a pool of instanceslinked by some visual and some semantic similarity. Visual
categories may or may not mirror semantic or taxonomic
categories. There are objects that are very different but
Manuscript received April 6, 2009; revised August 13, 2009; accepted
August 14, 2009. Date of publication May 17, 2010; date of current version July 21,
2010. This work was supported in part by the National Science Foundation under
Awards IIS-0803603 and IIS-0534837, in part by the Department of Homeland Security
under MIAS, a DHS-IDS Center for Multimodal Information Access and Synthesis at the
University of Illinois at Urbana-Champaign, in part by the Office of Naval Research
under Grant N00014-01-1-0890 as part of the MURI program, and in part by a
Beckman Institute postdoctoral fellowship.
T. L. Berg is with the Department of Computer Science, State University of New York
Stony Brook, Stony Brook, NY 11794 USA (e-mail: [email protected]).
A. Sorokin, G. Wang, D. A. Forsyth, D. Hoiem, I. Endres, and A. Farhadi are with the
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana,
IL 61801 USA (e-mail: [email protected]; [email protected]; [email protected];
[email protected]; [email protected]; [email protected]).
Digital Object Identifier: 10.1109/JPROC.2009.2032355
1434 Proceedings of the IEEE | Vol. 98, No. 8, August 2010 0018-9219/$26.00 �2010 IEEE
share many visual attributes. For example, penguins arenot fish, though they can look quite like fish and often
appear in the same context as fish. Alternatively, there are
categories that can vary widely in visual appearance yet
still carry the same semantic label (i.e. Bchair[ can refer to
any object you can sit on from beanbags, to stools, to
rocking chairs). Adding to the confusion is the fact that a
particular object may belong to many categoriesVfor
example, my bicycle is a bicycle, but it is also a means oftransportation, a wheeled object, and a metal object, and it
can be an obstacle.
In reconstruction, one attempts to build representations
of the geometric and photometric properties of the world
from visual data. The visual data could be single images,
multiple images, video, or even the product of exotic imaging
systems (e.g., thermal images or laser range scanner data).
The representations take many forms. One often wants tocreate models like those used in computer graphics to render
in virtual environments. Alternatives include complex data
structures linking images so that they can be presented to
viewers in a way that give a strong sense of movement and
layout in space. Progress in this area has been driven by a
variety of practical problems and has spawned many ap-
plications; the key guide is [1]. An important recent trend
involves applying multiview reconstruction methods toBfound[ imagesVif enough tourists have photographed the
Colosseum and one collects these photographs together, then
one can build compelling representations of that space [2],
[3]. We will not discuss reconstruction further in this paper,
but refer interested readers to other papers in this issue.
Recognition is a topic of intense research, and a com-
prehensive review would take us out of our way; instead
we provide some entry points to the literature. Textbookaccounts [4], [5] are now out of date. A good overview of
topics is provided by the edited proceedings of a recent
workshop [6]. Our definition of recognition is deliberately
broad to bring together discussion from several related
research areas. One important subtopic in recognition is
object recognition, where one builds models to recognize
object categories or instances. There are annual competi-
tions culminating in workshops, and a good survey of thestate-of-the-art can be obtained by looking at the
proceedings.1 Other important subtopics include activityrecognition, where one builds descriptions of what people
are doing from visual data (we are not aware of an
extensive review; [7] has a fair sketch of recent literature);
face recognition, where one must attach identities to
pictures or video of faces (recent reviews in [8]–[10],
critical discussion in [11]); and detection, where one mustlocalize all instances of a particular category in an image
(the recognition workshops, given above, deal with some
aspects of detection). A small number of object categorieshave been the subjects of intense study because they are
extremely common or particularly important to people.
Pedestrians are one such category [12]–[15]. Faces are the
most important such category, and detecting frontal faces
is now a fairly reliable and, as we shall see, extremely
useful technology (see Section III-A and [16]–[18]).
The central activity of object recognition research is
building systems to evaluate ideas about object represen-tation. The main tools used are statistical in nature, and a
major part of the research involves learning from labelled
training data. Much of the research debate involves what
to learn and how to learn it. In the case of frontal face
detectors, the established method is to search all image
windows for a face by computing features from the win-
dow, then presenting these features to a classifier trained
to respond to faces only. Other object categories can bedetected like this, but the details and success of such
approaches vary widely. This is because most object cate-
gories are extremely variable in appearanceVdog species
range from the tiny Chihuahua to the large German
Shepherd; different cars can have different colors and
shapes; pedestrians wear a wide range of clothes and might
raise or lower their arms. All of this variability means that
learning a classifier that will effectively respond to anobject category might require an unmanageable quantity of
examples to sample all possible appearances. This problem
can be simplified by constructing complex feature repre-
sentations that suppress the worst of these sources of
variation; doing so usually involves a component of learn-
ing from data too.
Recognition research and reconstruction research share
one important property: a strong demand for visual data inimmense quantities. Humans, our best example of a
working recognition system, observe huge amounts of
visual data from which to learn to recognize objects in early
life. If one is to estimate the visual system as capturing
30 frames a second, then this translates to 9� 108 images
per year. Most of these data will be unlabelled, but some
labelled data will be provided through parental teaching,
etc. A fair guess is that object recognition research willrequire hundreds of millions of images to build a similar
working recognition system. The Internet is exciting
because it has visual data on this scale. Furthermore,
data drawn from the Internet have a better chance of being
Bfair[ than data made in-house because it is produced by
people unrelated to the research problem. From the point
of view of a computer vision researcher, the Internet is a
device that has the potential to produce intriguing data-sets. Unfortunately, these data have not yet been correctly
annotated. Current processes of hand annotation are
difficult and expensive to use at the kind of scale we want.
In this paper, we demonstrate several fully or semiauto-
matic methods for producing large, well-labelled datasets
that are useful for computer vision research. We explore
two different but related ways to produce these datasets:
1The 2007 workshop dealt with a dataset from Pascal and one fromthe California Institute of Technology, and proceedings are at http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2007/workshop/index.html;the 2008 workshop dealt with a dataset from Pascal, proceedings at http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2008/workshop/index.html.
Berg et al. : It’s All About the Data
Vol. 98, No. 8, August 2010 | Proceedings of the IEEE 1435
by utilizing surrounding textual information in conjunc-tion with image information (Section III-A to C) or by
using inexpensive human labeling resources (Section IV).
II . CURRENT METHODS FOR PREPARINGANNOTATED DATASETS
We do not really know what the ideal dataset would be for
object recognition research. However, some of theproperties of a good dataset are apparent.
• Variety: a good dataset should be rich enough that
it represents all major visual phenomena we are
likely to encounter in applying the system we are
training and testing to all the images in the world.
• Scale: a good dataset will be big. We do not really
know what the major visual phenomena are, but
we have the best chance that all are representedwell if the dataset is big.
• Precision: the labels in the dataset should be
reasonably accurate, though systems should be able
to handle some amount of noise in the labeling
process.
• Suitability: the dataset should have many examples
relevant to the problems we wish to solve.
• Cost: the dataset should not be unreasonablyexpensive.
• Representativeness: the dataset should represent visual
phenomena Bfairly[; we should not be selecting data
to emphasize rare but interesting effects, or to
deemphasize common but difficult ones.
We can say with confidence that the perfect dataset
will be big. A rough estimate of the size of this dataset can
be obtained as follows. It should contain images of objectsin the contexts in which they occur. The images should
cover most significant categories, and there should be
many images of each category. Within a category, the
images should show all variations important for training
and for testing, including the different shapes, textures,
and configurations the members can take. All important
viewing conditions should be represented by images taken
at different viewpoints, under different illuminationconditions, and so on. These images should be labelled.
It is difficult to be crisp about what would be in the
labelsVfor example, should one label every book in a
cluttered office? If not, which ones?Vbut the labels would
include the names of the categories of the important
objects, where those objects were in the image, informa-
tion about the context in which they lie, and the viewing
conditions. There might also be information about whythose objects were worth labeling and others were not. A
fair guess at the number of categories that should be
represented is in the thousands; a fair guess of the number
of images needed to represent internal variations in each
category is thousands or more; and a fair guess of the
number of images needed to represent view variations
might be hundreds or more. All this means that we might
need hundreds of millions of images. This figure could betoo small by several orders of magnitude if each guess is
too small; it could be too big by several orders of
magnitude if we discover processes of generalization
across examples that, for example, link internal variation
between categories or views. It is a fallacy to believe that,
because good datasets are big, then big datasets are good.
Building hand-labelled datasets has now become a
significant activity in computer vision. The most prominentexamples here include: the Berkeley segmentation dataset
[19]; the Caltech 5, Caltech 101, [20] and Caltech 256 [21]
datasets; the Pascal VOC datasets [22], [23]; the University
of Illinois at Urbana-Champaign (UIUC) car dataset [24];
the localized semantics dataset [25]; the Massachusetts
Institute of Technology (MIT) [26] and INRIA [27] pedes-
trian datasets; andthe Yale [28], FERET [29], and CMU
PIE [30] face datasets. Every dataset above is a focused datacollection targeted at a specific research problem: segmen-
tation, car detection, pedestrian detection, face detection
and recognition, and object category recognition. Such
datasets will continue to be an important part of computer
vision research because they do well on the precision and
suitability criteria (above). The annotations are high-
quality, and the data represent what one wants to study.
But datasets like this will always be relatively small be-cause the strategy does not scale well: it consumes sig-
nificant amounts of highly skilled labor, requires much
management work, and is expensive.
Another strategy for building large scale datasets is to
get volunteers to label images for free. LabelMe [31] is a
public online image annotation tool that now has more than
11 845 images and 18 524 video frames with at least one
object labelled [31]. The current Web site counter displays222 970 labelled objects. The annotation process is simple
and intuitive; users can browse existing annotations to get
the idea of what kind of annotations are required and then
label segmentation polygons within an image to denote an
object’s location. The dataset is freely available for
download and comes with handy Matlab toolbox to browse
and search the dataset. The dataset is semicentralized. MIT
maintains a publicly accessible repository, accepts imagesto be added to the dataset, and distributes the source code
to allow interested parties to set up a similar repository. The
ESP game [32] and Peekaboom [33] are interactive games
that collect image annotations by entertaining people.
Players cooperate and receive points by providing textual
and location information that is likely to describe the
content of the image to their partner. These games have
produced a great deal of data (over 37 million2 and 1 million[33] annotations, respectively). The Peekaboom project
recently released a collection of 57 797 images annotated
through gameplay, but to date this collection has not been
widely used, for reasons we do not know. While datasets
produced by volunteers can be very large, there remain
2See www.espgame.org.
Berg et al. : It’s All About the Data
1436 Proceedings of the IEEE | Vol. 98, No. 8, August 2010
difficulties with precision (volunteers may not do the rightthing), with variety (volunteers may prefer to label a
particular kind of image), and with suitability (volunteers
may not wish to label the kind of data we want labelled).
The game-based approach has two inconveniences. The
first is centralization. To achieve proper scale, it is neces-
sary to have a well-attended game service that features the
game. This constrains publishing of a new game to obtain
project-specific annotations. The second one is the gameitself. To achieve reasonable scale, one has to design a
game. The game should be entertaining, or else no one will
play it. This will require creativity and experimentation to
create appropriate annotation interface.
Finally, dedicated annotation services can provide
quality and scale, but at a high price. ImageParsing.com
has built one of the world largest annotated datasets [34].
With more than 49 357 images, 587 391 video frames, and3 927 130 annotated physical objects, this is an invaluable
resource for vision scientists. At the same time, the cost of
entry is steep. Obtaining standard data would require at
least a $1000 investment, and custom annotations would
require at least $5000.3 ImageParsing.com provides high-
quality annotations and has a large number of images
available for free. Another large annotated dataset that has
recently been released is ImageNet [35]. This data col-lection provides whole image level category labels for more
than three million images collected using several Internet
search engines for about 5000 WordNet [36] synsets and
labelled via a large-scale annotation effort using Amazon’s
Mechanical Turk service.4 It is important to note that these
two datasets [34], [35] present probably the most rigorous
and the most varied collections to date for image labeling
and classification tasks.One important issue that computer vision researchers
should be aware of when collecting and distributing datasets
from the Internet is copyright. Images posted on the Internet
are owned by the photographer, and usage rights are not
usually specified (except in the special case of examples like
the creative commons license on Flickr.com). Most use of
such images for educational and research purposes, in-
cluding republishing images in research papers, may becovered under fair-use policies, but the general issue of
copyright for Internet images is still a bit murky and has not
been satisfactorily resolved to our knowledge.
No current procedure for producing datasets is entirely
satisfactory. In what follows, we describe two mechanisms
to use the Internet to produce useful and interesting data-
sets. One can regard the Internet as a vast, disorganized
repository, and build methods to prowl through thisrepository to collect data that has the right form, or that
can be levered into the right form (Section III-A to C).
These methods alleviate or bypass the necessity of human
labeling to create data collections but also present other
challenges or inconveniences. The most prominent chal-lenge here is how to design effective and accurate com-
puter vision and natural language processing algorithms
for labeling data, something that can be quite difficult.
Alternatively, one can regard the Internet as a marketplace
where it is possible to find someone willing to help
improve data quickly and cheaply (Section IV).
III . THE INTERNET AS A REPOSITORY
It is not helpful to think of the Internet as a big pile of
images or videos because most visual data items on the
Internet are surrounded by complex and interesting
contexts that can be useful for guiding the labeling pro-
cess. Generally, we shall talk about images, but much of
what we say applies to video as well. In the simplest case,
images are associated with captions, which contain explicitinformation about the image’s content. In more complex
cases, images appear embedded in Web pages, which
would render text near the image. The relations between
this text and the image can be extremely revealing. Re-
sources like Wikipedia contain highly structured informa-
tion in both text and visual form but tend to have relatively
few images. Lastly, there are resources where one can
perform keyword searches for images, like Google’s imagesearch or Flickr; images obtained from these sites have, at
least implicitly, the context of the site’s index and the
search used to obtain the image attached to them.
We describe three case studies that show how useful
that context can be. In the first, we show how news pic-
tures, taken with their captions, can yield a rich and
demanding face dataset. In the second, we show how to
use text lying near pictures, taken with image features, tocollect a complex dataset of animals. In the third, we show
that a very big collection of pictures with noisy labels can
improve the performance of an object recognizer, even
though both the pictures and the labels are noisy and
uncontrolled. Each of these case studies illustrates what we
believe are important principles in using Internet
resources to produce labelled datasets.
1) Throwing away data items that cannot be labelledfor some reason or another is a valuable strategy if
one starts with an enormous collection. With care,
this produces data with biases no worse than other
methods of building datasets.
2) Levering sources of information against one
another is very valuable, and the opportunities to
do so appear to be quite widespread. For example,
the text that appears near an image, taken withthat image, can be very helpful in deciding what
labels an image should carry.
Each of these case studies illustrates how useful it is to
start with Web-scale resources. In the first case study, we
rely on being able to throw away captioned images we
cannot work with; because the original collection is so
large, there are enough captioned images we can handle to
3See www.ImageParsing.com.4See http://www.mturk.com/.
Berg et al. : It’s All About the Data
Vol. 98, No. 8, August 2010 | Proceedings of the IEEE 1437
yield a large dataset. In the second case study, because thecollection is so large, we can find enough data items where
the relations between the visual data and the context are
clear enough that we can label effectively. In the third case
study, we find we can obtain a useful signal for image
labeling from many images with quite noisy labels
attached. This works because we have so many images
that we can smooth this noise effectively. Related and
future work to these studies is discussed in Section V.
A. Case Study: Preparing a Labeled Dataset ofNames and Faces
Many nontraditional kinds of image data are now being
made freely available online. One such source of data is
news photographs with associated captions posted by
various news sources (Associated Press, Yahoo! News,
CNN, etc.) at rates of thousands of new images per day.This particular data source opens up new problems for
exploration. In traditional face recognition, the question
posed is: BGiven a face, who’s face is this?[ In general, the
answer to this question could be one of hundreds,
thousands, or even millions of possible people. However,
for news photographs with captions, one canVwithout
losing a great deal of accuracyVreduce this problem to
consider only those people who’s name appears in thecorresponding caption (usually fewer than about ten
names). This is an example of the power of levering
sources of information against one another.
In our work on automatically labeling face images in
photographs [37], [38], we show that a large and realistic
face dataset (or face dictionary) can be built from a
collection of news photographs and their associated
captions (one photo-caption pair is shown in Fig. 2).Our automatically constructed face dataset consists of
30 281 face images, obtained by applying a face finder to
approximately half a million captioned news images. The
faces are labelled using image information from the photo-
graphs and word information extracted from the corre-
sponding caption. Here we take advantage of recent
computer vision successes on accurate face detection
algorithms to focus on parts of the image we care about:those that contain peoples’ faces. The other 470 000 images
we collected were thrown away as not containing faces, not
containing faces that were big enough to identify reason-
ably accurately, or not having names in their captions. If
one has sufficient noisy data, then it is entirely reasonable
to throw away 90% of the data to get a clean dataset.
This dataset is more realistic than usual face recogni-
tion datasets because it contains faces captured Bin thewild[ under a wide range of positions, poses, facial expres-
sions, and illuminations. The data display a rich variety of
phenomena found in real-world face recognition tasksVsignificant variations in color, hairstyle, expression, etc.
Equally interesting is that it does not contain large num-
bers of faces in highly unusual and seldom seen poses, such
as upside down. Rather than building a database of face
images by choosing arbitrary ranges of pose, lighting,expression, and so on, we simply let the properties of a
Bnatural[ data source determine these parameters. News
photographs tend to be well illuminated and captured with
nice high-end cameras. The faces we find also tend to be
mainly frontal, relatively unoccluded, and large due to the
face detector and how we select detections. Name fre-
quencies have the long tails that occur in natural language
problems. We expect that face images follow roughly thesame distribution. We have hundreds to thousands of
images of a few individuals (e.g., President Bush) and a
large number of individuals who appear only a few times or
in only one picture (e.g., Fig. 1 Sophia Loren). One expects
real applications to have this property. For example, in
airport security cameras, a few people (security guards, or
airline staff perhaps) might be seen often, but the majority
of people would appear infrequently. All of these factorslead to biases in our created dataset and will have
implications on recognition algorithms trained using such
a datasetVe.g., algorithms may not be as good at recognition
on highly occluded faces or recognition under low lighting
conditions. However, we believe that in the long run, devel-
oping detectors, recognizers, and other computer vision tools
around such a database in addition to current laboratory
produced recognition datasets will help produce programsthat work better in realistic everyday settings.
We describe our construction of a face dictionary as a
sequence of three steps. First, we detect names in captions
using an open source named entity recognizer [39]. Next, we
detect [40], align, and represent faces using standard face
representations with some modifications to handle the large
quantity of data encountered here. Finally, we associate
names with faces using either a purely appearance-basedclustering method or an enhanced method that also in-
corporates text cues in a maximum entropy framework [41].
This fully automatic method alternates between updat-
ing the best assignment of names to faces given the current
appearance and language models and updating the models
given the current set of assignments (some predicted
assignments are shown in Fig. 1). This allows us to build an
appearance model for each individual in the collection anda language model of depiction that is general across all
captions without having to provide any hand-labelled
supervisory information. The intuition behind how we do
this is as follows: if we observe a face across multiple
photographs with varying appearance but some visual
characteristics in common, and notice that this face always
co-occurs with the name BGeorge Bush,[ then by pooling
information across multiple pictures, we can learn anappearance model of the individual with the name BGeorge
Bush.[ On the language side, we might observe that often
when a name is followed by, for example, a present-tense
verb, the face that this name refers to is likely to be depicted
in the corresponding photograph. This information can be
pooled across all captions and names to learn a general
model for depiction.
Berg et al. : It’s All About the Data
1438 Proceedings of the IEEE | Vol. 98, No. 8, August 2010
One important discovery we have made is that the
context in which a name appears in a caption provides
powerful cues as to whether it is depicted in the associated
image. Our language model learns several useful indications
of depiction, including the fact that if a name appears in a
caption directly followed by various forms of present-tense
verbs, that person is likely to be found in the accompanying
photo; names that appear near the beginning of the caption
tend to be pictured; words like Bshown,[ Bpictured,[ or
contextual cues such as B(L)[ or B(C)[ are good indicators
Fig. 1. The figure shows a representative set of clusters, illustrating a series of important properties of both the dataset and the method.
1) Some faces are very frequent and appear in many different expressions and poses, with a rich range of illuminations (e.g., clusters labelled
Secretary of State Colin Powell or Donald Rumsfeld). 2) Some faces are rare or appear in either repeated copies of one or two pictures or
only slightly different pictures (e.g., cluster labelled Chelsea Clinton or Sophia Loren). 3) Some faces are not, in fact, photographs (M. Ali).
4) The association between proper names and face is still somewhat noisy, for example, Leonard Nemoy, which shows a name associated with
the wrong face, while other clusters contain mislabeled faces (e.g., Donald Rumsfeld or Angelina Jolie). 5) Occasionally faces are incorrectly
detected by the face detector (Strom Thurmond). 6) Some names are genuinely ambiguous (James Bond, two different faces naturally
associated with the name (the first is an actor who played James Bond, the second an actor who was a character in a James Bond film).
7) Some faces appear in black in white (Marilyn Monroe) while most are in color. 8) Our clustering is quite resilient in the presence of spectacles
(Hans Blix, Woody Allen), perhaps wigs (John Bolton) and mustaches (John Bolton).
Berg et al. : It’s All About the Data
Vol. 98, No. 8, August 2010 | Proceedings of the IEEE 1439
of depiction when they occur. By incorporating simplenatural language techniques, we show significant improve-
ment in our face labeling procedure over a purely appearance-
based modelVan image-based model gives approximately
67% accuracy on this task, while an image plus text model
improves performance to about 78%. Additionally, by learn-
ing the visual and textual models in concert, we boost the
amount of information available from either the images and
text alone. This increases the performance power of bothlearned models. We have conclusively shown that by incor-
porating language information we can improve a vision task,
namely, automatic labeling of faces in images.
Once our procedure is complete, we have a large,
accurately labelled set of faces, an appearance model for
each individual depicted, and a natural language model
that can produce accurate results on captions in isolation.
This language model can accurately label names in cap-tions without looking at a single pixel in the accompanying
photo with an accuracy of 86%. The dataset we produce
consists of 30 281 faces with roughly 3000 different
individuals that can be used for further exploration of face
recognition algorithms. Another product of our system is a
Web interface that organizes the news in a novel way,
according to individuals present in news photographs.
Users are able to browse the news according to individualFig. 2, bring up multiple photographs of a person, and view
the original news photographs and associated captions
featuring that person.
1) What Can Be Done With Names and Faces: While
perfect automatic labeling is not yet possible, the dataset of
labelled faces we have produced (Section III-A) has already
proven useful because it is large, because it contains chal-lenging phenomena, and because correcting labels for a
subset is relatively straightforward.
Because the dataset is large, it is possible to focus
inquiry on specific subsets of the data. For example, Ozkan
and Duygulu [42] present a clustering algorithm that
requires many images of each individual. They used the
most frequent 23 people in the database (each of whom
occurred more than 200 times) in experiments on refiningperson-name assignments. In that work, more than half of
the labels needed to be correct, a threshold easily exceeded
by the current dataset. Work by Ferencz et al. concentrates
on another subset of the dataVthey consider a large num-
ber of individuals (more than 400) with at least two images
per person for work on modeling appearance for face
recognition [43]. That work required completely correct
labeling of the faces, which was easily accomplished in acouple of hours with a custom interface to clean up the
automatic labeling in the Bnames and faces[ dataset.
Because the dataset exhibits challenging phenomena, it
has been used at the starting point for the fully labelled,
Blabelled faces in the wild[ (LFW) dataset [44] designed to
test face recognition on images of faces with real-world
variation. Benchmark results have shown that these data
are indeed very challenging for traditional algorithms andhave focused research on effective face-recognition
algorithms [45], [46].
Besides our work on labeling faces in news photo-
graphs, there have been several other related projects in
the domain of face labeling with contextual information.
Gallagher and Chen have further expanded on our notion
of context to introduce gender and age cues into the face-
labeling problem [47]. By estimating probabilities ofnames referring to men or women, and older or younger
people, they can constrain the labeling problem accord-
ingly (female faces should only be matched to female
names, an older face is more likely to correspond to the
name BMildred,[ etc.), leading to improved performance.
Similar problems have also been explored for labeling:
faces in consumer photo collections [48], [49], faces in
video with associated transcripts [50]–[53], or in videowith a small amount of manual labeling [54].
The trend in this area is toward further understanding
and use of the connection between images of faces and
context, in the images themselves, associated text or meta-
data, and the world at large.
B. Case Study: Preparing a Labeled Dataset ofAnimals on the Web
Not every picture has an explicit caption; sometimes,
pictures just have text nearby. Nonetheless, we can lever this
text against image features to obtain very useful results. Text
is a natural source of information about the content of
images, but the relationship between free text and images on
a Web page is complex. In particular, there are no obvious
indicators linking particular text items with image content
(a problem that does not arise if one confines attention tocaptions, annotations or image names). All this makes text a
noisy cue to image content if used alone. However, this
noisy cue can be helpful if combined appropriately with
good image descriptors and good examples.
In our work on identifying images containing categories
of animals [55], we develop a method to classify images
depicting animals in a wide range of aspects, configura-
tions, and appearances. In addition, the images typicallyportray multiple species that differ in appearance (e.g.,
uakari’s, vervet monkeys, spider monkeys, rhesus monkeys,
etc.). Our method is accurate despite this variation and
relies on an integration of information from four simple
cues: text, color, shape, and texture.
We demonstrate one application by harvesting approx-
imately 14 000 pictures for ten animal categories from the
Web. Since there is little point in looking for, say, Balligator[in Web pages that do not have words like Balligator,[Breptile,[ or Bswamp,[ we use Google to focus the search.
Using Google text search, we retrieve the top 1000 Web
pages for each category and use our method to rerank the
images on the returned pages.
From the retrieved images, we first select a small set of
example images from each query using only text infor-
Berg et al. : It’s All About the Data
1440 Proceedings of the IEEE | Vol. 98, No. 8, August 2010
mation. We use latent Dirichlet allocation [56] to discovera set of ten latent topics from the words contained on the
retrieved Web pages. These latent topics give distributions
over words and are used to select highly likely words for
each topic. We rank images according to their nearby
word likelihoods and select a set of 30 exemplars for each
topic.
Words and images can be ambiguous (e.g., Balligator[could refer to Balligator boots[ or Balligator clips[ as well
as the animal). Currently, there is no known method for
breaking this polysemy-like phenomenon automatically.
Therefore, at this point, we ask the user to identify which
topics are relevant to the concept they are searching for.
The user labels each topic as relevant or background,
Fig. 2. We have created a Web interface for organizing and browsing news photographs according to individual. Our dataset consists of
30 281 faces depicting approximately 3000 different individuals. Here we show a screen shot of our face dictionary (top), one cluster from
that face dictionary (actress Jennifer Lopez) (bottom left), and one of the indexed pictures with corresponding caption (bottom right).
This face dictionary allows a user to search for photographs of an individual as well as gives access to the original news photographs and
captions featuring that individual. It also provides a new way of organizing the news according to the individuals present in its photos.
Berg et al. : It’s All About the Data
Vol. 98, No. 8, August 2010 | Proceedings of the IEEE 1441
depending on whether the associated images and wordsillustrate the category well. Given this labeling, we merge
selected topics into a single relevant topic and unselected
topics into a background topic (pooling their exemplars
and likely words). These exemplars often have high
precision, a fact that is not surprising given that most
successful commercial image search techniques rely on
textual information to index images. The word score for an
image is thus the summed likelihood of nearby words giventhe relevant topic model.
Visual cues are evaluated by a voting method that
compares local image phenomena with the relevant visual
exemplars for the category. For each image, we compute
features of three types: shape, color, and texture. For each
type of feature, we create two pools: one containing
positive features from the relevant exemplars and the other
negative features from the background exemplars. Foreach feature of a particular type in a query image, we apply
a 1-nearest neighbor classifier with similarity measured
using normalized correlation to label the feature as the
relevant topic or the background topic. For each visual cue
(color, shape, and texture), we compute the sum of the
similarities of features matching positive exemplars,
normalized so that scores range between zero and one.
The final score consists of summing the four individualcue scores. By combining each of these modalities, a much
better ranking is achieved than using any of the cues alone.
Precision recall curves are shown in Fig. 3, where, as you
move down the ranked set of images, precision is measured
as the percentage of images returned that are good and
recall as the percentage of all good images that are
returned. Though in isolation each of the visual and textual
features is relatively weak, the combination performs quite
well at this ranking task and generally provides an imageranking that is far superior to the original Google ranking.
The resulting sets of animal images (Fig. 4) are quite
compelling and demonstrate that we can handle a broad
range of animals.
Because we apply our system to Web pages with free
text, the word cue is extremely noisy, but again similarly to
the face labeling task (discussed in Section III-A) we show
unequivocal evidence that combining visual informationwith textual information improves performance on an
image-labeling task. By considering top-ranked images
using our text and image classification model, we can
produce labelled datasets where the majority of labels are
correct. For the ten original animal categories, this
corresponds to an average precision of 55% for the top
100 returned results, as compared to the original Google
ranking precision of 21%. The giraffe and frog classifiersare especially accurate, returning 74 and 83 true positives,
respectively. For one category, monkey, we have collected
an additional much larger set of about 13 000 images using
the original query plus related query words like Bold
world[ or Bscience.[ The dataset that we produce from this
set of images is startlingly accurate (81% precision for the
first 500 images and 69% for the first 1000). Not only does
the resulting collection contain monkeys in a variety ofposes, aspects, and depictions but it also contains large
number of monkey species and other related primates
including lemurs, chimps, and gibbons.
Here we have demonstrated a method to produce large,
accurate, and challenging visual datasets for object
categories with widely varying appearance using only a
small amount of human supervision. Though we chose to
focus on animal categories here, this method could be
Fig. 3. We are able to correctly classify images from the Web depicting categories of animals despite wide variations in appearance.
Even though the relationship between words and pictures on a Web page is complex, we are able to very effectively rerank Google search
results by combining several image and text based cues. These graphs show classification performance for three of our animal classes:
‘‘monkey’’ (left), ‘‘frog’’ (center), and ‘‘giraffe’’ (right). Colored precision-recall curves show performance for the original Google Web search
classification (red), word-based classification (green), local shape feature-based classification (magenta), color-based classification (cyan), and
texture-based classification (yellow). Our final output classification (black) utilizes a combination of all of these cues. Even though each cue in
isolation may have poor ranking performance, levering the different sources of information in combination (black) tends to do quite well.
In particular, you can see that incorporating visual information increases classification performance enormously over using word-based
classification alone.
Berg et al. : It’s All About the Data
1442 Proceedings of the IEEE | Vol. 98, No. 8, August 2010
applied to most other natural object categories, providingone possible solution for collecting large labelled image
datasets.
1) What Can Be Done With Text and Images: Many other
computer vision researchers have explored the relation-
ship between words and pictures in various settings for
exploring the semantics between words and pictures [57]–[59], image clustering [60], [61], or story illustration [62].
Others have attempted to leverage Internet image col-
lections to assist in image search [63], [64] or recognition
[65]–[70]. The most common strategy is to improve anno-
tation quality or filter spurious search results output by
one of the commercial search engines, gathering a new
Fig. 4. By levering different sources of informationVimage cues computed on the picture itself and text cues computed on the surrounding
Web pageVwe are able to effectively rerank Google search results. Here we show some ranked results from the ‘‘bear,’’ ‘‘dolphin,’’ ‘‘frog,’’
‘‘giraffe,’’ ‘‘leopard,’’ and ‘‘penguin’’ categories. Most of the top classified images for each category are correct and often display a wide variety of
poses (‘‘giraffe’’), depictions (‘‘leopard’’Vheads or whole bodies), and even multiple species (‘‘penguin,’’ ‘‘bear’’). Notice that the highly
ranked false positives (dark red) are quite reasonable since they display appearances or contexts similar to the true category; teddy bears for
the ‘‘bear’’ class, whale images for the ‘‘dolphin’’ class, and leopard frogs or leopard geckos for the ‘‘leopard’’ class. Drawings, even if they
depict the desired category, were counted as false positives for this task (e.g., ‘‘dolphin’’ and ‘‘leopard’’ categories).
Berg et al. : It’s All About the Data
Vol. 98, No. 8, August 2010 | Proceedings of the IEEE 1443
collection of images that can be used for training
recognition systems [65]–[68].
Along these lines, we propose several other importantgoals for future research: labeling Web images with
relevant nouns extracted from the containing Web pages,
ranking nearby sentences according to how well they
describe the picture, and developing automated systems
for writing summaries about the content of photographs by
mining nearby textual information.
C. Case Study: The Internet as a Source of FeaturesSo far, we have looked at ways of using Internet images
to build datasets. We can also build representations for an
input image based on the metadata that surrounds visually
similar Internet images. The metadata, such as surround-
ing text or GPS coordinates, provides direct access toimportant scene information. Here, we discuss our recent
work [71] that builds text features from histograms of tags
and group names that are linked to the K most similar
images in a large set downloaded from Flickr (a popular
photo-sharing site). The text features can then be used to
improve prediction of object presence (Fig. 5 shows the
framework of our approach).
Our method is based on two key ideas. First, it is ofteneasier to determine the content of an image using nearby text
than with currently available image features. State-of-the-art
methods in computer vision [72] are still not capable of
handling the unpredictability of object positions and sizes,
appearance, lighting, and unusual camera angles that are
common in consumer photographs, such as those found on
Flickr. Determining object presence from the text that
surrounds an image (tags, discussion, group names) is also farfrom trivial due to polysemy, synonymy, and incomplete or
spurious annotations. Still, as we saw in Section III-B, even
unstructured text can provide valuable information that is
not easy to extract from image features. The second
important idea is that, given a large enough dataset, we are
bound to find very similar images to an input image, even
when matching with simple image features. This idea has
been demonstrated by Torralba et al. [73], who showed thatmatching tiny (32 � 32) images using Euclidean distance of
intensity leads to surprisingly good object recognition
results if the dataset is large enough (tens of millions of
images). Likewise, Hays and Efros [74], [75] showed that
simple image matching can be used to complete images and
to infer world coordinates. Our approach is to infer likely
text for our input image based on similar images in a large
dataset and use that text to determine whether an object ispresent. Along these lines, Quattoni et al. [69] use captioned
images to learn a more predictive visual representation. Our
work is related to this in that we learn a distance metric
that causes images with similar surrounding text to be
similar in visual feature space.
The text features that we use are straightforward.
Flickr images possess tags and are often assigned to one or
more groups. When we collected the images, we made arecord of the tags and group names associated with each
image. Each unique tag or group name forms a single item,
even though it may include multiple words. For example,
the group name BDogs! Dogs! Dogs![ is treated as a single
item. We use only tags and group names that occur
frequently in the auxiliary dataset, resulting in a vocabu-
lary of about 6000 items. The text features are computed
by building a histogram of these items, counted over the Knearest neighbors of the test image. In our experiments,
K ¼ 150. We can then train and evaluate classifier based
on these text features in the standard way (we use a
support vector machine classifier with a chi-squared kernel
in our experiments).
In the work described in this paper, we found that
these text features do not outperform standard construc-
tions of visual features for standard discriminative tasks.However, because the text features tend to make quite
different errors than the visual features (also observed in
our work on classifying animal images; see Section III-B),
predictions based on one can correct predictions based on
the other. In general, we find that, by combining the
features, we can obtain a small but significant improve-
ment in error rate across classes for a standard task
(classification in the PASCAL 2006 dataset; see [76]).Figs. 6 and 7 provide illuminating examples. In Fig. 6, we
show examples misclassified by the visual classifier but
correctly classified by the text classifier on the PASCAL
2006 dataset. In the first image, the cat is in a sleeping
pose, which is unusual in the PASCAL training set, and so
the visual classifier gets it wrong. However, we find many
such images in the auxiliary dataset, and there are several
sleeping cat images in the 25 nearest neighbors, so the textcue can make a correct prediction. When the nearest
neighbors are poor, the text feature is less helpful, as Fig. 7
shows. While one might worry that our text features are
powerful only because our images are tagged with category
labels, this is not the case. We have tested this by excluding
category names and their plural inflections from the text
features. This means that, for example, the words Bcat[
Fig. 5. The framework of our approach. We have training and
test images (here we only show the test image part). We also have an
auxiliary dataset consisting of Internet images and associated text.
For each test image, we extract its visual features and find the K most
similar images from the Internet dataset. The text associated with
these nearest neighbor Internet images is summarized to build the text
features. Text classifiers that are trained with the same type of text
features are applied to predict the object labels. We can also train
visual classifiers with the visual features. The outputs from the two
classifiers are fused to do the final classification.
Berg et al. : It’s All About the Data
1444 Proceedings of the IEEE | Vol. 98, No. 8, August 2010
and Bcats[ would not appear in the features. The effect on
performance is extremely small. This suggests that textassociated with images is rich in secondary cues (perhaps
Bmice[ or Bcatnip[ appear strongly with cats).
IV. THE INTERNET AS A MARKETPLACE
Another way to see the Internet is as a device to connect
willing buyers with willing sellers. We would like to buy
image annotations cheaply and in large quantities by
outsourcing annotation work to an online worker commu-
nity. There are now strong tools for doing so efficiently.
We have used Amazon’s Mechanical Turk (Section IV-A)
and have found the resulting annotations to be both goodand cheap (Section IV-B).
A. How to Annotate on Mechanical TurkEach annotation task is converted into a human intel-
ligence task (HIT). The tasks are submitted to Amazon
Mechanical Turk (MT). Online workers choose to work on
the submitted tasks. Every worker opens our Web page
with a HIT and does what we ask them to do. They
Bsubmit[ the result to Amazon. We then fetch all resultsfrom Amazon MT and convert them into annotations. The
core tasks for a researcher are 1) define an annotation
protocol and 2) determine what data need to be annotated.
To get good quality data, one must ensure that the
workers understand the requested task and try to perform
it well. We have found it to be extremely helpful to set up
examples showing how we would like the labeling task to
be performed. Additionally, we have three strategies toclean up occasional errors and detect and prevent cheat-
ing. The basic strategy is to collect multiple annotations for
every image. This will account for natural variability of
human performance, reduce the influence of occasional
errors, and allow us to catch malicious users. However,
this increases the cost of annotation. The second strategy is
to perform a separate grading task. A worker looks at
several annotated images and scores every annotation. Weget explicit quality assessments at a fraction of the cost
because grading is easy. The third strategy is to build a goldstandardVa collection of images with trusted annotations.
Fig. 6. The left column shows the PASCAL 2006 images whose category labels cannot be predicted by the visual classifier but can be predicted
by the text classifier; The center column shows the 25 nearest neighbor images retrieved from the Internet dataset; the right column shows
the built text feature vectors. In the first image, the cat is in a sleeping pose, which is unusual in the PASCAL training set. So the visual classifier gets
it wrong. Some sleeping cat images are retrieved from the auxiliary dataset. Then the text features make a correct prediction.
Berg et al. : It’s All About the Data
Vol. 98, No. 8, August 2010 | Proceedings of the IEEE 1445
Images from the gold standard are injected into the
annotation process and used to detect and correct workers
deviating significantly from the desired results. This
strategy is again cheap, as only a small fraction of images
come from the gold standard.
B. Our Experience of Annotation onMechanical Turk
We have detailed quality data for four annotation pro-
tocols (Fig. 8): two coarse object segmentation protocols,
polygonal labeling, and 14-point human landmark labeling.
The object segmentation protocols show an image to the
worker and a small image of the query (person). We ask
the worker to click on every circle (site) overlapping with
the query (person). The first protocol places sites on aregular grid, whereas the second places sites at the centers
of superpixels (computed with [77], [78]). In the third
protocol, polygonal labeling, we ask the worker to trace the
boundary of the person in the image (similar to the
LabelMe [31] task). The fourth protocol labels the land-
marks of the human body used for pose annotation [79],
asking the worker to click on locations of these points in a
specified order. So far, we have run five annotationexperiments using data collected from YouTube (experi-
ments 1, 2, 5), a dataset of people [79] (experiments 3, 4), a
small sample of data from LabelMe [31], Weizman [80],
and our own dataset (experiments 5). In all experiments,
we are interested in people. As shown in Table 1, we have
obtained a total of 3861 annotations for 982 distinct images
collected for a total cost of $59.
We present sample annotation results (Figs. 8 and 9) to
show the representative annotations and highlight the
most prominent failures. We are extremely satisfied with
the quality of the annotations taking into account that
workers receive no feedback from us.
1) Pricing: Pricing annotation work is difficult, and
throughput is quite sensitive to price. Even if the task is
underpriced, some workers participate out of curiosity or
for entertainment, but we do not expect to be able to
obtain good annotations at large scales like this. If the price
is too high, we could be wasting resources and possibly
attracting inefficient workers. We have no algorithm for
pricing but surveyed workers to get a sense of how to pricework. As Table 1 shows, the hourly pay in experiments 4
and 5 was roughly $1/h. In these experiments, we had a
comments field, and some comments suggested that the
pay should be increased by a factor of three. From this, we
conclude that the perceived fair pricing is about $3/h,
though we expect that this depends on the nature of the
work. In further experiments, we have offered small
amounts of work at a given price, then raised or loweredpay depending on how quickly it was finished.
2) Annotation Quality: To understand the quality of
annotations, we use three simple consistency scores for a
pair of annotations (a1 and a2) of the same type. Forprotocols 1–3, we divide the area where annotations
disagree by the area marked by any of the two annotations.
We can think about this as XORða1; a2Þ=ORða1; a2Þ. For
Fig. 7. The left column shows the PASCAL images whose category labels cannot be predicted by the text classifier but can be predicted by the
visual classifier; The center column shows the 25 nearest neighbor images retrieved from the Internet dataset; the right column shows the built
text features of the PASCAL images. The text features do not work here mainly because we fail to find good nearest neighbor images.
Berg et al. : It’s All About the Data
1446 Proceedings of the IEEE | Vol. 98, No. 8, August 2010
protocols 1 and 2, XOR counts of sites with the differentannotations and OR counts the sites marked by any of the
two annotations a1 and a2. For protocol 3, XOR is the area
of the symmetric difference and OR is the area of the
union. For protocol 4, we measure the average distance
between the selected landmark locations. Ideally, the
locations coincide and the score is zero.
We then select the two best annotations for every
image by simply taking a pair with the lowest score, i.e.,most consistent. For protocol 3, we further assume that the
polygon with more vertices is a better annotation and we
put it first in the pair. The distribution of scores and a
detailed analysis appears in Fig. 9 with scores ordered
from best (lowest) on the left to the worst (highest) on
the right. Some of the errors come from sloppy annotations
(especially in the heavily underpaid experiment 3), but
most of the disagreements arise when the question we askis difficult to answer. For example, in experiment 4, wor-
kers were asked to label landmarks that are often obscured
by clothing. In Fig. 10, we show consistency of the anno-
tations of each landmark between the thirty-fifth and the
sixty-fifth percentile of Fig. 9, indicating that hips are
much more difficult to localize compared to other, less
often obscured joint positions.
More recently, in pursuit of an object-recognitionresearch agenda [81], we have used Mechanical Turk to
label two large datasets with approximately 400 000 labels.
Our datasets are intended to explore the object description
problem. In particular, we want to describe objects in terms
of their attributes, properties like Bmade of wood,[ Bmade
of metal,[ Bhas a wheel,[ Bhas skin,[ and so on. We
collected attribute annotations for each of 20 object classes
in a standard object recognition dataset, PASCAL VOC2008, created for classification and detection of visual
object classes in a variety of natural poses, viewpoints, and
orientations. These object classes cluster nicely into
Banimals,[ Bvehicles,[ and Bthings[ and include object
classes such as people, bird, cat, boat, tv, etc., with between
150 and 5000 instances per category. To supplement this
dataset, we collected images for 12 additional object classes
from Yahoo! image search, selected to have objects similarto the PASCAL classes while having different correlations
between the attributes. For example, PASCAL has a Bdog[category, so we collected Bwolf[ images, and so on. Objects
in this set include, for example: wolf, zebra, goat, donkey,
monkey, statue of people, centaur, etc.
Fig. 8. Example results show the example results obtained from the annotation experiments for two of our protocols. The first column is the
implementation of the protocol, the second column shows obtained results, and the third column shows some poor annotations we observed.
The user interfaces are similar, simple, and are easy to implement.
Table 1 Collected Data. In Our Five Experiments, We Have Collected
3861 Labels for 982 Distinct Images for Only $59. In Experiments 4 and 5,
the Throughput Exceeds 300 Annotations Per Hour Even at Low ($1/h)
Hourly Rate. We Expect Further Increase in Throughput as We Increase the
Pay to Effective Market Rate
Berg et al. : It’s All About the Data
Vol. 98, No. 8, August 2010 | Proceedings of the IEEE 1447
We made a list of 64 attributes to describe our objects
and collected annotations for semantic attributes for each
object using Amazon’s Mechanical Turk, a total ofapproximately half-a-million attribute labels for a total
cost of $600. Labeling objects with their attributes can
often be an ambiguous task. This can be demonstrated by
imperfect interannotator agreement among Bexperts[(authors) and Amazon Turk annotators. The agreement
among experts is 84.3%, between experts and Amazon Turk
annotators is 81.4%, and among Amazon Turk annotators is
84.1%. As we have worked with the dataset, we have foundsome important idiosyncrasies. For example, images of
people labelled as not having Bskin[ almost always do; there
just is not very much visible, or the skin that is visible is
unremarkable (hands or faces). This effect could be
ascribed to miscommunication between annotators andcollectors. We have not found evidence that we need to
deploy quality control measures, though we may have been
lucky in the interface design.
V. CONCLUSION: DATASET RESEARCHON THE INTERNET IS FERTILE
Recognition research needs datasets, and the Internet is a
valuable source of datasets. Collecting these datasets is not
just collation. Instead, the question of how to produce
Fig. 9. Quality details. We present detailed analysis of annotation quality for experiment 4. For every image, the best fitting pair of annotations
is selected. The score of the best pair is shown in the figure. For experiment 4, we compute the average distance between the marked points.
The scores are ordered low (best) to high (worst). For clarity, we render annotations at 5:15:95 percentiles of the score. Blue curve and dots show
annotation 1; yellow curve and dots show annotation 2 of the pair.
Fig. 10. Quality details per landmark. We present analysis of annotation quality per landmark in experiment 4. We show scores of the best
pair for all annotations between the thirty-fifth and sixty-fifth percentilesVbetween points C and E of experiment 4 in Fig. 9. All the plots
have the same scale: from image 100 to 200 on horizontal axis and from 3 to 13 pixels of error on the vertical axis. These graphs show
annotators have greater difficulty choosing a consistent location for the hip than for any other landmark; this may be because some place the
hip at the point a tailor would use and others mark the waist or because the location of the hip is difficult to decide under clothing.
Berg et al. : It’s All About the Data
1448 Proceedings of the IEEE | Vol. 98, No. 8, August 2010
them has proven fertile. It has been the source of inspi-ration for a wide range of research activities and for new
research agendas. The question of how to attach a name to
a visual representation of a face using whatever context
seems likely to help has now produced novel applications,
datasets, and insights about context and its meaning
(Section III-A). Similarly, the general question of how to
exploit text near images has become a major research
agenda in computer vision (Section III-B1).But there are other, currently less well-developed lines
of enquiry. For example, we could use the view of the
Internet as a marketplace where annotation services are
available very cheaply to evaluate object detectors in a
realistic way. Current practice involves building datasets of
annotated positive and negative examples, then using part
for training and part for testing. Once such a dataset has
been used a lot, results reported are questionable becausethey are subject to a form of selection bias. Where cheap
annotation is available, we can adopt a different strategy.
We want to evaluate a detector on all possible images, but
annotation is not cheap enough to annotate all images. So
we must draw a sample, and we could do so by applying the
detector to a large pool of images (ideally, every image),
then sampling the images where the detectors respond and
sending them out for annotation. This argument is similarin flavor to the procedures used by active learning systems,
where a large pool of data is iteratively used to develop
classifiers with a human in the loop for interactive training
[82]. In particular, if one has two or more detectors, each
may serve as a guide to where other detectors should have
responded but did not. This view offers the prospect of
relatively cheap, very large-scale relative evaluations of
detectors that yield statistics close to the actual perfor-mance one expects on every image in the world.
As another example, if one has thousands (or millions)
of pictures of an object category, which pictures are iconic,
in the sense that they give a good representation of the
object? Some recent papers have explored methods for
choosing the most representative or canonical photographs
for a given location [83], monument [3], [84], or object
category [85] by looking at coherence within an image andacross the set of search returns. But another strategy
observes that only well-composed and aesthetically
appealing images are likely to be the most relevant to a
query (returning a poor image is unlikely to be alignedwith a user’s needs) [86]–[88]. Recently, Flickr.com, a
popular photo-sharing site, has used a similar idea quite
successfully with their measure of Binterestingness[related to user behavior around photographs. Photo-
graphs are considered Binteresting[ if they demonstrate a
great deal of user activity, in the form of user favorites,
clicks, comments, etc. This measure, though computed
from human behavior, is usually directly related to thephotograph’s intrinsic quality since people tend to mark
as a favorite or comment on only those photographs that
strike them as pleasing in some way. Though neither
aesthetic quality nor Binterestingness[ directly address
the question of relevance, utilizing these alternative
sources of information as part of search could help to
improve ranking results significantly.
As yet another example, if we want to allow users tosearch, browse, or otherwise access visual data in large
quantities, it will need to be organized in ways that make
sense. This organization will have to represent at least
some components of the Bmeaning[ of the visual data.
Obtaining this meaning will, for the foreseeable future,
require an understanding of the complex relationships
between visual data and textual information that appears
nearby. The ultimate goal is to use methods of both com-puter vision and natural language understanding to pro-
duce a representation of meaning for visual data from that
data and all the text that might be relevant. Doing so is
much more difficult than, for example, attaching names to
faces because the tools in both vision and natural language
are less well developed for the general case. However, as
we have shown, words on pages surrounding images
contain really useful information about those pictures.In summary, what might appear to be a humdrum
questionVhow to collect data on the InternetVis in fact a
fertile research area in computer vision that forces us to
address deep and important questions of visual represen-
tation and meaning. h
Acknowledgment
The authors thank the anonymous referees for their
helpful comments. They thank Dolores Labs and L. Biewald
for their help managing data annotation.
RE FERENCES
[1] R. Hartley and A. Zisserman, Multiple ViewGeometry in Computer Vision, 2nd ed.Cambridge, U.K.: Cambridge Univ. Press,2003.
[2] N. Snavely, S. M. Seitz, and R. Szeliski,BPhoto tourism: Exploring photo collectionsin 3d,[ in Proc. SIGGRAPH, 2006.
[3] X. Li, C. Wu, C. Zach, S. Lazebnik, andJ.-M. Frahm, BModeling and recognition oflandmark image collections using iconic scenegraphs,[ in Proc. ECCV, 2008.
[4] D. Forsyth and J. Ponce, Computer Vision:A Modern Approach. Englewood Cliffs, NJ:Prentice-Hall, 2002.
[5] L. Shapiro and G. Stockman, ComputerVision. Englewood Cliffs, NJ: Prentice-Hall,2001.
[6] J. Ponce, M. Hebert, C. Schmid, andA. Zisserman, Eds., Toward Category LevelObject Recognition. Berlin, Germany:Springer, 2006.
[7] N. Ikizler and D. Forsyth, BSearching videofor complex activities with finite statemodels,[ Proc. IJCV, 2008.
[8] W. Zhao, R. Chellappa, P. J. Phillips, andA. Rosenfeld, BFace recognition: A literaturesurvey,[ ACM Comput. Surv., vol. 35, no. 4,pp. 399–458, 2003.
[9] S. G. Kong, J. Heo, B. R. Abidi, J. Paik, andM. A. Abidi, BRecent advances in visual andinfrared face recognition-a review,[ Comput.Vision Image Understand., vol. 97, no. 1,pp. 103–135, 2005.
[10] A. J. S. Z. Li, A Handbook of FaceRecognition. Berlin, Germany: Springer,2005.
[11] P. J. Phillips and E. Newton, BMeta-analysis offace recognition algorithms,[ in Proc. Int.Conf. Autom. Face Gesture Recognit., 2002.
[12] B. Leibe, A. Leonardis, and B. Schiele,BRobust object detection with interleavedcategorization and segmentation,[ Int. J.Comput. Vision, 2008.
Berg et al. : It’s All About the Data
Vol. 98, No. 8, August 2010 | Proceedings of the IEEE 1449
[13] N. Dalal and B. Triggs, BHistograms oforiented gradients for human detection,[ inProc. CVPR, 2005, pp. I: 886–I: 893.
[14] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna,and T. Poggio, BPedestrian detection usingwavelet templates,[ in Proc. IEEE Conf.Comput. Vision Pattern Recognit., 1997,pp. 193–199.
[15] P. Viola, M. Jones, and D. Snow, BDetectingpedestrians using patterns of motion andappearance,[ Int. J. Comput. Vision, vol. 63,no. 2, pp. 153–161, Jul. 2005.
[16] M.-H. Yang, D. Kriegman, and N. Ahuja,BDetecting faces in images: A survey,[ IEEETrans. Pattern Anal. Machine Intell., vol. 24,pp. 34–58, 2002.
[17] H. Rowley, S. Baluja, and T. Kanade, BNeuralnetwork-based face detection,[ in Proc. CVPR,1996, pp. 203–208.
[18] T. Poggio and K.-K. Sung, BFindinghuman faces with a gaussian mixturedistribution-based face model,[ in Proc. AsianConf. Comput. Vision, 1995, pp. 435–440.
[19] D. Martin, C. Fowlkes, D. Tal, and J. Malik,BA database of human segmented naturalimages and its application to evaluatingsegmentation algorithms and measuringecological statistics,[ in Proc. Int. Conf.Comput. Vision, 2001.
[20] L. Fei-Fei, R. Fergus, and P. Perona,BOne-shot learning of object categories,[IEEE Trans. Pattern Anal. Machine Intell., to bepublished.
[21] G. Griffin, A. Holub, and P. Perona. (2007).Caltech-256 object category dataset,California Inst. of Technology,Tech. Rep. 7694. [Online]. Available: http://authors.library.caltech.edu/7694
[22] M. Everingham, A. Zisserman,C. K. I. Williams, and L. Van Gool,The PASCAL visual object classes challenge2006 (VOC2006) results. [Online]. Available:http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf
[23] M. Everingham, L. Van Gool, C. K. I. Williams,J. Winn, and A. Zisserman, The PASCAL visualobject classes challenge 2007 (VOC2007) results.[Online]. Available: http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
[24] S. Agarwal, A. Awan, and D. Roth, BLearningto detect objects in images via a sparse,part-based representation,[ IEEE Trans.Pattern Anal. Machine Intell., vol. 26,pp. 1475–1490, Nov. 2004.
[25] K. Barnard, Q. Fan, R. Swaminathan,A. Hoogs, R. Collins, P. Rondot, andJ. Kaufhold, BEvaluation of localizedsemantics: Data, methodology, andexperiments,[ Proc. IJCV, 2008.
[26] C. Papageorgiou and T. Poggio, BA trainablesystem for object detection,[ Proc. IJCV, 2000.
[27] N. Dalal and B. Triggs, BHistograms oforiented gradients for human detection,[ inProc. CVPR, 2005.
[28] P. N. Belhumeur, J. P. Hespanha, andD. J. Kriegman, BEigenfaces vs. Fisherfaces:Recognition using class specific linearprojection,[ IEEE Trans. Pattern Anal. MachineIntell. (Special Issue on Face Recognition),vol. 19, no. 7, pp. 711–720, 1997.
[29] P. J. Phillips, A. Martin, C. Wilson, andM. Przybocki, BAn introduction to evaluatingbiometric systems,[ Computer, vol. 33, no. 2,pp. 56–63, 2000.
[30] T. Sim, S. Baker, and M. Bsat, BThe CMUpose, illumination, and expression (pie)database,[ in Proc. AFGR, 2002.
[31] B. Russell, A. Torralba, K. Murphy, andW. T. Freeman, BLabelme: A database andWeb-based tool for image annotation,[ Int. J.Comput. Vision, 2007.
[32] L. von Ahn and L. Dabbish, BLabeling imageswith a computer game,[ in Proc. ACM CHI,2004.
[33] L. von Ahn, R. Liu, and M. Blum,BPeekaboom: A game for locating objects inimages,[ in Proc. ACM CHI, 2006.
[34] B. Yao, X. Yang, and S.-C. Zhu, BIntroductionto a large scale general purpose ground truthdataset: Methodology, annotation tool, andbenchmarks,[ in Proc. EMM CVPR, 2007.
[35] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li,and L. Fei-Fei, BImageNet: A large-scalehierarchical image database,[ in Proc.CVPR’09, 2009.
[36] C. Fellbaum, Ed., WordNet: An ElectronicLexical Database. Cambridge, MA: MITPress, 1998.
[37] T. Berg, A. Berg, J. Edwards, M. Maire,R. White, E. Learned-Miller, Y. Teh, andD. Forsyth, BNames and faces,[ in Proc. CVPR,2004.
[38] T. L. Berg, A. C. Berg, J. Edwards, andD. Forsyth, BWho’s in the picture?’’ inProc. NIPS, Dec. 2004.
[39] H. Cunningham, D. Maynard, K. Bontcheva,and V. Tablan, BGate: A framework andgraphical development environment forrobust nlp tools and applications,[ Assoc.Computat. Linguist., 2002.
[40] K. Mikolajczyk, ‘‘Face detector,’’Ph.D. dissertation, INRIA, Rhone-Alpes.
[41] A. Berger, S. Pietra, and V. D. Pietra,BA maximum entropy approach to naturallanguage processing,[ Computat. Linguist.,vol. 22-1, 1996.
[42] D. Ozkan and P. Duygulu, BA graph basedapproach for naming faces in news photos,[ inProc. CVPR, 2006.
[43] A. Ferencz, E. Learned-Miller, and J. Malik,BLearning hyper-features for visualidentification,[ in Proc. NIPS, 2005.
[44] G. B. Huang, M. Ramesh, T. Berg, andE. Learned-Miller, BLabeled faces in the wild:A database for studying face recognition inunconstrained environments,[ Univ. ofMassachusetts, Amherst, Tech. Rep., 2007.
[45] E. Nowak and F. Jurie, BLearning visualsimilarity measures for comparing never seenobjects,[ in Proc. CVPR, 2007.
[46] G. B. Huang, V. Jain, and E. Learned-Miller,BUnsupervised joint alignment of compleximages,[ in Proc. ICCV, 2007.
[47] A. Gallagher and T. Chen, BEstimating age,gender and identity using first name priors,[in Proc. CVPR, 2008.
[48] M. Naaman, R. B. Yeh, H. Garcia-Molina, andA. Paepcke, BLeveraging context to resolveidentity in photo albums,[ in Proc. Joint Conf.Digital Libraries, 2005.
[49] L. Zhang, Y. Hu, M. Li, W. Ma, and H. Zhang,BEfficient propagation for face annotation infamily albums,[ in ACM Multimedia, 2004.
[50] S. Satoh, Y. Nakamura, and T. Kanade,BName-it: Naming and detecting faces innews videos,[ in IEEE Multimedia, 1999.
[51] R. Houghton, BNamed faces: Putting names tofaces,[ in IEEE Intell. Syst., 1999.
[52] X. Song, C.-Y. Lin, and M.-T. Sun,BCrossmodality automatic face model trainingfrom large video databases,[ in Proc. CVPRWWorkshop CVPR, 2004.
[53] M. Everingham, J. Sivic, and A. Zisserman,BHello! My name is. . . BuffyVAutomatic
naming of characters in TV video,[ in Proc.BMVC, 2006.
[54] D. Ramanan, S. Baker, and S. Kakade,BLeveraging archival video for building facedatasets,[ in Proc. ICCV, 2007.
[55] T. L. Berg and D. A. Forsyth, BAnimals on theweb,[ in Proc. CVPR, 2006.
[56] D. Blei, A. Ng, and M. Jordan, BLatentDirichlet allocation,[ Proc. JMLR, 2003.
[57] K. Barnard and D. Forsyth, BLearning thesemantics of words and pictures,[ in Proc.ICCV, 2001.
[58] K. Barnard, P. Duygulu, N. de Freitas,D. Forsyth, D. Blei, and M. I. Jordan,BMatching words and pictures,[ J. MachineLearn. Res., 2003.
[59] J. Li and J. Z. Wang, BAutomatic linguisticindexing of pictures by a statistical modelingapproach,[ IEEE Trans. Pattern Anal. MachineIntell., 2003.
[60] B. Gao, T. Liu, T. Qin, X. Zheng, Q. Cheng,and W. Ma, BWeb image clustering byconsistent utilization of visual features andsurrounding texts,[ in ACM Multimedia, 2005.
[61] K. Barnard, P. Duyguly, and D. Forsyth,BClustering art,[ in Proc. CVPR, Jun. 2001.
[62] D. Joshi, J. Z. Wang, and J. Li, BThe storypicturing engine: Finding elite images toillustrate a story using mutualreinforcement,[ in Proc. 6th ACM SIGMM Int.Workshop Multimedia Inf. Retrieval (MIR’04),2004.
[63] R. Fergus, P. Perona, and A. Zisserman,BA visual category filter for Google images,[in Proc. 8th Eur. Conf. Comput. Vision, 2004.
[64] N. Ben-Haim, B. Babenko, and S. Belongie,BImproving web-based image search viacontent based clustering,[ in Proc. Conf.Comput. Vision Pattern Recognit. Workshop,2006, pp. 106–111.
[65] R. Fergus, P. Perona, and A. Zisserman,BA sparse object category model for efficientlearning and exhaustive recognition,[ in Proc.Comput. Vision Pattern Recognit., 2005.
[66] K. Wnuk and S. Soatto, BFiltering Internetimage search results towards keyword basedcategory recognition,[ in Proc. Conf. Comput.Vision Pattern Recognit., 2008.
[67] L.-J. Li, G. Wang, and L. Fei-Fei, BOPTIMOL:Automatic Object Picture collecTion viaIncremental MOdel Learning,[ in Proc. IEEEComput. Vis. Pattern Recognit., Minneapolis,MN, 2007, pp. 1–8.
[68] F. Schroff, A. Criminisi, and A. Zisserman,BHarvesting image databases from the web,[in Proc. 11th Int. Conf. Comput. Vision,Rio de Janeiro, Brazil, 2007.
[69] A. Quattoni, M. Collins, and T. Darrell,BLearning visual representations using imageswith captions,[ in Proc. CVPR 2007, Jun. 2007.
[70] B. Collins, J. Deng, L. Kai, and L. Fei-Fei,BTowards scalable dataset construction: Anactive learning approach,[ in Proc. ECCV,2008.
[71] G. Wang, D. Hoiem, and D. Forsyth,BBuilding text features for object imageclassification,[ in Proc. CVPR, 2009.
[72] M. Everingham, L. Van Gool, C. Williams,J. Winn, and A. Zisserman, The PASCAL visualobject classes challenge 2007 (VOC2007)results, 2007.
[73] A. Torralba, R. Fergus, and W. T. Freeman,BTiny images,[ Computer Science andArtificial Intelligence Lab, MassachusettsInst. of Technology, Tech. Rep. MIT-CSAIL-TR-2007-024. [Online]. Available: http://dspace.mit.edu/handle/1721.1/37291
Berg et al. : It’s All About the Data
1450 Proceedings of the IEEE | Vol. 98, No. 8, August 2010
[74] J. Hays and A. Efros, BScene completion usingmillions of photographs,[ in Int. Conf. Comput.Graph. Interact. Techniques, 2007.
[75] J. Hays and A. Efros, BIM2GPS: Estimatinggeographic information from a single image,[in Proc. IEEE Conf. Comput. Vision PatternRecognit. 2008, 2008, pp. 1–8.
[76] M. Everingham, A. Zisserman,C. K. I. Williams, and L. Van Gool, ThePASCAL visual object classes challenge 2006(VOC2006) results. [Online]. Available: http://www.pascal-network.org/challenges/VOC/voc2006/results.pdf
[77] G. Mori, X. Ren, A. Efros, and J. Malik,BRecovering human body configurations:Combining segmentation and recognition,[ inProc. CVPR, 2004.
[78] D. Martin, C. Fowlkes, and J. Malik,BLearning to detect natural image boundaries
using local brightness, color, and texturecues,[ IEEE Trans. Pattern Anal. MachineIntell., 2004.
[79] D. Ramanan, BLearning to parse images ofarticulated bodies,[ in Proc. NIPS, 2007,pp. 1129–1136.
[80] M. Blank, L. Gorelick, E. Shechtman, M. Irani,and R. Basri, BActions as space-time shapes,[in Proc. ICCV, 2005, pp. 1395–1402.
[81] A. Farhadi, I. Endres, D. Hoiem, andD. Forsyth, BLearning to describe objects,[ inProc. CVPR, 2009.
[82] Y. Abramson and Y. Freund, BSemi-automaticvisual learning (Seville): A tutorial on activelearning for visual object recognition,[ inTutorial CVPR, 2005.
[83] I. Simon, N. Snavely, and S. Seitz, BScenesummarization for online image collections,[in Proc. ICCV, 2007.
[84] T. L. Berg and D. A. Forsyth, BAutomaticranking of iconic images,’’ Univ. CaliforniaBerkeley Tech. Rep., 2007.
[85] T. L. Berg and A. C. Berg, BFinding iconicimages,[ in Proc. 2nd Internet Vision WorkshopIEEE Conf. Comput. Vision Pattern Recognit.,2009.
[86] R. Raguram and S. Lazebnik, BComputingiconic summaries for general visualconcepts,[ in Proc. 1st Internet VisionWorkshop, 2008.
[87] Y. Ke, X. Tang, and F. Jing, BThe design ofhigh-level features for photo qualityassessment,[ in Proc. CVPR, 2006.
[88] R. Datta, D. Joshi, J. Li, and J. Z. Wang,BStudying aesthetics in photographic imagesusing a computational approach,[ in Proc.ECCV, 2006.
ABOUT T HE AUTHO RS
Tamara L. Berg received the Ph.D. degree from
the Computer Science Department, University of
California, Berkeley, in 2007.
She was a member of the Berkeley Computer
Vision Group and worked under the advisorship of
Prof. D. Forsyth. She spent 2007–2008 as a
Postdoctoral Researcher with Yahoo! Research
developing various projects related to digital
media, including the automatic annotation of con-
sumer photographs. She is currently an Assistant
Professor at Stony Brook University. Her main research area is digital
media, specifically focused on organizing large collections of images with
associated text through the development of techniques in computer
vision and natural language processing. Past projects include automat-
ically identifying people in news photographs, classifying images from
the Web, and finding iconic images in consumer photo collections. She is
also generally interested in bringing together people and expertise from
various areas of digital media including digital art, music, and cultural
studies.
Alexander Sorokin (Member, IEEE) received the
B.S. degree (magna cum laude) from Lomonosov
Moscow State University, Russia, in 2003. He is
pursuing the Ph.D. degree at the University of
Illinois at Urbana-Champaign.
Mr. Sorokin is a member of Phi Kappa Phi. In
2007, he received a Best Paper Award from
Information Visualization 2007.
Gang Wang received the B.S. degree from Harbin
Institute of Technology, China, in 2005. He is
currently working towards the Ph.D. degree at
the University of Illinois at Urbana-Champaign,
Urbana.
His research interests include computer vision
and machine learning.
David Alexander Forsyth (Fellow, IEEE) received
the B.Sc. and M.Sc. degrees in electrical engineer-
ing from the University of the Witwatersrand,
Johannesburg, South Africa, and the D.Phil. degree
from Balliol College, Oxford, U.K.
He has published more than 100 papers on
computer vision, computer graphics, and machine
learning. He is a coauthor (with J. Ponce) of
Computer Vision: A Modern Approach (Englewood
Cliffs, NJ: Prentice-Hall, 2002). He was a Professor
at the University of California, Berkeley. He is currently a Professor at the
University of Illinois at Urbana-Champaign.
Prof. Forsyth was Program Cochair for IEEE Computer Vision and
Pattern Recognition (CVPR) in 2000, General Cochair for CVPR 2006, and
Program Cochair for the European Conference on Computer Vision 2008.
He will be Program Cochair for CVPR 2011. He received an IEEE Technical
Achievement Award in 2005.
Derek Hoiem received the Ph.D. degree from
Carnegie–Mellon University, Pittsburgh, PA.
He is an Assistant Professor in computer
science at the University of Illinois at Urbana-
Champaign. His dissertation on automatically
transforming 2-D images into 3-D scenes was
featured in The Economist. His current research
focuses on physical scene understanding and
property-centric object recognition.
Dr. Hoiem received an Honorable Mention for
the ACM Doctoral Dissertation Award.
Berg et al. : It’s All About the Data
Vol. 98, No. 8, August 2010 | Proceedings of the IEEE 1451
Ian Endres (Member, IEEE) received the B.S.
degree in computer science from the University
of Illinois at Urbana-Champaign in 2008, where he
is currently pursuing the Ph.D. degree. His re-
search interests include computer vision and
machine learning.
Ali Farhadi (Member, IEEE) is pursuing the Ph.D.
degree in the Computer Science Department,
University of Illinois at Urbana-Champaign.
His work is mainly focused on computer vision
and machine learning. More specifically, he is
interested in transfer learning and its application
to aspect issues in human activity and object
recognition, scene understanding, and attribute-
based representation of objects.
He recently received the inaugural Google
Fellowship in computer vision and image interpretation and the Uni-
versity of Illinois CS/AI 2009 Award.
Berg et al. : It’s All About the Data
1452 Proceedings of the IEEE | Vol. 98, No. 8, August 2010