Unsupervised Discovery of Domain-Specific Knowledge from ...

10
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 1466–1475, Portland, Oregon, June 19-24, 2011. c 2011 Association for Computational Linguistics Unsupervised Discovery of Domain-Specific Knowledge from Text Dirk Hovy, Chunliang Zhang, Eduard Hovy Information Sciences Institute University of Southern California 4676 Admiralty Way, Marina del Rey, CA 90292 {dirkh, czheng, hovy}@isi.edu Anselmo Pe ˜ nas UNED NLP and IR Group Juan del Rosal 16 28040 Madrid, Spain [email protected] Abstract Learning by Reading (LbR) aims at enabling machines to acquire knowledge from and rea- son about textual input. This requires knowl- edge about the domain structure (such as en- tities, classes, and actions) in order to do in- ference. We present a method to infer this im- plicit knowledge from unlabeled text. Unlike previous approaches, we use automatically ex- tracted classes with a probability distribution over entities to allow for context-sensitive la- beling. From a corpus of 1.4m sentences, we learn about 250k simple propositions about American football in the form of predicate- argument structures like “quarterbacks throw passes to receivers”. Using several statisti- cal measures, we show that our model is able to generalize and explain the data statistically significantly better than various baseline ap- proaches. Human subjects judged up to 96.6% of the resulting propositions to be sensible. The classes and probabilistic model can be used in textual enrichment to improve the per- formance of LbR end-to-end systems. 1 Introduction The goal of Learning by Reading (LbR) is to enable a computer to learn about a new domain and then to reason about it in order to perform such tasks as question answering, threat assessment, and explana- tion (Strassel et al., 2010). This requires joint efforts from Information Extraction, Knowledge Represen- tation, and logical inference. All these steps depend on the system having access to basic, often unstated, foundational knowledge about the domain. Most documents, however, do not explicitly men- tion this information in the text, but assume basic background knowledge about the domain, such as positions (“quarterback”), titles (“winner”), or ac- tions (“throw”) for sports game reports. Without this knowledge, the text will not make sense to the reader, despite being well-formed English. Luckily, the information is often implicitly contained in the document or can be inferred from similar texts. Our system automatically acquires domain- specific knowledge (classes and actions) from large amounts of unlabeled data, and trains a probabilis- tic model to determine and apply the most infor- mative classes (quarterback, etc.) at appropriate levels of generality for unseen data. E.g., from sentences such as “Steve Young threw a pass to Michael Holt”, “Quarterback Steve Young finished strong”, and “Michael Holt, the receiver, left early” we can learn the classes quarterback and receiver, and the proposition “quarterbacks throw passes to receivers”. We will thus assume that the implicit knowl- edge comes in two forms: actions in the form of predicate-argument structures, and classes as part of the source data. Our task is to identify and extract both. Since LbR systems must quickly adapt and scale well to new domains, we need to be able to work with large amounts of data and minimal super- vision. Our approach produces simple propositions about the domain (see Figure 1 for examples of ac- tual propositions learned by our system). American football was the first official evaluation domain in the DARPA-sponsored Machine Reading program, and provides the background for a number 1466

Transcript of Unsupervised Discovery of Domain-Specific Knowledge from ...

Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 1466–1475,Portland, Oregon, June 19-24, 2011. c©2011 Association for Computational Linguistics

Unsupervised Discovery of Domain-Specific Knowledge from Text

Dirk Hovy, Chunliang Zhang, Eduard HovyInformation Sciences Institute

University of Southern California4676 Admiralty Way, Marina del Rey, CA 90292{dirkh, czheng, hovy}@isi.edu

Anselmo PenasUNED NLP and IR Group

Juan del Rosal 1628040 Madrid, Spain

[email protected]

Abstract

Learning by Reading (LbR) aims at enablingmachines to acquire knowledge from and rea-son about textual input. This requires knowl-edge about the domain structure (such as en-tities, classes, and actions) in order to do in-ference. We present a method to infer this im-plicit knowledge from unlabeled text. Unlikeprevious approaches, we use automatically ex-tracted classes with a probability distributionover entities to allow for context-sensitive la-beling. From a corpus of 1.4m sentences, welearn about 250k simple propositions aboutAmerican football in the form of predicate-argument structures like “quarterbacks throwpasses to receivers”. Using several statisti-cal measures, we show that our model is ableto generalize and explain the data statisticallysignificantly better than various baseline ap-proaches. Human subjects judged up to 96.6%of the resulting propositions to be sensible.The classes and probabilistic model can beused in textual enrichment to improve the per-formance of LbR end-to-end systems.

1 Introduction

The goal of Learning by Reading (LbR) is to enablea computer to learn about a new domain and thento reason about it in order to perform such tasks asquestion answering, threat assessment, and explana-tion (Strassel et al., 2010). This requires joint effortsfrom Information Extraction, Knowledge Represen-tation, and logical inference. All these steps dependon the system having access to basic, often unstated,foundational knowledge about the domain.

Most documents, however, do not explicitly men-tion this information in the text, but assume basicbackground knowledge about the domain, such aspositions (“quarterback”), titles (“winner”), or ac-tions (“throw”) for sports game reports. Withoutthis knowledge, the text will not make sense to thereader, despite being well-formed English. Luckily,the information is often implicitly contained in thedocument or can be inferred from similar texts.

Our system automatically acquires domain-specific knowledge (classes and actions) from largeamounts of unlabeled data, and trains a probabilis-tic model to determine and apply the most infor-mative classes (quarterback, etc.) at appropriatelevels of generality for unseen data. E.g., fromsentences such as “Steve Young threw a pass toMichael Holt”, “Quarterback Steve Young finishedstrong”, and “Michael Holt, the receiver, left early”we can learn the classes quarterback and receiver,and the proposition “quarterbacks throw passes toreceivers”.

We will thus assume that the implicit knowl-edge comes in two forms: actions in the form ofpredicate-argument structures, and classes as part ofthe source data. Our task is to identify and extractboth. Since LbR systems must quickly adapt andscale well to new domains, we need to be able towork with large amounts of data and minimal super-vision. Our approach produces simple propositionsabout the domain (see Figure 1 for examples of ac-tual propositions learned by our system).

American football was the first official evaluationdomain in the DARPA-sponsored Machine Readingprogram, and provides the background for a number

1466

of LbR systems (Mulkar-Mehta et al., 2010). Sportsis particularly amenable, since it usually follows afinite, explicit set of rules. Due to its popularity,results are easy to evaluate with lay subjects, andgame reports, databases, etc. provide a large amountof data. The same need for basic knowledge appearsin all domains, though. In music, musicians play in-struments, in electronics, components constitute cir-cuits, circuits use electricity, etc.

Teams beat teamsTeams play teamsQuarterbacks throw passesTeams win gamesTeams defeat teamsReceivers catch passesQuarterbacks complete passesQuarterbacks throw passes to receiversTeams play gamesTeams lose games

Figure 1: The ten most frequent propositions discoveredby our system for the American football domain

Our approach differs from verb-argument identi-fication or Named Entity (NE) tagging in several re-spects. While previous work on verb-argument se-lection (Pardo et al., 2006; Fan et al., 2010) usesfixed sets of classes, we cannot know a priori howmany and which classes we will encounter. Wetherefore provide a way to derive the appropriateclasses automatically and include a probability dis-tribution for each of them. Our approach is thusless restricted and can learn context-dependent, fine-grained, domain-specific propositions. While a NE-tagged corpus could produce a general propositionlike “PERSON throws to PERSON”, our methodenables us to distinguish the arguments and learn“quarterback throws to receiver” for American foot-ball and “outfielder throws to third base” for base-ball. While in NE tagging each word has only onecorrect tag in a given context, we have hierarchicalclasses: an entity can be correctly labeled as a playeror a quarterback (and possibly many more classes),depending on the context. By taking context intoaccount, we are also able to label each sentence in-dividually and account for unseen entities withoutusing external resources.

Our contributions are:

• we use unsupervised learning to train a modelthat makes use of automatically extractedclasses to uncover implicit knowledge in theform of predicate-argument propositions

• we evaluate the explanatory power, generaliza-tion capability, and sensibility of the proposi-tions using both statistical measures and humanjudges, and compare them to several baselines

• we provide a model and a set of propositionsthat can be used to improve the performanceof end-to-end LbR systems via textual enrich-ment.

2 Methods

INPUT:Steve Young threw a pass to Michael Holt

1. PARSE INPUT:

2. JOIN NAMES, EXTRACT PREDICATES:NVN: Steve_Young throw pass

NVNPN: Steve_Young throw pass to Michael_Holt

3. DECODE TO INFER PROPOSITIONS:QUARTERBACK throw pass

QUARTERBACK throw pass to RECEIVER

Steve/NNP

Young/NNP

throw/VBD

pass/NN

a/DT

to/TO

Michael/NNP

Holt/NNP

nsubj

dobj

prep

nn

nn

pobjdet

Steve_Young threw a pass to Michael_Holt

s1 s2 x1 s3 s4 s5

p1 p2 p3 p4 p5

quarterback throw pass to receiver

Figure 2: Illustrated example of different processing steps

Our running example will be “Steve Young threwa pass to Michael Holt”. This is an instance of theunderlying proposition “quarterbacks throw passesto receivers”, which is not explicitly stated in thedata. A proposition is thus a more general state-ment about the domain than the sentences it de-rives. It contains domain-specific classes (quarter-back, receiver), as well as lexical items (“throw”,“pass”). In order to reproduce the proposition,given the input sentences, our system has to notonly identify the classes, but also learn when to

1467

abstract away from the lexical form to the ap-propriate class and when to keep it (cf. Figure2, step 3). To facilitate extraction, we focus onpropositions with the following predicate-argumentstructures: NOUN-VERB-NOUN (e.g., “quarter-backs throw passes”), or NOUN-VERB-NOUN-PREPOSITION-NOUN (e.g., “quarterbacks throwpasses to receivers”. There is nothing, though, thatprevents the use of other types of structures as well.We do not restrict the verbs we consider (Pardo etal., 2006; Ritter et al., 2010)), which extracts a highnumber of hapax structures.

Given a sentence, we want to find the most likelyclass for each word and thereby derive the mostlikely proposition. Similar to Pardo et al. (2006), weassume the observed data was produced by a processthat generates the proposition and then transformsthe classes into a sentence, possibly adding addi-tional words. We model this as a Hidden MarkovModel (HMM) with bigram transitions (see Section2.3) and use the EM algorithm (Dempster et al.,1977) to train it on the observed data, with smooth-ing to prevent overfitting.

2.1 Data

We use a corpus of about 33k texts on Ameri-can football, extracted from the New York Times(Sandhaus, 2008). To identify the articles, we relyon the provided “football” keyword classifier. Theresulting corpus comprises 1, 359, 709 sentencesfrom game reports, background stories, and opin-ion pieces. In a first step, we parse all documentswith the Stanford dependency parser (De Marneffeet al., 2006) (see Figure 2, step 1). The outputis lemmatized (collapsing “throws”, “threw”, etc.,into “throw”), and marked for various dependen-cies (nsubj, amod, etc.). This enables us to ex-tract the predicate argument structure, like subject-verb-object, or additional prepositional phrases (seeFigure 2, step 2). These structures help to sim-plify the model by discarding additional words likemodifiers, determiners, etc., which are not essen-tial to the proposition. The same approach is usedby (Brody, 2007). We also concatenate multi-word names (identified by sequences of NNPs) withan underscore to form a single token (“Steve/NNPYoung/NNP”→ “Steve Young”).

2.2 Deriving Classes

To derive the classes used for entities, we do not re-strict ourselves to a fixed set, but derive a domain-specific set directly from the data. This step is per-formed simultaneously with the corpus generationdescribed above. We utilize three syntactic construc-tions to identify classes, namely nominal modifiers,copula verbs, and appositions, see below. This issimilar in nature to Hearst’s lexico-syntactic patterns(Hearst, 1992) and other approaches that derive IS-A relations from text. While we find it straightfor-ward to collect classes for entities in this way, wedid not find similar patterns for verbs. Given a suit-able mechanism, however, these could be incorpo-rated into our framework as well.

Nominal modifier are common nouns (labeledNN) that precede proper nouns (labeled NNP), as in“quarterback/NN Steve/NNP Young/NNP”, where“quarterback” is the nominal modifier of “SteveYoung”. Similar information can be gained from ap-positions (e.g., “Steve Young, the quarterback of histeam, said...”), and copula verbs (“Steve Young isthe quarterback of the 49ers”). We extract those co-occurrences and store the proper nouns as entitiesand the common nouns as their possible classes. Foreach pair of class and entity, we collect counts overthe corpus to derive probability distributions.

Entities for which we do not find any of the abovepatterns in our corpus are marked “UNK”. Theseentities are instantiated with the 20 most frequentclasses. All other (non-entity) words (includingverbs) have only their identity as class (i.e., “pass”remains “pass”).

The average number of classes per entity is 6.87.The total number of distinct classes for entities is63, 942. This is a huge number to model in our statespace.1 Instead of manually choosing a subset of theclasses we extracted, we defer the task of finding thebest set to the model.

We note, however, that the distribution of classesfor each entity is highly skewed. Due to the unsuper-vised nature of the extraction process, many of theextracted classes are hapaxes and/or random noise.Most entities have only a small number of applicableclasses (a football player usually has one main posi-

1NE taggers usually use a set of only a few dozen classes atmost.

1468

tion, and a few additional roles, such as star, team-mate, etc.). We handle this by limiting the number ofclasses considered to 3 per entity. This constraint re-duces the total number of distinct classes to 26, 165,and the average number of classes per entity to 2.53.The reduction makes for a more tractable model sizewithout losing too much information. The class al-phabet is still several magnitudes larger than that forNE or POS tagging. Alternatively, one could use ex-ternal resources such as Wikipedia, Yago (Suchaneket al., 2007), or WordNet++ (Ponzetto and Navigli,2010) to select the most appropriate classes for eachentity. This is likely to have a positive effect on thequality of the applicable classes and merits furtherresearch. Here, we focus on the possibilities of aself-contained system without recurrence to outsideresources.

The number of classes we consider for each entityalso influences the number of possible propositions:if we consider exactly one class per entity, there willbe little overlap between sentences, and thus no gen-eralization possible—the model will produce manydistinct propositions. If, on the other hand, we usedonly one class for all entities, there will be similar-ities between many sentences—the model will pro-duce very few distinct propositions.

2.3 Probabilistic Model

INPUT:Steve Young threw a pass to Michael Holt

PARSE:

INSTANCES:Steve_Young throw pass Steve_Young throw pass to Michael_Holt

PROPOSITIONS:Quarterback throw passQuarterback throw pass to receiver

Steve

Young

throw

pass

a

to

Michael

Holt

nsubj

dobj

prep

nn

nn

pobjdet

Steve_Young threw a pass to Michael_Holt

s1 s2 x1 s3 s4 s5

p1 p2 p3 p4 p5

quarterback throw pass to receiver

Figure 3: Graphical model for the running example

We use a generative noisy-channel model to cap-ture the joint probability of input sentences and theirunderlying proposition. Our generative story of howa sentence s (with words s1, ..., sn) was generatedassumes that a proposition p is generated as a se-quence of classes p1, ..., pn, with transition proba-bilities P (pi|pi−1). Each class pi generates a wordsi with probability P (si|pi). We allow additionalwords x in the sentence which do not depend on anyclass in the proposition and are thus generated inde-

pendently with P (x) (cf. model in Figure 3).Since we observe the co-occurrence counts of

classes and entities in the data, we can fix the emis-sion parameter P (s|p) in our HMM. Further, we donot want to generate sentences from propositions, sowe can omit the step that adds the additional wordsx in our model. The removal of these words is re-flected by the preprocessing step that extracts thestructure (cf. Section 2.1).

Our model is thus defined as

P (s,p) =P (p1) ·n∏

i=1

(P (pi|pi−1) · P (si|pi)

)(1)

where si, pi denote the ith word of sentence s andproposition p, respectively.

3 Evaluation

We want to evaluate how well our model predictsthe data, and how sensible the resulting propositionsare. We define a good model as one that generalizeswell and produces semantically useful propositions.

We encounter two problems. First, since we de-rive the classes in a data-driven way, we have nogold standard data available for comparison. Sec-ond, there is no accepted evaluation measure for thiskind of task. Ultimately, we would like to evaluateour model externally, such as measuring its impacton performance of a LbR system. In the absencethereof, we resort to several complementary mea-sures, as well as performing an annotation task. Wederive evaluation criteria as follows. A model gener-alizes well if it can cover (‘explain’) all the sentencesin the corpus with a few propositions. This requiresa measure of generality. However, while a proposi-tion such as “PERSON does THING”, has excellentgenerality, it possesses no discriminating power. Wealso need the propositions to partition the sentencesinto clusters of semantic similarity, to support effec-tive inference. This requires a measure of distribu-tion. Maximal distribution, achieved by assigningevery sentence to a different proposition, however,is not useful either. We need to find an appropri-ate level of generality within which the sentencesare clustered into propositions for the best overallgroupings to support inference.

To assess the learned model, we apply the mea-sures of generalization, entropy, and perplexity (see

1469

Sections 3.2, 3.3, and 3.4). These measures can beused to compare different systems. We do not at-tempt to weight or combine the different measures,but present each in its own right.

Further, to assess label accuracy, we use Ama-zon’s Mechanical Turk annotators to judge the sen-sibility of the propositions produced by each sys-tem (Section 3.5). We reason that if our systemlearned to infer the correct classes, then the resultingpropositions should constitute true, general state-ments about that domain, and thus be judged as sen-sible.2 This approach allows the effective annotationof sufficient amounts of data for an evaluation (firstdescribed for NLP in (Snow et al., 2008)).

3.1 Evaluation DataWith the trained model, we use Viterbi decoding toextract the best class sequence for each example inthe data. This translates the original corpus sen-tences into propositions. See steps 2 and 3 in Figure2.

We create two baseline systems from the samecorpus, one which uses the most frequent class(MFC) for each entity, and another one which usesa class picked at random from the applicable classesof each entity.

Ultimately, we are interested in labeling unseendata from the same domain with the correct class,so we evaluate separately on the full corpus andthe subset of sentences that contain unknown enti-ties (i.e., entities for which no class information wasavailable in the corpus, cf. Section 2.2).

For the latter case, we select all examples con-taining at least one unknown entity (labeled UNK),resulting in a subset of 41, 897 sentences, and repeatthe evaluation steps described above. Here, we haveto consider a much larger set of possible classes perentity (the 20 overall most frequent classes). TheMFC baseline for these cases is the most frequentof the 20 classes for UNK tokens, while the randombaseline chooses randomly from that set.

3.2 GeneralizationGeneralization measures how widely applicable theproduced propositions are. A completely lexical ap-

2Unfortunately, if judged insensible, we can not inferwhether our model used the wrong class despite better options,or whether we simply have not learned the correct label.

entropy

Page 1

full data set

unknown entities

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.040.01

0.12 0.09

0.25

0.66

Generalization

random

MFC

model

Figure 4: Generalization of models on the data sets

proach, at one extreme, would turn each sentenceinto a separate proposition, thus achieving a gener-alization of 0%. At the other extreme, a model thatproduces only one proposition would generalize ex-tremely well (but would fail to explain the data inany meaningful way). Both are of course not desir-able.

We define generalization as

g = 1− |propositions||sentences|

(2)

The results in Figure 4 show that our model iscapable of abstracting away from the lexical form,achieving a generalization rate of 25% for the fulldata set. The baseline approaches do significantlyworse, since they are unable to detect similaritiesbetween lexically different examples, and thus cre-ate more propositions. Using a two-tailed t-test, thedifference between our model and each baseline isstatistically significant at p < .001.

Generalization on the unknown entity data set iseven higher (65.84%). The difference between themodel and the baselines is again statistically signif-icant at p < .001. MFC always chooses the sameclass for UNK, regardless of context, and performsmuch worse. The random baseline chooses between20 classes per entity instead of 3, and is thus evenless general.

3.3 Normalized EntropyEntropy is used in information theory to measurehow predictable data is. 0 means the data is com-pletely predictable. The higher the entropy of ourpropositions, the less well they explain the data. Weare looking for models with low entropy. The ex-treme case of only one proposition has 0 entropy:

1470

entropy

Page 1

full data set

unknown entities

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.001.00 1.000.99 0.99

0.89

0.50

Normalized Entropy

random

MFC

model

Figure 5: Entropy of models on the data sets

we know exactly which sentences are produced bythe proposition.

Entropy is directly influenced by the number ofpropositions used by a system.3 In order to comparedifferent models, we thus define normalized entropyas

HN =−

n∑i=0

Pi · logPi

log n(3)

where Pi is the coverage of the proposition, or thepercentage of sentences explained by it, and n is thenumber of distinct propositions.

The entropy of our model on the full data set isrelatively high with 0.89, see Figure 5. The bestentropy we can hope to achieve given the numberof propositions and sentences is actually 0.80 (byconcentrating the maximum probability mass in oneproposition). The model thus does not perform asbadly as the number might suggest. The entropy ofour model on unseen data is better, with 0.50 (bestpossible: 0.41). This might be due to the fact thatwe considered more classes for UNK than for regu-lar entities.

3.4 Perplexity

Since we assume that propositions are valid sen-tences in our domain, good propositions should havea higher probability than bad propositions in a lan-guage model. We can compute this using perplex-

3Note that how many classes we consider per entity influ-ences how many propositions are produced (cf. Section 2.2),and thus indirectly puts a bound on entropy.

entropy

Page 1

full data set unknown entities

50.00

51.00

52.00

53.00

54.00

55.00

56.00

57.00

58.00

59.00

60.00 59.52

57.0357.03 57.1556.84

54.92

Perplexity

random

MFC

model

Figure 6: Perplexity of models on the data sets

ity:4

perplexity(data) = 2− log P (data)

n (4)

where P (data) is the product of the propositionprobabilities, and n is the number of propositions.We use the uni-, bi-, and trigram counts of theGoogleGrams corpus (Brants and Franz, 2006) withsimple interpolation to compute the probability ofeach proposition.

The results in Figure 6 indicate that the proposi-tions found by the model are preferable to the onesfound by the baselines. As would be expected, thesensibility judgements for MFC and model5 (Tables1 and 2, Section 3.5) are perfectly anti-correlated(correlation coefficient −1) with the perplexity forthese systems in each data set. However, due to thesmall sample size, this should be interpreted cau-tiously.

3.5 Sensibility and Label AccuracyIn unsupervised training, the model with the bestdata likelihood does not necessarily produce the bestlabel accuracy. We evaluate label accuracy by pre-senting subjects with the propositions we obtainedfrom the Viterbi decoding of the corpus, and askthem to rate their sensibility. We compare the dif-ferent systems by computing sensibility as the per-centage of propositions judged sensible for each sys-tem. Since the underlying probability distributionsare quite different, we weight the sensibility judge-ment for each proposition by the likelihood of thatproposition. We report results for both aggregate

4Perplexity also quantifies the uncertainty of the resultingpropositions, where 0 perplexity means no uncertainty.

5We did not collect sensibility judgements for the randombaseline.

1471

accuracy

Page 1

System

90.16 92.13 69.35 70.57 88.84 90.37

94.28 96.55 70.93 70.45 93.06 95.16

100 most frequent random combined

Data set agg maj agg maj agg maj

fullbaseline

model

Table 1: Percentage of propositions derived from labeling the full data set that were judged sensibleaccuracy

Page 1

System

51.92 51.51 32.39 28.21 50.39 49.66

66.00 69.57 48.14 41.74 64.83 67.76

100 most frequent random combined

Data set agg maj agg maj agg maj

unknownbaseline

model

Table 2: Percentage of propositions derived from labeling unknown entities that were judged sensible

sensibility (using the total number of individual an-swers), and majority sensibility, where each propo-sition is scored according to the majority of annota-tors’ decisions.

The model and baseline propositions for the fulldata set are both judged highly sensible, achievingaccuracies of 96.6% and 92.1% (cf. Table 1). Whileour model did slightly better, the differences are notstatistically significant when using a two-tailed test.The propositions produced by the model from un-known entities are less sensible (67.8%), albeit stillsignificantly above chance level, and the baselinepropositions for the same data set (p < 0.01). Only49.7% propositions of the baseline were judged sen-sible (cf. Table 2).

3.5.1 Annotation TaskOur model finds 250, 169 distinct propositions,

the MFC baseline 293, 028. We thus have to restrictourselves to a subset in order to judge their sensi-bility. For each system, we sample the 100 mostfrequent propositions and 100 random propositionsfound for both the full data set and the unknown enti-ties6 and have 10 annotators rate each proposition assensible or insensible. To identify and omit bad an-notators (‘spammers’), we use the method describedin Section 3.5.2, and measure inter-annotator agree-ment as described in Section 3.5.3. The details ofthis evaluation are given below, the results can befound in Tables 1 and 2.

The 200 propositions from each of the four sys-

6We omit the random baseline here due to size issues, andbecause it is not likely to produce any informative comparison.

tems (model and baseline on both full and unknowndata set), contain 696 distinct propositions. Webreak these up into 70 batches (Amazon Turk an-notation HIT pages) of ten propositions each. Foreach proposition, we request 10 annotators. Overall,148 different annotators participated in our annota-tion. The annotators are asked to state whether eachproposition represents a sensible statement aboutAmerican Football or not. A proposition like “Quar-terbacks can throw passes to receivers” should makesense, while “Coaches can intercept teams” doesnot. To ensure that annotators judge sensibility andnot grammaticality, we format each proposition thesame way, namely pluralizing the nouns and adding“can” before the verb. In addition, annotators canstate whether a proposition sounds odd, seems un-grammatical, is a valid sentence, but against therules (e.g., “Coaches can hit players”) or whetherthey do not understand it.

3.5.2 Spammers

Some (albeit few) annotators on Mechanical Turktry to complete tasks as quickly as possible with-out paying attention to the actual requirements, in-troducing noise into the data. We have to identifythese spammers before the evaluation. One way isto include tests. Annotators that fail these tests willbe excluded. We use a repetition (first and last ques-tion are the same), and a truism (annotators answer-ing ”no” either do not know about football or justanswered randomly). Alternatively, we can assumethat good annotators, who are the majority, will ex-hibit similar behavior to one another, while spam-

1472

mers exhibit a deviant answer pattern. To identifythose outliers, we compare each annotator’s agree-ment to the others and exclude those whose agree-ment falls more than one standard deviation belowthe average overall agreement.

We find that both methods produce similar results.The first method requires more careful planning, andthe resulting set of annotators still has to be checkedfor outliers. The second method has the advantagethat it requires no additional questions. It includesthe risk, though, that one selects a set of bad annota-tors solely because they agree with one another.

3.5.3 Agreementagreement

Page 1

0.88 0.76 0.82

! 0.45 0.50 0.48

0.66 0.53 0.58

measure100 most frequent

random combined

agreement

G-index

Table 3: Agreement measures for different samples

We use inter-annotator agreement to quantify thereliability of the judgments. Apart from the simpleagreement measure, which records how often an-notators choose the same value for an item, thereare several statistics that qualify this measure by ad-justing for other factors. One frequently used mea-sure, Cohen’s κ, has the disadvantage that if thereis prevalence of one answer, κ will be low (or evennegative), despite high agreement (Feinstein and Ci-cchetti, 1990). This phenomenon, known as the κparadox, is a result of the formula’s adjustment forchance agreement. As shown by Gwet (2008), thetrue level of actual chance agreement is realisticallynot as high as computed, resulting in the counterin-tuitive results. We include it for comparative rea-sons. Another statistic, the G-index (Holley andGuilford, 1964), avoids the paradox. It assumes thatexpected agreement is a function of the number ofchoices rather than chance. It uses the same generalformula as κ,

(Pa − Pe)(1− Pe)

(5)

where Pa is the actual raw agreement measured, andPe is the expected agreement. The difference withκ is that Pe for the G-index is defined as Pe = 1/q,

where q is the number of available categories, in-stead of expected chance agreement. Under mostconditions, G and κ are equivalent, but in the caseof high raw agreement and few categories, G gives amore accurate estimation of the agreement. We thusreport raw agreement, κ, and G-index.

Despite early spammer detection, there are stilloutliers in the final data, which have to be accountedfor when calculating agreement. We take the sameapproach as in the statistical spammer detection anddelete outliers that are more than one standard devi-ation below the rest of the annotators’ average.

The raw agreement for both samples combined is0.82, G = 0.58, and κ = 0.48. The numbers showthat there is reasonably high agreement on the labelaccuracy.

4 Related Research

The approach we describe is similar in nature to un-supervised verb argument selection/selectional pref-erences and semantic role labeling, yet goes be-yond it in several ways. For semantic role label-ing (Gildea and Jurafsky, 2002; Fleischman et al.,2003), classes have been derived from FrameNet(Baker et al., 1998). For verb argument detec-tion, classes are either semi-manually derived froma repository like WordNet, or from NE taggers(Pardo et al., 2006; Fan et al., 2010). This allowsfor domain-independent systems, but limits the ap-proach to a fixed set of oftentimes rather inappropri-ate classes. In contrast, we derive the level of gran-ularity directly from the data.

Pre-tagging the data with NE classes before train-ing comes at a cost. It lumps entities together whichcan have very different classes (i.e., all people be-come labeled as PERSON), effectively allowing onlyone class per entity. Etzioni et al. (2005) resolve theproblem with a web-based approach that learns hi-erarchies of the NE classes in an unsupervised man-ner. We do not enforce a taxonomy, but include sta-tistical knowledge about the distribution of possibleclasses over each entity by incorporating a prior dis-tribution P (class, entity). This enables us to gen-eralize from the lexical form without restricting our-selves to one class per entity, which helps to bet-ter fit the data. In addition, we can distinguish sev-eral classes for each entity, depending on the context

1473

(e.g., winner vs. quarterback). Ritter et al. (2010)also use an unsupervised model to derive selectionalpredicates from unlabeled text. They do not assignclasses altogether, but group similar predicates andarguments into unlabeled clusters using LDA. Brody(2007) uses a very similar methodology to establishrelations between clauses and sentences, by cluster-ing simplified propositions.

Penas and Hovy (2010) employ syntactic patternsto derive classes from unlabeled data in the contextof LbR. They consider a wider range of syntacticstructures, but do not include a probabilistic modelto label new data.

5 Conclusion

We use an unsupervised model to infer domain-specific classes from a corpus of 1.4m unlabeledsentences, and applied them to learn 250k propo-sitions about American football. Unlike previousapproaches, we use automatically extracted classeswith a probability distribution over entities to al-low for context-sensitive selection of appropriateclasses.

We evaluate both the model qualities and sensibil-ity of the resulting propositions. Several measuresshow that the model has good explanatory power andgeneralizes well, significantly outperforming twobaseline approaches, especially where the possibleclasses of an entity can only be inferred from thecontext.

Human subjects on Amazon’s Mechanical Turkjudged up to 96.6% of the propositions for the fulldata set, and 67.8% for data containing unseen enti-ties as sensible. Inter-annotator agreement was rea-sonably high (agreement = 0.82, G = 0.58, κ =0.48).

The probabilistic model and the extracted propo-sitions can be used to enrich texts and support post-parsing inference for question answering. We arecurrently applying our method to other domains.

Acknowledgements

We would like to thank David Chiang, Victoria Fos-sum, Daniel Marcu, and Stephen Tratz, as well as theanonymous ACL reviewers for comments and sug-gestions to improve the paper. Research supportedin part by Air Force Contract FA8750-09-C-0172

under the DARPA Machine Reading Program.

References

Collin F. Baker, Charles J. Fillmore, and John B. Lowe.1998. The Berkeley FrameNet Project. In Proceed-ings of the 17th international conference on Computa-tional linguistics-Volume 1, pages 86–90. Associationfor Computational Linguistics Morristown, NJ, USA.

Thorsten Brants and Alex Franz, editors. 2006. TheGoogle Web 1T 5-gram Corpus Version 1.1. NumberLDC2006T13. Linguistic Data Consortium, Philadel-phia.

Samuel Brody. 2007. Clustering Clauses for High-Level Relation Detection: An Information-theoreticApproach. In Annual Meeting-Association for Com-putational Linguistics, volume 45, page 448.

Marie-Catherine De Marneffe, Bill MacCartney, andChristopher D. Manning. 2006. Generating typeddependency parses from phrase structure parses. InLREC 2006. Citeseer.

Arthur P. Dempster, Nan M. Laird, and Donald B. Ru-bin. 1977. Maximum likelihood from incomplete datavia the EM algorithm. Journal of the Royal StatisticalSociety. Series B (Methodological), 39(1):1–38.

Oren Etzioni, Michael Cafarella, Doug. Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland,Daniel S. Weld, and Alexander Yates. 2005. Unsuper-vised named-entity extraction from the web: An exper-imental study. Artificial Intelligence, 165(1):91–134.

James Fan, David Ferrucci, David Gondek, and AdityaKalyanpur. 2010. Prismatic: Inducing knowledgefrom a large scale lexicalized relation resource. InProceedings of the NAACL HLT 2010 First Interna-tional Workshop on Formalisms and Methodology forLearning by Reading, pages 122–127, Los Angeles,California, June. Association for Computational Lin-guistics.

Alvan R. Feinstein and Domenic V. Cicchetti. 1990.High agreement but low kappa: I. the problems oftwo paradoxes. Journal of Clinical Epidemiology,43(6):543–549.

Michael Fleischman, Namhee Kwon, and Eduard Hovy.2003. Maximum entropy models for FrameNet classi-fication. In Proceedings of EMNLP, volume 3.

Danies Gildea and Dan Jurafsky. 2002. Automatic la-beling of semantic roles. Computational Linguistics,28(3):245–288.

Kilem Li Gwet. 2008. Computing inter-rater reliabil-ity and its variance in the presence of high agreement.British Journal of Mathematical and Statistical Psy-chology, 61(1):29–48.

1474

Marti A. Hearst. 1992. Automatic acquisition of hy-ponyms from large text corpora. In Proceedings of the14th conference on Computational linguistics-Volume2, pages 539–545. Association for Computational Lin-guistics.

Jasper Wilson Holley and Joy Paul Guilford. 1964. ANote on the G-Index of Agreement. Educational andPsychological Measurement, 24(4):749.

Rutu Mulkar-Mehta, James Allen, Jerry Hobbs, EduardHovy, Bernardo Magnini, and Christopher Manning,editors. 2010. Proceedings of the NAACL HLT2010 First International Workshop on Formalisms andMethodology for Learning by Reading. Associationfor Computational Linguistics, Los Angeles, Califor-nia, June.

Thiago Pardo, Daniel Marcu, and Maria Nunes. 2006.Unsupervised Learning of Verb Argument Structures.Computational Linguistics and Intelligent Text Pro-cessing, pages 59–70.

Anselmo Penas and Eduard Hovy. 2010. Semantic en-richment of text with background knowledge. In Pro-ceedings of the NAACL HLT 2010 First InternationalWorkshop on Formalisms and Methodology for Learn-ing by Reading, pages 15–23, Los Angeles, California,June. Association for Computational Linguistics.

Simone Paolo Ponzetto and Roberto Navigli. 2010.Knowledge-rich Word Sense Disambiguation rivalingsupervised systems. In Proceedings of the 48th AnnualMeeting of the Association for Computational Linguis-tics, pages 1522–1531. Association for ComputationalLinguistics.

Alan Ritter, Mausam, and Oren Etzioni. 2010. A latentdirichlet allocation method for selectional preferences.In Proceedings of the 48th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 424–434,Uppsala, Sweden, July. Association for ComputationalLinguistics.

Evan Sandhaus, editor. 2008. The New York Times Anno-tated Corpus. Number LDC2008T19. Linguistic DataConsortium, Philadelphia.

Rion Snow, Brendan O’Connor, Dan Jurafsky, and An-drew Y. Ng. 2008. Cheap and fast—but is itgood? Evaluating non-expert annotations for naturallanguage tasks. In Proceedings of the Conference onEmpirical Methods in Natural Language Processing,pages 254–263. Association for Computational Lin-guistics.

Stephanie Strassel, Dan Adams, Henry Goldberg,Jonathan Herr, Ron Keesing, Daniel Oblinger, HeatherSimpson, Robert Schrag, and Jonathan Wright. 2010.The DARPA Machine Reading Program-EncouragingLinguistic and Reasoning Research with a Series ofReading Tasks. In Proceedings of LREC 2010.

Fabian M. Suchanek, Gjergji Kasneci, and GerhardWeikum. 2007. Yago: a core of semantic knowledge.In Proceedings of the 16th international conference onWorld Wide Web, pages 697–706. ACM.

1475