Using the web to validate lexico-semantic relations

13
Using the Web to Validate Lexico-Semantic Relations Hernani Pereira Costa, Hugo Gon¸ calo Oliveira, and Paulo Gomes Cognitive and Media Systems Group, CISUC University of Coimbra, Portugal {hpcosta,hroliv,pgomes}@dei.uc.pt Abstract. The evaluation of semantic relations acquired automatically from text is a challenging task, which generally ends up being done by humans. Despite less prone to errors, manual evaluation is hardly repeatable, time-consuming and sometimes subjective. In this paper, we evaluate relational triples automatically, exploiting popular similarity measures on the Web. After using these measures to quantify triples according to the co-occurrence of their arguments and textual patterns denoting their relation, some scores revealed to be highly correlated with the correction rate of the triples. The measures were also used to select correct triples in a set, with best F1 scores around 96%. 1 Introduction During the last decades, there have been several attempts to discover knowledge automatically from text (see [16] [6] [12] [20] [9] [22]). Regardless the kind of knowledge, information extraction (IE) systems generally acquire entities (e.g. e 1 , e 2 ) and relations between them, represented as triples (e 1 , r, e 2 ), where r identifies the type of relation. For instance, considering named entities such as people, places or organisations, born-in or headquarters-of are possible kinds of relations. As for lexico-semantic knowledge, hyponymy and part-of are typical types of relations held by word meanings. Knowledge discovered automatically is useful to create or to enrich existing ontologies, such as WordNet [13] or CyC [17], however its evaluation is a challenging task, especially when dealing with broad-coverage open-domain knowledge. Despite the existence of, at least, four distinct approaches for evaluating (domain) ontologies (see [5]): manual, gold standard, comparison with a source of data, and indirect – most evaluations end up needing an excess of human intervention. The comparison with a collection of documents evaluates only the coverage of one or several domains and hardly applies to broad-coverage knowledge, while indirect evaluation does not depend exclusively on the quality of the knowledge but also on the way it is used. Also, most of the times there are no available gold standards for the needed evaluation, thus requiring the manual creation of one, which is an expensive task. Furthermore, once again, when it comes to broad-coverage knowledge it is more difficult to create gold standards due to the huge quantity of knowledge these should contain, as well as L. Antunes and H.S. Pinto (Eds.): EPIA 2011, LNAI 7026, pp. 597–609, 2011. c Springer-Verlag Berlin Heidelberg 2011

Transcript of Using the web to validate lexico-semantic relations

Using the Web to Validate

Lexico-Semantic Relations

Hernani Pereira Costa, Hugo Goncalo Oliveira, and Paulo Gomes

Cognitive and Media Systems Group, CISUCUniversity of Coimbra, Portugal

{hpcosta,hroliv,pgomes}@dei.uc.pt

Abstract. The evaluation of semantic relations acquired automaticallyfrom text is a challenging task, which generally ends up being doneby humans. Despite less prone to errors, manual evaluation is hardlyrepeatable, time-consuming and sometimes subjective. In this paper, weevaluate relational triples automatically, exploiting popular similaritymeasures on the Web. After using these measures to quantify triplesaccording to the co-occurrence of their arguments and textual patternsdenoting their relation, some scores revealed to be highly correlated withthe correction rate of the triples. The measures were also used to selectcorrect triples in a set, with best F1 scores around 96%.

1 Introduction

During the last decades, there have been several attempts to discover knowledgeautomatically from text (see [16] [6] [12] [20] [9] [22]). Regardless the kind ofknowledge, information extraction (IE) systems generally acquire entities (e.g.e1, e2) and relations between them, represented as triples (e1, r, e2), where ridentifies the type of relation. For instance, considering named entities such aspeople, places or organisations, born-in or headquarters-of are possible kinds ofrelations. As for lexico-semantic knowledge, hyponymy and part-of are typicaltypes of relations held by word meanings.

Knowledge discovered automatically is useful to create or to enrich existingontologies, such as WordNet [13] or CyC [17], however its evaluation is achallenging task, especially when dealing with broad-coverage open-domainknowledge. Despite the existence of, at least, four distinct approaches forevaluating (domain) ontologies (see [5]): manual, gold standard, comparison witha source of data, and indirect – most evaluations end up needing an excess ofhuman intervention. The comparison with a collection of documents evaluatesonly the coverage of one or several domains and hardly applies to broad-coverageknowledge, while indirect evaluation does not depend exclusively on the qualityof the knowledge but also on the way it is used. Also, most of the times thereare no available gold standards for the needed evaluation, thus requiring themanual creation of one, which is an expensive task. Furthermore, once again,when it comes to broad-coverage knowledge it is more difficult to create goldstandards due to the huge quantity of knowledge these should contain, as well as

L. Antunes and H.S. Pinto (Eds.): EPIA 2011, LNAI 7026, pp. 597–609, 2011.c© Springer-Verlag Berlin Heidelberg 2011

598 H.P. Costa, H.G. Oliveira, and P. Gomes

other structural issues. So, even though time-consuming, hardly repeatable andsometimes subjective, manual evaluation of a representative set of the extractedknowledge is the most common choice (as in [6] [20] [22]).

An automatic alternative to the aforementioned evaluation approaches is tosearch, in large collections of text, for support on the knowledge to be evaluated.Since a semantic relation can be denoted by several textual patterns (e.g. “is a”in “car is a vehicle”, for hyponymy, or “of the” in “the wheel of the car”, for part-of) the quality of a relational triple can be extrapolated based on the frequencyof its entities connected by one or more patterns denoting its relation. Still,since some entities are more frequent than others and because natural languageis ambiguous, we should not rely only on the latter frequency and more datashould be combined to validate relational triples.

The goal of this paper is to ascertain how well-suited similarity measures basedon the distribution of words on the Web are for evaluating relational triples, ina completely automatic fashion. After collecting a small set of patterns denotingthe relations to validate, similarity measures are adapted – instead of lookingfor occurrences of the entities alone, the measures are used to search for theseentities followed by or after the patterns. Our assumption is that the scores givenby the measures can be used to filter incorrect or less probable triples.

Therefore, in order to verify how similarity measures could be exploited, wehave used WordNet to collect sets of correct and incorrect hyponymy and part-oftriples to calculate how their correction correlates with the scores given by themeasures. We then used the latter sets as a gold standard and selected thecorrect triples based on the same scores. Not only some correlation coefficientswere very high, but some measures were capable of selecting the correct tripleswith F1 scores around 96%. This is very promising and sets these measures as anew automatic approach for evaluating semantic relations.

In the rest of the paper, distributional similarity measures are introduced,our experiments are described, and their results presented. Before concluding,we refer some related work.

2 Web-based Similarity Measures

Some of the most popular methods for computing the semantic similarity ofwords involve mathematical models based on the distribution of words in largecorpora, or in the World Wide Web. The latter infrastructure is very attractivebecause it is a huge and heterogeneous source of knowledge, probably the largestavailable. Furthermore, search engines are efficient interfaces to interact with thecontents of the Web. This provides an easy access to information on the frequencyand distribution of words, which can thus be used to infer similarities.

In this section, we present five common Web-based similarity measures. Someof them are simple adaptations of popular co-occurrence measures. In theirexpressions, we use q for denoting a query and P (q) to denote the number ofpages returned by a search engine (hereafter page counts) for q. So, P (e1 ∩ e2)represents the page counts for the query consisting of the entities e1 and e2,more precisely “e1 AND e2”.

Using the Web to Validate Lexico-Semantic Relations 599

The WebJaccard measure, in expression 1, is an adaptation of the Jaccardcoefficient, given by the number of documents in which e1 and e2 co-occur,divided by the number of documents where each one occurs. The WebOverlap(expression 2) and WebDice (expression 3) measures are two variations ofWebJaccard, respectively for measuring the overlap and the mean overlap of twosets. More precisely, the Overlap minimises the effect of comparing two objects ofdifferent sizes, so the number of co-occurrences is divided by the lowest numberof page counts, min(P (ei), P (ej)).

WebJaccard(e1, e2) =P (e1 ∩ e2)

P (e1) + P (e2) − P (e1 ∩ e2)(1)

WebOverlap(e1, e2) =P (e1 ∩ e2)

min (P (e1), P (e2))(2)

WebDice(e1, e2) =2 ∗ P (e1 ∩ e2)

P (e1) + P (e2)(3)

The WebPMI measure (expression 4) stands for Pointwise Mutual Information(PMI) and quantifies the statistical dependence between two entities [21]. In itsexpression, N is the total number of pages indexed by the search engine which,for Google search engine, can be roughly estimated to 1010 [4] [3]. If entities e1

and e2 are statistically independent, the probability that they co-occur is givenby P (e1) ∗P (e2). On the other hand, if they tend to co-occur, P (e1 ∩ e2) will behigher than P (e1) ∗ P (e2), and the PMI will thus be greater.

WebPMI(ei, ej) = log2

( P (ei ∩ ej)

P (ei) ∗ P (ej)∗ N

)(4)

The Normalised Web Distance (NWD, expression 5) [7] is an approximationof the Normalised Information Distance [1] and measures the distance of twoentities, based on their co-occurrences on the Web. Therefore, if the entitiesalways co-occur, this means they are very similar and NWD is 0. On the otherhand, although NWD most of the times ranges from 0 to 1, if the entities neverco-occur, NWD is +∞.

If we invert NWD and bound it to the [0-1] range [14], we can measure thesimilarity of two entities, in a measure that we will, from now on, call NormalisedWeb Similarity (NWS, expression 6).

NWD(e1, e2) =max (log P (e1), log P (e2)) − log P (e1 ∩ e2)

log N − min (log P (e1), log P (e2))(5)

NWS(e1, e2) = e−2∗NWD(e1,e2) (6)

600 H.P. Costa, H.G. Oliveira, and P. Gomes

3 Experimentation

Our experimentation was performed to analyse how well Web distributionalmeasures, presented in section 2, suit the task of validating semantic relations.Despite quantifying the semantic similarity between two entities alone, themeasures were adapted to quantify the similarity of the entities attachedto textual patterns denoting semantic relations. In the first experiment, thecorrelation between the measures and the correction rate of the triples iscalculated. The second is an information retrieval task, where the measures areused to identify correct triples from a set. Looking at the results, we believe thatsome of the measures can be used in future evaluations of semantic relationsand, eventually, replace manual evaluation.

3.1 Set-up

All the measures presented in section 2 were implemented. However, havingin mind the validation of semantic relations, we followed previous hints [19]and adapted the measures to quantify the similarity between entities connectedby patterns expressing semantic relations. Therefore, for validating the triplet = (e1, r, e2), a pattern πri, indicative of relation r, is selected. The expressionsof the measures are changed according to the following:

– P (e1) = page counts for query: “e1 πri”;– P (e2) = page counts for query: “πri e2”;

– P (e1 ∩ e2) = pages counts for query: “e1 πri e2”.

For instance, if e1={planet}, e2={Mars} and πri={such as},we would have P (e1)={planet such as}, P (e2)={such as Mars},P (e1 ∩ e2)={planet such as Mars}. To this end, a set of indicative patterns forthe relations we were validating, hyponymy and part-of, was created (table2 and 3). To increase the coverage of the patterns, we used not only thoseconveying the direct relation (e.g. e1 and other e2, for hyponymy), but also theindirect (e.g. e2 such as e1, for hypernymy).

As semantic relations can be expressed by several different textual patterns,we could not select one best pattern. So, we decided to use two sets, Πh and Πp,consisting of the most frequent hyponymy and part-of patterns. Furthermore, asthe measures accept only one pattern at once, we computed the final scores byfour distinct methods. Considering that Πr has all the patterns for relation r,we sort a list, Sm : |Sm| = |Πr|, containing the scores given by a measure m,with each pattern πri ∈ Πr, such that the best score is in Sm1. The final scoreis then given by:

– a baseline consisting of simple co-occurrence, without including the patterns in theexpressions (NP);

– the score of the best pattern (B), Sm1;

– the average of the scores given by the two best patterns (2B),Sm1+Sm2

2;

– the average of the scores given by all patterns (Av),

∑|Sm|i=1 Smi

|Sm| .

Using the Web to Validate Lexico-Semantic Relations 601

3.2 Datasets

We have used WordNet 2.01 for collecting sets of hyponymy and part-of triples. Inorder to reduce noise due to ambiguities, we took advantage of the organisationof WordNet, which has the synsets ordered by the most frequent senses of thewords and created the sets in the following way:

1. We selected all the relation instances between synsets which denote the first senseof their most frequent word.

2. For each of the latter, we defined relational triples held by the firstword in the connected synsets. For example, the instance {corporation.1,corp.1} hyponym-of {firm.1, house.2, business firm.1} originates the triple{corporation, hyponym-of, firm}.

3. To create the final sets, we ranked the triples according to the frequency of theirarguments in Google web search engine2 and selected the first 1,100 hyponymytriples and 1,100 part-of triples, respectively H and P .

Sets H and P contain only correct triples, but regarding the need for incorrecttriples, we created a third set, I, with 1,010 random pairs of words, which wemade sure to be not related by hyponymy nor by part-of. In table 1, we presentexamples of triples in the datasets and their classification.

Table 1. Examples of triples in the datasets

Classification Examples

Correct (C) fight hyponym-of conflict hour part-of day

Incorrect (I) towel hyponym-of engineer ibuprofen part-of light

Wrong Relation (WR) eye hyponym-of face hometown part-of town

3.3 Preliminary Analysis

Before applying the measures, we analysed the page counts for the entitiesconnected by the patterns. Both sets of patterns, Πh and Πp, for hyponymyand part-of respectively, were used with the sets of triples H , P and I.

Regarding that IE systems may extract triples held by related entities but failon identifying the relation, our correct triples were searched with patterns forother relations, making it possible to compare the page counts of this kind oftriples with completely correct and completely incorrect triples. Table 2 presentsthe average (Av) and the standard deviation (SD) of page counts of the patternsin Πh connecting the entities in H (Correct), P (WrongRel) and I (Incorrect).Similarly, table 3 presents the same statistical measures, this time for page countsof the patterns in Πp connecting the entities in P (Correct), H (WrongRel) andI (Incorrect).

In both tables (2 and 3), page counts tend to be higher for correct triples, alittle lower for the ones with a wrong relation and even lower or 0 for incorrect1 Available through http://wordnet.princeton.edu2 t = (e1, r, e2), score(t) = log(P (e1) + P (e2)).

602 H.P. Costa, H.G. Oliveira, and P. Gomes

Table 2. Page counts for hyponymy patterns

R Textual Pattern (πh)Correct WrongRel Incorrect

Av SD Av SD Av SD

Yhypo

X

is a|an|one|the kind of 0.46 73.38 0.01 5.46 9.90E−4 0.99is a|an|one|the 274.7 44560.8 8.09 1007.66 0.53 175.74is a|an|one|the variety of 0.01 5.28 0.0 0.0 0.0 0.0is a|an|one|the type of 0.77 191.2 0.07 23.87 0.0 0.0is a|an|one|the form of 0.99 510.96 4.5E−3 3.31 0.0 0.0and|or other 66.9 15512.9 15.15 2748.7 0.36 23.34

Xhyper

Y such as 27.48 6832.4 18.15 2620.2 0.16 6.3like 42.60 6486.2 14.29 3264.8 0.02 11.41including 26.47 7307.9 81.63 10414.3 9.90E−4 8.74especially 2.79 570.8 21.1 4147.7 0.03 11.41

Table 3. Page counts for part-of patterns

R Textual Pattern (πp)Correct WrongRel Incorrect

Av SD Av SD Av SD

Ypart

X

of 208.28 49053.4 313.58 65352.7 26.13 15971.7of a|an|one|the 546.28 119150.7 319.91 73777.7 3.99 1390.4from a|an|one|the 43.07 18873.5 71.42 43370.6 0.21 100.90in 646.38 43269.8 152.86 45098.7 9.12 6785.4is part of 2.08 418.33 0.11 47.64 2.97E−3 2.23

is member of 2.72E−3 1.73 2.73E−3 2.99 0.0 0.0part of a|an|one|the 1.45 251.7 1.72 161.95 0.13 120.98

member of a|an|one|the 0.27 122.73 0.89 17.12 4.95E−3 2.64

is a|one|the part of 0.99 290.33 0.06 19.94 1.98E−3 1.99is a|an|one|the member of 1.33 1439.3 0.01 7.19 0.0 0.0

is a|an|one|the part of a|one|the 0.91 218.01 0.08 13.92 1.98E−3 1.99is a|an|one|the member of a|one|the 0.42 301.84 0.12 64.59 0.0 0.0

Xhas

Y

’s 550.9 188279.1 243.92 159845.1 5.23 3233.0has a|an|one|the 9.17 1809.0 9.48 2577.7 0.25 177.89contains a|an|one|the 1.12 309.75 2.67 871.14 9.90E−4 4.12

consists of 0.61 111.26 2.36 1228.2 6.93E−3 0.99is made of 0.02 8.74 0.11 68.13 0.0 0.0

triples. There are however exceptions, especially for ambiguous patterns. Forinstance, the patterns “including” and “especially” were used as hypernymyindicators, but they can sometimes denote the part-of relation. Furthermore,for hyponymy, 26 correct triples, 135 triples with the wrong relation and 947incorrect triples do not have page counts with any pattern, as well as 17 correct,241 with the wrong relation, and 859 incorrect part-of triples. Correct tripleswith no page counts are acceptable because we only selected a set with the mostfrequent patterns, such as the ones proposed by Hearst [16] for hyponymy. Yet,these relations can be expressed by many other ways.

3.4 Correlation Analysis

The parallelism between the correctness of the triples and the scores given bythe similarity measures is quantified by Spearman’s coefficient, ρ : −1 ≤ ρ ≤ 1(expression 7) between these two variables.

Using the Web to Validate Lexico-Semantic Relations 603

Table 4. Correlations between the correctness of the triples and the similarity measures

Relation nHits Jaccard Overlap Dice PMI NWS hasHits

Hyponymy (C + I)

NP 0.11 0.14 0.15 0.14 0.13 0.63 0.77B 0.16 0.17 0.34 0.18 0.93 0.87 0.922B 0.18 0.19 0.36 0.20 0.92 0.86 0.86Av 0.18 0.20 0.35 0.22 0.78 0.72 -

Hyponymy (C + I + WR)

NP −5.3E−3 -0.11 -0.11 -0.11 -0.17 -0.14 0.39B 0.04 0.16 0.29 0.17 0.76 0.74 0.762B 0.04 0.17 0.32 0.19 0.73 0.74 0.69Av 0.07 0.20 0.34 0.21 0.69 0.67 -

Part-of (C + I)

NP 0.19 0.22 0.35 0.23 0.29 0.71 0.76B 0.16 0.18 0.21 0.23 0.89 0.85 0.852B 0.17 0.19 0.34 0.23 0.90 0.86 0.88Av 0.18 0.21 0.26 0.25 0.78 0.72 -

Part-of (C + I + WR)

NP 0.13 0.24 0.33 0.25 0.33 0.65 0.39B 0.16 0.17 0.21 0.20 0.82 0.69 0.722B 0.17 0.17 0.24 0.20 0.82 0.68 0.72Av 0.18 0.15 0.25 0.16 0.57 0.42 -

ρ(mi, xi) =

∑i

(mi − m)(xi − x)

√∑i

(mi − m)(xi − x)(7)

Our references, x, consisted of arrays with 1, 0.5 and 0, respectively for correcttriples, triples with a wrong relation and incorrect triples. Table 4 shows thecorrelations between x and the scores given by the measures, m, calculated bythe four methods described in section 3.1, namely, a baseline using no patterns(NP), using only the best pattern (B), the average of the two best patterns (2B),and the average of all patterns (Av). Besides the similarity measures, we haveused two simpler measures – one considers the number of page counts (nHits); theother marks the triple as correct if there is at least one page count for P (e1∩e2),using no pattern (NP), one pattern (B), or two patterns (2B) (hasHits). Theresults are shown for both studied relations, hyponymy and part-of, and, for eachof them, we have calculated the coefficient using just the correct and incorrecttriples (C + I), and adding the triples with a wrong relation (C + I + WR).

All the measures, except for the baseline method with hyponymy, arepositively correlated with the correctness of the triples. Still, just some are highlycorrelated. For both relations, WebPMI is the measure most correlated with thetriples correctness. Other highly correlated measures are NWS and hasHits. Itshould be remarked that a measure as simple as hasHits outperforms all themeasures except WebPMI. Also worth noticing is that, for the highest correlatedmeasures, using the average of the scores with all the patterns leads generally tolower correlations. Furthermore, while for hyponymy it seems to be enough touse only the best patterns, for part-of the correlation is higher when using thetwo best patterns.

As expected, all correlations drop when the triples with wrong relations (WR)are included. Still, WebPMI and hasHits correlation coefficients are noteworthy.

604 H.P. Costa, H.G. Oliveira, and P. Gomes

3.5 Identification of Correct Triples

In the second experiment, we performed an information retrieval task, wherethe measures were used to filter incorrect triples automatically from our dataset.According to the score of each measure, we tested several cut points (θ), used toselect the triples with a score higher than θ. Then, we computed the precision,recall and F1 in the following manner, for measuring the quality and the quantityof triples selected:

Precision = Selected correct triplesSelected triples

Recall = Selected correct triplesTotal correct triples

F1 = 2∗Precision∗RecallPrecision+Recall

Table 5. Best F1 measures and respective θ

RJaccard Overlap Dice PMI NWS hasHitsF1 θ F1 θ F1 θ F1 θ F1 θ F1

Hypo

NP 70.5 1E−4 68.5 1E−4 77.5 2E−4 68.5 0 68.5 0 89.0B 94.9 2E−4 68.5 0 95.4 2E−4 96.2 26 96.1 0.15 96.0

2B 94.1 2E−4 68.5 0 95.1 2E−4 96.3 16 96.0 0.05 92.6

Av 86.3 2E−4 68.5 0 90.9 2E−4 96.2 3 91.5 0.05 -

Part

NP 91.5 2E−4 68.5 1E−4 93.7 2E−4 68.5 0 91.6 0.05 88.9

B 93.9 2E−4 80.7 2E−4 94.0 2E−4 94.3 32 94.7 0.25 92.8

2B 93.8 2E−4 75.6 0.05 93.8 2E−4 94.7 33 94.9 0.2 94.1Av 86.7 2E−4 68.5 0 90.3 2E−4 94.5 4 87.1 0.05 -

Including triples with wrong relation (WR)

Hypo

NP 51.0 1E−4 51.0 1E−4 51.0 1E−4 51.0 0 51.0 0 61.8

B 75.3 2E−4 54.7 2E−4 75.3 3E−4 69.9 28 75.2 0.25 69.52B 74.8 2E−4 51.1 0 74.9 5E−4 69.9 16 74.1 0.25 68.2

Av 72.7 2E−4 51.1 0 75.1 2E−4 71.9 4 74.8 0.05 -

Part

NP 70.6 2E−4 51.0 1E−4 70.7 4E−4 62.1 1 72.5 0.05 61.5B 65.5 2E−4 62.1 2E−4 65.2 2E−4 86.9 42 62.3 0.2 65.7

2B 65.4 2E−4 59.4 0.05 65.6 2E−4 85.5 41 67.5 0.2 68.8

Av 59.7 2E−4 51.0 0 62.3 2E−4 68.6 4 60.7 0.05 -

For each measure, table 5 has the best F1 scores and the respective θ.These values are presented, first, using only the correct and incorrect triples(C + I), and then adding the triples with the wrong relations (C + I + WR),which, for this task, were considered to be incorrect. This way, we compare howthe similarity measures behave ideally or in a more realistic scenario, where,sometimes, extracted triples only fail on identifying the type of the relation.

Even though some of the measures had low correlations in the previousexperiment, most of them achieve high F1, and significantly outperform thebaseline. Yet, the best F1 measure without the triples with the wrong relation isachieved by the WebPMI using the two best patterns, with θ=16 for hyponymyand θ=33 for part-of. When triples with the wrong relation are added, WebPMI isstill the best for part-of triples. Using only the best pattern with θ=42, it achieves86.9% F1. WebPMI is, however, worse for hyponymy, where WebJaccard,WebDice and NWS outperform it, in this order.

Using the Web to Validate Lexico-Semantic Relations 605

���

���

���

���

���

���

��

���

��

��

���

���

���

���

���

���

���

��

� � � � � � � � � � � � �� � �� �� �� �� � �� �� ��

��

�������

����

Fig. 1. Identification of the correct hyponymy triples from a set including correct andincorrect triples, with WebPMI using the two best patterns

In the latter experimentation scenario, F1 measures are lower, around 75%.This is still good, considering that, sometimes, if we change the type of a relationfrom hyponymy to part-of, we still get acceptable relations, as the followingsituations:

– (economy hyponym-of system) � (economy part-of system);– (computer hyponym-of machine) � (computer part-of machine).

Figures 1 and 2 are examples of how precision, recall and F1 change with θ, intwo situations where WebPMI is used in the identification of correct triples.

4 Related Work

Classical IE systems rely on textual patters that frequently denote semanticrelations (see [16]). However, some trade-off is often needed in the selection ofpatterns because, on the one hand, some of them occur rarely and, on the otherhand, the most frequent are usually ambiguous. In order to increase the recalland to minimise the effort needed to encode the patterns, state of the art IEsystems typically have a (weakly) supervised pattern learning component (e.g.[20] [22]), which, nevertheless, is prone to extract more noise.

Therefore, some mathematical models have been proposed to filter incorrecttriples [6] or to estimate the reliability of learned patterns [20]. This leaddirectly to higher precision and, eventually by using more ambiguous patterns,higher recall. Having in mind the distributional hypothesis [15], these filtersare generally based on distributional similarity measures, which quantify thesimilarity of words according to their distribution in large corpora.

606 H.P. Costa, H.G. Oliveira, and P. Gomes

���

���

���

���

���

���

��

���

��

��

���

���

���

���

���

���

���

��

� � �� � �� �� �� �� � �� �� �� �� �

��

�������

��� ��

Fig. 2. Identification of the correct part-of triples from a set including correct, incorrectand triples with a wrong relation, with WebPMI using the best pattern

In the last decade, the Web became an attractive target for the extraction ofhuge quantities of knowledge (e.g. in [12] [9] [2] [22]). It started as well to beseen as an interesting infrastructure for quantifying and validating knowledgeextracted automatically, not only because of its size, variety of subjects andredundancy, but also because web search engines provide an efficient interface.

Distributional similarity measures were adapted for the Web and were used,for instance, to rank semantic relations [9] [11]. They have also been exploitedas features and combined with lexical patterns indicating synonymy in a robustmetric which was claimed [4] to outperform all web-based similarity metrics.

PMI-IR [21] is a popular measure for searching the Web for pairs of similarwords. Variations of the PMI have been used to reduce the noise in informationextracted from the Web [12] [2]. However, in the latter works, the similaritymeasures were adapted and, instead of searching for the entities alone, they wereused to measure the similarity between the entities and indicative patterns. Theobtained scores can thus be exploited to assess the likelihood of the triples or thequality of the patterns. For instance, Etzioni et al. [12] compute the correctionlikelihood of hyponymy triples, held between classes and named entities. Theyrely on the ratio between the hits of the named entity in a hyponymy patternconnecting it to the class (e.g. Liege is a city), and the hits of the entity alone.

A very interesting work on this topic [11] defines a probabilistic model forevaluating the impact of redundancy, sample size and different extraction ruleson the correctness of extracted information. This model is claimed to outperform

Using the Web to Validate Lexico-Semantic Relations 607

models based on PMI, but was evaluated in four relations (Corporations,Countries, CEO of a company, and Capital of a Country) which are simplerand less ambiguous than lexico-semantic relations. For instance, all the formerrelations either have a static argument (X is-a country, Y is-a corporation) orare a one-to-one correspondence (a country has only one capital and a companyone CEO). On the other hand, lexico-semantic relations such as hypernymyhave always variable arguments (e.g. most lexical entities have hyponyms orhypernyms and a lexical entity may have several hyponyms). Furthermore,hypernymy can be held between entities on different levels of a hierarchy (e.g.animal hypernym-of mammal, animal hypernym-of dog, animal hypernym-ofboxer, mammal hypernym-of dog, dog hypernym-of boxer, etc.).

More than assigning probabilities, the similarity measures can take advantageof the redundancy of the Web to validate knowledge, including not only semantictriples, but also question-answer pairs [18]. In order to verify how well automaticweb-based validation performs, it has been put side-by-side with the manualevaluation of triples obtained after analysing the sequence of search enginequeries in the same session [10].

Besides validation, web-based similarity measures have been used for othertasks, such as suggesting hyponymy relations between named entities and theconcepts of an ontology [8] or to identify aliases of named entities [3].

5 Concluding Remarks

We have conducted several experiments to confirm if several web-baseddistributional similarity measures were well suited to validate lexico-semanticrelations. These measures were applied to sets of correct and incorrect triplesfrom WordNet. First, we confirmed that the scores given by some measures arehighly correlated to the correction of triples. Then, we performed an informationretrieval task consisting of the identification of correct triples, based on the scoresof the measures. All the measures had high F1 scores, some of them higher than96% for hyponymy and higher than 94% for part-of.

These results are promising and we believe that the best performing measurescan be used as an alternative to manual evaluation of relational triples,extracted automatically from textual resources. Even though our experimentswere performed for English hyponymy and part-of triples, we intend to make ourframework available for the validation of other kinds of relations (e.g. causation-of, purpose-of, headquarters-of, founded-by), eventually, in other languages.

Acknowledgements. Hernani Pereira Costa is supported by the FCTscholarship grant BII/FCTUC/C2008/CISUC/2ndPhase. Hugo Goncalo Oliveirais supported by the FCT scholarship grant SFRH/BD/44955/2008, co-fundedby FSE.

608 H.P. Costa, H.G. Oliveira, and P. Gomes

References

1. Bennett, C.H., Gacs, P., Gcs, P., Member, S., Li, M., Vitanyi, P.M.B., Zurek,W.H.: Information Distance. IEEE Transactions on Information Theory 44, 1407–1423 (1998)

2. Blohm, S., Cimiano, P., Stemle, E.: Harvesting relations from the web: quantifiyingthe impact of filtering functions. In: Proc. 22nd National Conf. on ArtificialIntelligence, pp. 1316–1321. AAAI (2007)

3. Bollegala, D., Honma, T., Matsuo, Y., Ishizuka, M.: Mining for personal namealiases on the web. In: Proc. 17th International Conf. on the World Wide Web, pp.1107–1108. ACM (2008)

4. Bollegala, D., Matsuo, Y., Ishizuka, M.: Measuring semantic similarity betweenwords using web search engines. In: Proc. 16th International Conf. on the WorldWide Web, pp. 757–766. ACM, New York (2007)

5. Brank, J., Grobelnik, M., Mladenic, D.: A survey of ontology evaluation techniques.In: Proc. Conf. on Data Mining and Data Warehouses, SIKDD (2005)

6. Cederberg, S., Widdows, D.: Using LSA and Noun Coordination Information toImprove the Precision and Recall of Automatic Hyponymy Extraction. In: Proc.Conf. on Computational Natural Language Learning, pp. 111–118 (2003)

7. Cilibrasi, R., Vitanyi, P.M.B.: Normalized Web Distance and Word Similarity.Computing Research Repository, ArXiv e-prints (2009)

8. Cimiano, P., Staab, S.: Learning by googling. SIGKDD ExplorationsNewsletter 6(2), 24–33 (2004)

9. Cimiano, P., Wenderoth, J.: Automatic Acquisition of Ranked Qualia Structuresfrom the Web. In: Proc. 45th Annual Meeting of the Association of ComputationalLinguistics, pp. 888–895. ACL, Prague (2007)

10. Costa, R.P., Seco, N.: Hyponymy extraction and web search behavior analysisbased on query reformulation. In: Geffner, H., Prada, R., Machado Alexandre, I.,David, N. (eds.) IBERAMIA 2008. LNCS (LNAI), vol. 5290, pp. 332–341. Springer,Heidelberg (2008)

11. Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancyin information extraction. In: Proc. 19th International Joint Conf. on ArtificialIntelligence, pp. 1034–1041. Morgan Kaufmann Publishers Inc., San Francisco(2005)

12. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S.,Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: anexperimental study. Artificial Intelligence 165(1), 91–134 (2005)

13. Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database (Language, Speech,and Communication). MIT (May 1998)

14. Gracia, J.L., Mena, E.: Web-Based Measure of Semantic Relatedness. In: Bailey,J., Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS,vol. 5175, pp. 136–150. Springer, Heidelberg (2008)

15. Harris, Z.: Distributional structure. In: Papers in Structural and TransformationalLinguistics, pp. 775–794. D. Reidel Publishing Comp., Dordrecht (1970)

16. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proc.14th Conf. on Computational Linguistics, pp. 539–545. ACL, Morristown (1992)

17. Lenat, D.: CYC: A Large-Scale Investment in Knowledge Infrastructure.Communications of the ACM 38, 33–38 (1995)

18. Magnini, B., Negri, M., Prevete, R., Tanev, H.: Is It the Right Answer? ExploitingWeb Redundancy for Answer Validation. In: Proc. 40th Annual Meeting of theAssociation for Computational Linguistics, pp. 425–432 (2002)

Using the Web to Validate Lexico-Semantic Relations 609

19. Oliveira, P.C.: Probabilistic Reasoning in the Semantic Web using Markov Logic,pp. 67–73. University of Coimbra, Faculty of Sciences and Technology, Departmentof Informatics Engineering (July 2009)

20. Pantel, P., Pennacchiotti, M.: Espresso: Leveraging Generic Patterns forAutomatically Harvesting Semantic Relations. In: Proc. 21st International Conf.on Computational Linguistics and 44th Annual Meeting of the Association forComputational Linguistics (COLING-ACL), pp. 113–120. ACL, Sydney (2006)

21. Turney, P.D.: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL.In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, pp.491–502. Springer, Heidelberg (2001)

22. Wu, F., Weld, D.S.: Open Information Extraction Using Wikipedia. In: Proc. 48thAnnual Meeting of the Association for Computational Linguistics, pp. 118–127.ACL, Uppsala (2010)