Evaluating the Effectiveness of a Knowledge Representation Based on Ontology in Ontoweb System

12
Evaluation of system measures for incomplete relevance judgment in IR Shengli Wu and Sally McClean School of Computing and Mathematics, University of Ulster, UK {s.wu1, si.mcclean}@ulster.ac.uk Abstract. Incomplete relevance judgment has become a norm for the evaluation of some major information retrieval evaluation events such as TREC, but its effect on some system measures has not been well understood. In this paper, we evaluate four system measures, namely mean average precision, R-precision, normalized average precision over all documents, and normalized discount cumulative gain, under incomplete relevance judgment. Among them, the measure of normalized average precision over all documents is introduced, and both mean average precision and R-precision are generalized for graded relevance judgment. These four measures have a common characteristic: complete relevance judgment is required for the calculation of their accurate values. We empirically investigate these measures through extensive experimentation of TREC data and aim to find the effect of incomplete relevance judgment on them. From these experiments, we conclude that incomplete relevance judgment affects all these four measures’ values significantly. When using the pooling method in TREC, the more incomplete the relevance judgment is, the higher the values of all these measures usually become. We also conclude that mean average precision is the most sensitive but least reliable measure, normalized discount cumulative gain and normalized average precision over all documents are the most reliable but least sensitive measures, while R-precision is in the middle. 1. Introduction To evaluate the effectiveness of an information retrieval system, a test collection, which includes a set of documents, a set of topics, and a set of relevance judgments indicating which documents are relevant to which topics, is required. Among them, “relevance” is an equivocal concept [3, 11, 12] and relevance judgment is a task which demands huge human effort. In some situations such as to evaluate some searching services on the World Wide Web, complete relevance judgment is not possible. It is also not affordable when using some large document collections for the evaluation of information retrieval systems. For example, in the Text REtrieval Conferences (TREC) held by the National Institute of Standards and Technology of the USA, only partial relevance judgment is conducted due to the large number of documents (from 0.5 to several million) in the whole collection. A pooling method [8] has been used in TREC. For every query (topic) a document pool is formed from the

Transcript of Evaluating the Effectiveness of a Knowledge Representation Based on Ontology in Ontoweb System

Evaluation of system measures for incomplete relevance judgment in IR

Shengli Wu and Sally McClean

School of Computing and Mathematics, University of Ulster, UK {s.wu1, si.mcclean}@ulster.ac.uk

Abstract. Incomplete relevance judgment has become a norm for the evaluation of some major information retrieval evaluation events such as TREC, but its effect on some system measures has not been well understood. In this paper, we evaluate four system measures, namely mean average precision, R-precision, normalized average precision over all documents, and normalized discount cumulative gain, under incomplete relevance judgment. Among them, the measure of normalized average precision over all documents is introduced, and both mean average precision and R-precision are generalized for graded relevance judgment. These four measures have a common characteristic: complete relevance judgment is required for the calculation of their accurate values. We empirically investigate these measures through extensive experimentation of TREC data and aim to find the effect of incomplete relevance judgment on them. From these experiments, we conclude that incomplete relevance judgment affects all these four measures’ values significantly. When using the pooling method in TREC, the more incomplete the relevance judgment is, the higher the values of all these measures usually become. We also conclude that mean average precision is the most sensitive but least reliable measure, normalized discount cumulative gain and normalized average precision over all documents are the most reliable but least sensitive measures, while R-precision is in the middle.

1. Introduction

To evaluate the effectiveness of an information retrieval system, a test collection, which includes a set of documents, a set of topics, and a set of relevance judgments indicating which documents are relevant to which topics, is required. Among them, “relevance” is an equivocal concept [3, 11, 12] and relevance judgment is a task which demands huge human effort. In some situations such as to evaluate some searching services on the World Wide Web, complete relevance judgment is not possible. It is also not affordable when using some large document collections for the evaluation of information retrieval systems. For example, in the Text REtrieval Conferences (TREC) held by the National Institute of Standards and Technology of the USA, only partial relevance judgment is conducted due to the large number of documents (from 0.5 to several million) in the whole collection. A pooling method [8] has been used in TREC. For every query (topic) a document pool is formed from the

2 Shengli Wu and Sally McCleanP

top 100 documents of all or a subset of all the runs submitted. Only those documents in the pool are judged by human judges and those documents which are not in the pool are not judged and are assumed to be irrelevant to the topic. Therefore, many relevant documents can be missed out in such processing [17]. “Partial relevance judgment” or “incomplete relevance judgment” are the terms used to refer to such situations. The TREC’s pooling method does not affect some measures such as precision at a given cut-off document level. However, in the evaluation of information retrieval systems, both precision and recall are important aspects and many measures concern both of them at the same time. In order to calculate accurate values for such measures, complete relevance judgment is required. Probably mean average precision (MAP) and R-precision are two such measures that are most often used recently. There have been some papers [1, 2, 5, 13] which investigate the reliability and sensitivity of MAP and R-precision. In the context of TREC, Zobel [17] investigated the reliability of the pooling method. He found that in general the pooling method was reliable, but that recall was overestimated since it was likely that 30% ~ 50% of the relevant documents had not been found. Buckley and Voorhees [5] conducted an experiment to investigate the stability of different measures when using different query formats. Results submitted to the TREC 8 query track were used. In their experiment, recall at 1000 document level had the least error rate, which was followed by precision at 1000 document level, R-precision, and mean average precision, while precision at 1, 10, and 30 document levels had the biggest error rates. Voorhees and Buckley [14] also investigated the effect of topic size on retrieval results by taking account of the consistency of rankings when using two different sets of topics for the same group of retrieval systems. They found that the error rates incurred were larger than anticipated, therefore, researchers needed to be careful when concluding one method was better than another, especially if few topics were used. Their investigation also suggested that using precision at 10 document level incurred higher error rate than using MAP. Concerning that fact that some existing evaluation measures (such as MAP, R-precision and precision at 10 document level) are not reliable for substantially incomplete relevance judgment, Buckley and Voorhees [6] introduced a new measure, which was related to the number of irrelevant documents occurring before a given number of relevant documents in a resultant list, to cope with such a situation. Sanderson and Zobel [13] reran Voorhees and Buckley’s experiment (Voorhees & Buckley, 2002) and had similar observations. But they argued that precision at 10 document level was as good as MAP if considering both the error rate of ranking and the human judgmental effort. Järvelin and Kekäläinen [7] introduced cumulated gain-based evaluation measures. Among them, normalized discount cumulated gain (NDCG) concerns both precision and recall, which can be used as an alternative for MAP. Using cumulated gain-based evaluation measures, Kekäläinen [9] compared the effect of binary and graded relevance judgment on the rankings of information retrieval systems. She found that these measures correlated strongly under binary relevance judgment, but the

Evaluation of system measures for incomplete relevance judgment in IR 3

correlation became less strong when emphasising highly relevant documents in graded relevance judgment. However, the effect of incomplete relevance judgment on these measures is not well understood. It is interesting to evaluate these measures in such a condition. We include four measures (MAP, NAP, NDCG, and R-precision) in our investigation, since all of them concern precision and recall at the same time and therefore can be regarded as good system measures. Among them, normalized average precision over all documents (NAP) is introduced in this paper. In their original definitions, both MAP and R-precision can only be used under binary relevance judgment. Therefore, their definitions are generalized for graded relevance judgment in this paper. Both binary and graded relevance judgment are used. The rest of this paper is organized as follows: in Section 2 we discuss the four measures involved in our experiments. Then in Section 3, 4, and 5 we present the experimental results about different aspects of these four measures. Section 6 is the conclusion.

2. Four measures

In this section we discuss the four measures used in this paper. MAP and R-precision have been used many times in TREC [15]. Both of them are defined under binary relevance judgment and have been used widely by researchers to evaluate their information retrieval systems and algorithms (e.g., in [4, 10, 16]). MAP uses the

formula, ∑=

=ntotal

i ipi

ntotalmap

_

1_1 , to calculate scores. Here total_n is the total number

of relevant documents in the whole collection for the information need and pi is the ranking position of the i-th relevant documents in the resultant list. R-precision is defined as the percentage of relevant documents in the top total_n documents where total_n is the total number of relevant documents for the information need. Now we generalize their definitions to make them suitable for graded relevance judgment. Suppose there are n relevance grades ranging from 1 to n (n means the most relevant state and 0 means the irrelevant state), then each document di can be assigned a grade g(di) according to its degree of relevance to the given topic. One primary assumption we take for these documents in various grades is: a document in grade n is regarded as 100% relevant and 100% useful to users, and a document in grade i (i<n) is regarded as i/n% relevant and i/n% useful to users. One natural derivation is that one document in grade i is equal to i/n documents in grade n on usefulness. Based on such an assumption, we will define MAP and R-precision under graded relevance judgment. Suppose there are total_n documents whose grades are above 0 and total_n = |r1|+ |r2|+…+ |rn|. Here |ri| denotes the number of documents in grade i. Before we discuss how to generalize MAP and R-precision, let us introduce the concept of the best resultant list. For the given information need, a resultant list l is best if it satisfies the following two conditions:

o all the documents whose grades are above 0 appear in the list; o for any document pair di and dj, if di is ranked in front of dj, then g(di)≥ g(dj).

4 Shengli Wu and Sally McCleanP

Many resultant lists can be the best at the same time since more than one document can be in the same grade and the documents in the same grade can be arranged in different orders. For any pair of the best resultant lists l1 and l2, if dj is a document in l1, and d’j is a document in l2, and dj is in the same ranking position in l1 as d’j in l2, then dj and d’j must be in the same grade, or g(dj)=g(d’j). Therefore, we can use g_best(dj) to refer to the grade of the document in ranking position j in one of these best resultant lists. We may also sum up the grades of the documents in top |rn|, top (|rn |+| rn-1|), … , top ((|rn|+| rn-1|+…+|r1|) for any of the best resultant lists (these sums are the same for all the best resultant lists):

∑=

=||

1)(_

nr

iin dgbests , , … , . ∑

−+

=− =

||||

11

1)(_

nn rr

iin dgbests ∑

+++

=

−==

||...||||

11

11)(__

rrr

ii

nndgbestsbests

These sums will be used later in this paper. One simple solution to calculate R-precision for a resultant list is to use the

formula ∑=

=ntotal

jjdg

bestspr

_

1)(

_1_ . However, this formula does not distinguish the

different positions of the documents if they are located in top total_n. It is the same effect for any document to occur in ranking position 1 or ranking position total_n. To avoid this drawback, we can use a more sophisticated formula. First we only consider

the top |rn| documents and use ∑=

||

1)(

_1 nr

jj

ndg

beststo evaluate their precision, next we

consider the top |rn|+|rn-1| documents and use ∑−+

=−

||||

11

1)(

_1 nn rr

jj

ndg

beststo evaluate their

precision, continue this process until finally we consider all top total_n documents

using ∑+++

=

− ||...||||

1

11)(

_1 rrr

jj

nndg

bests. Combining all these, we have

})(_

1...)(_

1)(_

1{1_||...||||

1

||||

11

||

1

111∑∑∑

+++

=

+

=−=

−−+++=

rrr

jj

rr

jj

n

r

jj

n

nnnnndg

bestsdg

bestsdg

bestsnpr (1)

Please note that in the above Equation 1, each addend inside the braces can vary from 0 to 1. There are n addends. Therefore, the final value of r_p calculated is between 0 and 1 inclusive. Next let us discuss MAP. MAP can be defined

as }/))()(({_

1 _

1 1i

ntotal

i

i

jpp pdgdg

bestsmap

ji∑ ∑= =

= . Here pj is the ranking position of the j-

th document whose grade is above 0, and is the total sum of grades for

documents up to rank p

∑=

i

jp j

dg1

)(

i. Considering all these total_n documents in the whole collection whose grades are above 0, MAP needs to calculate the precision at all these document levels (p1, p2, …, ptotal_n). At any pi, precision is calculated

Evaluation of system measures for incomplete relevance judgment in IR 5

as , and then a weight of is applied. In this way the documents in

higher grades have a bigger contribution to the final value of MAP.

ii

jp pdg

j/)(

1∑=

)(ipdg

Normalized average precision over all documents (NAP) is a new measure

introduced in this paper. It can be defined as idgtbestnav

navt

i

i

jj /))((

*_1

1 1∑ ∑= =

= . Here

is precision at document level i, idgi

jj /))((

1∑=

idgt

t

i

i

jj /))((1

1 1∑ ∑= =

is average precision at

all document levels, and 1/nap_best is the normalization coefficient. nap_best is the NAP value for one of the best resultant lists. NDCG was introduced in [7] by Järvelin & Kekäläinen graded relevance judgment. Each ranking position in a resultant document list is assigned a given weight. The top ranked documents are assigned the highest weights since they are the most convenient ones for users to read. A logarithmic function-based weighting schema was proposed in their paper, which needs to take a particular whole number b. The first b documents are assigned a weight of 1; then for any document ranked k which is greater than b, its weight is w(k)=log b/log k. Considering a resultant document list up to t documents,

its discount cumulated gain (DCG) is , where g(i) is the judged grade of

the i-th document. DCG can be normalized using a normalization coefficient dcg_best, and dcg_best is the value of discount cumulated gain of the best resultant

lists. Therefore, we have:

∑=

t

iigiw

1)(*)(

∑=

=t

iigiw

bestdcgndcg

1)(*)(

_1 .

3. Effect of incomplete relevance judgment on measure values

In this section we investigate how incomplete relevance judgment affects these measures on their values. Considering that the pooling method in TREC is a reasonable method for incomplete relevance judgment, we conduct an experiment to compare the values of these measures by using pools of different depths. In every year, a pool of 100 documents in depth was used in TREC to generate its qrels (relevance judgment file). Shallower pools of 10, 20,…, 90 documents in depth were used in this experiment to generate more qrels. For a resultant list and a measure, we calculate its value of the measure c100 using the 100 document qrels, then calculate its value of the measure ci using the i document qerls (i = 10, 20, …., 90), an absolute difference can be calculated by asb_diff=|ci-c100|/c100. 9 groups of runs submitted to TREC (TREC 5-8: ad hoc track; TREC 9, 2001, and 2002: Web track; TREC 2003 and 2004: robust track) were used in the experiment. In some year, judged documents were divided into 2 categories: relevant and irrelevant; in some other years, judged documents were divided into 3 categories: relevant, highly relevant, and irrelevant. First we equally treat highly relevant documents and relevant documents, and just use a binary judgment for the evaluation of these runs. Figure 1 shows the difference of

6 Shengli Wu and Sally McCleanP

the four measure values when different qrels are used. Every data point in Figure 1 is the average of all the submitted runs in all 9 year groups. One general tendency for all four measures is: the shallower the pool is, the bigger the difference is. However, MAP is the worst considering the difference rate. When using a pool of 10 documents in depth, the difference rate for MAP is as big as 44%. In the same condition, it is about 32% for R-precision, about 30% for NAP, and 21% for NDCG_2 (2 was used as the base of its logarithmic function). For all four measures, it is generally true that the shallower the pool is, the bigger the value is.

Fig. 1. Absolute differences of four measures when using pools of different depth (the pool of 100 documents in depth is served as baseline, binary relevance judgment)

Next we did the same experiment again but used different relevance judgments.

This time we did not take an indiscriminate policy towards highly relevance documents. Instead, highly relevant documents were regarded as 100% relevant, relevant documents were regarded as 50% relevant. Submitted runs to TREC 9 (all 50 topics), 2001 (all 50 topics), and 2003 (the second half: topics 601-650) were used. The experimental result is shown in Figure 2. All the curves in Figure 2 and in Figure 1 are very much alike; therefore, the same conclusion can be drawn here as with binary relevance judgments. However, Figures 1 and 2 can not be compared directly since the data set used for them are not identical. For a more reasonable comparison, the results of using both binary relevance judgment and graded relevance judgment with the same data set are presented in Figure 2. Comparing the corresponding curves for the same measure in Figure 2, we can observe that the curve of graded relevance judgments is always very close to its counterpart of binary relevance judgments.

Evaluation of system measures for incomplete relevance judgment in IR 7

Fig. 2. Absolute difference of four measures when using pools of different depths (the pool of 100 documents in depth is served as baseline, three categories relevance judgment)

4. Further investigation about these measure values

Zobel [17] estimated that 30% ~ 50% of the relevant documents which might not be identified when using a pool of 100 documents. His estimation method was based on the 50 topics (251-300) in TREC 5. As an average, the estimated figure is reasonable. However, one topic can be very different from another as regards to the number of relevant documents. Some topics may have as few as 1 or 2 relevant documents, while some others may have over 100 relevant documents. In this section, we would go a step further to investigate the issue of missing relevant documents and the properties of the four measures across different topics under the pooling method. For all 699 topics (one topic in TREC 2004 was dropped since it did not include any relevant document) in 9 year groups, we divided them into 11 groups according to the number of relevant documents identified for them. Group 1 (G1) includes those topics with fewer than 10 relevant documents, group 2 (G2) includes those topics with between 10 and 19 relevant documents, …, group 11 (G11) includes those topics with 100 or more relevant documents. The number of topics in each group is as follows:

G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 Total 74 116 79 75 49 33 39 27 25 17 165 699

In this section, binary relevance judgment was used for the investigation. For every topic, we investigate the impact of the number of identified relevant documents for the topic on these four measures. Again, for those topic groups G1 ~ G11 defined before in this section, we calculated the value differences of these measures using pools of different depths. Figure 3 shows the experimental result, in which each measure is drawn separately in (a), (b), (c), and (d). One common tendency for all these four measures is: the fewer the relevant documents are identified, the less difference the values of the same measure have with pools in different depths. For example, the curves of G1 are always below all other curves, while the curves of G10 and G11 are above all other curves. Comparing all these curves of different measures, we can observe that bigger differences occur for the measure of MAP. For groups G10

8 Shengli Wu and Sally McCleanP

and G11, the value differences of MAP are 0.93 and 0.84 between a pool of 10 documents and a pool of 100 documents, while the figures for NAP are 0.60 and 0.51, the figures for R-precision are 0.48 and 0.52, and the figures for NDCG are 0.34 and 0.37. This experiment demonstrates that MAP is the most sensitive measure among these four measures.

Pool depth

100806040200

NA

P s

core

diff

eren

ce

.7

.6

.5

.4

.3

.2

.1

0.0

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Pool depth

100806040200

ND

CG

sco

re d

iffer

ence

.4

.3

.2

.1

0.0

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

(a) (b)

Pool depth

100806040200

R-p

reci

sion

sco

re d

iffer

ence

.6

.5

.4

.3

.2

.1

0.0

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

Pool depth

100806040200

MA

P S

core

diff

eren

ce

1.0

.8

.6

.4

.2

0.0

G11

G10

G9

G8

G7

G6

G5

G4

G3

G2

G1

(c) (d)

Figure 3. Difference in performance values using pools of different depths

5. Error and tie rates of the measures

In this section we present the result of an experiment whose methodology is similar to the one conducted by Buckley and Voorhees [5]. However, there are substantial differences between Buckley and Voorhees’s experiment and our experiment here in this paper. First, the experiment conducted by Buckley and Voorhees [5] only used the runs submitted to the TREC 8 query track. Here we use 9 groups of runs. Second, two measures (NAP and NDCG) are investigated in this paper that were not involved in Buckley and Voorhees’s experiment. Third, graded relevance judgments are investigated in this paper but not in Buckley and Voorhees’s investigation as well. For a given measure, we evaluate all the results in a year group and obtain the average performance of them. Then we count how many pairs whose performance

Evaluation of system measures for incomplete relevance judgment in IR 9

difference is above 5%. The tie rate is defined as the percentage of pairs whose performance difference is less than 5%. For those pairs whose performance difference is above 5%, we check if this is true for all the topics. Suppose we have two results A and B such that A's average performance is better than B's average performance by over 5% over all l topics. Then we consider these l topics one by one. If A is better than B by over 5% for m topics, and B is better than A by over 5% for n topics (l ≥ m+n), then the error rate is n/(m+n), since for m times that the conclusion is consistent between a topic and the average of all the topics, and for n times that the conclusion is inconsistent between a topic and the average of all the topics. Tables 1 and 2 show the error rates and the ties rates of the experiment, respectively. As in Section 3, 9 groups of runs in TREC were used. Binary relevance judgments and pools of different number of documents in depth were set up in the experiment. From Tables 1 and 2, we can observe NAP and NDCG_2 are close in both rates; both NAP and NDCG_2 have the lowest error rates but the highest tie rates; MAP has the highest error rates but the lowest tie rates; R-precision is in the middle in both tie rates and error rates. We can also observe that for all the measures, the values of error rates and tie rates are very close to each other when the pools are formed with different numbers of documents. Table 1. Error rates of using different measures under binary relevance judgments (9

groups of runs in TREC) Num. docs MAP R-precision NAP NDCG_2 10 0.2489 0.2087 0.1927 0.2014 20 0.2472 0.2123 0.1951 0.1985 30 0.2465 0.2140 0.1965 0.1974 40 0.2457 0.2147 0.1975 0.1970 50 0.2458 0.2157 0.1988 0.1975 60 0.2458 0.2158 0.1991 0.1977 70 0.2454 0.2159 0.1995 0.1978 80 0.2452 0.2160 0.1995 0.1979 90 0.2452 0.2157 0.1996 0.1980 100 0.2450 0.2157 0.1995 0.1980 Average 0.2461 0.2144 0.1978 0.1981

The above experiment was repeated with graded relevance judgments. Three groups of runs in TREC (TREC 9, TREC 2001, and the second half of TREC 2003) were used. Table 3 and Table 4 show the results. These results are very much like the results in Tables 1 and 2. Please note that the results in Tables 1 & 3 and Tables 2 & 4 are not directly comparable, since the data set used are different. For a reasonable comparison, we calculate the average of error rates and tie rates using the same data set but binary relevance judgments (the last row of Table 3 and Table 4). Comparing the last two rows in Tables 3 and 4, we can observe that there is not much difference between them. We can conclude that the experimental results are consistent between binary relevance judgments and graded relevance judgments; on the other hand, graded relevance judgments can not help to reduce the error rates or tie rates compared with binary relevance judgments.

10 Shengli Wu and Sally McCleanP

Table 2. Tie rates of using different measures under binary relevance judgment (9 groups of runs in TREC)

Num. docs MAP R-precision NAP NDCG_2 10 0.1001 0.1223 0.1613 0.1635 20 0.0966 0.1222 0.1522 0.1597 30 0.0954 0.1204 0.1484 0.1565 40 0.0936 0.1219 0.1460 0.1536 50 0.0940 0.1208 0.1440 0.1509 60 0.0932 0.1211 0.1434 0.1502 70 0.0937 0.1210 0.1419 0.1495 80 0.0937 0.1206 0.1417 0.1480 90 0.0931 0.1210 0.1408 0.1471 100 0.0927 0.1208 0.1409 0.1467 Average 0.0946 0.1212 0.1461 0.1526

Actually, error rates can be regarded as a good indicator of reliability, while tie

rates can be regarded as a good indicator of sensitivity, of the measure in question. For MAP, it has the highest error rates and the lowest tie rates among the four measures, which indicate that MAP is the most sensible, but the least reliable measure. On the other hand, NAP and NDCG have the lowest error rates but the highest tie rates, which indicate that both of them are the most two reliable, but the least two sensible measures. R-precision is not as sensitive as MAP, but is more sensitive than NAP and NDCG, and is not as reliable as NAP and NDCG, but is more reliable than MAP.

If we consider both tie rates and error rates at the same time and define a comprehensive measure (com) which sums them up. Then for both binary and graded relevance judgment, R-precision is the best with the lowest com value (0.3356 and 0.3522), MAP is in the second place (0.3407 and 0.3642), NAP is in the third place (0.3439 and 0.3711), and NGCG is the worst (0.3507 and 0.3792).

Table 3. Error rates of using different measures under graded relevance judgment (3

groups of runs in TREC) Num. docs MAP R-

precision NAP NDCG_2

10 0.2484 0.2085 0.1827 0.2028 20 0.2465 0.2136 0.1848 0.1986 30 0.2461 0.2137 0.1871 0.1971 40 0.2452 0.2141 0.1881 0.1972 50 0.2445 0.2146 0.1880 0.1963 60 0.2446 0.2149 0.1882 0.1963 70 0.2446 0.2156 0.1889 0.1960 80 0.2440 0.2152 0.1894 0.1959 90 0.2443 0.2161 0.1894 0.1961 100 0.2443 0.2159 0.1894 0.1959 Average 0.2453 0.2142 0.1877 0.1972 Ave (binary) 0.2433 0.2130 0.1878 0.1923

Evaluation of system measures for incomplete relevance judgment in IR 11

Table 4. Tie rates of using different measures under graded relevance judgment (3 groups of runs in TREC)

Num. docs MAP R-precision

NAP NDCG_2

10 0.1158 0.1334 0.1939 0.1808 20 0.1167 0.1347 0.1886 0.1833 30 0.1165 0.1367 0.1861 0.1810 40 0.1184 0.1365 0.1821 0.1801 50 0.1198 0.1378 0.1822 0.1819 60 0.1196 0.1388 0.1822 0.1828 70 0.1194 0.1386 0.1812 0.1832 80 0.1215 0.1407 0.1798 0.1827 90 0.1208 0.1395 0.1789 0.1815 100 0.1209 0.1430 0.1789 0.1823 Average 0.1189 0.1380 0.1834 0.1820 Ave (binary) 0.1154 0.1433 0.1795 0.1848

6. Conclusions

In this paper we have investigated the properties of four measures namely mean average precision (MAP), R-precision, normalized average precision over all documents (NAP) and normalized discount cumulative gain (NDCG) when relevant judgment is incomplete. All these measures have one common characteristic: both precision and recall are implicit in their definitions. Therefore, they are good candidates for the evaluation of the effectiveness of information retrieval systems and algorithms. 9 groups of the submitted runs to TREC have been used in our experiments. Both binary relevance judgment and graded relevance judgment have been utilised. From these experimental results, we conclude that MAP is the most sensitive but least reliable measure, and both NAP and NDCG are the most reliable but least sensitive measure, while R-precision is in the middle. We believe that a good measure should be a good balance of these two somewhat contradictory properties: sensitivity and reliability. R-precision is the best to have a balance of them, MAP is in the second place, NAP is in the third place, and NDCG is the worst. Among these four measures, normalized average precision (NAP) is introduced in this paper. It can be used for both binary relevance judgment and graded relevance judgment. Probably it is in a better position than NDCG to be justified since no parameter is required. In their original definitions, mean average precision and R-precision can only be used with binary relevance judgment, a generalized form of mean average precision and R-precision is provided in this paper. Our experimental results also show that the values of all these four measures are very likely to be exaggerated when incomplete relevance judgment such as the pooling method is applied. Therefore, when explaining the results of such measures as mean average precision and R-precision which are used in TREC, care should be

12 Shengli Wu and Sally McCleanP

taken since their real values depend on the percentage of relevant documents which have been identified. From such a perspective, it is better to use reliable measures.

References

1. Aslam, J. A. and Yilmaz, E.: A geometric interpretation and analysis of R-precision. In Proceedings of ACM CIKM'2005, pages 664-671, Bremen, Germany, October-November.

2. Aslam, J. A. and Yilmaz, E. and Pavlu, V.: The maximum entropy method for analyzing retrieval measures. In Proceedings of ACM SIGIR'2005, pages 27-34, Salvador, Brazil.

3. Barry, C. L.: User-defined relevance criteria: an exploratory study. Journal of the American Society for Information Science, 45(3):149-159, 1994.

4. Bodoff, D. and Robertson, S.: A new united probabilistic model. Journal of the American Society for Information Science and Technology, 55(6):471-487, 2004.

5. Buckley, C. and Voorhees, E. M.: Evaluating evaluation measure stability. In Proceedings of ACM SIGIR'2000, pages 33-40, Athens, Greece.

6. Buckley, C. and Voorhees, E. M.: Retrieval evaluation with incomplete information. In Proceedings of ACM SIGIR'2004, pages 25-32, Sheffield, United Kingdom.

7. Järvelin, K. and Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):442-446, 2002.

8. Sparck Jones, K. and van Rijisbergen, C.: Report on the need for and provision of an “ideal" information retrieval test collection. Technical report, British library research and development report 5266, Computer laboratory, University of Cambridge, Cambridge, UK, 1975.

9. Kekäläinen, J.: Binary and graded relevance in IR evaluations - comparison of the efforts on ranking of IR systems. Information Processing & Management, 41(5):1019-1033, 2005.

10. Lee, C. and Lee, G. G.: Probabilistic information retrieval model for a dependency structured indexing system. Information Processing & Management, 41(2):161-175, 2005.

11. Saracevic, T.: Relevance: A review of and a framework for thinking on the notion in information science. Journal of the American Society for Information Science, 26(6):321-343. 1975.

12. Schamber, L. and Eisenberg, M. B. and Nilan, M. S.: A re-examination of relevance: toward a dynamic, situational definition. Information Processing & Management, 26(6):755-776, 1990.

13. Sanderson, M. and Zobel, J.: Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of ACM SIGIR'2005, pages 162-169, Salvador, Brazil.

14. Voorhees, E. M. and Buckley C.: The effect of topic set size on retrieval experiment error. In Proceedings of ACM SIGIR'2002, pages 316-323, Tampere, Finland.

15. Voorhees, E. M. and Harman, D.: Overview of the sixth text retrieval conference (trec-6). Information Processing & Management, 36(1):3-35:2000.

16. Xu, Y. and Benaroch, M.: Information retrieval with a hybrid automatic query expansion and data fusion procedure. Information Retrieval, 8(1):41-65, 2005.

17. Zobel, J.: How reliable are the results of large-scale information retrieval experiments. In Proceedings of ACM SIGIR'1998, pages 307-314, Melbourne, Australia.