Information Retrieval from Unstructured Web Text based on Automatic learning of the Threshold

19
IJIRR: International Journal of Information Retrieval Research. In press, 2013. Information Retrieval from Unstructured Web Text based on Automatic learning of the Threshold Fethi Fkih and Mohamed Nazih Omri MARS Research Unit, Faculty of sciences of Monastir, University of Monastir, 5019 Monastir, Tunisia ABSTRACT Collocation is defined as a sequence of lexical tokens which habitually co-occur. This type of information is widely used in various applications such as Information Retrieval, document indexing, machine translation, lexicography, etc. Therefore, many techniques are developed for the automatic retrieval of collocations from textual documents. These techniques use statistical measures based on a joint frequency calculation to quantify the connection strength between the tokens of a candidate collocation. The discrimination between relevant and irrelevant collocations is performed using a priori fixed threshold. Generally, the discrimination threshold estimation is performed manually by a domain expert. This supervised estimation is considered as an additional cost which reduces system performance. In this paper, we propose a new technique for the threshold automatic learning. This technique is mainly based on the usual performance evaluation measures (such as ROC and Precision-Recall curves). The results show the ability to automatically estimate a statistical threshold independently to the treated corpus. Keywords: Collocations Retrieval, Statistical Threshold, Binary Classification, Performance Evaluation, ROC curves, Youden Index. 1. INTRODUCTION Currently, the Web is considered as the most used source of knowledge. It brings a huge amount of heterogeneous information (text, image, videos, etc.). Among this information, the unstructured textual content remains the most important. In fact, textual data are very rich of important kinds of information (such as: named entities, terms, collocations, etc.) which are useful in several applications (such as: indexing, machine translation, documents classification, construction of linguistic resources, Lexicography, Parsing, etc.). There are several tools for automatic knowledge retrieval from the web. These techniques often use linguistic, statistical or hybrid approaches. Each approach tries to exploit linguistic and statistical features of the information to be retrieved. In this paper we will focus on the retrieval of a particular kind of knowledge, i.e. word collocations, which are

Transcript of Information Retrieval from Unstructured Web Text based on Automatic learning of the Threshold

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

Information Retrieval from Unstructured

Web Text based on Automatic learning

of the Threshold

Fethi Fkih and Mohamed Nazih Omri

MARS Research Unit, Faculty of sciences of Monastir,

University of Monastir, 5019 Monastir, Tunisia

ABSTRACT

Collocation is defined as a sequence of lexical tokens which habitually co-occur. This type of

information is widely used in various applications such as Information Retrieval, document

indexing, machine translation, lexicography, etc. Therefore, many techniques are developed for

the automatic retrieval of collocations from textual documents. These techniques use statistical

measures based on a joint frequency calculation to quantify the connection strength between the

tokens of a candidate collocation. The discrimination between relevant and irrelevant

collocations is performed using a priori fixed threshold. Generally, the discrimination threshold

estimation is performed manually by a domain expert. This supervised estimation is considered

as an additional cost which reduces system performance. In this paper, we propose a new

technique for the threshold automatic learning. This technique is mainly based on the usual

performance evaluation measures (such as ROC and Precision-Recall curves). The results show

the ability to automatically estimate a statistical threshold independently to the treated corpus.

Keywords: Collocations Retrieval, Statistical Threshold, Binary Classification, Performance

Evaluation, ROC curves, Youden Index.

1. INTRODUCTION

Currently, the Web is considered as the most

used source of knowledge. It brings a huge

amount of heterogeneous information (text,

image, videos, etc.). Among this

information, the unstructured textual content

remains the most important. In fact, textual

data are very rich of important kinds of

information (such as: named entities, terms,

collocations, etc.) which are useful in

several applications (such as: indexing,

machine translation, documents

classification, construction of linguistic

resources, Lexicography, Parsing, etc.).

There are several tools for automatic

knowledge retrieval from the web. These

techniques often use linguistic, statistical or

hybrid approaches. Each approach tries to

exploit linguistic and statistical features of

the information to be retrieved.

In this paper we will focus on the

retrieval of a particular kind of knowledge,

i.e. word collocations, which are

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

characterized by specific linguistic and

statistical properties (Seretan, 2011).

According to Manning and Schütze (1999),

they can be characterized by three linguistic

properties:

Limited compositionality: the

meaning of the collocation is not a

composition of the meanings of its

parts. For example, the meaning of

the collocation “strong tea” is

different from the composition of the

meaning of “strong” and the

meaning of “tea”.

Limited substitutability: we cannot

substitute a part of a collocation by

her synonym. For example, “strong”

in “strong tea” cannot be substituted

by “muscular”.

Limited modifiability: many

collocations cannot be supplemented

by additional words. For example,

the collocation “to kick the bucket”

cannot be supplemented as “to kick

the {red/plastic/water} bucket”

(Wermter and Hahn, 2004).

Indeed, the retrieval of collocations

requires two main tasks: recognizing

interesting collocations in the text;

classifying them according to classes

predefined by the expert. Firth (1957)

asserts that “you shall know a word by the

company it keeps”. In this perspective, the

techniques used for the collocations retrieval

are often based on the calculation of the

joint frequency of a pair of words within a

sliding window of fixed size (Church et al.,

1989). In practice, the joint frequency is

used to calculate a score that measures the

attachment force between two words in a

given text. If this force exceeds a threshold

fixed a priori, we can judge in this case that

the couple can form a pertinent collocation.

The collocation retrieval problem can be

seen as a binary classification problem.

Collocations will be classified by the system

into two classes: relevant and irrelevant.

This classification depends mainly on two

parameters: the statistical value used to

weight collocations, and the threshold value

used for the discrimination. Estimation of

the discrimination threshold is a problem

very known in several scientific fields (such

as: signal processing, image processing,

information retrieval, etc.). In fact we can

find in the literature a wide range of

machine learning techniques for the

prediction of the threshold as, among other

things, Bayesian networks (Gustafson et al.,

2009), Perona-Malik Model (Shao and Zou,

2009), genetic algorithm (Lia et al., 2012),

etc..

Like other areas of knowledge, the choice

of the ideal threshold is considered as a

problem in the terminology retrieval field.

However, we don’t find in the literature any

exact rules to justify this choice. On the

other hand, a domain expert is tasked with

determining the threshold value the most

suitable for retrieval. The manual estimation

of the threshold has a significant cost on the

retrieval systems performance.

Thus, in this paper we try to shed light on

the threshold determination problem by

exploring, first, the techniques used in

several scientific areas (such as biomedical

and biometric) and applying them on the

statistical terminology field. Our approach is

mainly based on statistical techniques of

performance measurement of binary

classification systems, namely, ROC,

Precision-Rappel, Accuracy and Cost

curves.

The remainder of this paper is structured

as follows. First, we present the theoretical

basis of the statistical approach for

collocations retrieval. Then, we identify the

main measures used to evaluate the

performance of binary classifiers. Finally we

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

conclude with an exposition of the obtained

results.

2. COLLOCATIONS RETRIEVAL

2.1 Definition

In the statistical approach, collocation is

considered as a sequence of words (n-gram)

among millions of other possible word

sequences. In Church and Hunks (1990)

work, a collocation as defined is a pair of

words that appear together more often than

expected.

Benson (1990) defines collocation as an

arbitrary and recurrent word combination.

This definition emerges from the statistic

interpretation of Firth (1957), which is based

on the following proposition: a collocation

consists of a number of lexemes appearing

often enough in a representative corpus at a

distance <n (see also Halliday (1966)).

Smadja (1993) considers that these

definitions don’t cover some aspects and

properties of collocations that affect a

number of machine applications. Therefore,

he enriches these definitions by four

properties:

Collocations are arbitrary:

collocations are difficult to produce

for second language learners; it’s

difficult to translate a collocation

word-for-word.

Collocations are domain-dependent:

they are related to the treated

knowledge area (medical, biologic,

etc.).

Collocations are recurrent: these

combinations are not exceptions;

they are very often repeated in a

given context.

Collocations are cohesive lexical

clusters: the presence of one or

several words of the collocations

often implies or suggests the rest of

the collocation

The remainder of this paper, we adopt

this definition. In fact, this definition is very

appropriate with our application domain,

namely the biomedical area, which is very

rich in collocations that are considered

highly relevant technical terms that, can

perfectly describe the content of the

biomedical corpus. Way of example, we cite

some collocations belonging to the

biomedical field: bilateral hemianopia,

cobalt allergy, EDTA molecule, ribonucleic

acids, etc.

Subsequently, we present the statistical

technique used for collocations retrieval.

2.2 Statistical approach for collocations retrieval

The collocations retrieval technique is

based on a simple principle: if two words

frequently appear together, then there’s a

chance that they form a meaningful lexical

sequence (or a pertinent term). In practice,

we calculate a score that measures the

attachment force between two words in a

given text. If this force exceeds a threshold

fixed a priori, we can judge in this case that

the couple can form a pertinent term.

Before calculating the words joint

frequency we must, first, reduce them to a

canonical form. Stemming (or

lemmatization) can solve the terminological

variation problem. The terminological

variation can be graphical, inflectional or

otherwise. In fact, terminology variation can

disrupt the results if we consider two

variations of the same token as two different

Figure 1. Example of a sliding window of size 5.

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

units. Example, the lexical units "acids" and

"Acid" are reduced to their canonical form

"acid" that will accumulate the frequencies

sum of all its variations.

Next, we apply on the text a common

statistical technique presented by Church

and hanks (1990). This technique consists in

moving a sliding window of size T over the

text (see Fig. 1).

For each pair of lemma (w1,w2),

(w1,w3),…, (w1,wT) we increment its

occurrence frequency in the corpus. We

associate with each couple (wi,wj) a

contingency table (see Table 1).

a = occurrences number of the pair

of lemma.

b = occurrences number of the pair

of lemma where a given lemma

appears as the first item of the pair.

c = occurrences number of the pair

of lemma where a given lemma

appears as the second item of the

pair.

d = occurrences number of the pairs

of lemma that don’t contain one of

the two lemma.

In the literature, we find several statistical

criteria to determine the reattachment force

between two words in a text. These

measures are used initially in biology

domain to detect possible relationships

between events that occur together. We cite,

inter alia, the mutual information (Shannon,

1948; Fano, 1961), and the likelihood

(Loglike) (Dunning, 1993).

The likelihood (Loglikelihood) (1) is

distinguished from other measures by the

consideration of cases where no occurrences

appear. This measure is frequently used in

the terminology retrieval field as in (Daille

et al., 1996). This measure is used to

calculate the connection force between two

words wi and wj, using our notation (Table.

1) we define Loglike(wi, wj) as (1):

(1)

Roche in (Roche et al., 2004) defines a new

measure (OccL) that combines likelihood

with joint occurrence (2).

(2)

It should be noted that there are recent

works that combine (at the same time)

multiple statistical measures instead of using

a single measure (Lin et al., 2008).

Pecina and Schlesinger (2006) present a

comparative study of 82 association

measures and show that the "context cosine

similarity in boolean vector space" provides

better performance. In the same context,

Pecina shows that SVM (support vector

machines) can combine all measures of

association offering optimum performance

(Pecina, 2010).

Ellis and Ferreira-Junior (2009) propose

to use a new association measure called ΔP.

This measure is based on conditional

probabilities (the outcome given the cue).

Other Recent research, like in (Petrović et

al., 2010; Fkih and Omri, 2013), has adapted

measures used primarily to extract bigrams

to retrieve collocations of random size.

The choice of the most effective

association measure is a relative task and

cannot be easily generalized to other

contexts as shown in (Evert and Krenn,

2005). In fact this task is related to several

TABLE 1 CONTINGENCY TABLE

wj wj’ with j’≠j

wi a b

wi’ with i'≠i c d

)log()(

)log()()log()(

)log()()log()(

)log( )log()log()log(

dcbadcba

dcdcdbdb

cacababa

ddccbbaa), wLoglike(w ji

),(),( awwLoglikewwOcc jijiL

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

constraints such as the type and the size of

the corpus used and the target application.

Based on experimental studies, Roche in

(Roche et al., 2004) has proved that OccL

measure admits a higher discriminative

ability than other measures and offers,

therefore, the best result. Roche has made

his experiments on a corpus very similar to

our test corpus. That’s why; we choose this

measure to retrieve collocations in our work.

Collocations classification into two classes

"relevant" and "irrelevant" is based mainly

on the choice of an optimal decision

threshold that can provide maximum

performance to the retrieval system.

Thereafter, we make a survey of the work

concerning the measurement of binary

classifiers performance.

2.3 Statistical threshold estimation

Statistical thresholds are often used to

classify events having similar behavior

dependent on a statistical criterion. Indeed,

this similarity will induce an ambiguity in a

decision-making or classification process. In

a statistical test, the events are presented by

a weight that quantifies their importance in

the test based on statistical observations. In

the case of a binary classification, events are

classified into two classes: a class

representing the events statistical with

indices above the threshold and a class that

represents events with indices below the

statistical threshold.

The problem of prediction of the

discrimination threshold is considered as an

intersection between multiple research

disciplines. Indeed, this problem is

addressed by several researchers in different

areas. We cite, among others, econometrics

(Hansen, 1999), genetics (Churchill and

Doerge, 1994), biochemistry (Keller et al.,

2002).

There are several techniques for

predicting the discrimination threshold,

these techniques are generally based on

approaches emerged from the fields of

Artificial Intelligence and Performance

Evaluation.

The threshold can be simply chosen as

some high percentile of the data

(DuMouchel, 1983) or estimated graphically

using the mean excess plot (Embrechts et

al., 1997) which is a tool widely used in the

study of risk, insurance and extreme values.

Behrens in (Behrens et al., 2004) determines

the threshold by proposing a parametric

form to fit the observations below it and a

generalized Pareto distribution (GPD) for

the observations beyond it.

In the next section, we present the basic

performance metrics used for the

performance evaluation of the binary

classifiers.

3. PERFORMANCE

MEASURES IN BINARY

CLASSIFICATION

In our case, our system (S) classifies

collocations into two classes: relevant and

irrelevant collocations. Collocations

correctly classified by the system in the

class "relevant" are called "true positives";

collocations misclassified are called "false

positives". Collocations correctly classified

by the system in the class "irrelevant" are

called "true negatives"; collocations

misclassified are called "false negatives".

The relevance of the extracted collocations

is reviewed by a domain expert (researcher,

linguist, engineer…) who will decide

TABLE 2 Decision Matrix

Relevant

collocations

Irrelevant

collocations

Collocations evaluated

as relevant by the system

True

Positives (TP)

False

Positives (FP)

Collocations evaluated

as irrelevant by the system

False

Negatives

(FN)

True

Negatives

(TN)

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

whether collocations admit sense or not. The

ultimate goal of this technique is to

maximize the rate of true positives and the

rate of false positives, also is to minimize

the rate of false positives and the rate of

false negatives.

Subsequently, we can build a Decision

Matrix (Table. 2) that summarizes all the

necessary measures for the performance

evaluation of a binary classification system.

Thus, we can calculate the four following

performance indices:

True Positives (TP): number of

correctly classified positive

examples

False Negatives (FN): number of

incorrectly classified positive

examples

True Negatives (TN): number of

correctly classified negative

examples

False Positives (FP): number of

incorrectly classified negative

examples

If the system is perfectly discriminative

(i.e.: the knowledge of statistical values

allowed him to predict without fail), we

would have: FP=FN=0.

In the following, we present the different

techniques used to selecting the optimal cut-

point that can provide the maximum

performance to binary classification system.

3.1 ROC curves

Originally, ROC curves have long been

used in signal detection theory to depict the

tradeoff between hit rate and false alarm rate

of classifiers (Swets et al., 2000).

Subsequently, they are used in the statistics

field to evaluate the performance of binary

classifiers. Graphically, the ROC curve

shows the rate of correct classifications

(called true positive rate) as a function of the

number of incorrect classifications (false

positive rate) for a set of results provided by

the system to be evaluated (see Fig. 2).

ROC analysis has been extended for use

in several areas in the computer science

field. Bradely (1997) is the first to use the

ROC curves for evaluation and comparison

of learning algorithms. In the field of

statistical terminology, Fkih and Omri

(2012) used ROC curves for learning the

size of the sliding window used for

terminology retrieval. For a deeper

discussion on the use of ROC curves, you

can see Hand (2009) and Krzanowski and

Hand (2009).

Based on Table 2 notations, we define the

following terms (Fawcett 2004):

Sensitivity (3) (also known as

Fraction of True Positives):

proportion of positive collocations

detected by the test. In other words,

the sensitivity allows measuring how

much the test is effective when used

on positive collocations. The test is

perfect for positive collocations

when sensitivity is 1, equivalent to a

random draw when sensitivity is 0.5.

If it is below 0.5, the test is cons-

efficient.

(3)

Specificity (4) (also known as

Fraction of True Negatives):

proportion of negative collocations

detected by the test. Specificity

allows measuring how well the test is

effective when used on negative

collocations. The test is perfect for

collocations negative when the

specificity is 1, equivalent to a

random draw when specificity is 0.5.

If it is below 0.5, the test is cons-

efficient.

(4)

In our work, we are interested in two

performance indices that can be deducted

FNTP

TPSeySensitivit

)(

FPTN

TNSpySpecificit

)(

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

from the ROC curve: Points on curve closest

to the (0,1) and Youden Index (J) (see Fig.

2). Each of the two criteria gives equal

weight to sensitivity and specificity.

3.1.1 Points on curve closest to the (0, 1)

According to Tilbury in (Tilbury et al.,

2000), if the distributions of the 2 classes

(relevant and irrelevant collocation) are well

separated, the curve will immediately rise to

the top left corner (0, 1), and then proceed

horizontally. If the distributions tend to

overlap, so that relevant and irrelevant

collocations cannot be distinguished by the

measurement, the curve will approach the

diagonal ((0,0) to (1,1)). As shown in Figure

2, the Points on curve closest to the (0, 1)

index measures the distance between a point

on the curve and the upper left corner of the

graph (point (0,1)). The goal is to minimize

this distance, then the point closest to the

point (0,1) is selected as "optimal point"

which ensures both a high sensitivity and

specificity to the system. In the following

we denote by DPCC(M) the distance

between the point (0,1) and a point M

belonging the curve.

For each point M(1-Sp,Se) belonging to

the ROC curve, we have (5):

(5)

Let T the optimum point which ensures

both a high sensitivity and specificity to the

system. Thus, we have (6):

(6)

3.1.2 Youden Index (J)

The Youden index (J) (Youden, 1950) is

a function of sensitivity (Se) and specificity

(Sp) is a commonly used measure of overall

diagnostic effectiveness (Schisterman et al.,

2005). The goal is to maximize the vertical

distance from line of random chance to the

point M (as shown in Fig. 2).

There are several mathematical methods

for estimating the Youden index (J). For

more details you can see (Ruopp et al.,

2008; Martínez-Camblor, 2011).

For each point M(1-Sp,Se) belonging to

the ROC curve, we have (7):

(7)

Let T the optimum point, we have (8):

(8)

3.2 Precision-Recall Curves

We define precision and recall as:

1. Precision (9): the ratio of the number

of relevant collocations by the total

extracted collocations extracted

(9)

2. Recall (10): the ratio of the number

of extracted collocations by the total

number of relevant collocations

(10)

Figure 2. ROC curve, on curve closest to

the (0,1) and Youden Index.

22 )1()1()( SeSpMDPCC

MM MDPCCTDPCC )(min)(

MM MJTJ )(max)(

FPTP

TPprecision

FNTP

TPrecall

1)()()( MSeMSpMJ

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

Precision-Recall (PR) curves are

commonly used in Information retrieval

field as in (Manning and Schütze 1999,

Raghavan et al. 1989). Also, they are used

by Pecina (2010) to evaluate the

performance of collocation retrieval

methods. This evaluation is based on

measuring the quality of candidates ranking

based on their chance to form collocations.

As shown in Figure 3, precision is a

decreasing function of the recall. In our

case, we give equal importance to precision

and recall. Then the optimal threshold is the

point (M) on the curve such as (11):

(11)

3.3 Accuracy

Accuracy refers to the degree of

closeness or conformity to the true value of

the quantity under measurement. Is defined

as (12):

(12)

Accuracy is an indicator commonly used

to measure the performance of binary

classifiers as in (Mazurowski and Tourassi,

2009; Bradley, 1997; Provost et al., 1998;

Sokolova et al., 2006). The goal is to

maximize the Accuracy function. So to get

the optimal threshold, we must select the

peak of the curve (see Fig. 4).

Let T the optimum point, we have (13):

(13)

3.4 Decision graph (Cost/Occl)

Each performance index (see Table 2)

gives an idea about the types of errors that

may occur during the process of

classification. These errors are mainly

caused by a misclassification of positive and

negative instances. In order to determine the

value of the classification threshold which

reduces the errors risk, a function which

calculates the misclassification cost is used.

The optimal threshold in this case is the one

that provides a minimum cost.

Practically, the cost function depends on

four parameters: TP, TN, FP and FN. Each

parameter is weighted according to its

importance.

Figure 3. Precision-Recall curve.

Figure 4. Accuracy curve.

Figure 5. Cost curve.

)()( MrecallMprecision

FNFPTNTP

TNTPAccuracy

MM MAccuracyTAccuracy )(max)(

FNdFPcTNbTPaMCost ****)(

numberspositivedcbaWhere ,,,

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

The parameters weighting is very useful

in many applications. For example, in a

cancer diagnosis test, the misclassification

of a "true positive" can induce the death of a

person. To avoid this serious error, the

"False Negative" parameter will be strongly

weighted in the cost function.

In our application (retrieval of

collocations), we want to strongly penalize

the misclassification of relevant collocations

(FN) and the misclassification of irrelevant

collocations (FP). Thus, we use the

following cost function:

Graphically, we plot the cost by varying

the OccL. The resulting curve decreases until

reaching the optimal threshold and then

creases again (see Fig. 5).

Let T the optimum point, we have:

4. EXPERIMENTAL STUDY AND RESULTS EVALUATION

4.1 Corpus description

During this work, we used an retrieve

from the medical corpus MEDLARS1 (75

documents) for the experimentation phases.

An expert in the medical field (university

1 ftp://ftp.cs.cornell.edu/pub/smart

professor in medicine) manually prepared a

list of relevant collocations extracted from

the corpus. This list contains 320 technical

collocations in the medical field. Examples

of collocations existing in the list: fetal

plasma, maternal level, diabetic syndrome,

histological examination, nucleic acid, etc.

Cleaning the corpus is a necessary task to

remove all empty words such as articles

(the, a, an, some, any), conjunctions (before,

when, so, if, etc...), pronouns (personal,

relative, possessive and demonstrative) and

punctuation. Indeed, these words have high

frequency in text documents without having

a real terminology importance. Thus it is

essential to remove them to avoid hampering

the statistical computing and not induce,

subsequently, a noise in the results.

The lemmatization task is performed by

TreeTagger2 a part-of-speech tagger having

an acceptable performance.

4.2 Results

We apply the sliding window technique on

the corpus. Each time, we vary the window

size starting with 2 until reaching 10. For

each window size, we get a list of

collocations ordered by descending order of

their OccL value. Table.3 contains the first

20 collocations from a list obtained by the

use of a window of size 4.

Therefore, we get 9 lists of collocations

ordered in descending order (a list for each

window size). For each list, we select the top

400 collocations (hereinafter referred to as

list-S). List-S contains collocations

evaluated as relevant by the system; the

other collocations (not belonging to list-S)

are evaluated by the system as irrelevant. A

collocation belonging to the list-S is seen as

a true positive (TP) if it belongs to the

expert list, else it is considered as false

positive (FP). A collocation not belonging,

2http://www.ims.uni-

stuttgart.de/projekte/corplex/TreeTagger/

FNFPTNbTPaMCost *4*2**)(

MM MCostCostTCost )()(

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

at the same time, to the expert list and the

list-S is counted as true negative (TN). A

collocation not belonging to the list-S but it

belongs to the expert list is considered as a

false negative (FN).

Using XLSTAT3, a statistical and data

analysis software, we vary for each

collocation list the threshold value of

discrimination and we calculate for each

threshold value the basic measures of

performance (TP, TN, FP, FN). Thus, we

plot 9 ROC curves (a curve for each window

size from 2 to 10). As an example, Fig. 2

and 5 show, respectively, the curves

3 http://www.xlstat.com/fr/

corresponding to the windows of size: 4, 5

and 6.

TABLE 3 Sample of collocations extracted by

a window of size 6

Collocations OccL

fatty Acid 62.0202

chronic hepatitis 27.0547

nephrotic syndrome 26.5593

non-esterified fatty 26.1130

non-esterified acid 25.7643

maternal fetal 23.9892

renal biopsy 23.3108

hepatitis mercaptopurine 22.6240

active hepatitis 21.0481

Maternal Level 20.8780

Maternal Plasma 20.3852

mercaptopurine azothioprine 19.5423

Blood glucose 18.8333

collagen disease 18.5864

Acid Adrenaline 17.7783

Fetal Plasma 17.7101

Lupus erythematosus 17.5947

Total Lipid 17.0910

hepatitis Azothioprine 16.9174

Cord blood 16.0717

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

Figure 5. ROC curves corresponding to the windows of size 5 (a) and 6 (b).

(a) (b)

Figure 6. Precision-Recall curves corresponding to the windows of size 2

(a), 3 (b), 4 (c) and 5 (d).

(a) (b)

(c) (d)

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

Figure 7. Accuracy curves corresponding to the windows of size

6 (a), 5 (b), 8 (c) and 9 (d).

(a) (b)

(c) (d)

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

Similarly, we plot 9 Precision-Recall and

Accuracy curves. Fig. 3 shows the

Precision-Recall curve corresponding to the

window of size 9. Also, Fig. 6 shows,

respectively, the Precision-Recall curve

corresponding to the windows of size 2 (Fig.

6.a), 3 (Fig. 6.b), 4 (Fig. 6.c) and 5 (Fig.

6.d). Figures 4 and 7 show the accuracy

curves corresponding to the windows of size

7 (Fig. 4), 6 (Fig. 7.a), 5 (Fig. 7.b), 8 (Fig.

7.c) and 9 (Fig. 7.d).

We summarize all the results in the Table 4.

Thereafter; we refer to each technique by an

acronym: Points on curve closest to the (0,

1) (DPCC); Youden Index (Youden);

Precision-Recall curves (PR) and Accuracy

(Accuracy).

For each window size we apply different

performance evaluation techniques (already

Figure 7. Accuracy curves corresponding to the windows of size 4 (a), 6

(b), 8 (c) and 10 (d).

(a) (b)

(c) (d)

TABLE 4 SUMMARY OF OBTAINED THRESHOLDS

Window Size DPCC Youden PR Cost Accuracy Threshold_Moy

2 6.420 6.420 8.1410 7.9328 16.0535 7.2284

3 9.36 9.019 12.0147 9.0187 15.0471 9.8530

4 9.848 9.711 11.9115 9.7107 14.0928 10.2953

5 10.059 10.530 11.4739 10.5303 13.1906 10.6485

6 10.174 9.406 11.3025 9.4058 12.6238 10.0720

7 10.16 11.928 11.8556 11.8556 13.5917 11.4499

8 11.209 11.337 12.2344 12.0605 12.8440 11.7104

9 11.711 11.904 13.1063 11.9042 14.7669 12.1564

10 11.659 11.659 14.4878 11.6586 16.6088 12.3660

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

detailed in the preceding paragraphs) on the

collocations extracted by the statistical

approach (see Table 3). For each

performance measure (DPCC, Youden, PR,

Accuracy) we give the value of the obtained

discrimination threshold (see Table 4).

4.3 Evaluation and discussion

We note that the three measures: DPCC,

Youden and PR have, nearly, similar

behaviour (Fig. 8). For each of the three

measures, the decision threshold value

increases with the size of the window. In

fact, statistics show that the retrieval by a

window of size 2 (the minimum window

size) requires lower decision threshold

compared to other larger windows.

Similarly, the decision threshold required for

collocations retrieval with a window size of

10 is the highest compared to other narrow

windows. From the above we can establish

the following relationship (14):

(14)

Accuracy has a different behaviour to the

other indices. It starts with an important

threshold value; it decreases with the size of

the window until it reaches its minimum at

the window size of 6 and then increases

again.

This dependence between the size of the

window and the decision threshold is

intuitively logical. In fact, if we increase the

size of the window then it will increase

simultaneously the number of possible

combinations between lexical units in the

same window which will increase, therefore,

the statistical weight of all collocations in

the corpus. So it is clear that the threshold

should be adjusted with the overall increase

in collocations weight.

We can observe the great similarity between

the curves representing the Youden Index

and the DPCC (see Fig. 8). Practically, the

two indices give (almost) the same results

for 7 different window sizes (window size:

2, 3, 4, 5, 8, 9 and 10). Same for the

windows size 6 and 7, the discrepancy

between the two measures is very small.

PR gives threshold values slightly higher

than that given by the Youden index and the

DPCC, but the values remain very close.

Figure 8. Variation of performance measures with the window size.

have we allfor :Then

. size of windowa of hresholddecision t theLet

size of windowextraction a Let

fj)Threshold(fi)Threshold(i j

ifi) Threshold(

ifi

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

Accuracy index is still the exception, it

provides values that are very distant to other

performance indices (e.g. threshold = 16 for

window size 2).

The results show that the index Accuracy

is not in harmony with the other

performance indices in its behaviour and

values. We judged that this index does not

provide any relevant knowledge to assist the

decision of the optimal threshold.

We may conclude that the indices: DPCC,

Youden and PR can provide interpretable

knowledge for the determination of the

optimal threshold for automatic retrieval

collocation systems. For these reasons, we

choose in our retrieval system to use the

average of the values provided by these

three performance indices (15).

We consider the case of a list of

collocations extracted by a window of size

4. The graph of Figure 9 allows seeing the

evolution of the values of TP (true

positives), TN (true negative), FP (false

positives) and FN (false negatives)

depending on the selected threshold value.

The discrimination threshold divides the

graph into two parts (see the dotted lines in

Fig. 9.a and Fig. 9.b). The two parts describe

the evolution of the basic performance

measures (TP, TN, FP and FN) before and

after the selected threshold. We note a

decrease in rates of FP and TP and an

increase in TN and FN after the threshold

which induces, therefore, an increase in

precision and a decrease in recall. Our

chosen threshold (10.49), for an retrieval

with a window of size 4, can be a good

compromise; he tries to take into

consideration the four factors evaluation

measures.

5. CONCLUSION

In this paper, we have presented a

detailed theoretical study on techniques

measurement of performance of binary

classifiers. Then, we used these techniques

to determine the optimal threshold for

collocations retrieval from text documents.

To this end, we conducted an empirical deep

study on a biomedical text corpus.

It is clear that collocations retrieval

problem is a binary classification problem.

Indeed, we must classify collocations

(extracted by statistical method using a

sliding window) in two different classes:

“relevant” and “irrelevant” collocations.

Each collocation is weighted by a statistical

weight which measures the strength of

connection between its components.

Collocations classification relies, mainly, on

the use of a discrimination statistical

threshold. In fact, collocations which have a

statistical weight above the threshold are

classified as "relevant". The same,

collocations which have a statistical weight

below the threshold are classified as

"irrelevant."

The main goal of our work is to improve

classification quality of collocations.

Practically, we used 4 conventional

statistical techniques namely Youden Index,

Points on curve closest to the (0, 1),

Precision-Recall curves and Accuracy

curves. These techniques attempt to estimate

the discrimination threshold that ensures

optimal performance for collocation

retrieval systems.

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

The results show the possibility of

benefiting from performance measures for

automatic selection of a priori decision

threshold that can bring gains in time and

resources to automatic collocations retrieval

systems.

References

Behrens, C.N, Lopes, H.F., and

Gamerman, D. 2004. ‘Bayesian analysis of

extreme events with threshold estimation.’

Statistical Modelling - STAT MODEL, vol.

4, no. 3, pp. 227-244.

Benson, M. 1990. ‘Collocations and

general-purpose dictionaries.’ International

Journal of Lexicography, vol. 3, no. 1, pp.

23-34.

Bradley, A.P. 1997. ‘The use of the area

under the roc curve in the evaluation of

machine learning algorithms.’ J. Pattern

Recognition, vol. 3, n° 3, p. 1145– 1159.

Church, K., Gale, W., Hanks P. and

Hindle D. 1989. ‘Parsing, word associations

and typical predicate-argument relations.’

Proc. The workshop on Speech and Natural

Language (HLT '89). Association for

Computational Linguistics Stroudsburg, PA,

USA.

Church, K. and Hanks, P. 1990. ‘Word

association norms, mutual information and

lexicography.’ J. Computational Linguistics,

vol. 16, no. 1, MIT Press Cambridge, MA,

USA, p. 22-29.

Churchill, G.A. and Doerge, R.W. 1994.

‘Empirical threshold values for quantitative

trait mapping.’ Genetics, vol. 138, no. 3, pp.

963-971.

Daille, B., Gaussier, E. and Langé, J.

1996. ‘An evaluation of statistical scores for

word association.’ The Tbilisi Symposium

on Logic, Language and Computation:

Selected Papers. Studies in Logic, Language

and Information. J. Ginzburg, Z.

Khasidashvili, C. Vogel, J.-J. Lévy and E.

Vallduví, eds., CSLI Publications, pp. 177–

188.

DuMouchel, W.H. 1983. ‘Estimating the

stable index a in order to measure tail

thickness: a critique.’ The Annals of

Statistics, vol. 11, no. 4, pp. 1019–31.

Dunning, T. 1993. ‘Accurate Methods for

the Statistics of Surprise and Coincidence.’

J. Computational Linguistics - Special issue

on using large corpora, vol. 19, no 1, MIT

Press Cambridge, MA, USA , pp. 61-74.

Figure 9. Evolution of the values of TP, TN, FP and FN depending on the selected

threshold value: windows of size 4 (a) and 6 (b).

(a) (b)

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

Ellis, N. C. & Ferreira-Junior, F. 2009.

‘Constructions and their acquisition: Islands

and the distinctiveness of their occupancy’.

Annual Review of Cognitive Linguistics, 7,

187–220.

Embrechts, P., Klüppelberg, C., Mikosch,

T. 1997. Modelling extremal events for

insurance and finance. New York: Springer.

Evert, S. and Krenn, B. 2005. ‘Using

small random samples for the manual

evaluation of statistical association

measures’. J. Computer Speech &

Language. Vol. 19, no. 4, pp. 450–466.

Elsevier.

Fano, R. 1961. ‘Transmission of

Information: A Statistical Theory of

Communications.’ MIT Press, Cambridge,

MA.

Fawcett, T. 2004. ‘ROC Graphs: Notes

and Practical Considerations for

Researchers.’ J. Pattern Recognition Letters,

vol. 27, n° 8, pp. 882-891.

Fkih, F. and Omri, M.N. 2012. ‘Learning

the Size of the Sliding Window for the

Collocations Extraction: a ROC-Based

Approach.’ Proc. The 2012 International

Conference on Artificial Intelligence

(ICAI'12), Las Vegas, USA, pp. 1071-1077.

Fkih, F. and Omri, M.N. 2013. ‘A

Statistical Classifier based Markov Chain

for Complex Terms Filtration’. Proc.

International Conference on Web and

Information Technologies (ICWIT'13),

Hammamet, Tunisia.

Firth, J. R. 1957. ‘A synopsis of

linguistic theory 1930-1955.’ Studies in

Linguistic Analysis, pp. l-32. Oxford:

Philological Society.

Gustafson, S.C., Costello, C.S., Like,

E.C., Pierce, S.J., Shenoy, K.N. 2009.

‘Bayesian Threshold Estimation.’ J. IEEE

Transactions on Education, vol. 52, no. 3,

pp. 400- 403.

Halliday, M.A.K. 1966. ‘Lexis as a

Linguistic Level.’ In Meomory of J.R. Firth,

C.E. Bazell, J.C. Catford, M.A.K. Halliday,

and R.H. Robins, eds., London: Longmans,

pp.148-162.

Hand, D.J. 2009. ‘Measuring classifier

performance: a coherent alternative to the

area under the ROC curve’. J. Machine

Learning, vol. 77, no. 1, pp. 103-123.

Springer.

Hansen, B.E. 1999. ‘Threshold effects in

non-dynamic panels: Estimation, testing,

and inference.’ J. Journal of Econometrics.

vol .93, no. 2, pp 345–368.

Keller, A., Nesvizhskii, A.I, Kolker, E.,

and Aebersold, R. 2002. ‘Empirical

Statistical Model To Estimate the Accuracy

of Peptide Identifications Made by MS/MS

and Database Search.’ J. Analytical

Chemistry, vol.74, no. 20, pp. 5383–5392.

Krzanowski, W. J. and Hand, D. J. 2009.

ROC curves for continuous data. Chapman

and Hall, London.

Li, J., Cheng, C., Jiang, T., Grzybowski,

S. 2012. ‘Wavelet de-noising of partial

discharge signals based on genetic adaptive

threshold estimation.’ J. IEEE Transactions

on Dielectrics and Electrical Insulation, vol.

19, no. 2, pp. 543- 549.

Lin, J.F., Li, S. and Cai, Y. 2008. ‘A new

collocation extraction method combining

multiple association measures,’ Proc.

International Conference on Machine

Learning and Cybernetics, pp. 12-17.

Manning, C.D. and Schütze, H. 1999.

‘Foundations of statistical natural language

processing.’ MIT Press Cambridge, MA,

USA.

Martínez-Camblor, P. 2011.

‘Nonparametric Cutoff Point Estimation for

Diagnostic Decisions with Weighted

Errors’. J. Revista Colombiana de

Estadística, vol. 34, no. 1, pp. 133-146.

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

Mazurowski, M.A., and Tourassi, G.D.

2009. ‘Evaluating classifiers: Relation

between area under the receiver operator

characteristic curve and overall accuracy’.

Proc. International Joint Conference on

Neural Networks, Atlanta, Georgia, USA,

pp. 2045- 2049.

Pecina, P. 2010. ‘Lexical association

measures and collocation extraction’. J.

Language Resources and Evaluation, vol.

44, no. 1-2, pp. 137-158. Springer.

Pecina, P. and Schlesinger, P. 2006.

‘Combining association measures for

collocation extraction’. Proc. COLING-ACL

'06 Proceedings of the COLING/ACL, pp.

651-658. Association for Computational

Linguistics. Stroudsburg, PA, USA.

Petrović, S., Šnajder, J. and Bašić, B.D.

2010. ‘Extending lexical association

measures for collocation extraction’. J.

Computer Speech & Language. Vol. 24, no.

2, pp. 383–394. Elsevier.

Provost, F. J., Fawcett, T., and Kohavi,

R. 1998.’The case against accuracy

estimation for comparing induction

algorithms,’ in Proceedings of the Fifteenth

International Conference on Machine

Learning, pp. 445--453.

Raghavan, V., Bollmann, P. and Jung,

G.S. 1989. ‘A critical investigation of recall

and precision as measures of retrieval

system performance.’ J. ACM Transactions

on Information Systems (TOIS), vol. 7, n° 3,

1989, p. 205-229.

Roche, M., Azé, J., Kodratoff Y. and

Sebag, M. 2004. ‘Learning Interestingness

Measures in Terminology Extraction A

ROC based approach.’ Proc. ROC Analysis

in AI Workshop (ECAI 2004), Valencia,

Espagne.

Ruopp, M.D., Perkins, N.J., Whitcomb,

B.W. and Schisterman, E.F. 2008. ‘Youden

Index and Optimal Cut-Point Estimated

from Observations Affected by a Lower

Limit of Detection.’ J. Biometrical Journal,

vol. 50, n° 3, p. 419-430.

Schisterman, E.F., Perkins, N.J., Liu, A.

and Bondell, H. 2005. ‘Optimal cut-point

and its corresponding Youden Index to

discriminate individuals using pooled blood

samples.’ J. Epidemiology, vol. 16, n° 1, p.

73-81.

Seretan, V. 2011. ‘Syntax-Based

Collocation Extraction.’ Series: Text,

Speech and Language Technology, vol. 44,

Springer.

Shannon, C.E. 1948. ‘A mathematical

theory of communication.’ J. Bell System

Technical Journal, vol. 27, no. 3, pp. 379-

423.

Smadja, F. 1993. ‘Retrieving

Collocations from Text: Xtract.’ J.

Computational Linguistics - Special issue on

using large corpora, vol. 19, no. 1, MIT

Press Cambridge, MA, USA , pp. 143-177.

Sokolova, M., Japkowicz, N. and

Szpakowicz, S. 2006. ‘Beyond Accuracy, F-

Score and ROC: A Family of Discriminant

Measures for Performance Evaluation’. In

AI 2006: Advances in Artificial Intelligence,

Abdul Sattar and Byeong-ho Kang, eds.,

Springer Berlin Heidelberg, pp. 1015-1021.

Swets, J.A., Dawes, R.M. and Monahan,

J. 2000. ‘Better Decisions through Science.’

J. Scientific American, vol. 283, n° 4, pp.

82-87.

Shao, H., Zou, H. 2009. ‘Threshold

Estimation Based on Perona-Malik Model.’

Conf. International Conference on

Computational Intelligence and Software

Engineering. CiSE 2009, pp. 1-4.

Tilbury, J.B., Van Eetvelt, W.J.,

Garibaldi, J.M., Curnsw, J.S.H., Ifeachor,

E.C. 2000. ‘Receiver operating

characteristic analysis for intelligent medical

systems-a new approach for finding

IJIRR: International Journal of Information Retrieval Research. In press, 2013.

confidence intervals.’J. IEEE Transactions

on Biomedical Engineering, vol. 47, no. 7,

952- 963.

Wermter, J. and Hahn, U. 2004.

‘Collocation extraction based on

modifiability statistics.’ Proc. The 20th

international conference on Computational

Linguistics (COLING '04). Association for

Computational Linguistics Stroudsburg, PA,

USA.

Youden, W.J. 1950. ‘Index for rating

diagnostic tests.’ J. Cancer, vol. 3, pp. 32-

35.