Using Roget's Thesaurus to Determine the Similarity of Texts

221
Using Rogets Thesaurus to Determine the Similarity of Texts Jeremy Ellman A thesis submitted in partial fulfilment of the requirements of the University of Sunderland for the degree of Doctor of Philosophy June 2000

Transcript of Using Roget's Thesaurus to Determine the Similarity of Texts

Using Roget�s Thesaurus to Determine the

Similarity of Texts

Jeremy Ellman

A thesis submitted in partial fulfilment of the

requirements of the University of Sunderland

for the degree of Doctor of Philosophy

June 2000

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

ii

Abstract

This thesis addresses the problem of extracting a representation of text's meaning from its

content. The solution investigated is based on the use of Roget�s thesaurus as an external

knowledge source and can be used to analyse texts of any length or complexity. The

resulting document representation can then be compared to others, producing a new

method for text similarity assessment.

All coherent texts contain embedded sequences of words that are related in meaning.

These sequences can be detected by identifying simple relationships between the relevant

thesaural entries in which the words are found. The identification of initial sequences

drives the addition of further related words into conceptually related �lexical chains�.

Although they differ in content, it is shown that the distribution of the links in these

�lexical chains� is independent of the type of text in which they are embedded, and

therefore this technique is of general applicability.

Every coherent text contains many lexical chains of different lengths and strengths. These

may be used to represent the broad subject matter of a text. By identifying the key

concept of each chain, and relating this to its presence we may produce an attribute value

vector of concepts and their strengths. This may then be used to identify other texts as

closer or further away in meaning.

This thesis describes the creation of a tool suitable for the detection of lexical chains in

large texts, and the design and implementation of algorithms to measure text similarity.

The performance of the algorithms has been compared with human judgements and

experimentally verified. The results show that lexical chain based similarity matching is

capable of producing a ranking between a source text and several examples equivalent to

that produced by human subjects. This illustrates the utility of Roget�s thesaurus as a

resource for the determination of lexical chains.

iii

Acknowledgements

I would like to thank Bill Black of UMIST and Mark Stairmand for first interesting me in

lexical chains, and for initial discussions on this thesis.

My former employers, The MARI Group, were most generous for encouraging me to start

this research, and funding its first three years.

I am extremely grateful to the staff and students of the University of Sunderland who used

their class time to participate in my experiments.

I would like to acknowledge advice from Dr Sharon McDonald on experimental design,

and Dr Malcolm Farrow on statistical analysis, and critical insights from my second

supervisor Prof. Gilbert Cockton.

I am also most grateful to Addison Wesley Longman Limited for permission to use The

Original Roget's Thesaurus of English Words and Phrases Copyright © 1987 by

Longman Group UK Ltd. in portions of this work.

I would also like to thank the Karpeles library for permission to include an image of Dr.

Peter Mark Roget�s original work.

The work described in this thesis would not have been possible without software written

and supported by the open source community.

Finally, I owe a huge debt to my supervisor, Prof. John Tait.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

iv

Table of Contents

Acknowledgements ......................................................................................................iii

Chapter 1. Introduction.......................................................................................1

1.1 Introduction ............................................................................................................. 1 1.2 Research problem and research questions ............................................................... 2 1.3 Justification for the research.................................................................................... 3 1.4 Methodology............................................................................................................ 4 1.5 Thesis Overview...................................................................................................... 5 1.6 Definition of Terms ................................................................................................. 6 1.7 Delimitation of Scope and Key Assumptions ....................................................... 11 1.8 Summary................................................................................................................ 12

Chapter 2. Literature Review............................................................................13

2.1 Introduction ........................................................................................................... 13 2.2 Information Retrieval ............................................................................................ 13 2.3 Case Based Reasoning........................................................................................... 25 2.4 Natural Language Processing ................................................................................ 28 2.5 Conclusions ........................................................................................................... 42

Chapter 3. Hesperus: A System for Comparing the Similarity of Texts Using Lexical Chains:..................................................................................................45

3.1. Introduction .......................................................................................................... 45 3.2 Hesperus: A System for comparing Text Similarity using Lexical Chains........... 47 3.3 A program to analyse lexical chains in a text using Roget's Thesaurus................ 48 3.4 An Example ........................................................................................................... 58 3.5 The Generic Document Profile.............................................................................. 62 3.6 Using the Generic Document Profile to Determine the Similarity of Texts ......... 64 3.7 Adherence to Zipf�s Law....................................................................................... 65 3.8 Visualisation of Results......................................................................................... 66 3.9 Conclusion............................................................................................................. 68

Chapter 4. The General Nature of Lexical Links.............................................70

4.1 Introduction ........................................................................................................... 70 4.2 Selection of the Experimental Texts...................................................................... 72 4.3 Reading Complexity of the Texts.......................................................................... 73 4.4 Determination of the Lexical Cohesive Relationships .......................................... 75 4.5 Analysis 1: Link Distribution between Documents............................................... 75 4.6 Analysis 2: Link Distributions Change across Different Document Types .......... 77 4.7 Related Work......................................................................................................... 79 4.8 Conformance to Zipf�s Law .................................................................................. 81 4.9 Conclusion............................................................................................................. 82

v

Chapter 5. Word Sense Disambiguation and Hesperus.................................84

5.1: Introduction .......................................................................................................... 84 5.2. The Problem of evaluating the effects of Word Sense Disambiguation in Hesperus..................................................................................................................................... 86 5.3. The motivation for HESPERUS participating in Senseval as SUSS ................... 86 5.4. SUSS: The Sunderland University Senseval System ........................................... 87 5.5. Local Disambiguator .......................................................................................... 101 5.6. Conclusion.......................................................................................................... 106

Chapter 6. Evaluating Hesperus ....................................................................108

6.1 Introduction ......................................................................................................... 108 6.2 Hypotheses .......................................................................................................... 113 6.3 Text Similarity Experiments................................................................................ 115 6.4 Discussion and Conclusion.................................................................................. 139

Chapter 7. Conclusions and Further Work ...................................................143

7.1 Introduction ......................................................................................................... 143 7.2 Conclusions about the research hypotheses ........................................................ 145 7.3 Contributions ....................................................................................................... 146 7.4 Future Work......................................................................................................... 147 7.5 Summary.............................................................................................................. 153

References.......................................................................................................154

Bibliography ....................................................................................................167

Appendix I. Experimental Examples on Rosetta Stone ..............................170

Appendix II. Experimental Texts....................................................................179

Appendix III. Lexical Chain Visibility Algorithm ............................................199

Appendix IV. Basics Statistics of the Experimental Data ............................200

Appendix V. Help information given to Experimental Subjects ..................202

Appendix VI. Roget�s Thesaurus � A brief Overview. ..................................205

Appendix VII. Papers published related to this thesis. ................................211

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

vi

Index of Figures

Figure 1-1: Conceptual organisation of the chapters of the thesis ...................................... 6

Figure 2-1: A classification of text retrieval techniques. .................................................. 16

Figure 2-2: Case Based Reasoning Cycle ......................................................................... 26

Figure 3-2:Hesperus System Architecture ........................................................................ 48

Figure 5-2 : SUSS System design ..................................................................................... 89

Figure 5-3: Distribution of the senses of �Shake�............................................................. 99

Figure 6-1: Source text Topic Screen.............................................................................. 120

Figure 6-2: Source and Example Text Comparison ........................................................ 121

Figure 7-1: Books on Chains in Hereford Cathedral Library.......................................... 143

Figure VII-1:Major Headings in Roget's Thesaurus ....................................................... 206

Figure VII-2 Sub divisions of �Abstract Relations�........................................................ 206

Figure VII-3 Sub divisions of �Space�............................................................................ 206

Figure VII-4 Sub divisions of �Matter� ......................................................................... 206

Figure VII-5 Sub divisions of �Emotion� ....................................................................... 206

Figure VII-6 Sub divisions of �Volition�........................................................................ 207

Figure VII-7: Sub divisions relating to "Existence"........................................................ 207

Figure VII-8 : An extract from Roget's thesaurus ........................................................... 207

Index of Algorithms

Algorithm 3-1: Creation of the E-Roget............................................................................ 52

Algorithm 3-2: Create Lexical Chains. ............................................................................. 57

Algorithm 5-1: SUSS Algorithm Processing Phase .......................................................... 90

Algorithm 5-2: Generate Links. ...................................................................................... 104

Algorithm 5-3: Local Word Disambiguation. ................................................................. 105

Algorithm 6-1: Reducing the number of example texts to five....................................... 117

Algorithm III-1: An Algorithm to make the lexical chains in a text visible. .................. 199

vii

Index of Graphs

Graph 3-1: Zipf Law: GDP Profile Values Vs Rank......................................................... 66

Graph 4-1: Link Type Vs Book Title ................................................................................. 76

Graph 4-2: Identical Links (%) Vs Inter-word Distance ................................................... 77

Graph 4-3: Percentage of Category Links Vs Inter-word Distance .................................. 78

Graph 4-4: Percentage of Group Links Vs Inter-word Distance....................................... 78

Graph 4-5 : Non-Self Triggers (Beeferman et al. 1997) ................................................... 79

Graph 4-6 : Self Triggers (Beeferman et al.1997)............................................................. 80

Graph 4-7 : Moby Dick. Number of Same Category words Vs Rank............................... 82

Graph 6-1 : Copyright S1 ratings .................................................................................... 127

Graph 6-2: AI: S1 ratings ................................................................................................ 129

Graph 6-3: Rosetta S1 ratings ......................................................................................... 131

Graph 6-4: Socialism: S1 ratings..................................................................................... 133

Graph 6-5: Ballot: S1 ratings........................................................................................... 135

Graph 6-6: Breakdance: S1 ratings ................................................................................. 137

Graph VI-1: Distribution of Polysemic Words in Roget�s Thesaurus ............................ 209

Graph VI-2: Frequency of Collocations in Roget�s Thesaurus ....................................... 210

Index of Tables

Table 2-1: Common steps in Information Retrieval. (From Robertson 1994, p3) ............ 15

Table 2-2: Comparison of IR and Textual CBR (reproduced from Lenz 1998) ............... 28

Table 2-3: Sub-domains of NLP. (Reproduced from Liddy 1998) ................................... 29

Table 2-4: Correlation of similarity measurements........................................................... 34

Table 3-1: Thesaural Relations Vs Mean Words ............................................................. 56

Table 3-2: Value of the different lexical links................................................................... 58

Table 3-3: Quotation from Einstein 1939 (cited by StOnge 1995) ................................... 59

Table 3-4: An example lexical chain embedded in a text ................................................. 60

Table 3-5: Senses of �TRAIN�.......................................................................................... 61

Table 3-6: An Example Generic Document Profile .......................................................... 63

Table 3-7: Link Type indications ...................................................................................... 67

Table 4-1: Texts Selected .................................................................................................. 73

Table 4-2: Reading Complexity of the Texts .................................................................... 74

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

viii

Table 4-3: Samples of Trigger Pairs (Beeferman et al.1997)............................................ 80

Table 5-1 Disambiguation Success (% accuracy) Vs Word by method............................ 97

Table 6-1: Rejection Criteria for example texts .............................................................. 118

Table 6-2 Copyright: Experimental Comparative Results .............................................. 128

Table 6-3: Copyright: Numeric Rank of similarity scores .............................................. 128

Table 6-4: Copyright: Hesperus and SWISH Spearman Rank Correlation. ................... 129

Table 6-5: AI: Experimental Comparative Results ......................................................... 130

Table 6-6: AI: Hesperus and SWISH Spearman Rank Correlation. ............................... 130

Table 6-7: Rosetta: Experimental Comparative Results ................................................. 132

Table 6-8: Rosetta: Hesperus and SWISH Spearman Rank Correlation......................... 132

Table 6-9: Socialism: Experimental Comparative Results.............................................. 134

Table 6-10: Socialism: Hesperus and SWISH Spearman Rank Correlation................... 134

Table 6-11: Ballot: Experimental Comparative Results.................................................. 136

Table 6-12: AI: Hesperus and SWISH Spearman Rank Correlation. ............................. 136

Table 6-13: Breakdance: Experimental Comparative Results ........................................ 138

Table 6-14: Breakdance: Hesperus and SWISH Spearman Rank Correlation................ 138

Table 6-15 Hesperus: Table of significance of experimental results ......................... 139

Table II-1: Generic Document Profile for �Rosetta Stone� from MS Encarta................ 179

Table II-2 Generic Document Profile for �Copyright� from MS Encarta....................... 184

Table II-3 Generic Document Profile for �Socialism� from MS Encarta ...................... 189

Table II-4: Generic Document Profile for �Ballot� from MS Encarta ........................... 193

Table II-5 Generic Document Profile for �AI� from MS Encarta .................................. 196

Table II-6 Generic Document Profile for �Breakdance� from MS Encarta .................. 198

ix

Figure 1: Extract from Roget�s Thesaurus1

1 Reproduced with the kind permission of the Karpeles Library.

Chapter 1. Introduction

1.1 Introduction

This study describes a new method to determine the similarity of two texts based on the

words they contain that are related in meaning. These words may be linked into chains

that contribute towards the cohesion of the text. These �lexical chains� (Morris and Hirst

1991) are identified using Roget's thesaurus. Roget has not been used previously in a

computer program to identify lexical chains. Neither has it been suggested that the

similarity of whole texts may be compared in this way. Thus, the study brings together

ideas of similarity judgements, text cohesion, and Roget�s thesaurus.

Similarity judgements are an essential component of human thought and inference

processes (Sloman and Rips 1998). Since no situation is exactly like another, people must

be able to generalise their experiences in order to apply them in new circumstances (Hahn

and Chater 1998, Schank and Abelson 1977). These may be basic responses to simple

stimuli as in classical Pavlovian conditioning, or making legal judgements based on

previous, similar cases. Similarity is at the core of the artificial intelligence problem

solving method known as �Case Based Reasoning� (Aamodt and Plaza 1994), which

seeks to identify problem solutions based on their similarity to past successes.

Cohesion is that property of a text that allows it to be read as a unified entity, as opposed

to a series of unconnected sentences. Halliday and Hasan (1976, 1989) have identified

many devices used to make text cohesive. These include linguistic phenomena, such as

anaphora, cataphora, ellipsis, co-extension, and chains of words. Lexical chains may be

composed of identical or similar words.

Roget's Thesaurus is a well-known scholarly work and writer's tool. It is used by authors

to find related and relevant words. It contains thousands of words organised by their

similarity to each other in a conceptual four-level hierarchy. Roget, and the semantic

information in its hierarchy, has been used in Information Retrieval (Spärck-Jones 1986;

Boyd et al. 1993), Word Sense Disambiguation (Yarowsky 1992), and in Text Cohesion

(Morris and Hirst 1991). Roget�s Thesaurus is described further in Appendix VII.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

2

Morris and Hirst (1991) proposed using Roget's thesaurus to identify the lexical chains in

a text. This would suggest its structure, which is an essential step in recognising its deeper

meaning. Morris and Hirst recognised that lexical chains provided a semantic context for

the interpretation of words and sentences. This study defines a method representing that

semantic context, which may then be used for estimating the similarity of two texts.

1.2 Research problem and research questions

This study addresses the problem of extracting a representation of a text's meaning from

its content using Roget's thesaurus as an external knowledge source. The resulting

document representation can then be compared to others, giving rise to a new method for

text similarity assessment. Specifically, the study tries to ascertain:-

1. Whether a text similarity measure may usefully be constructed from a text's

lexical chains as identified using Roget's Thesaurus.

2. If the text similarity measure defined provides a better approximation of human

judgements than purely statistical methods.

3. Whether the text representation considered is suitable for the analysis of texts of

different lengths and complexities.

4. Whether the measure may be improved by including word sense disambiguation

at current levels of accuracy.

There are three motivations for this work:

Firstly, natural language approaches to Information Retrieval have consistently performed

no better than statistical and heuristic methods (Strzalkowski 1999). Such statistical

methods consider documents to be described by a representative set of keywords (Baeza-

Yates and Ribeiro-Neto 1999). However, natural language texts contain sentences that

have a grammatical structure and use a varied vocabulary rich in synonyms. These

elements contribute to a text�s meaning, which is not considered in statistical IR. Since

people use meaning when considering text similarity it is important to demonstrate a

robust application of Natural Language Processing that considers it �even in the most

superficial sense- and demonstrates improved performance on text similarity matching.

Secondly, the study considers the potential advantages and disadvantages associated with

the use of Roget's thesaurus in identifying lexical chains. Several other computer

Chapter 1 Introduction

3

implementations of lexical chains (see Chapter 2) have used Princeton's WordNet (Miller

et al. 1990, Fellbaum 1998). Since these have not performed better than rival statistical

measures it is important to decide whether the technique does not work well, or whether it

may be improved by a different knowledge source.

Thirdly, all coherent texts contain embedded sequences of words related in meaning.

These sequences can be detected by identifying simple relationships between the relevant

thesaural entries in which the words are found. The identification of initial sequences

drives the addition of further related words into conceptually related lexical chains.

Although they differ in content, it is not known whether the distribution of the links in

these lexical chains is also dependent on the type of text in which they are embedded.

Consequently this needs to be determined if this technique is to be of general

applicability.

Every coherent text contains many lexical chains of different lengths and strengths. We

may use these to represent the broad subject matter of a text. This is done by identifying

the key concept of each chain, and relating this to its magnitude, giving an attribute value

vector of concepts and their strengths. We can then use this to identify other texts as

closer or further away in meaning.

1.3 Justification for the research

The purpose of this study is to ascertain whether chains of words in texts that are related

by virtue of their position in Roget's thesaurus could be used to define a general measure

of a text's subject matter. This metric needs to be sufficient to allow the similarity of two

texts to be compared more accurately than measures that consider texts as composed of

unrelated terms.

This study is important for three reasons: firstly, it examines the suitability of Roget's

thesaurus as an external knowledge source. Other work on lexical chains (Stairmand

1996, and StOnge 1995) expressed the view that the general relationships between related

words included in Roget may be more useful than those found in Princeton's WordNet

(Fellbaum et al. 1998).

Secondly, it develops a robust, but shallow method of estimating a text's subject matter

from its content. Most Natural Language Processing (NLP) methods are not capable of

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

4

analysing unconstrained text (Lewis and Spärck-Jones 1996). Additional robust methods

could increase the practical utility of NLP, for example by making them applicable to real

world tasks, such as analysing on-line information from the World Wide Web (Berners-

Lee et al. 1994, Ellman and Tait 1996).

Thirdly, it has significant potential in textual case based reasoning (T-CBR). T-CBR

(Lenz et al. 1998) applies the problem solving methodology (Watson 1997) of case based

reasoning (CBR) to knowledge bases stored as texts (see Chapter 2). Current applications

in T-CBR all use purpose built techniques to analyse a text's contents. This study will

describe a method of determining text's similarity that is applicable to any subject

domain. Since similarity assessment is a core element of CBR, a generic method could

significantly ease the burden of building a T-CBR system by avoiding writing a text

analyser specifically for each new application.

1.4 Methodology

This is an experimental study in Computational Linguistics with an emphasis on

comparative evaluation. The core research method used was the construction of a lexical

chaining program that uses Roget's thesaurus as a knowledge source. This program is

known as �Hesperus�. Hesperus' performance was evaluated against judgements of text

similarity made by human subjects in an experiment based in a realistic setting. The

results were also compared to those given by a statistically based Information Retrieval

program. The approach to word sense ambiguity was analysed by participating in an

international word sense disambiguation benchmarking competition called Senseval

(Kilgarriff and Rosenzweig 2000).

Hesperus is made up of a lexical chaining program, and a computationally tractable

version of Roget's thesaurus as a knowledge source. The lexical chaining program is an

enhanced version of one described in (StOnge 1995) which is based on (Morris and Hirst

1991). Enhancements include storing lexical chains according to their prominence in the

text. The procedure that calculates the importance of a lexical chain is based on

Stairmand's (1996) work. Computational complexity is also controlled using a sliding

window approach (Schütze 1992). Details are given in Chapter 3.

A machine-readable version of Roget's thesaurus was not available to Morris and Hirst

(1991). Since then the 1911 version of Roget's thesaurus has been made available by

Chapter 1 Introduction

5

Project Gutenberg (1999). Whilst this is machine-readable it is not machine tractable, as it

is one large block of text. It was made machine tractable by splitting it into multiple files,

and then using an Information Retrieval program to make these accessible. An identical

procedure was applied to the 1987 Roget when permission had been granted to use this

for research purposes.

1.5 Thesis Overview

Following the introduction contained in this chapter, the remaining components of this

thesis are as follows.

Chapter 2 surveys the state of the art in areas related to this thesis. Following brief links

into the parent disciplines of Natural Language Processing, Information Retrieval, and

Textual Case Based Reasoning attention is focussed on other work in lexical chains, and

in text and concept similarity assessment. Other possible approaches to the text similarity

problem are also considered, such as those that are not knowledge based.

Chapter 3 describes the lexical chaining program Hesperus, and its implementation.

Procedures are also given for converting a text-based thesaurus into a resource suitable

for lexical chaining, for text similarity assessment, and for visualising lexical chains in a

text.

Chapter 4 examines the general nature of the approach. That is, we raise the issue of text

genre, and whether lexical chains (and hence Generic Document Profiles) derived from

simple texts may be compared to those from ones that are more complex. This question is

answered by analysing several book length texts that differed in complexity according to

a standard readability metric and comparing the frequency and types of the thesaural

links.

Chapter 5 considers the problem of word sense ambiguity, and possible approaches to it

that could be used in Hesperus. These ideas were tested within the context of an

international contest known as Senseval (Kilgarriff and Rosenzweig 2000) which was

held to evaluate different approaches to word sense disambiguation. Appropriate concepts

were then migrated into the Hesperus system design, where their effectiveness could be

evaluated.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

6

Chapter 6 describes a fully randomised experiment designed to evaluate Hesperus, the

text similarity program. This involved generating a benchmark set of data whose

similarity was assessed by people. Hesperus was then operated under several conditions,

and its similarity judgements compared against those of the human subjects. This gave an

indication of their efficacy.

The final chapter summarises findings, draws conclusions, and makes suggestions for

further research. The conceptual relationships between the chapters is shown in fig 1-1

below.

Figure 1-1: Conceptual organisation of the chapters of the thesis

1.6 Definition of Terms

Definitions adapted by researchers are rarely uniform, so essential, or unusual terms are

defined here to eliminate future confusion.

1

2

3

4 5

6

7

1. Introduction.

2. Literature Review.

3. Hesperus: A system for comparing the similarity of texts

using lexical chains.

4. The General Nature of Lexical Links.

5. Word Sense Disambiguation and Hesperus.

6. Evaluating Hesperus

7. Conclusion

Chapter 1 Introduction

7

Roget's Thesaurus

This study depends critically on Roget�s thesaurus. However, since Roget�s thesaurus was

first was published in 1852 there have been many different editions1. This study uses the

version of 1911 in Chapters 3 and 4, and that of 1987 thereafter. Nonetheless, the lessons

from the study are general, because of the common nature of Roget�s thesauri. To

understand this requires a brief introduction to the structure of Roget.

There are three essential components common to any Roget�s thesaurus:

1. A four-level2 hierarchical structure that includes approximately one

thousand concepts or classes.

2. A large number of words organised into these classes.

3. An index that identifies to which classes a word belongs.

The index (3) is highly dependent on the body of words that it is produced from (2). This

varies considerably from pocket editions of Roget (such as that of 1911) which have

approximately 60,000 words, to large desktop editions that have 250,000 words.

The performance of algorithms developed here depends on the presence of words in the

index. Consequently the exact edition of Roget is important, and better performance is

seen with larger editions of Roget

The class structure of Roget varies little. Roget originally used 1000 classes, but

subsequently added approximately 16 sub classes. Later lexicographers reduced this to

990 classes, which is a variation of ±1.5%.

The algorithms developed here exploit Roget�s class structure. However, given the small

variation in that structure they are applicable to any version of Roget. Strictly though, the

work reported here may only be reproduced exactly with the 1911 edition for the

experiments reported in Chapter 3, and the 1987 edition subsequently3. Nonetheless, the

1 There were twenty four in Roget's lifetime alone (Encarta 1997) 2 The four named levels in the hierarchy, and up to three levels underneath these identifiable either by syntactic

category, or punctuation � especially semicolons 3 However the paper edition of 1962 is used for reference throughout.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

8

principles identified are applicable to any Roget�s thesaurus -albeit with some tuning.

Further information about Roget�s thesaurus is given in Appendix VII.

Similarity

This thesis is about �similarity�, but what does that mean? A standard definition would be

�of the same kind, nature, or amount; having a resemblance� (COD9), and most authors

imply this (e.g. Lee 1997). Hahn and Chater (1998) offer the following intuition (that they

note as being vague), which will be used here:

1. Similarity is some function of common properties.

2. Similarity is graded.

3. Similarity is maximal for identity.

Word Meaning and Word Sense

There is no universally acceptable definition of word meaning. One view taken by a

logical positivist would be that the meaning of a word is determined by the truth value of

the proposition of which it forms a part. The view of a pragmatist would be that a word�s

meaning is given by its context of use, and the intention of whoever uttered or wrote it.

An everyday view would be that a word�s meaning is given by its dictionary entry. This is

possibly one of the least satisfactory definitions, as it implies that lexicographers dictate

the meaning of words in a language, whereas their aim is to interpret it.

The dictionary definition of meaning does have the advantage of operational adequacy.

That is, it can be implemented in a computer program so as to offer users appropriate

interpretations as desired.

The view is taken in this thesis that word meaning and word senses are given by their

entries in the dictionary specified. If no dictionary is mentioned, then Roget�s thesaurus is

implied. That is, we assume that a word sense corresponds to a Roget category even

though Roget only defines words by association with others, rather than giving formal

explanations.

Now there are two classic problems that affect work with natural language texts: several

different words can often be used to express the same concept, and one word can be used

for several different concepts. Let us call these problems �synonymy� and �word sense

ambiguity�.

Chapter 1 Introduction

9

In synonymy, a word is equivalent to another in some (but possibly not all) senses. An

example would be �wedding� and �marriage� in the ceremony sense4, where both are

equivalent in meaning, although �wedding� can not be substituted for �marriage� in all its

dictionary senses. For example, both words can be used interchangeably in (1), but not in

(2).

1. The wedding took place in church.

2. The marriage had irretrievably broken down.

Synonymy and word sense ambiguity lead to the classic information retrieval �vocabulary

problem� (Furnas et al. 1987, Blair and Maron 1985) in which users will enter different

terms for desired objects or actions from that envisaged by a system's designer.

Ambiguous words (see Chapter 5) are often divided into �homographic� and

�polysemous� senses. Homographic words have identical spellings, but completely

different meanings that are often the result of different derivations. These correspond to

major sub entries in a dictionary. An example would be �dram� in the sense of a small

drink of spirits, as opposed to computer memory specification e.g. �24meg dram�.

Polysemous words have many meanings. Here �polysemy� refers to different sense

variations within one major dictionary entry. For example, the �onion� in cheese and

onion crisps is derived from, as opposed to identical to, the onion that may be grown in

the garden.

Identifying which word sense(s) were intended by an author is known as the word sense

disambiguation problem.

Lexical Chains

The term �lexical chain� is due to Morris and Hirst (1991). They used the term to identify

sequences of related words in a text. Their work was based on the stricter definitions of

Halliday and Hasan (1976, 1989) who defined the earlier term �cohesive chain� as being

a set of terms that are semantically related. Halliday and Hasan (1989) also specified that

these were of two types of chains: �identity chains� where every member refers to the

same thing (that is, they are co-referential), and �similarity chains�. In similarity chains,

the terms are related by co-classification, or co-extension, that is, they refer to members

of the same class of things or events. Halliday and Hasan (1989, p84) stress that the

4 where �wedding� is derived from old English, and �marriage� from Middle English through Old French (COD9).

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

10

distinction between identity and similarity chains is important: However, it is not possible

to maintain this computationally. Therefore, this study will follow Morris and Hirst

(1991) in its use of the term lexical chain.

An Example text with Lexical Chains marked

To better clarify the notion, we are going to briefly consider an example text with lexical

chains indicated. The following quotation from Einstein was considered by StOnge

(1995). It is sufficiently brief to look at in some detail and also permits some comparison

of the two works (Section 3.4).

�We suppose a very long train travelling along the rails with the constant velocity t'

and in the direction indicated in Figure 1. People travelling in this train will with

advantage use the train as a rigid reference-body; they regard all events in

reference to the train. Then every event which takes place along the line also takes

place at a particular point of the train. Also, the definition of simultaneity can be

given relative to the train in exactly the same way as with respect to the

embankment.� Einstein 1939, cited by StOnge (1995

StOnge (1995) manually identified three lexical chains in this text. These are indicated by

subscripts in the text5, and then listed below.

We suppose a very long train1 travelling2 along the rails1 with the constant velocity2 t' and in the direction2 indicated in Figure 1. People travelling2 in this train1 will

with advantage use the train1 as a rigid reference-body3; they regard all events in

reference3 to the train1. Then every event which takes place along the line1 also

takes place at a particular point1 of the train1. Also, the definition of simultaneity

can be given relative to the train1 in exactly the same way as with respect to the

embankment1. Einstein)

1. {train, rails, train, train, train, line, point, train, train, embankment}

2. {travelling, velocity, direction, travelling}

3. {reference-body, reference}

It is important to remember that there is no absolute truth in the selection of any particular

5 Reproduced from StOnge (1995). Colour has been added to the chains StOnge identifies for clarity of exposition.

Chapter 1 Introduction

11

chain of words in a text. The relationships between words that form part of a coherent

theme in the text may or may not be detectable by reference to an external thesaurus (see

Section 3.4). Furthermore, the presence of a word in a thesaurus may itself be problematic

due its possible multiple interpretations. (See Chapter 6). Nevertheless, the example

shows the type of semantic relationship found in a lexical chain.

Lexical Chains and their component Links

Halliday and Hasan (1976, 1989) use the word �tie�6 to indicate relations in a text that can

be joined into a cohesive chain. This study consistently uses �links� to describe elements

of a lexical chain since this is the primary sense, as judged by position of the dictionary

entry. Links in a lexical chain (hence lexical links) should be considered as equivalent to

Halliday and Hasan's ties.

Lexical Chains and the Generic Document Profile

Lexical chains are used to compute a semantic representation of a text that we call its

�Generic Document Profile� (Chapter 3). This is an attribute value vector of Roget

categories whose strengths are determined using the lexical chains identified in the text.

Ecological Validity

The term �ecologically valid� is widely used in psychology to describe experiments that

try to ensure that a task's content and features are representative of the larger

circumstances of a person's activities. The term is due to Brunswik, and is discussed at

length by Hammond (1998).

We consider the experiments described in Chapter 6 as ecologically valid as they are

carried out in the subjects� usual environment, using computer hardware and software

with which they are familiar.

1.7 Delimitation of Scope and Key Assumptions

There are inevitably some limitations inherent in the approach used. These will be

described here as they may restrict the ability to extrapolate from the results. As with

other work based on lexical chains, Hesperus can only process words that are found (or

may be morphologically reduced to those) in the thesaurus. Thus, document themes that

are tightly bound to proper names are not catered for in the program as it stands. This

6 This sense of �tie� is that of something (typically a beam) holding parts of a structure together. This sense is lesser

used as indicated by its sixth position in the dictionary entry (COD9).

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

12

might for example be addressed by incorporating work on named entities (Chinor 1998)

being done within the MUC program by Humphreys et al. (1998) on LaSIE, or Black et

al. (1998) on FACILE.

This study is based on a particular interpretation of lexical chains that may be derived

using an external thesaurus. Halliday and Hasan (1976, 1989) identified other cohesive

relations, such as those based on pronouns. Being able to automatically identify the

contribution these made to the sense of a text would clearly be useful, but its

implementation (e.g. Azzam, Humphreys, and Gaizauskas 1999) is beyond the scope of

this study. We claim that using thesaurally derived lexical chains is sufficient for crude

similarity assessment.

Hesperus is a research tool. Its performance has not been optimised based on intermediate

results. The premise is that if the technique works then it may be improved in the future

by tuning system parameters based on empirical data.

There are inevitable limitations with the text similarity experiment reported in Chapter 5.

The subjects were undergraduate and masters students at the University of Sunderland

following courses in Computing and Information Systems. Their assessments of similarity

may not be representative of the general population. Similar limitations of scope of course

apply to the vast majority of work in the psychological literature.

1.8 Summary

This chapter has presented the major subject of the thesis; the use of thesaurally related

lexical chains in determining text similarity. This has introduced lexical chains −

especially those dependant on relations that may be determined using Roget's Thesaurus.

The thesis methodology has also been specified. This is experimental, with an emphasis

on comparative evaluation. We shall now proceed with a discussion of the background to

the research followed by detailed description of the study.

Chapter 2. Literature Review

2.1 Introduction

The objective of this chapter is to provide a basis from the published literature for the

work outlined in Chapter 1. This will be done by reviewing the relevant areas to identify

germane research issues. These research issues come from Information Retrieval, Case

Based Reasoning, and Natural Language Processing, and include thesauri, similarity,

word sense ambiguity, and lexical chains. We now proceed to examine each of these

fields in turn.

2.2 Information Retrieval

2.2.1 Introduction

Information retrieval (IR) �deals with the representation, storage, organisation of and

access to information items� (Baeza-Yates and Ribeiro-Neto 1999, p1). That is, texts,

images, or other forms of information are often collected together for later use. Then, as

the collection size grows, automatic means of identifying entries useful to the enquirer are

required. This section will only consider text retrieval since this is most directly related to

the topic of this thesis, although retrieval of images and multimedia are active research

areas (e.g. Bertino, Catania, and Ferrari 1999).

As a mature field1, there are many general descriptions of IR, its methods, philosophy,

and origins (van Rijsbergen 1979, Salton and McGill 1983, Frakes and Baeza-Yates 1992,

Baeza-Yates and Ribeiro-Neto 1999).

IR may be broadly divided into manual or automatic techniques for indexing and

subsequent retrieval. Manual methods have the advantage of accuracy, but the

disadvantage of cost, in terms of person time required to assign information to an

appropriate category. An example of the successful application of manual methods would

be the Internet search engine Yahoo, which is one of the most popular (Baeza-Yates and

Ribeiro-Neto 1999). This uses a classification system, into which Internet documents are

manually classified. Information seekers retrieve information by searching this

classification hierarchy.

1 The Journal of Documentation, which is associated with IR, made its first appearance in 1946.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

14

Automatic techniques have the advantage that a machine may analyse information for

storage and later retrieval. Thus, the volume of information processed in a given time far

exceeds human capabilities. The cost of processing information automatically is also

decreasing, as it is linked to the exponentially falling cost of computer time and storage.

Automatic methods also have the advantage of consistency, whereas human indexing is

subject to individual judgement.

The disadvantage of automatic methods is that they rarely achieve human levels of

performance. They are far more likely to identify erroneous information as relevant to

person�s query, or to ignore pertinent information.

Several measures exist for measuring and comparing IR systems (see van Rijsbergen

1979 for formal definitions). The two measures that are most often cited are known as

recall and precision. Recall is the proportion of relevant documents that are retrieved, and

precision is the proportion of retrieved documents that are relevant at a given cut-off in a

ranking.

Automatic techniques may be divided into methods that directly exploit knowledge about

the contents of the information in order to improve retrieval performance and those that

rely solely on its statistical characterisation. We shall term the former �knowledge based�

methods, and the latter �statistical� methods.

FERRET (Mauldin 1991) is an example of a knowledge-based IR system. It exploited

script based language parsing, and used four levels of lexical knowledge. These included

a hand-coded lexicon, extracts from Webster�s 7th dictionary to account for synonyms,

rules to recognise near synonyms, and special rules to identify names. In a limited domain

of 1065 astronomy articles, Mauldin (1991) claimed performance increases of 30% recall,

and 250% increase in precision when compared to a standard information retrieval

technique (Boolean keyword query). The cost of this increase in precision was a

processing time that Mauldin (1991) notes �was limited to eight minutes per page�.

Automatic statistical methods have two tremendous advantages over knowledge based

methods. Firstly, they are capable of processing vast quantities of data in a limited time,

Chapter 2 Literature Review

15

and secondly, they are largely language independent. Mauldin (1997) describes Lycos, an

Internet search engine that uses statistical methodology.

Lycos is unusual in that it provides automated summaries of Internet documents. Other

Internet search engines do not. Mauldin (1997) describes how these summaries are

created using statistical IR techniques that identify the 100 words that most characterise

the text. He also discusses how both computational and financial cost was a factor in

deciding which approaches could be used in Lycos, since a commercial web service needs

to be capable of searching millions of documents per day.

Internet search services such as AltaVista, Excite, and Lycos that use statistical

approaches are equally applicable to any language. This differentiates them from

knowledge based approaches that are largely restricted to English, and widens their

appeal to a global audience.

In giving an overall review of IR, Robertson (1994) points out that many steps are present

in nearly all free-text systems. These are summarised in Table 2-1, which is derived from

Robertson (1994, p3).

Table 2-1: Common steps in Information Retrieval. (From Robertson 1994, p3)

Step Description

Free-text indexing Creates an index entry from part of the item only (e.g. title or abstract). This

exploits a previous manual selection of the important words to describe an

item.

Word identification There must be a set of rules for this, dealing not only with word separators

such as blank characters and punctuation, but also with upper-lower case,

embedded hyphens or hyphens at the end of lines, numbers etc.

Stop-lists Most systems identify and exclude certain common words (the list is usually

manually prepared).

Stemming Stemming or suffix stripping refers to the abbreviation of index and query

terms to reduce index size, and increase retrieval effectiveness.

Dictionary operations These identify phrases, acronyms, synonyms possibly using a thesaurus

Inverted file generation This is one of the principal data storage techniques. It allows the identification

of the containing document from a recognised query term.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

16

A classification of text retrieval techniques proposed by Belkin and Croft (1987) is shown

below in Figure 2-1.

Figure 2-1: A classification of text retrieval techniques. Reproduced from Belkin and Croft (1987)

Detailed descriptions of these methods are widely available (e.g., see van Rijsbergen

1979, Baeza-Yates and Ribeiro-Neto 1999).

At its broadest, text based Information Retrieval consists of collecting a set of texts,

indexing them, and then querying the index to identify relevant documents. It could be

said that one of the principle objectives of IR is the optimisation of query document

similarity to maximise precision and recall. Enhancements to this approach are being

explored, especially based on visualisation (Nowell et al. 1996), knowledge discovery

(Crimmins et al. 1999), and, within the Internet, citation count (Brin and Page 1998).

However, the following sections will pursue query document similarity, as it is most

relevant to the work outlined in Chapter 1.

A fundamental issue in query document similarity is known as �The Vocabulary

Problem� (Furnas et al. 1987). Simply put, this occurs when information seekers do not

use exactly the same words as those employed by information providers. They may for

example, use synonyms, or equivalent phrases. The converse of the vocabulary problem is

the issue of word sense ambiguity. Here users find inappropriate documents that include

the search terms they specified, but in a sense, other than they intended.

Chapter 2 Literature Review

17

The vocabulary problem may be addressed by exploiting thesauri (Qiu and Frei 1995), or

modifying queries (Xu and Croft 1996). In the following sections, we firstly consider

lexical ambiguity in IR to put the vocabulary problem in perspective. Next, we consider

standard measures of similarity measurement, as these are clearly related to the approach

outlined in Chapter 1. We then discuss query modification and the use of thesauri.

Finally, we consider IR evaluation. This is critical if we wish to know how a modified

technique has affected precision and recall.

2.2.2 Lexical Ambiguity and Information Retrieval

Approximately one third of the words in Roget�s thesaurus are found in more than one

entry, and could be considered ambiguous (see Appendix VII, and Section 2.4.2). This

33% ambiguity level is in broad accord with figures reported by Ide and Véronis (1998)

from a variety of sources. Consequently, an IR query containing an ambiguous word such

as plant (e.g. Manufacturing plant, as compared to plant life) may identify documents as

relevant that contain the term in a sense other that in the query. Although Krovetz and

Croft (1992) have argued that ambiguity is diminished in subject specific document

collections, it is a factor in heterogeneous document collections such as TREC, and the

Web. As such, there is a question as to whether automatic sense disambiguation is

desirable in IR, since it is an unresolved research problem.

Sanderson (1994) has carried out detailed experiments on the effects of lexical

disambiguation on Information Retrieval performance. Sanderson (1994) used the

artificial ambiguity technique that he credits to Yarowsky. Here, word pairs that are

unrelated are merged into pseudo words. For example if the word pair is �kalashnikov�

and �banana�, every occurrence of both in the document collection is replaced with

artificial term �kalashnikov/banana�. The advantage of this approach is the original text

can be used to identify which word sense was intended (i.e. either �kalashnikov� or

�banana�).

Sanderson (1994) tested the effect on IR performance of varying degrees of

disambiguation. He did this by creating pseudo words of between two to ten words in

length. The corpus used was the Reuters 22713. This is a collection of 22713 articles from

the Reuters news wire in 1986. Sanderson varied the accuracy of his artificial

disambiguator, and examined the effect on the E measure2 (van Rijsbergen 1979). He

2 E is a compound precision-recall performance measure (van Rijsbergen 1979).

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

18

concluded that IR systems are insensitive to lexical ambiguity, but very sensitive to

erroneous disambiguation. This needs to be more than 90% accurate to improve IR

performance.

Information Retrieval is not the same problem as text similarity matching, as the latter is a

subset of the former. Nonetheless, Sanderson�s point that inaccurate disambiguation may

degrade performance to a greater extent than no disambiguation is still applicable. A

research question remains as to what the level is, in text similarity matching, and whether

disambiguation technology exceeds this threshold or not.

2.2.3 Similarity and its measurement

Calculation and measurement of similarity is often of importance in IR. Similarity

measures allow documents to be ranked in order of relevance to a query, and are also used

to cluster similar documents together. Many similarity formulae have been proposed,

although none has gained universal acceptance (Bartell, Cottrel, and Belew 1998).

Examples include the Dice, Jaccard, and Cosine Coefficients (see van Rijsbergen 1979,

Salton and McGill 1983 for original references).

Similarity coefficients are combined with term weights for effective searching. Weights

exploit the relative frequency of search terms in the document collection, so that less

common terms contribute a greater weight to the similarity calculation. That is, they

exploit the heuristic that rare terms are better indicators of a relevant document than

common ones.

Zobel and Moffat (1998) report an assessment of eight query document similarity

measures, nine ways of choosing document weights, two methods of calculating

document term weights, and six ways of setting relative term frequencies. Zobel and

Moffat (1998) report that these combine to give a considerable quantity of similarity

formulae.

Zobel and Moffat (1998) report experiments on Disk 2 of the TREC collection to

determine which similarity formula was most effective. Whilst the various combinations

have been in common use for many years, they have rarely been compared on one large

document collection. Zobel and Moffat�s (1998) results were inconclusive. Some

formulae performed better for some queries, but none was consistently superior.

Chapter 2 Literature Review

19

Bartell, Cottrel, and Belew (1998) report experiments on a system that automatically

adjusts the parameters in a similarity matching formula. These adjustments are based on

initial user judgements of the desired ranking of a limited training document set. This

method is claimed to equal or exceed the performance of all �classic� similarity measures.

Parameter adjustments to similarity formulae, such as Bartell, Cottrel, and Belew (1998),

may be compared to other methods of query modification. These may also alter the

weights of search terms, as we will see in the next section.

2.2.4 Query modification methods.

It is well known that retrieval performance is enhanced when, after a first retrieval

attempt, the user indicates the most relevant documents and his query is repeated. This

procedure known as relevance feedback (van Rijsbergen 1979, Salton and McGill 1983,

Baeza-Yates and Ribeiro-Neto 1999). One effect of relevance feedback is to augment the

user�s query with distinctive terms from the relevant documents indicated that do not

occur in the original query.

Query expansion has long been suggested as a way of coping with the word mismatch, or

vocabulary, problem in IR (Xu and Croft 1996). Xu and Croft (1996) note that there is

little evidence that a general purpose thesaurus can improve search effectiveness, and

propose local and global methods of document analysis. In local document analysis, the

results of an initial query are analysed, whereas in global document analysis, the entire

corpus of documents is analysed. Xu and Croft (1996) report a global analysis to discover

phrases that co-occur within a window of one to three sentences. For example, airline

pilot might be associated with plane, air, and traffic. These concepts are then stored in an

INQUERY (e.g. Allan et al. 1998) database. Queries are expanded when they are run

against this database.

Xu and Croft (1996) also discuss a related approach based on local document analysis.

Here, INQUERY is used to retrieve the top n ranked paragraphs. Concepts (noun phrases)

are then ranked using a variant of the tf*idf3 measure. The top ranked concepts may then

be used to augment the user�s query. Although this approach does require an analysis of

the document collection, this is only needed once.

3 the tf*iDf heuristic states that the frequency of a term in a text times the inverse of its occurrence in the collection

indicates its importance. See Baeza-Yates and Ribiero-Net 1999 p29

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

20

Xu and Croft (1996) found that local analysis is more effective than global. However, a

combination of global techniques on the local set of documents is more effective and

predictable than simple local feedback.

A related technique due to Gauch and Wang (1997) automatically generates a similarity

thesaurus (Section 2.2.5). They use linguistic corpus analysis techniques to produce a

matrix of term-term similarities. These are then used to automatically expand queries

within SMART (Buckley 1995). Performance improvements of up to 23% are claimed for

the TREC-5 data.

Wilbur and Coffee (1994) considered two kinds of queries that may be applied to a

database. The first was written by a searcher to express an information need. The second

was a request for documents most similar to a document that the searcher had already

judged as being relevant. They found the similarity based query to be more effective than

the one expressing the information need. This provided a justification for document

neighbouring procedures (pre-computation of closely related documents). Wilbur and

Coffee (1994) showed that this feedback-based method provides significant improvement

in recall over traditional linear searching methods, and even appears superior to

traditional feedback methods in overall performance.

2.2.5 Thesauri

Thesauri are valuable resources for Information Retrieval systems. They contain a

number of terms and synonyms organised into a classified hierarchy. The principal

purpose of an IR thesaurus is to provide a controlled vocabulary. That is, to limit the

number of words used in indexing documents by replacing equivalent terms by their

synonyms, or class identifiers. This reduces the size of the index, with a consequent effect

on storage and retrieval efficiency. A similar operation is also applied to user�s queries, so

that Information Retrieval performance is not compromised.

Roget's thesaurus (Appendix VII) is very different from a typical IR thesaurus (Srinivasan

1992) since that was designed as a general tool to help writers express themselves, whilst

IR thesauri are usually domain specific and contain synonyms rather than the broader

word relationships used in Roget.

Chapter 2 Literature Review

21

Thesauri have been of interest to IR for many years. Srinivasan writing in 1992 refers to a

then considerable literature on thesaurus construction. Like much of IR, methods may be

manual or automatic, and the automatic techniques may be divided into statistical and

knowledge based approaches.

A manual thesaurus is built by subject experts who collect terms and their synonyms, and

group them hierarchically. This is clearly costly in terms of person time to build, and,

once built, manual thesauri need to be maintained as new terminology becomes used.

In principle, a less costly alternative would an automatically constructed statistical

thesaurus. In this method, the whole documented collection is analysed for correlated

terms, which are terms that co-occur in relation to a common context. These make up the

thesaurus and are used subsequently to expand user queries. The objective here is to

collect terms that differentiate most amongst the candidate documents.

Grefenstette (1994) has described SEXTANT, a program that constructs thesauri

automatically using knowledge-poor techniques. The techniques are knowledge-poor in

that they do not depend on domain knowledge. Thesauri are constructed by analysing a

large number of texts in one domain. The text corpus is split into individual words or

tokens and then tagged with part of speech information. The word senses are

disambiguated using a separate statistical program. Dependencies between words are then

identified using local lexical-syntactic relationships, such as identifying nouns that have

modifying adjectives. Noun similarity between the dependent fragments is then calculated

using a weighted Jaccard measurement. This list of related nouns is then pruned retaining

only those which Grefenstette (1994) terms �reciprocal near neighbors�.

Grefenstette (1994) presents several examples of thesauri produced using his methods.

These are claimed to resemble hand-built thesauri. However, whilst knowledge-poor,

Grefenstette�s (1994) method requires some linguistic knowledge.

A technique that requires no knowledge of language is Latent Semantic Indexing (Dumais

et al. 1997). Latent Semantic Indexing (LSI) is a statistical technique designed to improve

Information Retrieval by addressing the vocabulary problem where different terms are

often used to refer to the same concept (Furnas et al. 1987). This is done by automatically

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

22

recognising that related terms are frequently found in similar contexts. Thus, for example

�laptop� and �portable� are often found near to �computer�.

LSI works by constructing word co-occurrence matrices for a document set using a

technique called �single value decomposition�. This produces an emergent set of virtual

�concepts� and their strengths that is used for query-document and document-document

similarity in an extension of the vector-space model (Salton and McGill 1983). That is,

the document set is indexed using the concepts identified. This index is then searched by

mapping user queries into this reduced dimensionality concept space prior to using

conventional similarity matching.

Success has been reported with LSI in several area including the TREC-3 information

filtering task (Dumais, 1995), and cross language information retrieval (Dumais et al.

1997). This is particularly interesting since term-term similarity is of little use in cross

language IR as the query term is unlikely to be identical to the document term if they are

in different languages.

The principal problem with LSI is that it is essentially a machine learning technique that

needs to analyse many documents for success. This may lead to a combinatorial explosion

whose time requirements would make LSI impractical for large document collections.

Rada and Bicknell (1989) report experiments on ranking documents with a thesaurus.

Their work was in the medical domain and used MeSH (Medical Subject Headings) as the

thesaurus to encode queries and documents from the MEDLINE bibliographic database.

Rada and Bicknell (1989) treated MeSH as semantic net (Quillian 1968). Their particular

contribution was to introduce a procedure called DISTANCE that counted edges between

connected terms, differentially weighting those that were less specific (�broader than�).

Rada and Bicknell noted that it was not practical to calculate distances between all

documents and the query, since MEDLINE contained five million documents. Rather,

they applied their procedure to the output from searching MEDLINE in the usual way.

WordNet (Miller et al. 1990, 1993, Fellbaum, 1998) is an on-line lexical database. It is

hierarchically structured, and may be used as a general semantic thesaurus of English.

Chapter 2 Literature Review

23

Richardson and Smeaton (1995) used a WordNet based distance function to re-rank a set

of documents retrieved by traditional means within the TREC context. Their results were

disappointing, possibly as a result of deficiencies within WordNet, or due to problems of

word sense ambiguity (see sections 2.2.6, 2.4.2).

Gonzalo et al. (1998) report that indexing with WordNet thesaural categories (�synsets�)

can improve IR retrieval performance by up to 29%. Their experiments were based on

SEMCOR, a subset of the Brown corpus (Francis and Kucera 1979) that has been

manually tagged with WordNet senses.

2.2.6 IR Evaluation and document collections.

Experimental evaluation is an essential aspect of IR. Much of this has focused on

experimental test collections where document and query relevance is marked up by hand.

This serves as a benchmark against which to evaluate systems.

Document collections are the main data source for IR evaluation and research. Collections

provide a baseline against which to assess the performance of algorithms, as specialists in

the subject of the collection code the characteristics of a sample of text manually. This

gives a baseline against which to evaluate systems, and also permit systems and

algorithms to be mutually compared. Sanderson (1996) has produced a comparison of

many of the better known collections. The Cranfield collection for example is made up of

1400 abstracts relating to aeronautics (approx. 1.5Mb). There are 325 natural language

text queries included. Human experts have identified which of the articles are relevant to

which query.

Sanderson (1996) cites concerns that the small size of many IR test collections may

influence the applicability of IR findings. In particular Blair and Maron (1985) found that

retrieval effectiveness did vary with the size of document collection. To address the issue

of collection size, Sanderson (1996) used the �Reuters 22713� for his work on lexical

ambiguity (Section 2.2.2). The Reuters 22713 is a collection of articles from the Reuters

newswire that was collected by Carnegie Group in 1988, and subsequently modified by

Lewis (1992) for his text categorisation research.

TREC (Text REtrieval Conference) is the modern test bed for IR system comparison. The

basis of TREC is that a central organisation builds the test collection, and researchers

around the world use it to test their own methods and systems, reporting back to the

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

24

conference with results presented in a standardised way. The TREC collection is far

larger than any previous test collection being about 2.5Gb of heterogeneous data. This

includes the WSJ collection as a subset of the TREC-3 category B data. This consists of

550 megabytes of articles from the Wall Street Journal.

There have been eight TREC conferences covering a variety of tasks since 1991. Systems

that use automated statistical techniques have achieved consistent success in TREC.

Examples include SMART (e.g. Bucknell et al. 1998), INQUERY (Allen et al. 1998), and

OKAPI (e.g. Robertson, Walker, and Beaulieu 1998).

Zobel (1998) has pointed out that, as no one has categorised the TREC collections for

relevance, precise calculations of recall and precision may not be made. The many

terabytes of information on the Internet World Wide Web will pose problems that have

not been addressed in the past by Information Retrieval (Ellman and Tait 1996). One

possible help here would be an understanding of the statistical properties of language

known as Zipf�s Law.

2.2.7 Zipf�s Law

Zipf (1949) observed that there is a regular power law relationship between the frequency

of some event as a function of its rank. Zipf (1949) presented many areas where this

relationship can be observed including the size of cities, and, most importantly, word

frequency. Zipf (1949) noted that if the frequency of occurrence of each word in a text is

ranked, then the frequency of the second most common word will be a half that of the

most common word and that of the third most common will be a third of the most

common word, and so on. That is:-

Frequency of Rank N ≈≈≈≈ (Frequency of Rank 1)/N

From this it follows that if the Log of Word Frequency is plotted against Log of Rank a

straight line with a slope of minus one will be obtained. This implies that the frequency of

the most common word is equal to the rank of the word whose frequency of occurrence is

equal to one. Zipf (1949) showed the same phenomena occur for English and several

other languages. Furthermore, Zipf (1949) noted the same phenomena in many other

areas of language, such as the distance between identical words in a text (Zipf 1949, p41.)

Data on Zipf�s law are summarised by Li (2000).

Chapter 2 Literature Review

25

2.2.8 Summary

Information Retrieval is concerned with identifying documents in a collection that are

similar to a user�s query. Methods may be characterised as manual, or automatic. Manual

methods are costly, and there is considerable IR research in automatic methods. These

may be statistical, or knowledge based. Although the latter may give better results in

small document collections, they are too expensive to cope alone with large ones.

However, they may be used in combination as a post-processing step with purely

statistical methods. Such automatic query processing has been shown to improve IR

performance.

The vocabulary problem, where a searcher�s word does not match that in a relevant

document, is fundamental in IR. This may be addressed using a thesaurus, and such

thesauri may be automatically created. Generic thesauri have not been shown to be of

value in IR, although domain specific ones may be useful.

2.3 Case Based Reasoning

2.3.1 Introduction

Case Based Reasoning is a problem solving method that seeks to solve new problems by

reference to previous successful solutions to related ones. CBR has been used in

applications such as fault diagnosis, and building construction. (Kolodner 1983, Aamodt

and Plaza 1994, Watson 1997, Lenz et al. 1998)

CBR uses a particular development methodology where the systems performance

improves as more appropriate cases are added to the store of known problems and

solutions � known as the case base.

Aamodt and Plaza (1994) described this method as a CBR cycle, which is represented

graphically in fig 2-2 below.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

26

Figure 2-2: Case Based Reasoning Cycle

Aamodt and Plaza (1994) describe the CBR cycle as being made up of four steps:-

① RETRIEVE the most similar cases

② REUSE the knowledge in those cases to solve the problem

③ REVISE the proposed solution

④ RETAIN the parts of the experience likely to be useful for future work

CBR is frequently applied to the development of Knowledge Based Systems, as it

potentially eliminates at least three problems with that area (Watson and Marir 1994).

Firstly, CBR does not require a model or understanding of the target domain. Secondly,

CBR does not require an explicit knowledge elicitation phase with its dependence on

skilled knowledge engineers, although it does require a set of appropriate cases to initially

populate its case base. Thirdly, CBR may be efficient, since collections of cases may be

stored using database technology (Shimazu et al. 1993). This is considerably more

efficient than the storage of large rule bases in flat files typically associated with

Knowledge Based Systems.

The key issues when building a CBR systems are:

① REPRESENTING a case to capture its true meaning.

② INDEXING cases for rapid retrieval.

③ ASSESSING the similarity between the test case and the stored cases.

④ ADAPTING a previously successful solution to a new problem.

The similarity between the problems faced by CBR and IR has long been noted (e.g.

Callan and Croft 1993, Rissland and Daniels 1996).

Chapter 2 Literature Review

27

2.3.2 Textual Case Based Reasoning

Lenz, Hübner and Kunze (1998) have coined the term �Textual CBR� for systems that

seek to apply case based reasoning technology to textual documents as opposed to highly

structured cases. T-CBR systems face a number of problems associated with textual

representation that differentiate them from other CBR systems that represent problems

and their stored solutions as simple data structures. These problems include the well-

known issues of structural and semantic ambiguity that apply to many areas of natural

language processing.

Example areas where text cases are used to derive solutions from previous successes

include case law, and medical reports.

The Law is a natural application area for Case Based Reasoning (Ashley and Rissland

1988), since previous cases are often used to supplement reasoning deductively with legal

rules. Furthermore, lawyers and judges reason analogically with precedent cases.

Consequently, KBS style rule predicates are simply not sufficiently well defined for the

inference of correct legal decisions.

Legal cases are encoded as text, so the law is a special candidate for text based CBR. This

has been explored by Ashley who developed a program known as HYPO. This explored

adversarial, case-based reasoning with cases and hypotheticals in the legal domain

(Ashley 1990, reviewed in Rissland and Daniels 1996).

Less specific approaches to T-CBR have been used by for example Kunze and Hübner

(1998) and Burke et al. (1997). Kunze and Hübner (1998) used a combination of shallow

NLP techniques and semi-structured documents in the FallQ project for document

management in the ExperienceBook project that provided UNIX system administrators'

support. Burke et al. (1997) exploit the question answer format in their FAQ Finder −

system which finds FAQs on the Internet that correspond to a user's question. FAQ Finder

uses a combination of statistical methods, and shallow WordNet based semantics.

Lenz (1998) has compared Textual CBR with methods used in Information Retrieval. His

results are summarised in Table 2-2 below.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

28

Table 2-2: Comparison of IR and Textual CBR (reproduced from Lenz 1998)

IR Textual CBR

Representation of

documents

Sets of index terms obtained from

statistic evaluations

sets of features established during

knowledge acquisition

Similarity measure Based on term frequency weighting based on domain theory

Application to new

domains

Easy requires knowledge acquisition

Domain knowledge not considered required

Non textual

information

Cannot be used can be integrated

Evaluation Well designed not yet addressed sufficiently

Lenz�s (1998) view of IR is certainly a caricature of extreme positions, since document

representations are not always statistical, similarity measures not always based on term

frequency, and non-textual information can be integrated. However, the views on IR

shown in table 2-2 does represent the majority of conventional Information Retrieval

systems that for example consistently perform better in TREC (Voorhees and Harman

1998). Lenz (1998) also recognises the mutual importance to IR and T-CBR of what he

terms the �Ambiguity problem�, and the �Paraphrase problem�. In the ambiguity

problem, one or more keywords in compared documents may be highly similar, but the

actual meaning of the documents may differ, whereas in the paraphrase problem the same

meaning may be expressed using completely different expressions. These phenomena

were termed lexical ambiguity, and the vocabulary problem in Section 2.2, which covered

IR.

2.3.3 Summary

Case Based Reasoning is a problem solving method that relies on identifying the

similarity between a problem and a previous one for which a solution has been identified.

T-CBR is a sub-domain that uses textual data. Generic approaches to text similarity may

be useful, or augment domain specific knowledge.

2.4 Natural Language Processing

2.4.1 Introduction

Natural Language Processing (NLP) is also commonly known as computational

linguistics. It �is a discipline between linguistics and computer science which is

concerned with the computational aspects of the human language faculty� (Radev 1997).

Chapter 2 Literature Review

29

NLP is commonly divided into different levels of analysis. Liddy (1998) gives the

following succinct breakdown of the field�s sub areas:

Table 2-3: Sub-domains of NLP. (Reproduced from Liddy 1998)

Phonological: interpretation of speech sounds within and across words

Phonological interpretation of speech sounds within and across words

Morphological: componential analysis of words, including prefixes, suffixes and roots

Lexical: word level analysis including lexical meaning and part of speech analysis

Syntactic: analysis of words in a sentence in order to uncover the grammatical structure of the

sentence

Semantic: determining the possible meanings of a sentence, including disambiguation of words

in context

Discourse: interpreting structure and meaning conveyed by texts larger than a sentence

Pragmatic: understanding the purposeful use of language in situations, particularly those aspects

of language which require world knowledge

Like IR, NLP is a well-established research area complete with textbooks (e.g. Allen

1995, Charniak and Wilks 1976, and many more), collections of important papers (Grosz,

Spärck-Jones, and Webber 1986), and sets of Frequently Asked Questions (Radev 1996)

that point to many more resources.

NLP encompasses very many areas of active research including tagging a text�s

grammatical parts of speech (e.g. Brill 1992); the derivation of a word�s base from

inflected variants (morphology, e.g. Antworth 1993), identifying the syntactic structures

of sentences (parsing), the generation of coherent natural language for program output,

and related studies of, semantics, and pragmatics.

A comprehensive account of text processing must either address the component areas of

syntax, semantics, and pragmatics with their unresolved problems, or avoid them. Since

all of these problems are sufficiently difficult as to be areas of scientific inquiry in their

own right there have been few approaches to processing whole texts in NLP, as that

requires solutions to many unresolved problems.

There are two basic reasons for considering whole texts from an NLP perspective. One

reason would be to generate texts as program output, whilst the other would be to analyse

text to determine its meaning. Text generation is simpler than analysis, as it is possible to

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

30

output ambiguous words or phrases and assume that readers will interpret the correct

sense from context. This does not resolve the issues of style or argument structure, which

text generation approaches must address. Examples of such approaches include text

grammars (van Dijk 1977), and Rhetorical Structure Theory (Mann, and Thompson

1988). Text grammars are used in restricted domains such as business correspondence

(Tait and Ellman 1999) where ad hoc associations between text structures can be

encoded. Rhetorical Structure Theory by contrast looks as formal relationships between

classes of discourse elements that can be justified at the speech act level (e.g. Austin

1962, Searle 1969, Ellman 1983). This leads to discourse-based approaches to natural

language generation (Dale et al. 1998).

In the next section we will examine work in NLP that is relevant to the idea sketched in

Chapter 1. That is, determining text similarity using lexical chains. In outline, we will

proceed as follows: Firstly, in order to determine if there is a coherence relation between

words, we need to identify the sense in which a word is being used. This problem is

related to the earlier discussion on lexical ambiguity in Section 2.2.2. Once a word sense

has been identified, we need to calculate the strength of the connection to a candidate

lexical chain. This is done through a discussion of semantic similarity. Again this

discussion parallels that on text similarity in Section 2.2.3. Next we consider sense tagged

corpora, which are used for evaluation and comparison of alternate approaches to word

sense disambiguation. This has a clear relationship with the issue of evaluation and test

collections in Section 2.2.6. Finally we look more comprehensively at work on lexical

chains as this forms the backbone of this study.

2.4.2 Word Senses and Classification Schemes

Many words have multiple senses irrespective of which classification scheme is used.

This leads to the word sense disambiguation problem. Word sense disambiguation is an

active sub field of NLP. Recent overviews and insights into the literature are given by Ng

and Zelle (1997), and especially Ide and Véronis (1998) who give approximately 200

references into the area.

Multiple word senses may be classified as homonymous or polysemous for which

definitions were given in Section 1.6. Collocate ambiguity is also important. We will look

briefly of examples of each.

Chapter 2 Literature Review

31

For an example of collocate ambiguity consider the word �cone� in phrases such as pine

cones, ice cream cones, and rods and cones. Collocate ambiguity can be resolved by

reference to immediate local context. Lesk (1987) used several machine-readable

dictionaries (e.g., Webster's 7th, Collins, OED) and looked for word overlaps with the

dictionary entries within a 10 word window of the target. Lesk (1987) reported a 50-70%

success rate in selecting the correct word sense using this technique.

Homographic ambiguity presents few problems to readers, and this type of ambiguity is

closest to the notion of lexical ambiguity discussed in Section 2.2.2. For example,

consider the senses of homonyms such as �rowing� as in:-

1. I saw a couple rowing outside the pub

2. I saw a couple rowing the boat

Polysemous ambiguity is problematic (e.g. Kilgarriff, 1997) as terms such as Business,

and Point may have many discernible senses. Exactly how many depends on the

classification system used. Usually these are based on machine-readable dictionaries and

thesauri, since these are accessible, and generally independent of theoretical bias.

Common works are:-

• LDOCE: Longman's Dictionary of Contemporary English

• Websters 7th

• Roget's Thesaurus

• Princeton's WordNet

Unfortunately, these sources are not guaranteed to contain the same words, or to

categorise their senses consistently with each other. The reasons for this are partly

functional, and partly economic. Kilgarriff (1997) points out that dictionary publishers

emphasise the number of entries in their dictionaries as a marketing ploy. Consequently

there is pressure to augment the number of sense distinctions identified. Conversely, a

dictionary aimed at language learners requires fewer, simpler entries than one aimed at

proficient native speakers.

WordNet (Miller et al. 1990) is often used in NLP systems since it is both large, and

easily available for research purposes. WordNet is a hierarchical lexical database that is

fully indexed. However, WordNet�s quality is �variable�, and its hierarchy is uneven

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

32

(Hearst and Schütze 1996). In fact, it is not clear that WordNet is the best classification

system. Voorhees (1994) indicated that she had difficulty selecting the correct word sense

from WordNet. This may account for performance degradation she observed in the TREC

task.

Roget's thesaurus (Appendix VII) is also useful, since it has an implicit structured

hierarchy that is quite evenly balanced and although smaller than WordNet still has 350

pages of text with 1024 entries that have headings. which are known as �heads�.

Although this number has varied over different editions (e.g. the 1987 edition has 990

heads) Roget's thesaurus has been refined over more than 200 years. It is also available

electronically. Project Gutenberg distribute the 1911 edition, since this is out of copyright.

Yarowsky (1992) used Roget's Thesaurus (1977 edition) in a lexical disambiguator based

on a statistical model of the Roget categories. He achieved an accuracy of up to 92% on

12 polysemous words examined in a 50 word window, which demonstrates Roget�s

potential utility for the word sense disambiguation problem.

2.4.3 Semantic Similarity

Deciding how close words are in meaning is known as the semantic similarity problem.

This is critical when considering whether to link them into lexical chains. There have

been two main approaches to the semantic similarity problem. The first is based on

information content and the second by calculating distance in a semantic hierarchy.

A psychological experiment by Miller and Charles (1991) has provided an evaluation

metric for word similarity studies. They presented subjects with a list of thirty words pairs

and asked for them to be rated for �similarity in meaning� on a scale from zero to four. A

rating of zero implied the words were completely dissimilar and four that the words were

perfect synonyms.

Resnik (1995) replicated this task (on twenty-eight word pairs4) finding a correlation of

r=0.9011 between the similarity judgements his experiment found, and those found by

Miller and Charles (1991). This is considered a reasonable upper bound on what to expect

from a computational procedure (Resnik 1995, Jiang and Conrath 1997, McHale 1998).

4 Two word pairs include the word �Woodland� which was not then in WordNet

Chapter 2 Literature Review

33

Distance Based Approaches

Distance based approaches to computing semantic similarity consider the information

source as a semantic net, and count the number of links (or edges) between two concepts

as a measure of semantic distance. This idea is based on a model of human memory

introduced by Collins and Quillian (1969). They hypothesised that people store concepts

within a hierarchical structure and tested this using reaction time experiments.

Rada and Bicknell (1989) used conceptual distance in the MeSH (Medical Subject

Headings) thesaurus to rank MeSH encoded queries against documents from the

MEDLINE bibliographic database. They found significant correlation between the human

rankings, and those generated by their edge-counting algorithm.

If semantic similarity is based on counting shortest path between two concepts (Rada and

Bicknell 1989), the taxonomy needs to have edges of equal length and value. If not,

suitable weights need to be applied. For example, relative conceptual density in the

WordNet hierarchy has been used in a word sense disambiguation task (Agirre and Rigau

1996). Resnik (1995) replicated the Miller and Charles (1991) experiment using simple

WordNet edge counting (and other techniques), and found poor correlation (0.66, see

Table 2-4 below). Note however that McHale (1998) found far better correlation using

edge counting from Roget�s Thesaurus. Thus it appears that Roget�s thesaurus is at least

as well suited to the semantic similarity task as WordNet.

Information Based Approaches

The information content approach to semantic similarity considers the extent to which

two nodes in a hierarchical concept space share information. The information content of

each node is considered as:-

IC(c) = log-1 P(c) where P(c) is the probability of encountering that concept

The similarity between two concepts is then the information content of the node which is

the lowest upper bound amongst those that subsumes both concepts. Formulae are given

in (Jiang and Conrath 1997).

Resnik (1995) defined concept frequency as the sum of word frequencies in that concept.

This however takes no account of word sense ambiguity. Richardson and Smeaton (1995)

corrected for this by dividing word frequency by the number of classes in which it is

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

34

found (i.e. its degree of polysemy and homonomy). However this estimation may be

unsound, since the distribution of word senses in the Senseval sense tagged corpus is not

linear (see Chapter 6).

Jiang and Conrath (1997) have proposed a combined approach to semantic similarity that

adds information content to edge counting. They estimate concept frequency using noun

frequency from SemCor � a WordNet sense tagged corpus (Miller et al. 1993). They note

that SemCor only includes half the words in WordNet, so it is unlikely that a word�s sense

frequencies will model its general usage.

Table 2-4 below (reproduced from McHale 1998 page 2) summarises results on semantic

similarity that have replicated Miller and Charles (1991).

Table 2-4: Correlation of similarity measurements

Similarity Method Correlation

Human judgements (replication) r=.9015

WordNet

Information Content r=.7911

Edge Counting r=.6645

Jiang and Conrath r=.8282

Roget's

Information Content r=.7900

Edge Counting r=.8862

Intervening Words r=.5734

Jiang and Conrath r=.7911

These results should be interpreted cautiously since the number of word pairs used by

Miller and Charles (1991) is very small in comparison to the number of words in English

(roughly 100,0005). Consequently these data are subject to fluctuations in response to

5 The Concise Oxford Dictionary (9th edition) contains 140,000 definitions including collocations. 100,000 words is

consequently a conservative estimate for the number of individual words.

Chapter 2 Literature Review

35

minor adjustments to the thesaural hierarchy. For example, Jiang and Conrath (1997)

increased the degree of correlation to r=0.8654 by removing a single questionable

classification of the word �furnace� in WordNet.

Edge counting does seem to provide a good measure of semantic similarity when applied

to MeSH (Rada and Bicknell 1987) and Roget (McHale 1998), whilst information content

performs better in WordNet (Jiang and Conrath 1997). Consequently, conceptual distance

as measured by edge counting could be considered a viable technique for the formation of

lexical chains.

2.4.4 Sense Tagged Corpora

A problem with the various approaches to word sense disambiguation is that the results

are rarely comparable, since evaluation uses different data sets. These will also have been

tagged with word senses to different degrees of accuracy. One option here is to use a

sense tagged corpus as a comparative benchmark.

A sense tagged corpus is a collection of texts where word senses have been marked by

human assessors.6 There are two common variations on sense tagged corpora. In the first

the preferred sense of every word in a complete text is indicated. The second type of

corpus is made up of isolated individual sentences (or paragraphs) in which only the

preferred sense of some particular word is indicated. This is known as the �lexical

samples� technique.

The full-text technique has the advantage that the task is more ecologically valid (see

Section 1.6). It represents an accurate picture of how most readers will encounter

ambiguous words. It does have the disadvantage that it is far more demanding since each

individual word must be looked up in a dictionary (and the entry compared to the word

sense in use). This makes it somewhat unlikely that individual words will occur

frequently enough to statistically represent actual usage.

The variety of words in this method also makes it difficult for human sense taggers to

become familiar with the individual dictionary entries. This means that sense tagged full

texts are more error prone than in the lexical sampling technique (Kilgarriff, and

Rosenzweig 2000).

6 Typically two assessors will mark-up the samples independently. Where they disagree, a third arbitrates.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

36

The lexical samples technique has the advantage that a corpus may contain a

representative sample of word meanings and usage frequencies. Its principal disadvantage

is that it is not a wholly natural task. Nonetheless, human assessors rarely have difficulty

identifying the intended sense, so the advantages of statistical validity and ease of

creation outweigh the disadvantages.

Few corpora are available for either sense-tagging task, since the activity of producing

such a corpus is both intellectually demanding, and time consuming. One of the best

available full text resources is known as SemCor (semantic concordance), which is

available as part of the WordNet distribution (Fellbaum et al. 1998). Senseval is a recent

corpus that has been created for the evaluation of sense tagging of lexical samples

(Kilgarriff, and Rosenzweig 2000).

SemCor is a set of two hundred articles from the Brown corpus (Francis and Kucera

1979) that have been tagged with WordNet senses (Fellbaum et al. 1998). SemCor suffers

from two faults as a sense tagged corpus. Firstly, the sense tagging is not sufficiently

accurate, and secondly, it is not a random sample.

Ng and Lee (1996) report manually retagging SemCor. They found an agreement of

approximately 67%, between the two tagged versions, which is disappointingly low.

Leacock (1998 personal communication) reports that SemCor was tagged by different

individual graduate students and not subsequently independently retagged. In fact, the

quality control technique was to randomly sample every tenth word, and check the

existing classification. This method has the advantage of speed, but the disadvantage that

whoever checks the existing classification is firstly primed that the existing classification

is reasonable, and secondly does not study the dictionary entry to consider whether other

sense categories might be more appropriate.

Leacock also notes that SemCor is a linear, not random, extract from the Brown corpus.

Funding constraints had prevented completely tagging the corpus as was originally

planned.

The Senseval corpus (Kilgarriff 1998) was a deliberate attempt to overcome the problems

with SemCor. Senseval used sense tags derived from an SGML encoded machine-

Chapter 2 Literature Review

37

readable dictionary called Hector. This was derived from an internal research project at

the Oxford University Press. The entries in Hector are extremely detailed, and include:-

1. Surface Forms (Including Collocations)

2. Text Definitions

3. Part Of Speech

4. Examples Of Usage

5. Idiomatic Phrases

Since the dictionary entries are detailed, they identify nuances of meaning as polysemous

senses. That is, entries are defined to high level of granularity that would not often be

considered individual. This meant that it was difficult to identify in which precise sense

and sub sense a word was used.

The Senseval corpus of lexical samples was manually sense tagged by two professional

lexicographers. In case of disagreement, a third arbitrated. This lead to an overall

inter-tagger agreement of 90%, at the finest level of granularity, and more than 99% at the

coarse level.

2.4.5 Lexical Chains

Lexical chains (Halliday and Hasan 1976, 1989, Morris and Hirst 1991) have been

applied to several different areas of computer based language processing. An example and

definition of lexical chains were given in Section 1.6. This section reports work on lexical

chsins, paying particular attention to whether lexical cohesion based approaches perform

better than alternate (term based) methods.

The first published implementation of a lexical chaining program was for Japanese

(Okumura and Honda 1994). They used lexical chaining for words sense disambiguation

in a speech recognition task, and for text segmentation. Okumura and Honda (1994)

reported an accuracy of 66% on the word sense disambiguation task. They showed that

the lexical chaining process implicitly provides word sense disambiguation, since it

associates words in a text using relationships derived from a thesaurus. If a word is

ambiguous, it will have more than one entry in the thesaurus. If only one of these senses

plays a part in an association with another word, then the first word is disambiguated with

respect to that association, and is assumed to be used in that sense in the text.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

38

StOnge (1995), and StOnge and Hirst (1998) described a lexical chainer for English using

WordNet as the lexical database. Linking relations between nouns were determined by

finding relations from WordNet These include membership of the same WordNet set of

synonyms (�Synset�), hyponymy7 (IS-A relations), meronymy8(part-whole relations), and

other relations simply determined through WordNet�s semantic net.

The chaining algorithm improved on the stack-based approach used by Okumura and

Honda (1994) based on salience to one that uses recency to re-order the stack.

It should be noted that StOnge and Hirst (1998)�s chainer deals with nouns, or words that

may be morphologically reduced to nouns. This is because there is no clear relationship

between the noun, verb, adjective, and other hierarchies in WordNet.

StOnge and Hirst (1998)�s chainer was applied to the detection of malapropisms in text.

These words are correctly spelled, but inappropriate in their context. For example in the

following sentences:

1. The bees were attracted to the flour.

2. The bees were attracted to the flower.

It is clear that the author probably intended (2). However, no current spelling checker

could determine this. StOnge�s (1995) thesis was that words that could not be chained to

others could be inappropriate in their context. Since these words did not form chains, he

named them �atomic chains�. Unfortunately, StOnge detected approximately ten times

more atomic chains than were malapropisms, thus rendering the technique impractical for

malapropism detection.

StOnge�s (1995) attributes his negative results to inaccuracy in the chaining process. In

particular, he identifies two major problems, under and over-chaining. In under-chaining,

words that should be joined to an existing chain are omitted, whilst in over-chaining,

spurious associations between words are identified that cause them to be incorrectly

joined to an existing chain.

7 Hyponym: One of a group of terms whose meanings are included in the meaning of a more general term, eg spaniel

and puppy in the meaning of dog.(Chambers Dictionary) 8 Meronym: A word whose relation to another in meaning is that of part to whole, eg whisker in relation to cat.

(Chambers Dictionary)

Chapter 2 Literature Review

39

StOnge (1995) states that under-chaining may have four causes:

1. An inadequacy of WordNet's set of relations. For instance, child care and school

cannot be related using WordNet's relations.

2. A lack of connections in WordNet. For example, WordNet does have a proper

set of relations to link beef stew and beef with a single relation/substance

meronym/holonym but no such link exists in WordNet's graph.

3. A lack of consistency in the semantic proximity expressed by WordNet's links.

For example, in WordNet's graph, the shortest path between stew and steak has

6 links while the shortest path between Australian and millionaire has 4 links.

4. A poor algorithm for chaining.

StOnge (1995) also states that over-chaining might be caused whenever two words are

very close to each other in WordNet's graph while being distant semantically. This lack of

consistency in the semantic proximity expressed by WordNet's links often results in the

merging of two chains.

Stairmand (1996) also used WordNet to create a lexical chainer. His approach is based on

spreading activation rather than the linear approach initially suggested by Morris and

Hirst (1991). Like StOnge, Stairmand�s chainer checks nouns only with their near

neighbours for relationships that may be found in WordNet. However, rather than trying

to chain all the words in a text Stairmand aims to only identify chains that could reflect

the structure of a text as proposed by Morris and Hirst (1991). This makes Stairmand�s

work applicable to the text segmentation problem.

Stairmand (1996) also reports experiments on the text segmentation task. That is, the

division of a text into paragraphs. He compared his technique against that of Hearst�s

(1994) TextTiling algorithm and found TextTiling had superior performance. Stairmand

(1996) does not explain this. Similarly, Hearst reports that earlier version of TextTiling

used a thesaurus but found better performance without.

Stairmand 1996 and Stairmand and Black (1996) report experiments on Information

Retrieval using WordNet derived lexical chains. They indexed 90,000 news articles using

their chainer, and compared retrieval performance to SMART (Salton and Buckley 1990)

on twelve very simple queries based on topics used in TREC (Harman 1993). The

evaluation results were positive, although Stairmand and Black (1996) note that the

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

40

approach would not be suitable for general information retrieval, since only terms that

appear in WordNet can form lexical chains, and hence be indexed. Consequently, whilst

this study may be applicable to local document similarity assessment (Section 2.2.4) it is

not proposed as a general solution for Information Retrieval.

Barzilay and Elhadad (1997) have also implemented a lexical chaining algorithm that is

used for text summarisation. They suggest that good text summaries may be produced by

extracting sentences from texts that contain the �strongest� chains, where chain strength is

a function of chain length and homogeneity. They tag the text initially, and then segment

it using the TextTiling algorithm (Hearst 1994). Lexical chains are then derived using a

WordNet based algorithm similar to StOnge and Hirst�s (1998). Barzilay and Elhadad

(1997) however claim superior performance by adapting a lazy disambiguation strategy.

This involves maintaining all possible word senses until there is clear evidence which

word sense is preferred. No data are given that compare this approach with other text

summarisation techniques (e.g. Alterman 1991).

Kominek and Kazman (1997) report work on Jabber an experimental system that allows

users to retrieve records of videoconferences based upon their (transcribed) verbal

contents. Jabber is able to summarise a set of related words, giving a name to each topic

using lexical chains. Users can then use this name to query or browse the stored

multimedia.

Jabber, or its subsystem Conceptfinder uses nouns from WordNet to form clusters of

concepts that are then merged according to the �lowest common hypernym�, in an

approach reminiscent of Stairmand�s (1996). Kominek and Kazman (1997) report that

their Conceptfinder system is able to make distinctions among different senses of the

same words, and is able to summarise a set of related words. Kominek and Kazman

(1997) claim that early results are encouraging, but do not report any formal evaluation.

Green�s work (1996, 1997, 1999) used lexical chains to generate hypertext links between

the different paragraphs contained within newspaper articles and a variant of this method

to generate links between different articles. Similarity between the paragraphs was based

on their semantic content as derived from their lexical chains. The program to create these

was based on that of StOnge (1995).

Chapter 2 Literature Review

41

Green's (1997) method for calculating the similarity of paragraphs inside one article

(�within article similarity�) is based on the relative presence of lexical chains in the

different paragraphs (where the chain has components in more than one paragraph).

Green (1997) also describes a technique for calculating the similarity of paragraphs in

separate articles (�between article similarity�). This is based on the relatedness of their

lexical chains as measured by the relative presence of their component WordNet synsets.

That is, he identifies the synsets inside the paragraphs and then compares these using the

Dice coefficient, which is a commonly used IR similarity measure (e.g. Van Rijsbergen

1979)

Green (1999) further considered comparing two documents based on their lexical chains.

Green (1999) highlights that due to the hierarchical nature of WordNet, it is common to

find documents that contain a large number of related words. His solution to this is to

restrict lexical chains to those containing identical words, words in the same WordNet

synset, or words in adjacent synsets. Next, he represents each document using two

vectors, each containing an element for each of WordNet�s 60,577 noun synsets. The first

vector contains the weight of that particular vector in WordNet, and the second vector

contains the weights of that synset when it is one link away.

Green (1999) experimentally evaluated the quality of the hypertext links generated.

Disappointingly, the results from his lexical chain based procedure were not significantly

better than those derived by a standard IR, term based, approach. He attributes this failure

to the limitation faced by any lexical chaining program that a word not in the thesaurus

can not contribute to the representation of its meaning, in addition to problems with word

sense ambiguity. Green (1999) reports that there is no way to tell whether the lexical

chainer has disambiguated a word correctly, and has no data on the average number of

incorrect disambiguations. The effect of word sense ambiguity on lexical chaining thus

remains an outstanding research issue.

In summary then, lexical chains then have found application in areas as diverse as concept

identification in multimedia (Kominek and Kazman 1997), information retrieval and text

segmentation (Stairmand 1996), text summarisation (Barzilay and Elhadad 1997),

malapropism detection (StOnge 1995), and word sense disambiguation (Okumura and

Honda 1994).

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

42

The approaches to lexical chaining described above, for which there are computer

implementations using non-domain specific thesauri, have used nouns only. In English,

this is due to the difficulty of associating nouns and verbs in the WordNet hierarchy.

Modest success is generally reported, although, where there has been formal comparison

with an alternative approach to the problem (Stairmand 1996 IR, Green 1997 hypertext

linking), this is not statistically significant. Explanations offered for poor performance

include deficiencies in WordNet�s vocabulary and organisation (StOnge 1995, Kominek

and Kazman 1997, Green 1997), inadequacies in the chaining algorithm (StOnge 1995),

and word sense ambiguity (Green 1997). Stairmand (1996), and StOnge (1995) both

suggest using Roget�s thesaurus as a method for circumventing the inadequacies of

WordNet for lexical chaining.

2.4.6 Summary

Lexical chaining is a method that has found application to a variety of problems in NLP.

Results have been generally disappointing though, possibly as a consequence of using

nouns only, or possibly due to word sense ambiguity. Researchers who have used

WordNet have suggested Roget�s thesaurus as an alternative knowledge source. This has

been shown in initial experiments to give sensible results on semantic similarity.

2.5 Conclusions

This chapter has briefly overviewed work in IR, CBR, and NLP with the objective of

justifying the text similarity method developed in this thesis. We found that IR methods

could be divided into manual and automatic techniques, and that automatic techniques

could be divided into statistical or knowledge based approaches. Knowledge based

approaches have not been shown to be practical for large document collections, although

they may give impressive results in research studies. We also found that similarity

methods using thesauri were well known in IR (e.g. Rada and Bicknell 1989) in specific

subject areas. Particularly, query document similarity ranking was effective when used

following an initial query to analyse the documents returned in a procedure known as

local document analysis (Xu and Croft 1996). Since the local document set is small, an

expensive procedure based on NLP could be applicable.

T-CBR deals with case bases of texts that are small in comparison to IR document

collections, so again, the additional processing incurred by an NLP based similarity

Chapter 2 Literature Review

43

measure would be acceptable if it provided improved similarity matching. T-CBR was

also shown to face similar problems to IR with respect to ambiguity.

A brief overview of NLP was also presented in Section 2.3. This focused on semantic

similarity where a simple edge counting technique using Roget�s thesaurus was shown to

be a possible alternative to information content approaches using WordNet. Although

WordNet has formed the basis of much work on lexical chains, several problems with it

were identified. Principal amongst these was the lack of a relationship between the noun

and verb hierarchies. All work in lexical chains to date9 has consequently been based on

nouns only.

There are consequently four research issues from the literature that motivate this research.

Firstly, whilst Morris and Hirst (1991) based their description of lexical chains on Roget's

thesaurus they did not provide a computer implementation due to a lack of a

machine-readable version. Thus, the relationships they identified in the thesaurus were

only verified by hand. A Roget based implementation would be a useful tool for assessing

their value, as an automatic system would be blind to the meaning of a text in a way that

is difficult or impossible for human readers. A Roget based lexical chainer will be

described in Chapter 3.

Secondly, Morris and Hirst (1991 p41) speculate that chain forming parameter settings

will vary according to an author�s style. This suspicion has not previously been

investigated, and remains an outstanding research issue. It will be addressed in Chapter 4.

Thirdly, both Stairmand (1996), and StOnge (1995) separately observed that the

performance of their chainers could be improved by using Roget. This is because some

Roget relationships are simple to compute, but not possible with WordNet. For example,

the words �blind� and �rainbow� have an intuitive association concerned with sight and

visual phenomena that is reflected in their membership of the same group of Roget

categories.

9 Loukachevitch and Dobrov (2000) have recently reported an all-words lexical chaining system for Russian that uses a

hand built thesaurus for the Socio-political domain. The thesaurus excludes ambiguous relations. The system has been applied to text summarisation, categorisation, and conceptual indexing.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

44

Finally, we note that Roget's thesaurus has a simpler, more balanced structure than

WordNet. If the main categories are not divided, the tree structure is only four ply deep.

Thus the whole tree structure, plus the pointers to related categories that are a feature of

Roget may be held in a computer�s main memory. This makes it suitable for an efficient

implementation using all of the Morris and Hirst linking relationships.

With Roget, the same procedures can be used to identify relationships between any two

words in the thesaurus. StOnge (1995 p47) comments that some deficiencies in his

program may have arisen due to �a lack of consistency in the semantic proximity

expressed by WordNet�s links�.

Roget's thesaurus naturally has some disadvantages with respect to WordNet. Roget�s

thesaurus gives no guidance as to word sense frequency. That is, whilst dictionaries order

their entries according to how frequently they are generally found, Roget gives equal

precedence to all senses of a word. This exacerbates problems of word sense ambiguity

(Chapter 5) and is an unresolved issue. By contrast, Princeton WordNet (version 1.5 and

up) lists word senses retrieved in order of their frequency of occurrence. Whilst the

derivation of this frequency information is suspect it is preferable to the situation with

Roget's thesaurus. Consequently, it is not known whether the word sense ambiguity

problems noted by Stairmand (1996), StOnge (1995), and Green (1999) will be

intractable in a Roget based lexical chainer, due to lack of frequency information, or

ameliorated due to the different thesaurus structure. These issue of ambiguity is addressed

in Chapter 5, whilst the evaluation issue is addressed in Chapter 6.

We now go on to describe a lexical chaining program based on Roget�s thesaurus which

uses all parts of speech. This is used to compare the similarity of texts.

1

2

3

4 5

6

7

Chapter 3. Hesperus: A System for Comparing the

Similarity of Texts Using Lexical Chains:

3.1. Introduction

We have hypothesised that the conceptual contents of texts may be used for similarity

judgements, and that these contents may be characterised with reference to an external

thesaurus.

This chapter presents supporting evidence for that hypothesis. This covers the

implementation of the lexical chainer and its various components leading to the derivation

of algorithms to compare the conceptual similarity of texts.

The relationship of this chapter to the thesis argument is shown in fig 3-1.

The purpose of the description of the system and its algorithms is threefold. Firstly, this

study is experimental in nature, and so the system constitutes an essential element of its

methodology. If the system design were conceptually flawed, work that is based upon it

in subsequent chapters would be built on weak foundations. A second reason for

describing the system is that this work needs to be understood in relation to previous work

on lexical chaining. This makes improvements explicit so that the techniques may be

understood or used by subsequent researchers. Finally, the description is essential if this

work is to be replicated.

Figure 3-1: Thesis chapter structure

1. Introduction.

2. Literature Review.

3. Hesperus: A system for comparing the similarity of texts using

lexical chains.

4. The General Nature of Lexical Links.

5. Word Sense Disambiguation and Hesperus.

6. Evaluating Hesperus

7. Conclusion

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

46

An objective of this work is the development of matching techniques for documents based

on their concepts as represented by thesaural categories, rather than their constituent

words. This requires:-

1. the selection of a structure to represent a document�s contents,

2. a method to derive this structure from texts,

3. a technique to compare these structures.

These requirements are interdependent, since overly complex representations of different

texts may not be directly comparable. Inspiration here comes from the area of Case Based

Reasoning (CBR , e.g. Aamodt and Plaza 1994) that attempts to resolve queries based on

previous problem solutions (Chapter 2). As a field, CBR is involved in the use of generic

similarity matching techniques applicable to areas as diverse as printer fault diagnosis,

building design, and personal income tax planning.

In CBR, a query and examples are usually represented as attribute value pairs. Thus to

apply CBR to document comparison, both the text acting as a query, and the documents to

be compared against need to be represented equivalently. If this representation is based on

simple terms (i.e. words), the problem becomes hugely complex, since there are about

100,000 words in the English language. This would also be a fragile approach, since

semantically equivalent words would not count as equal. However, if documents may be

represented as sets of Roget categories, the problem becomes tractable. The purpose of

this chapter is to describe a system that performs this transformation.

We firstly describe the implementation of a lexical chainer based on Roget's thesaurus

(Appendix VII). As described in Chapter 2, the identification of lexical chains in a text

supports several applications in Natural Language Processing. In this work, they are a

prerequisite for the derivation of the Generic Document Profile (GDP) that is used to

compute similarity between texts. Consequently, we describe how this GDP may be

derived from the lexical chains. As the word senses used in a lexical chain are difficult to

understand outside of the surrounding text, we present a method that allows their

visualisation in context. Finally, the derivation of the �Electronic Roget� is presented.

Although Project Gutenberg (1999) has made the machine-readable text of Roget's

thesaurus available, it lacks an index. Consequently, the derivation of the index is

described. This description may be useful to researchers in languages other than English,

Chapter 3 Hesperus

47

who are looking for methods to convert their thesauri into efficient supporting tools for

lexical chaining.

3.2 Hesperus: A System for comparing Text Similarity using Lexical

Chains.

Hesperus is a system designed to compare how similar texts are by measuring their

conceptual contents as determined by their thesaurally defined lexical chains. This

process is described in some detail in this section. In outline it is however as follows:

Texts are firstly processed individually to determine their lexical chains, and subsequently

their generic document profiles. These are stored in a database of cases, which is known

as �case base� in CBR. The profiles can then be clustered for similarity to the exemplar

texts using the nearest neighbour algorithm as is common in CBR.

The architecture of the system is shown in figure 3-2 overleaf using a level 1 data flow

diagram and standard SSADM notation (e.g. Weaver 1993).

This chapter proceeds with a discussion of the Roget based chainer, and the algorithm

used. Subsequently, we look at the derivation of the Generic Document Profile.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

48

Figure 3-2:Hesperus System Architecture

3.3 A program to analyse lexical chains in a text using Roget's Thesaurus

This work is based on Morris and Hirst's (1991) hypothesis that lexical chains may be

automatically identified in a text using Roget's thesaurus. Although we have seen in

Section 2.4 that lexical chainers have been written using WordNet (Fellbaum 1998) there

have been no published computer implementations based on Roget.

This section describes a lexical chainer based on Roget�s thesaurus. The techniques used

are general, and depend on the organisation of the thesaurus as a structured resource.

Thus, they should be applicable to languages for which WordNet�s have not been written,

but for which texts equivalent to Roget are available.

Chapter 3 Hesperus

49

This lexical chainer depends on the availability of an �Electronic Roget� - a machine-

readable version of Roget's thesaurus together with an indexing program. This makes it

possible for a program to identify the Roget categories of which a word (or words) is a

member.

Two versions of Roget's thesaurus have been used during the course of this study. Firstly,

the 1911 edition was used, as this is available electronically over the Internet (Project

Gutenberg 1911) and is out of copyright. This version however has several problems

before it can be used to find inter word thesaural relationships. It contains many obsolete,

literary, and foreign language terms that need to be filtered out, but most importantly, it

lacks an index. The editor of the 1962 Roget1 (Dutch 1962) points out that the index was

carefully created so as not to contain all possible terms. Therefore, an automatic approach

is going to lead to a higher degree of lexical and term ambiguity, since it does include all

possible terms - including those whose relationship is too tenuous for a human editor to

include.

The second version of Roget�s thesaurus used was that of 1987. �The Original Roget's

Thesaurus of English Words and Phrases�2. This is structurally similar to the 1911

version, but includes a considerably enlarged and modernised vocabulary. Again, a

machine-readable index was not available.

The structural difference between the 1987 and the 1911 versions of Roget is due to the

increased size of the vocabulary, which consequently increases the size of the thesaural

entries. Whilst these conform in structure to Roget�s 1000 categories (see Section 1.6),

these categories are further subdivided into approximately into six further subcategories,

giving 6400 subcategories altogether. For the remainder of this chapter this subdivision

will be ignored. However, the 1987 edition of Roget does allow the possibility of working

at a finer level of granularity than that offered by the 1911 version. That is, we could

consider the thesaurus as having approximately 6400 smaller categories, rather than 1000

larger ones.

1 The 1962 version is available as paper only. 2 Copyright © 1987 by Longman Group UK Ltd. We are grateful to Longman�s for permission to use this work for

academic purposes.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

50

The issue of word sense ambiguity inherent in an automatically created index is

connected to an observation in Morris and Hirst (1991) that of the visible connections

between words in a newspaper article, only 80% (approximately) could be detected with

Roget's thesaurus. They recognised the remaining relationships as between proper nouns,

anaphora, and knowledge of the environment, such as city districts. These are associations

familiar to Morris and Hirst (1991), and the article�s author, but not included in Roget in

the appropriate sense. It is quite plausible that some of these terms, especially proper

nouns, will be present in Roget in other senses. These erroneous senses will then be used,

leading to misclassification, and subsequent identification of spurious lexical links and

chains.

Lexical chaining is then an approximate, error-prone procedure. The objective for this

work is to produce an implementation that is accurate enough to produce a usable Generic

Document Profile. That is, sufficiently accurate as to offer an improvement over term-

only similarity algorithms (Chapter 6), whilst being suitable for interactive use.

The next section proceeds as follows: Firstly we consider the creation of an Electronic

Roget, we then proceed to discuss the chaining algorithm itself.

The Creation of a Machine Readable Thesaurus

This thesis has used Roget�s thesaurus as a knowledge base for the identification of

lexical chains in a text. This was done using machine-readable versions of the thesaurus

without having access to an index. The index is essential in identifying how a word is

classified in the thesaurus. Without an index, it is not possible to find to which entries a

word belongs, or where it fits in the thesaural hierarchy. Consequently, it was necessary

to create a machine-readable index. The procedure that does is described in Algorithm 3-1

below.

Algorithm 3-1, which creates a machine-readable index, may be applied to any similar

resource. It would be possible for example to apply Algorithm 3-1 to a non-English

thesaurus (whose text had been acquired say, by scanning), and subsequently use that to

derive lexical chains in texts in that language3.

3 It follows too that if the thesauri have equivalent structures the procedures for text similarity matching (or other

operations possible with lexical chains) could be applied cross-lingually, to texts written in different languages.

Chapter 3 Hesperus

51

Algorithm 3-1 particularly applies to the 1911 edition of Roget�s thesaurus. Project

Gutenberg (1999) have made the 1911 Roget publicly available, since its copyright has

expired. That edition of Roget is available as a single text (i.e. it is a book) with section

headings divided into the standard categories described in the preface to any Roget

(Dutch 1962). Each entry is indicated by a headword that describes the idea(s) of that

entry, which is independent of any part of speech. Each entry is numbered, and a feature

of Roget is that the reader is referred to related entries.

In the 1911 Roget the related entries are given by number at the end of each entry. Later

editions (e.g. Roget 1962) indicate pertinent categories next to the most relevant word.

Algorithm 3-1 does not exploit that enhancement.

The challenge is to be able to identify the thesaural categories a word (or words) belongs

to starting from the book form. Algorithm 3-14 divides that book into separate entries in

the thesaurus, simplifies them, and then creates an index from them using an information

retrieval program5.

4 The purpose and role of minor procedures is described in appropriate comments. 5 This work uses a public domain program, FFW, though many other programs are available on the internet -as is FFW.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

52

Algorithm 3-1: Creation of the E-Roget

PROCEDURE Create-E-Roget

LET file-name := NULL

LET RogetPtrs := NULL

LET Roget_files := NULL

FOREACH Roget-entry R1:

Simplify-Text(R1) //Remove Obsolete, Latin, & foreign words

Combine-Collocations(R1) //merge into single hyphenated terms

file-name :=

Compose(head_title, entry_number, part_of_speech)

WRITE R1 AS file-name

COLLECT pointers-to-related-entries INTO RogetPtrs

//array indexed by entry number

COLLECT file-name INTO Roget_files

WRITE RogetPtrs AS Roget.ptrs

//using an Information Retrieval (IR) program.

INDEX Roget_files INTO-FILE Roget.Index

END

When reloaded, the memory-based array RogetPtrs allows thesaural relations between

any two word categories to be determined. The search component of the IR program may

now be encapsulated as separate Function Categorise.

Chapter 3 Hesperus

53

Function Categorise (Word1 .. Wordn)

LET Set1:= RETRIEVE File-Names from

Roget.Index containing Word1

LET Setn:= RETRIEVE File-Names from

Roget.Index containing Wordn

=>Return (Set1 ∩∩∩∩ Setn) // the set of Thesaural Entries containing word1 . wordn;

END

Efficiency Considerations.

The Electronic Roget uses two techniques that support efficient use by the lexical chainer.

Firstly, thesaural entries encoded as integers, and secondly, the internal pointer structure

of the Roget is held in memory. Let us look at these techniques in turn.

Integer Encoding

In order to carry out lexical chaining the function Categorise (above) is used. This

must return the three types of information that describe a word�s thesaural category:

1) The thesaural category (e.g. Entry 1. Entry 1000)

2) Thesaural sub division: that is, Noun, Verb, Adjective, Adverb & Phrase

3) Additional Category if relevant (e.g. Entry 16a, 16b, or 16c). These refer to

class divisions and additions created after the original classification.

Since these numbers are small, they may be combined into one integer (two bytes). This

both makes maximum use of available memory, and permits several lexical comparison

operations to be carried out simultaneously.

For example, in the various lexical linking relations described in Morris and Hirst (1991),

the simplest comparison is to determine whether two words are members of the same

thesaural entry. This requires testing the equivalence of both item (1) and (2) above. This

may be done in one operation. Other operations are similarly reduced to integer

arithmetic. Of course, if specific information is required (such as whether an entry refers

to a �Noun� subcategory this may be done using bit mask operations). Since these are low

level operations, they are also typically faster than the string matching operations that

would be required by other encoding schemes. This technique also supports the second

efficiency technique, the Roget pointer structure.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

54

The Roget Pointer Structure

Lexical chaining depends upon the detection of relationships between words. These

relationships may be membership of the same category in the thesaurus, or more complex

links, which may be determined by following the internal references in the thesaurus.

Consequently, the efficient operation of lexical chaining depends upon the rapid

identification of relations between thesaural categories.

Roget�s thesaurus is distinctive in that the categories are ordered in a tree like structure

(outlined in Appendix VII), where entries contain pointers to related entries. Since this

information is static, it may be determined when the electronic index is created.

Since the Roget pointer table is held in memory, all the thesaural-linking operations are

carried out without disk access. This is important, as accessing a computer disk is one

hundred times slower than accessing its memory. This is a consequence of a disk being a

mechanical device. Additionally, as word category information is held as integers, lexical

chaining using Roget will be more efficient.

In summary, it is a pre-requisite of any lexical chaining procedure that we can determine

the thesaural categories to which a word belongs. This section has described a procedure

to accomplish this using routinely available Information Retrieval programs supported by

basic text processing.

Chaining Algorithm

The algorithm described here is based on that given for WordNet in StOnge (1995), who

in turn followed Okumura and Honda (1994), and Morris and Hirst (1991). That is, a

linear pass is made through the text, and where words can be associated using

relationships derived by reference to an external thesaurus, a �link� is stored. If one of the

members of the link was previously linked elsewhere in the text, the two links form a

�chain�, to which further links may be added.

Several relations between word pairs can be used to decide if they are members of the

same lexical chain. These were suggested and described fully by Morris and Hirst (1991).

Only a subset of the Morris and Hirst (1991) were found to be useful in Hesperus since

some are excessively prone to problems of word sense ambiguity.

Chapter 3 Hesperus

55

The most important relation is word repetition, known as the ID6 or identical word

relation. Simply, if two words are the same, they may be linked with high degree of

certainty. Although there are a mean of four thesaural entries per term, the discourse topic

acts to constrain sense usage. This means that a word used in one sense is frequently used

in that sense throughout the same text. For example, �bond� has one common sense in

financial journals, a second usage in chemistry papers, and a further widely accepted

sense in sociology books. This is an instance of �one sense per discourse� noted by

Krovetz and Croft (1992).

Next, we consider whether two words are members of the same thesaural category (CAT

relation). Again, due to one sense per discourse, this is mostly successful. It does appear

however more error prone than the ID method. The degree of error is related to the type

of text being analysed, with technical texts appearing to have fewer errors than fictional

texts, as they contain a greater proportion of polysemous words. Precise error rates have

not been calculated, as this would require a text corpus tagged with Roget categories. (see

Section 2.4.4)

Since Roget�s thesaurus contains groups of categories, we also consider whether word

pairs are members of neighbouring and related thesaural categories. If so, and the two

categories are members of the same thesaural group, the words may be linked with a

GROUP relation.

Roget categories often refer to other categories. Words may therefore be related where a

word's entry refers to an entry that contains the second word. This relationship is

abbreviated as ONE, since there is one level of indirection from one thesaural category to

the second.

All lexical chains are stored in common data structure called a ChainStore. Unlike

StOnge�s Chainstack, which ordered chains by recency of occurrence or Okumura and

Honda (1994) where chains were ordered by their length, chains in the ChainStore are

ordered by their value, as calculated using procedure Potential-Link-Value (see

6 Courier font is used to differentiate link names from descriptive text.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

56

pp56-57). This allows the concepts in a text to be considered according to their known

relative strengths as these emerge during the linear text �context� determination process.

The ChainStore is linked to the application of a variable width window within which

lexical links are considered. Furthermore, the type of link governs the region of text for

which links are considered. Identical word links are considered then within a region of

fifty non-stopwords. Other link types have this value reduced in proportion to their

relative weights.

The variable width window is motivated by ambiguity considerations. The progressively

weaker semantic relations consider increasing number of words as shown in table 3.1

below.

Table 3-1: Thesaural Relations Vs Mean Words

Relation Mean Words Comment

ID 4 Mean number of word repetitions in the thesaurus

CAT 30 Mean size of thesaural category

GROUP 4*CAT=120 Mean number of categories in a thesaural group

ONE 6*CAT=180 Mean number of categories in a ONE relation

TWO 6*6*CAT=1080 Mean number of categories in a TWO relation

A full discussion of this issue needs to be informed by word sense frequency data

(Chapter 5). To clarify table 3.1 we will assume that all word senses occur equally

frequently.

Table 3.1 shows that GROUP relationship will consider possible links between two words

from a mean of four categories7 from Roget�s 1000. The ONE relationship will consider

six categories, whilst the TWO relationship considers two levels of indirection, or thirty-

six categories.

As there are approximately 1000 categories in Roget, it is clear that there is a significant

probability that a TWO relationship will be found between unrelated words as this often

7 Some groups contain two and others as many as ten categories. Four categories is an estimated mean.

Chapter 3 Hesperus

57

includes so many categories. This naturally increases the greater number of words that are

considered. Indeed, there is a strong possibility that unrelated words that should form new

lexical chains could be erroneously linked to words in existing chains by the TWO

relation. As such, it appears to offer no positive contribution to representing a text for text

similarity assessment, and its use was not pursued.

The algorithm used to create chains is as follows: -

Algorithm 3-2: Create Lexical Chains.

PROCEDURE Create-Lexical-Chains

LET NewLink := GetLink(); // get a new link from the input

LET LinkTypes := [ID, CAT, GRP, ONE]; // types of links in preferred order

FOREACH LinkType in LinkTypes

FOREACH chain in ChainStore

FOREACH link in chain

IF Canlink(link, NewLink, LinkType)

Push(NewLink, chain)

IF Ambiguous(NewLink) // has multiple word senses

DeleteUnusedWordSenses(NewLink)

Calculate-Weight(chain)

SORT ChainStore

RETURN // Done!!

IF Potential-Link-Value(link, NewLink)<1)

BREAK 2 // Leave two loops to try next link type

LET NewChain := NULL; //can�t be linked to an existing

Push(NewLink, NewChain) // chain form a new chain

Push(NewChain, ChainStore)

SORT ChainStore

END

The function Potential-Link-Value restricts the potential chaining region

according to the link type being considered. Table 3-1 has shown that the different

thesaural linking relations proposed by Morris and Hirst (1991) cover increasing numbers

of words in the order CAT < GRP < ONE. Thus, as weaker relations are used the risk of a

spurious link being found increases.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

58

Function Potential-Link-Value reduces the effect of erroneous links by restricting

the number of words being considered. This is done in two ways. Firstly, link types have

different weights, which are given in table 3-2 below. Secondly, the distance between two

links is used to determine the value of a link.

The values of the link weights were determined empirically. That is, different values were

used that reflected the links� relative importance, and which gave reasonable results for a

sample text (Section 3.4). There is no benchmark data against which alternate link values

could be independently assessed. This issue is addressed however in Chapter 6, which

develops baseline test data for the complete text similarity task.

Function Potential-Link-Value (NewLink, OldLink)

RETURN (Linkvalue/WordNumber(Newlink)-

WordNumber(OldLink)))

END

Table 3-2: Value of the different lexical links.

Link Type Value

IDENTITY 600

CATEGORY 500

GROUP 300

ONEPOINT 200

3.4 An Example

Now we are going to briefly consider an example creation of a lexical chain using

Hesperus. We will use the quotation from Einstein considered by StOnge (1995), and

introduced in Section 1.6 as an example to illustrate the notion of a lexical chain.

Recall that StOnge (1995) manually identified three lexical chains in this text, which are

indicated by text subscripts and listed below.

Chapter 3 Hesperus

59

We suppose a very long train1 travelling2 along the rails1 with the constant velocity2 t' and in the direction2 indicated in Figure 1. People travelling2 in this train1 will

with advantage use the train1 as a rigid reference-body3; they regard all events in

reference3 to the train1. Then every event which takes place along the line1 also

takes place at a particular point1 of the train1. Also, the definition of simultaneity

can be given relative to the train1 in exactly the same way as with respect to the

embankment1. Einstein)

1. {train, rails, train, train, train, line, point, train, train, embankment}

4. {travelling, velocity, direction, travelling}

5. {reference-body, reference}

Hesperus finds the chains shown below in the text. Table 3-4 illustrates their embedding.

(This notation is described in Section 3.8)

Table 3-3: Quotation from Einstein 1939 (cited by StOnge 1995)

We suppose4 a very long6 train0 travelling3 along the rails with a constant velocity v and

in the direction1 indicated in figure7 . People travelling in this direction will with

advantage5 use the train as a rigid reference2 - body; they regard all events in reference-

to to the train. Then every event which takes place along the line also takes place at a

particular point10 of the train. Also, the definition8 of simultaneity9 can be given

relative-to to the train in exactly the same way as with respect to embankment

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

60

The chains are given below numbered in importance from zero:

0. train, rails, train, train, line, train, train, embankment,

1. direction, people, direction,

2. reference, regard, relative-to, respect,

3. travelling, velocity, travelling, rigid

4. suppose, reference-to, place, place,

5. advantage, events, event

6. long, constant

7. figure, body

The remaining chains contain single words only, or are �atomic� using Hirst and StOnge'

(1998) terminology.

The most important chain is given below. The word and its number in the text are shown,

followed by the word's link relationship in the chain. This is followed by the number of

the word preceding it in the chain that it is linked to. The thesaural sense(s) possible for

the relationships between those words are also given. Thus, in the following, word 6

(Train) at the head of the chain is not linked to anything (↓), whilst word 10 (rails) is

linked to it by CAT. That is, they are members of the same thesaural category.

Table 3-4: An example lexical chain embedded in a text

Word Word Number Link Type Linked to Thesaural Category

train 6 ↓ 0 railway_624_4011_n

rails 10 CAT 6 railway_624_4011_n

train 33 ID 6 railway_624_4011_n

train 47 ID 33 railway_624_4011_n

line 56 CAT 47 railway_624_4011_n

train 66 ID 47 railway_624_4011_n

train 78 ID 66 railway_624_4011_n

embankment 88 CAT 78 railway_624_4011_n

This lexical chain illustrates the reduction in possible word senses for �train� from

twenty-three to one. The possible senses are shown in table 3-5 below.

Chapter 3 Hesperus

61

Table 3-5: Senses of �TRAIN�

Senses of �TRAIN� from Roget' Thesaurus (1997)

The headword8 of the entry in the thesaurus is given followed by the main category

number where there are about 1000 categories in Roget's thesaurus. This is followed

by sub group (of which there are 6400) and grammatical part of speech.

adjunct_40_271_n

break-in_369_2449_v

conveyance_267_1765_n

direct_689_4458_v

draw_288_1942_v

experiment_461_2965_v

follower_284_1915_n

garment_228_1495_n

habituate_610_3914_v

train_274_1843_n

train_534_3476_v

hanging-object_217_1409_n

learn_536_3483_v

make-conform_83_553_v

make-ready_669_4343_v

marching_267_1763_n

prepare-oneself_669_4345_v

procession_71_478_n

railway_624_4011_n

rear_238_1575_n

retainer_742_4821_n

retinue_67_453_n

series_71_477_n

We also see that �direction� is mistakenly linked to �people�, since both are members of

the thesaural category �government�. Nevertheless, the lexical chains produced by

Hesperus overlap well with those manually identified by StOnge (1995) above.

StOnge' Chainer Output

StOnge (1995) analysed the quotation from Einstein at the start of this section and his

program found the following lexical chains:

8 The headword is the title of the entry. No definition is given of what the category title means other than the words it

contains. Thus, to clarify the �learn� sense of train we need to find �train� in the context of related words in that category �train, practice, exercise, be wont�.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

62

[001] simultaneity(3), respect-to(3), train(3), train(2), reference-to(1),

reference(1), advantage(1), train(1), train(1), train(1), direction(0), velocity(0), train(0)

[002] travelling(1), travelling(0)

[003] line(2), rails(0)

[004] given(3), constant(0)

[005] body(1), people(1), figure(0)

[006] point(2), particular(2), regard(1)

[007] place(2), place(2)) event(2), events(1)

[008] definition(3)

[009] embankment (3)

As with Hesperus, the initial number (in square brackets) shows the chain creation order,

and words appear in reverse order of insertion. StOnge (1995) notes that rails is read but

is not associated with train, the distance between these two words being too large in

WordNet.

Further problems are identified in StOnge (1995). Nevertheless, this example shows that a

lexical chainer based on Roget's thesaurus can produce results equivalent in quality to

those produced using WordNet.

3.5 The Generic Document Profile

The purpose of the Generic Document Profile is to represent any text in such a way that

the similarity in meaning between two texts may be compared. The Generic Document

Profile is simply a set of semantic (Roget) categories with associated weights. These

weights are based on chain length and strength attached to the thesaural categories. This

profile can be matched against that derived from another text in a Case Based Reasoning

approach using a Nearest Neighbor9 algorithm (Aamodt and Plaza 1994).

This representation is known as the �Generic Document Profile�, since it is not word

specific, and is derived from the whole text. Now all that remains is to describe how a text

may be analysed so that values in each particular category can be determined.

9 US spelling is used, since this is the common name for this class of algorithm.

Chapter 3 Hesperus

63

Creating the Generic Document Profile

The Generic Document Profile is created from the lexical chains identified in a text. The

strength of every link is determined as described in the function Potential-Link-

Value (Section 3.3). This value is then summed into the appropriate profile category.

This gives the required attribute value representation. Thus, the strength σ of any concept

c where C(n) gives the strength of concept c in link n, and where a text contains N lexical

chains is:

Equation 3-1: The strength of a concept

�=N

nC nC )(σ

Table 3-7 below shows an example generic document profile. It is derived from the

example quotation from Einstein used in Section 3.4. For each thesaural category, the raw

score is converted into a percentage of the total score. Using these percentages normalises

document length and is an intrinsic part of creating a GDP.

Table 3-6: An Example Generic Document Profile

Roget Class

(Headword_Entry_Number__Sub_Group_POS)

Raw Score Percent

Railway_624_4011_N 375 31.97

Government_733_4752_N 200 17.05

Relation_9_54_N 176 15.00

Put-In-Front_64_442_V 109 9.29

Chance_159_996_N 87 7.42

Unconformable_84_561_A 76 6.48

Lasting_113_699_A 62 5.29

Motion_265_1743_N 42 3.58

Person_371_2463_N 31 2.64

Concerning_9_63_R 8 0.68

Attribution_158_989_N 7 0.60

Total 1173 100

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

64

Note that the GDP approach to text described assumes that meaning is compositional.

That is, that there is one meaning to a text that is summarised in the GDP. This may be

true of single paragraphs, but is rarely true of most whole texts. These contain themes, or

arguments, depending on the type text that is contained in the component sections, which

linked together form a coherent document. GDP�s could be used to represent themes in

whole texts by partitioning documents into sub-structures such as chapters or sections

first. The approach described is however suitable for determining the similarity of texts.

3.6 Using the Generic Document Profile to Determine the Similarity of

Texts

This section is the technical zenith of the approach. So far, we have described how to

determine the lexical chains in a text. Next we described how to derive the Generic

Document Profile, which is an attribute value vector whose attributes are categories from

Roget�s thesaurus, and whose values are derived from the lexical chains. Now we will

describe how texts may be compared to give a similarity score using a Nearest Neighbor

algorithm.

A CBR Nearest Neighbor algorithm takes two case descriptions, one input, and the other

retrieved from the database of cases and returns their similarity as a value between 0 and

1.0. Kolodner (1993) gives the following definition of a Nearest Neighbor algorithm:

Equation 3-2: Simple Nearest Neighbor Algorithm

�=

=

=

=

×= Ni

ii

R

i

I

i

Ni

ii

w

ffsimw

1

1

)),(MatchScore

(

where w is the importance weight of a feature or slot, sim is the similarity function, and

ff R

i

I

i, are the values for feature i in the input and retrieved cases respectively.

Kolodner�s (1993) algorithm needs to be modified slightly for use in Hesperus, since not

all Roget senses will occur in the input text I, or the retrieved case R. If the sets of Roget

categories in the input and retrieved cases are CC RI, respectively, then only those

categories found in both (that is, CC RI� ) are considered in the match.

Chapter 3 Hesperus

65

Equation 3-3: Hesperus Nearest Neighbor formulation

×=

}{

}{

)),sim()(MatchScore

(CC

CC

I,R RI

RI

i

ii

R

i

I

i

i

ii

w

ffw�

The influence of any particular Roget category is governed by its weight wi . The weight

used is the value of that feature in the Generic Document Profile of the input case. Since

we have no external metric to evaluate differential feature weighting, Hesperus weights

all Roget categories equally. This issue is discussed further in Chapter 7.

In common with many CBR packages such as ART*IM (Inference 1994), the feature

similarity function sim is defined as the proportional range of the feature value between

the two cases. If vv Ri

Ii , are the values of feature i in the input and retrieved cases

respectively, then:

Equation 3-4: Hesperus Feature Similarity function

),max(

),min(),sim(

vvvvff R

i

I

i

Ri

IiR

i

I

i=

This function has a range between 0.0 and 1.0.

In the case of several texts, the Generic Document Profile is determined for all the texts

and these profiles are then stored in the database of cases, or case base. Similarities may

then be calculated for any particular input text against those stored. Thus, the GDP which

is derived with Roget�s thesaurus may be used to determine the similarity of two or more

texts.

3.7 Adherence to Zipf�s Law.

Since the GDP is derived from text, a necessary requirement for a text representation is

that it conforms to Zipf�s law (Chapter 2). If the representation displayed another

frequency distribution, it would clearly be misrepresenting the text since the text�s

unprocessed contents do conform to Zipf�s Law.

Graph 3-1 below shows the values of the thesaural categories Vs their rank plotted using a

log/log scale. These values represent the frequency of the categories.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

66

The data in Table 3-7 may be fitted to a straight line. Since these data were derived from a

one-paragraph example text, conformance is inevitably imperfect10. Nonetheless, basic

correspondence with Zipf�s law has been shown. This demonstrates that the GDP

derivation is representing the texts contents.

Many phenomena in natural language exhibit Zipf�s law (Section 2.2.7). Consequently,

conformance to Zipf�s Law is necessary but not sufficient proof that the lexical chaining

and GDP algorithms provide plausible text representations. This does however give

sufficient confidence in the methodology to proceed with testing the research hypotheses

given in Chapter 1.

Graph 3-1: Zipf Law: GDP Profile Values Vs Rank

3.8 Visualisation of Results

It was difficult to evaluate the relative success of initial implementations of the lexical

chaining algorithm since data was hidden in large volumes of output. It was consequently

hard to recognise what associations had been found. Morris and Hirst (1991) used a

complicated system of subscripts and superscripts that encoded the link type and

associations, however this required close attention to follow. The approach used in

example 3-4 improved on that situation, since the link type and associations are specified

in normal text, however the approach was still unsatisfactory since the lexical chain was

10 Longer texts are used in experiments reported in chapter 6. Their GDPs are given in Appendix II. These display better

adherence to Zipf�s law.

Chapter 3 Hesperus

67

viewed isolated from its surrounding context. This made it hard to interpret whether the

word sense selected was appropriate.

Since lexical chains essentially decompose a text into different �threads� of meaning, an

approach was required that allows each thread to be visualised individually, with the

option of returning to the overall view. This is equivalent to seeing any text as hypertext

made up of lexical chains.

HTML (Berners-Lee et al. 1994) is a highly practical hypertext medium, since it is

supported by public domain viewers such as Netscape and Internet Explorer.

A module was consequently designed that produces an HTML version of the initial text.

This shows the lexical chains identified, using colour to indicate chain membership. The

first element of the chain contains a hyperlink to that chain only.

The link type is encoded in the character styles used to print the other members of the

chain. Thus, a chain and its various links can be seen distributed in the surface text. An

example of this mark-up is given in table 3-3 earlier, whilst table 3-8 shows how the

different link types encoded as bold, underlined, italic or plain text faces. Thus for the

most dominant chain, we may see:-

Table 3-7: Link Type indications

IDENTITY

CATEGORY

GROUP

ONE-POINT

Each link in the chain also contains links to the Roget�s thesaurus entry selected for that

word. This allows for rapid, qualitative checking of the entry�s suitability to represent that

word sense. Since Roget�s thesaurus does not give definitions of entry meanings, the

suitability of an entry has to be decided based on the whether the other words in the entry

would make suitable synonyms for that usage.

Note that this qualitative evaluation is only suitable for monitoring coarse system

performance. Due to the often-noted sparseness of language phenomena, modifications to

improve performance with less frequent words, may negatively affect more frequent

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

68

words, and consequently degrade system performance. Consequently, overall evaluation

on a corpus is preferable (Chapter 6).

Several examples of Hesperus� demonstration output are included in Appendix II

The algorithm to make the lexical chains in a text visible is given in Appendix III.

3.9 Conclusion

We have described a system known as �Hesperus� that computes the similarity of pairs of

texts. The program identifies the lexical chains in a text using Roget�s thesaurus as a

knowledge source. This is used to create an attribute value vector of thesaural categories

that we have called the Generic Document Profile. This has been to conform to Zipf�s

law. Using this profile, the similarity between two texts based on their semantic content

can be calculated. We claim that this improves on much work on text similarity

assessment, since that is largely based on term repetition. This claim is experimentally

investigated in Chapter 6.

Two innovations have been introduced with respect to previous work on lexical chains.

Firstly, Roget's thesaurus has been shown to be a useful alternative to WordNet.

Secondly, the notion of a value ordered chain store was introduced. This allows a text�s

concepts to be considered according to their prevalence, rather than following the linear

text analysis.

Subsequent work is focused on improving the accuracy of the Generic Document Profile.

This is problematic, as there are no text corpora for which Generic Document Profiles or

lexical chains have been defined. This means that implementation decisions have been

described in this chapter which were justified solely on the basis of informal experiments.

What is required is a representative benchmark test that is independent of Hesperus.

Performance modifications can then be evaluated against that standard. Chapter 6

develops such a standard.

One cause of inaccuracy to Hesperus is the word sense ambiguity problem. This is

addressed by incorporating appropriate disambiguation techniques which will be

described in chapter 5.

Chapter 3 Hesperus

69

Chapter 6 evaluates the GDP approach by comparing it to human judgements of the

similarity of randomly selected texts. That chapter also evaluates the claim that the GDP

method is superior to an approach based on term repetition. We also return in Chapter 6 to

the possibility of using a finer level of granularity. This was mentioned in Section 3.3,

and involves using an expanded GDP containing approximately 6400 thesaural

categories. This is technically trivial, but such a modification would only be justified if it

can be shown that it improves text similarity matching performance compared to human

judgements.

The applicability of the technique to different text complexities is addressed in the

following chapter.

Chapter 4. The General Nature of Lexical Links

4.1 Introduction

The Generic Document Profile (GDP) is designed to facilitate the comparison of texts and

measure how similar they are in content. It is derived from the lexical chains identified in

a text. The GDP is an attribute-value vector whose attributes are categories from Roget�s

Thesaurus, and whose values are the cumulative weights of the links in the lexical chains

(Chapter 3).

Texts may differ in several ways including style, length, genre, and complexity. These

dimensions necessarily interact, so for example many conference papers tend to be about

five pages in length due to space restrictions. Authors compensate for this by adapting a

more terse style. Journal papers are usually longer, and explain findings in more detail,

whilst successful textbook authors strive for explanations of the highest clarity.

For a procedure to represent the content of any text, it must not be sensitive to any of the

factors of style, genre, length, and complexity. If it were sensitive, then it would be a

measure of that aspect of the text. The objective of this chapter is to demonstrate that the

GDP is a general method.

The issue of document length is addressed mechanically in Hesperus using simple

mathematical methods. For example, the attribute-values in the document profile are

converted to percentages of the total value of the GDP. This normalises the value that a

particular attribute may take to between zero and one.

The issue of text style and genre is somewhat more difficult to address. There are

different types of texts, and texts that are about the same subject may be in different

genres (Karlgren and Cutting 1994).

The GDP calculation is based upon the weight attached to each link in every lexical

chain. This is determined by the strength of the link type divided by the distance between

the linked words (Chapter 3). If different genres contain higher proportions of the

different links, or if the distance between thesaurally related words varies this would alter

the distribution of attributes and values. This could arise if the threads of related words in

Chapter 4 The General Nature of Lexical Chains

71

the simple texts are shorter, hence making the text easier to read, or, alternately, denser

text could have longer inter-word link distances. Indeed, Morris and Hirst (1991 p41)

speculate that lexical chain forming parameter settings will vary according to an author�s

style. If this were to be the case, the lexical chaining approach would not be a general

tool, but would instead be some measure of document complexity. Lexical chaining could

only be made suitable as an enabling measure to determine text similarity if a text�s genre

were first classified. This classification would then need to be included in the GDP

calculation.

The objective of this chapter is then to demonstrate that lexical links have the same

characteristics in texts of different genre and complexity. That provides the basis for

considering the GDP as general method. If it were not the case, we would need genre

identification and normalisation methods to be used before the GDP method could be

applied. This would reduce the attraction of the method for interactive applications.

In order to show that the GDP is a general method we need to show that: -

1. The proportion of links does not depend on the type of text.

2. Links have similar distributions in different text types.

3. This distribution is independent of the link type.

This is done by selecting a set of texts of varying complexity. The lexical chaining

component of Hesperus is then applied to these and analysed to test the assertions above.

At the same time, basic data about lexical links will be collected. Although lexical chains

have been used in a variety of applications (see Chapter 2) these were small-scale. Thus,

basic data about large volumes of lexical chains have not been reported. These will be

used subsequently to tune the performance of the algorithm.

This chapter is arranged as follows: Firstly a collection (or 'mini-corpus') of appropriate

texts are selected. These varied from children's books such as �Alice in Wonderland� to

more challenging works such as Kant's �Critique of Pure Reason�. Next, we demonstrate

that these are of different reading complexities using the well-known Flesch-Kincaid

readability metric. Basic data about the types of lexical links found in the texts is then

reported. The principal finding being that identical words, and words in the same

thesaural category make up approximately 80% of the relations discovered. Next, we

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

72

move on to look at the distance between the words where a thesaural link can be found.

This uncovers a significantly similar distribution pattern for all the link types and in all

the texts. This data is very similar to that reported by Beeferman, Berger and Lafferty

(1997) using purely statistical analyses. This distribution is subsequently shown to

conform to Zipf�s (l949) law. The implications of these findings are then discussed and

conclusions drawn.

4.2 Selection of the Experimental Texts

The lexical chaining approach to text analysis is highly attractive, since it is both robust,

and deals with whole texts. It is though a heuristic approach. For example, we do not

know whether it is an independent measure, or a reflection of a text�s genre.

To answer this question, we decided to analyse a set of longer texts. Since most work in

lexical chaining has considered shorter texts, results, though interesting, may not be

general. Consequently, a mixed range of texts of differing complexity were chosen.

There were several constraints on the selection of texts for the experiments. They had to

be:-

1. analysed within the constraints of the current implementation.

2. available electronically.

3. several thousand words in length.

4. They should have demonstrably different complexities.

A range of texts of differing complexity was selected from those available on the Internet,

or CD-ROM. These varied from children's books such as �Alice in Wonderland� to more

challenging works such as Kant's �Critique of Pure Reason�. The texts chosen are listed

in table 4-1 below.

Chapter 4 The General Nature of Lexical Chains

73

Table 4-1: Texts Selected

Title Author Publication

Date

Alice�s Adventures In Wonderland Lewis Carroll 1867

Through The Looking Glass Lewis Carroll 1867

Pride And Prejudice Jane Austen 1813

Moby Dick Herman Melville 1851

Lectures on the Industrial Revolution in England Arnold Toynbee 1884

The Critique Of Pure Reason Immanuel Kant1 1781

4.3 Reading Complexity of the Texts

The books used in these experiments were selected as representing a range of literary

complexity. Books by Lewis Carroll are commonly read to junior school children, Austin

and Melville are high school texts, whilst Kant and Toynbee are not usually encountered

until University. Thus, we can expect intuitively that University level texts are harder to

read than those aimed at school children. Nonetheless, some independent confirmation of

their reading ease is desirable.

Readability is often measured by teachers to determine the suitability of books for pupils

of different reading abilities. Readability formulae (e.g. Harrison 1980) aim to predict the

level of a text�s reading difficulty by calculating statistics, such as sentence length and

mean syllables per word, from the text. They do not consider content, so need to be

applied with caution.

Harrison (1980) describes ten readability measures, including the Flesch formula, and the

Gunning FOG formula. Harrison (1980) reports a study by Lunzer and Gardner that

shows that seven of the readability formulae are approximately correlated with pooled

teachers� assessments of text reading levels.

Karlgren and Cutting (1994) showed that texts may be simply classified into fifteen

different genres. They used the statistical technique of discriminant analysis on twenty

parameters. These included sentence length, proportion of pronouns, average characters

1 translated by J. M. D. Meiklejohn

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

74

per word, and number of relative pronouns. They applied this method to classify the five

hundred texts from the Brown corpus, which have been manually classified as belong to

different genres. Karlgren and Cutting comment that readability measures work well to

discriminate text types since they include the most salient features of their experiments.

These include sentence length, word length, and characters per word.

The Flesch-Kincaid grade level measure computes readability based on the average

number of syllables per word and the average number of words per sentence. It is a

common metric that is widely used. It is included in both Microsoft Word, and Corel�s

WordPerfect word processors, so it also has the advantage of convenience. The Flesch-

Kincaid grade level was consequently calculated for the initial 1000 lines of the books in

table 4-1. The 1000 line limit was chosen since this represents a reasonable subset of the

book that is sufficient to capture its style, assuming that this is approximately uniform

throughout the text. The results are shown in table 4-2 below.

Table 4-2 show that the books represent a range of reading complexity. They also

demonstrate the internal consistency of the measure as two books of a similar style by the

same author (�Looking Glass” and “Alice in Wonderland”) have similar Grade levels.

We now move on to look analyse the data produced from the lexical chains identified the

texts.

Table 4-2: Reading Complexity of the Texts

Book Title Flesch-

Kincaid

Grade

Abbreviation

Alice's Adventures In Wonderland 5.5 Alice

Through The Looking Glass 6.4 Looking

Pride And Prejudice 6.5 Pride

Moby Dick 7.8 Moby

Lectures on The Industrial Revolution in England 11.6 Indrev

The Critique Of Pure Reason 12.0 Critique

Chapter 4 The General Nature of Lexical Chains

75

4.4 Determination of the Lexical Cohesive Relationships

We used an algorithm based on those of Morris and Hirst (1991), and StOnge (1995) to

identify the lexical cohesive relationships in the texts. This is described in Chapter 2. Four

relations were examined:

1. The links between identical words ( hence ID)

2. Links between words that are not identical, but are member of the same

Roget category (hence CAT)

3. Links between words that are members of the same group of categories in

Roget, but not in the same category. (hence GRP)

4. Links through one level of internal thesaural pointers. (hence ONE)

4.5 Analysis 1: Link Distribution between Documents.

This first analysis presents unprocessed sums of the link types. That is, all the lexical

chains found in the documents were examined, and simple sums made of the types of

lexical linking relationships found. This is shown in Graph 4-1 below.

Our initial hypothesis was that there would more “weaker” linking relationships (such as

GRP or ONE), since these can connect to a greater number of words than the identical

word or same category relations. However, this was not the case.

Simple word identity (ID) is the most common lexical linking relationship found.

Following that, we find Roget category entry (CAT), then Roget group membership

(GRP). The ONE relationship is least frequent. Words not contained in the thesaurus are

shown as NONE.

All the books in the experimental corpus show approximately the same total link

distribution. Since the books represent increasingly complex texts, we have shown that

the proportion of links of different types found in a text is broadly independent of the

complexity of that text. We have also reason to question the value of the more complex

thesaural relationships. The ONE level of indirection relation is sufficiently rare that one

may question whether it is worth calculating. Indeed, its cost of calculation as measured

in terms of program run time, far outweighs its potential benefits. Consequently, it is not

considered after this chapter.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

76

We may speculate as to why the ONE level of indirection is found so rarely. A possible

explanation could be that Roget is a tool to aid writers. As such, ONE pointers indicate

nuances of word senses in related categories. Possibly writers choose related words to

create a more cohesive document. If that is the case, this might be recognised through the

other relationships of CAT, GROUP, and ID since lexical chaining2 gives no insight into

the writing process.

00%

20%

40%

60%

80%

100%

alice looking pride moby indrev critique

Identity Category Group OnePoint None

Graph 4-1: Link Type Vs Book Title

Looking at the proportions of link types in Graph 4-1, it is clear that an algorithm that

only uses identical words could give useful performance: it will usually discover the

majority of relationships in a text, and will do this accurately. Although one may expect

better performance as more relationships are added, word sense ambiguity comes into

effect, and this will cause inappropriate linking. Indeed, Sanderson (1994) has concluded

from Information Retrieval experiments that word sense disambiguation is likely to

degrade performance unless it is more than 90% accurate. The problem of word sense

ambiguity may explain why Hearst (1994) reports that her text segmentation algorithm

�TextTiling� performed better when it was not aided by thesaural relationships.

Stairmand (1996) compared a text segmentation algorithm that used lexical chains

(derived from WordNet �see Chapter 2) to TextTiling. He found that TextTiling gave

superior performance. Although Stairmand (1996) offers no explanation for this, it seems

2 as currently formulated!

Chapter 4 The General Nature of Lexical Chains

77

likely that TextTiling could perform better with less �but perfect� data than Stairmand�s

approach with more �but less accurate� data.

Since the identical word relationship is so important, it is almost certainly an error to

eliminate words not found in the thesaurus during the pre-processing stage. The rationale

for this is that such words can not form chains with other words. However, they will form

chaining relationships with themselves, and this may form a significant aspect of a text.

How this error may be corrected remains however a research problem.

4.6 Analysis 2: Link Distributions Change across Different Document

Types

Now we need to consider whether link distributions change across the different document

types. This is done by calculating the distances between each pair of words for which

there is a lexical linking relationship in each of the six experimental texts shown in Table

4-1. Comparative analysis between the texts is only possible if we compensate for their

different lengths. This is done by normalising the number of lexical links that share a

range of interword distances as a percentage of the total.

We do not know whether the distribution of different links varies in the same way for

each link type. Thus, the percentage calculation was divided according by the type of link

considered. We can then plot the percentages of each link type against the distance

between the words in that link.

Identical Links (%) Vs Interword Distance

0%1%2%3%4%5%6%7%

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99

indrevalicelookingcritiquepridemoby

Graph 4-2: Identical Links (%) Vs Inter-word Distance

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

78

Percentage of Category Links Vs Interword Distance

0%

1%

2%

3%

4%

5%

6%

7%

8%

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97

indrevalicelookingcritiquepridemoby

Graph 4-3: Percentage of Category Links Vs Inter-word Distance

Percentage of GROUP links Vs Interword Distance

0%2%4%6%8%

10%12%14%16%18%

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99

indrev

alice

looking

critique

pride

moby

Graph 4-4: Percentage of Group Links Vs Inter-word Distance

Graphs 4-2, 4-3, and 4-4 show the results of this analysis by link type3. As can be seen,

the percentage distributions are almost identical for all the texts, and for all the linkage

types. This means that the type of text does not affect link creation in lexical chains. It

3 The ONE link type has been excluded since this does not occur frequently enough to generate consistent data.

Chapter 4 The General Nature of Lexical Chains

79

also follows that the distance between words in a text is independent of the thesaural

relationships sought between the words.

It can also be seen that Morris and Hirst had little justification for applying special status

to the identical word relation, as they follow similar distributions to the other thesaural

links (Section 3.3).

4.7 Related Work

A Mathematical model showing an exponentially decaying relationship between co-

occurring words in English has been described by Beeferman et al. (1997). Their work is

empirical and based upon a statistical analysis of �trigger pairs� of words appearing in

five million words of the WSJ corpus (a collection of articles from the Wall St. Journal

newspaper).

Beeferman et al. (1997) divide trigger pairs into self, and non-self triggers. Self triggers

are identical words that are repeated in a text. Non self triggers are non-identical words

for which a statistical pattern of co-occurrence can be identified. Graph 4.5 below

(reproduced from Beeferman et al. 1997) shows the observed distance distributions.

Graph 4-5 : Non-Self Triggers (Beeferman et al. 1997)

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

80

Graph 4-6 : Self Triggers (Beeferman et al.1997)

It appears that Beeferman et al. 1997's self triggers are the same as the identical word

relationship described by Morris and Hirst (1991). Therefore, they have a similar

distribution pattern. Of course, we have shown above that this relationship is found in

several texts of differing complexities.

Beeferman et al.�s (1997) non-self triggers are more interesting. The notion is more

powerful than thesaural linking, since it captures associations that could not have been

previously stored in a database. Examples of such relationships are highlighted in italics

in table 4.3 below.

Table 4-3: Samples of Trigger Pairs (Beeferman et al.1997)

Self Trigger

Ms. Her

Changes Revisions

Energy Gas

Committee Representative

Lieutenant Colonel

Soviet Missiles

Underwater Diving

Patients Drugs

Voyager Neptune

Medical Surgical

Non-self triggers display the same distribution characteristics as the CAT, GROUP, and

ONE lexical linking relationship, however they were derived in a completely different

Chapter 4 The General Nature of Lexical Chains

81

way. Beeferman et al�s (1997) work is based on statistical analysis of large corpora,

whereas lexical linking uses a thesaurus to predict relationships without prior analysis.

Beeferman et al�s (1997) data support the hypothesis that the distance between related

words in texts is independent of text genre. As Beeferman et al (1997) used completely

different methods it is unlikely that graphs 4.4-4.7 are artefacts of the algorithm used.

Consequently, we can conclude that inter-word relationships are independent of text style.

4.8 Conformance to Zipf�s Law

The data in graphs 4.5-4.7, and Beeferman et al.�s (1997) data display the characteristic

power curve Zipf�s law that is often found in Natural Languages (Chapter 2).

Graph 4-7 demonstrates Zipf�s law in the experimental data of analysis two. Specifically,

data for Moby Dick has been extracted from Graph 4-4, and re-plotted using double

logarithmic scales. This shows the distance between words in the same thesaural category

Vs ranked number. A straight line can be observed within the 95% confidence limits

drawn by the statistical package SPSS.

The data in graphs 4-4 to 4-7 display the same power curve. We can conclude that this

conforms to Zipf�s law, as did the Generic Document Profile (Section 3.7).

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

82

Graph 4-7 : Moby Dick. Number of Same Category words Vs Rank

4.9 Conclusion

Lexical cohesion is a property of the words in a text. Relationships that link words have

been termed lexical links. Links may be composed in to chains, and such lexical chains

have great potential utility in text processing tasks, such as information retrieval, text

similarity detection, or text summarisation.

A major concern is that types of lexical chains to be found in a text may depend on the

style of that text. If this had been true, we would not have been able to base a measure of

text similarity directly on lexical chains: it would have needed to be mediated by a

determination of text genre.

This concern has been rejected experimentally be analysing several book length texts.

These were selected to be no more recent than Roget�s 1911 thesaurus. This maximised

the applicability of the lexical chaining algorithm. In addition to intuition, the books were

Chapter 4 The General Nature of Lexical Chains

83

shown to be of different reading difficulty by comparing them using the Flesch-Kincaid

grade level readability measure.

An analysis of the distribution frequency of the lexical links found in the mini-corpus was

strikingly similar for all the link types. This supports the hypothesis that text analysis

measures based upon lexical cohesive links will be applicable to different styles of texts.

Thus, the text similarity technique discussed in Chapter 3 is capable in principle of

determining the similarity of texts about the same subject, but written in different styles.

There are several implications that follow from these findings. Firstly, for the existing

work on lexical chains reported in Chapter 2 it follows that no special attention should be

paid to the texts used in the lexical chaining experiments. If lexical links are independent

of text genre, then the results of Stairmand (1996), StOnge (1995), Green (1997), and

Barzilay and Elhadad (1998) would also be replicated if texts of different genres had been

chosen. Secondly, regarding future work, no particular control need be applied to ensure

the uniformity of text genres. This is particularly important for the experiments reported

in Chapter 6, where texts are randomly selected from the Internet for comparative

experiments on human similarity judgements.

We now proceed to Chapter 5, which considers the issue of word sense ambiguity in

relation to Hesperus.

Chapter 5. Word Sense Disambiguation and Hesperus

5.1: Introduction

This chapter addresses the issue of word sense disambiguation (WSD) in relation to the

derivation of a text�s Generic Document Profile. Word sense ambiguity arises since words

may be used in more than one sense, and so, for example, approximately 33% of the

unique words in Roget�s are found in more than one thesaural category, and some words

found in many categories (Appendix VII). This has considerable implications for a text�s

Generic Document Profile (GDP). It will be recalled from Section 3-6 that this involves

analysing an input text word by word, identifying the words� thesaural categories and

linking them to chains of words related in meaning. Should an inappropriate word sense

be selected, two related problems arise: firstly it strengthens the wrong chain, and

secondly it correspondingly weakens the correct chain. This affects the overall accuracy

of the performance of the Generic Profile, and subsequently weakens the text similarity

algorithm.

The question then is whether we should try to choose a more valid word sense from

amongst the possible candidates, or accept that the inaccuracies of the lexical chaining

approach? That is, since word sense disambiguation is an unsolved problem it is quite

feasible that an attempt to solve it within Hesperus will introduce greater imprecision to

the text similarity match than that caused by the problem itself. This chapter addresses

that question.

1

2

3

4 5

6

7

Figure 5-1: Structure of the Thesis

1. Introduction.

2. Literature Review.

3. Hesperus: A system for comparing the similarity of texts using

lexical chains.

4. The General Nature of Lexical Links.

5. Word Sense Disambiguation and Hesperus.

6. Evaluating Hesperus

7. Conclusion

Chapter 5 Word Sense Disambiguation

85

This chapter proposes improvements to the text similarity process described in Chapter 3

by including a dedicated sense disambiguation phase. This is independently assessed, and

subsequently developed as an additional module for Hesperus. The improved system

shown graphically in fig 5-2 and subsequently evaluated in Chapter 6.

Okumura and Honda (1994) showed that the lexical chaining process implicitly provides

word sense disambiguation (Section 2.4.5). We may contrast this to explicit word sense

disambiguation, where we attempt to determine word senses prior to, and independent of,

lexical chaining. The objective here would be to avoid spurious word associations and

lexical links, by ensuring that words are only used in their intended sense.

Given the natural tendency of lexical chaining to disambiguate word senses, there is a

question as to whether explicit disambiguation should be attempted. As pointed out in

Section 2.2, Sanderson (1996) has argued that WSD will negatively affect Information

Retrieval (IR) performance unless it is more accurate than 90% (Chapter 2). Given that

Kilgariff (1998) has only demonstrated human sense tagger agreement of 91%1 it seems

questionable that any algorithmic system can approach the indicated level of performance

for IR. Text similarity matching is not however IR. Consequently the effect of WSD

needs to be specifically addressed.

The objective of this chapter is to examine whether an explicit WSD system compatible

with Hesperus can be produced. If not capable of disambiguation accuracy greater than

90%, it is certainly desirable that its performance is comparable to the current state of the

art. The impact of explicit WSD performance on a text similarity task can then be

addressed (in Section 6.3.4).

This chapter is arranged then as follows: Firstly, the problem of evaluating word sense

disambiguation in Hesperus is discussed. Secondly, plausible options for the �local

disambiguator algorithm� are described. Next, their evaluation is given within the context

of �Senseval�, an international word disambiguation competition. Finally, a local sense

disambiguation module for Hesperus is described, and conclusions drawn.

1 This refers to agreement over fine sense distinctions. Human agreement is higher where only coarse, homographic,

distinctions are considered (Ng et al. 1999)

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

86

5.2. The Problem of evaluating the effects of Word Sense Disambiguation

in Hesperus

Explicit word sense disambiguation was proposed in Section 3.9 as a pre-processing

phase that would feed sense disambiguated words as input to Hesperus. This would

improve the quality of the lexical chains produced from a text, by eliminating spurious

associations due to words being linked in other than their intended senses. The principal

problem with this approach is that word sense disambiguation is an unsolved problem

(Section 2.4.2), and the evaluation of possible approaches requires a sense tagged corpus

(Section 2.4.4). Senseval was selected for this purpose, as it was both timely, and purpose

designed for comparative evaluation.

Senseval (Section 2.4.3) was set up as a competition to allow the evaluation of the relative

performance of different word sense disambiguation approaches. Potential participants

firstly received (May 1998) a set of �dry run� data that gave the format of the

competition, and relevant sections of the Hector dictionary on which the Senseval

competition was based (Kilgarriff and Palmer (2000). A set of training data was

circulated in June 1998 tagged with senses for the forty-one words that were to be the

focus of the competition. Data for evaluation were circulated at the end of July 1998, and

two weeks were allowed for participating systems to tag this data and return results.

Participation in Senseval allowed a detailed evaluation of the several word sense

disambiguation techniques compatible with Hesperus. Consequently, a separate module

prototype was developed and entered the Senseval competition as �the Sunderland

University Similarity System� or SUSS.

5.3. The motivation for HESPERUS participating in Senseval as SUSS

SUSS's principal objective in Senseval was to evaluate different disambiguation

techniques suitable for use within Hesperus. These could then improve the performance

of a future version of Hesperus as a local disambiguator (Section 5-5).

A derived objective was to maximise the number of successful disambiguations. This was

both a requirement for success in the competition, and an objective for Hesperus, where

incorrect disambiguation is a possible source of inaccuracy.

Chapter 5 Word Sense Disambiguation

87

SUSS extensively exploited the Hector machine readable dictionary entries. There were

two reasons for this: firstly, Hector dictionary entries are extremely rich, and allowed us

to consider disambiguation techniques that would not have been possible using Roget

alone, secondly Hector sense definitions were much finer grained than those used in

Roget. A system that used Roget would consequently have been at a considerable

disadvantage since it would not have been able to propose exact Hector senses in the

competition. The use of Hector also allowed us to envisage the effect of better informed

WSD on future extensions to Hesperus.

Now we will look at the Hesperus paradigm, and the strategy used to develop SUSS.

The SUSS development strategy.

SUSS was developed using an iterative development strategy designed to maximise

performance as measured by the total number of successful disambiguations. The strategy

was as follows:

1 A basic system was implemented that processed the training data.

2 A statistics module was implemented that displayed disambiguation

effectiveness by word, word sense, and percentage precision.

3 As different disambiguation techniques were developed, effectiveness was

measured on the whole corpus.

Techniques that improved performance were further developed. Those that degraded

performance were dropped. Since the competition was time limited, there was no time to

pursue interesting, but unsuccessful approaches.

5.4. SUSS: The Sunderland University Senseval System

SUSS was a multi-pass implementation that reduced the number of candidate word senses

by repeated filtering. Following an initialisation phase, different filters are applied to

select a preferred sense tag, or eliminate inappropriate ones.

The order of filter application is important. Word and sense specific techniques are

applied first, more general techniques are used if these fail. Specific techniques are not

likely to affect any other than their prospective targets, whereas general methods

introduce probable misinterpretation over the entire corpus. For example, a collocate such

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

88

as �brass band� uniquely identifies that sense of �band�, with no impact on other word

senses. Other techniques required careful assessment to ensure that their overall effect

was positive. This was part of a structured development strategy.

A data flow diagram representing the overall system using standard SSADM notation

(e.g. Weaver 1993) is given in fig 5.2 overleaf, and descriptions of the filters given in

Section 5.3.

SUSS Initialisation Phase

SUSS used a preparation phase that included dictionary processing and other preparations

that would otherwise be repeated for each lexical sample to be processed. The Hector

dictionary was loaded into memory using a public domain program that parses SGML

instances. This made the definition available as an array of homographs that is further

divided into an array of finer sense distinctions. Each of these contained fields, such as

the word sense definition, part of speech information, plus examples of usage.

The usage examples were used in the �example comparison filter� and the �semantic

relations filter� techniques (described below). They were reduced to narrow windows W

words wide centred on the word to be disambiguated from which stopwords (Salton and

McGill 1983) have been eliminated. This facilitated comparison with identically

structured text windows produced from the test data. The main SUSS algorithm is as

follows.

Chapter 5 Word Sense Disambiguation

89

Figure 5-2 : SUSS System design

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

90

Algorithm 5-1: SUSS Algorithm Processing Phase

FOREACH sample:

Filter possible entries as collocates.

DONE IF there is only one candidate sense

Filter remaining senses for

information extraction pattern.

DONE IF there is only one candidate sense

Filter remaining senses for idiomatic phrases.

DONE IF there is only one candidate sense

Eliminate-stopwords from sample.

Produce-window W words wide centred on target word

FOREACH example in the Hector dictionary entry

Match the sample window against the example window

Select the sense that has the highest

example matching score.

If no unique match found, return the most frequent sense

of those remaining from the training corpus (or first

remaining dictionary entry2).

We now go on to describe the specific techniques tested.

Collocation Filter

Collocations are short, set expressions which have undergone a process of lexicalisation.

For example, consider the collocation �brass band�. This expression, without context, is

understood to refer to a collection of musicians, playing together on a range of brass

instruments, rather than a band made of brass to be worn on the wrist. The Hector

dictionary encodes such collocate expressions as distinct senses of the word.

Given the set nature of collocations, we decided to look for these senses early in the

disambiguation process since this would be a simple method of identifying or eliminating

them from consideration.

2 The calculation of sense occurrence statistics was designed to counter a perceived deficiency in Hector, where the

ordering of senses did not appear to match that of sense frequency in the corpus.

Chapter 5 Word Sense Disambiguation

91

The collocation identification module, therefore, worked as a filter using simple string

matching. If a word occurrence passing through the module corresponded to one of the

collocational senses defined in the dictionary, or could be morphologically reduced to

such an entry, it would be tagged as having that sense. If none of these senses were

applicable, however, all senses taking a collocational form were filtered out.

Information Extraction Pattern Filter

The Information Extraction filter refers exclusively to enhancements to the Hector

dictionary entries specifically to support word sense disambiguation. The Hector

dictionary is primarily intended for human readers. Many entries contain a clues field in a

restricted language that indicates typical usage. Examples include phrases such as �learn

at mother's knee, learn at father's knee, and variants�, or �usu. on or after�. Such phrases

have long been proposed as an important element of language understanding (Becker

1975). These phrases were manually converted into string matching patterns and

successfully used to identify individual senses.

For example, �shake� contains the following:

<idi>shake in one's shoes, shake in one's boots</idi>

<clues>v/= prep/in pron-poss prep-obj/(shoes,boots,seat)</clues>

This can be used to convert the idiom field (using PERL patterns) as follows:

<idi>shake in \w* (shoes|boots|seat)</idi>

This may now be used to match against any of the idiomatic expressions �shake in her

boots�, �your boots�, etc., as morphological variants would previously have been reduced

to base forms

We call a related method �phrasal patterns�. A phrasal pattern is a non-idiomatic multiple

word expression that strongly indicates use of a word in a particular sense. For example,

�shaken up� seems to occur only in past passive forms. Adding appropriate phrasal

patterns to a dictionary sense was found to increase disambiguation performance for that

sense. The majority of phrasal patterns were manually derived from the Hector dictionary

entries. Others were identified by observing usage patterns in the dictionary examples, or

the training data.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

92

Collocation and other phrasal methods are important since they are tightly focused on one

word, and on one sense that word may be used in. They do not affect other word senses,

and can not influence the interpretation of other words.

Idiomatic Filter

Idiomatic forms identify some word senses. Unlike collocations, however, idiomatic

expressions are not constant in their precise wording. This made it necessary to search for

content words in a given order, rather than looking for a fixed string. An idiom was

considered present in the text if a subset of the content words were found exceeding a

certain (heuristically determined) threshold value. For example, the meaning of �too many

cooks� is clear, without giving the precise idiom.

Dictionary entries that contained idiomatic forms were processed as follows: Firstly, two

word idioms were checked for specifically. If the idiom was longer, stopwords were

removed from the idiomatic form listed, and remaining content words compared in order

with words occurring in the text. If 60% of the content words were found in the region of

the target word, the idiomatic filter succeeded, and senses containing that idiom selected.

Otherwise, senses containing that idiomatic form were excluded from further

consideration.

Example Comparison Filter.

The Example Comparison Filter tries to match the examples given in the dictionary

against the word to be disambiguated, looking at the local usage context. It assigns a score

for each sense based on identical words occurring in the text and dictionary examples and

their relative positions. We take a window of words surrounding the target word, with a

specified width and specified position of the target, in the text and in a similar window

from each dictionary example.

For each example in each sense, all the words occurring in each window are compared

and, where identical words are found, a score, S, is assigned, where

�∈

=Ww

ESddS

Equation 5-1: Example Comparison Score

and w is a word in window W, and dS and dE are functions of the distance of the word

from the target word in the sample and example windows respectively, such that greater

Chapter 5 Word Sense Disambiguation

93

distances result in lower scores. The size of the window was determined empirically.

Window sizes of 24, 14, and 10 words were tried. Larger window sizes increased the

probability of spurious associations, and a window size of ten words, (which is five words

before and five words after the target word) was selected as optimal.

When all the example scores have been calculated for each word sense, the sense with the

highest example score is chosen as the correct sense of that occurrence.

In cases where this does not produce a result, the most frequently occurring sense (or first

dictionary sense) that has not been previously eliminated is chosen.

Other Techniques Evaluated.

One of the objectives of SUSS was to evaluate different disambiguation techniques.

Below we describe two methods that were evaluated, but not used in the final system,

since they lead to decreased overall performance.

Part of Speech Filter

Wilks and Stevenson (1996) have claimed that much of sense tagging may be reduced to

part-of-speech tagging. Consequently, we used the Brill (1992) Tagger on the subset of

the training data set that required part-of-speech discrimination. This should have

improved disambiguation performance by filtering out possible senses not appropriate to

the assigned part of speech. However, due to perceived tagging inaccuracy, this was just

as likely to eliminate the correct word sense too. Consequently, it did not make a positive

contribution (Klinke 1998).

Another routine that used the part-of-speech tags attempted to filter out the senses of

words marked as noun modifiers by the dictionary grammar labels where the following

word was not marked as a noun by the tagger. This routine also checked words that

contained an 'after' specification in the grammar tag and eliminated these senses where

the occurrence did not follow the word given. However, it gave no overall benefit to the

results either. One possible cause of this is in occurrences where there are two modifiers

joined by a conjunction so that the first is, legitimately, not followed immediately by a

noun.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

94

Semantic Relations Filter

The Semantic Relations Filter is an extension of the example comparison filter that uses

overlapping categories and groups in Roget�s thesaurus, rather than identical word

matching. This technique was prompted by in-sentence thesaural based disambiguation

described in Wilks, Slator and Guthrie (1995), and attributed to Masterman in the late

1960�s. It involves looking for strong, same category, or identical word relationships in

the window surrounding the current word. Where these are found alternative senses are

eliminated. This may lead to partial disambiguation and should allow us to recognise that

�accident� is used in the same sense in �car accident�, and �motor-bike accident�, since

both are means of transport.

Appropriate scores are allocated for each category in Roget that the test sentence window

has in common with the dictionary example window. As in the example comparison, the

sense that contains the highest scoring example is selected as the best.

Disappointingly, this technique finds many spurious relations where words in the local

context are interpreted ambiguously. This led to an overall performance degradation over

the test set, and so the technique was not part of the final SUSS algorithm.

Results Of The SUSS Evaluation.

Introduction

Senseval was organised as a competition in order to allow the comparative evaluation of

different word sense disambiguation techniques. More importantly however, the

availability of the Senseval sense tagged (lexical samples) corpus permitted the

comparative assessment of the individual techniques possible within the Hesperus

paradigm.

The results given below are consequently include both intra-system assessments of the

various SUSS methods, followed by relative assessments of performance in comparison

to other WSD systems that competed in Senseval. First the performance of the various

techniques are given on the Senseval dry run data, then the performance of SUSS relative

to other word sense disambiguation systems is given.

Dry Run Results

These results demonstrate performance of SUSS techniques on the distributed sense

tagged corpus distributed for system training. Following the development strategy

Chapter 5 Word Sense Disambiguation

95

outlined earlier, several variation of the WSD techniques have been evaluated across the

entire Senseval corpus. This was especially important, as techniques that improved

performance on disambiguating senses of one word could cause an overall reduction in

performance across the corpus if they have negative effects on other words.

In table 5.1 (below) results are given for several combinations of techniques across the

Senseval training corpus of five thousand sense tagged samples. These are divided into

rows that correspond to the twenty-eight words whose senses have been tagged3.

The column headings in Table 5.1 are given the mnemonics, trlogall, origdictlog, noidlog,

th-log, synbrilllog, nostatslog, and lastlog. These are described below.

trlogall

This version of the algorithm is that reported in Ellman, Klinke, and Tait (1998) at the

Senseval workshop.

origdictlog

This variation of the algorithm tested the contribution of sense specific information

extraction patterns added to the dictionary. This was done by using the original,

unaugmented Hector dictionary.

noidlog

This variation of the algorithm tested the contribution of the example comparison filter.

This was done by disabling the filter so that sentence fragments are not matched against

those from the Hector dictionary definitions.

th-log

This series of tests used the semantic relations filter. That is, example matching was tried

using the semantic relations filter.

synbrilllog

In this series of tests entries were pruned from the dictionary using the part of speech

filter if they did not correspond to parts of speech used in the competition.

3 That is, there were approximately two hundred samples of each word embedded in different sentences, where the

numbered sense of the test word has been manually identified and marked-up.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

96

nostatslog

This test series did not use calculate word sense frequency statistics from the training

data. Consequently the sense ordering used was that found from the dictionary. Since

dictionary entries are generally ordered in order of senses occurrence, calculating

statistics should have made no difference. This is the observation in a majority of cases.

However statistical calculation made a clear contribution to several words (e.g. shake,

bitter, and sanction).

Chapter 5 Word Sense Disambiguation

97

Table 5-1 Disambiguation Success (% accuracy) Vs Word by method

Table 5.1: Disambiguation Success (% accuracy) Vs Word by method

Word n Word trlogall origdictlog noidlo

g

th-log synbrilllog nostatslog

1 float 20.9 20.9 15.6 17.4 20.9 25.5

2 bet 33.3 33.3 38.8 29.1 33.3 33.3

3 generous 33.6 33.6 32.2 25.1 33.6 33.6

4 bury 40.1 40.1 36.4 30.9 40.1 40.4

5 bother 42.5 41.2 45.2 24.1 42.5 44.6

6 seize 44.3 44.3 27.5 28.5 44.3 32.3

7 derive 44.4 44.4 44 38.2 44.4 44.4

8 invade 45.3 45.3 39.6 17.0 45.3 30.2

9 consume 46.3 46.3 43.3 31.3 46.3 5.8

10 bitter 47.9 45.1 46.9 22.2 47.9 25.7

11 giant 50.6 50.6 44.1 9.4 50.6 8.5

12 promise 50.9 50.7 52.4 16.6 50.9 19.8

13 brilliant 51.6 51.6 47.5 28.7 51.6 51.6

14 knee 51.7 52.0 46.2 52.0 51.7 50.1

15 modest 59.9 59.9 60.2 9.4 59.9 16.3

16 calculate 61.0 61.0 62.7 49.8 61.0 61.0

17 shake 61.3 31.2 59.7 49.6 61.3 48.5

18 sanction 63.2 63.2 55.2 24.6 63.2 26.3

19 sack 64.8 64.8 64.4 51.0 64.8 15.5

20 slight 69.4 69.4 67 33.2 69.4 91.7

21 excess 76.5 52.2 78.1 62.2 76.5 61.4

22 accident 76.8 76.8 74.1 25.7 76.8 76.8

23 band 81.6 78.2 85.6 18.0 81.6 70.9

24 shirt 81.6 81.4 82.6 46.5 81.6 79.9

25 onion 92.3 92.3 92.3 30.8 92.3 92.3

26 behaviour 96.2 95.6 96.2 96.2 96.2 96.2

27 wooden 97.2 96.4 97.5 97.2 97.2 97.2

28 amaze 99.1 99.1 99.1 98.4 99.1 99.4

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

98

These results may be seen graphically as follows:-

Graph 5.2: Comparative Disambiguation Performance

Results of Senseval

The results of the Senseval competition are described in Kilgarriff and Rosenzweig

(2000). This gives multiple system analyses broken down by recall, precision, and

granularity. Recall refers to the proportion of correct disambiguations divided by the size

of the sample set; Precision refers to the proportion of correct disambiguation divided by

the number attempted. Granularity is concerned with whether polysemous senses are

viewed as distinct, or whether only homographic distinctions are important, polysemous

senses being considered equivalent.

Our concern here is not to discuss the comparative performance of SUSS, but to

demonstrate that simple techniques used which are suitable for inclusion in Hesperus are

at least of the level of the current international standard.

This is shown graphically in figs 5-3, 5-4, and 5.5. Fig 5-3 shows the distribution of word

senses of �Shake� and its collocations. No sense contributes more than 25% of possible

usage found in the corpus. Consequently, the challenge the systems faced was

considerable.

The recall performance of all the systems for the word shake is given in fig 5-4. SUSS is

ranked 9th out of 35 competing systems on this assessment. The competitors from 1st to 8th

are machine learning based systems that extensively exploit the training data.

Percent Score Vs Sample

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30

Word Number

Acc

urac

y %

trlogallorigdictlognoidlogth-logsynbrilllognostatslog

Chapter 5 Word Sense Disambiguation

99

System recall over the whole corpus (at fine granularity) is shown in fig 5-5. On this

measure SUSS was ranked 14th out of 39. This is comparable to the current level of

performance for dictionary based systems.

Figure 5-3: Distribution of the senses of �Shake�

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

100

Figure 5-4: System Comparison: �Shake�

Chapter 5 Word Sense Disambiguation

101

5.5. Local Disambiguator

The experience from the SUSS prototype was incorporated into Hesperus as a local

disambiguator module. This would be an add-on utility designed to improve the

performance of the lexical chaining program. The local disambiguator takes raw text as

input, and passes �links� to the lexical chaining algorithm. Each link encapsulates no

more than one token from the input text, where a token contains one word or one

collocation. Each link contains information about the thesaural categories of which the

word may be a member, in addition to bookkeeping information. This includes the

sentence and paragraph number in which the word was found, and its surface form.

Text input and pre-processing

The local disambiguator carries out various pre-processing activities preceding its role in

word sense disambiguation. These include looking up candidate words in Roget�s

thesaurus, collocation identification, analysis of morphological roots, and stopword

elimination. These will be described in turn followed by the algorithm used that specifies

their order of application.

Figure 5-5: Overall System Precision

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

102

These pre-processing activities interact, and are subject to empirical limits imposed by the

presence (or absence) of word forms in Roget. Program efficiency is also a consideration,

since looking words up in Roget requires disk access and is consequently slow.

Collocations are frequent word combinations that may modify or completely alter the

meaning of the component words. Collocations are less subject to ambiguity than single

words (Section 5.4), and Roget includes many examples. These include �bitter struggle�

where bitter modifies the meaning of struggle, and �bitter pill� which is usually used

metaphorically as one concept (bitter does not modify �pill�). It is consequently

advantageous to check for two word collocations in the thesaurus, since these may more

accurately identify a text�s concepts.

Verbs are often collocated with prepositions as phrasal verbs to indicate their sense more

clearly. However, the preposition may not be adjacent to the verb, and a full syntactic

analysis of the sentence would be required to recognise all verb collocations present in

Roget. For example, (1) and (3) below demonstrate the greater ambiguity of prepositions

not directly adjacent to the verb.

1. The accident seized the engine up.

2. The accident seized the engine.

3. The thief seized the car up the road.

4. The accident seized up the engine.

Note that recent editions of Roget�s thesaurus (e.g. 1987) contain many, but not all

inflected collocations. For example �seizing up� may be found in the index, whilst

�seized up� is not.

The cost of looking for multi-word collocations and the increased risk of ambiguous

interpretations outweighs the benefits of their use for Hesperus. Firstly, they are found far

less frequently than two word collocations (Appendix VII), and secondly longer

collocations are often stored in Roget in a generalised form (e.g. �afraid of one's own

shadow�) that could not be recognised without sophisticated and unreliable linguistic

processing. Collocations are consequently limited to adjacent word pairs.

Chapter 5 Word Sense Disambiguation

103

A word�s morphological root is its simplest form without possible inflections. For

example, the root of �giving� is �give�. Roget contains both inflected and uninflected

word forms. If a word is not present in Roget, its morphological root is determined using

the algorithm described in Winograd (1972). Programs that are more powerful are readily

available (such as PC-KIMMO, Antworth 1993) however their sophistication incurs

increased cost, whilst Winograd�s (1972) algorithm may be implemented as a short

subroutine.

Stopwords are common function words (such as prepositions and conjunctions) that have

been found to add little to the content of a document for Information Retrieval purposes

(Salton and McGill 1983). Words may be quickly identified as stopwords from a list held

in memory, assuming that they are not part of collocations. The stopword list used was

that defined by the SMART Information Retrieval system (Buckley 1985), as this is

widely used.

The algorithm that generates links from ASCII text is given below. This used to populate

the local-disambiguator which will be discussed next.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

104

Algorithm 5-2: Generate Links.

LET W1:=NULL

LET W2:=NULL

WHILE input-file NOT end-of-file

IF W2 EQUALS NULL

READ next word AS W1

ELSE W1:=W2

UNLESS end-of-sentence //simple collocation ?

READ next word AS W2

Categorise collocation W1-W2

UNLESS NULL

RETURN Link(W1-W2)

Find Morphological-root(W1) AS R1

IF R1 NOT EQUALS W1

AND W2 NOT NULL //inflected collocation ?

Categorise collocation R1-W2

UNLESS NULL

RETURN Link(R1-W2)

ELSE UNLESS Stopword(W1) //simple word ?

Categorise W1

UNLESS NULL RETURN Link(W1)

ELSE Categorise R1 //morphological root ?

UNLESS NULL RETURN Link(R1)

RETURN NULL

The procedure Categorise returns the thesaural categories of a word (or collocation).

The procedure morphological-root returns the word without any inflections that it

may have. If none can be found, the word itself is returned.

Explicit Disambiguation

The local disambiguator uses a circular data structure as a source of links for the lexical

chaining algorithm. This is known as the �data-ring�. The data-ring is populated with

links derived from plain text by calling algorithm 5-2 above. These may then be

Chapter 5 Word Sense Disambiguation

105

disambiguated (to reduce the number of Roget categories that they refer to) before use in

lexical chaining.

The Local Disambiguator is designed specifically to counter the one pass, �greedy�

(Barzilay and Elhadad 1997) nature of the chaining algorithm. That is, a pure one-pass

algorithm could link one word (A) to another previously seen word using the weakest link

type. Alternative senses of that word would then be eliminated, since it had been assigned

to a chain. However it is often the case that a much better link could be formed with a

word yet unseen. This �sliding window� model of word sense disambiguation has been

described Schütze (1992).

The algorithm is based on the SUSS Semantic Relations Filter (Section 5-4).

Algorithm 5-3: Local Word Disambiguation.

LET the word to be disambiguated be W

RETURN IF W has one thesaural category

IF an earlier word in the DATA-RING is identical

link their categories.

RETURN //Already disambiguated

Initialise vector V with thesaural-categories(W)

FOREACH E ∈ V

E := 0

FOREACH word W’ ∈ DATA-RING

LET S := thesaural-categories(W) ∩∩∩∩

thesaural-categories(W’)

FOREACH E’ ∈ S

increment V(E)

IF all elements of V are 0 return. //No disambiguation

IF V contains one element higher in value than the

others, return that //full disambiguation

IF several sense share a value, remove the remainder, and

return those senses //partial disambiguation

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

106

Note that there are three possible outcomes to algorithm 5-3:

1. No Disambiguation Evidence

2. Partial Disambiguation

3. Full Disambiguation

In (1), no thesaural relationships are found, so no disambiguation may be done. In (2),

several word senses may form equally strong connections to neighbouring words. Weaker

word senses may be eliminated, which both simplifies the lexical chaining process, and

eliminates their spurious use in later processing. In (3), the number of candidate solutions

is reduced to one. This is optimal, providing the remaining sense is correct.

5.6. Conclusion

This chapter has investigated the effects of word sense disambiguation on the

performance of Hesperus. In particular techniques of local word sense disambiguation

were investigated in more detail. This was due to the availability of a large-scale sense

tagged corpus within the context of the Senseval word sense disambiguation evaluation

program.

From the above it is clear that the performance of SUSS in Senseval was comparable with

other systems that did not include a machine learning element. This gives confidence that

we can generalise the performance of Hesperus� explicit word sense disambiguation to

any similar technique that could be applied instead of it.

An essential finding was that the composition of the Senseval corpus and relative

frequency of sense occurrences significantly affected the performance of the various

disambiguation techniques. For example, the intended sense of �wooden� (where it is not

a component of the collocation �wooden spoon�) means �made of wood� 345 times, and

describes �lacking liveliness grace or spirit� on 5 occasions. Thus, by selecting the first

sense as default 98.5% accuracy can be achieved.

Simple word disambiguation techniques that conform to the Hesperus paradigm have

been shown as successful as most alternative approaches that do not use machine learning

techniques. Approaches that do use machine learning do however have superior

performance (even where explicit training data is not available).

Chapter 5 Word Sense Disambiguation

107

Not all the approaches to word sense disambiguation implemented in SUSS could be

readily transported into Hesperus, as they relied upon the availability of detailed

dictionary data. Exploiting such data remains a possibility for further work.

Several methods were however used to develop the local disambiguator. Roget includes a

considerable number of collocations, and phrases. It also includes many entries for verb-

particle combinations separate from the verb entry alone. The lesson from SUSS was that

such multi-word expressions have far lower ambiguity than their components when seen

individually.

A local disambiguator was consequently developed to exploit such expressions. This

included identifying the morphological roots of unrecognised words, and considering

whether they may be associated with a verb particle. The size of the input data-ring was

also restricted to seven words in accordance with the optimum found in Senseval. These

enhancements are included in the system used in the next chapter, which evaluated

Hesperus against human performance.

Chapter 6. Evaluating Hesperus

6.1 Introduction

This chapter investigates experimentally how well the Generic Document Profile (GDP)

performs at text similarity matching. Since similarity assessment is a matter of human

opinion, the principal method used in this chapter is to collect people�s judgements

experimentally, and compare these to Hesperus. These judgements also provide a baseline

measure against which the performance of possible enhancements to Hesperus may be

measured.

Two modifications to Hesperus are considered that may improve its efficacy. These are

the effects of explicit word sense disambiguation (Chapter 5), and altering the granularity

of the similarity matching (Section 3.3). Further modifications may be considered in the

future once the baseline measure is defined.

Section 3.5 describes the procedure for producing a similarity match score between two

or more texts. In this chapter we call the text that we wish to match against the �source

text�, and other texts �example texts�. If we calculate the GDPs of both the source and

example texts, we can calculate a similarity score for all the examples that may then be

ranked in order of their similarity to the source text. However, we do not know how

accurate this similarity ranking is.

Authors write for human readers. Consequently, people have to be the baseline arbiters of

how similar texts are, and need to provide judgements of text similarity. These may then

be used to evaluate the performance of the GDP procedure as that can be measured

against their assessment. This chapter describes experiments to firstly generate a set of

human judgements, and secondly to contrast the GDP against those judgements.

People�s opinions will naturally vary according to circumstance. People who are well

informed about a specific subject may be able to identify differences that are not apparent

to those who have had little exposure to it. For example literary scholars dispute the

authorship of some of the works of Shakespeare based on differences in style that they

can identify.

Chapter 6 Evaluating Hesperus

109

The GDP is intended to be a general procedure. Consequently, it is important not to focus

on a specialist subject domain, as this would not fairly compare it against non-specialist

human judgements. Several areas should also be covered to ensure that the assessment is

not biased in favour of one particular subject. Therefore the texts should be general, and

their readers, the experimental subjects, should be non-specialists.

As the experimental subjects are to be non-specialists who will be asked to compare

random texts, we have to consider their motivation. Even the most well intentioned

subject will not be able to concentrate on differentiating between random texts if they can

not understand the task in principle. In other words, an ecologically valid design is most

important (see Section 1.6). That is, the experimental task needs to be similar to an

activity that the subjects routinely carry out, in an environment they are familiar with, and

using tools or equipment with which they are familiar.

Internet searching is now a routine part of student life. Given an assignment (or personal

interest), students are accustomed to sifting through the web pages identified by search

engines in for those that may be useful. Recognising unrelated items is a component part

of this task, which will be routinely carried out using a web browser such as Netscape or

Explorer. Basing the experiment on a purpose built web site will capitalise on this

activity, and satisfy the requirement for ecological validity.

There are a significant number of issues that need to be addressed to ensure that the

results are unbiased and hence of value. These start with three fundamental problems:

1. What constitutes a source text?

2. What texts are we trying to assess its similarity to?

3. What is the problem context?

We have claimed previously (Chapter 4) that the lexical chains found in texts of varying

lengths and styles are remarkably similar in distribution. Thus, our technique could be

applied in theory to any texts. This in turn gives rise to further concerns:

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

110

a) Are the source and example texts selected randomly?

b) Can the test data be made available electronically?

c) Is reliable human experimental verification possible?

We also need to compare the performance of Hesperus against a more standard method of

assessing text similarity to compare its relative effectiveness.

Of these concerns, (c) is most problematic. If one person provides a similarity assessment,

the results will reflect their opinion only, and that may be based on an idiosyncratic

interpretation of the text. This objection has to be countered by reference to statistics.

These may be collected by asking a number of experimental subjects to assess similarity

of several texts. This allows us to produce a group similarity assessment that is

independent of any one person's particular background and bias.

Text similarity assessment is nonetheless a difficult task. Unless the texts are short,

people find it difficult1 to read and compare several texts on the same subject. This sets a

practical upper bound on text length of between one and two pages.

A lower bound is imposed by the lexical chaining process. Since lexical chains are

derived from relationships between sets of words, there need to be sufficient words in the

set for several chains to emerge. This is unlikely to happen in one sentence, as word

repetition within one sentence violates English rules of style. Consequently, paragraph

length texts are a suitable minimum length for experimental purposes.

A great deal of text is electronically available now either via the Internet, or via

CD-ROM. The Internet contains huge quantities of texts (and graphics, sounds etc) on all

subjects, whereas CD-ROMS tend to be subject specific.

Microsoft�s �Encarta2� is an example of one such widely available CD-ROM. Encarta 97

is an electronic multimedia encyclopædia that contains 31,108 texts based on the 29-

volume Funk and Wagnall�s New Encyclopaedia. The texts are aimed at a general

audience (Nadeau 1994) and are approximately one or two pages in length. They cover

1 As reported by several experimental subjects. 2�Copyright,� Microsoft® Encarta® 97 Encyclopedia. © 1993-1996 Microsoft Corporation. All rights reserved.

Chapter 6 Evaluating Hesperus

111

well-defined subjects, and the majority of the encyclopædia contributors and consultants

are university professors.

Encarta has several characteristics that make it an appropriate source of �source texts�.

Firstly, its articles have wide coverage, and offer an excellent choice of experimental

topics. That is, it is not necessary for the experimenter to choose topics, since a random

procedure has a good probability of identifying them. It is not a certainty however, since

whilst Encarta, like all encyclopaedias, may aspire to cover all possible subjects, in reality

it does not. Most non-domain specific topics can be found by simply using the various

searching tools. Secondly, Encarta has been subject to a single standard of editorial

control so no source text will be too long, or difficult for the experimental subjects.

Thirdly, Encarta is widely available so the experiment may be replicated.

There are four mechanisms for finding an article in Encarta. These are known collectively

as the �PinPointer�. The four techniques are as follows. Firstly, the 31,108 topics in the

encyclopaedia may be displayed in a pop-up window. Articles may be selected by

scrolling though this window, and clicking on an article title, or an article title may be

typed into a text box: this then scrolls through the topics list to one matching. Secondly,

the list of articles may be restricted those that belong to one of nine major categories such

as Sport, History (European History), Life Science, and so on with up to fifteen sub-

categories. This may then be scrolled through as in the first method. The third search

mechanism uses an interface that supports a text-input mode as opposed to the previous

menu choice methods. The text mode locates articles by keyword, or sub-string contained

in that article. The fourth and final mechanism supports an advanced feature that allows

Boolean queries using AND, OR, (), NOT, NEAR. These Boolean queries may also be

combined with sub-string search.

Now the question of source texts has been addressed, we return to the issue of example

texts. As there are multiple pages on almost every subject, the Internet is an ideal source

of example texts. It is also highly heterogeneous. Thus Internet derived example data will

(probably) use the same terms in different senses. Consequently, a procedure that

recognises conceptual similarity should be able to distinguish clearly between texts that

are on the same subject, as opposed to those that are on different subjects, but use

identical terms.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

112

Single subject area test collections are commonly used in Information Retrieval for

evaluation purposes (see Baeza-Yates and Ribeiro-Neto 1999 Chapter 3). These are

readily available, but have two disadvantages. Firstly, terms tend to be used in one sense

(Krovetz and Croft 1992). An example would be use of the term �cancer� in the medical

literature, as opposed to texts on astrology. Thus, data derived from a single subject

collection would be an inferior source for text similarity experimentation as it would be

harder for the program to distinguish than Internet data that covers several domains. That

is, the single subject collection would tend to disambiguate the words it contains, making

its automatic differentiation more onerous.

The second disadvantage would be the difficulty that non-specialist would have in

discriminating between domain specific texts.

If the Internet is the source of example texts, the subjects (i.e. topics) used in the

experiment should correspond to Internet queries. Since we have decided that Microsoft's

Encarta is a good source of example texts to match against, these queries should also

correspond to index entries in Encarta.

Thus, random Internet queries, that may be found in Encarta�s index form our source

texts, whilst the Internet pages make up the example texts.

This chapter is organised then as follows: firstly we consider the questions, or hypotheses;

that the experiments in this chapter are designed to answer. Next, we describe how the

materials for the experiments were selected to be both unbiased, and ecologically valid.

The experiments with human subjects are described in Section 6.3. These provide a

similarity ordered baseline set of texts. This similarity ordering is then compared topic by

topic to that given by Hesperus, whilst a simple information retrieval program acts as a

control.

Section 6.3 also evaluates the effects of explicit word sense disambiguation using the

same test data. This involved activating the local disambiguator (Section 3.3) so that

explicit disambiguation is done, and then assessing changes to performance.

Finally the overall results are discussed and conclusions are drawn.

Chapter 6 Evaluating Hesperus

113

6.2 Hypotheses

There are a number of hypotheses to be tested. These are given below in order of their

relative importance with respect to this study whose primary objective is decide whether

lexical chains derived using Roget�s thesaurus may be used to determine the similarity of

texts.

The hypotheses are:-

1. Lexical chain based similarity matching is able to produce a ranking

between a source text and several examples equivalent to that produced by

human subjects.

2. Lexical chain based similarity matching is not identical to a purely term

based approach.

3. The performance of Hesperus on text similarity matching will improve if

there is an explicit word sense disambiguation phase.

4. The performance of Hesperus will alter if the granularity level of GDP

matching is refined.

5. Articles written in an encyclopaedia style will be preferred by the subjects

over Internet web pages.

The first of these hypotheses is most important. It raises the question as to whether a

text�s lexical chains determined by Hesperus express its meaning sufficiently as to be a

useful tool for similarity matching. If this were the case, then it would be possible to place

texts in order, or rank, of similarity and use this in applications such as information

retrieval where text similarity matching is important. Note that we do not expect the

procedure to give the same numeric values as that determined by the human subjects, as

these values would be an artefact of the experimental procedure.

The second hypothesis follows from the first. We have indicated (Chapter 3) that a

majority of the strength of lexical chains derives from term repetition. It is possible that

Hypothesis One will be true, but the similarity matching effect is due solely to identical

terms in the source and example texts. Consequently, we need to enquire whether term

repetition alone will give equivalent or better results than Hesperus.

The third hypothesis tests whether word sense disambiguation (WSD) at current accuracy

levels (Chapter 5) can improve the performance of Hesperus on text similarity matching.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

114

Sanderson�s (1996) conclusion was that WSD needed to have accuracy greater than 90%

to improve performance of an information retrieval system. By contrast, SUSS achieved a

mean WSD precision of 67% (Chapter 5). Thus, it would seem that Hypothesis three is

unlikely if Sanderson�s (1996) information retrieval task is equivalent to the text

similarity assessment proposed here. However, Sanderson (1996) used a corpus that

contained brief texts about finance only (the Reuters 22713 collection, Lewis 1991),

whilst we will use the random materials developed in looking at the earlier questions.

Thus, we have greater scope for word sense disambiguation at the homographic level, as

single subject corpora are known to favour particular word senses. For example, �bond3�

in financial documents refers to a particular type of certificate, whilst in a random corpus

it could equally refer to family relations, or adhesive properties.

The fourth hypothesis tests the effect of altering the match granularity. Reduced

granularity could improve the performance of Hesperus through better precision since

there are more, smaller, categories. Alternately, it could reduce GDP similarity score

since the smaller categories are less likely to match than larger ones. Consequently, this

could impair performance.

The fifth hypothesis is relatively minor. It concerns the fact that topic definitions found in

encyclopaedias are not routinely found on the Internet. Thus, the subjects may select

encyclopaedia texts as more similar to the Encarta text because of stylistic considerations.

For example, encyclopaedia texts may be preferred as they have been subject to

professional editing, whilst Internet texts will have been produced to varying quality

standards. Readers are sensitive to stylistic variations. For example, Karlgren and Cutting

(1994) have shown that text genre may be differentiated on stylistic grounds, whilst

Karlgren (1999) has suggested that stylistics may be used to augment information

retrieval performance. Thus, stylistics may be used by human readers, although it is not

used in lexical chain formation (Chapter 4), which is an essential step in determining a

texts� GDP (Chapter 3). This hypothesis would be proven if the Infopedia texts were

consistently identified as more similar to the source text by the subjects than by Hesperus.

We now go on to look at the derivation of an appropriate random set of test materials, and

experimental comparison of performance with Hesperus.

3 This example is due to Ken Litkowski writing on the �Corpora� mailing list. Of course, �bond� is ambiguous at the

polysemic level, since it may refer to a variety of financial instruments, e.g. T-Bills, war bond, longbund etc.

Chapter 6 Evaluating Hesperus

115

6.3 Text Similarity Experiments

The objective of the text similarity experiments is to evaluate the comparative

performance of the Hesperus derived GDP on a realistic task.

This section describes three experiments. Firstly, the experiment with human subjects is

intended to produce a set of texts whose similarity has been ordered by people. This will

serve as a baseline assessment measure for Hesperus. The second experiment applies

Hesperus to these texts to produce a second set of similarity assessments. A third

experiment applies an information retrieval program to the retrieved data to test the

hypothesis that the GDP performance is principally influenced by term repetition.

Since our thesis is that the GDP is applicable to any text type and any subject the

selection of topics, and texts on those topics is critical4.

In the following section, we first discuss random query selection and then the acquisition

of example documents on these topics. Next, the methodology for each of the three

techniques of text similarity calculation is described. That is, firstly we discuss an

experiment in which people assess the similarity of the examples to a control text. Then

we discuss the generation of GDPs for this example set, and finally discuss an alternate

approach that uses a simple information retrieval application. The results are then

presented by topic.

6.3.1 Material Selection

Query Selection

Several Internet search engines publish (anonymously) current queries they are executing.

These would constitute random topics to be used in our experiments. For example,

Metaspy5 publishes ten current queries from MetaCrawler, a �meta� search engine

(Selberg and Etzioni 1997) that searches several other search engines in parallel. Metaspy

automatically refreshes the display every fifteen seconds with ten further queries.

4 Judicious selection of texts would give better results, but obtaining texts and topics at random provides far greater

confidence in the results obtained. 5 http://www.metaspy.com/spy/filtered_b.html

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

116

Selberg and Etzioni (1995) note that there is no particular pattern to these queries.

Nonetheless, approximately 1% corresponds to general topics that intuitively could be

found in the index of an encyclopaedia such as Encarta. The remaining queries are mostly

highly specific (e.g. �SNE 99x ROM), proper names (�toys r us stores�), non-English,

misspellings, or phrased in the syntax of web based information retrieval (�+nursery

+plants +florida�). Thus, one potential topic appears about every 2-3 minutes (some

queries are repeated, since there are not always ten new queries in fifteen seconds).

Twenty queries were recorded in a session conducted in May 1998. Of these, eight were

found in Encarta. These were: -

1. Socialism

2. Ballot

3. Copyright

4. AI

5. Rosetta

6. Breakdance

7. Welfare

8. Fishing

Corpus Gathering

Having selected the queries, we needed to retrieve the Internet pages that users would find

in response. This was done using a purpose built application �Hesperus-Web�.

Hesperus-Web accepts single queries as identified above, and posts them to MetaCrawler

(Selberg and Etzioni 1995, 1997). This in turn queries AltaVista, Infoseek, WebCrawler,

Excite, and Yahoo! MetaCrawler then collates the links found into a common list of the

thirty or so most highly ranked. Hesperus-Web then retrieves these pages and stores them

on the local disk, renaming them where necessary.

Corpus Selection

There is still a problem once the pages have been retrieved, as MetaCrawler is

programmed to try to return thirty pages in response to any query. Pilot studies have

shown that five example texts are better for the experiment due to limitations on attention

span. Consequently, these must be selected from the thirty returned.

Chapter 6 Evaluating Hesperus

117

To avoid implicitly accepting the rank ordering imposed by MetaCrawler, an algorithmic

procedure was defined that could eliminate possible texts from the result set. This reduced

the possibility of experimenter bias in selecting example texts.

Algorithm 6-1: Reducing the number of example texts to five.

Initialise rejection criteria R to first in table 6.1

While there are still rejection criteria

If number of texts N, is equal to five EXIT

Eliminate texts that satisfy R.

Increment R to next in table 6.1

End While

If there are more then five texts

select five examples at random

EXIT

Algorithm 6.1 was manually applied. The rejection criteria used are shown in table 6-1

below. The objective of these was to provide example texts suitable both for experiments

with human subjects, and suitable for Hesperus.

For the subjects the texts need to be short enough so that they may be read in one or two

minutes. The experiment was planned to last one hour to ensure the subjects� attention

was maintained and for pragmatic reasons. Two minutes per text ensures that a sufficient

number may be read to produce data for several queries. Thus, they needed to be neither

too short nor too long. They also needed to be amenable for subsequent analysis with

Hesperus.

Table 6-1 below shows the items that were consequently eliminated from consideration: -

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

118

Table 6-1: Rejection Criteria for example texts

Rejection Criteria

1 Empty Files

2 Not Found/Illegal Access/Redirected Links

3 Pages longer than three screenfuls

4 Pages shorter than one screenfuls

5 Pages containing Images only

6 E-mail �Threads�

7 Tables of Contents (otherwise containing little text)

8 Adverts

9 Book Reviews

10 Adult Material

In addition, an example was included from an electronic encyclopaedia �Infopedia 976�.

This allows us to perform a limited test of the minor hypothesis that topic definitions

found in encyclopaedia type summaries are not routinely found on the Internet.

6.3.2 Experiment with Human Subjects

Method

The experiment was constructed as a dedicated Internet WWW site viewed with a Web

browser such as Microsoft Internet Explorer or Netscape. This ensured a high level of

ecological validity, since the texts were being seen in the medium and by the means for

which they were designed. The web pages used in the experiment were purposely

modified so that links other than those in the experiment could not be followed.

Approval for the experiment was sought from, and granted by the University of

Sunderland ethics committee.

The subjects were twenty-five undergraduate and MSc conversion students from the

School of Computing and Information Systems at the University of Sunderland. They

participated as class groups with the support of the course tutors. Participation was

voluntary, and the subjects were not paid.

6 The Hutchinson New Century Encyclopaedia. Copyright 1996 Softkey Multimedia Inc.

Chapter 6 Evaluating Hesperus

119

The students were given a brief explanation that the experiment was about text similarity,

and were informed of the address of the experimental web site. They were shown the

questions and had the five point Likert scale explained to them. This explanation was

reproduced on help pages on the web site and reproduced in Appendix IV.

No detailed explanation was attempted of the concept of similarity other than �means the

same thing�. It was felt that a technical explanation would have detracted from the

subject�s naïve views, and changed the nature of the task to that of evaluating similarity

based on the external definition. Support for this view comes from Hampton (1998)

writing in the psychological literature. Hampton (1998) reports a categorisation

experiment by Hampton and Dubois. Hampton and Dubois found little or no evidence

that the fuzziness of categorisation was reduced by providing a clear discourse context.

Subjects were provided with either an elaborate scenario before participating in a

categorisation experiment or no background scenario at all. Levels of disagreement, and

inconsistency in the experimental results were unaffected by whether subjects had

received the detailed explanation.

The experimental design was fully randomised. Its major component was a JavaScript

program that guided interaction through the web site. This consisted of six component

experiments that covered the selected topics. These experiments were presented in a

random order. The components of each experiment were also presented in random order.

This ensured that no part of the experiment was unduly influenced by fatigue, since some

subjects would have seen first those parts that others saw last.

Subject reports in initial experiments (Ellman 1998) indicated considerable difficulty

when faced with one question (whether one example text is more similar than another to a

source text). Consequently an experiment was designed in which subjects answered

several questions, only one of which was important to this study. That question being how

similar one text was to another7.

After reading the initial instructions, subjects were shown the Encarta entry as a Source

text. This was shown in a Frame based layout with the text in the left of the screen (the

7 I would like to thank Dr. Sharon Macdonald for the suggestion to ask multiple questions.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

120

right being initially blank). It was explained that each topic would be labelled as a

�Query�, and would be labelled as �0/6�, with the comparison texts numbered from one.

This explanation was acceptable to the subjects, and did not require further elaboration.

Four general statements about the text were shown in the lower part of the frame (see fig

6.0 below). The purpose of these questions was to ensure that subjects read the source text

in sufficient detail to make subsequent decisions about similarity of meaning.

I1. The Source Text is a good explanation of the subject

I2. I am very familiar with the topic of the Query

I3. I would like to find the Source Text if I looked for this Query

I4. The Source Text is a good definition of the subject

Statements for the Initial Topic Page

Figure 6-1: Source text Topic Screen

Chapter 6 Evaluating Hesperus

121

Subjects were required to indicate whether they agreed or disagreed with the statements

using a five point Likert scale. When the responses had been completed and the

�SUBMIT�8 button clicked, the example texts were shown in random order in the right

pane (see fig 6-2 below). Four further statements were made in the lower pane.

These statements also required agreement to be indicated on a five point Likert scale.

Once all the replies had been completed, and the �SUBMIT� button clicked, a further

example was shown. When all the examples had been seen on one subject, the experiment

proceeded to the next topic, until all the topics had been covered.

S1. The Example means the same as the Source Text

S2. The Example is relevant to the Query

S3. The Example more specific than the Query

S4. The Example is a good definition of the subject

Statements for the Example Comparison Pages

Figure 6-2: Source and Example Text Comparison

8 The �SUBMIT� button was positioned at the far right of the screen. This ensured that subjects saw all points of the

scale before confirming their choice.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

122

The replies made by the subjects were collected from the Web site log, and then analysed.

The question to which we would like the answer is whether the source and example texts

similar in meaning. This needs to be phrased as a statement to which subjects can express

agreement or disagreement using the Likert scale. The statement form �the example text

means the same as the source text� was consequently taken as the principal human

similarity measure. This is known as statement S1. Their answers to statements S2-S4

were recorded, but the analysis is beyond the scope of this thesis.

The frequency distribution of the replies to statement S1 is shown below in graphs 6-1 to

6-6 below. This shows the variance in the answers. This allows us to see how well the

subjects agreed with each other, and to determine whether their responses were random or

not.

It can be seen from graphs 6-1 to 6-6 that the subjects� responses to S1 approximately

conformed to a normal distribution. This means that we can use the mean of S1 as an

accurate summary measure of the subjects� judgement.

6.3.3 Hesperus Comparison Experiment

The purpose of comparing Hesperus with user judgements was to assess its usefulness as

a text similarity tool. This was discussed as Hypothesis 1 (Section 6.2). The similarity

ordered texts also provided a baseline measure against which modifications to Hesperus

can be evaluated. Two modifications to Hesperus were considered, with explicit

disambiguation, and using fine-grained similarity match. These were Hypotheses 3, and 4

(Section 6.2).

Method

Hesperus was run under four settings, determining the GDP for both the source texts, and

the example texts (Chapter 3). This produces a similarity measure for each text derived

from its GDP (Section 3.6) that ranged from zero to one. The settings were:-

1. Standard grain.

2. Standard grain, with explicit word sense disambiguation.

3. Fine grained.

4. Fine grained, with explicit word sense disambiguation.

These settings have the following meanings: �standard grain� uses the ~1000 Roget

categories as described in Chapter 3. Explicit word sense disambiguation refers to the

Chapter 6 Evaluating Hesperus

123

augmentation of HESPERUS with the local disambiguator described in Chapter 5.

Finally, �fine grain� refers to the use of the smaller sized 6400 subcategories possible

with the 1987 Roget (Section 3-3).

The rank order produced by Hesperus was compared to that derived manually. It is shown

below in the corresponding tables for each topic.

6.3.4 Information Retrieval Experiment

The purpose of the information retrieval experiment was to test Hypothesis 2: that lexical

chain based similarity matching is different to that produced by term based approaches.

SWISH (Simple Web Indexing System for Humans9) is an example of a classic

information retrieval program. It is based on the tf*idf10 heuristic that indicates the

relative importance of the text in a document collection (Salton and McGill 1983).

SWISH was selected for the information retrieval experiment as it is both simple to use,

and is adapted for www pages. Consequently, it can both analyse unprocessed html, and

differentially weight keyterms that occur in document titles. Thus, if titles are good

indicators of content, SWISH would be at an advantage with respect to Hesperus, which

processes text only.

Note that this experiment is not an exact comparison with the human one, or that using

Hesperus as both those experiments compare the similarities of two texts. As stated, this

experiment compares the match between a query and several texts. Nonetheless it gives

an indicative comparison of a term based approach with little effort expenditure.

Method

The page set retrieved for each query was indexed separately using SWISH. The raw

query (i.e. �Copyright�, �AI�, �Rosetta Stone� �Socialism�, �Ballot�, and �Breakdance�)

was then posed against this index. The relevance score returned for each example in the

test set was recorded. These are shown in tables 6-2, 6-5, 6-7, 6-9, 6-11, and 6-13 below.

9 . SWISH is a freely available basic IR program that understands HTML format, and increases the rank of terms found

in HTML heading tags. URL: http://sunsite.berkeley.edu/SWISH-E/ 10 the tf*iDf heuristic states that the frequency of a term in a text times the inverse of its occurrence in the collection

indicates its importance

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

124

As in the Hesperus experiment, the rankings obtained from SWISH were compared to

those derived from the human assessors. The results are shown under the topic headings

below.

6.3.5 Results

6.3.5.1 Introduction

This section presents of the results of the experiments by topic. This requires the

comparison of three different types of data. Hesperus gives a similarity score from zero to

one, where one represents a perfect match. SWISH gives a relevance score for zero to one

thousand, where one thousand means highly relevant, whilst the human experiment gives

Likert data that range from 1 to 5.

For convenience, a percentage score was calculated from each of the three data sets. The

simple scaling techniques used are described below. Caution is required in comparing the

three percentage scores however as they are derived by different means, and are most

likely from different frequency distributions. For example, a human experimental score of

50% mean �neither agree or disagree�, whilst for Hesperus this represents quite strong

similarity.

For the human experiments the subject judgements are converted into percentages by

firstly coding the responses to statements S1 to S4 from 1 (disagree) to 5 (agree). The

arithmetic means of these ratings were then taken, giving a group measure of agreement

to each statement. These data are presented graphically below under each query heading

as graphs 6-1 to 6-6.

Hesperus and SWISH scores were converted to percentages using simple scaling. For

Hesperus, raw similarity measures range from zero to one, whereas SWISH scores range

from zero to one thousand. Conversion to percentages was by simple multiplication by

one hundred for Hesperus, or division by ten for SWISH.

As the numeric data from the three experiments were not derived from the same

frequency distribution, they can not be compared directly. However, the rank (or order)

that they give to the example texts can be compared. That is, we can compare the most to

least similar example ordering derived from the experiment, with that given by Hesperus.

Chapter 6 Evaluating Hesperus

125

Spearman�s rank correlation statistic is useful here (henceforth Spearman). It is a non-

parametric statistic that makes no assumptions about underlying frequency distribution of

the data it is used on. It is used for determining the correlation and hence statistical

significance of ordinal data (e.g. see Kinnear and Gray 1997). That is, data ordered into

ranks or assigned to ordered categories. It is applied below to each of the queries to

determine how well the experimental techniques concur.

Kilgarriff and Rose (1998) have noted a disadvantage of Spearman for corpus studies.

That is, Spearman does not take account of the magnitude of differences between

classifications leading to different rankings, only that they differ in order. This is useful

for comparing scales that can not be compared directly, but does not highlight when large

differences on one scale are matched by small alterations in the other ranking scale. Such

differences will be indicated in the description of the data, the format of which will now

be described.

6.3.5.2 Results Presentation Format

The evaluation of Hesperus included six experimental topics, and three experiments, with

Hesperus operated under four different conditions. The success of the trials depended on

the topic considered. Consequently, the data generated are presented in a common format,

ordered by topic, which will now be described.

Firstly, there is a brief description of the topic that indicates the position of the source text

in Encarta. Any oddities of the data found for the example texts on the Internet are also

described.

Next, a histogram of the experimental subjects� responses to question S1 (�The Example

means the same as the Source Text�) is presented. This shows the percentage of subjects

who agreed and disagreed with this question as applied to each of the six example texts.

The histogram gives a visual representation of how well the subjects� agreed with each

other. As with any experiment with a group of human subjects, we would expect these

results to be approximately normally distributed around some mean. This mean is used as

the assessment of similarity used to evaluate Hesperus.

The mean assessment of similarity is then presented in tabular form. This table includes

the percentage scores from the Hesperus comparison and Information Retrieval

experiments. This is followed by a table that presents the results of the Spearman rank

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

126

correlation. Brief comments are included that highlight features of the analysis, whilst

overall conclusions and discussion are given in Section 6.4.

The data from the six topics covered in the experiments are presented below in the order

�Copyright�, �AI�, �Rosetta Stone� �Socialism�, �Ballot�, and �Breakdance�. This

sequence is derived from the statistical significance of the results, with the most

significant presented first.

6.3.5.3 Experimental Results Data

Copyright

�Copyright� is one of the major articles in Encarta. It was found in the Encarta

�PinPointer�, which lists the principal encyclopaedia article titles. �Copyright� describes

the general body of legal rights related to the protection of creative works.

Graph 6-1 below shows the association between the experimental question S1 (�The

Example means the same as the Source Text�), and the texts shown to the subjects. As

mentioned in Section 6.3.5.2 above, it is a stacked histogram which shows the proportion

of subjects who agreed with question S1, and how strongly.

The data show an approximately normal distribution, which supports the validity of the

experiment. The article �copytoc� is an exception, which shows a bimodal distribution.

The explanation for this may be in the content of the article, which is the table of contents

of a web site covering copyright. �Copytoc�. This could be considered to be a set of bullet

points about copyright, in which case it would be similar to the Encarta entry, or as a site

index list, which would not be similar. As can be seen in graph 6-1, the twenty three

subjects are divided on this point.

Graph 6-1 below shows that there were differences between the ratings of documents.

�Info�, �copyright�, and �copytoc� shared higher ratings, whilst �lawnet� had the lowest

rating. The arithmetic mean of the subject scores is shown in table 6-2 below.

Chapter 6 Evaluating Hesperus

127

0

10

20

30

40

50

60

70

Agree <== Subject Rating ==> Disagree

copyright copytoc Info lawnet overview topical

% Responses

Example Text Title

Copyright: Document Vs Rating Vs Responses (%)

Graph 6-1 : Copyright S1 ratings

The differences identified by the subjects are mirrored by their rating by Hesperus, and to

some extent SWISH. This is shown in table 6-2 which gives the normalised percentage

similarity scores. SPSS (Kinnear and Gray 1997) calculates their numeric order, or rank,

to give the Spearman rank correlation in table 6-4. This has been calculated manually in

this first example only to clarify what data are compared in the Spearman rank correlation

(table 6-3).

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

128

Table 6-2 Copyright: Experimental Comparative Results

Copyright

Example

Title

Percentage similarity as measured using specified techniques

S1 SWISH Hesperus

Fine Grained

Explicit WSD

Hesperus

Fine Grained

Hesperus

Standard Grain

Explicit WSD

Hesperus

Standard Grain

copyright 68.30% 2% 9% 11% 11% 13%

copytoc 66.80% 13% 8% 8% 5% 7%

Info 62.50% 10% 9% 8% 10% 9%

Lawnet 48.30% 8% 3% 2% 5% 7%

Overview 30.00% 0% 3% 6% 4% 7%

Topical 18.30% 0% 0% 0% 0% 0%

Table 6-3: Copyright: Numeric Rank of similarity scores

Copyright

Example

Title

Rank of Similarity scores

S1 SWISH Hesperus

Fine Grained

Explicit WSD

Hesperus

Fine Grained

Hesperus

Standard Grain

Explicit WSD

Hesperus

Standard

Grain

copyright 1st 4th 1st= 1st 1st 1st

copytoc 2nd 1st 3rd 2nd= 3rd= 3rd=

Info 3rd 2nd 1st= 2nd= 2nd 2nd

Lawnet 4th 3rd 4th= 5th 3rd= 3rd=

Overview 5th 5th= 4th= 4th 5th 3rd=

Topical 6th 5th= 6th 6th 6th 6th

Chapter 6 Evaluating Hesperus

129

Table 6-4: Copyright: Hesperus and SWISH Spearman Rank Correlation.

S1 SWISH Hesperus

Fine Grained

Explicit WSD

Hesperus

Fine Grained

Hesperus

Standard Grain

Explicit WSD

Hesperus

Standard Grain

Spearman�s rho 1.000 .638 .883** .928** .899** .820*

Sig. (1-tailed) . .087 .010 .004 .007 .023

N 6 6 6 6 6 6

** Correlation is significant at the .01 level (1-tailed).

* Correlation is significant at the .05 level (1-tailed).

Table 6-4 shows that Hesperus achieved a significant Spearman rank correlation on all

measures, with three out of four highly significant (p <.01).

AI

�AI� is a main subject title in Encarta. It contains the single statement See also Artificial

Intelligence, which describes the discipline that aims to create artefacts that mimic human

thought or intellectual performance. That article was used as the source text.

0

10

20

30

40

50

60

70

Agree <== Subject Rating ==> Disagree

aidef Info Welcome9 Welcome2 imkai web

% Responses

Example Text Title

AI: Document Vs Rating Vs Responses (%)

Graph 6-2: AI: S1 ratings

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

130

Table 6-5: AI: Experimental Comparative Results

AI

Example

Title

Percentage similarity as measured using specified techniques

S1 SWISH Hesperus

Fine Grained

Explicit WSD

Hesperus

Fine Grained

Hesperus

Standard

Grain

Explicit WSD

Hesperus

Standard

Grain

aidef 77% 1% 8% 10% 9% 11%

Info 75% 17% 9% 10% 10% 11%

Welcome9 50% 0% 14% 11% 21% 20%

Welcome2 41% 100% 5% 8% 9% 12%

Imkai 30% 8% 4% 3% 8% 6%

Web-homepage 29% 0% 2% 2% 3% 6%

Table 6-6 below shows that Hesperus fine-grained found a significant correlation (p <0.5)

in rank ordering both with and without explicit WSD. Explicit WSD had a negative effect

on both levels of granularity.

Table 6-6: AI: Hesperus and SWISH Spearman Rank Correlation.

S1 SWISH Hesperus

Fine Grained

Explicit WSD

Hesperus

Fine Grained

Hesperus

Standard Grain

Explicit WSD

Hesperus

Standard Grain

Spearman�s rho 1.000 .174 .771* .812* .696 .500

Sig. (1-tailed) . .371 .036 .025 .062 .156

N 6 6 6 6 6 6

* Correlation is significant at the .05 level (1-tailed).

Rosetta Stone

�Rosetta Stone� was found in the Encarta �PinPointer�. It describes the well known

Rosetta Stone held in the British Museum. The Rosetta Stone bears inscriptions in three

languages, two of which were known (demotic and Greek), which lead to the

decipherment of the third (hieroglyphic).

Chapter 6 Evaluating Hesperus

131

Rosetta stone is found in one entry in Roget�s thesaurus (Intellect: The exercise of the

mind: Means of communicating ideas: Writing lettering.) Consequently, it is not subject

to an ambiguous interpretation by Hesperus.

These example texts are identified by their web page titles, or suitable defaults. The text

called �Info� is the extract the �Infopedia� electronic encyclopaedia included to test

whether Internet retrieved data is comparable to that from a published encyclopaedia.

0

10

20

30

40

50

60

70

Agree <== Subject Rating ==> Disagree

index Info RosettaStone stone location rosetta

% Responses

Example Text Title

Rosetta: Document Vs Rating Vs Responses (%)

Graph 6-3: Rosetta S1 ratings

Graph 6-3 above shows that the subjects found clear differences as to whether the

example texts meant the same as the source text. �Info� and �index� most similar in

meaning to the source text from Encarta, whilst �location� and �rosetta� were largely

judged dissimilar. This is shown numerically in table 6-7 below, alongside the scores

generated from Hesperus and SWISH.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

132

Table 6-7: Rosetta: Experimental Comparative Results

Rosetta

Example

Title

Percentage similarity as measured using specified techniques

S1 SWISH Hesperus

Fine Grained

Explicit WSD

Hesperus

Fine Grained

Hesperus

Standard

Grain Explicit

WSD

Hesperus

Standard

Grain

index 86.00% 30.00% 33% 24% 47% 35%

Info 77.80% 29.70% 45% 41% 47% 41%

RosettaStone 54.50% 4.10% 0% 0% 10% 29%

stone 36.80% 100.00% 2% 2% 21% 8%

location 17.30% 76.30% 1% 1% 21% 22%

rosetta 13.30% 28.30% 14% 13% 24% 22%

The Spearman rank correlation for the ordered percentages was calculated, and is given in

Table 6-8 below. This shows a significant (p < 0.05) correlation between Hesperus and

the subjects when standard grain matching is used, and the explicit disambiguation is not

used.

Table 6-8: Rosetta: Hesperus and SWISH Spearman Rank Correlation.

S1 SWISH Hesperus

Fine Grained

Explicit WSD

Hesperus

Fine Grained

Hesperus

Standard Grain

Explicit WSD

Hesperus

Standard Grain

Spearman�s rho 1.00 -.029 .429 .429 .441 .754*

Sig. (1-tailed) . .479 .198 .198 .190 .042

N 6 6 6 6 6 6

* Correlation is significant at the .05 level (1-tailed).

Note that all the four Hesperus methods capture the strong difference identified by the

subjects between �Index�, and �Info�, that were considered similar, as compared to

�location�, and �rosetta� that were not.

Chapter 6 Evaluating Hesperus

133

Socialism

�Socialism� is one of the major articles found in the Encarta Pinpointer. It describes the

doctrine of state ownership and control of the fundamental means of production. The

article also refers to famous socialists, and their contribution to the movement.

There are four entries for �socialism� in Roget, plus a further three entries for �socialist�,

and two entries for �socialistic�. Consequently, there is a real possibility that Hesperus

may perform poorly due to word sense ambiguity on the fine grain match.

0

10

20

30

40

50

60

70

Agree <== Subject Rating ==> Disagree

Socialism Info kimsk_e Welcome2 NOMARX4 Welcome7

% Responses

Example Text Title

Socialism: Document Vs Rating Vs Responses (%)

Graph 6-4: Socialism: S1 ratings

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

134

Table 6-9: Socialism: Experimental Comparative Results

Socialism

Example

Title

Percentage similarity as measured using specified techniques

S1 SWISH Hesperus

Fine Grained

Explicit WSD

Hesperus Fine

Grained

Hesperus

Standard Grain

Explicit WSD

Hesperus

Standard

Grain

Socialism 78% 9% 20% 23% 31% 36%

Info 76% 16% 9% 7% 14% 11%

kimsk_e 56% 5% 8% 9% 9% 17%

Welcome2 38% 15% 2% 3% 4% 5%

NOMARX4 33% 6% 4% 4% 9% 9%

Welcome7 22% 4% 11% 13% 12% 15%

Table 6-10: Socialism: Hesperus and SWISH Spearman Rank Correlation.

S_S1 SWISH Hesperus

Fine Grained

Explicit WSD

Hesperus

Fine Grained

Hesperus

Standard Grain

Explicit WSD

Hesperus

Standard Grain

Spearman�s rho 1.00 .600 .371 .314 .551 .486

Sig. (1-tailed) . .104 .234 .272 .129 .164

N 6 6 6 6 6 6

Table 6-10 shows no significant rank correlation at the 0.05 level between Hesperus�

score on any measure, and that given by the subjects. However, Hesperus found the most

similar text (�Socialism�), as more similar than the least in all cases.

Ballot

�Ballot� is one of the major topics in Encarta as it is found in the Encyclopaedia index.

The article describes both the sheet of paper used to cast votes in an electoral system, and

the general method and development of secret voting. This corresponds to a type token

distinction between a description of the voting process, and casting a ballot in on

particular election.

Chapter 6 Evaluating Hesperus

135

Several of the web pages retrieved on the �ballot� topic were on-line electoral or voting

forms. These pages were �Marking�, �03mba� and �index�. Graph 6-5 shows that these

were not considered strongly similar in meaning (S1) to the source text.

The Infopedia article gave a brief general description of the voting process. The subjects

considered it similar to the source text.

0

10

20

30

40

50

60

70

Agree <== Subject Rating ==> Disagree

Info STV5GIFs Marking vote_trace 03mba index

% Responses

Example Text Title

Ballot: Document Vs Rating Vs Responses (%)

Graph 6-5: Ballot: S1 ratings

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

136

Table 6-11: Ballot: Experimental Comparative Results

Ballot

Example

Title t

Percentage similarity as measured using specified techniques

S1 SWISH Hesperus Fine

Grained

Explicit WSD

Hesperus Fine

Grained

Hesperus

Standard Grain

Explicit WSD

Hesperus

Standard

Grain

Info 73.50% 20% 8% 10% 8% 10%

STV5GIFs 50.00% 3% 6% 9% 7% 11%

Marking 50.00% 43% 9% 8% 16% 15%

vote_trace 48.50% 2% 14% 12% 22% 15%

03mba 33.80% 6% 6% 8% 6% 8%

index 17.80% 3% 3% 1% 4% 2%

Table 6-12: AI: Hesperus and SWISH Spearman Rank Correlation.

S1 SWISH Hesperus Fine

Grained Explicit

WSD

Hesperus Fine

Grained

Hesperus

Standard Grain

Explicit WSD

Hesperus

Standard Grain

Spearman�s rho 1.000 .580 .485 .574 .551 .515

Sig. (1-tailed) . .114 .165 .117 .129 .148

N 6 6 6 6 6 6

Table 6-11 shows no significant rank correlation between Hesperus and the experimental

data at the 0.05 level. However, note that the two texts identified by subjects as most

similar (�Info�, and �STV5GIFs�) were preferred over the two lowest ranked items.

Breakdance

Breakdancing is a form of street dancing characterised by disjointed robotic movements

or acrobatic spins (Infopedia 96) �Breakdance� does not appear as an article title in

Encarta. The phrase �break AND dancing� was found by the Encarta PinPointer, so the

topic did satisfy the criteria for inclusion given in algorithm 6-1 above.

Breakdance is included in the article �Rock Music and Its Dances�. Consequently, the

source text for �Breakdance� is not focused on break dancing.

Chapter 6 Evaluating Hesperus

137

�Breakdance� has two senses in Roget: firstly it occurs as �Space: Motion: leap (noun)�,

and secondly as �Emotion, religion and morality: Personal emotion: Amusement dance

(verb)�. Consequently, its interpretation is ambiguous.

0

10

20

30

40

50

60

70

Agree <== Subject Rating ==> Disagree

info highline index dance Welcome2 frenz-e

% Responses

Example Text Title

Breakdance: Document Vs Rating Vs Responses (%)

Graph 6-6: Breakdance: S1 ratings

Graph 6-6 shows subjects found little similarity between the source and example texts.

This is reflected in table 6-12, which shows neither SWISH, nor Hesperus performed well

at this level of similarity.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

138

Table 6-13: Breakdance: Experimental Comparative Results

Breakdance

Example

Title

Percentage similarity as measured using specified techniques

S1 SWISH Hesperus

Fine Grained

Explicit WSD

Hesperus

Fine Grained

Hesperus

Standard

Grain Explicit

WSD

Hesperus

Standard

Grain

Info 50% 0% 0% 0% 0% 0%

highline 48% 0% 0% 0% 0% 0%

index 28% 0% 2% 5% 7% 12%

dance 27% 0% 4% 11% 16% 26%

Welcome2 23% 0% 7% 13% 6% 11%

frenz-e 20% 0% 0% 1% 5% 9%

Table 6-14: Breakdance: Hesperus and SWISH Spearman Rank Correlation.

S1 SWISH Hesperus Fine

Grained Explicit

WSD

Hesperus Fine

Grained

Hesperus Standard

Grain Explicit

WSD

Hesperus

Standard Grain

Spearman�s rho 1.00 . -.395 -.638 -.464 -.464

Sig. (1-tailed) . . .219 .087 .177 .177

N 6 6 6 6 6 6

Table 6-14 shows no significant correlation with the experimental data at the 0.05 level.

In particular, SWISH did not register any text as relevant, as the term �Breakdance� was

not used in any of the documents retrieved. This indicates that the example pages

retrieved had been classified by their authors under this heading with an Internet

catalogue such as YAHOO, or indexed using a synonym (Such as �Hip-Hop�) before

being retrieved via MetaCrawler.

Hesperus also had difficulty with some of the texts, such as the Infopedia extract Info as

this was only one line long. Consequently, no lexical chain could be created from it.

Chapter 6 Evaluating Hesperus

139

6.3.5.4 Experimental Results Summary

Table 6-15 summarises the statistical probabilities shown in tables 6-4, 6-6, 6-8, 6-10, 6-

12, and 6-14 above. These indicate whether the similarity rankings could be rank

correlated by chance alone. Since there are twenty four Hesperus trials here, we may

expect one result where p<0.05. However, there are eight values where p<0.05.

Furthermore, these eight values include two results where p<0.01, and one in which

p<0.005. This alone would not be expected to occur by chance more than once in two

hundred and fifty trials. Consequently, we may conclude that Hesperus, under some

conditions, is capable of producing human like text similarity assessments. We now go on

to discuss these findings.

Table 6-15 Hesperus: Table of significance of experimental results

Experiment

Topic Probability of experimental outcome using specified techniques

SWISH Hesperus

Fine Grained

Explicit WSD

Hesperus

Fine Grained

Hesperus

Standard Grain

Explicit WSD

Hesperus

Standard

Grain

Copyright .087 .010 .004 .007 .023

AI .371 .036 .025 .062 .156

Rosetta .479 .198 .198 .190 .042

Socialism .104 .234 .272 .129 .164

Ballot .114 .165 .117 .129 .148

Breakdance . .219 .087 .177 .177

6.4 Discussion and Conclusion

This chapter has assessed the effectiveness of the Generic Document Profile as means of

comparing the similarity of different texts. This was done by creating a random, but

realistic, set of experimental texts, and having human subjects rank these in order of

similarity to texts on the same subject from the Encarta electronic Encyclopaedia. These

similarity judgements could then be used to examine the performance of Hesperus under

various operating conditions.

A number of experimental hypotheses were formulated in Section 6-2. We will now go on

to review these, and discuss their likelihood considering the experimental findings.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

140

Hypothesis 1 proposed that �Lexical Chain based Similarity matching is able to produce

a ranking between an example text and several examples equivalent to those produced by

human subjects�

The results have shown that GDP values ranked in order have produced highly significant

statistical correlation for one topic, Copyright, and significant correlation in two further

topics, AI and Rosetta. Consequently, the hypothesis is proven with the caveat that

although Lexical Chain based similarity matching is able to produce a ranking it is not

guaranteed to do so, and did not in the remaining three topics.

This caveat is in part a consequence of the untuned nature of the algorithm (Section 3.6).

As discussed in Section 3.3 the weights of lexical links were determined empirically,

without reference to an external standard, since there is no document set readily available

that people have ranked in terms of similarity.

Hypothesis 2 proposed that �Lexical Chain based Similarity matching is not identical to a

purely term based approach�.

Hypothesis 2 was tested using an IR program, SWISH, to rank the experimental texts for

presence of the key topic term. The rankings found using SWISH were marginally

correlated with those derived from the human subjects on three topics, and poorly

correlated on two others. This differs considerably from the subjects' rating, and that of

Hesperus.

The marginal correlation indicates that human similarity judgements are based on rather

more than term repetition.

Hypothesis 3 proposed that �The performance of Hesperus on text similarity matching

will improve if there is an explicit word sense disambiguation phase�.

This hypothesis was tested by generating GDPs, at both levels of granularity, both with

and without explicit word sense disambiguation. In some cases, disambiguation improved

performance slightly, however, in others it decreased it. Of the three cases where

similarity ranking achieved statistical significance, all had better results in the case

without explicit disambiguation. This would seem to provide support for Sanderson�s

Chapter 6 Evaluating Hesperus

141

(1996) thesis that disambiguation accuracy needs to be high to uniformly improve

performance.

Hypothesis 4 proposed that �The performance of Hesperus will alter if the granularity

level of GDP matching is refined�).

This hypothesis was tested by generating GDP�s for all the experimental texts at two

levels of granularity. The one thousand topic heads in Roget was considered the normal

level of granularity, so that each GDP could contain up to 1000 categories. The fine-

grained level of granularity used the further subdivisions possible with the 1987 Roget

that allowed each GDP to have up to 6400 categories.

The normal level of granularity lead to higher similarity scores but reduced the accuracy

of the matches in comparison to the human judgements. As a result, the best rank

correlations were achieved with fine-grained granularity even thought the match scores

were lower. Consequently, it would seem to be the case that granularity altered Hesperus�

performance.

Hypothesis 5 proposed that �Articles written in an encyclopaedia style will be preferred

by the subjects over Internet web pages�).

This hypothesis was tested by including an extract from a second encyclopaedia,

Infopedia, in the experimental texts. These extracts were consistently rated as similar to

the Encarta text. In all the six topics, the Infopedia extract was rated in the first three out

of six. It was considered most similar in two topics, and second most similar in a further

three.

Hesperus usually matched the high rating of the Infopedia articles, ranking them in the

first three in most cases. It would appear then that the content of the articles caused their

high rating, rather than stylistic considerations.

Of the five hypotheses described above, the first was most important. This supports the

central aspect of this thesis; that Roget�s thesaurus may be used to determine the

similarity of text. The remaining four hypotheses, especially hypotheses two and five

provide supporting evidence that the results were not due to an artefact of the

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

142

experimental design. Hypotheses three and four provide evidence to determine the best

operating parameters for Hesperus. These appear to be fine-grained, without explicit word

sense disambiguation.

Two observations may also be made not directly connected to the hypotheses above.

These have general implications. Firstly, it was clear from Section 3.3 that a theoretical

minimum of two words are required to make a lexical chain. In practice however, several

sentences may be needed before any links may be identified using Roget�s thesaurus.

Thus, in the Breakdance experiment, a one sentence text was not comparable to the

source text, as it not form any chains. By contrast, Information Retrieval approaches are

capable of working with single word queries and texts. This makes SWISH like programs

more widely applicable, if less accurate, than the GDP text similarity methods evaluated

here.

The second observation relates to the relationship between a query, and the content found

in formal indices such as encyclopaedias, compared to informal knowledge sources such

as the Internet. As noted in the Breakdance experiment, the majority of the Internet

derived texts do not describe the activity of breakdancing, but related experiences whilst

breakdancing. Thus, the fundamental relationship between topic and text is different

between the two information sources.

A similar observation may also be made regarding �Fishing�, which was one of the

rejected experimental topics. There Encarta describe the process of fishing, whilst

Internet sites that mention fishing describe the quality of fishing.

These observations do not invalidate the experimental results described in this chapter.

They do however indicate some limitations on the wider application of encyclopaedias as

tools to improve Internet searching.

We now go on to present overall conclusions from the study and discuss further work.

Chapter 7. Conclusions and Further Work

Figure 7-1: Books on Chains in Hereford Cathedral Library

7.1 Introduction

This study has explored whether Roget�s Thesaurus may be used as a knowledge source

to extract a representation of a text�s meaning from its content. This representation could

then be used to automatically identify whether, and by how much two texts are similar.

The method was based on thesaurally derived lexical chains. These are sequences of not

necessarily contiguous words in texts that support it as a coherent structure.

The structure of text is an emergent property of the nature of discourse. Text and writing

are evolutionary descendants of human verbal interaction. This is an intrinsically serial

process. Sounds follow each other to make up words and conversations. This may be

compared to visual perception, where an entire scene is seen in parallel.

An explicit structure is needed to manage the process of conversational interaction

(Schegloff, and Sacks 1973). This can be approximated as a text grammar (van Dijk

1977) for building applications (e.g. Tait and Ellman 1999) although the structure is an

emergent property of the process of communication, rather than rule following.

Any process that flattens the structure of text is carrying out an information reducing

transformation. The structure of discourse itself contains useful information such as

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

144

announcing topics before discussing them, outlining arguments, re-expressing key points

and forming conclusions.

The reason for losing structure from text was to derive a document representation that

could be used to compare one text to another irrespective of the precise words they

contain. This document representation was based on the cohesive properties of text as

described by Halliday and Hasan (1976), and proposed for a computer implementation by

Morris and Hirst (1991).

Related implementations (StOnge 1995, Stairmand 1996, Okumura and Honda 1994,

Richardson and Smeaton 1995, Kominek and Kazman 1997, and Barzilay and Elhadad

1997) that derive the lexical chains in a text were discussed in Chapter 2. These studies

used nouns only to derive texts� lexical chains. This was due to the lack of a simple

relationship between parts of speech in the source of knowledge used for lexical chaining.

This was principally WordNet (Fellbaum 1998).

This study was based on Roget�s thesaurus. One motivation for this was the clear

relationship between semantically related words that are different parts of speech. A

second reason was that its semantic structure contains far fewer categories than WordNet.

This simplifies the document representation used for comparing texts. The derivation of

this representation was described in Chapter 3.

Chapter 4 raised and addressed the issue that text genre would have a profound effect on

the GDP. Several book length texts of classical literature were analysed. These were

shown to have different reading complexities, as found by readability statistics. The

lexical chains in these texts were determined, and the distribution of their lexical links

plotted. Furthermore, they were found to be equivalent across the different text

complexities, and to conform to a distribution frequency common in language phenomena

that was first observed by Zipf (1949). Chapter 4 concluded that lexical chains could be

used to determine the similarities of texts from different genres, as the analysis was genre

independent.

The issue of word sense disambiguation (WSD) was raised in Chapter 5. Several simple

WSD techniques suitable for use in Hesperus were presented as a system called SUSS.

These were evaluated in the context of the Senseval competition. Senseval provided a

Chapter 7 Conclusion

145

large, manually sense tagged corpus which provided a gold-standard against which word

sense disambiguation methods could be compared both against human performance, and

against each other.

Chapter 6 described the experimental verification of Hesperus by comparing its

performance to that of human subjects. The experiment entailed human subjects ranking a

random set of texts in order of their similarity to texts of a known standard. Hesperus was

evaluated by comparing its ordering of the texts� similarity and comparing it to those

given by humans.

The experiments were statistically significant in two of the six topics used, and gave

indications of positive performance in a further two. This gives a good indication that

Hesperus was able to determine the similarity of texts using Roget�s thesaurus. Hesperus

did not produce significant results in the remaining two cases. We hypothesised in

Section 6.4 that this may have been because their topic classification was not derived

algorithmically from their contents, but by human editors.

7.2 Conclusions about the research hypotheses

Four research hypotheses were raised in Chapter 1. Here we look at each in turn and

assess as to whether it has been proved.

1. Whether a text similarity measure may usefully be constructed from a text�s

lexical chains as identified using Roget�s Thesaurus.

This hypothesis was demonstrated successfully in Chapter 6. Further improvements could

be made with tuning and weights.

2. Whether the text similarity measure defined provides a better approximation

of human judgements than statistical methods used in Information Retrieval.

Within the scope of the limited experiments in this study, IR judgements were inferior,

where they did not have access to term/document distribution frequency over the whole

corpus. This was shown again in Chapter 6.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

146

3. Whether the text representation considered is suitable for the analysis of texts

of different lengths and complexities.

Chapter 4 showed that texts of different complexities contain the same underlying

distribution of lexical links. Since the text representation is built upon this, there is reason

to believe that the text representation may be applied to further text types.

Chapter 3 also showed that the GDP conforms to Zipf�s empirical laws.

4. Whether the measure may be improved by including word sense

disambiguation at current levels of accuracy.

Demonstrated as being experimentally false in Chapter 6.

7.3 Contributions

This study has examined whether Roget�s thesaurus may be used to determine the

similarity of texts. The specific approach has involved the generation of text

representations based on Roget categories that may then be compared.

The study has made a number of contributions:

1. We have shown that a lexical chaining program can be implemented based upon

Roget�s thesaurus as suggested by Morris and Hirst (1991). Unlike other lexical

chaining programs (StOnge 1995, Stairmand 1996, Okumura and Honda 1994)

this lexical chaining system has used all parts of speech.

2. We have examined the lexical cohesive relationships proposed by Morris and

Hirst (1991), based on the work of Halliday and Hasan (1973, 1984). It was

shown that only three of Morris and Hirst (1991) thesaural relationships are

practically useful. Some relationships hardly occur in normal text, and others are

found too frequently to be useful

3. We have shown that Zipf�s law applies to the frequency distribution of lexical

coherence relations when their frequency is compared to the distance between

related words.

4. A novel document representation has been proposed based on categories in

Roget�s thesaurus. This method has been used in a new technique to assess the

Chapter 7 Conclusion

147

similarity of whole texts. Furthermore, it has been experimentally compared to

human judgements and shown to be statistically significant in several, although

not all, instances.

5. The effect of words sense disambiguation and granularity have studied in

relation to the text similarity task. Support has been presented for Sanderson�s

(1996) view in information retrieval that word sense disambiguation needs to be

very accurate to contribute positively. This supports Voorhees (1994) findings

where WordNet was used to disambiguate TREC queries, but with no overall

benefit.

7.4 Future Work

There is a wealth of exciting work that naturally follows from this study. This may be

broadly characterised as developing and extending the basis of the work on lexical chains,

and further developing the approach to text similarity matching. Here we will look at

these in turn.

Word Sense Ambiguity

In this section, we consider some implications of word sense ambiguity and their affect on

Roget based lexical chaining.

Word Sense Frequency Information

Selecting an inappropriate word sense is a cause of inaccuracy in the creation of lexical

chains. This was explored in Chapter 5, where an explicit word sense disambiguation

(WSD) phase was investigated. Chapter 5�s principal conclusion was that the contribution

that different WSD methods make should be considered carefully, as they may under

perform the selection of a most common word sense where this is known.

Dictionaries order the entries for different word senses according to their commonality.

Consequently, a baseline method of determining a word sense is to select the first

dictionary entry. Wilks (1999) has stated that selecting the first LDOCE sense results in

62% correct sense assignment, whilst Ng and Zelle (1997) report Miller et al., as finding

58.2% using WordNet.

Roget�s thesaurus gives no clue as to which entry the most frequent word sense

corresponds. The lexical chaining process must determine an appropriate word sense from

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

148

context. This sometimes means that an association will be found between two words

based on the selection of two inappropriate, rare, word senses.

WordNet�s sense frequency information could be used alleviate this problem. This

presupposes that WordNet�s sense inventory could be mapped against that of Roget. This

would be possible since WordNet�s 70,000 synsets contains narrow sense distinctions,

whereas Roget�s 1000 categories are broad.

A sense equivalence algorithm would need to calculate the best overlap between

WordNet�s synsets for a word, and its Roget categories. Once this had been determined,

the Roget categories could be ordered according to the order of their related synsets. This

would give an ordering to Roget categories where currently there is none.

Mapping between sense inventories has been explored by several workers. Yarowsky

(1992) suggested a method that would equate WordNet synsets with categories in Roget�s

thesaurus. Recently Litkowski (1999) has proposed using the sense equivalences that

were manually determined between Hector (Chapter 5) and WordNet for the Senseval

competition as a baseline measure to evaluate sense equivalence algorithms. His

suggested approach exploits the semantics of dictionary organisation, rather than simple

term mapping.

However determined, it is likely that the sense ordering procedure would be approximate,

since there is no exact mapping between any pair of word sense ordering schemes. It also

depends on the accuracy of the word sense frequency information in WordNet (see

Chapter 5). Whether it would improve Hesperus� disambiguation accuracy would need to

be determined experimentally.

Word Sense Disambiguation, Chaining and Length

It is an experimental observation that the word sense ambiguity problem decreases in

importance as documents increase in length. This happens because the same concept will

occur sufficiently frequently (and with different synonyms) in a coherent and cohesive

text that the correct interpretation will eventually be chosen.

Chapter 7 Conclusion

149

Accuracy Metrics for Word Sense Disambiguation

There are reasons to question Sanderson�s (1996) results. His technique of term

combination reduces the number of distinct terms in his document collection. This is a

critical element in many IR systems that use the tf*idf1 heuristic. In principle,

disambiguation will lead to an increased number of distinct word senses that should

permit a better discrimination on term frequency, and so improve performance.

Chains Theory

In this section, we will consider future work that relates to some theoretical options

related to lexical chaining theory. Any improvements in lexical chaining should result in

overall improvements in text similarity assessment.

The Interaction of Lexical Chaining and Part of Speech

The process of lexical chaining is relatively new, and has largely conformed to

possibilities allowed by the subset of WordNet that concerns nouns only usually called

NounNet. These studies (Stairmand 1996, Green 1999, StOnge 1995) have frequently

found inferior performance to statistical approaches. The degree to which this degraded

performance has been due to information lost is not clear.

The Roget based chainer may use all parts of speech. If a restricted subset were created

out of nouns only, its performance could be compared to that of a lexical chainer that uses

all parts of speech.

The text similarity problem addressed here would provide an ideal benchmark. An �all

words� chainer could be compared on the baseline similarity task to that of a chainer that

used nouns only. Unchanged or improved performance by the nouns only chainer would

support Stairmand�s (1996) hypothesis that nouns contain the substance of a text.

The implication of successful performance would indicate that WordNet based chainers

could be improved if the noun and verb hierarchies in WordNet could be integrated.

Unambiguous Chaining

Word Sense Disambiguation has been studied extensively in this thesis since it has a

pronounced effect on the accuracy of lexical chaining. Since the WSD problem is well

known, it is unlikely to be adequately solved in the near future. An alternative approach to

1 term frequency * inverse document frequency heuristic.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

150

the applications of lexical chains would use a system that included unambiguous words

only.

Sixty percent of the words in Roget occur in one entry only, and are hence unambiguous.

Since they have entries in the thesaurus, the algorithms described in this study would

continue to be applicable. A future line of investigation would be to pre-process the

thesaurus so that it contained only monosemous words. Text similarity performance could

then be assessed. The question to be answered is whether text similarity performance

would suffer more because of lost vocabulary, or benefit from the increased accuracy of

the lexical links.

Other Chains

Anaphora add greatly to the meaning and focus of a text by allowing key terms to be

substituted for simpler expressions. These anaphora do not contribute to the lexical chains

identified by Hesperus. Consequently, they are excluded from its Generic Document

Profile. Techniques have been described by Kennedy and Boguraev (1996) that identify

the referents of anaphora. If these could be adapted by Hesperus, they would potentially

increase its accuracy. Of course, there is a risk that, as with word sense disambiguation,

inaccurate recognition of anaphora would decrease its performance.

Chain Applications

In this section, we consider possible applications that may be derived from this study and

its results.

Use of Roget in other Lexical Chaining Activities

Lexical chains have been used in several problem areas such as summarisation (Barzilay

and Elhadad 1997), IR (Stairmand 1996), malapropism detection (StOnge 1995), and

hypertext linking (Green 1997, 1999).

Hesperus� performance was crudely compared to that of StOnge�s chainer (Chapter 3).

However, both Stairmand (1996) and StOnge (1995) speculate that their work could have

been improved if WordNet had access to the relationships encapsulated in Roget.

A natural development of this study would be to apply Roget to other areas suitable for

the application of lexical chaining. This would have the practical objective of improving

on the known performance on the published systems above that obtained using WordNet.

Chapter 7 Conclusion

151

Information Metric

Chapter 4 of this study reported data that showed that the distance between thesaurally

related words follows a common frequency distribution across text types. If this data can

be shown to hold in further research, it would be an interesting property of coherent texts.

This relationship has been proposed by Ellman and Tait (1997) as the basis for an

�Information Metric�. That is, a measure that would allow the automatic differentiation

between coherent text, and incoherent text possibly designed to confuse an Internet search

engine (known colloquially as �spam�). Whilst Ellman and Tait�s (1997) measure was a

crude mean value, the concept could be further developed based on the observed versus

expected frequency data. As such, it would be appropriate for analysis with the chi-

squared statistic. This would be an interesting application of lexical chains.

Similarity

The section contains some suggestion to improve the accuracy of the similarity matching

process.

Broader Authentication on Genre and Text Length

Chapter 4 investigated the claim that lexical cohesive relations were impervious to effects

of document length and style. This work was based on several book length texts. Since

this was shown to be true for those texts, it is possible that text similarity performance can

be independent of genre and document length. Broader authentication on this issue is

required.

Unknown words

The problem of words that are not in the thesaurus affects all lexical chainers. Unknown

words can not form part of lexical chains and are excluded from any document

representation. This work has shown that 60-80% of chain links identified were due to

term repetition. If a term is repeated that is not in thesaurus can not currently recognise

that this is a distinctive aspect of the document. If these could be incorporated into the

chaining approach, this would improve Hesperus� general applicability.

Solutions to the unknown words problem have been proposed by Green (1999). These

include the application of Dumais (1995) techniques for latent semantic indexing

(Chapter 2) for unknown function words, and a system for dealing with proper names.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

152

The identification of named entities is a recognised task within MUC (Chinor 1998).

Named entities could be stored in a separate, thesaural pseudo category. This would allow

items such as company or product names to contribute to the document representation.

This would exploit the value and specificity of identical word links, without increasing

the ambiguity within the GDP.

Weighting and Corpora

Text similarity matching is carried out in Hesperus as though all thesaural categorise are

equally important. However many categories, like the terms they contain, are more

indicative of content than others.

Further normalisation would be possible for example by eliminating attributes whose

value fell below certain percentages. Recalculating percentage weights would then have

the general effect of focusing the document representation to favour those concepts that

are already strongly represented.

The presence of rare terms in a document has been reliably shown to improve

performance of information retrieval systems (Salton and McGill 1988, Baeza-Yates and

Ribiero-Netto 1999) using the tf*idf heuristic. This differentially weights those terms that

are present in a text, but that are infrequent in a collection of documents. If the relative

frequency of thesaural categories could be determined, this could be used to weight

categories as more or less indicative of a texts content. Such frequency information could

come for example from the British National Corpus, although other corpora would be

needed for specialist domains.

Representing GDPs

The representation of GDPs is a considerable problem. This study has drawn GDPs as

tables, although graphical techniques are widely used for exploring the characteristics of

data. If a graphical representation for GDPs were found, people would be able to see at a

glance approximately what a text was about. This could be a useful tool for information

browsing. Furthermore two or more could be compared and their similarity determined

visually. Users would be able to differentiate conceptual clusters in the representation that

correspond to themes in a document, and identify the equivalents in similar texts. This

would allow the possibility that users could alter a desired concept strength interactively,

to differentially weight it, and so influencing the system�s similarity assessment.

Chapter 7 Conclusion

153

7.5 Summary

It was recognised at the start of this study that the technique(s) that were being

investigated would not be completely accurate. It was however anticipated that they

would be robust. The essential issue was whether inaccurate use of the cohesive nature of

text could simulate human performance on a problem with natural language such as

deciding how alike two texts are.

Evidence has been presented to show that even inexact use of thesaural relationships in

analysing natural language may augment a simple task such as determining text

similarity. The challenge is now to increase the accuracy of the process by exploiting

better the words that texts� contain.

References

Aamodt A. and Plaza E. (1994) �Case Based Reasoning: Foundational Issues,

Methodological Variations and System Approaches� AI Communication Vol. 7, 1

March 1994

Agirre, E. and G. Rigau (1996) �Word Sense Disambiguation Using Conceptual Density�

In Proceedings of the 16th International Conference on Computational Linguistics

(Coling `96), Copenhagen, Denmark, 1996.

Allan J., Callan J, Sanderson M, Xu J, and Wegmann S. (1998) �INQUERY and

TREC-7� Proc. TREC 7. NIST

Allen, J. F. �Natural Language Understanding�, 2nd edition (1995) The

Benjamin/Cummings Publishing Company, Menlo Park, California, ISBN 0-8053-

0330-8.

Alterman R. (1991) �Understanding and Summarization� Artificial Intelligence Review

Vol. 5., pp239-254.

Antworth, E. L. (1993) �Glossing text with the PC-KIMMO morphological parser�,

Computers and the Humanities 26:475-484,.

Ashley, K. D. and Rissland, E. L. (1988) �A Case-Based Approach to Modelling Legal

Expertise.� IEEE Expert, 1988, Vol. 3, No. 3, pp. 70-77.

Austin, J.L.: (1962) �How To Do Things With Words� Oxford University Press, Oxford,

UK.

Azzam S., Humphreys K., and Gaizauskas R. (1999) �Using Coreference Chains for Text

Summarization� in Proc. ACL'99 Workshop on �Coreference and Its Applications�

Baeza-Yates R. and Ribeiro-Neto B. (1999) �Modern Information Retrieval� Addison-

Wesley Harlow UK ISBN 0-201-39829-X

Bartell, B. D., Cottrel, G. W., and Belew R. K. (1998) �Optimizing Similarity using

Multi-Query Relevance Feedback�. Journal of the American Society for Information

Science Vol. 49(8) pp742-761

Bartell, B. D., Cottrel, G. W., and Belew R. K. , (1995) �Representing Documents using

an Explicit Model of their Similarities� Journal of the American Society for

Information Science Vol. 46(4) pp245-271

References

155

Barzilay, R. and M. Elhadad. (1997). �Using Lexical Chains for Text Summarization.� In

Proceedings of the Workshop on Intelligent Scalable Text Summarization at the

ACL/EACL Conference, 10�17. Madrid, Spain.

Becker J. D. (1975) �The Phrasal Lexicon�. In Proceedings of the Conference on

Theoretical Issues in Natural Language Processing, Cambridge, MA, pp. 70-77

Beeferman D. Berger A., and Lafferty J. (1997) �A model of lexical attraction and

repulsion. In Proceedings of the ACL-EACL '97 Joint Conference, Madrid, Spain

Belkin, N. and Croft, W. B. (1987) �Retrieval Techniques.� Annual Review of

Information Sciences and Techniques (ARIST). vol. 22, 1987. 109-145.

Berners-Lee, T., Cailliau, R. Luotonen, A, Nielsen H. F, and Secret A, (1994) �The

World Wide Web�, CACM Vol. 37, 8 August 1994

Bertino E. Catania B. and Ferrari E. (1999) �Multimedia IR: Models and Languages� in

Baeza-Yates and Ribeiro-Neto (1999)

Black W. J. Rinaldi F. Mowatt D. (1998) �FACILE: Description of the NE System Used

for MUC-7� in proc. MUC-7 http://www.muc.saic.com/proceedings/muc_7_toc.html

accessed 25/10/1999

Blair, D. C. and Maron, M. E. (1985). �An evaluation of retrieval effectiveness for a full-

text document retrieval system.� Communications of the ACM, 28(3):289-299

Boguraev B. and Pustejovsky J. (eds.) (1996) �Corpus Processing for Lexical

Acquisition�, MIT Press.

Boyd R., Driscoll J., Syu M. 1992 �Incorporating Semantics Within a Connectionist

Model and a Vector Processing Model� in Proc. TREC 2. available from

http://trec.nist.gov/pubs/trec2/papers/txt/29.txt [15/06/2000]

Brill, E. (1992). �A simple rule-based part-of-speech tagger.� Proceeding of the Third

Conference on Applied Natural Language Processing. Trento, Italy.

Brin S. and Page L. (1998). �The Anatomy of a Large-Scale Hypertextual Web Search

Engine.� Proceedings of the Seventh World Wide Web Conference (WWW7),

Brisbane, also in a special issue of the journal Computer Networks and ISDN Systems,

Volume 30, issues 1-7. (also http://google.stanford.edu/long321.htm accessed

27/10/99)

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

156

Buckley C. (1985) �Implementation of the SMART Information Retrieval System�

cornell/TR85-686, available as http://cs-

tr.cs.cornell.edu:80/Dienst/UI/1.0/Display/ncstrl.cornell/TR85-686 21st July (1999)

Buckley C. Salton G. Allan J. Singhal A. (1995) �Automatic Query Expansion Using

SMART:: TREC-3� in Proc TREC 3 NIST.

Burke R. Hammond K. Kulyukin V. Lytinen S. Tomuro N. and Schoenberg S. (1997)

�Question Answering from Frequently Asked Question Files.� AI Magazine, pp- 57-

66, (1997).

Callan J.P and Croft W. B. (1993) �An Approach to Incorporating CBR Concepts in IR

Systems.� In �Case Based Reasoning and Information Retrieval : Exploring the

Opportunities for Technology Sharing� Papers from the Spring Symposium, AAAI

Press Technical Report SS-93-07

Charniak, E. and Wilks, Y. (1976) (eds.) �Computational Semantics�, North-Holland,

Amsterdam, Netherlands.

Chen H. (1994) �Collaborative Systems: Solving the Vocabulary Problem� IEEE

Computer May 1994

Chinor N. (1998) �MUC-7 Named Entity Task Definition (version 3.5)� Proc. MUC-7

http://www.muc.saic.com/proceedings/muc_7_toc.html accessed 25/10/1999

COD9 �The Concise Oxford Dictionary� 9th Edition on CD-ROM. Oxford University

Press 1997

Collins, A.M., & Quillian, M.R. (1969). �Retrieval time from semantic memory.� Journal

of Verbal Learning and Verbal Behavior, 8, 240-247.

Crimmins F. Smeaton A. F. Dkaki T. and Mothe J. (1999) �TétraFusion: Information

Discovery on the Internet� IEEE Intelligent Systems July/August 1999.

Croft W. B. (1995a) �What Do People Want from Information Retrieval?� D-LIB

Magazine (�http://www.dlib.org�), November 1995

Croft W. B. (1995b) �Effective Text Retrieval Based on Combining Evidence from the

Corpus and Users� IEEE Expert Vol. 10, 6 December 1995

Dale R., Oberlander J, Milosavljevic M. and Knott A: (1998) �Integrating Natural

Language Generation and Hypertext to Produce Dynamic Documents�. Interacting

with Computers 11(2): 109-135 (1998).

References

157

Deerwester, S. Dumais, S. T. Furnas, G. W. Landauer, T. K. and Harshman, R. (1990)

�Indexing by Latent Semantic Analysis.� Journal of the American Society for

Information Science, 41,6, (1990), 391-407

Dutch R. A. (1962) �Preface to Roget�s Thesaurus� in Roget�s Thesaurus, Longmans

London.

Draper S. W. and Dunlop, M.D. (1997) �New IR - New Evaluation: The impact of

interactive multimedia on information retrieval and its evaluation�, The New Review

of Hypermedia and Multimedia, vol 3, pp 107-122,

Dumais, S. T. (1995), �Using LSI for information filtering: TREC-3 experiments.� In: D.

Harman (Ed.), The Third Text REtrieval Conference (TREC3) National Institute of

Standards and Technology Special Publication

Dumais, S. T., Letsche, T. A. Littman, M. L. and Landauer, T. K. (1997) �Automatic

cross-language retrieval using Latent Semantic Indexing.� In AAAI Spring

Symposium on Cross-Language Text and Speech Retrieval, March (1997).

Ellman J. (1983) �An Indirect Approach To Types Of Speech Acts.� Proc IJCAI (1983)

Ellman J. (1997) �Using Information Density to Navigate the Web� UK ISSN 0963-3308

IEE Colloquium on Intelligent World Wide Web Agents. March 1997

Ellman J. (1998) �Using the Generic Document Profile to Cluster Similar texts� in Proc.

Computational Linguistics UK (CLUK 97) Jan. 1998 University of Sunderland

Ellman J. and Tait J. (1996) �INTERNET Challenges for Information Retrieval�, proc

BCS IRSG Conference March 1996

Ellman J. and Tait J. (2000a) �Roget's thesaurus: An additional Knowledge Source for

Textual CBR?�, in "Research and Development in Intelligent Systems XVI: Proc 19th

SGES Intl Conf. on Knowledge Based and Applied Artificial Intelligence" Bramer M.,

Macintosh A., and Coenen F. (eds) ISBN 1-85233-231-X. pp-204-217 2000.

Ellman J. and Tait J. (2000b) �On the Generality of Thesaurally derived Lexical Links� in

Actes de 5es Journées Internationales d'Analyse Statistique des Données Textuelles

March 2000 (JADT 2000) pp147-154 Ecole Polytechnique Fédérale de Lausanne.

Switzerland

Ellman J., Klincke I., and Tait J. (1998) �SUSS: The Sunderland University Similarity

System: Beneath the Glass Ceiling� in Proc SENSEVAL workshop University of

Brighton 1998.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

158

Ellman J., Klincke I., and Tait J. (2000) �Word Sense Disambiguation by Information

Filtering and Extraction� in �Computers and the Humanities� vol. 34, number 1-2,

2000, Special Issue on �Senseval: Evaluating Word Sense Disambiguation Programs�

Guest Editors Adam Kilgarriff and Martha Palmer

Fellbaum, C. (1998), ed. �WordNet: An Electronic Lexical Database�. MIT Press,

Cambridge, MA.

Frakes W.B. and Baeza-Yates R (eds.) (1992) �Information Retrieval: Data Structures

and Algorithms� Prentice-Hall, ISBN 0-13-463837-9

Francis W. N. and Kucera H. (1979) �Brown Corpus Manual�

http://www.hd.uib.no/icame/brown/bcm.html Accessed 5/7/99

Furnas, G.W. Landauer, T.K. Gomez, L.M. Dumais, S. T. (1987) �The vocabulary

problem in human-system communication.� Communications of the Association for

Computing Machinery, 30 (11), Nov 1987, pp. 964-971.

Gaizauskas R. Wakao T. Humphreys K. Cunningham H. and Wilks Y. (1995)

�Description of the LaSIE System as used for MUC-6�, In Proc. of the Sixth Message

Understanding Conference (MUC-6), Morgan Kaufmann, pp. 207-220,

Gauch S. and Wang J. (1997) �A Corpus Analysis Approach for Automatic Query

Expansion� in Proceedings of ACM CIKM '97.

Gonzalo J. Verdejo F. Chuur I. and Cigarrán J. (1998) �Indexing with WordNet synsets

can improve text retrieval� Proc. SIGIR (1998). Also cmp-lg/9808002.

Green S. (1996) �Using Lexical Chains to build Hypertext links in Newspaper Articles�

in proc AAAI Symposium

Green S. (1997) �Automatically generating Hypertext by Computing Semantic

Similarity� University of Toronto PhD Thesis. Computing Systems Research Group

Technical Report 366

Green S. (1999) �Building Hypertext Links by Computing Semantic Similarity� IEEE

Transactions on Knowledge and Data Engineering Vol. 11, no 8. September/October

(1999),

Grefenstette G. (1994) �Explorations in Automatic Thesaurus Discovery� Kluwer

Academic Publishers, Boston ISBN 0-7923-9468-2

Grosz, B.J. Spärck Jones, K. and Webber, B.L. (eds.) (1986) �Readings In Natural

Language Processing.� Los Altos, CA,: Morgan Kaufmann. ISBN 0934613117.

References

159

Hahn U. and Chater N. (1998) �Similarity and rules: distinct? exhaustive? empirically

distinguishable?� Cognition vol. 65 pp 197-230.

Halliday, M. A. K. and Hasan, R. (1989), �Language, context, and text�. Oxford

University Press, Oxford, UK.

Halliday, M. A. K. and Hasan, R.: (1976), �Cohesion in English�, Longman, London.

Hammond K. R. (1998). �Ecological Validity: Then and Now� The Brunswik Society:

Web Essays #2 http://www.albany.edu/cpr/brunswik/essay2.html accessed 10th June

99.

Hampton J. A. (1998) �Similarity-based categorization and fuzziness of natural

categories� Cognition 65 (1998) pp137-165

Harrison C. (1980) �Readability in the Classroom� Cambridge University Press UK.

ISBN 0 521 22713 7

Hearst M. A., (1994) �Multi-Paragraph Segmentation of Expository Text.� Proceedings

of the 32nd Meeting of the Association for Computational Linguistics, Los Cruces,

NM, June, 1994.

Hearst, M. and Schütze, H. (1996) �Customizing a Lexicon to Better Suit a

Computational Task�, in Boguraev and Pustejovsky (1996)

Hirst G. and St Onge D. (1998) �Lexical Chains as representations of context for the

detection and correction of malapropisms� in Fellbaum (1998)

Humphreys K. Gaizauskas R. Azzam S. Huyck C. Mitchell B. Cunningham H. Wilks Y.

(1998) �University of Sheffield: Description of the LaSIE-II System as Used for

MUC-7� in Proc. MUC-7 http://www.muc.saic.com/proceedings/muc_7_toc.html

accessed 25/10/1999

Ide N., and Véronis J. (1998) �Introduction to the Special Issue on Word Sense

Disambiguation: The State of the Art� Computational Linguistics Vol24, No. 1 pp.1-

41

Inference (1994) �ART*Enterprise:ARTScript Programming guide. Chapter (19: Case

Based Reasoning�. Inference Corp.

Infopedia 96. CD-ROM Softkey Multimedia Inc.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

160

Jiang, J. J. and Conrath D. W. (1997) �Semantic Similarity Based on Corpus Statistics

and Lexical Taxonomy'�, in Proceedings of ROCLING X (1997) International

Conference on Research in Computational Linguistics, Taiwan, (1997).

Karlgren J. (1999) �Stylistic Experiments� in Strzalkowski (1999)

Karlgren, J. , and Cutting, D. (1994) �Recognizing Text Genres with Simple Metrics

Using Discriminant Analysis� in proc. COLING (1994). (also

http://xxx.lanl.gov/cmp-lg/9410008 accessed 30/6/99)

Kennedy C, and Boguraev B. (1996) "Anaphora for everyone: pronominal anaphora

resolution without a parser". Proceedings of the 16th International Conference on

Computational Linguistics (COLING'96), pp 113-118 Copenhagen, Denmark.

Kilgarriff, A. (1997) �I don't believe in word senses� Computers and the Humanities 31

(2), pp 91�113 (Also available at ftp://ftp.itri.bton.ac.uk/pub/reports/ITRI-97-12.ps.gz

28/04/2000

Kilgarriff, A. (1998) �Gold Standard Datasets for Evaluating Word Sense Disambiguation

Programs� Computer Speech and Language 12 (3) (also University of Brighton

Technical Report ITRI-98-08)

Kilgarriff, A. and Rose, T. (1998) �Measures for corpus similarity and homogeneity�

Also published in Proc. 3rd Conf. �On Empirical Methods in Natural Language

Processing�, Granada, Spain. Pp 46-52. (also University of Brighton Technical Report

ITRI-98-07)

Kilgarriff, A. and Rosenzweig J. (2000) �English SENSEVAL: Report and Results�. To

appear in Proc. LREC, Athens, May-June 2000. available as

http://www.itri.brighton.ac.uk/events/senseval/athens.ps Accessed 13 June 2000

Kilgarriff A. and Palmer M. (2000) �Introduction to the Special Issue on SENSEVAL�

Computers and the Humanities Vol. 34 (1/2):1-13, April 2000.

Kinnear P. R. and Gray C. D. (1997) �SPSS for Windows made simple� Psychology

Press, Hove East Sussex ISBN 0-86377-827-5

Klincke I. (1998) �Word Sense Disambiguation� MSc Thesis. School of Computing and

Information Systems, University of Sunderland

Kolodner J. (1993) �Case-Based Reasoning� Academic Press/Morgan Kaufmann; ISBN:

1558602372

References

161

Kominek J. and Kazman R. (1997) �Accessing Multimedia through Concept Clustering�

Proc. CHI.

Krovetz R and Croft W. B. (1992) �Lexical Ambiguity and Information Retrieval�, ACM

Transactions on Information Systems, Vol. 10(2), pp. 115-141,.

Kunze M. and Hübner A. (1998) �CBR on Semi-structured Documents: The

ExperienceBook and the FAllQ Project� in proc 6th German Workshop On Case-

Based Reasoning. http://www.informatik.hu-berlin.de/~cbr-

ws/GWCBR98/program.html Accessed 18/6/99

Lee L. (1997) �Similarity-Based Approaches to Natural Language Processing� PhD

Thesis, Harvard University http://xxx.lanl.gov/cmp-lg/9708011 accessed (19/8)/97

Lenz M. (1998) �Textual CBR and Information Retrieval - A Comparison.� In proc. 6th

German Workshop On Case-Based Reasoning. Berlin, March 6-8, (1998)

Lenz M. Bartsch-Spörl B. Burkhard H-D. Wess S. (Eds.) (1998): �Case-Based Reasoning

Technology: From Foundations to Applications. �Lecture Notes in Artificial

Intelligence 1400, Springer Verlag, (1998) ISBN 3-540-64572-1

Lenz M. Hübner A. Kunze M. (1998) �Textual CBR� in Lenz, Bartsch-Spörl, Burkhard

and Wess (1998).

Lesk M. (1986) �Automatic Sense Disambiguation using Machine Readable Dictionaries:

How to tell a Pine Cone from and Ice Cream Cone� Proc. ACM SIGDOC Toronto

Canada.

Lewis D. D. (1991) �Representation and Learning in Information Retrieval� PhD thesis

University of Massachusetts TR91-93

Lewis D. D. and Spärck Jones K. (1996) �Natural Language Processing for Information

Retrieval� CACM Vol 39 no. 1

Li W. (2000) �Zipf's law� http://linkage.rockefeller.edu/wli/zipf/ [accessed 19 June 2000)

Liddy E. D. (1998) �Enhanced Text Retrieval Using Natural Language Processing� ASIS

Bulletin June (1998)

Litkowski K. C. (1999) �Towards a Meaning-Full Comparison of Lexical Resources� in

Proceeding of the Association for Computational Linguistics Special Interest Group on

the Lexicon, June 21-22, College Park, MD (SIGLEX-99)

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

162

Loukachevitch N. V. and Dobrov B.V. 2000 �Thesaurus as a Tool for Automatic

Detection of Lexical Cohesion in Texts� in JADT2000 5es Journées Internationales

d�Analyse Statistique des Données Textuelles. École Polytechnique Fédérales de

Lausanne. Switzerland

Makuta M. Cohen R. and Donaldson T. �An integrated approach to detecting incoherence

in texts with emphasis on the role of evaluating lexical cohesion� to appear

Mann, W. C. and Thompson, S. A. (1988). �Rhetorical structure theory: A theory of text

organization.� Text , 8(3), 243-281.

Mauldin M. (1991) �Conceptual Information Retrieval: A Case Study in Adaptive Partial

Parsing� Kluwer Academic Publishers, Dordrecht The Netherlands

Mauldin M. L. (1991) �Retrieval Performance in FERRET: A Conceptual Information

Retrieval System� Proc 14th SIGIR (1991)

Mauldin M. L. (1995) �Measuring the Web with Lycos� Proc 3rd Int. World Wide Web

Conference April 95 (Http://lycos.cs.cmu.edu/lycos-websize.html)

Mauldin M. L. (1997) �Lycos: Design choices in an Internet search service� IEEE

Intelligent Systems Jan-Feb (1997), p. 8-11

Mc Hale, M. L. (1998) �A Comparison of WordNet and Roget's Taxonomy for

Measuring Semantic Similarity. http://xxx.lanl.gov/cmp-lg/9809003 14 Sep (1998)

Miller G. Beckwith R. Fellbaum C. Gross D. and Miller K. (1990) �Introduction to

WordNet: An on-line lexical database� J. Lexicography 3(4) pp235-244

Miller, G. and Charles W. G. (1991) ``Contextual Correlates of Semantic Similarity'',

Language and Cognitive Processes, Vol. 6, No. 1, 1-28.

Miller, G., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. (1990). �Five papers on

WordNet.� Technical Report CSL-43, Cognitive Science Laboratory, Princeton

University.(Available from ftp://ftp.cogsci.princeton.edu/wordnet/ )

Morris, J. and Hirst, G. (1991). �Lexical Cohesion computed by thesaural relations as an

indicator of the structure of text.� Computational Linguistics, 17(1), pp21-48.

Nadeau, M. (1994) �An Improved Encarta� BYTE, January (1994)

Ng H. T. and Lee H. B. (1996) �Integrating multiple knowledge sources to disambiguate

word senses: An exemplar-based approach.� in Proc. 34th ACL pp40-47

References

163

Ng H. T. and Zelle J. (1997) �Corpus Based Approaches to Semantic Interpretation in

NLP� AI Magazine Vol 18, 4. Winter 1997.

Ng H. T., Lim C. Y., and Foo S. K. (1999) �A Case Study on Inter-Annotator Agreement

for Word Sense Disambiguation� in Proceeding of the Association for Computational

Linguistics Special Interest Group on the Lexicon, June 21-22, College Park, MD

(SIGLEX-99)

Nowell T. L., France R. K., Hix D. Heath L. S, and Fox E. A. (1996) �Visualizing Search

Results: Some Alternatives To Query-Document Similarity� Proc. SIGIR 1996.

Okumura M. and Honda T. (1994) �Word Sense Disambiguation and text segmentation

based on lexical cohesion� Proc COLING 1994 vol 2 pp 755-761

Project Gutenberg (1999) �Official and Original Project Gutenberg Web Site and Home

Page� http://www.promo.net/pg/ Accessed 27/08/1999

Qiu Y. and Frei H.P. (1995) �Concept Based Query Expansion� in Proc. ACM SIGIR

1995 pp160-169

Quillian M.R. (1968) �Semantic Memory� in Minsky, M.�Semantic Information

Processing� MIT press Cambridge Mass.

Rada R. and Bicknell E. (1989) �Ranking Documents with a Thesaurus� Journal of the

American Society for Information Science 40(5) pp304-310

Radev D. R. (1997) �Natural Language Processing FAQ�

http://www.cs.columbia.edu/~acl/nlpfaq.txt Accessed April 1999.

Resnik, P. (1995) �Using Information Content to Evaluate Semantic Similarity in a

Taxonomy�, Proceedings of the 14th International Joint Conference on Artificial

Intelligence, Vol. 1, 448-453, Montreal, August 1995.

Richardson, R. and Smeaton A.F. (1995) Using WordNet in a Knowledge-Based

Approach to Information Retrieval. Working Paper CA-0395, School of Computer

Applications, Dublin City University, Ireland.

Rissland E. L, and Daniels J. L. (1996) �The Synergistic Application of CBR to IR�

Artificial Intelligence Review vol 10, pp 441-475

Robertson S. E. Walker Beaulieu (1998) �Okapi at TREC�7�: automatic ad hoc,

filtering, VLC and interactive track. Proc TREC-7. NIST.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

164

Robertson S.E. (1994) �Computer Retrieval :As Seen Through The Pages Of Journal Of

Documentation� Published in: B.C. Vickery (Editor), Fifty years of information

progress. London: Aslib, (1994). (pp 119-146)

Salton G. and Buckley C. (1990) �Improving Retrieval Performance by Relevance

Feedback� Journal of the American Society for Information Science 41(4) pp288-297

Salton, G. and McGill, M. (1983), Introduction to Modern Information Retrieval.

McGraw-Hill.

Sanderson M. (1994) �Word Sense Disambiguation and Information Retrieval� in Proc.

ACM SIGIR (1994) pp 142-151

Sanderson M. (1996) �Word Sense Disambiguation and Information Retrieval� PhD

Thesis, University of Glasgow

Schank, R. C., and Abelson. R. P. (1977). �Scripts, Plans, Goals, and Understanding.�

Hillsdale, NJ: Lawrence Erlbaum Associates.

Schegloff E. and Sacks H. (1973) �Opening Up Closings.� Semiotica 8 pp289-327.

Schütze H. (1992) �Dimensions of Meaning�, Proceedings of Supercomputing, pp 787-

796, Minneapolis MN,.

Searle, J. (1969) �Speech Acts: An Essay in the Philosophy of Language�, Cambridge,

Eng.: Cambridge University Press.

Selberg E. and Etzioni O. (1995) �Multi-Service Search and Comparison Using the

MetaCrawler� Proc WWW4

Selberg, E. and Etzioni, O. (1997) �The MetaCrawler Architecture for Resource

Aggregation on the Web� IEEE Expert, January / February 1997, Volume 12 No. 1,

pp. 8-14.

Shimazu H. Kitano H, and Shibata G (1993) �Retrieving Cases from Relational

Databases: Another Stride to Corporate Wide Case Based Systems� Proc IJCAI 1993

Sloman S. A., and Rips L. J. (1998) �Similarity as an explanatory construct� Cognition

vol. 65 pp87-101

Smeaton A. (1999) �Using NLP or NLP Resources for Information Retrieval Tasks� in

Strzalkowski (1999)

Spärck Jones K. (1986) �Synonymy and Semantic Classification� Edinburgh University

Press, Edinburgh UK

References

165

Spärck Jones K. (1999) �What is the role of NLP in Text Retrieval� in Strzalkowski

(1999)

Spärck Jones K. and Willett P. (1997) �Readings in Information Retrieval� Morgan

Kaufmann ISBN 1-55860-454-5

Srinivasan, P. (1992) �Thesaurus Construction.� In Frakes and Baeza-Yates (1992)

Stairmand M. (1996) �A Computational Analysis of Lexical Cohesion with Applications

in Information Retrieval� PhD Thesis. UMIST Computational Linguistics Laboratory

Stairmand M. and Black W. J. (1996) �Conceptual and Contextual Indexing using

WordNet-derived Lexical Chains� in proc BCS IRSG

StOnge, D. (1995). �Detecting and Correcting Malapropisms with Lexical Chains.� MSc

Thesis, University of Toronto. Department of Computer Science Technical Report

CSRI-319, March 1995. (Available on-line at ftp://ftp.cs.toronto.edu/csri-technical-

reports/319/319.ps.Z [21/08/2000]

Strzalkowski T. (1999) �Natural Language Informational Retrieval� Kluwer Academic,

Dordrecht NL. ISBN 0-7923-5685-3

Sussna, M. (1993). Word Sense Disambiguation for Free-text Indexing Using a Massive

Semantic Network. Proceedings of the Second International Conference on

Information and Knowledge Base Management, pp67-74.

Tait J. and Ellman J. (1999) �MABLe: a multilingual authoring tool for business letters�.

proc. ASLIB 21st Conf. on Translating and the Computer. Nov. 1999.

Van Dijk T. A. (1977) �Text and context. Explorations in the semantics and pragmatics of

discourse� London: Longman,

van Rijsbergen C. J. (1979) �Information Retrieval� London: Butterworths, (available on-

line on 25th March (1999) at: http://www.dcs.gla.ac.uk/Keith/Preface.html)

Voorhees E. (1994), �Query Expansion Using Lexical-Semantic Relations�, Proceedings

SIGIR `94, pp61-69,

Voorhees E. and Harman D. (1998) �Overview of the Eighth Text REtrieval Conference

(TREC-8)� in Proc. TREC-8 NIST. Available on-line

http://trec.nist.gov/pubs/trec8/t8_proceedings.html [15-Jun-00]

Watson I., Watson H (1998) �Case-based content navigation� in Knowledge-Based

Systems, (1998), Vol.11, No.5-6, pp.345-353

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

166

Watson I. (1997) �Applying Case-Based Reasoning: Techniques for Enterprise Systems�

Morgan Kaufmann; ISBN: 1558604626

Watson I. and Marir F. (1994) �Case Based Reasoning: A review�. The Knowledge

Engineering Review Vol 9, 4 (1994) pp 327-354

Weaver P. L. (1993) �Practical SSADM Version 4� Pitman. London.

Wilbur, W. J. and Coffee, L. (1994) "The effectiveness of document neighbouring in

search enhancement" Information Processing and Management 30(2) pp253-266

Wilks Y. (1999) �COM 334: Language Engineering Notes�

http://www.dcs.shef.ac.uk/~yorick/le334.html accessed 27/08/1999

Wilks, Y, Slator B. and Guthrie L. (1995) �Electric Words: dictionaries, computers and

meanings� MIT Press,

Wilks, Y. and Stevenson, M. (1996) �The grammar of sense: Is word-sense tagging much

more than part-of-speech tagging?� Technical Report CS-96-05, University of

Sheffield.. also proc. SIGLEX (1997)

Winograd T. (1972) Understanding Natural Language, (191 pp.) New York: Academic

Press.

Xu J. and Croft B. W. (1996) �Query Expansion using Local and Global Document

Analysis� in Proc. ACM SIGIR 1996

Yang Y. and Pederson J. (1999) �Intelligent Information Retrieval� IEEE Intelligent

Systems July/August 1999.

Yarowsky D. (1992) �Word Sense Disambiguation Using Statistical Models of Roget's

Categories Trained on Large Corpora.� Proc COLING 1992 pp 454-460

Zipf G.K. (1949) �Human Behavior and the Principle of Least Effort� Addison-Wesley

Inc (republished by Hafner Publishing Co. New York (1972))

Zobel J. (1998) �How reliable are large-scale information retrieval experiments?�

Proceedings of the Twenty-First International ACM-SIGIR Conference on Research

and Development in Information Retrieval, Melbourne, Australia, August 1998, pp.

307-314.

Zobel J. and Moffat A. (1998) �Exploring the similarity space�, SIGIR forum 32(1):

pp18-34, spring 1998

Bibliography

Armstrong R. Freitag D. Joachims T. Mitchell T. (1995) �WebWatcher: A Learning

Apprentice for the World Wide Web� AAAI Spring Symposium on Information

Gathering In Heterogeneous, Distributed Environments. March 1995

Brüninghaus Stefanie and Ashley Kevin D. (1998) �Evaluation of Textual CBR

Approaches.� In: Proceedings of the AAAI-98 Workshop on Textual Case-Based

Reasoning (AAAI Technical Report WS-98-12). Pages 30-34. Madison, WI.

Catarci T, Chang SK, Liu W, Santucci G (1998) �A light-weight Web-at-a-Glance

system for intelligent information retrieval� Knowledge-Based Systems, , Vol.11,

No.2, pp.115-124

Crestani F, Van Rijsbergen C. J. (1998) � A study of probability kinematics in

information retrieval� ACM Transactions On Information Systems, , Vol.16,

No.3,pp.225-255

Davies J. Week R. Revett M. and McGrath A (1996) �Using Clustering in a WWW

Information Agent� Proc. BCS IRSG Conference March 1996

Di Battista G. Eades P. Tamassia R. and Tollis I. (1994) �Algorithms for Drawing

Graphs: an Annotated Bibliography� Computational Geometry: Theory and

Applications, 4(5), 235-282.

Eichmann D. (1994) �Ethical Web Agents� Proc WWW 94

Etzioni O. and Weld D. (1994) �A Softbot-Based Interface to the Internet� CACM July

1994.

Hunt J. (1997) �Case Based Diagnosis and Repair of Software Faults� Expert Systems:

The International Journal of Knowledge Engineering and Neural Networks, (1997),

vol. 14, no. 1, pp. 15-23(9)

Knott A and Sanders T, �The Classification of Coherence Relations and their Linguistic

Markers: An Exploration of Two Languages.� Journal of Pragmatics vol 30 (1998).

Knott A. (1996) �A Data-Driven Methodology for Motivating a Set of Coherence

Relations� PhD Thesis Dept of AI, University of Edinburgh

Koster, M. (1994), World Wide Web Wanderers, Spiders and Robots,

http://web.nexor.co.uk/ mak/doc/robots/robots.html

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

168

Lalmas M. Bruza P.D. (1998) �The use of logic in information retrieval modelling� In

Knowledge Engineering Review, (1998), Vol.13, No.3, pp.263-295

Liebermann H. (1995) �Letizia: An Agent That Assists Web Browsing� Proc WWW4

Magennis M. (1995) �Expert rule-based query expansion� Proc BCS IRSG (1995)

Manber U. and Wu S. (1993) �GLIMPSE: A Tool to Search Through Entire File

Systems� University of Arizona Dept of Computer Science Technical Report TR 93-

34

Mladenic D. (1999) �Text-Learning and Related Intelligent Agents: A Survey� IEEE

Intelligent Systems July/August (1999).

Obrazka K. Danzig P. B, and Li S-H. (1993) �Internet Resource Discovery Services�

IEEE Computer September 1993.

Onyshkevych B. (1993) �Template design for Information Extraction� in Proceeding of

the Fifth Message Understanding Conference (MUC-5)

Paice, C. D. (1991) �A Thesaural model of Information Retrieval� Information

Processing and Management Vol. 27, 5, pp443-447.

Pasi G, Pereira R.A.M. (1999) �A decision making approach to relevance feedback in

information retrieval: A model based on soft consensus dynamics� International

Journal Of Intelligent Systems, , Vol.14, No.1, pp.105-122

Resnick P and Varian H. (1997) �Recommender Systems� Communications of the

Association for Computing Machinery, 40 (3), Mar 1997, pp. 56-58

Riloff, E. and Lehnert, W. (1994) �Information extraction as a basis for high-precision

text classification� ACM Transactions on Information Systems Vol.12, No. 3 (July

1994), pp. 296-333

Schwartz M. (1993) �Internet Resource Discovery at the University of Colorado� IEEE

Computer September 1993

StatSoft, Inc. (1999). Electronic Statistics Textbook. Tulsa, OK: StatSoft. WEB:

http://www.statsoft.com/textbook/stathome.html.

Tait J. Sanderson H, Ellman J. Martinez A.M. Hellwig P, Tsagheas P, (1997) �Practical

Considerations in Building a Multi-Lingual Authoring System for Business Letters�

Proc. ACL'97/EACL'97 workshop on Commercial Applications of NLP

Bibliography

169

Willett, P. (1988) �Recent trends in hierarchical document clustering: a critical review�.

Information Processing and Management 24:577-97

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

170

Appendix I. Experimental Examples on Rosetta

Stone

This appendix contains the texts for the text similarity experiment on the Rosetta Stone.

The articles are named as they appear in Chapter 6. The source text from Encarta is

known as �msRosetta�. The extract from Infopaedia is called �Info_Rosetta�. The other

texts have names derived automatically from their URLs.

The pages are shown as reduced to text only as this was the form used in the experiments.

msRosetta.

Rosetta Stone, black basalt slab bearing an inscription that was the key to the deciphering

of Egyptian hieroglyphics and thus to the foundation of modern Egyptology. Found by

French troops in 1799 near the town of Rosetta in Lower Egypt, it is now in the British

Museum, London. The stone was inscribed in 196 BC with a decree praising the Egyptian

king Ptolemy V. Because the inscription appears in three scripts, hieroglyphic, demotic,

and Greek, scholars were able to decipher the hieroglyphic and demotic versions by

comparing them with the Greek version. The deciphering was chiefly the work of the

British physicist Thomas Young and the French Egyptologist Jean François Champollion.

index

The Rosetta Stone

The Rosetta Stone

Photo of the Rosetta Stone from British Museum (117k)

The Rosetta Stone led to the modern understanding of hieroglyphs. Made in Egypt around 200BC, it is a

stone tablet engraved with writing which celebrates the crowning of King Ptolemy V. It is a solid piece of

black Basalt and is 1m high by 70cm wide by 30cm deep. Quite heavy.

The interesting thing about the Rosetta Stone is that the writing is repeated three times in different

alphabets:

Hieroglyphic (top of stone)- used by ancient Egyptians

Demotic (centre of stone)- used by Arabs including modern Egyptians

Greek (base of stone)- used by, erm, Greeks, and other eastern Europeans

Appendices

171

Simplified map of the world showing Egypt (10k)

Map of northern Egypt showing Rashid, the discovery location (44k)

The stone was re-discovered in 1799AD at Rosetta near Rashid, about 200km north of Cairo on the

Mediterranean coast. At that time, the meaning of hieroglyphs had been forgotten. Nobody could translate

any of the hieroglyphs found whilst raiding/exploring ancient Egyptian archeology.

However, the Rosetta Stone changed all that. Because people of the 19th century could understand the

Demotic and Greek parts of the engraving, a chap called Jean-Francois Champollion worked out which

words were represented by which hieroglyphs in 1821AD.

The Rosetta Stone now rests in the British Museum in London.

Here is an extract from the writing on the Rosetta Stone:

...whereas king PTOLEMY THE EVER-LIVING, THE BELOVED OF PTAH, THE GOD EPIPHANES

EUCHARISTO, the son of King Ptolemy and Queen Arsinoe, the Gods Philopatores, has been a benefactor

both to the temples and to those who dwell in them, as well as those who are his subjects, being a god

sprung from a god and goddess (like Horus the son of lsis and Osiris, who avenged his father Osiris) and

being benevolently disposed towards the gods, has dedicated to the temples revenues in money and corn

and has undertaken much outlay to bring Egypt into prosperity, and to establish the temples, and has been

generous with all his own means; and of the revenues and taxes levied in Egypt some he has wholly

remitted and others has lightened, in order that the people and the others might be in prosperity during his

reign: and whereas he has remitted the debts to the crown being many in number which they in Egypt and in

the rest of the kingdom owed: and whereas those who were in prison and those who were under accusation

for a long time, he has freed of the charges against them; and whereas he has directed that the gods shall

continue to enjoy the revenues of the temples and the yearly allowances given to them, both of corn and

money, likewise also the revenues assigned to the gods from vine land and from gardens and other

properties which belonged to the gods in his father's time...

Photo of the Place des Ecritures, Figeac, France (81k)

There is now a museum dedicated to the translator Jean-Francois Champollion, located in Champollion's

home town of Figeac near Cahors in southern France.

An Egyptian tablet in the Louvre Museum, Paris, France (26k)

Hieroglyphic name ring for King Ptolemy V

Champollion was born in December 1790 and could speak Greek, Latin, Hebrew, Arabic, Chaldean and

Syrian by the age of 14. By 19, Champollion was a History lecturer at Grenoble University. He had to make

his translations from a copy of the Rosetta Stone, since the stone itself had been stolen/seized by the English

during the Napoleonic war.Champollion visited Egypt only once- to put his new understanding of

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

172

hieroglyphs to the test. He returned to France to found the Egyptology Museum at the Louvre in Paris

(where you can still see many tablets and statues today). Champollion died in 1832 aged only 42.

Much of Champollion's work was based on that of Englishman Thomas Young who had already deciphered

names of people and places. Words like these, called Proper Nouns, are bordered by hieroglyphic name

rings, similar in shape to modern army name tags.

[ Cimmerii Index | [email protected] | Andrew Oakley ]

[ British Museum | Cambridge University Egyptology Dept. ]

[ Rosetta Stone- goth music band | Dark Horizons RS music reviews | Download RealAudio ]

Click here for hit statistics:

Info_Rosetta

Rosetta Stone Slab of basalt with inscriptions from 197 BC, found near the town of Rosetta, Egypt,

1799. Giving the same text in three versions - Greek, hieroglyphic, and demotic script - it

became the key to deciphering other Egyptian inscriptions.

Discovered during the French Revolutionary Wars by one of Napoleon's officers in the

town now called Rashid, in the Nile delta, the Rosetta Stone was captured by the British

1801, and placed in the British Museum 1802. Demotic is a cursive script (for quick

writing) derived from Egyptian hieratic, which in turn is a more easily written form of

hieroglyphic.

RosettaStone

The Rosetta Stone

The Rosetta Stone

Download at full size (127K)

© The Trustees of the British Museum

The Rosetta Stone was the key that unlocked the mysteries of Egyptian hieroglyphics. Napoleon's troops

discovered it in 1799 near the seaside town of Rosetta in lower Egypt, and it eventually made its way into

the British Museum in London where it resides today. It is a slab of black basalt dating from 196 BC.

inscribed by the ancient Egyptians with a royal decree praising their king Ptolemy V. The inscription is

written on the stone three times, once in hieroglyphic, once in demotic, and once in Greek. Thomas Young,

a British physicist, and Jean Francois Champollion, a French Egyptologist, collaborated to decipher the

Appendices

173

hieroglyphic and demotic texts by comparing them with the known Greek text. From this meager starting

point a generation of Egyptologists eventually managed to read most everything that remains of the

Egyptians' ancient writings.

When I started my company in 1976 I was a new Ph.D. in programming languages and thought I'd write

compilers for a living, so I took the stone's name because of its famous association with language

translation. I did implement a few programming languages but turned out to spend most of my time doing

interactive graphical applications, and most recently computer games.

Rosetta home page

Copyright &#169; 1998 by Rosetta, Inc. All rights reserved.

�Rosetta� is a registered trademark of Rosetta, Inc.

Sendcomments or questions about this site to the webmaster.

stone

Rosetta Technologies, Inc. - The Rosetta Stone

= 3) version = �n3�;

In 1799, soldiers in Napoleon's invading army

unearthed a flat basalt stone that had lain

concealed for nearly two thousand years near the

town of Rosetta, Egypt. The Rosetta Stone bore

inscriptions in three languages, and after years of

intense research, it gave linguists the key that

unlocked the secret of Egyptian hieroglyphics and

vastly expanded our knowledge of ancient Egyptian

history and culture.

In today's highly heterogeneous computing

environments, users who need to share data often

feel as if they might as well be dealing with

hieroglyphics: until they encounter Rosetta

Technologies. Rosetta unlocks the secrets of

diverse and incompatible computer documents and

breaks the barriers to communicating electronic

product data.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

174

location

Rosetta Technologies Locations

The greatest power of computers is not as

number-

crunching machines but as platforms for

communication. Just as the Rosetta Stone broke the

barriers that kept Egyptian hieroglyphics a secret

for thousands of years, Rosetta Technologies breaks

the barriers that have made it difficult to exploit

the computer as a true communications vehicle. The

Rosetta Stone empowered us to communicate with the

past. Rosetta Technologies offers communication

capabilities that will carry us into the future.

Contact Rosetta Technologies today to see how to

take full advantage of your electronic product

data.

In North America

Address:

Rosetta Technologies, Inc.

15220 NW Greenbrier Pkwy.,

Suite 300

Beaverton, Oregon 97006

Telephone:

1 503 690-2500

Sales (US): 1 800 445-0300

Telefax: 1 503 531-0401

Email: [email protected]

In Europe

Address:

Rosetta Technologies

35, cours Michelet

92060 Paris La Défense

France

Telephone: +33 01 47 73 15 60

Telefax: +33 01 47 73 15 58

Appendices

175

Email: [email protected]

About Rosetta |

Products |

Services |

Technical Support |

Evaluation Software |

Site Index

Rosetta

Rosetta Stone Language Software - Multilingual Books - Rosetta Stone Language Software

Multilingual Books and TapesRosetta Stone CD-ROM Courses

Ordering Information

Back to Software Page

Rosetta Stone CD-ROM Language Courses

The most extensive CD-ROM courses available! Equivalent to 2 years of college course study.

Available for Arabic, Chinese (Mandarin), English, French, German, Italian,

Latin, Japanese, Portuguese, Russian, Thai, and Vietnamese.

Seven language PowerPac sampler available for only $69

Full courses only $385

About Rosetta Stone -

Screen Shots -

Availability -

PowerPac

What is The Rosetta Stone CD-ROM language course?

The Rosetta Stone CD-ROM series is the premier choice for students seeking to master a foreign

language on their computer. Intended for all serious students, these courses are equivalent to 2 years of

college course study. In fact, these

packed CD-ROM courses for Windows and Macintosh are the most intensive computer courses you can

buy. Rosetta Stone teaches with an

extensive series of drills that associate text, spoken words, and pictures. Rosetta Stone teaches you not just

tourist phrases, but also lots of

vocabulary and grammar in a clear and simple manner. This is the only course you will need to learn your

choice of Arabic, Chinese (Mandarin),

English, French, German, Italian, Latin, Japanese, Portuguese, Russian, Thai, and Vietnamese. We love

Rosetta Stone courses and we sell them

below retail price at only $385.00. Get yours today.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

176

The Rosetta Stone intensive courses assume no prior knowledge of a foreign language and will take the

student through the equivalent of

two years college study. Unlike learning in language classrooms, Rosetta Stone is designed to be

entertaining and self-paced. You can learn a foreign

language when you want, for as long or short a time as you wish. Each Rosetta Stone language CD-ROM

contains over 8,000 real life color

images, plus thousands of words and phrases spoken by native speakers. The emphasis is on both spoken

and written language, meaning that

Rosetta Stone teaches both the spoken and written foreign language. The interactive design is not a mere

�static� course that plays like a

cassette; constant student response is necessary and this holds the student's attention and improves retention

of new material. Built-in speech

recognition aids the student in pronouncing the words as they should be spoken. Dual versions allow for the

course to be used on either

Macintosh or Windows computers.

Who should buy The Rosetta Stone?

The Rosetta Stone intensive course is best suited for people who want a professional-quality language

course to learn more than just the

basics of a language. People who choose Rosetta Stone are getting the best and most extensive computer

course available today. But don't take

our word for it, check out these comments from major news sources:

�The premier foreign language CD-ROM�

The American School Board Journal, Sept. 1996

Excellent teaching methods...beautiful photographs...a valuable educational tool and fun to us. Four stars.�

Joanna Pearlstein - MacWorld magazine

It is rare to find a superbly designed instructional software program as this one. One of the top 10 CDs!�

CD-ROMs Rated - McGraw-Hill

�Top 100 CD-ROMs!� - PC magazine

50 Best CD-ROMs!� - MacUser

Other outstanding features:

1 or 2 CD-ROMs with 92 lessons including 8 review lessons

Illustrated User's Guide

Language Book with curriculum text

Handbook for teachers

Icon-driven operation for ease of use

Appendices

177

Instant scoring of exercises and tests

Timed modes for increased challenge

Wide scope of languages - intensive programs for Chinese, Russian, and Dutch are

rare

Available for Windows 95, 3.x and Mac

Pricing and Availability

All Rosetta Stone intensive courses are equivalent to 2 years of college courses!

AR-525 Arabic Rosetta Stone Level 1 $385.00

CH-525 Mandarin Chinese Rosetta Stone Level 1 $385.00

DU-525 Dutch Rosetta Stone Level 1 $385.00

EN-525 English Rosetta Stone Level 1 $385.00

FR-525 French Rosetta Stone Level 1 $385.00

IT-525 Italian Rosetta Stone Level 1 $385.00

JA-525 Japanese Rosetta Stone Level 1 $385.00

LA-525 Latin Rosetta Stone Level 1 $385.00

PR-525 Portuguese Rosetta Stone Level 1 $385.00

RS-525 Russian Rosetta Stone Level 1 $385.00

SP-525 Spanish Rosetta Stone Level 1 $385.00

TH-525 Thai Rosetta Stone Level 1 $385.00

VI-525 Vietnamese Rosetta Stone Level 1 $385.00

Interested in a special version that includes samples of many different languages?

Check out the Rosetta Stone PowerPac edition for only $69.00!

Minimum system requirements:

Mac: 256 colors, CD-ROM drive, 4 MB RAM, microphone for voice

recording

Win 3.x: 486, 4 MB RAM, 4MB hard drive space, CD-ROM drive, 256 colors,

Sound Blaster or 100% compatible, microphone for voice recording

Win 95: 486DX, 8 MB RAM, CD-ROM drive, 256 colors, Sound Blaster or 100%

compatible, microphone for voice recording

Ordering Information

Back to Top

Multilingual Books and Tapes

1205 E. Pike, Seattle, WA 98122

1-206-328-7922 - Fax: 328-7445

E-mail:[email protected]

&#169; Copyright 1998 The Internet Language Company

Appendix II. Experimental Texts

Experimental Texts and their Generic Document Profiles

This appendix shows the experimental source texts from Microsoft�s© Encarta. These

have been marked up by Hesperus to show the component lexical chains1. The texts are

followed by the Generic Document Profile as determined using the fine-grain, no word

sense disambiguation options (Chapter 6).

Rosetta Stone

rosetta-stone0 Stone, black2 basalt slab bearing an inscription that was the key to the

deciphering of Egyptian hieroglyphics and thus to the foundation of modern Egyptology.

Found by French troops in near the town of Rosetta in Lower Egypt, it is now in the

British Museum, London. The stone was inscribed in BC with a decree praising the

Egyptian king Ptolemy V. Because the inscription appears in three scripts, hieroglyphic,

demotic, and greek1 , scholars were able to decipher the hieroglyphic and demotic

versions by comparing them with the Greek version. The deciphering was chiefly the

work of the British physicist Thomas Young and the French Egyptologist Jean Fran ois

Champollion.

Table II-1: Generic Document Profile for �Rosetta Stone� from MS Encarta

Rosetta Stone GDP

Roget Categories

Percentage

writing_586_3743_n 36.91

language_557_3608_n 17.14

indication_547_3541_n 12.63

record_548_3564_n 10.06

abode_192_1209_n 7.14

production_164_1050_n 4.74

beginning_68_455_n 4.29

interpretation_520_3359_n 4.23

revelation_974_6288_n 1.37

1 It is an artifact of the algorith that non-ascii and numeric text is omitted.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

180

Copyright

copyright1 , body5 of legal rights that protect creative0 works from being reproduced,

performed, displayed2 , or disseminated by others without permission10 . The owner of

copyright has the exclusive right to reproduce a protected work; to prepare14 other works

based on the protected work; to sell4 , rent, or lend copies of the protected work to the

public; to perform protected works in public; and to display copyrighted works publicly.

These basic exclusive rights of copyright owners are subject to exceptions depending on

the type of work and the type of use made by others. The term7 work used in copyright

law refers to any original creation of authorship fixed in a tangible13 medium. Thus,

works that can be protected by copyright include literary pieces, musical compositions,

dramatic selections, dances, photographs3 , drawings, paintings, sculpture, diagrams,

advertisements, maps, motion-pictures pictures, radio and television programs, sound

recordings11 , and computer12 software programs. copyright does not protect an idea or

concept; it only protects the way in which an author has expressed an idea or concept. If,

for example, a scientist publishes an article explaining a new process for making a

medicine, the copyright prevents others from copying the article, but it does not prevent

anyone from using the process described to prepare the medicine. In order to protect the

process, the scientist must obtain a patent. History of copyright The first real copyright

law, enacted in by the British Parliament, was the Statute of Anne. This law forbade the

unauthorized printing6 , reprinting, or importing of books for a limited number of years.

In the United States, the founding fathers recognized the need to encourage creativity by

protecting authors. They placed in the constitution of the United States a provision giving

Congress the power �to promote the progress of science and useful arts, by securing for

limited times to authors and inventors the exclusive right to their respective writings and

discoveries� ( art. I, Sect. ). This provision gave the federal government the power to

enact copyright and patent statutes. In , Congress passed the first U.S. copyright law.

Since then, the copyright statutes have been expanded and changed by Congress many

times. A major revision of U.S. law was made in the copyright Act, which remained the

basic framework for protection until January , , when the copyright Act of went into

effect. The act, which is the legal basis for copyright protection today, made substantial

and important changes in U.S. law. copyright in the United States The copyright Act

established a single system of federal statutory protection for all eligible works, both

published and unpublished. For works created after January , , copyright becomes the

property of the author the moment the work is created and lasts for the author's life plus

years. When a work is created by an employee in the normal course of a job, however,

Appendices

181

the copyright becomes the property of the employer and lasts for years from publication

or years from creation, whichever is shorter. For works created before , the old act

provided that the copyright endured for years from the date the copyright was secured

and might be extended for another years, for a maximum term of protection of years. The

new act extended the renewal term for copyrights existing on January , , so that

copyright protection would last for years. However, for works produced in the United

States prior to , the owner must have filed a renewal application to obtain the benefit of

the renewal period. works that first obtained statutory copyright protection in or later

automatically receive the benefit of the renewal period. notice Although copyright

becomes effective on creation of a work, for works publicly distributed before march , ,

the copyright is potentially invalidated unless a prescribed copyright notice is placed on

all publicly distributed copies. For works published on or after march , , the use of a

copyright notice is optional, though recommended. This notice consists either of the

word copyright, the abbreviation Copr., or the symbol accompanied by the name of the

owner and the year of first publication (for example, John Doe ). In most printed books

the copyright notice appears on the reverse side of the title-page page. The use of the

notice is the responsibility of the copyright owner and does not require advance

permission from, or registration with, the copyright Office. A similar notice bearing the

symbol (for example, Doe record company) may be used to protect sound recordings

such as phonograph records and tapes. To enforce a copyright, the U.S. author or owner

must have applied to register with the copyright Office in Washington, D.C. To register,

the copyright owner must fill out the application, pay a fee, and send two complete

copies of the work, if published, which will be placed in the library of Congress. The

sooner the claim to copyright is registered, the more remedies the author may have in

any litigation to enforce the copyright. licensing copyright can be sold or licensed to

others. licenses of copyrights are normally granted in written contracts24 agreed-to to by

all parties involved. For example, an author of a novel can license one publisher to print

the work in hardbound copies, another publisher to produce paperback copies, and a

motion-picture company to make a movie based on the novel. A sale or license of

copyright made on or after January , , can be terminated by the author (or by the

author's family) years after the sale or license. The purpose of allowing such a

termination is to permit an author to obtain more financial reward if the work remains

commercially valuable over a long period of time. For the sale or license made before ,

the author has a similar right of termination years from the date the copyright was

originally secured or beginning on January , , whichever is later. The law sets up

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

182

conditions for reproduction of copies by libraries and archives and for transmission of

audiovisual and other programs and forbids unauthorized duplication of sound recordings.

It provides for royalty payments on recorded music, on public performance of sound

recordings by coin- operated phonographs, and on transmission of some television

programs. A radio station that broadcasts a recording of copyrighted music is

�performing� the work publicly and for profit and must be licensed to do so. In ,

however, the Supreme Court of the United states ruled that noncommercial use of

videocassette recorders does not violate copyright law. Infringement Copyright

infringement is any violation of the exclusive rights mentioned above-for example,

making an unauthorized copy of a copyrighted book. Infringement does not necessarily23

require word22 -for- word reproduction; �substantial similarity� to the copyright-

protected content of a work is sufficient. Generally, copyright infringements are dealt

with in civil lawsuits in federal court. If infringement is proved, the copyright owner has

several remedies available. The court may order an injunction against future

infringement; the destruction of infringing copies; reimbursement for any financial loss

incurred by the copyright owner; transfer of profits made from the sale of infringing

copies; and payment of fixed damages (usually between $ and $ , ) for each work

infringed, as well as court21 costs and attorney's fees. In copyright cases, a criminal

penalty of imprisonment and/or a fine can be imposed for knowingly infringing the

copyright for profit20 . fair Use An exception to the rule of copyright infringement is the

concept known as fair use, which permits the reproduction of copyrighted material for

purposes such as criticism9 , comment, teaching and research. In deciding whether a use

falls within the fair use exceptions, several factors are considered, including the purpose

of the use and the effect of the use on the value of the original-work work. Examples of

fair use include the quotation of excerpts from a book8 , poem, or play in a critical review

for purposes of illustration or comment; quotation of passages in a scholarly or technical

book to illustrate or clarify the author's observations; use in a parody of some of the

work being parodied; summary of a speech or article, with quotations, in a news report;

and reproduction by a teacher or student of a portion of a work to illustrate a lesson.

Because works created by the U.S. government cannot be copyrighted, material from the

many publications put-out out by the U.S. Government Printing Office may be

reproduced without fear of infringement. Advances in technology Technological

development has produced and will continue to produce new and different ways to store

information in smaller19 and smaller spaces, retrievable by electronic methods. Congress,

in passing the copyright Act, recognized that it could not foresee all the new methods of

Appendices

183

fixing or storing information. Accordingly, it broadly defined the category of

copyrightable material to include all � original works of authorship fixed in any tangible

medium of expression, now known or later developed, from which they can be perceived,

reproduced, or otherwise communicated, either directly or with the aid of a machine or

device.� Thus, an author who types a story on a computer, which stores it on a tape or

disc in computer18 memory, has �fixed� the work in a � copy� sufficient for copyright

protection. International copyright Almost every nation has some form of copyright

protection for authors and artists. Most do not require marking published copies with a

formal copyright notice or registering the claim with the copyright Office, though use of

appropriate copyright notices is recommended to maximize international protection. The

United states is a member of the Universal copyright Convention (UCC), an

international treaty organization in effect since , designed to eliminate17 discrimination

against foreigners in copyright protection. More than nations belong to the UCC. Every

member nation must give foreign works that meet UCC requirements the same copyright

protection as that nation gives to domestic works and authors. An American who wishes

to secure copyright protection in the United states and in UCC member nations at the

same time can do so by marking all published copies with a copyright notice that

satisfies the provisions of both the UCC treaty and domestic U.S. law. This notice

includes the symbol , the name of the copyright owner, and the year of first publication.

Although no such thing as an �international copyright� exists, it is easy for an author to

obtain copyright protection in many nations. Several other international conventions also

provide copyright protection. As of march , , the United states became a member of the

Berne convention, which protects any works first published in a member nation, without

formalities such as a copyright notice. The Buenos Aires convention, a multilateral16

treaty of North and South American nations including the United states, requires a

statement such as �All Rights Reserved� to be printed in the copyright notice. In

February the United states and China signed an agreement to prevent companies in China

from illegally manufacturing items, such as compact discs and computer15 software, in

violation of American copyrights. The United states estimates that this piracy caused

American businesses to lose $ billion a year. To stop copyright violations, China agreed

to establish task forces and increase the power of customs officials.

� copyright,� Microsoft(R) Encarta(R) Encyclopedia. (c) - Microsoft Corporation. All

reserved reserved.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

184

Table II-2 Generic Document Profile for �Copyright� from MS Encarta

Copyright GDP

Roget Category

Percentage

dueness_914_5874_n 13.39

production_164_1050_n 12.98

publication_528_3426_n 6.89

copy_22_134_n 5.5

numeration_86_572_n 3.2

transfer_780_5006_n 2.66

receptacle_194_1234_n 2.55

composition_56_391_n 2.51

time_108_672_n 2.45

reproduction_166_1069_n 2.2

causation_156_971_n 2.19

compact_765_4931_n 2.13

musical-instrument_414_2677_n 2.08

manifestation_522_3374_n 2.01

punishment_962_6194_n 1.95

judgment_480_3090_n 1.77

interpretation_520_3359_n 1.42

payment_804_5127_n 1.33

component_58_402_n 1.28

requirement_627_4025_n 1.21

imagination_513_3317_n 1.2

conditions_766_4936_n 1.18

prosperity_730_4728_n 1.06

smallness_33_219_n 1.06

exclusion_57_396_n 1.06

Appendices

185

Socialism

socialism0 socialism, economic and social1 doctrine2 , political movement inspired by this

doctrine, and system or order established when this doctrine is organized in a society.

The socialist doctrine demands state-ownership ownership and control of the

fundamental5 means of production and distribution of wealth, to be achieved by

reconstruction of the existing capitalist or other political system of a country through

peaceful, democratic, and parliamentary means. The doctrine specifically advocates

nationalization of natural-resources4 resources, basic industries3 , banking and credit

facilities, and public utilities. It places special emphasis on the nationalization of

monopolized branches of industry and trade, viewing monopolies as inimical to the

public welfare. It also advocates state-ownership ownership of corporations in which the

ownership function has passed from stockholders to managerial personnel. Smaller and

less vital enterprises would be left under private ownership, and privately held

cooperatives would be encouraged. These are the tenets of the socialist party of the U.S.,

the labour party of Great Britain, and labor or social democratic parties of various other

countries. Therefore they constitute the centrist position held by most socialists. Some

political movements calling themselves socialist, however, insist on the complete

abolition of the capitalist system and of private profit, and at the other extreme are

socialist programs having objectives entailing even fewer changes in the social order

than those outlined above. The ultimate goal of all socialists, however, is a classless

cooperative commonwealth in every nation of the world. Comparison with communism

The terms socialism and communism were once used interchangeably. Today, however,

communism designates those theories and movements that, in accordance with one view

of the teachings of Karl Marx and Friedrich Engels, advocate the abolition of capitalism

and all private profit, by means of violent revolution if necessary. Marx organized the

international Workingmen's Association, or First international; when this congress met

at Geneva in , it was the first international forum for the promulgation of communist

doctrine. This doctrine was later explained by Lenin, who defined a socialist society as

one in which the workers, free from capitalist exploitation, receive the full product of

their labor. Most socialists deny the claim of communists to have achieved socialism in

the USSR, which they regarded as an authoritarian tyranny. But after World War II, many

communist-led political parties in the Soviet sphere of influence still used the

designation socialist in their names. In East Germany (now part of the united federal

republic of Germany), for example, the name adopted by the merged communist and

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

186

social democratic parties was the socialist Unity party. The modern socialist

movement, as distinguished from communism, had its origin largely in the revisionist

movement of the late th century. The worsening condition of the proletariat, or workers,

and the class war predicted by Marx for Western Europe had not come about. Many

socialist thinkers began to doubt the indispensability of revolution and to revise other

basic tenets of Marxism. Led by the German writer Eduard Bernstein, they declared that

socialism could best be attained by reformist, parliamentary, and evolutionary methods,

including the support of the bourgeoisie. Moderate socialism Such a view was held by

the founders of the fabian society, organized in by British social reformers Sidney and

Beatrice Webb and their associates. The fabians in turn6 helped to form the British

independent labour party in ; it became affiliated with the newly organized labour party

in . In the U.S. a socialist Labor party was founded in . This party, small as it was,

became fragmented in the s. In a moderate faction of the party under Morris Hillquit

joined with the social democratic party of Eugene V. Debs and the christian socialists

of George D. Herron to form the socialist party. The moderate, or revisionist, type of

socialism found its clearest expression in the organization in Paris in of the Second

international. This body differed7 from the First international in that it was merely a

coordinator of the activities of its affiliated political parties and trade unions. The

Second international also diverged in ideology; a majority of its members, led by Eduard

Bernstein, were revisionists. The left-wing-wing minority was led by Lenin and the

German revolutionist Rosa Luxemburg; a third element, Marxist but opposed to Lenin,

was led by the German theorist Karl Kautsky. The Second international declared its

opposition to the preparations for war being made by most European governments. rise of

the left-wing Wing When World War I began in , modern European socialist leaders

supported their respective governments. Leaders of the socialist party in the U.S. and of

the labour party of Great Britain did not. Spokespersons for the left-wing wing, led by

Lenin, labeled the war an imperialist struggle and urged the workers of the world to

convert the war into a proletarian revolution or to turn the imperialist war into a class

war. This ideological conflict resulted in the collapse of the Second international.

Revived after World War I, it was never again important. Despite the decline of the

Second international, socialist parties made substantial gains during the years following

World War I and during World War II. In Great Britain, the labour party under Ramsay

MacDonald was in power for ten months in and again from to , but it lacked

parliamentary majorities and accomplished little. In Australia the Labor party held office

from to , from to , and from to . The labour government of New Zealand, elected in ,

Appendices

187

remained in power until . In Scandinavia, candidates of the social democratic parties of

Denmark, Norway, and Sweden were elected to high positions early in the s; these

parties subsequently became dominant in Scandinavia. socialism Versus fascism During

the s and ' s socialist and communist parties were in continuous conflict. One point of

contention was the question of support for the USSR. socialists castigated communists as

agents of the Soviet union and traitors to their own countries. Also during the ' s and ' s,

Fascist regimes in Germany and Italy caused both socialists and communists to develop

new tactics. Attempts were made in several countries to form a united front of all

working-class organizations opposed to fascism, but the movement had limited success,

even in France and Spain, where it did well in the elections. Failure of the communists

and socialists of Germany to unite is regarded as one cause of the success of the national

socialists. The fragile alliance that was achieved between socialists and communists in

some countries during this � popular-front Front� period was destroyed in by the

conclusion of a nonaggression pact between Germany and the USSR. socialists

condemned this act as a demonstration of the community of interest between two

totalitarian governments. In august , Germany invaded Poland, precipitating World War

II, and socialists in the allied countries immediately expressed full support for their

governments. After World War II An upsurge occurred in support of socialist parties

after the war, chiefly8 in Western Europe. The greatest advance was scored in great

Britain in ; the victorious labour party had in its campaign advocated the socialization of

the British economy. In ensuing years individual socialists won victories and in some

instances formed governments in France, Italy, Belgium, the Netherlands, Norway,

Sweden, and numerous other European countries. The socialist international, similar to

the Second international, was organized in in Frankfurt, West Germany (now part of the

united federal republic of Germany). In Asia, socialism made progress in India, Burma

(now known as Myanmar), and Japan; the Asian socialist Conference was formed as the

Eastern equivalent of the socialist international. The Soviet satellites, the � people's

democracies� of Eastern Europe, including Poland, Czechoslovakia (now the Czech

republic and Slovakia), Hungary, Bulgaria, and Romania, came under the control of

Communist- socialist parties, but these were dominated in all cases by communists.

China established a communist government, as did Albania and, later, Cuba. Emerging

nations of Africa, Asia, and Latin America frequently adopted social systems that were

largely socialist in orientation. In many instances, these nations took over properties held

by foreign owners. The influence of the socialist party of the U.S., led from to by

Norman Thomas, gradually declined, although much of its economic program became

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

188

law under the New Deal of president Franklin D. Roosevelt. The period following World

War II was also marked by intensification of the conflict between socialists and

communists. socialists approved such measures, initiated in the U.S. and supported by

the governments of Western Europe, as the European recovery Program and the North

Atlantic Treaty organization, declaring that the former would stem the tide of

totalitarian communism by raising living standards and that the latter would achieve the

same end by strengthening Western Europe militarily. communists denounced these

measures as imperialist preparations for war against the USSR. socialist political parties

have suffered occasional setbacks in elections in those countries in which they form half

of the two- party-system system, as in New Zealand in (they had been in power from to

and from to ) and in great Britain in (after five years in power). Nonetheless, extensive

and fundamental parts of the socialist program are permanent features of contemporary

economic and social life.

Contributed by: Robert E. Burke Norman Thomas

� socialism,� Microsoft(R) Encarta(R) Encyclopedia. (c) - Microsoft corporation. All

rights reserved.

Appendices

189

Table II-3 Generic Document Profile for �Socialism� from MS Encarta

Socialism GDP

Roget Categories

Percentage

party_708_4561_n 17.11

joint-possession_775_4981_n 15.38

belief_485_3123_n 7.53

business_622_3986_n 6.47

authority_733_4749_n 6.03

means_629_4037_n 4.77

war_718_4637_n 2.83

form_243_1595_n 2.72

importance_638_4090_n 2.69

disagreement_25_152_n 2.52

improvement_654_4221_n 2.35

humankind_371_2461_n 2.28

part_53_366_n 2.2

circumstance_8_48_n 2

moderation_177_1127_n 1.68

state_7_43_n 1.61

end_69_466_n 1.47

property_777_4991_n 1.4

motive_612_3921_n 1.35

enquiry_459_2935_n 1.33

management_689_4454_n 1.24

adversity_731_4738_n 1.2

violence_176_1117_n 1.2

thought_449_2873_n 1.14

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

190

Ballot

ballot0

ballot in modern usage, a sheet of paper used in voting, usually in an electoral-system

system that allows the voter to make choices secretly. The term5 may also designate the

method and act of voting secretly by means of a mechanical device. Used in elections in

all democratic3 countries, the ballot method protects voters from coercion and reprisal in

the exercise of their vote. Wherever the practice of deciding questions by free vote has

prevailed, some form of secret voting has always been found6 necessary. History of

balloting

In ancient Greece, the dicasts ( members of high courts) voted secretly with balls, stones,

or marked shells. Legislation was enacted in Rome in BC establishing a system of secret

voting. Long before the passage of this law, however, questions sometimes were decided

in Rome in public meetings1 by means of the ballot. Colored balls were used as ballots

during the middle-ages Ages. This form has survived to modern-times times, particularly

in clubs or associations in which voting decides the question of admitting or rejecting

proposed new members. Each voter receives two balls, one white, indicating2

acceptance, and the other black, indicating rejection; they are then deposited secretly in

appropriate receptacles so as to indicate a favorable or unfavorable decision. In some

organizations, candidates for admission are rejected if any black balls are found among

the white balls. In modern-times times, the most common form of ballot has been the

written8 or printed ticket Although the ballot had been used previously by the British

parliament to conceal the voting record of its members, in the house of Lords rejected a

proposal7 of the house of Commons providing for secret voting on matters before

parliament. The French Chamber of deputies voted by ballot from to . With the

development of democracy the practice of voting secretly in legislative assemblies

responsible to the people was generally abandoned. Toward the end of the th century,

demands were made in Great Britain that elections to parliament be conducted by secret-

ballot ballot, but the first proposal of this kind was not introduced into parliament until .

The proposal was rejected, but subsequently advocates of Chartism incorporated the

demand in their petitions to parliament. Despite repeated attempts by proponents of the

legislation to secure its enactment, parliament took no effective action until . In that year

the ballot Act was approved providing for secret voting at all parliamentary elections,

except parliamentary elections held at universities, and at all municipal elections.

Appendices

191

similar legislation had been previously adopted in France ( ) and Italy ( ). balloting in the

U.S.

Following the American Revolution, the secret-ballot ballot, used universally during the

period of British colonial rule, was adopted in most of the newly established states.

development of the political-party party system resulted in various abuses of the ballot

system in many states during the first half of the th century, when the law permitted the

printing and distribution of ballots to the voters both by candidates and by political

organizations. This system, which led to confusion and fraud at the polls, produced

widespread public sentiment for ballot reform. In the Massachusetts state legislature

initiated remedial action, adopting legislation that provided for the so- called Australian

ballot in state elections. The principal features of this method, first used in Australia in

and subsequently adopted by every state in the union, are the preparation, printing, and

distribution of the ballot by government4 agencies; the use of a blanket ballot listing the

names and party designations of all candidates for all offices to be filled; and secret

voting under government supervision. Formerly, most of the U.S. used the party-

column type of blanket ballot, in which the names of candidates are arranged in columns

allocated to their respective political parties. By , however, most states had adopted the

office- column type of listing, in which the names are arranged under the office sought,

either alphabetically or by party, with the party label appearing after the name in either

case. When the party- column ballot is used, the party emblem is often added to the

party column and the party circle. In some places a party emblem is used on the office-

column ballot, as in New York state. The purpose of the emblem and the party circle is

to make it easier for loyal but ill-informed party voters to vote a straight party ticket In

addition, some states, counties, and cities provide ballots with extra space for write-in

votes for candidates not listed. The preferential ballot, now rarely used, allows voters to

indicate with numerals the order of their preference among the candidates for the same

office. The long ballot, on which candidates for administrative as well as for legislative

office were listed, has gradually been replaced, through the efforts of such reformers as

president Woodrow Wilson, by the short ballot listing names of legislative candidates

only, administrative offices often being filled largely by appointment. Various methods

have been devised for the nomination of candidates to ensure that only the names of

authorized office seekers appear on the ballot. Many states and localities require a

candidate to file a petition before the name can appear on the ballot. The petition must

contain a certain minimum number of signatures of registered voters from a certain

minimum number of counties in the state, or districts in the locality. The validity of the

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

192

signatures may then be challenged by other candidates, with final adjudication of

disputes made by the appropriate board of elections, or in some cases by the courts. To

facilitate voting and to reduce the possibility of fraud, a mechanical device operated

either manually or electrically began to be adopted in various parts of the U.S. after ,

when New York state first authorized such use. The list of candidates is arranged on the

face of a voting machine according to the office- column model, either horizontally or

vertically. The voter indicates a preference by placing a pointer next to the name of the

candidate of his or her choice. Space is also provided for write-in votes. Each voting

machine is equipped with curtains, which the voter closes to form a complete, private

polling-booth booth. When the voter has finished voting, he or she pulls a special lever

that opens the curtains, returns the pointers to their original positions, and starts the

mechanical counting devices that record and add up the votes. The use of voting

machines in U.S. elections depends on state legislation. Despite the fact that the

Australian ballot system, or a modification of it, is used throughout the United states,

fraudulent voting, although greatly reduced, still occurs in some communities. This is

accomplished chiefly by �repeating,� an unlawful practice whereby citizens register and

vote at more than one polling place, and by �stuffing,� or putting extra votes into the

ballot-box box. All such frauds are generally accomplished with the connivance of

dishonest election officials, but may be counteracted in some cases by calling for a

recount after votes have been tallied. In some states, where voting machines are used

exclusively, it is claimed that virtually no fraudulence occurs, although efforts are

sometimes made to damage voting machines so as to reduce the number of votes given

to a favored candidate.

� ballot,� Microsoft(R) Encarta(R) Encyclopedia. (c) - Microsoft Corporation. All rights

reserved.

Appendices

193

Table II-4: Generic Document Profile for �Ballot� from MS Encarta

Ballot GDP

Roget Categories

Percentage

choice_605_3881_n 22.6

assemblage_74_491_n 13.75

authority_733_4749_n 5.28

management_689_4454_n 4.32

indication_547_3541_n 4.01

request_761_4913_n 3.5

council_692_4469_n 3.41

record_548_3564_n 3.34

judgment_480_3090_n 2.65

meaning_514_3325_n 2.49

production_164_1050_n 2.25

enquiry_459_2935_n 2.12

friendship_879_5636_n 2.1

publication_528_3426_n 2.03

greatness_32_199_n 1.95

end_69_466_n 1.81

evidence_466_2993_n 1.48

precept_693_4473_n 1.46

use_673_4363_n 1.44

serial-place_73_488_n 1.35

repetition_106_661_n 1.33

instrumentality_628_4032_n 1.2

propagation_167_1072_n 1.15

beginning_68_455_n 1.1

negation_533_3464_n 1

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

194

AI

msAI

artificial-intelligence6 Intelligence or ai, a term that in its broadest1 sense would indicate

the ability of an artefact to perform3 the same kinds of functions that characterize human

thought. The possibility of developing some such artefact has intrigued human beings

since ancient times. With the growth of modern science, the search-for5 for ai has taken

two major directions: psychological and physiological research into the nature of human

thought, and the technological development of increasingly sophisticated computing

systems4 . In the latter sense, the term ai has been applied to computer systems and

programs capable of performing tasks more complex than straightforward programming,

although still far from the realm of actual thought. The most important fields of research

in this area are information processing, pattern recognition, game playing, and applied

fields such as medical diagnosis. current research in information processing deals with

programs that enable a computer to understand written0 or spoken information and to

produce summaries, answer2 specific questions, or redistribute information to users

interested in specific areas of this information. essential to such programs is the ability

of the system to generate grammatically correct sentences and to establish links between

words and ideas. research has shown that whereas the logic of language structure its

syntax submits to programming, the problem of meaning, or semantics, lies far deeper, in

the direction of true ai. In medicine, programs have been developed that analyse the

disease7 symptoms, medical history, and laboratory-test test results of a patient, and then

suggest a diagnosis to the physician. The diagnostic program is an example of a so-called

expert-system system programs designed to perform tasks in specialized areas as a

human would. Expert systems take computers a step beyond straightforward

programming, being based on a technique called rule-based inference, in which

preestablished rule systems are used to process the data. Despite their sophistication,

expert systems still do not approach the complexity of true intelligent thought. Many

scientists remain doubtful that true ai can ever be developed. The operation of the human

mind is still little understood, and computer design may remain essentially incapable of

analogously duplicating those unknown, complex processes. Various routes are being

used in the effort to reach the goal of true ai. One approach is to apply the concept of

parallel processing interlinked and concurrent computer operations. Another is to create

networks of experimental computer chips, called silicon neurons, that mimic data-

Appendices

195

processing-processing functions of brain cells. Using analogue technology, the transistors

in these chips emulate nerve-cell membranes in order to operate at the speed of neurons.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

196

Table II-5 Generic Document Profile for �AI� from MS Encarta

AI GDP

Thesaural Category

Percentage

language_557_3608_n 15.82

remedy_658_4256_n 9.67

intelligence_498_3214_n 8.93

enquiry_459_2935_n 7.02

tool_630_4040_n 5.95

action_676_4375_n 5.43

ill-health_651_4185_n 4.95

relation_9_54_n 4.37

arrangement_62_427_n 4.17

numeration_86_572_n 2.92

whole_52_356_n 2.9

benevolence_896_5770_n 2.77

agency_173_1101_n 2.13

substantiality_3_19_n 1.86

knowledge_490_3165_n 1.73

order_60_410_n 1.73

veracity_540_3501_n 1.5

utility_640_4107_n 1.5

production_164_1050_n 1.44

conversion_147_917_n 1.35

power_160_1004_n 1.25

phrase_563_3646_n 1.23

intention_617_3951_n 1.12

greatness_32_199_n 1.09

latency_523_3384_n 1.06

Appendices

197

Breakdance

msBreakdance

Rock Music and Its Dances In the s the quietly sensuous movements of the Latin dances

became the provocative hip rolls0 of the singer Elvis Presley, whose first major1 record

was released in . Also in the mid- s, rock and roll became a national phenomenon when

Bill Haley and His Comets were featured in the film rock Around the Clock, and the

television show �American Bandstand� began its broadcasts of dancing teenagers.

American society underwent fundamental upheavals during this period and the following

decade with the civil rights movement, protests against the war in Vietnam, and such

events as the famous music festival at Woodstock, New York, in . In the rock musician

Chubby Checker ushered in the twist2 , performed with gyrating hips and torso and a body

attitude that seemed to express �doing your own thing�. The dances of the s such as the

fish, the hitchhiker, the frug, and the jerk were free and individualistic. People danced en

masse, both sexes with long hair, all dancing by themselves and inventing as they went

along. Several contradictory trends appeared in the s and s. Couple dancing, enhanced by

the individuality of the s, returned in the s with the hustle and other elaborately

choreographed dances performed to disco music, a simple form of rock with strong

dance rhythms. Alongside the disco movement, which dominated the s and s, the more

outrageous punk rock movement brought in its wake slam dancing, which involved

leaping, jumping, and sometimes physical attack, and in the mid- s the acrobatic solo-

dance dance form known as break dancing. The late s and s have seen the development of

rave culture in which people dance very energetically to electronically-based music with

a beat beat.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

198

Table II-6 Generic Document Profile for �Breakdance� from MS Encarta

Breakdance GDP

Roget Categories

Percentage

amusement_836_5332_n 21.5

intrinsicality_5_28_n 15.12

action_676_4375_n 15.03

music_412_2658_n 10.04

agitation_318_2139_n 8.94

oscillation_317_2132_n 8.43

whole_52_356_n 5.2

period_110_686_n 3.71

musician_413_2666_n 3.06

drama_594_3803_n 2.05

form_243_1595_n 1.07

greatness_32_199_n 1.07

Appendices

199

Appendix III. Lexical Chain Visibility Algorithm

The following was designed to improve the visibility of lexical chains, by showing them

embedded in the text from which they are derived. This was described in Section 3-8.

Algorithm III-1: An Algorithm to make the lexical chains in a text visible.

Step 1.0 Output the Chains 1.1 For ALL the Chains in the ChainStore

Derive a unique name �ChainFile� from the text and chain number

Store �ChainFile� into the Chain

Write the Chain into a file �ChainFile�;

Step 2.0: Serialise Links 2.1 Initialise ALLChains to be NULL;

2.2 For ALL the Chains in the ChainStore

Append the Chain into ALLCHAINS //this is one long flat chain

2.3 SORT the links in ALLChains by link.WordNumber

Step 3.0: Output HTML Text 3.1 Re-Open the INPUT file

3.2 Open and Initialise the HTML file

3.3 Let WordNumber = 0;

3.4 For all WORDs in the INPUT

Increment WordNumber;

If the WordNumber is in ALLChains

Print the WORD to HTML File using MARKUP.

else copy from INPUT to HTML

3.3 Close the HTML file

Step 4.0: Procedure MARKUP (Link) 4.1 If the link type is 0

Start Anchor

Output the word

Link to �Chainfile�

Output the Chain Number as SuperScript

4.2 Else Output the Word // Derive Character Face from link type

// Derive colour derived from the Chain number.

Appendix IV. Basics Statistics of the Experimental

Data

Rosetta Stone

File: Info_Rosetta index locatio

n

stone rosetta rosettastone

Input Words: 99 722 203 190 786 245

Stopwords: 47 371 82 84 299 120

Not in Thesaurus: 9 79 47 31 100 25

Chainable: 41 256 71 73 352 94

Disambiguation

Attempts:

30 205 59 61 277 69

Disambiguations

Done:

9 49 11 11 29 10

Socialism

Table IV-1

File: Socialis

m

Info_Socialis

m

kimsk_

e

Welcome

2

NOMARX

4

Welcome

7

Input Words: 902 408 269 408 217 313

Stopwords: 439 194 121 200 93 130

Not in Thesaurus: 49 38 8 37 27 52

Chainable: 401 171 135 165 96 125

Disambiguation

Attempts:

352 148 109 148 79 109

Disambiguations

Done:

57 35 10 21 13 17

Appendices

201

Artificial Intelligence

Table IV-2

File: Info_

Artificial_

Intelligenc

e

Aidef Welcome

9

Welcome

2

web-

homepage

imkai

Input Words: 130 1837 379 125 175 205

Stopwords: 51 867 126 44 81 72

Not in Thesaurus: 8 75 32 17 18 50

Chainable: 68 859 210 62 73 77

Disambiguation

Attempts:

57 731 187 52 68 61

Disambiguations

Done:

6 129 25 14 12 13

Breakdance

Table IV-3

File: Info_

Break_

Dance

highline dance Welcome2 index frenz-e

Input Words: 38 818 437 248 272 244

Stopwords: 16 408 224 94 120 126

Not in Thesaurus: 3 67 24 21 31 18

Chainable: 17 337 178 126 114 97

Disambiguation Attempts: 14 303 149 105 105 81

Disambiguations Done: 2 42 41 21 16 21

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

202

Appendix V. Help information given to

Experimental Subjects

This appendix gives the explanatory, �help� information given to the experimental

subjects.

Text Similarity Experiment Instructions

The Text Similarity Experiment is made up of a frame split

into a heading, and three further panes. The heading pane

contains a query (�Ballot�) that has been used on the

Internet The left text pane below contains a sample text on

the subject of the query found in Microsoft's Encarta

encyclopedia.

The bottom pane contains a number of statements. There

are five radio buttons next to each question. If you agree

Appendices

203

with the statement select the leftmost button. If you

completely disagree select the rightmost button. Select the

middle button if the statement is neither completely true, or

completely false.

Once the initial questions have been asked we move onto

the text comparison part of the experiment. Some six

(random) texts retrieved from the Internet are shown in the

right pane.

Different questions are shown in the bottom pane. Answer

these questions as before.

If you are using a smaller screen, you may not see

the button. If so,

scroll across after you have answered the questions.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

204

The experiment is completely anonymous. The results are

identified by IP number and time of submission. The results

are most useful if you complete all the experiment.

Appendices

205

Appendix VI. Roget�s Thesaurus � A brief Overview.

Introduction and purpose

The purpose of this appendix is to describe Roget's thesaurus briefly for those who are not

familiar with it. This will be done by giving an overview of the structure and organisation

of the thesaurus, and by reporting some basic data about it.

What is Roget's thesaurus?

Roget's thesaurus is a tool designed to help writers find words. That is, a writer might

have in mind a word to express a certain idea. However, he or she might know that there

exists a more precise word or phrase that better captures the nuance of meaning they wish

to articulate.

Roget's thesaurus is made up of an index, and the thesaurus entries. To find alternatives to

the word the writer had in mind, he looks it up in the index. This refers to one or more

numbered entries in the thesaurus. To find the precise word he needs, the thesaurus user

needs to read the thesaurus entries to which the index refers.

Each entry in Roget's thesaurus is made up of words and phrases that are related in

meaning to each other. The title, or heading, of the entry gives a clue as to what the words

and phrases have in common, but not precise definition is given. Furthermore, there is no

fixed relationship between the words in an entry. Some are synonyms, others antonyms,

meronyms, or unspecified types of associations.

A writer selects a better word for his context based on his understanding of the words

meaning. Roget's thesaurus does not attempt to give definitions of word meaning as a

dictionary does. Dutch�s (1962) introduction to Roget's thesaurus gives more information

on the background to the thesaurus, and advice on how it should be used.

Organisation of the Thesaurus.

Roget's thesaurus is notable for the organisation of its entries, in addition to their contents.

These have been arranged into a semantic hierarchy. That is, a nested structure of heading

and sub-headings, that is up to five levels of subdivision deep. The major headings are

shown in Figure VII-1 below.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

206

Figure VI-1:Major Headings in Roget's Thesaurus

These category heading are further subdivided, as shown in figure VII-2 to VII-7 below.

A boxed minus preceding a heading indicates that it is expanded by the indented

subheadings following it. A boxed inverted cross indicates that a heading or subheading

may be expanded further. An empty grey box indicates that a heading is fully expanded,

and refers to an actual entry in the thesaurus.

Figure VI-2 Sub divisions of �Abstract Relations�

Figure VI-3 Sub divisions of �Space�

Figure VI-4 Sub divisions of �Matter�

Figure VI-5 Sub divisions of �Emotion�

Appendices

207

Figure VI-6 Sub divisions of �Volition�

Figure VI-7: Sub divisions relating to "Existence"

Figure VII-7 shows the category �Existence� fully expanded. Each of the terms that it

contains that have a part of speech as a suffix is the heading or title of an entry in the

thesaurus. Figure VII-8 shows the actual entry for existence (noun).

Figure VI-8 : An extract from Roget's thesaurus

Space precludes fully expanding the thesaural hierarchy, as in the 1987 edition there are

6400 entries. Earlier editions of the thesaurus that used approximately one thousand

entries commonly included a table that shows the ordering and arrangement of all the

categories (e.g. see Dutch 1962). There the �Existence� main category in figure VII-7

would only contain four thesaural entries only (Noun, Verb, Adjective, and Adverb), as

opposed to a sub-category, and then the entry titles seen here. Thus, the 1987 version of

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

208

the thesaurus adds a further level to the hierarchy, but is otherwise virtually identical in

structure. Note though that the entries remain approximately the same size as the

vocabulary has been expanded. We now go on to look at some data about this vocabulary.

Basic Data about Roget�s thesaurus

This section gives some basic data about Roget�s thesaurus. Its purpose is to support a

number of design decisions made in Hesperus. The data reported include the size of the

thesaurus in terms of the total number of words it contains, and the number of unique

terms, where a term may be a word, or collocation.

The data were collected using simple word frequency information collected over the

whole 1987 thesaurus supplemented by some purpose written Perl programs.

Words in the 1987 edition of Roget�s Thesaurus

• Words and phrases (including duplicates): 224489

• Unique Words and phrases (eliminating duplicates) 98357

Thus, it appears that each word in the thesaurus appears approximately 2.9 times. Since

each occurrence is in a different category, this implies that each word has approximately

2.9 senses. This figure is misleading however since the distribution of word senses is

highly skewed. This is shown in graph VII-1 below.

Appendices

209

Roget's Thesaurus: Word Senses Vs Word Frequencies

0

10000

20000

30000

40000

50000

60000

70000

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73

Word Senses

No.

of W

ords

Graph VI-1: Distribution of Polysemic Words in Roget�s Thesaurus

Graph VII-1 shows that of the 98357 unique words and phrases in Roget�s thesaurus,

60208, (i.e. 61.2%) only have one meaning since they are only found in one entry in the

thesaurus. The remaining words have a variable number of meanings, the one most

frequently found being �cut�, which has 73 thesaural entries (excluding its collocates).

Collocations are sequences of two or more words that are found together and have

acquired a lexical identity separate from their component words. Roget�s thesaurus is rich

in collocations of various lengths.

Graph VII-2 below shows the length of collocations found in Roget�s thesaurus and their

frequency of occurrence. From this we can see that only 8598 collocations are longer than

two words, which is, 8.74% of the unique words and phrases. Their comparative rarity

supports the implementation decision (Section 5.5) to limit the search for collocations to

word pairs only.

Using Roget�s Thesaurus to determine the similarity of texts Jeremy Ellman

210

Roget's Thesaurus: Collocation Length Vs Frequency

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Words in Collocation

Freq

uenc

y of

Occ

urre

nce

Graph VI-2: Frequency of Collocations in Roget�s Thesaurus

Summary

This appendix has briefly described the hierarchical structure of Roget�s thesaurus, and

shown an example entry. Basic statistical data about the thesaurus have also been

reported.

Appendix VII. Papers published related to this thesis.

In accordance with the PhD regulations of the University of Sunderland, this appendix

includes refereed papers published as part of the work of this thesis. These were:-

1. On the Generality of Thesaurally derived Lexical Links Jeremy Ellman and John Tait in Actes de 5es Journées Internationales d'Analyse Statistique des Données Textuelles March 2000 (JADT 2000) pp147-154 Ecole Polytechnique Fédérale de Lausanne. Switzerland

2. Word Sense Disambiguation by Information Filtering and Extraction Jeremy Ellman, Ian Klincke and John Tait in “Computers and the Humanities” vol. 34, number 1-2, 2000, Special Issue on “Senseval: Evaluating Word Sense Disambiguation Programs” Guest Editors Adam Kilgarriff and Martha Palmer

3. Roget's thesaurus: An additional Knowledge Source for Textual CBR?, Jeremy Ellman and John Tait in "Research and Development in Intelligent Systems XVI: Proc 19th SGES Intl Conf. on Knowledge Based and Applied Artificial Intelligence" Bramer M., Macintosh A., and Coenen F. (eds) ISBN 1-85233-231-X. pp-204-217 2000.

4. SUSS: The Sunderland University Similarity System: "Beneath the Glass Ceiling" Jeremy Ellman, Ian Klincke and John Tait in Proc SENSEVAL workshop University of Brighton 1998.

5. Using the Generic Document Profile to Cluster Similar texts” Jeremy Ellman, in Proc. Computational Linguistics UK (CLUK 97) Jan. 1998 University of Sunderland

6. "Using Information Density to Navigate the Web" Jeremy Ellman and John Tait. UK ISSN 0963-3308 IEE Colloquium on Intelligent World Wide Web Agents. March 1997