Ontologies and Information Extraction

Ontologies and Information ExtractionInternational Workshop held as part of the:

Summer School onSemantic Web and Language Technologies

Organized by

Amalia TodirascuVincenzo Pallotta

Sponsored by

Faculty of Information andCommunication Science

http://ic.epfl.ch

Laboratoire Tech-CICO / Dpt GSIDUniversité de Technologie de Troyes

http://tech-cico.utt.fr/

Bucarest, 28th July – 8th August 2003

ForewordInformation Extraction systems were designed to filter, to select and to classify the increasing amount ofinformation available nowadays, mainly on the Web. Most of them were based on shallow natural languageprocessing techniques, but semantics was not really used, due to the unavailability of generic ontologies.During last years, some generic Ontologies become available and a lot of research projects tried to take intoaccount semantic aspects to obtain more precise results for IE systems. Meanwhile, important effortsconcentrate on developing tools for semi-automatic building of domain-specific Ontologies, based on IE andtext-mining techniques.

This workshop provides a forum for discussion between leading names in the field and researchers involvedin the development of ontology-based IE systems or semi-automatic tools for building ontologies. Topics ofinterest include but are not limited to:

- semantic annotation;- content-based indexing and retrieval;- robust analysis of language data;- text mining;- platforms for semi-automatic ontology extraction;- formalisms for ontology representation.

Program Committee

Nathalie Aussenac-Gilles (IRIT, Toulouse, France)Roberto Basili ( University of Rome 2 "Tor Vergata", Italy )Bill Black (UMIST, Manchester ,UK )Paul Buitelaar (DFKI, Saarbrucken ,Germany )Amedeo Cappelli ( University of Pisa ,Italy )Paola Merlo ( University of Geneva ,Switzerland )Malvina Nissim ( University of Edinburgh ,Scotland ,UK )Fabio Rinaldi ( University of Zurich ,Switzerland )Laurent Romary (LORIA, Nancy ,France )Patrick Ruch (EPFL, Lausanne, Switzerland)Horacio Saggion ( University of Sheffield ,UK )Manuela Speranza (IRST, Trento )Steffen Staab (AISB, University of Karlsruhe ,Germany )Valentin Tablan ( University of Sheffield ,UK )Dan Tufis (RomanianAcademy, Bucharest ,Romania )Ion Constantinescu (EPFL, Lausanne ,Switzerland )Violeta Seretan (University of Geneva ,Switzerland )

AcknowledgementsThis workshop has been organized with the support of the Swiss National Science Foundation project (IM)2on Interactive Multimodal Information Management (http://www.im2.ch). More specifically, the workshopis part of the activity of the project's IP on Multimodal Dialogue Management.

We would like to thank prof. Giovanni Coray and prof. Dan Cristea for their support to our initiative and Dr.Hatem Ghorbel for the final editing of these proceedings. We would like also to thank all the authors whosubmitted their articles and the members of the program committee for their contribution to ensuring the highstandard of the accepted papers.

Amalia TodirascuVincenzo Pallotta

Pallotta & Todirascu (eds.) Ontologies and Information Extraction

2

Ontologies and Information Extraction

International Workshop held as part of theEUROLAN'03 Summer School

July 28 - August 8, 2003Bucarest, Romania

Call for Papers

Information Extraction systems were designed to filter, to select and to classify the increasingamount of information available nowadays, mainly on the Web. Most of them were based onshallow natural language processing techniques, but semantics was not really used, due to theunavailability of generic ontologies.

During last years, some generic ontologies become available and a lot of research projets triedto take into account semantic aspects to obtain more precise results for IE systems.Meanwhile, important efforts concentrate on developping tools for semi-automatic building ofdomain-specific ontologies, based on IE and text-mining techniques.

This workshop will provide a forum for discussion between researchers involved in thedevelopment of ontology-based IE systems or semi-automatic tools for building ontologiesand leading names in the field.

We would like to invite all researchers to submit their original and unpublished work to theworkshop. Topics of interest include but are not limited to:

- semantic annotation;

- content-based indexing and retrieval;

- robust analysis of language data;

- text mining;

- platforms for semi-automatic ontology extraction;

- formalisms for ontology representations.


3

Submission Requirements

Authors are invited to submit a 4-6 pages extended abstract in electronic form (postscript orPDF) by 22th of April 2003. Authors of accepted papers should submit the final version inelectronic format not later than 15th of June.

The documents must be in either postscript or pdf format (PDF is encouraged, but postscriptdocuments are acceptable as well). If you have problems delivering your paper in one of theseformats, please contact the organising comittee. Maximum length of paper should be about 10pages. This workshop uses the same guidelines as EACL-2003. The instructions can be foundat http://www.utt.fr/~amalia/OntoIE/. Please do not insert page numbers, headers or footers. Ifyou have any problem following the style please contact the organising committee as soon aspossible.

===============================================================

Important Dates

Submission Deadline: 22th April 2003

Notification of Acceptance: 10th May 2003

Camera-ready Papers: 15th June 2003

Demos of the presented systems are encouraged.

====================================

Programme Commitee

Nathalie Aussenac-Gilles (IRIT, Toulouse, France)Roberto Basili (University of Rome 2 "Tor Vergata", Italy) (to be confirmed)Bill Black (UMIST, Manchester, UK)Paul Buitelaar (DFKI, Saarbrucken, Germany)Amedeo Cappelli (University of Pisa, Italy)Paola Merlo (University of Geneva, Switzerland)Malvina Nissim (University of Edinburgh, Scotland, UK)Fabio Rinaldi (University of Zurich, Switzerland)Laurent Romary (LORIA, Nancy, France)Patrick Ruch (EPFL, Lausanne, Switzerland)Horacio Saggion (University of Sheffield, UK)Steffen Staab (AISB, University of Karlsruhe, Germany)Valentin Tablan (University of Sheffield, UK)Dan Tufis (Romanian Academy, Bucharest, Romania)

Organising commitee

Amalia Todirascu (Technological University of Troyes, France)Vincenzo Pallotta (EPFL, Lausanne, Switzerland)


4

Table of Contents Forewords Specific Domain Model Building for Information Extraction from poor quality corpus Fabrice Even, Chantal Enguehard Use of Ontologies for Cross-lingual Information Management in the Web Ben Hachey, Clarie Grover, Vangelis Karkaletsis, Alexandros Valarakos, Maria Tereza Pazienza, Michele Vindigni, Emmanuel Cartier, José Coch Automatic Annotation of Multilingual Text Collections with a Conceptual Thesaurus Bruno Pouliquen, Ralf Steinberger, Antonio Ribeiro, Camelia Ignat Bridging the Word Disambiguation Gap with the Help of OWL and Semantic Web Ontologies Steve Legrand, Pasi Tyrvainen, Harri Saarikoski Using WordNet hierarchies to pinpoint differences in related texts Ann Devitt, Carl Vogel Domain Knowldege Articulation using "Integration Graphs" Madalina Croitoru, Ernesto Compatangelo Experiments in Ontology Construction from Linguistic Resources Mariam Tariq, P. Manumaisupat, R. Al-Sayed, K. Ahmad Ontology Aspects in Relation Extraction Birte Lonneker Enhancing Recall in Information Extraction through Ontological Semantics (abstract) Sergei Nirenburg, Marjorie McShane, Stephen Beale


5

Specific Domain Model Building for Information Extraction from poor quality corpus

Fabrice Even IRIN

2 rue de la Houssinière 44322 Nantes, France

[email protected]

Chantal Enguehard IRIN

2 rue de la Houssinière 44322 Nantes, France

[email protected]

Abstract

This article presents an automatic infor-mation extraction method from poor qual-ity specific-domain corpora. This method is based on building a semi-formal ontol-ogy in order to model information present in the corpus and its relation. This ap-proach takes place in four steps: corpus normalization by a correcting process, ontology building from texts and external knowledge, model formalization in grammar and the information extraction itself, which is made by a tagging process using grammar rules. After a description of the different stages of our method, ex-perimentation on a French bank corpus is presented.

1 Introduction

This research stems from the need for informa-tion extraction from a poor quality corpus (with-out punctuation, with poor syntax and a lot of abbreviations). Information extraction is defined as a two-step process: the modelling of informa-tion needed and its identification in the corpus. The first step uses external knowledge sources to construct a semi-formal ontology. This ontology covers a part of the corpus domain. The model-ling only concerns the knowledge actually de-scribed in corpus. The second step uses this on-tology to extract information.

After a brief presentation of different ontol-ogy-building methods, a presentation of our method is made by a description of its different stages: pre-process for a corpus partial correc-tion, ontology building from this corrected cor-pus and external knowledge, the ontology repre-sentation by a grammar and the information ex-traction engine based on this grammar. The results are evaluated and analysed.

We work on a French corpus composed of bank texts. These texts are compilations of inter-views between bank employees and clients. Our goal is to automatically extract specific informa-tion about clients (future plans, evolution of their family situation, etc.) to expand a database.

2 Ontology Building Methods

There are a lot of methods to build ontology from corpora. Most of them are based on text content. In these approaches, texts are the main source for knowledge acquisition (Nobécourt, 2000). Con-cepts and relations result only from a corpus analysis, without external knowledge. Aussenac-Gilles & al. (2000) follow this point of view but affirm that there can be other knowledge sources than corpus. Such an approach includes two steps. The first one consists of the construction by a corpus terminological and linguistic analy-sis, of a first set of concepts that corresponds to terms (conceptual primitives (Nobécourt, 2000)) and the extraction of lexical relations. The result of this stage is a base of primitive concepts and first concept relations (terminological knowledge


6

base (Aussenac-Gilles, 2000), (Lame, 2000)). In this stage, the designer has to select terms and relations that will be modelled, those that are relevant and in the case of several meanings for one term or relation, which one must be kept. The second step is based on conceptual model-ling by a study of the semantic relations between terms. This analysis produces new concept rela-tions and new concepts, which are added to the first one. This new set of concepts and relations is also structured into a concept semantic net-work. An expert of the corpus domain must vali-date this network to express which relations are rele-vant (normalization (Bouaud, Bachimont and al., 1995)). The result of the entire process is a hier-archical structure of a set of terms of the domain (Swartout and al., 1996). This model is also an ontology, which can be formalized by a formal or semi-formal representation.

Using such a method is powerful for syntacti-cally correct texts but not for poor quality cor-pora. The terminological step is conceivable after corpus partial correction but lexical and semantic relation extraction are impossible. Indeed linguis-tic tools are highly ineffective on syntactical and lexical poor corpora.

So for such poor quality corpora, our approach can be based on a terminological analysis for first concept identification but a solution other than classical methods must be found for the other modelling steps (terminological knowledge base and semantic network building).

3 Pre-process

The texts from our corpus are written manually and quickly. This corpus is distinguished by a lot of typological errors, spelling mistakes and the use of non-standard abbreviations.

These characteristics make a correction and normalization step necessary. Indeed using the texts directly without correction gives a lot of undesirable information and causes a poor cover-age of information. Therefore modelling and in-formation extraction starts from the corrected corpus. This step concerns value formats (num-ber with unit), dates (unique numeric representa-tion and specific abbreviation treatment such days or month), abbreviation standardization (ab-breviations specific to the writer are substituted by a unique one resolving ambiguities thanks to

the context) and orthographical correction of lexical and typological errors. These pre-processes are carried out with regular expres-sions, which depend on the corpus (cf. figure 1).

4 Domain Model Building

According to Bachimont (2001), there are no in-dependent concepts from context or current prob-lems, which allow building the whole knowledge of a particular domain. An ontology works like a theoretical framework of a domain and is built according to a current problem. The modelling process described here is based on this definition. Ontology is built from knowledge found in the corpus and from external knowledge (experts).

4.1 Initial ontology definition

The information searched is first informally ex-pressed (in natural language) and next converted into predicates. These predicates are described by information patterns. This work must be done with domain experts with a wide knowledge of the domain and who are able to express exactly what information must be extracted. This step produces a set of concept hierarchies. This set is the first sub-ontology (initial ontology), which is composed of predicative relations between con-cepts (there is a relation between two concepts when one of them is an attribute of the other). These relations are in accordance with the At-tribute Consistency Postulate (Guarino, 1992): in each predicative relation, any value of an attrib-ute is also an instance of the concept correspond-ing to that attribute.

4.2 Terminology definition

The terminology is built from the union of two sets of terms. The first is made by a terminologi-

Here a corpus extract: “ PJCT PURCH BMW FOR 30 KEUR NEXT SUMR ” After pre-process, we obtain this new text : “ project purchase bmw for 30ke next summer ”

Figure 1: Example of pre-process


7

cal study of texts by the linguistic tool ANA, (Enguehard and Pantera, 1995). The other is built by a set of documents about domain terminology where terms that can be used in texts are found.

4.3 Normalization

There is not a direct correspondence between each term and a concept from initial ontology. But these concepts have to be bound to corpus terms, so a normalization process is necessary. This process takes place in three steps: initial ontology extension, terminological knowledge base (TKB) building and unification of models.

Initial ontology extension

Initial ontology is revised with domain experts. Some new concepts that stemmed from this first set of concepts are defined and added to hierar-chy. This produces an extended initial ontology.

TKB building

With these same experts and domain specific documents, a new set of concepts is built from terminology: the basic concepts. From these ba-sic concepts, others are defined recursively by inheritance. The result is a set of small hierar-chies with, for each one a unique ancestor whose last heir is a basic concept. These hierarchies are normalized: each father is divided into sons by a unique criterion. They also respect the Guarino-Rigid-Property (Guarino, 2000).

Models Unification

The two precedent processes produce ontology linked to the current problem and a hierarchical structure linked to texts. We proceed to the unifi-cation of these two models. The extended initial ontology is unified with a hierarchy if it ancestor is a concept of the initial ontology or if a relation can be built between this ancestor and concepts from this ontology. A domain model is obtained, which covers all concepts relevant for information searching. An oriented graph diagram first describes this model. This model defines a semi-formal ontology be-cause it is not dependent on a representation lan-guage (Barry & al., 2001).

5 Formal representation

To make the model usable, it is formalized into a grammar. As seen in 4, two relation types are found in this model: hierarchical relations and predicative relations. The grammar must repre-sent these two types of relation. Also two sorts of rules are defined. On the one hand, constituent rules, which represent hierarchical relations, are defined. When we have a hierarchical relation between two concepts A and B, with B a son of A, we say that B constitutes A. On the other hand predicative rules are defined to represent predica-tive relations. When we have a predicative rela-tion between two concepts C and D, D is an at-tribute of C (the type of attribute depends on the relation). All these rules are written with a BNF-like description.

5.1 Constituent rules

A concept C is defined by a set of rules Def(C). These rules concern terms or concepts. For each X from Def(C), X only defines C, never another concept. The notation for these rules is C ::= Def(C). There are three sorts of constituent rules: select rules, conjunctive rules and disjunctives rules.

Select rules: C ::= B1 | B2 | ... . The concept C is defined by B1 or by B2 but not by both of them at the same time.

Examples: <VEHICLE> ::= <CAR> | <MOTO> | <OTHER_VEH> <CAR> ::= fiat | peugeot | … Conjunctive rules: C ::= B1 + B2 + ... . The concept C is defined by a set of concepts in which all concepts are necessary to define C.

Example: <MORTGAGE> ::= <DC_LOAN> + <PROPERTY>

Disjunctive rules: C ::= B1 v B2 v ... . The concept C is defined by B1 or by B2 or by both of them.

Example: <PERSON> ::= <NAME> v <FIRST_NAME>


8

5.2 Predicative rules

These rules describe predicative concepts (also called predicate). These are concepts with attrib-ute. Links between these concepts and their at-tributes are defined by predicative relations. These rules define a predicate by one descriptor and one main attribute: the object. For a predi-cate, the descriptor is a unique concept, this mean that a concept cannot be the descriptor for more than one rule. The object is one of a set of possi-ble concepts (this set is defined by the model). These rules can have some optional attributes. These attributes give more information on the predicate but are neither necessary nor sufficient to define it. Predicative rules are written: P ::= (descriptor = D; object = O1 | O2 | O3 | ...; option1 = A1 | A2 | A3 | ...; option2 = B1 | B2 | ...;...) Example: the predicate PURCHASE is described by figure 2.

6 Extraction engine

The extraction engine is based on the grammar modelling of the domain. It proceeds in four steps: rules database creation, two tagging proc-esses and information collection.

Rules database is composed of two sets of rules (constituent and predicative) inferred from the grammar. Constituent tagging is based on database constituent rules (each term and concept is tagged according to database rules). Second tagging is based on predicative database rules for instantiate predicates. After these step, informa-tion is collected directly. This information will expand a database.

6.1 Constituent tagging

Constituent tagging find terms, and then concepts in the text by recursive applications of constitu-ent rules. Each time a concept is found, a tag marks it up.

Some specific concepts (with a known syntax) such as sums, rates or dates are tagged first. After that, tagging takes place in two steps: term tag-ging then concept propagation.

Some select rules define concepts from terms. In term tagging, these rules are applied to the corpus (for each rule C ::= t, tags of concept C mark the term t in the text). When all these rules are applied, every term in the grammar is tagged by concepts (cf. figure 3).

With concept propagation, some new concepts are found. When a rule A ::= B exists, tags of A are added to tags of B in the corpus. So concept A is marked in the texts (cf. figure 4).

Conceptual rules are applied until none of them is applicable. Then the corpus is completely tagged by constituent rules.

<PURCHASE> ::= ( descriptor = <DC_PURCHASE> ; object = <PROPERTY> | <VEHICLE> | <BANK_PRODUCT>; date = <DATE>; amount = <SUM> location = <PLACE> )

Figure 2: PURCHASE predicate

The text “buy studio paris in 2003” becomes after termtagging: <DC_PURCHASE>buy</DC_PURCHASE> <APPARTMENT>studio</APPARTMENT> <CITY>paris</CITY> in <DATE>2003</DATE> [1]

Figure 3: Example of term tagging

Extract [1] becomes after concept propagation: <DC_PURCHASE>buy</DC_PURCHASE> <PROPERTY><APPARTMENT>studio </APPARTMENT><PROPERTY> <PLACE><CITY>paris</CITY></PLACE> in <DATES>2003</DATE> [2]

With these rules: <PROPERTY> ::= <APPARTMENT> <PLACE> ::= <CITY> | <COUNTRY> | <REGION>

Figure 4: Example of concept propagation


9

6.2 Predicative tagging

Application of predicative rules detects in the texts the instances of grammar predicates. Each time a predicate descriptor is found, the process search one of the concepts defined as possible object for this predicate.

Predicates are instantiated until it is impossible to do. This proceeds as follow. Text is processed from left to right. When a predicate's descriptor is recognized, the process looks for a correct object (concept or predicate) for this predicate before the next concept that is a descriptor of another untreated predicate instance. If a correct object is found, the attribute object is given a value for this predicate instance. Next the process tries to give a value to the optional attributes by looking at correct concepts in the text located between the descriptor of this predicate and the next one. Af-ter that, the system treats the next descriptor in the text.

If no correct object is found, this descriptor is left and the system immediately treats the next descriptor. This process is made right to the end of text. At this point, if untreated descriptors (that define predicate instance without a found object)

are left, the process is repeated from the text's beginning. The operation is repeated until there are no descriptors left to treat or only those that cannot be treated. If such descriptors are left, they are marked as defining empty predicate in-stances (instances without an object).

A predicate P1 can be the object of another one (P2). In this case those of P2 give values to attributes of P1 when possible.

Predicate instantiation is made through a text tagging process. The system tags the descriptor by a predicate reference (the predicate name and an instance number to distinguish different in-stances of the same predicate). Each predicate attribute is tagged by the predicate reference and its type (Object, Date, Location...). Example: from the extract [2], after applying the predicative rules, we obtain the tagging text and the predicate instance described by figure 5.

6.3 Information Retrieving

After constituent and predicative tagging, tags make concepts and relations clearly readable in the corpus. In the retrieving step, all that has to be done to specify the concepts to be searched. With the tags, the system can easily locate these concepts and their different attributes. In this step empty predicates are ignored. This information feeds a database in which tables correspond to grammar predicate.

7 Results

We have a corpus with around one million re-cords. Each record is taken from an interview between a client and a bank employee. It is com-posed of a numerical heading and a text area. In the heading, there is an identification number and the recording date. The text area is filled with the interview report written by the employee. The text size varies from record to record: from a few to thirty words. Before text analysis, the text area is treated to make it in conformity with Data Pro-tection Act. Terminological extraction with ANA defines a first set of 15000 term-candidates. After creaming off this set, 1300 remain. Terminologi-cal documents (which contain about 350 terms) give 200 new terms. So the terminological step gives us a set of 1500 terms.

<PURCHASE_1> <DC_PURCHASE>buy</DC_PURCHASE> </PURCHASE_1> <PURCHASE_1 ARG=object> <PROPERTY> <APPARTMENT>studio</APPARTMENT> <PROPERTY> </PURCHASE_1 ARG=object> <PURCHASE_1 ARG=location> <PLACE><CITY>paris</CITY></PLACE> </PURCHASE_1 ARG=location> in </PURCHASE_1 ARG=date> <DATE>2003</DATE> </PURCHASE_1 ARG=date>

Therefore we obtain this instance of PURCHASE predicate: <PURCHASE_1> [ DESCRIPTOR = buy OBJECT = studio

DATE = 2003 LOCATION = paris AMOUNT = ∅

]

Figure 5: Example of predicative tagging


10

7.1 Evaluation method

The goal of this research is to extract client events. These events are client projects and the proposition refusals (from the bank or from the client). The result is a set of searching predicate instances. As there are different attributes for a predicate, three degrees of validity are defined, which depend upon the way these attributes are given a value.

A predicate instance is called valid if the value is correct for attributes, which are given a value (not all the attributes need to be given a value). The validity rate is the number of valid instance per number of instances found.

A valid instance is called totally valid if all of these attributes are given a value and partially valid if one or more attribute is not given a value.

A partially valid instance is called incomplete when one or more attributes is not given a value because of a process mistake and complete when all of the attributes are not given a value because of a lack of information in the corpus.

7.2 Experimentation

Our first experiment focuses on a representative sample of 4000 records taken at random in the corpus. The experiment has been carried out with the aim of extracting the clients' projects from this sample. The results found are validated by experts who have aligned the sample with the table PROJECT in the database filled by our sys-tem. According to them, 265 projects are in this sample. The system detects 253 instances of the predicate PROJECT. These results are described in figure 6.

7.3 Analysis

The coverage rate is not revealing because a lot of records do not contain projects (95%). The recall rate (number of both instances found per number of instances in the corpus) and the valid-ity rate (respectively 95.4% and 92.5%) are both very satisfactory but a lot of instances are par-tially valid (90% of valid projects). 73.4% of these instances are partially valid because of cor-pus non-fulfilment but the other 26.6% are im-putable to the system. We are working at present to reduce the number of incomplete partially valid instances by improving the predicative tag-ging process.

8 Conclusion

As usual information extraction processes are unusable on poor quality texts, we described a method to extract information from this type of corpus. This approach is based on ontology building led by the type of information to be searched in texts. Good results have been ob-tained with a very wide cover of information in each record, even for record containing very little information. This method can easily be applied to other corpora and to other domains. The domain modelling and pre-processes are bound to texts but the other parts of the system are generic and, except for small changes to the rules database building process, do not need modifications. Cur-rently experimentations are being carried out to that end.

References Aussenac-Gilles N., Biébow B. and Szulman S.

(2000) Corpus analysis for conceptual modelling. Proceeding of EKAW'2000, Juan-les-Pins, France, pp. 13-20.

Aussenac-Gilles N., Bourrigault D., Codamines A. and Gross C. (1995) How can knowledge acquisi-tion benefit from terminology. Proceeding of the Ninth Knowledge Acquisition for Knowledge-Based Systems Workshop (KAW '95), Banff, Can-ada.

Bachimont B. (2001) Modélisation linguistique et modélisation logique des ontologies : l'apport de l'ontologie formelle, Proceeding of IC2001, Greno-ble, France, pp. 349-368. Figure 6: Experimentation results

Incomplete 56

Valid Instances: 234

Partially valid instances: 211

Complete

155

Totally valid instances: 23


11

Barry C., Cormier C., Kassel G and Nobécourt J. (2001) Evaluation de langages opérationnels de représentation d'ontologies, Proceeding of IC'2001, Grenoble, France, pp. 309-327.

Biébow B. and Szulman S. (1998) Une approche ter-minologique pour catégoriser les concepts d'une ontologie, Proceeding of IC'98, Nancy, France, pp. 51-58.

Bouaud J., Bachimont B., Charlet J. and Zwei-genbaum P. (1995) Methodological Principles for Structuring an Ontology, Proceeding of IJCAI-95 Workshop on Basic Ontological Issues in Knowl-edge Sharing, Montreal, Canada.

Ciravegna F. (2000), Learning to tag for Information Extraction from Text, ECAI Workshop on Machine Learning for Information Extraction, Berlin, Ger-many.

Enguehard C. (2000), Flexible-Equality of Terms: Definition and Evaluation, Proceedings of the Fourth International cionference on flexible query Answering Systems (FQAS’2000), Warsaw, Po-land, pp. 289-300.

Enguehard C. and Pantéra L. (1995), Automatic Natu-ral Acquisition of a Terminology, Journal of quanti-tative linguistics, Vol. 2, n°1, pp.27-32.

Galhardas H., Florescu D., Shasha D., Simon E. and Saita C. (2001), Declarative Data Cleaning : Lan-guage Model and Algorithms, INRIA Technical re-port n°4149.

Gomez-Perez A. & al. (1996), Towards a methods to Conceptualize Domain Ontologies, Proceedings of ECAI'96, Budapest, Hungary, pp. 41-52.

Guarino N. and Welty W. (2000), A Formal Ontology of Properties, Proceedings of the ICAI-00 Work-shop on Applications of Ontologies and Problem-Solving Methods, Las Vegas, United States, pp. 12/1-12/8.

Guarino N. (1992), Concepts, Attributes and Arbi-trary Relations : Some Linguistic and Ontological Criteria for Structuring Knowledge Bases, Data & Knowledge Bases Engineering, Vol. 8(2), pp. 249-261.

Hahn U. and Romacker M. (2000), Content manage-ment in the SYNDIKATE system – How technical documents are automatically transformed to text knowledge bases, Data & Knowledge Engineering, Vol. 35(2), pp. 137-159.

Kassel G. and Perpette S. (1999), Cooperative ontol-ogy construction needs to carefully articulate terms, notions and objects, Proceedings of the In-

ternational Workshop on Ontology Engineering on the global Information Infrastructure, Germany.

Kassel G. (2002), OntoGec : une méthode de spécifi-cation semi-informelle d'ontologies, Proceedings of Journées francophones d'Ingénierie des Connais-sances (IC'2002), Rouen, France, pp. 75-87.

Lame G. (2000), Knowledge acquisition from texts towards an ontology of French law, Proceedings of EKAW'2000, Juan-les-Pins, France, pp. 53-62.

Madche A. (2000), Semi-automatic engineering of ontologies from text, International Conference on Software Engineering and Knowledge Engineering (SEKE' 2000), Chicago, United States.

Nobécourt J. (2000), A method to build formal ontolo-gies from texts, Proceedings of EKAW'2000, Juan-les-Pins, France, pp. 21-27.

Nestorov S. & al. (1997), Representative objects : concise representation of semistructured, hierar-chical data, Proceedings of International Confer-ence on Data Engineering, Birmingham, United Kingdom, pp. 79-90.

Riloff E. (1996), Automatically Generating Extraction Patterns from Untagged Text, Proceedings of Thir-teenth National Conference on Artificial Intelli-gence, Portland, United States, pp. 1044-1049.

Soderland S. (1997), Learning Text Analysis Rules for Domain-specific Natural Language Processing, Ph.D. thesis, University of Massachusetts, Am-herst.

Suguraman V. (2001), Creating and managing Do-main Ontologies for Database Design, Proceedings of NLDB’01, Madrid, Spain, pp. 17-26.

Swartout B., Patil R., Knight K. and Russ T. (1996), Towards distributed use of large-scale ontologies, Proceedings of the Tenth Knowledge Acquisition for Knowledge-Based Systems Workshop (KAW '96), Banff, Canada, pp. 32.1-32.19.

Uschold M. and King M. (1995), Towards a method-ology for building ontologies, Proceeding of the Workshop on Basic Ontological Issues in Knowl-edge Sharing (IJCAI’95), Montreal, Canada .


12

Use of Ontologies for Cross-lingual Information Management in the Web

Ben Hachey�, Claire Grover

�, Vangelis Karkaletsis†, Alexandros Valarakos†,

Maria Teresa Pazienza�, Michele Vindigni

�, Emmanuel Cartier‡, Jose Coch‡

�Division of Informatics, University of Edinburgh�

bhachey, grover � @ed.ac.uk†Institute for Informatics and Telecommunications, NCSR “Demokritos”�

vangelis, alexv � @iit.demokritos.gr�D.I.S.P., Universita di Roma Tor Vergata�

pazienza, vindigni � @info.uniroma2.it‡Lingway�

emmanuel.cartier,Jose.Coch � @lingway.com

Abstract

We present the ontology-based approach for cross-lingual information management of web contentthat has been developed by the EC-funded projectCROSSMARC. CROSSMARC can be perceivedas a meta-search engine, which identifies domain-specific information from the Web. To achievethis, it employs agents for web crawling, spider-ing, information extraction from web pages, datastorage, and data presentation to the user. Domainontologies are exploited by each of these agentsin different ways. The paper presents the ontol-ogy structure and maintenance before describinghow domain ontologies are exploited by CROSS-MARC agents.

1 Introduction

The EC-funded R&D project CROSSMARC1 pro-poses a methodology for management of infor-mation from web pages across languages. It isa full-scale approach starting with the identifica-tion of web sites in various languages that containpages in a specific domain. Next, the system lo-cates domain-specific web pages within the rele-vant sites and extracts specific product informa-tion from these pages. Finally, the end user in-teracts with the system through a search interfaceallowing them to select and view products accord-ing to the characteristics they deem important. Aunique ontology structure is exploited throughoutthis process in different ways.

1http://www.iit.demokritos.gr/skel/crossmarc

The CROSSMARC architecture is charac-terised by the following design points:

� machine learning methods to facilitate rapidtailoring of linguistic resources to new do-mains with minimal human intervention,

� multi-agent architecture ensuring clear sepa-ration of module responsibilities, providingthe system with a clear interface formalism,and providing robust and intelligent informa-tion processing capabilities.

� user modelling and localisation to adapt theinformation retrieved to the users’ prefer-ences and locale.

� domain-specific ontologies and the corre-sponding language-specific instances

The focus of this paper is the ontology exploitationat various processing stages of the CROSSMARCproject.

The main functionality of the system is imple-mented in the agent modules, which appear inthe centre column of Figure 1. Namely, theseconsist of domain-specific Web crawling, domain-specific spidering, information extraction, infor-mation storage and retrieval, and information pre-sentation. We briefly describe the primary func-tionality of these agent modules below.

Domain-specific Web crawling is managed bythe Crawling Agent. The Crawling Agent consultsWeb information sources such as search enginesand Web directories to discover Web sites contain-ing information about a specific domains.2

2Two domains are being implemented during the term ofthe project: laptops and job offers.


13

Domain-specific spidering is managed by theSpidering Agent. The Spidering Agent identi-fies domain-specific web pages grouped under thesites discovered by the Crawling Agent and feedsthem to the Information Extraction Agent.

The Information Extraction Agent managescommunication with remote information extrac-tion systems (four such systems are employed forthe four languages of the project). These sys-tems process Web pages collected by the Spider-ing Agent and extract domain facts from them(Grover et al., 2002). The facts are stored in thesystem’s database.

Information storage and retrieval is managed bythe Data Storage Agent. Its tasks consist of main-taining a database of facts for each domain, addingnew facts, updating already stored facts and per-forming queries on the database. Finally, infor-mation presentation is managed by the Personali-sation Agent, which allows the presentation to beadapted to user preferences and locale.

CROSSMARC is a cross-lingual multi-domainsystem for product comparison. The goal is tocover a wide area of possible knowledge domainsand a wide range of conceivable facts in each do-main, hence the CROSSMARC model implementsa shallow representation of knowledge for each

domain in an ontology (Pazienza et al., 2003). Adomain ontology reflects a degree of expert knowl-edge for that domain. Cross-linguality is achievedthrough the lexical layer of the ontology, whichprovides language specific synonyms for all on-tology entries. In the overall processing flow, theontology plays several key roles:

� During Crawling & Spidering, it comes into use as a “bag of words”–that is, a roughterminological description of the domain thathelps CROSSMARC crawlers and spiders toidentify the interesting web pages.

� During Information Extraction, it drives theidentification and classification of relevantentities in textual descriptions. It is also usedduring fact extraction for the normalisationand matching of named entities.

� During Data Storage & Presentation, the lex-ical layer of the ontology makes possible aneasy rendering of a product description fromone language to another. User stereotypesmaintained by the Personalisation Agent in-clude ontology attributes in order to representstereotype preferences according to the ontol-ogy. Thus, results can be adapted to the pref-


14

<ontology xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" id="RTVD1-R1.2"xsi:noNamespaceSchemaLocation="../XSD/New_Ontology.xsd">

<description>Laptops</description><features>

<feature id="OF-d0e5"><description>Brand</description><attribute type="basic" id="OA-d0e7">

<description>Manufacturer Name</description><discrete_set type="open">

<value id="OV-d0e3283"><description>Fujitsu-Siemens</description>

</value>...

</discrete_set></attribute><attribute type="basic" id="OA-d0e349">

<description>Model Name</description><discrete_set type="open">

<value id="EOV-d0e351"><description>Unknown Model</description></value></discrete_set>

</attribute></feature>.....

</features></ontology>

Figure 2: Excerpt from XML export of concept instances for the laptop domain

erences of the end user who can also com-pare uniform summaries of offers descrip-tions from Web sites written in different lan-guages.

This paper first presents the CROSSMARC on-tology and discusses ontology management issues.It then details the manner in which CROSSMARCagents exploit domain-specific ontologies at vari-ous processing stages of the multi-agent architec-ture. It next presents related work before conclud-ing with a summary of the current status of theproject and future plans.

2 The CROSSMARC Ontology

2.1 Ontology Structure

The structure of the CROSSMARC ontology hasbeen designed, first, to be flexible enough to beapplied to different domains and languages with-out changing the overall structure and, second, tobe easily maintainable by modifying only the ap-propriate features. For this reason, we have con-structed a three-layered structure. The ontologyconsists of a meta-conceptual layer, a conceptuallayer, and an instances layer. The instances layercan be further divided into concept instances andlexical instances, which provide support for multi-lingual product information. For use by CROSS-MARC agents, the concept instances and lexical

instances are exported into XML (Figures 2 and3).

The meta-conceptual layer defines the top-levelcommitments of the CROSSMARC ontology ar-chitecture defining the language used in the con-ceptual layer. It denotes three meta-elements (fea-tures, attributes, and values), which are used inthe conceptual level to assign computational se-mantics to elements of the ontology. Also, thislayer defines the structure of the templates thatwill be used in the information extraction phase.In essence, the meta-conceptual layer specifies thetop-level semantics of CROSSMARC across do-mains.

The conceptual layer is composed of the con-cepts that populate the specific domain of interest.These concepts follow the structure defined in themeta-conceptual layer for their internal represen-tation and the relationship amongst them. Eachconcept element is discriminated by the use of aunique identity (ID) number, which is called onto-reference. This conceptual layer defines the se-mantics of a given domain. An important aspectof this is the domain-specific information extrac-tion template.

Finally, the instances layer represents domainspecific individuals. It consists of two types of in-stances: (1) concept instances that act as the nor-malised values of each individual, and (2) lexical


15

<lexicon xmlns:xsi="http://www.w3.org/2000/10/XMLSchema-instance" idref="RTVD1-R1.2" lang="en" xsi:noNamespaceSchemaLocation="../XSD/New_Lexicon.xsd">

<node idref="OV-d0e3283"><synonym>Fujitsu</synonym><synonym>FUJITSU</synonym><synonym>Siemens</synonym><synonym>SIEMENS</synonym><synonym>Fujitsu Siemens</synonym><synonym>Fujitsu-Siemens</synonym><synonym>FUJITSU-SIEMENS</synonym><synonym>FUJISTU SIEMENS</synonym>

</node>...

</lexicon>

Figure 3: Excerpt from the XML export of Englishlexical instances for the laptop domain

instances that denote linguistic relationships be-tween concepts or instances for each natural lan-guage. Concepts are instantiated in this layer bypopulating their attribute(s) with appropriate val-ues. Every instance is unique and a unique identitynumber, named onto-value, is attributed to it.

As previously mentioned, lexical instances sup-port multi-lingual information. They are instan-tiated in a domain specific lexicon for each natu-ral language supported (currently English, Greek,French, and Italian). Here, possible instantiationsof ontology concepts for each language are listedas synonyms. The “idref” attribute on synonymlist nodes associates lexical items with the ontol-ogy concept or instances that it corresponds to.Also, regular expressions can be provided for eachnode of a lexicon for a broader coverage of syn-onyms.

We can illustrate the overall ontology structurewith an example concept instantiation from thelaptop domain. Again, the way we describe thestructure of the domain is constrained by the meta-conceptual layer. The conceptual layer definesthe items of interest in the domain, for laptops,these include information about the brand (e.g.manufacturer name, model), about the processor(e.g. brand, speed), about preinstalled software(e.g. OS, applications), and etcetera. Finally, inthe instances layer, we declare instances of con-cepts and provide a list of possible lexical realisa-tions. For example, the exported domain ontologyin Figure 2 lists ‘Fujitsu-Siemens’ as an instanceof the manufacturer name concept and the ex-ported English lexicon in Figure 3 lists alternative

lexical instantiations of ‘Fujitsu-Siemens’.Though it is common knowledge that concep-

tual clustering is different from one language tothe next, the ontology structure described is suf-ficient to deal with product comparison. Firstly,because commercial products are fairly interna-tional, cross-cultural concepts. Secondly, the on-tology design phase of adding a new domain pro-vides a forum for discussing and addressing lin-guistic cultural differences.

2.2 Ontology Maintenance

After a survey of existing ontology editors andtools, we decided to use Protege-20003 as thetool for ontology development and maintenancein CROSSMARC. We modified and improved theProtege model of representation and the user-interface in order to fit CROSSMARC’s user needsand to facilitate the process of editing CROSS-MARC ontologies. This work has led to the de-velopment and release of a range of tab plug-insdedicated to the editing of sections of the ontologyrelated to specific steps in the Ontology Mainte-nance Process.

The default Protege editing Tabs are dividedinto Class, Slots and Instances. Although this or-ganisation is quite logical, it was impractical forthe purposes of CROSSMARC, as the Class viewof the knowledge base puts together the DomainModel, the Lexicons, and the Meta layers. For thisreasons we developed several plug-in Tabs (de-scribed in Table 1) that focus the attention on eachdifferent aspect of the knowledge base, allowingfor more functional inspection and editing of thespecific component under analysis. For more in-formation on ontology maintenance in CROSS-MARC, refer to (Pazienza et al., 2003).

3 Ontology Use in CROSSMARC

3.1 Crawling & Spidering

The CROSSMARC implementation of crawlingexploits the topic-based website hierarchies usedby various search engines to return web sites un-der given points in these hierarchies. It also takesa given set of queries, exploiting CROSSMARCdomain ontologies and lexicons, submits them to asearch engine, and then returns those sites that cor-respond to the pages returned. The list of web sitesoutput from the crawler is filtered using a light ver-sion of the site-specific spidering tool (NEAC) im-

3http://protege.stanford.edu/


16

Protege Tab Maintenance TaskDomain Model Editor World modellingTemplate Editor Creation of a task-oriented model to be used as template for

purposes of fact-extractionLexicon Editor Upgrade of the lexicon for the ontologyImport/Export Import and Export of the Ontology and Lexicons in XML ac-

cording to the Schema adopted in CROSSMARC

Table 1: CROSSMARC Protege Tabs with description of associated maintenance tasks.

plemented in CROSSMARC, which also exploitsthe ontology.

The CROSSMARC web spidering tool exploreseach site’s hierarchy starting at the top page of thesite, scoring the links in the page and following“useful” links. Each visited page is evaluated andif it describes one or more offers, it is classifiedas positive and is stored in order to be processedby the information extraction agent. Thus, theCROSSMARC web spidering tool integrates de-cision functions for page classification (filtering)and link scoring.

Supervised machine learning methods are usedto create the page classification and link scoringtools. The development of these classifiers re-quires the construction of a representative train-ing set that will allow the identification of impor-tant distinguishing characteristics for the variousclasses. This is not always a trivial task, particu-larly so for Web page classification. We devised asimple approach which is based on an interactiveprocess between the user (person responsible forcorpus formation) and a simple nearest-neighbourclassifier. The resulting Corpus Formation Toolpresents difficult pages to the user for manual clas-sification in order to build an accurate domain cor-pus with positive and negative examples.

For the feature vector representation of the webpages, which is required both by the corpus for-mation tool and the supervised learning methods,we use the domain ontology and lexicons. A spe-cialised vectorisation module has been developedthat translates the ontology and the lexicons intopatterns to be matched in web pages. These pat-terns vary from simple key phrases and their syn-onyms to complex regular expressions that de-scribe numerical ranges and associated text. Thevectorisation module generates such a pattern file(the feature definition file) which is then usedby an efficient pattern-matcher to translate a webpage into a feature vector. In the resulting bi-

nary feature vector, each bit represents the exis-tence of a specific pattern in the correspondingweb page. A detailed discussion and evaluation ofthe CROSSMARC crawling and spidering agentscan be found in (Stamatakis et al., 2003).

3.2 Information Extraction

Information Extraction from the domain-specificweb pages collected by the crawling & spider-ing agents, involves two main sub-stages. First,an entity recognition stage identifies named enti-ties (e.g. product manufacturer name, companyname) in descriptions inside the web page writ-ten in any of the project’s four languages (Groveret al., 2002). After this, a fact extraction stageidentifies those named entities that fill the slots ofthe template specifying the information to be ex-tracted from each web page. For this we combinewrapper-induction approaches for fact extractionwith language-based information extraction in or-der to develop site-independent wrappers for thedomain.

Although each monolingual information extrac-tion system (four such systems are currently un-der development) employs different methodolo-gies and tools, the ontology is exploited in aboutthe same way. During the named-entity recog-nition stage, all the monolingual IE systems em-ploy a gazetteer look up process in order to an-notate in the web page those words/phrases thatbelong to its gazetteers. These gazetteers are pro-duced from the ontology and the correspondinglanguage-specific lexicon through an automatic orsemi-automatic process.

During the fact extraction stage, most of theIE systems employ a normalisation module. Thisruns after the identification of the named entitiesor expressions that fill a fact slot according tothe information extraction template (i.e. the enti-ties representing the product information that willeventually be presented to the end-user). The on-


17

Figure 4: Screen shot of CROSSMARC search form.

tology and the language dependent lexicons areused for the normalisation of the recognised namesand expressions that fill fact slots. As a first step,names and expressions are matched against entriesin the ontology. If a match is not found, namesand expressions are matched against all synonymsin the four lexicons. Whenever a match is foundthe text is annotated with the characteristic “on-toval” that takes as value the ID of the corre-sponding node from the domain ontology. If nomatch is found for a name or expression belong-ing to a closed class, their “ontoval” characteristictakes the value of the ID of the corresponding “un-known” node. If the name or expression belongsto an open set the ID of the category is returned.In the cases of annotated numeric expressions, themodule returns not only the corresponding ID ofthe ontology node but also the normalised valueand unit.

3.3 Information Storage & Presentation

The information extracted and normalised by themonolingual IE systems is stored into the CROSS-MARC database by the Data Storage Agent. Aseparate database is constructed for each domaincovered. The structure of the database is deter-mined by the fact extraction schema, which is gen-erated by the Template Editor Tab implemented inProtege.

The ontology is also exploited for the presenta-tion of information in the CROSSMARC end-userinterface. The User Interface design (see Figures 4and 5) is based on a web server application whichaccesses the data source (i.e. the Data Storage out-put) and provides the end user with a web interfacefor querying data sets. This interface is customisedaccording to a user profile or stereotype main-tained by the personalisation agent and definedwith respect to the domain ontology. Each queryis forwarded to the Data Storage component andquery results are presented to the user after subse-quent XSL transformation stages. These transfor-mations select the relevant features according tothe user’s profile and apply appropriate lexical in-formation onto them by accessing the normalisedlexical representations corresponding to the user’slanguage preferences.

4 Related Work

In the last years, the increasing importance of theInternet has re-oriented the information extractioncommunity somewhat toward tasks involving textssuch as e-mails, web-pages, web-logs and news-groups. The main problems encountered by thisgeneration of IE systems are the high heterogene-ity and the sparseness of the data on such do-mains. Machine learning techniques and ontolo-gies have been employed to overcome those prob-


18

Figure 5: Screen shot of CROSSMARC search results display.

lems and improve system performance. RAPIER(Califf and Mooney, 1997) is such a system thatextracts information from computer job postingson USENET newsgroup. It uses a Lexical ontol-ogy to exploit the hypernym relationship to gen-eralise over a semantic class of a pre or post fillerpattern. Following the same philosophy, CRYS-TAL (Soderland et al., 1995) uses a domain ontol-ogy to relax the semantic constraints of its conceptnode definitions by moving up the semantic hierar-chy or dropping constraints in order to broaden thecoverage. The WAVE (Aseltine, 1999) algorithmexploits a semantic hierarchy restricted to a simpletable look-up process to assign a semantic class toeach term. And in (Vargas-Vera et al., 2001), anontology is used to recognise the type of objectsand to resolve ambiguities in order to choose theappropriate template for extraction.

IE systems have encountered another limita-tion as regards the static nature of the backgroundknowledge (i.e. the ontology) they use. For thatreason bootstrapping techniques for semantic lex-icon and ontology extension during the IE pro-cesses have been introduced. (Brewster et al.,2002) uses an ontology to retrieve examples oflexicalisation of relations amongst concepts in a

corpus to discover new instances which can be in-serted to the ontology after user’s validation. In(Maedche et al., 2002) and (Roux et al., 2000),the initial IE model is improved through extensionof the ontology’s instances or concepts, exploitingsyntactic resources.

Ontologies are also used to alleviate the lack ofannotated corpora. (Poibeau and Dutoit, 2002)employ an ontology to overcome this limitationfor an information extraction task. The use of on-tologies in this work is twofold. First, it is usedto normalise the corpus by replacing the instanceswith their corresponding semantic class using anamed entity recogniser to specify the instances.Second, it generated patterns exploiting the se-mantic proximity between two words (where oneof them is the word that should be extracted) inorder to propose new patterns for extraction. Theontology used in this work is a multilingual netover five languages having more than 100 differ-ent kinds of links.

Kavalec (2002) conducted an ontological anal-ysis of web directories and constructed a meta-ontology of directory headings plus a collectionof interpretation rules that accompany the meta-ontology. He treats the meta-ontology schema as


19

a template for IE and uses the ontology’s schemaand interpretation rules to drive the informationextraction process in the sense of filling a tem-plate. Another work (Craven et al., 1999) uses anontology that describes classes and relationshipsof interest in conjunction with labelled regions ofhypertext representing instances of the ontology tocreate an information extraction method for eachdesired type of knowledge and construct a knowl-edge base from the WWW.

5 Current Work and Conclusions

CROSSMARC is a novel, cross-lingual approachto e-retail comparison that is rapidly portable tonew domains and languages. The system crawlsthe web for English, French, Greek, and Italianpages in a particular domain extracting informa-tion relevant to product comparison.

We have recently performed a user evaluationof the CROSSMARC system in the first domain.This evaluation consisted of separate user tasksconcerning the crawling, spidering, and informa-tion extraction agents as well as the end user in-terface (Figures 4 and 5). We are in the processof analysing the results and are scheduling furtheruser evaluations.

We are also currently porting the system intothe domain of job offers. An important result ofthis will be the formalised customisation strategy.This will detail the engineering process for cre-ating a product comparison system in a new do-main, a task that consists broadly of developing anew domain ontology, filling lexicons, and train-ing the crawling, spidering, and information ex-traction tools.

The CROSSMARC system benefits from anovel, multi-level ontology structure which con-strains customisation to new domains. Further-more, domain ontologies and lexicons provide animportant knowledge resource for all componentagents.

The resulting system deals automatically withissues that semantic web advocates hope to alle-viate. Namely, the web is built for human con-sumption and thus uses natural language and vi-sual layout to convey content, making it difficultfor machines to effectively exploit Web content.CROSSMARC explores an approach to extract-ing and normalising product information that isadapted to new domains with minimal human ef-fort.

ReferencesJ. H. Aseltine. 1999. WAVE: An incremental algorithm for

information extraction. In Proceedings of the 16th Na-tional Conference on Artificial Intelligence (AAAI 1999).

C. Brewster, F. Ciravegna, and Y. Wilks. 2002. User centeredontology learning for knowledge management. In Pro-ceedings of 7th International Workshop on Applicationsof Natural Language to Information Systems.

M. E. Califf and R. J. Mooney. 1997. Relational Learning ofPattern-Match Rules for Information Extraction. In Pro-ceedings of the 1st Workshop on Computational NaturalLanguage Learning (CoNLL-97).

M. Craven, D. DiPasquo, D. Freitag, A. McCallum,K. Nigam, T. Mitchell, and S. Slattery. Learning to con-struct knowledge bases from the world wide web ArtificialIntelligence, 118:69–113.

C. Grover, S. McDonald, D. Nic Gearailt, V. Karkaletsis,D. Farmakiotou, G. Samaritakis, G. Petasis, M. Pazienza,M.Vindigni, F. Vichot and F. Wolinski. 2002. Multilin-gual XML-Based named entity recognition for e-retail do-mains. In Proceedings of the 3rd International Confer-ence on Language Resources and Evaluation.

M. Kavalec and V. Svatek. 2002. Information extraction andontology learning guided by web directory. In Proceed-ings 15th European Conference on Artificial Intelligence.

A. Maedche, G. Neumann, and S. Staab. 2002. Bootstrap-ping an ontology-based information extractions system.In P. S. Szczepaniak, J. Segovia, J. Kacprzyk, and L. A.Zadeh (eds), Intelligent Exploration of the Web.

M. T. Pazienza, A. Stellato, M. Vindigni, A. Valarakos, andV. Karkaletsis. 2003. Ontology integration in a multi-lingual e-retail system To appear in Proceedings of theHuman Computer Interaction International (HCII’2003).

T. Poibeau and D. Dutoit. 2002. Generating extraction pat-terns from large semantic networks and an untagged cor-pus. In Proceedings of the 19th International Conferenceon Computational Linguistics.

C. Roux, D. Proux, F. Rechenmann, and L. Julliard. Anontology enrichment method for a pragmatic informationextraction system gathering data on genetic interactions.In Proceedings of the ECAI 2000 Workshop on OntologyLearning.

S. Soderland, D. Fisher, J. Aseltine, and W. Lehnert. 1995.Issues in inductive learning of domain-specific text extrac-tion rules. In Proceedings of the Workshop on New Ap-proaches to Learning for Natural Language Processing.

K. Stamatakis, V. Karkaletsis, G. Paliouras, J. Horlock,C. Grover, J. Curran, and S. Dingare. 2003. Domain-specific web site identification: The CROSSMARC for-cused web crawler. To appear in Proceedings of the Sec-ond International Workshop on Web Document Analysis.

M. Vargas-Vera, J. Domingue, Y. Kalfoglou, E. Motta, andS. Shum. 2001. Template-driven information extractionfor populating ontologies. In Proceedings IJCAI 2001workshop on Ontologies Learning.


20

Automatic Annotation of Multilingual Text Collections with a Conceptual Thesaurus

Bruno Pouliquen, Ralf Steinberger, Camelia Ignat European Commission - Joint Research Centre (JRC)

Institute for the Protection and Security of the Citizen (IPSC) T.P. 267, 21020 Ispra (VA), Italy http://www.jrc.it/langtech

[email protected]

Abstract

Automatic annotation of documents with controlled vocabulary terms (descriptors) from a conceptual thesaurus is not only useful for document indexing and re-trieval. The mapping of texts onto the same thesaurus furthermore allows to es-tablish links between similar documents. This is also a substantial requirement of the Semantic Web. This paper presents an almost language-independent system that maps documents written in different languages onto the same multilingual conceptual thesaurus, EUROVOC. Concep-tual thesauri differ from Natural Lan-guage Thesauri in that they consist of relatively small controlled lists of words or phrases with a rather abstract meaning. To automatically identify which thesau-rus descriptors describe the contents of a document best, we developed a statisti-cal, associative system that is trained on texts that have previously been indexed manually. In addition to describing the large number of empirically optimised parameters of the fully functional appli-cation, we present the performance of the software according to a human evaluation by professional indexers.

1 Introduction

The process of assigning keywords to documents is called indexing. It is different from the process of producing an inverted index of all words oc-

curring in a text, which is called full text index-ing. Lancaster (1998) distinguishes the indexing tasks keyword extraction and keyword assign-ment. Keyword extraction is the task of identify-ing keywords present verbatim in text, while keyword assignment is the identification of ap-propriate keywords from the controlled vocabu-lary of a reference list (a thesaurus). Controlled vocabulary keywords, which are usually referred to as descriptors, are therefore not necessarily present explicitly in the text.

We furthermore distinguish conceptual thesauri (CT) from natural language thesauri (NLT). In CT, most descriptors are relatively abstract, conceptual terms. An example for a CT is EUROVOC (Eurovoc, 1995; see section 1.2), whose approximately 6,000 descriptors describe the main concepts of a wide variety of subject fields by using high-level descriptor terms such as PROTECTION OF MINORITIES, FISHERY MANAGE-MENT and CONSTRUCTION AND TOWN PLANNING 1 . NLT, on the other hand, are more concrete in the sense that they usually aim at including an ex-haustive list of the terminology of the covered field. Examples are MeSH in the medical field (NLM, 1986), DESY in particle physics (DESY, 1996), and AGROVOC in agriculture (AGROVOC, 1998). WordNet (Miller, 1995) is a NLT that is not specialised in any particular subject domain, but it does have the aim of being exhaustive (dis-tinguishing approximately 95,000 synonym sets).

In this paper, we present work on automating the process of keyword assignment in several languages, using the CT EUROVOC. The challenge of this task is that the EUROVOC descriptor texts

1We write all EUROVOC descriptors in small caps.


21

are not usually explicitly present in the docu-ments. We can therefore show in section 3 that treating EUROVOC descriptor identification as a keyword extraction task leads to very bad results.

We succeeded in making the big step from keyword extraction to keyword assignment by devising a statistical system that uses a training corpus of manually indexed documents to pro-duce, for each descriptor, a list of associated natural language words whose presence in a text indicates that the descriptor may be appropriate for this text.

1.1 Contents

The structure of this paper is the following: we first present the EUROVOC thesaurus and explain why so many organisations use thesauri instead of, or in addition to, using conventional full-text search engines. In section 2, we then distinguish our system from related work. In section 3, we give a high-level overview of the approach we adopted, without specifying the details. The rea-son for keeping the description general is that we experimented with many different formulae, pa-rameters and parameter settings, and these will be listed in sections 4 to 6. Section 4 discusses the experiments concerning the linguistic pre-processing of the texts and the various results achieved. Section 5 is dedicated to those parame-ters that were used during the training phase of the process to produce the most efficient list of associated words for each descriptor. Section 6 then discusses the various experiments carried out to optimise the descriptor assignment results by matching the associated words against the text, to which descriptors should be assigned.

Section 7 summarises the results achieved with the best parameter settings according to a manual evaluation by indexing professionals. The conclusion summarises the findings, shows possible uses of our system for other applica-tions, and points to future work.

1.2 The Eurovoc thesaurus

EUROVOC (Eurovoc, 1995) is a wide-coverage conceptual thesaurus, covering diverse fields such as politics, law, finance, social questions, science, transport, environment, geography, or-ganisations, etc. EUROVOC is used by the Euro-pean Parliament, the European Commission’s Publications Office and at least fifteen other

(mostly parliamentary) institutions to catalogue their multilingual document collections for search and retrieval. It exists in one-to-one trans-lations in eleven languages with a further eleven language versions awaiting release. EUROVOC descriptors are defined precisely, using scope notes, so that each descriptor has exactly one translation into each language. A dedicated maintenance committee continuously updates the thesaurus.

EUROVOC is a thriving resource that will be used by more organisations in the future as it facilitates information and document exchange between parliamentary and other databases in the European Union, its Member States and other countries. It is also likely that more EUROVOC language versions will be developed.

EUROVOC organises its 6075 descriptors hier-archically into eight levels, using the relations Broader Term and Narrower Term (BT/NT), as well as Related Term (RT). RTs link nodes not related hierarchically. EUROVOC also provides a number of language-specific and optional non-descriptor terms that may help the indexing pro-fessional to find the appropriate descriptor. Non-descriptors typically are synonyms or hyponyms of the descriptor term (e.g. banana for TROPICAL FRUIT).

1.3 Motivation for thesaurus indexing

Most large organisations use thesauri for consis-tent indexing, storage and retrieval of electronic and hardcopy documents in their libraries and documentation centres. A list of carefully chosen descriptors gives users a quick summary of the document contents and it enables them to navi-gate the document collection by subject field. The hierarchical nature of the thesaurus allows the query expansion in database retrieval of documents by subject field (e.g. ‘radioactive ma-terials’) without having to enter a list of possible search terms (e.g. ‘plutonium’, ‘uranium’, etc.). When using multilingual thesauri such as EURO-VOC, multilingual document collections can be searched monolingually by taking advantage of the fact that there are one-to-one translations of each descriptor.

Manual assignment of thesaurus descriptors is time-consuming and expensive. Several organi-sations confirmed that their professional EURO-VOC indexers assign, on average, less than thirty


22

documents per day. Thus, automatic or, at least semi-automatic, solutions are sought. The JRC system takes about five seconds per document and could be used as a fully-automatic system or as input to machine-aided indexing.

Apart from supporting organisations that cur-rently use manually assigned EUROVOC descrip-tors, the automatic descriptor assignment can be useful to catalogue other types of documents, and for several other purposes: Representing docu-ment contents by a list of multilingual descrip-tors allows multilingual document classification and clustering, cross-lingual document similarity calculation (Steinberger et al., 2002), and the production of multilingual document maps (Steinberger, 2000). Lin and Hovy (2000) showed that the data produced in a similar proc-ess can also be useful for subject-specific sum-marisation. Last but not least, linking texts to meta-information such as established thesauri is a prerequisite for the realisation of the Semantic Web. EUROVOC is getting more widely accepted as a standard for parliamentary documentation centres. Due to its wide coverage, its usage is in no way restricted to parliamentary texts. As it will also soon be available in 22 languages, EUROVOC has the potential for being a good standard refer-ence to link documents on the Semantic Web.

2 Related work

Most previous work in the field concerns the in-dexing of texts with specialised natural language thesauri. These efforts come closer to the task of keyword extraction than keyword assignment because exhaustive terminology lists exist that can be matched against the words in the docu-ment to be indexed. Examples are Pouliquen et al. (2002) for the field of medicine, Montejo-Raez (2002) for particle physics and Haller et al. (2001) for economics. Jacquemin et al. (2002) additionally used tools to identify morphological and syntactic variations of the descriptors of the agricultural thesaurus AGROVOC. Gonzalo et al.’s (1998) effort to identify the most appropriate WordNet synsets for a text also differs from our own work: While the major challenge for Word-Net indexing is to sense-disambiguate words found in the text that are part of several synsets, EUROVOC indexing is difficult because the de-scriptors are not present in the text.

Regarding indexing with conceptual thesauri (CT), both Marjorie & Hlava (1996) and Lou-kachevitch & Dobrov (2002) use rule-based ap-proaches using vast, language-specific linguistic resources. Marjorie & Hlava’s system to assign English EUROVOC descriptors uses over 40,000 hand-crafted rules making use of text strings, synonym lists, vicinity operators and even tools to recognise and exploit legal references in text. Such an excessive usage of language-specific resources is out of our reach as we aim at linguis-tics-poor methods so that we can adapt them to all Eurovoc languages.

The most similar application to ours was de-veloped by Ferber (1997), whose aim was to use a multilingual thesaurus for the retrieval of Eng-lish documents using search terms in languages other than English. Ferber trained his associative system on the titles of 80,000 bibliographic re-cords, which were manually indexed using the OECD thesaurus. The OECD thesaurus is similar to EUROVOC, with the difference that it is smaller and exists only in four languages. Ferber achieved rather good results (a precision of 62% for a recall of 64%). However, we cannot com-pare our methods and our results directly with his as the training data is of a rather different nature (corpus of titles vs. corpus of full texts with highly varying length).

Our approach of producing lists of associated words whose presence in a text indicate the ap-propriateness of the corresponding descriptor is not dissimilar to work on topic signatures, as described in Lin and Hovy (2000) and Agirre et al. (2000). However, our application requires a few additional steps because it is more complex and there are also a number of differences re-garding the creation of the lists. Lin and Hovy produced their topic signatures on documents that had been classified manually as being or not being relevant for one of four specific domains. Also, Lin and Hovy used the topic signatures to relevance-rank sentences in text of the same do-main for the purpose of summarisation. They were thus able to use positive and negative train-ing examples and they only had to decide, to what extent a sentence is similar to their topic signature, separately for each of the four subject domains. Agirre et al. did produce topic signa-tures for many more subject domains (for all WordNet synsets), but they used the signatures


23

for word sense disambiguation, meaning that for each word they had only as many choices as there were word senses.

In the case of EUROVOC descriptor assignment, the situation is rather different, in that, for each document, it is possible to assign any of the 6075 descriptors and, in fact, multiple classification is the aim. Additionally, descriptor lists are re-quired to be short so that only the most relevant descriptors should be assigned while other ap-propriate, but less relevant descriptors should not be assigned in order to keep the list concise.

Due to the complexity of this task, we intro-duced a large number of additional parameters that were not used by the authors mentioned above. Some of these parameters concern the pre-processing of the texts, some of them affect the creation of the topic signatures, and again others were introduced to optimise the mapping of the signatures with the texts for which EURO-VOC descriptors are sought. Sections 4 to 6 ex-plain these parameters in detail.

3 Overview of the process

3.1 Test and training corpus

Our English corpus consists of almost 60,000 texts of eight different types 2 . Documentation specialists had indexed them manually with an average of 5.65 descriptors per text over a period of nine years. For some EU languages, the train-ing corpus is slightly smaller. The average text size is about 5,500 characters, with a rather high standard deviation of 17,000. We randomly se-lected 587 texts to build a test set that is repre-sentative of this corpus regarding the various text types. The remainder was used for training.

3.2 ‘Extracting’ EUROVOC descriptors

The analysis of the training corpus showed that only 31% of the documents contain explicitly the manually assigned descriptor terms. At the same time, in nine out of ten cases where a descriptor text occurred explicitly in a text, this descriptor was not assigned manually. These facts indicate that identifying the most appropriate EUROVOC descriptors by keyword extraction (i.e. solely by 2 Types of document are ‘Parliamentary Question’, ‘Council Regulation’, ‘Council Decision’, ‘Resolution, ‘Protocol’, ‘Debate’, ’Contract’, etc.

searching for their verbatim occurrence in the text) will not yield good results.

To prove this, we launched an extraction ex-periment on our English test set. We assigned all descriptors automatically whose descriptor text occurred explicitly in the document. In order to evaluate the outcome, we compared the results with those EUROVOC descriptors, that had previ-ously been assigned manually to these texts. The experiment showed that a maximum Recall of 30.8% could be achieved, i.e. almost 70% of the manually assigned descriptors were not found. At the same time, this method achieved a preci-sion of 7.4%, meaning that over 92% of the automatically assigned descriptors had not been assigned manually. We also experimented with using a lemmatiser, stop words (as described in section 4) and EUROVOC’s non-descriptors. These experiments never yielded better Precision val-ues, but the maximum Recall could be elevated to 39.8%.

These are very poor results, which prove that keyword extraction is indeed not an option for the EUROVOC thesaurus. We take this perform-ance as a lower-bound benchmark, assuming that our system has to perform better than this.

3.3 ‘Assigning’ EUROVOC descriptors, using an associative approach

As keyword extraction is not an option, we adopted a linguistics-poor statistical approach and trained a system on our corpus. As the only types of linguistic input, we experimented with normalising all texts of the training and test sets, using lemmatisation, multi-word mark-up and removing stop words (see section 4).

During the training phase, we produce a ranked list of words (or: lemmas) that are statistically (and often also semantically) related to each descriptor (see section 5). We refer to these lemmas as asso-ciates. These associate lists are rather similar to the topic signatures mentioned in section 2. Table 1 shows an example associate list for the EUROVOC descriptor FISHERY MANAGEMENT. The various columns will be explained in section 5.

During the assignment phase, we normalise the new document in the same way and calculate the similarity between this document’s lemma frequency list and each of the descriptor associ-ate lists (see section 6). The descriptor associate lists that are most similar to the lemma frequency


24

Rank Descriptor Similarity1 VETERINARY LEGISLATION 42.4%2 PUBLIC HEALTH 37.1%3 VETERINARY INSPECTION 36.6%4 FOOD CONTROL 35.6%5 FOOD INSPECTION 34.8%6 AUSTRIA 29.5%7 VETERINARY PRODUCT 28.9%8 COMMUNITY CONTROL 28.4%

Table 2. Assignment results (8 top-ranking descrip-tors) for the document Food and veterinary Office mission to Austria, found on the internet at http://europa.eu.int/comm/food/fs/inspections/vi/reports/austria/vi_rep_oste_1074-1999_en.html.

list of the new document indicate the most ap-propriate EUROVOC descriptors.

The EUROVOC descriptors can then be pre-sented in a ranked list, according to the similarity of their associate lists with the documents’ lemma frequency list, as shown in Table 2. As the list of potential descriptors is very long, we must decide how many descriptors to present to the users, and for how many descriptors to calcu-late Precision and Recall values. We can com-pute Precision and Recall for any number of highest-ranking descriptors. If we say that the Precision at rank 5 is Y%, this means that an av-

erage of Y% of the top five descriptors were cor-rect in all documents evaluated. During the training phase, we evaluated the automatically generated descriptor lists automatically, by com-paring them to the previously manually assigned descriptors. The more manually assigned de-scriptors were found at the top of the ranked list, the better the results. The final evaluation of the assignment, as discussed in section 7, was car-ried out manually.

For each formula and parameter, we identi-fied the optimal parameter setting in an empirical way, by trying a range of parameters and by then choosing the setting that yielded the best results. For parameter tuning and evaluation, we carried out over 1500 tests.

We will now focus on the description of the different parameters used in the pre-processing, training and assignment phases. Section 7 will then show the results according to a human evaluation of the descriptor assignment, using an optimised parameter setting.

4 Corpus pre-processing

We tried to keep the linguistic effort minimal in order to be able to apply the same algorithm to all eleven languages for which we have training material. Our initial assumption was that lemma-tisation would be crucial (especially for lan-guages that are more highly inflected than English), that marking up multi-word expres-sions would be useful as it helps disambiguating polysemous words such as ‘plant’ (power plant vs. green plant), and that stop words would help excluding words that are semantically poor or that can be considered as corpus-specific ‘noise’. Table 3 shows that, for both English and Span-ish, using the combination of lemmatisation, multi-word mark-up and stop word lists does indeed produce the best results. However, only using the corpus-tuned stop word list containing 1533 words yields results that are almost as good (Spanish F = 47.4 vs. 48). This result was a big surprise for us. Tuning the stop word list to the domain clearly was useful as the results achieved with a standard stop word list were less good (F = 46.6 vs. 47.4).

We conclude that, at least if the amount of training material is similar, and for languages that are not more highly inflected than Spanish,

Lemma Freq Nb of texts

Weight

fishery_resource 317 160 54.47fishing 983 281 49.11fish 1766 281 46.19common_fishery_policy 274 165 44.67fishery 1427 281 44.19fishing_activity 295 124 43.37fly_the_flag 403 143 42.87aquaculture 242 171 39.27conservation 759 183 38.34vessel 2598 230 37.91

… Table 1. Top ten associated lemmas for EUROVOCdescriptor FISHERY MANAGEMENT. With reference to the discussion in section 5, the columns 2, 3 and 4 show the absolute frequency of the lemma in all texts indexed with this descriptor, the number of texts in-dexed with this descriptor the lemma occurred in, and the final weight of each lemma.


25

the assignment results do not suffer much if no lemmatisation and multi-word treatment is car-ried out. This is good news as this makes it easier to apply the algorithm to more languages, for which less linguistic resources may be available.

5 Producing associate lists

The process of creating associate lists (or ‘topic signatures’) for each descriptor is the part where we experimented with most parameters. We can only mention the major parameters here, as ex-plaining all details would require more space. The result of this process is, for each descriptor, a vector consisting of all associates and their weight, as shown in Table 1.

We did not have enough training material for all descriptors as some descriptors were never used and others were used very rarely. We dis-tinguish (I) basic minimum requirements that had to be met for us to produce associate lists, or ba-sic decisions we took, and (II) parameters that had an impact on the choice of associates and their weights. Using the optimised minimum re-quirements, we managed to produce associate lists for 2893 English and 2912 Spanish descrip-tors. (I) Basic requirements and decisions: (a) minimum size and number of training texts

available for each descriptor. We chose to

require at least 5 texts with at least 2000 characters each (half a page).

(b) produce associate lists on the basis of one large meta-text per descriptor (concatenation of all texts indexed with this descriptor) vs. producing associate candidates for each text indexed with this descriptor and joining the results. As the training texts were of ex-tremely varying length, the latter method produced much better results.

(c) choice of measure to identify associates in texts, such as pure frequency (TF), fre-quency normalised by average frequency in the training corpus, TF.IDF, chi-square, log-likelihood. Following Kilgarriff’s (1996) study, we used log-likelihood. We set the p-value as high as 0.15 so as to produce long associate lists.

(d) choice of reference corpus for the log-likelihood formula. We chose our training corpus as a reference corpus over using an independent corpus (like the British National Corpus or others).

(II) Parameters with an impact on the choice and weight of associates: (e) the minimum number of texts per descriptor

for which the lemma is an associate. We were surprised to learn that results were best when requiring the lemma to occur in a minimum of only two texts. Setting this threshold higher means getting more descrip-tor-specific, but also shorter associate lists.

(f) Deciding on the weight of an associate for a descriptor. Candidates were the frequency of the lemma in all texts indexed with this de-scriptor, the number of texts the lemma oc-curs in, the sum of the log-likelihood values, etc. Best results were achieved using the number of texts indexed with this descriptor, ignoring the absolute frequency and the log-likelihood values; the log-likelihood formula was thus only used to identify associate can-didates.

(g) normalisation of the associate weight. We used a variation of the IDF formula (F3 be-low), by dividing by the number of descrip-tors for which the lemma is an associate. This punishes the impact of lemmas that are associates to many descriptors. This proved to be so important that we punished common

LEM SW MW Prec Spanish

Recall Spanish

F- measure Spanish

F- measure Engl.

– – – 40.3 43.4 41.8 45.6– – + 40.4 43.6 41.9

– + – 45.6 49.3 47.4 49.1– strict – 44.8 48.5 46.6 – + + 45.6 49.2 47.3 + – – 42.4 45.6 43.8 + + – 45.7 49.6 47.6 48.5+ – + 43.9 47.4 45.6 + + + 46.2 49.9 48.0 50.0

Table 3. Evaluation of the assignment (for the 6 top-ranking descriptors) on Spanish and English texts fol-lowing various linguistic pre-processing steps. LEM: using lemmatisation, SW: using stop word list, MW: marking up multi-word expressions. “strict” indicates that we used a general, non-application-specific stop word list; missing results have not been computed.


26

lemmas strongly. Using a ‘β’ of 10 in for-mula F3 cancelled the impact of all associ-ates occurring at least 10% as often as the most common associate lemma.

(h) further normalisation of the associate weight. We considered either the length of each training text or the number of descriptors manually assigned to this text. Normalisation by text length yielded bad results, but nor-malisation by the number of other descrip-tors was important to reduce the interference of other descriptors that were assigned to the same training text (see formula F2).

(i) a minimum weight threshold for each asso-ciate. Experiments showed that results are better when not considering associates with a lower weight.

(j) a minimum requirement on the number of lemmas in the associate list of a descriptor for us to assign this descriptor. Setting this parameter high increases precision as a lot of lexical evidence is needed to assign a de-scriptor, but it lowers recall as the number of descriptors we can assign is low. A mini-mum of ten associates per descriptor pro-duced best F-measure results.

The final formula to establish the weightl,d of lemma l as an associate of a descriptor d is:

ldldl IDFWWeight ⋅= ,, (F1)

with: l being a lemma, d a descriptor, Wl,d the weight of a lemma in a descriptor, IDFl the “In-verse Descriptor Frequency”.

Wl,d = �∈ dlTt tNd

,

1 see (h) (F2)

with: t being a text; Ndt being the number of manually assigned descriptors for text t; Tl,d be-ing the texts that are indexed by descriptor d and containing lemma l.

IDFl = ��

��

�+

⋅1log

l

DF

DFMax

l

β see (g) (F3)

with: DFl being the descriptor frequency, i.e. the number of descriptors the lemma appears in as an associate. MaxDF is the maximum value of DFl for all lemmas. The parameter β is set to 10 in order to punish lemmas that occur in more than

10% of the MaxDF value.

The whole formula is F4:

��

��

�+

⋅⋅

��

�

�

��

�

�= �

∈

1log1

,

,l

DF

Tt tdl DF

MaxNd

Weight l

dlβ

(F4)

6 Assigning descriptors to a text

Once associate vectors such as that in Table 1 exist for all descriptors satisfying the basic re-quirements laid out in section 5, descriptors can be assigned to new texts by calculating the simi-larity between the text and the associate vectors. To this end, the text is pre-processed in the same way as the training material and lemma fre-quency lists are produced for the new text. An experiment working with log-likelihood values for the lemmas of the new text instead of pure lemma frequencies gave bad results.

(III) We experimented with the following filters before calculating the similarity: (k) we set a threshold for the minimum number

of descriptor associates that had to be present in the text to avoid that only a couple of as-sociates with a high weight would trigger wrong descriptors. This proved to be very important. The optimal minimal occurrence is four. Using a smoothing technique by adapting this parameter flexibly to either the text length or to the number of associates in the descriptor did not yield good results.

(l) We checked whether the occurrence of the descriptor text in the new document should be required (doing this produced bad results; see section 3.2).

(IV) We tried the following similarity measures to compare the text vector with the descriptor vectors: (m) the Cosine formula (Salton, 1989); (n) the Okapi formula (Robertson et al., 1994); (o) the scalar product of vectors (cosine without

normalisation); (p) a linear combination of the three formulae,

as recommended by Wilkinson (1994). In all cases, this combination produced the best re-sults, but the optimal proportions varied de-pending on the languages and on the other parameter settings; an average good mix of


27

weights turned out to be 40%-20%-40% for the formulae mentioned in (m)-(n)-(o).

7 Manual evaluation of the assignment

In addition to comparing our automatic assign-ment results to previously manually assigned descriptors, we asked two indexing specialists to evaluate the automatically generated results. The purpose of this second evaluation was (a) to get a second opinion because indexers differ in their judgements on the appropriateness of descriptors, (b) to get feedback on the relevance of those de-scriptors that were assigned automatically, but not manually, and (c) to produce an upper bound benchmark for the performance of our system by setting the assignment overlap between human indexers as the maximum performance that can be achieved automatically.

As human assignment is extremely time-consuming, the specialists were only able to pro-vide descriptors for 162 English and for 98 Span-ish texts of the test collection. They evaluated a total of 3706 automatically assigned descriptors. These are the basis of the evaluation measures given in Table 4.

In the evaluation, the evaluators were given the choice between the choices (a) good, (b) BT or (c) NT (rather good, but a broader or narrower term would have been better), (d) unknown, (e) bad, but semantically related and (f) bad. The results in Table 4 show categories judged with (a) to (c) as correct, all others as incorrect.

The performance is expressed using Preci-sion, Recall and F-measure. For the latter, we gave an equal weight to Recall and Precision. All measurements are calculated separately for each rank. Precision for a given rank is defined as the number of descriptors judged as correct divided by the number of descriptors suggested up to this rank. Recall is defined as the number of correct descriptors found up to this rank, divided by all descriptors the evaluator found relevant for the text. The evaluator working on English judged an average of 8.15 descriptors as correct (of which 7.5 as (a); standard deviation = 2.5). The person working on the Spanish evaluation accepted a higher average of 11.6 descriptors per text as good (of which 10.4 as (a); standard deviation = 4.2).

7.1 Manual evaluation of manual assignment

It is well-known that human indexers do not al-ways come to the same assignment and evalua-tion results. It is thus obvious that automatically generated results can never be 100% the same as those of a human indexer. In order to have an upper-bound benchmark for our system (i.e. a maximally achievable result), we subjected the previously manually assigned descriptors to the evaluation of our indexing professionals. The review was blind, meaning that the evaluators did not know which descriptors had been as-signed automatically and which ones manually.

This manual evaluation of the manual as-signment showed that the evaluators working on the English and Spanish texts judged, respec-tively, 74% and 84% of the previously manually assigned descriptors as good (a). They judged an additional 4% and 3% as rather good ((b) or (c)). Their total agreement with the previously manu-ally assigned descriptors was thus 78% and 87%, respectively. It follows that they actively dis-agreed with the manual assignment in 22% and 13% of all cases. The differences between the two professional and well-trained human evalua-tors show that there is a difference in style re-garding the number of descriptors assigned and regarding the generosity with which they ac-cepted the manually or automatically assigned descriptors as correct.

We take the 74% and 84% overlap with the human judgements for English and Spanish as the maximally achievable benchmark for our sys-tem. These inter-annotator agreement results confirm previous studies (e.g. Ferber, 1997 and Jacquemin, 2002), which found an overlap be-tween human indexers of between 20 and 80 per-cent. 80% can only be achieved by well-trained indexing professionals that are given clear index-ing instructions.

7.2 Evaluation of automatic assignment

Table 4 shows human evaluation results for various ranks, i.e. looking at the top-ranking 1, 3, 5, 8, 10 and 11 automatically assigned descrip-tors. Looking at ranks 8 and 11 is most useful as these are the average numbers of appropriate de-scriptors as judged by the human evaluators for English and Spanish. Setting the human assign-ment overlap of 78% and 87% as the benchmark


28

(100%), the system achieved a precision of 86% (67/78) for English and of 80% (69/87) for Span-ish. The indexing specialists judged that this is a very good result for a complex application like this one. Regarding the precision values, note that, for those documents for which the profes-sional indexer only found 4 relevant descriptors, the maximally achievable automatic result at rank 8 would be 50% (4/8).

7.3 Performance across languages

The system has currently been trained and opti-mised for English, Spanish and French. Accord-ing to the automatic comparison with previously manually assigned descriptors, the results were very similar for the three languages. For eight other European languages, assignment was car-ried out without any linguistic pre-processing and without fine-tuning the stop word lists. The results between languages varied little and were very similar to the results for English, Spanish and French without linguistic input and parame-ter tuning (which improved the results by six to eight percent). This similar performance across the very different languages (including Finnish and German) shows that the approach as such is language-independent and that the application can easily be applied to further languages when training material becomes available.

8 Conclusion and Future Work

The manual evaluation of the automatic assign-ment of descriptors from the conceptual thesau-rus EUROVOC using statistical methods and a high number of optimised parameters showed that the system performed 570% better than the lower bound benchmark, which is the keyword extrac-

tion of descriptors present verbatim in the text (section 3.2; F-measure comparison). Further-more, it performs only 14% (English) to 20% (Spanish) less well than the upper bound bench-mark, which is the percentage of overlap be-tween two human indexing specialists (section 7.2). Results for English and Spanish assignment were very similar.

We showed that adding a number of parame-ters to more standard formulae, and identifying the best parameter settings empirically, improves the assignment results a lot. We have success-fully applied the language-independent algorithm to more languages, and we believe that it can be applied to other applications such as the indexing of texts with other thesauri. However, the identi-fication of the best parameter setting will have to be done anew. The optimised parameter settings for English, Spanish and French descriptor as-signment were similar, but not entirely identical.

A problem we did not manage to solve with different formulae and parameters is the frequent assignment of descriptors that are wrong, but that are clearly part of the same semantic field. For instance, the descriptor NUCLEAR ACCIDENT was often assigned automatically to texts in which vocabulary such as ‘plutonium’ and ‘radioactive’ was abundant, even if the texts were not about nuclear accidents. Indeed, the descriptors NU-CLEAR MATERIAL and NUCLEAR ACCIDENT have a large amount of associates in common, which makes them hard to distinguish. To solve this problem, it is obvious that, for texts on nuclear accidents, the occurrence of at least one of the words ‘accident’, ‘leak’, or similar should be made obligatory. We have therefore started ap-plying Machine Learning methods to infer such rules. First experiments with Support Vector Machines are encouraging.

In addition to the assignment in its own right, we use the automatic assignment of EUROVOC de-scriptors to texts for a variety of other applica-tions. These include cross-lingual document similarity calculation, the automatic identifica-tion of document translations, multilingual clus-tering and classification, as well as subject-specific summarisation.

References Agirre Eneko, Ansa Olatz, Hovy Eduard, Martínez

David (2000) Enriching very large ontologies us-

English 162 texts

Spanish 98 texts

Nb

of

desc

r P R F P R F

1 94 12 21 88 8 15 3 83 31 45 86 24 37 5 75 46 57 82 37 51 8 67 63 65 75 54 63

10 58 68 63 71 64 67 11 55 71 62 69 75 68

Table 4. Precision, Recall and F-measure results for the manual evaluation of the English and Spanish documents of the test collection.


29

ing the WWW. Proceedings of the Ontology Learn-ing Workshop, ECAI. Berlin, Germany.

AGROVOC (1998) Multilingual agricultural thesau-rus. World Agricultural Information Center. http://www.fao.org/scripts/agrovoc/frame.htm.

DESY (1996). The high energy physics index key-words, http://www-library.desy.de/schlagw2.html.

Eurovoc (1995). Thesaurus Eurovoc - Volume 2: Sub-ject-Oriented Version. Ed. 3/English Language. Annex to the index of the Official Journal of the EC. Luxembourg, Office for Official Publications of the European Communities. http://europa.eu.int /celex/eurovoc.

Ferber R. (1997) Automated Indexing with Thesaurus Descriptors: A Co-occurrence Based Approach to Multilingual Retrieval. In Peters C. & Thanos C. (eds.). Research and Advanced Technology for Digital Libraries. 1st European Conf. (ECDL’97). Springer Lecture Notes, Berlin, pp. 232-255.

Gonzalo J., Verdejo F., Chugur I., Cigarrán J. (1998) Indexing with WordNet synsets can improve text retrieval. Proceedings of COLING/ACL'98 Work-shop on Usage of WordNet for NLP, Montreal.

Haller J., Ripplinger B., Maas D., Gastmeyer M. (2001) Automatische Indexierung von wirtschafts-wissenschaftlichen Texten - Ein Experiment, Ham-burgisches Welt-Wirtschafts-Archiv. Saarbrücken, Germany.

Hlava Marjorie & R. Hainebach (1996). Multilingual Machine Indexing. NIT’1996. Available at http:// joan.simmons.edu/~chen/nit/NIT'96/96-105-Hava.html

Jacquemin C., Daille B., Royaute J, Polanco X (2002) In vitro evaluation of a program for ma-chine-aided indexing, Information Processing and Management, V. 38, ed. Elsevier Science B.V., Amsterdam. pp. 765-792.

Kilgarriff A. (1996) Which words are particularly characteristic of a text? A survey of statistical ap-proaches. Proceedings of the AISB Workshop on Language Engineering for Document Analysis and Recognition, Sussex, April 1996, pp. 33-40.

Lancaster F.W. (1998). Indexing and Abstracting in Theory and Practice. Library Association Publish-ing. London.

Lin Chin-Yew, Hovy Eduard (2000) The Automated Acquisition of Topic Signatures for Text Summari-zation. Proceedings of CoLing. Strasbourg, France.

Loukachevitch Natalia & B. Dobrov (2002).Cross-lingual IR based on Multilingual Thesaurus spe-cifically created for Automatic Text Processing. Proceedings of SIGIR’2002.

Miller G.A., (1995) WordNet: A Lexical Database for English. Communications of the ACM 11

Montejo Raez A. (2002) Towards conceptual indexing using automatic assignment of descriptors Proceed-ings of the Workshop on Personalization Techniques in Electronic Publishing on the Web, at the 2nd Inter-national Conference on Adaptive Hypermedia and Adaptive Web. ed S. Mizaro & C. Tasso, Malaga, Spain.

NLM - National Library of Medicine - (1986). Medi-cal Subject Headings. Bethesda, Maryland, USA

Pouliquen B., Delamarre D., Le Beux P. (2002). In-dexation de textes médicaux par extraction de concepts, et ses utilisations. In "6th International Conference on the Statistical Analysis of Textual Data" (JADT'2002). St. Malo, France. pp 617-628.

Robertson S. E., Walker S., Hancock-Beaulieu M., Gatford M. (1994). Okapi in TREC-3. Proceedings of Text Retrieval Conference TREC-3, U.S. Na-tional Institute of Standards and Technology, Gaithersburg, USA. NIST Special Publication 500-225, pp. 109-126.

Salton G. (1989) Automatic Text Processing: the Transformation, Analysis and Retrieval of Informa-tion by Computer. Reading, Mass., Addison-Wesley.

Steinberger R. (2000) Using Thesauri for Information Extraction and for the Visualisation of Multilingual Document Collections. Proceedings of the Work-shop on Ontologies and Lexical Knowledge Bases (OntoLex’2000), pp. 130-141. Sozopol, Bulgaria.

Steinberger R., B. Pouliquen & J. Hagman (2002). Cross-lingual Document Similarity Calculation Using the Multilingual Thesaurus Eurovoc. In: A. Gelbukh (ed.) Computational Linguistics and Intel-ligent Text Processing, Third International Confer-ence, CICLing'2002. Lecture Notes in Computer Science 2276, pp. 415-424. Mexico-City, Mexico. Springer, Berlin Heidelberg.

Acknowledgements We would like to thank the Documentation Centres of the European Parliament and of the European Com-mission’s Publications Office OPOCE for providing us with the EUROVOC thesaurus and the training material. We thank Elisabet Lindkvist Michailaki from the Swedish Parliament and Victoria Fernández Mera from the Spanish Senate for their thorough evaluation of the automatic assignment results. We also thank the anonymous evaluators for their feedback given to our initial submission.


30

Bridging the Word Disambiguation Gap with the Help of OWL and Semantic Web Ontologies

Steve Legrand, Pasi Tyrväinen Department of Computer Science

University of Jyväskylä [email protected]

[email protected]

Harr i Saarikoski Department of Linguistics

University of Helsinki [email protected]

Abstract

Due to the complexity of natural language, sufficiently reliable Word Sense Disambiguation (WSD) systems are yet to see the daylight in spite of years of work directed towards that goal in Artificial Intelligence, Computational Linguistics and other related disciplines. We describe how the goal could be approached by applying hybrid methods to information sources and knowledge types. The overall aim is to chart the shorfalls of the present WSD systems related to the use of knowledge types and information sources in them. Real world ontologies and other ontologies in the Semantic Web will make a useful contribution towards the WSD knowledge base envisaged here. The inference capabilities inherent in Ontology Web Language (OWL) especially will have an important role to play in natural language disambiguation and knowledge acquisition. The emphasis is on ontologies as one of the important information sources for hybrid WSD.

1 Introduction

WSD methods have undergone changes and increased in number and variety in recent times, reflecting the requirements of many different types of uses WSD is put into. New types of information sources have appeared enabling the utilization of various types of knowledge

incorporated in them. Nevertheless, there is no such thing as 100% WSD in any domain, however restricted. This, in fact, smacks of a goal that cannot be realized, taking into account that human WSD cannot reach that goal either.

1.1 Disambiguation Gap

However, we can try and get as close to the level of human WSD as possible. For this we need to specify the gap that currently exists between machine WSD and human WSD. This gap varies and is influenced by:

• application domain (machine translation, information retrieval, information extraction, knowledge acquisition, textual data mining, and natural language understanding among them).

• information sources available for those domains (machine-readable dictionaries, ontologies, corpora and their combinations among others), and

• knowledge types that these information sources incorporate (part-of-speech information, morphology, collocations, semantic associations, syntactic cues, sense frequencies, selectional preferences etc.).

Attempts to systemize the variety of sources

and types have been made (Agirre and Martinez 2001, Ide and Véronis 1999) with some success, and attempts to unravel the knowledge types used for particular application domains, for example machine translation (Mowatt 1999),


31

have also been made. The field now seems ripe for an approach that takes advantage of the potential of hybridisation in the multiplicity of these methods to optimise the effectiveness of WSD.

1.2 Dealing with the Disambiguation Gap

First, it is important to specify the disambiguation gap for various application domains by using our existing knowledge about information sources and knowledge types, and by experimenting with their combinations. The optimum mix varies from application to application: corpus statistical methods can support manual or semi-automatic knowledge methods. The relative weight of each method and knowledge type may be tested on corpora and defined. Hybrid methods (Ng and Lee 1996) and unsupervised methods (Yarowski 1995) have proved their mettle in comparative studies.

Second, one needs to identify those sources and types whose disambiguation potential is currently under-utilised due to their poor availability, costly acquisition, or insufficient appreciation. In particular, the use of ontologies for NLP tasks must be investigated, as ontologies will play a dominant part in the creation of the Semantic Web. OWL seems like a very good choice for disambiguation in which dense ontologies and disambiguation rules are used. There has been a rapid increase in the inference capabilites of Semantic Web languages with each layer added on RDF (RDFS � OIL � Daml-OIL) (Antoniou, 2002). At the time of this writing, the OWL standard is in its Last Call Working Draft phase (McGuinnes and van Harmelen, 2003), and may become a standard before the publication of this paper, and can then be added on top of the stack of the Semantic Web languages. As an indication of OWL’s widespread acceptance at this early stage, one can cite some of the tools and technologies that have already been developed to take advantage of it, among them the knOWLer (2003) information management system and the OWL Converter (2003) converting from DAML-OIL format to OWL.

A further advantage of using the OWL together with the Semantic Web ontologies is in the distributed nature of those ontologies. A

useable set of domain ontologies will take a considerable time to create: the task will never be completed, because new words and concepts are entering the vocabulary constantly. Hundreds and thousands of experts are needed to make sense of the world. The Open Source community exemplifies the way that the collaborative potential of likeminded people can be effectively harnessed. Already there are Open Source type development groups active in the Semantic Web.

Before Semantic Web came into being, a WSD system for machine translation, Mikrokosmos, based on knowledge-dense ontologies represented by TMRs (Text Meaning Representation) (Mahesh and Nirenburg, 1995) saw daylight: its word sense disambiguation success for all the words in a corpus, 97%, for both training and unseen text, and 90% for ambiguous words (Mahesh et al 1996) is encouraging, but the application lacks portability.

2 Hybrid Multilevel Disambiguation

In hybrid multilevel disambiguation, the idea is to disambiguate word senses using a mix of knowledge types and information sources, including real world knowledge in ontologies and their inference capabilities.

In the ontology related method, the correct ontology that corresponds to the document or document part domain is first detected. The domain itself can be automatically classified by using a hybrid method, for example a committee approach, as outlined in Hammond et al (2002). The domain ontology thus determined and located in the Semantic Web can then be used as the basis for the subsequent hybrid disambiguation.

For example, a paragraph such as: “ I sat in my old buggy It was very hot, so I turned on the engine, and drove under a tree to get cooler. Then I opened the window. “

can be disambiguated in several levels:

1. Morpho-syntactic level 2. Semantic level 3. World knowledge level


32

All of these levels overlap to some extent. As the result of morpho-syntactic level disambiguation, a sentence is annotated with POS (Part-Of-Speech) and morphosyntactic feature tags. From these we derive the syntactic function labels indicating whether the word is the subject, object, predicate, modifier or complement (Figure 1).

The semantic level can use these annotations together with the selectional preferences of the words themselves to clarify the meaning further. For example in the example paragraph above, ‘buggy’ is more likely to denote a kind of car than a horse-drawn carriage or a baby pram, the noun ’engine’ regularly co-occurring with the noun ’car’ in the same context. The Part-of ontological hierarchy in Figure 2 confirms their

Figure 2: SUMO top ontology (bold) subsuming the syntactic function and transportation domain ontologies (lighter color).

Sumo ontology (in bold): Entity

Physical Object

SelfConnectedObject ContentBearingObject LinquisticExpression Word Noun Verb Phrase NounPhrase Object Subject VerbPhrase Predicate

CorpuscularObject Artifact

Device TransportationDevice

Domain ontology fragments : Vehicle

MotorVehicleType bus car Jeep sedan buggy …… MotorlessVehicleType bicycle horse carriage buggy ……….. baby carriage pram buggy ….. ……. MotorVehiclePart

engine transmission electrical system body roof bumper floor door

window

I i @SUBJ

sat sit @+FMAINV

in in @ADVL

my i @A>

old old @A>

buggy buggy @<P

......... ......... ........

I i @SUBJ

turned turn @+FMAINV

on on @ADVL

the the @DN>

engine engine @<P

and and @CC

drove drive @+FMAINV

under under @ADVL

the the @DN>

tree tree @<P

...... ...... ......

I i @SUBJ

opened open @+FMAINV

the the @DN>

window

window @OBJ

Figure 1. Syntactic functions (column 3) for the example paragraph according to FDG. Morphosyntactic feature tags are not shown here.


33

close relationship. Naturally, a rule specifying this would need to span over the sentence boundaries.

Even though we are aware now that the protagonist has started the engine of his car, it would be difficult by morpho-syntactic and semantic disambiguation alone to reason that the sentence following: “Then I opened the window…” would necessarily mean the car window. In this a real world ontology would be of great help. Apart from confirming that ‘buggy’ can be a type of ‘car’, it would confirm that the window, in this case, is a part of a car and not of a house, and that horse-drawn carriages or baby prams do not have engines. One could reason further that the car was a type of a vehicle etc, the selectionally preferred noun for the verb ‘to drive’ was ‘a car’ etc. If, however, the context of the paragraph was established to be that of golf, then of course the selectional preference for the verb ‘drive’ in ‘I drove under a tree’ would most likely change. These types of inferences could be drawn from ontologies built with the help of OWL, combined to minimize the word sense ambiguity.

The above is to give a rough idea of the way word sense disambiguation can be handled if OWL’s inferencing capabilities are combined with traditional means of disambiguation. The example is just to illustrate the idea, and does not provide enough details for a complete disambiguation of the paragraph. It is easy to find fault with it by insisting, for example, that ‘buggy’, according to a definition found in many dictionaries is a small vehicle without windows and doors and with a roof mounted on the chassis. This would contradict what is said about the window above, unless one classified some off-road vehicles such as converted VW’s (with windows) as buggies, which is also quite common. This further illustrates the importance of not relying excessively on any single information source or disambiguation method in trying to reduce the disambiguation gap.

Once the correct domain ontology and the position of the word in it denoting the concept are determined, the word can be matched with the foreign word in the same ontological structure if the purpose is to translate it. It may be that a house ‘window’ and a car ‘window’ in another language are denoted by two entirely

different words, unlike in English, and therefore it is important to select the correct ontological concept.

3 FDG and Ontological Approach

We use a morpho-syntactic Functional Dependency Grammar (FDG, 2003) analyser as the baseline setter on which to found our research. The FDG analyser is based on ENGCG parser which, when combined with Xerox tagger, reached 98.5% structural disambiguation accuracy, outperforming all the other parsing combinations tested in the study of Tapanainen and Voutilainen (1995). FDG was selected for the present research mainly due to its accuracy. However, although its disambiguation error rate seems very small, it is still significant when considering natural language applications. Using the same formula as Abney (1996) it is easy to show that this word-based disambiguation rate, when applied to sentence-level, still needs some improvement to satisfy the requirements of natural language processing applications. If we assume that a sentence consists of 20 words on an average, the 98.5% word disambiguation accuracy is transformed into 26% error rate on a sentence level (1-.98520 = 26%). For the purposes of machine translation this is clearly not yet adequate (1/4 of all the sentences erroneous even in ideal cases where the domain is restricted).

The current morpho-syntactic word sense disambiguation in FDG will soon be augmented with a semantic disambiguation module, which is likely to further improve the parser’s accuracy. This is not sufficient, however. In addition to morphological, syntactic and semantic word sense disambiguation, real world knowledge is required for optimal understanding of a natural language. In our approach, the gap remaining in the disambiguation that cannot be bridged using the mix of currently available methods and their modifications is subjected to ontological disambiguation using real world distributed domain ontologies and SUMO upper ontology (Pease et al 2002) in the Semantic Web. Currently, most of these ontologies are in the RDF-based DAML-OIL format, but can be converted to OWL, the standard that is expected to replace DAML-OIL in the near future.


34

Farrar et al (2002) have suggested the addition of a general ontology for linguistic description (GOLD) to the SUMO upper ontology and published a draft version of it in the OWL format. They see it useful as a part of an expert system reasoning about language data, or as a part of an interlingua for machine translation system. We envisage being able to use their linguistic ontology to hold disambiguated

morphological and syntactic data. Syntactic functions for nouns could indirectly be indicated in the case system portion of GOLD. However, these can also be plugged directly to the SUMO upper ontology (Figure 2), although their positioning under the SUMO’s Phrase category may prove unsatisfactory in the long run.

The domain ontology holding the real world data and relations for the words can then be aligned to the SUMO ontology (Figure2) mainly through OWL subsumption and to the linguistic ontology using OWL property relations (Figure 3) as the glue: inferences can be drawn from the real world data to increase the disambiguation power of the linguistic ontology. For the purpose of alignment, SUMO also need be formatted to OWL. An agent-based application can then be used to manipulate the structures created for linguistic disambiguation.

The coding and property and class relations in Figure 3 are grossly simplified fragments forming part of an OWL document. The idea here is to show that the verb ‘drive’ selects a car rather than a baby carriage as its preferred noun phrase object. The word ‘buggy’ may subsequently be matched with ‘car’, and ‘window’ to the car body through their part-of relations. Similarly, both ‘body’ and ‘engine’ would be identified as parts of a motor vehicle. A comprehensive OWL statement about the verb ‘drive’ would have a set of preferentially weighed selectional preference entities to select from and a set of restrictions applied to it. Contextually closest (shortest arc distances in the ontology) selectional preferences would have the greatest preference weighing.

4 WSD Knowledge Base

Ontologies to be tested and designed for our optimal WSD include new and existing ontologies in the Semantic Web suitably modified and contain, for each concept, a dense network of subclass/superclass (eg. car is-a motorVehicle) relationships, property rules (eg. selectional preferences) and associative relations. Essentially, the WSD Knowledge Base will contain the differentiating factors between two senses of a word, which will disambiguate the sense of the target word. Synonym sets may be thought of as a differentiator, the sense’s place in

<owl:Class rdf:ID="Car"> <rdfs:subClassOf rdf:resource="#MotorVehicleType" /> </owl:Class> <owl:Class rdf:ID="Buggy"> <rdfs:subClassOf rdf:resource="#Car" /> </owl:Class> ---------------------------------------------- <owl:Class rdf:ID="Engine"> <rdfs:subClassOf rdf:resource="#MotorVehiclePart" /> </owl:Class> <owl:Class rdf:ID="Body"> <rdfs:subClassOf rdf:resource="#MotorVehiclePart" /> </owl:Class> <owl:Class rdf:ID="Window"> <rdfs:subClassOf rdf:resource="#Body"/> </owl:Class> ---------------------------------------------- <owl:Class rdf:ID="Predicate"> <rdfs:subClassOf rdf:resource="#VerbPhrase" /> </owl:Class> <Predicate rdf:ID="Drive">

<selectionalPreferenceObject rdf:resource="#Car" />

</Verb>

Figure 3. Owl fragments connecting ontologies with subsumption and relational properties. Namespace declarations, superclass definitions, and property definitions are omitted.


35

the Knowledge Base hierarchies and categories as another. Selectional preferences of the senses, and, of course, context word statistics - among other differentiators -, can also be used for disambiguation.

It is in this density and multiplicity of knowledge types and “sub-atomicity” (concepts are defined rigorously and adequately from within) that it contrasts the traditional, atomic (concepts are only defined in terms of their few external relations to related terms in network) ontologies. The Mikrokosmos ontology holds 5000 concepts with an average of 16 attributes and relations per concept (Mahesh et al. 1996). Our WSD knowledge base starts from that density and increases/decreases density until WSD is optimised. The result will be referred to as the WSD Knowledge Base for which we define each aspect of its construction and functioning. We will rigorously define the principles of designing such knowledge base, both in terms of quality (knowledge types required) and quantity (number of concept-internal definitions = information from knowledge types). As such, this research will also provide a feasible requirements specification for eventual implementation of the WSD system described.

One important application for our optimal WSD system is knowledge acquisition. It is precisely the lack of knowledge, and the high cost of acquiring dense knowledge bases and ontologies, that stands as the bottleneck in the way of knowledge-based NLP systems becoming more useful. WSD Knowledge Base may deliver a solution to both structural and semantic disambiguation tasks, and can as such be utilised in a multitude of NLP applications.

The WSD system could then be tested using corpora and test cases (disambiguable target words) from earlier research. Such starting points, and also points of comparison, could be Ng and Lee (1996) who tested their hybrid system on the senses of a single noun ‘interest’, Bruce and Wiebe (1994) who worked with the same noun, or Towell and Vorhees (1998) who tested some highly disambiguous words such as ‘line’ (noun), ‘serve’ (verb), and ‘hard’ (adjective). Another possibility, offering an equal amount of comparability to previous research, would be to examine the systems from the SENSEVAL-2 (2001) competition to see what

knowledge types and information sources would most naturally and effectively disambiguate the target words.

5 Conclusion

This paper outlines the three main aspects in bridging the current disambiguation gap in WSD: application domain, information sources and knowledge types. There is a multiplicity of different domains, sources and types. Methods dealing with them have their limitations and can be partially overcome by combining the best of them in hybrid methods. It is important to determine the part of the disambiguation gap for language understanding that is dependent on knowledge aqcuisition.

Ours is an attempt to quantify the disambiguation potential of each information source and their contained knowledge types for each target word type. For example, if we find that what differentiates a word from another is synonym sets, points are added to the knowledge type and information source involved. The idea is to get an overall view on the most useful differentiating and disambiguating factors, knowledge types, and information sources in each particular case. Efforts in the KA and NL communities can then be better directed toward acquiring these information sources and knowledge types and developing more reliable hybrid WSD systems.

The FDG parser that we use in our morpho-syntactic and semantic disambiguation provides a mix of knowledge types (POS, morphology etc) to which we add selectional preferences and other types for the purpose of semantic / world knowledge disambiguation.

Ontologies as information sources are gaining momentum thanks to the emerging Semantic Web language specifications such as RDFS, DAML-OIL, and the most recent arrival, OWL, with its enhanced inference capabilites suitable for knowledge-based NLP. The use of OWL ontologies further reduces the disambiguation gap by allowing word sense disambiguation with the help of real world knowledge contained in Semantic Web domain ontologies.

It is still the early days. However, the OWL will become a standard soon, the SUMO upper ontology will be translated to OWL in a due


36

course, and linguistic ontologies and ontologies from other domains (knowledge-saturated and knowledge-optimized ontologies) will be added and aligned with it. It is our hope that this paper can offer a glimpse of how Semantic Web, saturated ontologies, and OWL can contribute as one of the disambiguation methods used in hybrid WSD.

References

Abney, S., Part-Of-Speech Tagging and Partial Parsing, In: Church, K., Young, S., Bloothoof, G., Methods in Language and Speech. An ELSENET book, Kluwer Academic Publishers, Dordrecht, 1996.

Antoniou, G., Nonmonotonic Rule Systems on Top of Ontology Layers, Lecture Notes in Computer Science, 2342, Online publication: May 29, 2002. Available in: http://link.springer.de/link/ service/series/0558/bibs/2342/23420394.htm

Agirre, E., and Martinez, D., Knowledge Sources for Word Sense Disambiguation, Lecture Notes in Computer Science 2166, Springer 2001.

Bruce, R., and Wiebe, J., Word-Sense Disambiguation Using Decomposable Models, In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, New Mexico, 1994.

Farrar, S., Lewis, W.D., Langendoen, D.T., A Common Ontology for Linguistic Concepts, Proceedings of the Knowledge Technologies Conference, March 10-13, Seattle, 2002.

FDG. Conexor Functional Dependency Grammar. In: http://www.conexoroy.com/fdg.htm, Last accessed: June 2003

Hammond, B., Amit, S., Kochut, K, Semantic Enhancement Platform for Semantic Applications over Heterogenous Content, To appear in Real World Semantic Web Applications, V.Kashyap and L.Shklar, Eds, IOS Press, 2002. Available in: http://lsdis.cs.uga.edu/lib/download/HSK02-SEE.pdf

Ide, N., and Véronis, A., Word Sense Disambiguation: The State of Art, Computational Linguistics, Vol.24, No.1, March 1998, p.1-40

knOWLer. Ontology based information management system. In: http://taurus.unine.ch/GroupHome/ knowler/wordnet.html. Last accessed: June 2003.

Mahesh, K., Nirenburg, S., Beale, S., Onyshkevych, B., Viegas, E., and Raskin, V., Word Sense Disambiguation: Why Statistics When We Have These Numbers? 1996.

Mahesh, K., and Nirenburg, S., A Situated Ontology for Practical NLP, in IJCAI-95 Workshop on Basic Ontological Issues in Knowledge Sharing, Aug. 19-21, Montreal, 1995.

McGuinnes, L.D., van Harmelen, F., Eds., OWL Web Ontology Language Overview, W3C Working Draft 31 March 2003, Available in: http://www.w3.org/TR/owl-features/

Mowatt, D., Types of Semantic Information Necessary. In Machine Translation Lexicon, Conférence TALN 1999, Cargèse, 12-17 July 1999

Ng, H. T. and Lee, H.B., Integrating multiple knowledge sources to disambiguate word sense: An exemplar-based approach, 1996. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, California, 1996.

OWL Converter. In: http://www.mindswap.org/2002/ owl.html. Last accessed: June 2003.

Pease, A., Niles, I., Li, J., (2002) The Suggested Upper Merged Ontology: A Large Ontology for the Semantic Web and its Applications. In Working Notes of the AAAI-2002 Workshop on Ontologies and the Semantic Web. Available in: http://reliant.teknowledge.com/AAAI-2002/Pease.ps

Rigau, G., Magnini, B., Agirre, E., Vossen, P., and Carroll, J., MEANING: a Roadmap to Knowledge Technologies, 2002.

SENSEVAL-2. Second International Workshop on Evaluating Word Sense Disambiguation Systems. 5-6 July 2001, Toulouse, France. In: http:// www.sle.sharp.co.uk/senseval2/

Tapanainen, P. and Voutilainen, A., Tagging accurately: don't guess if you don't know. Technical Report, Xerox Corporation, 1994.

Towell, G., Voorhees E.M., Leacock, C., Disambiguating Highly Ambiguous Words In Computational Linguistics Volume 24, Issue 1 / March 1998, p. 125 – 145

Yarowsky, D., Unsupervised word sense disambiguation methods rivaling supervised methods, ACL95 - 33rd Annual Meeting of the Association for Computational Linguistics 26-30 June 1995,Massachusetts Institute of Technology, Cambridge, Massachusetts, USA, 1995.


37

UsingWordnet hierarchiesto pinpoint differ encesin relatedtexts

Ann DevittComputationalLinguisticsGroupDepartmentof ComputerScience

Trinity [email protected]

Carl VogelComputationalLinguisticsGroup& CCLS

Departmentof ComputerScienceTrinity CollegeDublin

[email protected]

Abstract

We presenta meansof comparingtextsto highlight their informational differ-ences. The systembuilds a DirectedAcyclic Graph representationof thecombined WordNet hypernym hierar-chiesof thenouns.Comparisonof theseyields a graphwhich distinguishesmi-nor lexical & majorcontentdifferences.

1 Intr oduction

As JohnLocke wrotein theseventeenthcentury

No knowledgewithoutdiscernment

In the 21st Century, the averagehuman needsmore thanthe averageamountof discernmenttoget much knowledge from the vast profusionofinformation available to them. This paper ad-dressesthe questionof distinguishingtexts fromeachother. It is aboutsimilaritiesbut alsodiffer-ences,detailinga structureby which differencesof contentmay be distinguished,independentofsurfacedifferences.

Thepaperdescribesasystemfor modelingtextsstartingfrom its lexical itemsandsupplementingthesewith “world knowledge”from WordNetandthencomparingthecontentof thesemodels.Sec-tion 2 discussestheoreticalmodelsof text whichhighlight the central role of lexis in text mean-ing andtheir usein NaturalLanguageProcessingapplications.Section3 detailsthe representationbuilt andtheprocessof text comparison.Section

4 setsout theanalysisfor 3 pairsof texts. Finallythereis abrief discussionof thiswork-in-progressandhow it will progress.

2 RelatedWork

Much currentresearchin the areaof NLP is in-debtedto (Halliday and Hasan,1976) for theirexposition of the notion of cohesionas a com-bination of forces at work to knit a collectionof disparatesentencesinto a text, a cohesivewhole. They categorised the factors contribut-ing to textualcohesionintogrammaticaldevices—reference,substitutionand ellipsis—conjunctionandlexical cohesion.

In general terms, lexical cohesionrelies onsomesimilarity or relatednessamonglexical itemsin a text. This assumptionthatthelexical itemsina text arerelatedsystematicallyon morethanjusta grammaticallevel hasfired researchinto meansof text representationbasedon lexical itemsandalso into the exploitation of suchrepresentationsto tackleclassicNLP problems.

Representationsmay be structuredor unstruc-tured.Theanalysissetout in (HallidayandHasan,1976)anddevelopedin (Hasan,1984)focusesona structuredrepresentation:lexical chainsof re-lated items which through their interactionwitheachothermake text coherent.(Hoey, 1991)fur-ther develops this idea as repetition nets whichcapturethe relationsbetweenlexical items in atext. Many computationalmodels take lexicalchainsasthepremisefor building representationsof text, for example (Harabagiuand Moldovan,1998).


38

Approachessuchas LatentSemanticAnalysis(Deerwesteret al., 1990)do not attemptto build astructuredrepresentationof the relationsbetweenindividual lexical itemsbut giveaglobalpictureoftext “meaning”.

Giventheavailability of knowledgebases,com-putationalapproachesbasedon lexical cohesioncan draw on more than just the lexical items tobuild a modelof text. WordNet(Fellbaum,1990)hasbeenusedextensively in this regard,see(Mi-halceaandMoldovan, 1999), (Harabagiu,1999),(Agirre andRigau,1996). The knowledgebasesrepresentedby largecorporaarealsousedto sup-plementlexical information,systemsincludeLSA(Deerwesteret al., 1990), Vectile (Kaufmann,1999).

Theserepresentationshave beenexploited totacklemany classicNLP problemssuchaswordsensedisambiguation(Mihalcea and Moldovan,1999),meronymy resolution(Markert andHahn,2002), pronominal resolution (Harabagiu andMaiorano,1999),topic identificationandtext seg-mentation(Hearst,1997),(Kaufmann,1999),tex-tual inference(HarabagiuandMoldovan,1998).

This paperfollows in this vein, on thepremisethatjust thecollectionof lexical itemshasmuchtotell aboutthestructureandcontentof a text with-out taking other structuralaspectsinto account.It presentsa very simplerepresentationof lexicalitemsin a text enrichedwith WordNethypernymrelationswhichis usedto highlight typesof differ-encesbetweentexts. Thetasksof text categoriza-tion and information retrieval deal with gaugingsimilaritiesamongtexts: textsandcorporaor textsandquerystringswith ahighmeasureof similaritydenotingsuccess.Shifting thefocusto text differ-encesmay yield interestingresultsfor relevancefeedbackwherea measureof similarity hasbeencalculatedandcategorisationof in whatwaysindi-vidual texts differ from somedesiredtarget couldbeuseful.

3 Approach

The approachusedhereis quick and dirty. Theaimis to build somerepresentationof atext, in thisinstancenews text, then to comparerepresenta-tionsof relatedstoriesto identify theirdifferences.

The representationis built using just nouns.

This is clearly a short-coming,however, for aquick anddirty approach,it suffices for the mo-ment.

In orderto extract the nouns,the texts arefirsttaggedusingLT POS,a taggerandchunker fromthe LanguageandTechnologyGroupat the Uni-versity of Edinburgh. The hypernym hierarchyfor all sensesof eachnounis thenextractedfromWordNet. Thesesynsetlists arethenmergedintoa directedacyclic graphto representthetext, am-biguitiesandall. Thesegraphsform thebasisforacomparisonof texts.

3.1 The Representation

The aim is to produce a useful representationwhichcouldbegeneratedfrom freetext, with min-imal pre-processingandnodisambiguation.

WordNet provides a rich knowledge baseforEnglishin whichconcepts,termedsynsetsor syn-onymy sets,arelinkedby semanticrelations.ThesystemdetailedhereusestheWordNethypernym“IS-A” relation. For eachlexical item in the text,thehypernym hierarchyfrom beginning to endisaddedto the representationof the whole. The fi-nal structureis a directedacyclic graphcontain-ing a subsetof the WordNet hypernym linkedstructure. Each node in the graph representsaWordNet synsetand a count of its frequency inthe text. The intuition is that the main entitiesandtopic areasarethoselinkedby frequentlytra-versededgesin thegraph.This structurecapturesthe lexical chainsformedby reiteration—throughgeneralnouns,synonymy and superordinatesorjust repetition—asedgesthathave beentraversedmorethanonce.

For example, Text1, the following short text,basedon an extract from (Halliday and Hasan,1976)p. 279

There’s a boy climbing that tree. The idiot willfall if he’s not careful.Elmsarenot very sturdy.Thepoorchild mighthurt himself.

yields,amongotherchains,thelexical chainscon-nectingboy-child-idiot andtree-elmshown in fig-ure1. Thestrengthof the line is directly propor-tionalto thefrequency of theparticulararc.Differ-ent sensesof wordsarewritten word#1, word#2,etc. The text graphalsocontainsspuriouspathsshown in figure2. While theflora-treepathis re-inforcedby theadditionof thehypernym pathfor


39

entity

causal_agency organism

human

juvenile male child#2 relation simpleton

child#1 boy#1 man

flora

boy#2

offspring

child#3

male_offspring

boy#3

idiot

tracheophyte

woody_plant

tree

elm

Figure1: Partialhypernym graphfor Text1

abstraction

attribute

form

figure

plane_figure

tree

Figure2: Pathfrom hypernym graphfor Text1

elms,theabstraction-treepathis not reinforcedbyconnectionswith any othernodesand indeedre-mainsthe only nodestemmingfrom the root ab-stractionin thetext.

In practice,graphs,suchasfigure1, arestoredasadjacency matricesof synsetidentifiersto facil-iatecomparisonwith othertexts.

3.2 Comparison

The comparisonof the matrix representationsoftwo texts, T1 and T2, yields anothermatrix ofsynsetnodes.Thecomparisonsandresultingma-tricesusedin thisapproacharethefollowing:

� T1�T2: the synsetnodespresentin T1 but

not in T2

� T2�T1: the synsetnodespresentin T2 but

not in T1

� T1 � T2: the intersectionof the two texts, allsynsetnodespresentin bothtexts

If two textsareoncompletelydifferentsubjects,thedifferencematrices,T1

�T2andT2

�T1, contain

mostof theoriginal matrices,T1 andT2, respec-tively. If they arequitesimilar, thedifferencema-

trix will containa very restrictedsubsetof thein-putsynsetnodesandthesimilarity matrixcontainsthe lion’s share. New entitiesarerepresentedbycompletepathsin thematrix,otherlexical choicesmay appearas absenceor presenceof a few ad-ditional nodes.An emptyor almostemptydiffer-encematrix canindicatetwo things: the texts arevery similar or onetext subsumesanother.

If the intersectionof two texts is significant,acomparisonof theT1

�T2 andT2

�T1 matricescan

giveanindicationof whichof thetwo textsis moreinformative on a particularsubject,i.e. whetherthey are sufficiently different to warrant readingbothor whetheronesubsumestheother. Possiblerelevancefeedbackcommentsinclude “if you’vereadthis,you probablydon’t needto readthat” or“readoneor theotherbut don’t worry aboutboth”or “whatever they’re saying, it’s somethingverydifferent,readboth”.

The statistical test used to determinesignifi-canceof the resultsof comparisonis chi-square.(Kilgarriff, To appear)statesthis performsbestamongothertestsevaluatingsimilarity within andacrosscorporaandthis is a relatedtask.

Having computedT1� T2 T1�T2 and T2

�T1,

weformulateachi-squarecontingency tableof theform:

Shared Text & One Part The Other PartT1 T1� T2 + T1

�T2 T2

�T1

T2 T1� T2 + T2�T1 T1

�T2

Table1: Chi-squarecontingency tableschema

Thus, the rows in eachcasesum to the totaltext size(words, in the caseof word basedcom-parison,and total nodesfor the two texts for theproposedWordNetderivedcomparison).Thechi-squarecontingency table is usedto test whetherthenull hypothesis,that thereis no impactof theinfluenceof text choice(the row) on thedistribu-tion in the column, may be rejected. Statisticalsignificancemeansthat thehypothesismaybere-jectedasan asymmetryexists in the contributionof oneof the two texts to the overall total. Lackof significanceindicatesthatthetextsareverydis-similar to the point of not being informationallycomparable.In otherwords,eachrow is boundedby the total set of typesderived from the union


40

of the texts (words or synsets),but the columnsisolatedifferentialcontributionsof the text pairs.Thus,significancein theChi-squaretestis directlyrelatedto differencesin the individual contribu-tions of the texts, albeit mitigatedby the contri-bution of the �� termwhich is a frequencycountof their joint contribution.

Themethodallows pairwisecomparisonsof in-dividual texts, andrequiresround-robincompari-sonto identify overall uniquenessin informationcontribution from a largerbodyof texts. Althoughexamplesarenot includedin theabstract,theveryprogramsthat allow computationof textual andWordNetnodeoverlapalsoadmit indexing of thedocumentsto supplyexact ‘pinpointing’ of puta-tive informationaldifferences.

TheChi-Squaretestasoutlinedboundstheen-tire comparisonby the cardinality of synsetsin-voked by the two comparedtexts together. Therow effect is the differential effect of the uniquecontribution of eachindividual text on that totalcardinality. The columneffect is the total contri-bution of eachtext in analgebraicallyconstructedisolation.

Supposea largerT1 T2 thanT1 � T2. Then,even if thereis a sizedifferencein T1 andT2, itis possibleto measurethelargercontributor to T1 T2, andhencemeasureinformationdifference.If T1 T2 is comparableto T1 � T2, thenT1

�T2

is comparableto T2�T1. If thereis a disparityin

T1 T2 comparedwith T1 � T2, thenit is possi-ble to comparetherelative contribution of T1 andT2, shouldthedifferenceresultfrom samplesize.However, in thetexts we experimentwith, samplesizesarecomparable.

The contingency table as constructed isboundedby the sizesof the overall setof uniqueconcepts,thatof the individual contributions,andthatwhich is commonto both.

The evaluationspresentedhere are basedonthe basic cardinality of the setsinvolved. Cur-rently, we are investigatingthe effect of weight-ing thenodecountaccordingto eachnode’s posi-tion in a lexical chain. This weighting addsim-plicit structuralinformation and makes a signif-icant contribution to the comparisonprocess. Itprovidesameansof distinguishingDAGsthatdif-fer only in fragmentsof paths,potentialvocabu-

lary differences,from DAGsthatcontaincompletepathsor blocksthataredifferent,indicatingextrainformation content. Furthermore,a measureofgraphdensitymaybeobtainedin ananalysisof theweightedcountversustheunweightednodecount- astheweightsrepresentaveragepathlength.Ex-perimentsareon-goingwith differentweightingsand resultsare encouragingbut inconclusive asyet.

The next sectionevaluatesthe original system,without weights,for 3 pairsof realnews texts.

4 The Texts

This sectionprovidesexamplesfrom a corpusofnews texts of how the systemdescribedin thispapercandistinguishthedegreeof differencebe-tweentexts. Thetexts,printedin AppendixA, aretitled Arsenic1,Arsenic2,Bomb2, ShootingandBudget. The resultsdiscussedbelow are basedonanevaluationof thecomparisonmatricesusingthe chi-squaresignificancetest discussedabove,with asignificancelevel of p 0.05,thechi-squarevalueshouldbe � 3.84.

4.1 Sametopic: Arsenic1and Arsenic2

The two texts, Arsenic1andArsenic2,relatethesamestory of a masspoisoningat a Church inMaine and the deathof the main suspect.How-ever, comingasthey dofrom differentsources,thestoryis told somewhatdifferently in each.

Table 2 details the resultsof a comparisonoftheadjacency matricesof the two texts. The chi-squarevalue,9.32, indicatesthat the distributionof nodesis statisticallysignificant,that the textsarerelated,andthat their individual contributionsbeyondtheintersectionis interesting.

Chi-square 9.32(� �� )Measure T1 � T2 T1

�T2 T2

�T1

Nodes 436 314 381Percentage 39% 28% 33%

Table2: Arsenic1vsArsenic2nodecomparison

A comparisonof the texts basedpurely on thelexical itemsin the texts yields the resultssetoutin table3. For significanceat the.05 level at 1 df,chi-squareshouldbegreaterthanor equalto 3.84.


41

Here,chi-squareis 0 indicatingthat theword dis-tribution is not significant, that the texts are notrelated. This points to an aspectof the intendedcontribution of our work: given thesmall samplesizes,text-based(that is, explicit word or lemmabased)measuresof informationalcontribution oftwo texts suffers from data-sparseness.While itis clearthatPOStaggedcomparisonssuffer fromexcessnoise,our intention is to usethe interme-diatelevel semantictaggingsuppliedby activatedWordNetnodes.This intermediaterepresentationon thesetwo short texts that are clearly related,but which also clearly involve substantiallydis-tinct vocabularies, provides an initial indicationthattheapproachis viable.

Chi-square 0 (�� )Measure T1 � T2 T1

�T2 T2

�T1

Words 96 108 108Percentage 30% 35% 35%

Table3: Arsenic1vsArsenic2word comparison

4.2 Unrelatedtopics: Bomb2 and Budget

The last sectiondemonstratedthat our approachcan generateuseful information: “these two ar-ticles arerelatedandboth contribute distinct bitsof information,soprobablyyou shouldreadboth,with an eye to what the index mechanismflags”.A contrastingstatisticis alsonecessaryif one isto receive adviceof theform: “while noodle.newshasclassifiedthesetwo articlesasaboutthesametopic, they actuallycontribute identically to theirsuminformation” (ie. eitherthey areidenticalorutterlyunrelated).

Thechi-squarevaluein table4 indicatethatanysimilaritiesbetweentexts Bomb2andBudgetarenot statisticallysignificant.For significanceat the0.05level, thechi-squarevalueshouldbe � 3.84.

Chi-square 0.0020(�� )Measure T1 � T2 T1

�T2 T1

�T2

Nodes 146 407 406Percentage 15% 42.5% 42.5%

Table4: Bomb2vsBudgetnodecomparison

Closer analysis of the intersection matrix,Bomb2� Budget,yieldsanoutlineof whatkind ofsimilarities exist betweenthe two texts. A largeportion of the matrix containspathsending ongenericterms,suchas:

skilledworkerelectricaldevicecognitive process

While the differencematricescontain the morespecificterminalnodesof thesepaths.

� Bomb2:skilled worker to man/ serviceman

� Budget:skilled worker to minister

This would suggestthat all texts have a baselinesimilarity—beingfor themostpartaboutentities,etc. An extendedsystemshouldtake into accounttheshapeandnatureof theintersectionmatrix, aswell asits size,to determinewhetherthesimilari-tiesarebaselineor significantsimilarities.

4.3 Relatedtopics: Bomb2 and Shooting

As a critical evaluationof our idea, we examineanintermediatecomparison.Bomb2andShootingbothrecounta storyof a shootingin Belfast. Theeventsthemselvesarenot relatedandhappenedatseparatetimes. The subjectmatter, however, issimilar. Again, thechi-squarevaluein Table5 aresignificant,indicatingthatthetexts arerelated.

In this instance,theword comparisondataalsoproducesa significant result, with a chi-squarevalueof 32.9,p value 0.001.

Chi-square 235.30(� �� )Measure T1 � T2 T1

�T2 T2

�T1

Nodes 161 392 119Percentage 24% 58% 18%

Table5: Bomb2vsShootingnodecomparison

This reveals that our notion of topic trackingis not refined enough to base comparisonsontexts automaticallyidentifiedto beaboutthesameevents. However, it correspondinglydemonstratessuccessin identifyinginformationallydistinctarti-clesaboutrelatedeventtypes: someoneinterestedin oneof theeventsmaywell be interestedin theotheron thebasisof their commonthematiccon-tent.


42

5 Discussion

Conclusive resultswouldrequireananalysisof farmoredatathanhasbeenpresentedhere.This taskis currentlyin progress.

However, these preliminary analyseswouldsuggestthat this approachcan discriminatedif-ferencesin texts at least better than a baselineword comparisonapproach.The systemoutputsa quite dependablemeasureof similarity or dif-ferencesbetweentexts and also what thesedif-ferencesor similaritiesare—thetermsassociatedwith thesynset.

Both of theseendsare of primary concerninapplicationsrelatedto InformationExtraction. Asystemwhich couldnot only identify which textsaresimilar amonga selectionof texts but whichcould also highlight or extract any significantlydifferentinformationin thetexts couldbeusedinmany domains. For example, a news wire ser-vice that identifies story updatesand highlightsany new informationin theupdateor integratesthenew into theold without replecatinginformation.We currently have a project under-way to inves-tigate the viability of this ideausinga corpusofnews texts in whichgroupsof texts form chainsofupdates.

6 Future Work

Our proposalis very much a work in progress,the aim to develop the representationto providea basisfor comparisonof text for many criteria—content,difficulty, genre,etc. Somespecificareasfor developmentinclude:

� Parts of speech: Themotivation for exclud-ing all partsof speechbesidesnounswastohave a working modelon which to begin ex-periments.The ultimateaim, however, is toincorporateall parts-of-speechin thetext rep-resentation.

� WordNet relations: WordNet providesmuch more varied resourcesthan just thehypernym hierarchy among nouns. Theother relations: antonymy, meronymy andholonymy shouldalsobeexploited,asin, forexample,(Harabagiu,1999).

� Dimension reduction or path-finding: Theinclusionof all partsof speechandof moreWordNetrelationsentailsanexplosionin thesizeof thematrixbringingattendantof issuesof how to dealwith sucha large datastruc-ture.

� Ambiguity : The DAG is not disambiguatedso every ambiguous word introduces orstrengthensspuriousnodesandedges.Thisaspectcanbecomea taskin itself—ahyper-nym hierarchysimilar to that describedherehas alreadybeenusedto a certain successfor ambiguity resolution(Agirre andRigau,1996).It couldalsobeexploitedto someex-tent. Therearecaseswhenthe level of am-biguity of a text maybea significantfeature,for example, in decidingon suitability of atext for aparticularreadership.

� DAG structural analysis: As notedin sec-tion 4.3, the shape,as opposedto the con-tents,of thegraphproducedfor anindividualtext is a telling characteristic.A comparisonof structuresacrossgenrecould yield inter-estingresults.Somepreliminaryapproacheshave beenmadeto factor in graphstructureto the comparisonalgorithm,seethe endofSection3.2, but muchremainsto do on thisquestion.

References

Eneko Agirre andGermanRigau. 1996. Word sensedisambiguationusing conceptualdensity. In Pro-ceedingsof the 16th International ConferenceonComputationalLinguistics, Copenhagen.

Associationfor ComputationalLinguistics.1999.Pro-ceedingsof the37thAnnualMeetingof the Associ-ation for ComputationalLinguistics, University ofMaryland.

S Deerwester, S. T. Dumais,G. W. Furnas,T. K. Lan-dauer, andR. Harshman.1990. Indexing by latentsemanticanalysis.Journal of theAmericanSocietyfor InformationScience, 41:391–407.

C. Fellbaum. 1990. WordNet,anelectronic lexicaldatabase. TheMIT Press.


43

J Flood, editor. 1984. Understandingreadingcom-prehension. International Reading Association,Delaware.

MichaelA. K. HallidayandRuqaiyaHasan.1976.Co-hesionin English. Longman.

SandaM Harabagiuand Steven J Maiorano. 1999.Knowledge-leancoreferenceresolutionandits rela-tion to textual cohesionandcoherence.In Harabag-iuMaiorano99(Ass,1999),pages29–38.Workshopon the relation of discourse/dialoguestructureandreference.

SandaM Harabagiuand Dan Moldovan. 1998. Aparallelsystemfor text inferenceusingmarker pro-pogations. IEEE Transactionsin Parallel and Dis-tributedSystems, pages729–747.

SandaM Harabagiu. 1999. From lexical cohesionto textual coherence:- a data driven perspective.Journal of Pattern Recognition and Artificial Intel-ligence, 13(2):247–265.

RuqaiyaHasan. 1984. Coherenceandcohesive har-mony. In Flood(Flood,1984),pages181–219.

Marti Hearst. 1997. Text-tiling: Segmentingtext intomulti-paragraphsubtopicpassages.ComputationalLinguistics, 23(1):33–64.

MichaelHoey. 1991. Patternsof Lexis in text. OxfordUniversityPress.

Stefan Kaufmann. 1999. Cohesionand collocation:Usingcontext vectorsin text segmentation.In Kauf-mann99(Ass,1999),pages591–595.

AdamKilgarrif f. To appear. Comparingcorpora. In-ternationalJournalof CorpusLinguistics.

JohnLocke. 1964.Anessayconcerninghumanunder-standing. Collins,5 edition.

Katja Markert and Udo Hahn. 2002. Understand-ing metonymiesin discourse.Artificial Intelligence,135:145–198.

RadaMihalceaandDan Moldovan. 1999. A methodfor word sensedisambiguationof unrestrictedtext.In Proceedingsof ACL ’99, pages152–158,Mary-land,NY, June.

A Appendix: NewsTexts

The following news texts are from RTE Interac-tive, theIrish NationalBroadcaster’s internetdivi-sionandfrom theGoogleNewssite

A.1 Arsenic11

Maine PoliceLink DeadMan to Arsenic CaseSat May 3,200301:32PM CARIBOU, Maine (Reuters)—MainepoliceonSaturdaylinkeda manwhodiedof a possiblyself-inflictedgunshotwoundto the arsenic-taintedcoffeepoisoningat alocal church that killed oneparishionerandsickenedat least15 others.

State Police Col. Michael Sperry told reporters thatDaniel Bondeson,53, whodiedon Friday eveningat a hos-pital in Caribou,wasinvolvedin thepoisoningat thechurchthat has rocked New Sweden,Maine, a town of about 600peopleneartheCanadianborder.

”We havelinkedtheshootingto thedeathat thechurch,”Sperrysaid,explainingthatpolicehadfoundkey informationat Bondeson’s farm housewhere hewasfoundshot. Bonde-sondiedafter emergencysurgery.

Sperryhowever declinedto describethe informationthatpolicehadfound,anddeclinedto saywhetherpolicehaddis-covered a suicidenoteor arsenic. A search of Bondeson’sWoodland,Mainehomewasexpectedto take several days.

”This was reportedas a self-inflictedgunshotwoundtous,” Sperrysaid, adding however that police won’t say forsure whathappeneduntil thebodyis autopsiedonMonday.

Themysterybegan last Sundaywhenparishioners at theGustafAdolphLutheranChurch in New Swedensuddenlyfellill after services.

Oneman,78-year-old Walter Morrill, diedafter drinkingthecoffeeandseveral others were hospitalized.Bondeson,abachelor, wasa memberof thechurch but did not attendlastweek’s sermon. However his brother and sister were therebut did notdrink anything, parishioners said.

Police earlier this weeksaid theconcentration of arsenicledthemto believeit hadbeenputinto thecoffeedeliberately.They called in theFederal Bureauof Investigationto gatherfingerprints and DNA samplesfrom as manyas 50 parish-ioners to try andfind thekiller.

Arseniccankill quickly if consumedin largequantities,al-thoughsmall, long-termexposure canleadto a much slowerdeath.It canmgivea strongbitter tasteto foodor beveragesit contaminates.

A.2 Arsenic22

Police: Shootingmay be tied to poisoning. 5/3/20031:06PM NEWSWEDEN,Maine(UPI)—Investigators in thesmallnorthernMaine townof New SwedensearchedSaturday forpossiblelinks betweena fatal shootinganda church arsenicpoisoningcase.

Daniel Bondeson,53, died Friday evening, shortly afterhe was found shot at his homein Woodland, adjacenttoNew Sweden.Police saidBondesonwasnot a suspectin thechurch poisoning, but searchedhis homefor anyconnectionto the laced church coffee that killed one elderly man andsickenedat least15 other people. Two of the victims werestill in critical condition.

”We won’t make a determinationuntil the autopsy(onBondeson’s body) Monday,” Col. Michael Sperry, chief ofthe Maine StatePolice, said Saturday outsidethe Cariboucourthouse.

”This is an openinvestigation,and we are still lookingat who is involved,” he said. Investigators are treating thearseniccaseasa homicide.

1Text via GoogleNews,3 May 20032Text via GoogleNews,3 May 2003


44

”The FBI is still verymuch involved,” addingthearsenicpoisoninghashad”a huge impacton thissmallcommunity,”about20milessouthof theCanadianborder.

The poisoning occurred Sundayafter servicesat theGustafAdolphLutheran Church in New Sweden,a commu-nity of about620in thepotato-farmingregion in rural north-ernMaine.

Amongthe two dozenparishioners who drank the coffeewasWalter ReidMorrill, 78,whodiedMonday. Morrill wasa memberwho lived next to the church and servedas care-taker andheadusher.

The remainingcoffee in the percolator contained”highlevels” of arsenic,saidStephenMcCausland,spokesmanfortheMaineDepartmentof PublicSafety.

He said testsconfirmedthe arsenic was not in the un-brewedcoffee, in thetapwateror in thesugar.

Word that someoneapparently had deliberately put ar-senicin thecoffeefrightenedarearesidents.

“This is a small communitywhere everyoneknowseachother,” McCauslandtold United PressInternational. ”Wedon’t know whether(the perpetrator) was amongthemorsomeonefromoutside, but thefocusof our investigationnowis to find whois responsiblefor introducingthearsenic,andwhy.”

A.3 Bomb23

Man due in court for bomb attemptA 34-year-old manis duein court in theNorthchargedin

connectionwith an attemptedfirebombattack in Belfastcitycentre at theweekend.

Themanhasbeencharged with possessionof explosiveswith intent to endanger life and conspiracy to causean ex-plosion.

He wasarrestedin a major securityoperation in thecitycentre last Sundaynight, during which a secondman wasshottwiceby police.

Police discovered a device consistingof two gas cylin-ders and two pipe bombslinked to a numberof containersof flammableliquid in a car abandonedoutsidethemotortaxofficeat UpperQueenStreetat theweekend.

A timerdevicehadbeenattachedandactivated.Theshootingof thesecondmanbypoliceis beinginvesti-

gatedby theNorth’s PoliceOmbudsman,NualaO’Loan.

A.4 Shooting4

8SHOT. R1 SM 05-12-200207.51 Two men have beeninjured in paramilitary style attacks in Ardoyne in northBelfast.A 27 yearold manwasshotin theleg; hewasfoundnear ArdoyneAvenueshortly after eight o’clock last night.Anda 20 year-old manwasalsoshotin theleg in thegardenof a houseat Butler Walk in thecity. Policesaidtheincidenthappenedat aroundmidnight.

A.5 Budget5

Minister briefs Cabinet on Budget. 041202TM12.50TheGovernment’s Budget for 2003will beunveiledin the

Dil this afternoonby the Minister for Finance, Charlie Mc-Creevy. It is expectedthatonlyverysmalltax reductionswill

3Text from RTE Interactive,27Nov 20024Text from RTE Interactive,5 Dec20025Text from RTE Interactive,4 Dec2002

beincluded,while increasesin SocialWelfare paymentswillgenerally bekeptin line with therateof inflation.

TheMinister is expectedto make a budgetaryprovisiontopay the25%backdatedelementof thePublic SectorBench-markingpay awards. A freezeon public sectornumbers aswell as a three-yearprogrammeto reducestaff levelsin thepublic sectoris also expectedbe announced.Mr McCreevybriefed the Cabinetat an early morning meetingaheadofthis afternoon’s Dil debate. Coverage of theDil debatewillbe broadcastduring a special5-7 Live programmeon RTERadioOnefrom 3.30pmand on Raidio na Gaeltachta from4.08pm.SpecialBudget programmeswill alsobe broadcaston RTEOneTelevision from 3.35pm,on NetworkTwo from4.30pmand on TG4 at 8.30pmthis eveningduring whichthere will be a phone-inwith a panelof experts. RTE Ra-dio One, 2FM,RaidionaGaeltachtaandLyric newsbulletinswill provide reportsof Budget announcements.And follow-ing the9 o’clock news on RTEOneTelevision, there will befurther interviews, analysisand discussionabout the Bud-get. TheRTEwebsiteandAertelwill alsoprovideup-to-theminutereports.


45

Domain Knowledge Articulation using Integration Graphs

Madalina Croitoru Ernesto Compatangelo

Department of Computing Science, University of Aberdeen, UK{mcroitor, compatan}@csd.abdn.ac.uk

Abstract

In order to be queried and reused,different domain knowledge basesmust be shared using the same rep-resentation. In this paper, we in-troduce a graph-based knowledgemodel to articulate independentlydeveloped ontologies that describethe same application domain. Ourknowledge representation formalismis based on a new type of graphs de-noted as “hierarchical graphs”. Thisrepresentation provides a basis forthe heuristic inferences and for thelogical deductions that are used todetect semantic mismatches betweendomain knowledge bases wheneverthey must be composed. Integrationgraphs can also be viewed as a suit-able data modelling formalism for in-formation integration.

1 Introduction

Domain Knowledge describes the matter of in-terest in some context, such as entities andtheir relationships in a conceptual schema ortaxonomies of shared concepts in an ontology.In any case, domain knowledge is consideredseparately from the problems or tasks thatmay arise in its application context (Uschold,1998). In recent years, the problem of estab-lishing links between different Domain Know-ledge Bases (DKBs for short) has receivedconsiderable attention from both databaseand knowledge engineering communities (Be-neventano and others, 2001; Maedcke and oth-

ers, 2002). Specifically, the problem of cre-ating a shared view that links two (or more)DKBs without modifying their respective con-tents is being actively investigated. The in-troduction of a shared view (denoted as anarticulation) is considered as an effective wayof enabling both queries to distributed dataand knowledge bases and ontology sharing &reuse (Mitra et al., 2000; Compatangelo andMeisel, 2002)

In order to be queried and reused, differentdomain knowledge bases must be shared us-ing the same representation. In other words,knowledge interoperability must be enabledfirst. In the past decade, research in Know-ledge Representation (KR) led to the intro-duction of several KR languages and systems.These are based on diverse knowledge models,such as semantic networks, frames, descriptionlogics, and conceptual graphs.

• Semantic networks represent knowledgeas labelled direct graphs, where nodes de-note concepts, and arcs denote the variousrelations between concepts (Sowa, 2000).

• Frames represent knowledge (e.g. indi-viduals, classes or general concepts) us-ing an object-like structure with attachedproperties. Frames are typically arrangedin a taxonomic hierarchy where each ofthem is linked to one (or more) parentframes (Karp, 1993).

• Description Logics (DLs) represent know-ledge similarly to frames, explicitly distin-guishing between a TBox, and an ABox.While the former describes classes and


46

their mutual links, the latter describes in-dividual elements possibly belonging toclasses explicitly defined in the TBox.The set theoretic semantics of DLs allowsthe introduction of specialised deductiveservices such as DKB consistency, conceptnormalisation and completion, subsump-tion between concepts, and instance re-trieval (Baader and others, 2003).

• Conceptual graphs represent knowledgeas bipartite labelled graphs where nodesdenote entities or relationships them. Theknowledge model underpinning concep-tual graphs is logically sound and com-plete w.r.t. the semantics of First-Order Predicate Logics (FOPL). Reason-ing about conceptual graphs is based ongraph-theoretical operations that rely ongraph homomorphism (Sowa, 2000).

Despite the variety of available knowledgemodels, the representation of knowledge thatspecifies mappings between related domainknowledge bases is still an open research prob-lem. Moreover, the complex structure of thesemappings affects the development of auto-mated reasoning algorithms to analyse andvalidate them. More specifically:

• Semantic Networks suffer from the lackof a clear formal semantics, even ifsome approaches versions are formallydefined within the framework of FOPL.Even in these approaches, the expressivepower of Semantic Networks is not ad-equate to represent the structure of map-pings between related concepts in par-tially overlapping DKBs.

• The semantics of frames is not completelyformalised, which prevents non-trivial de-ductions about them. Moreover, theobject-oriented structure of frames is notparticularly suited to express mappingsbetween concepts.

• Despite their set-theoretic semantics, DLsdo not have specific primitives to ex-press mappings other than C

.v D

and D.v C (i.e. subsumption rela-

tionships between concepts). Neverthe-less, articulation-specific deductions canbe effectively performed in a DL-basedframework (Compatangelo and Meisel,2002). However, the usage of purely ter-minological deductions provided by DLreasoners fails to detect links between re-lated concepts or individuals expressedusing different (but otherwise semantic-ally equivalent) lexicons.

• Although conceptual graphs benefit froma solid logic foundation, their expressivepower may not be completely adequateto express mappings between structuredconcepts such as those in frame-basedontologies. An extension that can cap-ture the hierarchical aspect of articulationmodelling might overcome this drawback.

Graph-based models have advantages overframe-based models in expressing certainkinds of mappings (e.g. mapping proper-ties into concepts and vice versa). However,their expressiveness might not be adequatefor modelling the DKBs to be articulated, in-tegrated or merged, and their automated de-ductive capabilities are rather limited. In or-der to overcome the above drawbacks, in thispaper we propose integration graphs, whichcombine the advantages of graph-based mod-els with those of frame-based models. In-tegration graphs are hierarchical graphs usedto articulate independently developed ontolo-gies that describe the same application do-main. The representation formalism of integ-ration graphs provides a basis for the logicalinferences needed to compose domain know-ledge bases and to detect semantic mismatchesbetween them. Integration graphs can also beviewed as a suitable data modelling formalismfor information integration systems by allow-ing the end user to restrict their queries eitherto any individual sources or to their sharedview. Therefore, our graph-based model canprove internal navigation instructions to sup-port the query execution plans needed to in-teract with the various knowledge sources.


47

2 The graph-based articulation ofindependent sources

Knowledge integration is the task of combin-ing different domain knowledge bases. In or-der to accomplish this task, a suitable repres-entation must be used, a methodological ap-proach to knowledge combination must be in-troduced, and a set of automated inferentialservices to support the combination processmust be provided.

However, the effectiveness of both the in-ferential services and the methodological ap-proach used used in the knowledge combin-ation process explicitly depend on the form-alism used to represent of this knowledge.Two different kinds of models have been usedto express mappings between partially over-lapping domain knowledge bases, namely se-mantic networks and frames.

A graph-oriented model for the articula-tion of ontology interdependencies based on se-mantic networks has been used in the ONIONapproach (Mitra et al., 2000). Althoughgraphs can model different kinds of mappingsin a very flexible and efficient way, their usageis currently hampered by their limitations ineffectively modelling the structure of overlap-ping concepts.

A frame-based model for the alignment ofontologies has been used in the PROMPT ap-proach (Friedman Noy and Musen, 2000). Al-though frames effectively model the structureof overlapping concepts, their primitives formodelling mappings between these conceptsare extremely limited. Moreover, because oftheir ill-defined semantics, frames do not sup-port automated deductive services for detect-ing and/or for validating these mappings.

Description Logics (DLs) (i.e. a family offrame-based knowledge modelling languageswith a formal semantics), have been usedto model and analyse the articulation of on-tology interdependencies (Compatangelo andMeisel, 2003). This approach overcomes thedrawbacks of other frame-based approaches bymaking use of the specialised deductive ser-vices provided by DL-based engines.

Despite their high expressiveness, even lan-guages like DL-OWL (W3C, ), which are ad-equate to model complex domain concepts,fail to provide suitable primitives for model-ling complex mappings between overlappingconcepts. Conversely, despite their lower ex-pressiveness on the concept side, graph-basedmodels capture mappings in a more effectiveway. Therefore, in this paper we introduce aa graphical formalism, denoted as integrationgraphs, which can be used to describe know-ledge combination (e.g. articulation, integra-tion, alignment).

Integration graphs describe the knowledgein terms of nodes. The links between sim-ilar concepts in different knowledge domainsare described in terms of edges. Integrationgraphs provide good navigational properties,as well as a logic foundation that allows auto-mated semantic reasoning.

2.1 The domain knowledgearticulation process

Knowledge interoperability explicitly dependson the adopted representation. However, thechoice of a common knowledge model (andthus of the same representation) does notautomatically imply that knowledge in differ-ent DKBs can be straightforwardly shared andreused. Another fundamental operation mustbe done to enable knowledge interoperabil-ity: all the diverse (but partially overlapping)DKBs must be “aligned”, i.e. they must reacha condition of mutual agreement. This can bedone by articulating them.

The introduction of an articulation estab-lishes semantic bridges between concepts inthe intersection of two partially overlappingDKBs, which remain otherwise independentand physically separated. An articulation cre-ates a shared view, where each concept in theview is mapped into one or more concepts inthe first DKB and into one or more conceptsin the second DKB. The participating domainknowledge bases are completely unaffected bythe introduction of this view (Compatangeloand Meisel, 2002). A graphical overview of thearticulation of two domain knowledge bases


48

is shown in Figure 1, where the dashed linesconnect each concept in the articulation to acouple of related concepts, one in each sourceDKB. As shown in this figure, hierarchical re-lationships between articulation concepts canalso be defined. These may or may not corres-pond to equivalent hierarchical relationshipsbetween related concepts in each source DKB.

S1 S2

A

Figure 1: Articulation of DKBs S1 and S2,giving rise to the shared view A

The introduction of an articulation schemadoes not modify the schemas to be sharedin any way. Moreover, an articulation con-siders all participating DKBs as “peers”. Inother words, it does not implicitly distinguishbetween a “more general” DKB, which shouldnot be modified in any case, and a “more spe-cific” DKB which may need to be modified inorder to align it to the other one.

Following a recent approach to the artic-ulation of conceptual schemas and ontolo-gies (Mitra et al., 1999; Mitra et al., 2000),a graphical representation can be used to ar-ticulate two (or more) DKBs. The semanticbridges between their concepts can be mod-elled using both logical rules (e.g. the semantic

implication between terms across ontologies)and functional rules (e.g. a conversion func-tion between terms across ontologies). Articu-lation rules indicate which terms, individuallyor in conjunction, are related in two sourceontologies. An articulation ontology containsthese terms and their mutual relationships.The term articulation thus refers both to thearticulation ontology and to the set of map-pings that relate elements in the articulationto terms in the source ontologies. A unifiedontology integrates two or more source onto-logies using an articulation.

2.2 Supporting knowledge articulationwith integration graphs

An integration graph is a graphical represent-ation for the articulation of two or more do-main knowledge bases. It formalises the integ-ration process using hierarchical graphs (Bus-atto, 2002). Nodes in integration graphs de-scribe domain concepts. Simple nodes denotenew concepts, while complex nodes denotehidden, previously known ontologies. Eachedge linking a simple node to a complex nodehas an articulation rule associated to it andinstructions that describe its use.

In order to illustrate the notion of integra-tion graph, we consider the two graphs in Fig-ure 2, which represent the two sources ontolo-gies Carrier and Factory originally introducedin (Mitra et al., 2000). A possible articulationof these ontologies is described by the integra-tion graph Transportation in Figure 3.

The articulation can be built using thefollowing sequence of methodological steps,originally introduced in (Compatangelo andMeisel, 2003) within the framework of intel-ligent conceptual schema integration.

1. Each ontology is analysed to deriving allits possible implications (e.g. further isalinks) that can be used to validate it.

2. The participating ontologies are jointlyanalysed, e.g. using a DL-based inferen-tial engine, identifying overlapping simplenodes and thus their correspondences.


49

Transportation

Cars Trucks

Driver

Price

Owner

Model

Carrier

Transportation

Goods Vehicle

Weight

CargoCarrier

Buyer Price

TruckFactory

Figure 2: The graphs describing two ontologies, Carrier and Factory

Transportation

CarsTrucks

Person

PricePassengerCar

Owner

VehicleCargoCarrier

Carrier Factory

Transportation

Figure 3: The articulation ontology Transportation

3. Either a human analyst or a heur-istic reasoner validate correspondencesbetween overlapping nodes. More com-plex correspondences involving the con-junction and/or the disjunction of nodescan be also defined.

4. A node in the articulation ontology can begenerated for every couple of correspond-ing nodes in the source ontologies. Eachof them is linked to the articulation entityby way of an articulation rule.

5. The articulation rules can be validated

and the articulation ontology can be sub-sequently analysed to derive hidden im-plications and thus further knowledge.

These steps can be performed either manu-ally by an analyst or automatically by an in-telligent knowledge management environmentsuch as ConcepTool (Compatangelo andMeisel, 2003), where a graph can be formallyexpressed using a Description Logic (DL). Lex-ical analysis can be also performed by Con-cepTool to enhance the quantity and thequality of possible implications.


50

a b

b

c

a

a b a

b

c

H1 H2

H3 H4

Figure 4: The articulation rules used in the Transportation graph

The articulation rules H1–H4 (see Figure 4)describe the edges that connect simple nodesin the integration graph (represented by circlesor ovals) to complex nodes (represented bysquares or rectangles). A simple node denotesa new concept, while a complex node denotes aknown ontology (a source one in our example).Each edge linking a simple node to a complexone has an articulation rule associated to it, aswell as instructions which describe its usage.In our case, all simple nodes are assumed to bevisible. Each edge with (at least) one complexnode at its extremity has both an associatedrule Hi and substitutions that explain whichsimple node (inside a complex one) is involved.The edges are shown in Figure 4, where nameambiguities are solved by qualifications.

The articulation rules shown in Figure 4 arespecified in Table 1. These rules offers a fa-vourable approach when dealing with navig-ational problems inside a graph. They alsosupport a logic based-formalism suitable forreasoning about them.

2.3 Formal definition of integrationgraphs: an overview

Our model introduces a graphical formalismthat can be used to perform information ex-traction and integration. Here, we only pro-pose a formal approach for describing data;

Transportation → Carrier:H1; a→ T.Transportation;b→ C.Transportation

CarsTrucks → Carrier:H2; a→ T.CarsTrucks,b→ C.Cars, c→ C.Trucks

PassengerCar → Carrier:H1; a → PassengerCar, b → C.Cars

Owner → Carrier:H3; a → T.Owner, b → C.Owner

CarTrucks → Factory:H1; a → CarTrucks, b → Vehicle

Owner → Factory:H3; a → T.Owner, b → Buyer

Factory → PassengerCar:H1; a→ Vehicle, b→ PassengerCar

Carrier → VehicleCargoCarrier:H1; a→ Trucks, b→ VehicleCargoCarrier

Factory → VehicleCargoCarrier:H4; a → Vehicle, b → CargoCarrier,c → VehicleCargoCarrier

Table 1: Specification of the articulation rulesshown in Figure 3

substantial tests and further developments willbe needed to fully develop our framework.

A (di)graph is a pair G = (N,E) where Nis a finite nonempty set of nodes and E is amultiset on N × N . If e = (n1, n2) ∈ E


51

s2

s2

r3

r2

r2s3 s4

s1

s3

r1

s1r1

r1 r2

s1

f

e

d y

tp

a

b

c x

G3

n

m

G1 G2 G3

H1 H2

H3

Figure 5: Composition patterns

is a (directed) edge we’ll write simply e =n1n2. The basic objects of our model arecalled source graphs. A source graph is a pairSG = (G,V N), where G is a graph and V Nis a subset of the nodes of G . V N is calledthe set of visible nodes of G. A source graphdescribes a world by means of the graph Gand offers an integration interface with otherworlds by means of the set of its visible nodesV N . The graphs in Figure 5, namely SG1 =(G1, a, b, c, d) and SG2 = (G3,m, n, p), are ex-amples of source graphs.

In order to describe the interaction betweendifferent worlds (i.e. source graphs) we definea bipartite (directed) graph H = (S,R;E) de-noted as articulation rule. The set of nodesof this graph is S ∪ R (where S and R aredisjoint, nonempty sets) and E is the set ofdirected edges joining nodes from S to nodesfrom R. In Figure 4, three articulation rulesH1,H2 and H3 are described.

Consequently, we define an integrationgraphs as a pair IG = (G,V N), where G is agraph and V N is a set of visible nodes. Eachnode n ∈ N(G) of G is either a simple or acomplex node. The set of visible nodes associ-ated to a simple node n is the singleton {n}.

If n is a complex node, then it has associ-ated a description D(n) which is also an integ-ration graph. If e = n1n2 ∈ E(G) is and edgeof G, and if at least one of its extremities is

complex node, then the edge e has associatedan articulation rule He and two injective func-tions se and re. The function re associates toeach node of RHe a node from the visible setof nodes of n1. The function se associates toeach node of SHe a node from the visible set ofnodes of n2. The triple (He, se, re) means thatthe links between the two nodes are describedby the pattern H(e) and can be obtained byusing the two substitutions se and re.

An example of integration graph is depictedin Figure 6, where Art is a simple node andG1, G2 are complex nodes that refer to thecorresponding graphs.

Art

G1 G3

H2;s1---Artr1---ar2---e

H2;s1---Artr1---nr2---p

H3;s1---a, s2---f, s3---e,s4---dr1---n, r2---p

Figure 6: Example of integration graph


52

f

e

d p

a

b

c

n

m

Art

Figure 7: The flat graph represented in Figure 6

The flat graph that can be obtained from theintegration graph after applying all our articu-lation rules and by only considering the visiblespecified nodes is illustrated in Figure 7.

3 Discussion

Graph-based models such as semantic networkand conceptual graphs could be potentially ef-fective to express mappings between conceptsbecause of their overloaded constructors (e.g.nodes or edges). For instance, these could eas-ily allow the mapping of properties into con-cepts and vice versa. Conversely, this kindof mapping cannot be naturally and easilyintroduced in frame-based models (includingDL-based ones). However, the expressivenessof standard graph-based approach is not ad-equate to model the DKBs to be articulated,integrated or merged. Moreover, the absenceof a full mapping into a formal logical sys-tem may prevent non-trivial automated de-ductions. On the other side, frame-based ap-

proaches support hierarchical reasoning (e.g.subsumption) in a straightforward way.

In order to overcome the drawbacks of bothknowledge models, we have introduced in-tegration graphs (i.e. a type of hierarchicalgraphs) that combine the flexibility of tra-ditional graph models with the hierarchicalstructure of frame models. being based onFOPL, hierarchical graphs allow the introduc-tion of automated semantic reasoning servicesbased on their knowledge model. Our in-tegration approach can easily capture simplesemantic implications rules or simple Hornclauses such as the ones introduced in (Mitraet al., 2000), which are necessary for domainknowledge interoperability. We note that ourapproach assumes that a frame-based know-ledge model is used to describe the DKBs tobe articulatedm integrated or unified, while in-tegration graphs are used to describe the ar-ticulation of these DKBs.

The interface of complex nodes in integra-


53

tion graphs is an interesting new feature thatcan be used both to induce subgraphs in thehierarchical structure and to describe thoselinks that are necessary to navigate in thenested structured worlds. In considering suchgraphs, we aimed to provide a formalism whichdescribes relationships between data and high-lights the navigation through paths in the(semi)structured data worlds of. Low-level op-erations such adding or deleting simple or com-plex nodes can be easily designed. If the sourcegraphs of the integration graph are chosen tobe conceptual graphs, the entire structure willprovide an elegant mode to do fusion reason-ing by a method similar to one used in (Cheinet al., 1998) for nested graphs.

A current limitation of our graph-based ap-proach is the lack of automated support toreason about both an articulation and its map-pings. This limitation could be partially over-come by rendering the hierarchical graphs asDL knowledge bases. These could be man-aged by a system for the articulation of do-main knowledge such as ConcepTool (Com-patangelo and Meisel, 2003). However, in do-ing so, the mappings should be substantiallyreduced and constrained.

The formalism described in this paper willbe incorporated in the ongoing information in-tegration project Infoshare, a formal approachfor reusing the information contained in two ormore databases or populated ontologies usingthe articulation of their conceptual schemas.

References

[Baader and others2003] F. Baader et al., editors.2003. The Description Logic Handbook. Cam-bridge University Press.

[Beneventano and others2001] D. Beneventanoet al. 2001. The MOMIS approach to Informa-tion Integration. In IEEE and AAAI Intl. Conf.on Enterprise Information Systems (ICEIS’01).

[Busatto2002] G. Busatto. 2002. An AbstractModel of Hierarchical Graphs and HierarchicalGraph Transformation. Ph.D. thesis, PaderbornUniversity, Germany.

[Chein et al.1998] M. Chein, M.-L. Mugnier, andG. Simonet. 1998. A graph-based know-

ledge representation model with FOL semantics.In Proc. of the 6th Intl. Conf. on the Prin-ciples of Knowledge Representation and Reas-oning (KR’98), pages 524–534.

[Compatangelo and Meisel2002] E. Compatangeloand H. Meisel. 2002. EER-ConcepTool : a“reasonable” environment for schema and on-tology sharing. In Proc. of the 14th IEEEIntl. Conf. on Tools with Artificial Intelligence(ICTAI’2002), pages 527–534.

[Compatangelo and Meisel2003] E. Compatangeloand H. Meisel. 2003. “reasonable” support toknowledge sharing through schema analysis andarticulation. Intl. Jour. of Engineering Intelli-gent Systems, to appear.

[Friedman Noy and Musen2000] N. Friedman Noyand M. A. Musen. 2000. PROMPT: Algorithmand Tool for Automated Ontology Merging andAlignment. In Proc. of the 17th Nat. Conf. onArtificial Intelligence (AAAI’00).

[Karp1993] P. D. Karp. 1993. The Design Spaceof Frame Knowledge Representation Systems.Technical Note 520, Artificial Intelligence Cen-ter, SRI International.

[Maedcke and others2002] A. Maedcke et al. 2002.MAFRA - A MApping FRAmework for Distrib-uted Ontologies. In Proc. of the 13th Intl. Conf.on Knowledge Engineering and Knowledge Man-agement (EKAW’2002), volume 2473 of Lec-ture Notes in Computer Science, pages 235–250.Springer-Verlag.

[Mitra et al.1999] P. Mitra, G. Wiederhold, andJ. Jannink. 1999. Semi-automatic integrationof knowledge sources. In Proc. of the 2nd Intl.Conf. On Information Fusion (FUSION’99).

[Mitra et al.2000] P. Mitra, G. Wiederhold, andM. L. Kersten. 2000. A Graph-Oriented Modelfor the Articulation of Ontology Interdepend-encies. In Proc. of the VII Conf. on Extend-ing Database Technology (EDBT’2000), Lec-ture Notes in Computer Science, pages 86–100.Springer-Verlag.

[Sowa2000] J. Sowa. 2000. Knowledge Representa-tion: Logical, Philosophical, and ComputationalFoundations. Brooks Cole Publishing Co.

[Uschold1998] M. Uschold. 1998. Knowledge levelmodelling: concepts and terminology. TheKnowledge Engineering Review, 13(1):5–29.

[W3C] The World Wide Web Consortium — Se-mantic Web Activities. http://www.w3.org/2001/sw/.


54

http://www.w3.org/2001/sw/

http://www.w3.org/2001/sw/

Experiments in Ontology Construction from Specialist Texts

Mariam Tariq, Pensiri Manumaisupat, Rafif Al-Sayed, Khurshid AhmadDepartment of Computing

University of Surrey{m.tariq, p.manumaisupat, r.sayed, k.ahmad}@surrey.ac.uk

Abstract

The identification of a domain ontologyis usually a theoretical pursuit. However,the development of knowledge manage-ment systems and information extractionsystems often requires an understandingof the ontology of the domain; the ques-tion of ontology then has a serious practi-cal import. Knowledge is typicallyrecorded in archives of texts: an exami-nation of a sample of the archive maylead to the identification of a potentialontology of the domain. This is our claim.

1 Introduction

Text, whether in paper or electronic form, is atangible source of information or knowledge thatcan be shared amongst a group of people in anydomain. Almost all human enterprises are char-acterized by a text repository that is peer-reviewed and reflects the opinion of the majorityof the enterprise; a repository may comprisebooks, learned papers, popular articles, diction-aries, manuals and handbooks. The advent of theWorld Wide Web has made it easier to collect,store and access such resources electronically.

The conceptual organization of any subjectdomain is based on a consensus amongst mem-bers of the domain. There are philosophical de-bates about how the concepts are organized andwhether or not the organization is a social, psy-chological, or political phenomenon. The key forus is the existence of a consensus and the consen-sus manifests itself in the speech and writing ofthe domain community. The community, scien-

tific, technical, recreational, political, religious orotherwise, continues to evolve and change. Theevolution and change manifests itself in the lin-guistic output of the domain. This is not to denythat other semiotic systems, other than language,are not at work, but at least for us, linguistic out-put is amongst the most tangible.

The discussions of ontology, ontologies, anddifferent things ontological, have a considerablepsycho-philosophical overtone (Sowa 2000).This is perhaps due to the fact that unlike thebuilding blocks of a subject domain – its terms –the relationship between the terms is not explic-itly available. This is certainly true at the incep-tion of a subject. When the subject matures therelationships are made more explicit and some-times the conceptual organization is showedgraphically as well. However, the relationshipbetween terms is signaled in domain texts. Theremay be many complicated ways in which a natu-ral language allows the description of the so-called semantic relationships; nevertheless it isnot difficult to find rather straightforward waysused by the domain community to signal suchrelationships. For example, a forensic scientistdescribes a hyponymic relationship by the simpledevice of enumeration: body fluids like saliva,blood, urine….. and this practice of describingrelationships through the use of phrases like, in-cluding, such as, and/or other is quite common ina number of typologically distinct domains.Cancer specialists write that adjuvant therapiesinclude chemotherapy and hormone therapy andthat anthracyclines, docatexal and other drugsare used in chemotherapy. Financial news wirescontain examples of such signaling as in defen-sive stocks such as oil, confectionary and elec-tricity. The semantic relationships are the basis


55

of the ontological commitment of a domaincommunity. The inter-relationships of terms in adomain, when made explicit, for example, dia-grammatically or more formally through a graph,will tell us something about the ontology of thedomain.

In this paper we describe a method for identi-fying the ontological commitment of a domain byexamining a random selection of documents pro-duced by the members of the domain. The use ofthe various phrases to encode such complex rela-tionships appears to have its own rules of de-scription – a kind of local grammar governs thebehaviour of clauses where the authors describetheir ontological commitment. We describe howsuch structures have been identified in three dif-ferent subject domains: forensic science, breastcancer research and finance & business, by ex-ploring the local grammar. The penultimate useof making the ontological commitment explicit inthe three domains is different.

2 Ontology Construction and SpecialLanguage Texts

We aim to explore the ontological commitmentof a specialist domain from a randomly sampleddomain-specific text corpus – that which mightbe construed to be a set of ‘representative’ textsof the domain. This notion of representativenessis controversial at best but has not deterred cor-pus linguists from building representative cor-pora of general language texts such as the BritishNational Corpus (BNC) (Leech et al. 2001).Paradoxically this approach has been very pro-ductive; such corpora have led to new insightsinto the structure of language. The sampling israndom in that the corpora are usually con-structed through the use of domain-specific key-words in a Web search engine and texts ofdifferent genera are chosen. Specialist lan-guages, considered variants of natural language,are restricted lexically, syntactically and semanti-cally (Harris, 1988). Open class words dominatespecialist language texts, particularly nounphrases (NPs); phrases used to name objects,events, actions and states relevant to the domain.It has been suggested that not only can terms beextracted from a specialist corpus (Ahmad &Rogers 2000, Bourigault et al. 2001) but also se-

mantic relations of hyponymy and meronymy(part-whole relations) between terms (Ahmad etal. 2003, Hearst, 1992).

The open class words, particularly the singleopen class words, reflect the lexical choice of thedomain measured by way of frequency of occur-rence. Indeed, it has been claimed that the spe-cialist texts contain the so-called l e x i ca lsignature that distinguishes them from the every-day or general language texts. This signature canbe elicited by comparing the frequency distribu-tion of the open class words in a specialist corpuswith that of the distribution of the same words ina ‘representative’ corpus of general languagewords, for example, the BNC. This measure hasbeen referred to as the measure of weirdness(Ahmad & Rogers 2001) – the frequent use ofsuch terms will appear unusual to a nativespeaker of English using the BNC as a standard.Terms with a high frequency and weirdness areusually considered good candidate terms. Do-mains are distinguished by the productive use ofcertain terms and, apart from inflectional andderivational use of these terms, much of the pro-ductivity manifests itself in the frequently usedcompound noun phrases that comprise one ormore single-words that give the idiosyncraticlexical signature to a given specialist domain (seeTable 1. below).

OCWs WC COMPOUNDS

Forensic Science Corpus (Total tokens: 610,197)evidencecrimesceneforensicanalysisblooddna

215338

471111233

crime sceneforensic sciencelaw enforcementcrime scene investigatorworkplace homicidesupreme courtcrime scene photography

Breast Cancer Corpus (Total tokens: 226464 )cancerbreastwomenriskpatienttreatmenttherapy

336749

16473022

119

breast cancerbreast cancer riskmetastic breast cancerbreast carcinomaovarian cancertamoxifen therapyadjuvant therapy


56

OCW WC COMPOUNDS

Finance & Business Corpus (Total tokens: 681,215)percentmillionmarketpoundssharesreuterscompany

67131232

10411222

5

interest ratemillion poundsftse indexpercent risetax profitrecovery plancentral bank

Table 1. A comparison of high frequency OCWs(potential candidate terms) in three differentdomains showing the weirdness coefficient(WC) and frequent compounds

Once the single and compound words areidentified automatically, by comparing the distri-butions in the specialist corpus with a represen-tative general language corpus (for exampleforensic occurs 471 times more frequently in theforensic science corpus then it does in the BNC),one can then examine the ontological commit-ments by examining semantic relations amongstthe terms. The terms in a domain are often re-lated to each other through a range of semanticrelations such as hyponymy and meronomy,which can be used to build hierarchies. Thesesemantic relations are often exemplified in a lan-guage through the arrangement of certain termsin recurrent grammatical patterns that can be sub-sequently analyzed. Cruse (1986) has discussedthe notion of semantic frames: a triplet of phrases- X REL Y where X and Y are noun phrases (NPs)and REL is a phrase generally expressed as IS A, ISA TYPE OF/KIND OF and PART OF for illustrating hy-ponymic and meronymic relationships respec-tively.

Apart from the signal cues mentioned above,which are used most commonly in biologicalclassifications anyway, it has been suggested byHearst (1992) that certain enumerative cues couldalso be used to identify such relationshipsthrough lexico-syntactic patterns occurring intexts such as the frame (X1………,Xn) OR OTHER Ywhere each X and Y are NPs and each Xi in thelist (X1………,Xn) is a hyponym or subtype of Y.An example sentence here could be ‘In the caseof shootings or other fatal assaults’ in which casea shooting is a type of fatal assault. The cue andother follows the same pattern as the or other cuewhereas cues like such as, including and like

have the order reversed where the NP represent-ing the supertype occurs on the left with the listof subtype NPs on the right side of the cue, thelast NP having an and or an or preceding it: YINCLUDING (X1,……{or/and},Xn) . An examplesentence here could be “Trace evidence includingfibers, hair, glass and DNA was found at thecrime scene,” fibers, hair, glass and DNA beingtypes of trace evidence.

The phrases where the superordinate and sub-ordinate/instances are linked together by one ofthe signals described above appear to be structur-ally similar to ‘collocations [..] and frozen sen-tences […..] one often encounters that cannot berelated by formal rules of either phrase structureor transformational type’ (Gross 1993:26). ForGross, phrases used in telling time or idiomaticuse of language require a finite state automatabased on the so-called local grammar. Examplesof the local grammar used for signalling relation-ships between one term (NP) and a set of otherterms (NPs) are shown in Figure 1a and Figure1b:

Figure 1a. Finite State Graph for the and / or othersignal cues

Figure 1b. Finite State Graph for the includ-ing/like/such as/especially signal cues

Compound words often convey a semantic re-lationship between the constituent lexical units aswell. Compounding tends to specialize themeaning of the headword, each successive modi-fier specializing it further. This semantic rela-tionship often signifies the hyponymic relation,for example, the compound term trace evidencesuggests that trace evidence is a type of evidence.

NP

,

OTHERANDOR NPNPNP

,,,

OTHEROTHERANDORANDOR NPNPNP

NP

INCLUDINGLIKE

SUCH ASESPECIALLY

NP

,

ANDOR

NPNPNPNP

INCLUDINGLIKE

SUCH ASESPECIALLY

INCLUDINGLIKE

SUCH ASESPECIALLY

NPNPNP

,,,

ANDORANDOR

NPNPNP


57

This heuristic can also be exploited to extractsemantic relationships.

Sentences containing lexico-syntactic cues areautomatically extracted and tagged to indicate thegrammatical category of each word. Regularexpressions are used to detect whether the localgrammar is followed. All correct sentences aresubsequently parsed, based on this local gram-mar, to extract hypernym-hyponym pairs. Com-pound NPs are also parsed recursively to extractmore hypernym-hyponym pairs. All these hy-pernym-hyponym pairs extracted from the corpusare then merged together using a tree data struc-ture that finally constitutes a forest of (sub-)trees, which can then be used in the identificationof the ontological commitments of the domain.The final step is the representation of the varioussub-trees in XML. An example of the process ofanalysing a sentence is shown in Figure 2.

Figure 2. Analysis of a sentence containing a seman-tic frame to generate XML

Hence not only terms but also the relationshipsbetween the terms can be gleaned from domainspecific texts. There are a number of other tex-tual resources electronically available for certainspecialist domains, including terminology data-bases or lexicons that can be used to validate theexistence of the terms, and by the examination ofterm definitions provided, the relationships be-tween different terms can be verified as well.This can be used as a part of an overall approachto extract terms and possible relationships. Fur-thermore there has been some extensive researchdone on creating general semantic lexicons like

WordNet1, as well as knowledge bases that claimto model world knowledge like CYC2, which isstill being developed and not freely available.Though useful for generic applications they areinadequate for use in specialized domains such asforensic science or medicine due to a lack in spe-cialized terminology; knowledge will have to beacquired specifically for the specialized domain.As an example, highly salient terms in the foren-sic science domain such as forensic science andcrime scene were not found in Wordnet. How-ever, such knowledge sources may help in pro-viding some top-level categories for a domainontology. A system architecture for automati-cally extracting a domain ontology is shown inFigure 3.

Figure 3. A proposed architecture for candidate on-tology generation given a domain-specifictext corpus

This architecture is based on a three-stepmethod: (i) knowledge acquisition involves thecorpus creation and analysis phase; (ii) concep- 1 http://www.cogsci.princeton.edu/~wn/2 http://www.cyc.com/

Cued SentenceAll forms of trace evidence such as fibres, paint, glass and DNA.

Tagged Cued SentenceAll/DT forms/NNS of/DT trace/NN evidence/NN such/JJ as/IN fibres/NNS ,/, paint/NN ,/,

glass/NN and/CC DNA/NNP ./.

TAGGER

PARSER

Parsed Sentence

trace evidence {fibre, paint, glass, DNA}

XML <Object Name="evidence">

<Object Name="trace evidence">

<Object Name="glass"></Object>

<Object Name="fibre"></Object>

<Object Name="polymeric fibre"</Object>

evidence {trace evidence }

Cued SentenceAll forms of trace evidence such as fibres, paint, glass and DNA.

Tagged Cued SentenceAll/DT forms/NNS of/DT trace/NN evidence/NN such/JJ as/IN fibres/NNS ,/, paint/NN ,/,

glass/NN and/CC DNA/NNP ./.

TAGGER

PARSER

Parsed Sentence

trace evidence {fibre, paint, glass, DNA}

XML <Object Name="evidence">

<Object Name="trace evidence">

<Object Name="glass"></Object>

<Object Name="fibre"></Object>

<Object Name="polymeric fibre"</Object>

evidence {trace evidence }

TERM BASE

TEXT CORPUS

CORPUS ANALYSIS

CONCEPTUAL MODELLING

FORMALIZATION

CANDIDATE ONTOLOGY

INTERMEDIATE REPRESENTATION

LEXICAL SEMANTIC PATTERNS

COLLOCATIONS

CANDIDATE TERMS

LEXICAL RESOURCE

OTHER KNOWLEDGE

SOURCE

TERM BASETERM BASE

TEXT CORPUS

CORPUS ANALYSIS

CONCEPTUAL MODELLING

FORMALIZATION

CANDIDATE ONTOLOGY

INTERMEDIATE REPRESENTATION

LEXICAL SEMANTIC PATTERNS

COLLOCATIONS

CANDIDATE TERMS

LEXICAL RESOURCE

LEXICAL RESOURCE

OTHER KNOWLEDGE

SOURCE


58

tual modeling involves the extraction of termsand their interrelationships as well as the integra-tion of other knowledge sources and merging ofpartial knowledge structures into an intermediaterepresentation; and finally (iii) formalization in-volves the mapping of the intermediate repre-sentation into a formalism such as XML. Userinteraction could be optional in each of thephases for validation purposes.

3 Randomly Sampled Domain-SpecificSpecialist Corpora and Ontology

In this section we shall provide some details ofapplying the method outlined in Section 2 onthree domain-specific specialist corpora namely:forensic science, breast cancer research, andfinance and business. We shall briefly describethe corpus and then some of the outputs from ourmethod. In the breast cancer and finance andbusiness domains there are online terminologydatabases available, which enabled us to compareour results to the term definitions provided forverification purposes.

3.1 Case Study 1: Forensic Science Do-main

Content-Based Image Retrieval Systems areincreasingly using keywords in addition to visualfeatures for image categorization purposes(Squire et al., 2000). Texts related to images,also known as collateral texts, can help in theindexing and retrieval of images. Whereasclosely collateral texts such as captions can beused to extract keywords to directly index orcategorize images; we propose that broadly col-lateral texts, such as encyclopedic descriptions ofobjects within the image, may be a good sourceto extract related terms that can subsequently beused to build a thesaurus or ontology for queryexpansion purposes (Ahmad et al., 2003, Foskett1997, Efthimiadis 1996). The method outlined inSection 2 was actually developed within contextof the Scene of Crime Information System (So-CIS)3 project, which attempted to exploit the useof texts related to images for the indexing andretrieval of crime scene images. 3 http://www.computing.surrey.ac.uk/ai/socis/

For the purpose of analyzing broadly collateraltexts, a forensic science corpus of over half amillion words was created from 1470 Englishtexts (610,197 words). A variety of text typeswere collected from the Web such as journal pa-pers, handbooks and advertisements ranging from1990 to 2001 to ensure that the corpus was repre-sentative of the domain. Crime scene formsfilled by Scene of Crime Officers were also in-cluded in the corpus.

The corpus had 20 OCWs amongst the firsthundred with evidence and crime being most fre-quent. These terms are used productively tomake compounds, for example crime and sceneare used together to form 90 different compoundsincluding crime scene analysis and crime scenephotography. Some interesting candidate neolo-gisms, which have a weirdness of infinity sincethey are not found in the BNC, included: bite-mark, toolmark, handgun, polygraph, footprints(example of an unusual inflection) and rifling(example of an unusual derivation).

The forensic science corpus was then analysedto extract semantic relationships between theterms. Over 1200 sentences containing enu-merative cues were extracted and it was observedthat 60% of them incorporated the local grammarrepresentative of hyponymic relationships be-tween the phrases (X1,…{or/and},Xn) and Y. I twas interesting to note that the more typical X ISA Y frame pattern only brought up 40 valid sen-tences out of 400. Frames depictive of the mero-nymic relationship were not very productive;only 60 sentences were extracted with 40% ofthem being representative.

The diagram (Figure 4) shows a graphicalview of some of the automatically extracted rela-tionships. A partial hierarchy is shown of theconcept evidence that has trace evidence as asub-concept. Blood, fibre and DNA are types oftrace evidence and fibre can be further classifiedas inorganic, dye or manufactured polymeric fi-bre.


59

Figure 4. A partial-hierarchy of the concept evidencein the forensic science domain, automati-cally extracted from a randomly-selectedcorpus of texts

3.2 Case Study 2: Knowledge Maps forBreast Cancer Research

One can broadly divide the knowledge of an or-ganization, or more specifically that of the peoplewithin it, into knowledge that is a result of inno-vation, either accidental or planned, and knowl-edge that comes about through the application ofknowledge, the so-called best practice. The in-novation is usually attributed to the R&D profes-sionals (Huseman and Goodman, 1998) and bestpractice to the knowledge workers –professionalsinvolved in running the day-to-day operations ofcomplex organisations (Davenport and Laurence,2000).

More often than not clarity in language-basedcommunication is cited as the principal impedi-ment in the exchange of knowledge, both inno-vative and extant, the later relates to bestpractice. There are best practices that haveevolved in the care and treatment of virulent dis-eases like cancer. Early identification of symp-toms by the potential patient is regarded as key tosuccessful therapy; early notification of adverseeffects, which could easily be reversed bychanging the drug regimen, is crucial. In boththese cases the knowledge of innovation and thatof best practice has to be communicated acrossdifferent registers: from the innovators to the pro-fessionals and practitioners and then onto thepublic at large.

One potential answer is to allow access to keydocuments that lead on from invention onto bestpractice; a document repository which can besearched through the use of terms in different

registers and which is cross-referenced in a dy-namic manner. Most importantly, the ontologicalbasis of the repository should be open to inspec-tion – what is included in the repository and why.One solution is the development of computer-based methods that can automatically index andcross-reference documents that make up theknowledge of a domain. We describe howknowledge within a given domain can be mappedso as it is available almost independent of theregister. A knowledge map, a representation ofconcepts and their relationships, enables a user tonavigate through the network and follow links torelevant knowledge sources (information or peo-ple) from any specific concept (Chou and Lin,1998, Browne et al., 1997).

Our case study aims to describe a method formapping knowledge in the sub-domain of breastcancer addressed specifically at the health profes-sional level. The first step was to create a repre-sentative text corpus of nearly a quarter milliontokens, from a variety of 1000 English texts(226,464 words) collected from the Web; mainlyprofessional abstracts and journal papers from theNCI4 website. An analysis of the corpus to de-termine the most frequent single and compoundwords can lead to the identification of a lexicalsignature, which can be used as a basis to de-velop the knowledge map of the domain.Amongst the 10 most frequent terms, chemother-apy (9th most frequent) appears nearly 756 timesmore frequently in the breast cancer corpus thenit does in the BNC (recall Table 1, Section 2).Certain terms such as estrogen have a weirdnessvalue of infinity, indicating that they may becandidate neologisms.

We have argued above that the constituentlexical units of a compound word or collocationfrequently reflect a semantic relationship. Forexample, taking the compound term breast can-cer, which has a high frequency of 1581 in thecorpus, it can be deduced that breast cancer is atype of cancer, distinguishing it from say, ovar-ian cancer. This heuristic was used to indicaterelationships between terms (Figure 5).

This method will also be applied to mapknowledge from corpora comprising texts ad-dressed at the patient and practitioner level re-spectively. A comparison between these

4 http://www.cancer.gov/

EVIDENCE

TRACE EVIDENCE

FIBREBLOOD DNA

INORGANICFIBRE

MANUFACTURED POLYMERIC FIB RE

DYE FIB RE

EVIDENCE

TRACE EVIDENCE

FIBREBLOOD DNA

INORGANICFIBRE

MANUFACTURED POLYMERIC FIB RE

DYE FIB RE


60

different levels of knowledge will perhaps lead toa c ommon knowledge map for breast cancer.Such a map could be used for generating textautomatically for multi-level knowledge workers(health professionals, nurses and patients). Wefeel one of the main advantages of this method isits ability to be fully automated, using a databasesystem as a backend. However, further evalua-tion and experimentation will determine its fullpotential for use in extracting domain knowledgefor different levels of audience.

Figure 5. A Knowledge Map of some of the frequentNPs in the breast cancer domain shown in atriangle with the most frequent single-wordterms at the base and higher weirdness to-wards the apex

The method outlined in Section 2 was appliedto the breast cancer corpus, which can be consid-ered highly specialized. Results from our analy-sis include (shown in XML):

<Object Name="adjuvant treatment"> <Object Name="radiation" /> <Object Name="chemotherapy"> <Object Name="docetaxel" /> <Object Name="anthracyclines" /> <Object Name="vinorelbine" /> </Object> <Object Name="hormone therapy" /></Object><Object Name="local therapy"> <Object Name="radiotherapy" /></Object>

Note that the above relationship in fact elaboratesfurther on the definition of adjuvant treatment

that is given in the NCI terminology database.The definition in the database is a rather a limitedone:

adjuvant therapy: Treatment givenafter the primary treatment to increasethe chances of a cure. Adjuvant therapymay include chemotherapy, radiationtherapy, hormone therapy, or biologicaltherapy

The following diagram (Figure 6) shows a par-tial-hierarchy automatically extracted from thecorpus using both the local grammar and thecompound term analysis. For example, from thefollowing XML output it can be elicited that"atherosclerotic disease" is a type of disease and"breast cancer" is also a type of disease. In allapproximately 70% of the extracted relationshipswere valid.

<Object Name= "atherosclerotic disease"> <Object Name= "stroke" /> <Object Name= "chd" /></Object><Object Name= "disease"> <Object Name= "breast cancer" /></Object>

Figure 6: A partial hierarchy showing disease andsome of its sub-concepts automaticallyextracted from the breast cancer corpus

3.3 Case Study 3: Finance & BusinessDomain

The construction of a terminology database fora given enterprise requires the identification of aconceptual model of the enterprise. The term

DISEASE

CANCER

ATHEROSCLEROTIC DISEASE

CARDIOVASCULAR DISEASE

BREAST CANCER

PANCREATIC CANCEROVARIAN CANCER

METASTATIC BREAST CANCER

DISEASE

CANCER

ATHEROSCLEROTIC DISEASE

CARDIOVASCULAR DISEASE

BREAST CANCER

PANCREATIC CANCEROVARIAN CANCER

METASTATIC BREAST CANCER

DISEASE TREATMENT RISK BREAST CANCER WOMEN PATIENT

DISEASE TREATMENT BREAST CANCER RISK BREAST CANCER WOMEN PATIENT

THERAPY OVARIAN

TAMOXIFEN METASTATIC

TAMOXIFEN THERAPY OVARIAN CANCER

ADJUVANT CARCINOMA

ADJUVANT THERAPY METASTATIC BREAST CANCER

HORMONE RECEPTOR

HORMONE THERAPY BREAST CARCINOMA

ESTROGEN PROGESTIN

ESTROGEN / PROGESTIN HORMONE

PROGESTIN RECEPTOR

ESTROGEN THERAPY

GENE

DISEASE TREATMENT RISK BREAST CANCER WOMEN PATIENT

DISEASE TREATMENT BREAST CANCER RISK BREAST CANCER WOMEN PATIENT

THERAPY OVARIAN

TAMOXIFEN METASTATIC

TAMOXIFEN THERAPY OVARIAN CANCER

ADJUVANT CARCINOMA

ADJUVANT THERAPY METASTATIC BREAST CANCER

HORMONE RECEPTOR

HORMONE THERAPY BREAST CARCINOMA

ESTROGEN PROGESTIN

ESTROGEN / PROGESTIN HORMONE

PROGESTIN RECEPTOR

ESTROGEN THERAPY

GENE


61

conceptual model emphasizes that the relation-ships and interdependencies between the key ob-jects in the enterprise be clearly delineated. Mostconceptual models of terminology databasestypically address this question at the level of lin-guistic description. The discussion here is on theattributes of the various linguistic ‘slots’ includ-ing lemma, definition, related terms, grammarcategories/codes, as well as on administrative‘slots’ such as dates of entry and update, and thenames of the terminologists (Kugler et al., 1995).

The conceptual models for terminology data-bases seldom explicitly include a conceptualmodel of the domain in which the term base willbe used. There is considerable discussion on theconcept-based approach to terminology studies.This Platonist, neo-positive approach to knowl-edge was pioneered by Eugene Wüster and iscontinued to date in terminology literature (Ca-bre, 1999) and in terminology standards (ISO1087-1/2:2000, ISO 12616:2002). The concep-tual model is usually produced by a group of ex-perts, typically working for a standards or tradeorganization, for an established or mature sub-ject. The model is based on a consensusachieved over a period of time. Such considera-tions do not apply in a straightforward manner toan emergent subject domain. Here the expertsare few and far between and as the subject isemerging there is inevitably not as much consen-sus as one may have for a maturer subject. How-ever, experts still communicate through themedium of writing.

There are instances where the development ofthe subject is of direct interest to the public andnewspapers and magazines tend to include thewritings of the workers in an emergent domain.Financial market trading is a good example of anemergent subject where the subject involves aca-demic and professionals publishing in journals,magazines and financial newspapers. Newer fi-nancial instruments – shares, currencies, bonds,good examples of a financial instrument – arebeing devised by financial traders while financialnews reporters write about the state of these in-struments. We describe how our 3-phase methodof extracting terms and ontology structure mayhelp in this emergent domain. The candidateontology may be used as a basis of developing afully-fledged conceptual model. The corpus wascreated by examining 1529 English texts

(681,215 words) produced by Reuters5 on thetopic of British financial trading.

The analysis phase showed that of the 100most frequent tokens in the corpus 42 were openclass words including percent, market, shares,company, bank, stock, and ftse. Out of these 42,60% were found as single terms or part of a com-pound term or phrase in the terminology base6.Most frequent compound terms extracted includemillion pounds, ftse index, wall street, percentrise, tax profits and interest rates.

The term stock is an important term in that itappears amongst the 100 most frequent terms inour corpus of British financial trading. One ofthe interesting collocates of the term was defen-sive stock. This is defined in the terminologydictionary as:

‘a stock that tends to remain stable un-der difficult economic conditions. De-fensive stocks include food, tobacco,oil, and utilities. […]’.

Our ontological analysis throws interesting lighton that. First, our analysis extends the definitionto cover other than the four industries listed inthe definition, and also the sub-specialisms ofdefensive stocks or sectors (cf: confectionary >{chocolate, toffee})

<Object Name="defensive stocks"><Object Name="electricity" /><Object Name="utilities" /><Object Name="oils" /><Object Name="soft drinks company" />

<Object Name="tobacco" /> <Object Name="water company" /> <Object Name="banks" /> <Object Name="utility" /> <Object Name="retailers" /> <Object Name="confectionery"> <Object Name="chocolate" /> <Object Name="toffees" /> </Object></Object>

Second, our system can find instances of the typedefensive stock as in the pharmacuetical companyAstrazeneca.

5 http://www.reuters.com/6 http://www.investorwords.com/


62

<Object Name="defensive-type stocks"> <Object Name="astrazeneca" /></Object>

Third, the opposite of the defensive stock orsector is the so-called cyclical sector defined as:

‘The stock of a company which is sen-sitive to business cycles and whoseperformance is strongly tied to theoverall economy. […] gains by buyingthe stock at the bottom of a business cy-cle, just before a turnaround begins.opposite of defensive stock’.

Our system found one type of cyclical com-pany (mining) as well as two instances ofsuch stocks were found:

<Object Name="industrial cyclical companies"> <Object Name="mining" /></Object><Object Name="industrial cyclical stock">

<Object Name="ici" /><Object Name="invensys" />

</Object>

There are few systems that produce a basis forgenerating conceptual models and our resultsencourage us to believe that our method will helphere. The diagram below shows a partial hierar-chy of the key term stock, the different types ofstocks: defensive and cyclical as well as two in-stances of industrial cyclical stocks.

Figure 7. A partial hierarchy automatically extractedfrom the corpus of finance and business,showing some sub-concepts and instancesof stock

4 Afterword

The three case studies discussed above illustratethat the method outlined for ontology construc-tion showed promising results in three very dis-parate specialisms. This is a good indication thatthe method may be applicable to any arbitraryspecialist domain. The resulting ontology couldbe subsequently used for a variety of purposes:query expansion, term base construction orknowledge mapping as discussed in our casestudies.

If certain electronic lexical resources or termbases already exist for a certain domain then theycan also be utilized to provide some additionalinformation such as synonyms and alternateterms as well as validate the candidate terms andrelationships that have been automatically ex-tracted. In return our method can be used to pe-riodically update the term base. The maintenanceof term bases or domain ontologies if donemanually is a difficult and time-consuming task.Since the introduction of neologisms as well asthe obsolescence of certain terms is a commonoccurrence in many research active domains,there is a need for this problem to be addressed.The method for ontology construction could helpto update a term base or ontology by analyzingcurrent texts and adding any new terms foundfrequently and depopulating a term base or on-tology of terms that have not been used for acertain period of time.

Taking into consideration Guarino’s (1998)suggestion on developing different levels of on-tologies depending on their generality, we cansuggest that existing general lexicons such asWordNet or knowledge bases such as CYC couldbe used to provide a top-level ontology which ourautomatically constructed domain ontology couldbe merged with. The analysis of current textssuch as the financial news texts, which can helpprovide current and often ephemeral relationshipsas well as instances of concepts can be used tobuild an application ontology, which might needto be updated frequently to reflect the changes inthe state of the domain.

STOCK

DEFENSIVE STOCK CYCLICAL STOCK

CHOCOLATE

CONFECTIONARYOIL

TOFFEE

INDUSTRIAL CYCLICAL STOCK

INVENSYSICI

STOCK


CHOCOLATE

CONFECTIONARYOIL

TOFFEE

STOCK


CHOCOLATE

CONFECTIONARYOIL

TOFFEE

INDUSTRIAL CYCLICAL STOCK

INVENSYSICI


63

AcknowledgementsThis work is related to two projects: SoCIS(GR/M89041), a three-year EPSRC sponsored projectjointly undertaken by the Universities of Sheffield andSurrey and supported by five police forces in the UK;and GIDA (IST 2000-31123) a two-year EU spon-sored project undertaken by University of Surrey incollaboration with EU companies. Mariam Tariqgratefully acknowledges a student bursary provided bythe EPSRC.

ReferencesKhurshid Ahmad and Margaret Rogers. 2001. Corpus-

based terminology extraction. In: Budin, G.,Wright S.A. (eds.): Handbook of TerminologyManagement, Vol.2. John Benjamins Publishers,Amsterdam. 725-760

Khurshid Ahmad, Mariam Tariq, Bogdan Vrusias andChris Handy. 2003. Corpus-Based Thesaurus Con-struction for Image Retrieval in Specialist Do-m a i n s , In (ed). Fabrizio Sebastiani. Proc. ofECIR’03. LNCS-2633. Springer Verlag, Heidel-berg. 502-510.

D. Bourigault, C. Jacquemin, M-C. L'Homme, (eds.):2001. Recent Advances in Computational Termi-nology. John Benjamins Publishers, Amsterdam.

G. Browne, S. Curley and P. Benson. 1997. EvokingInformation in Probability Assessment: KnowledgeMaps and Reasoning-Based Directed Questions.Managements Science (43:1). 1-14.

M. T Cabre. 1999. Terminology: Theory, Methods &Applications. Benjamins, John Publishing Com-pany, Amsterdam. (Tr. Janet Ann DeCesaris).

Chou and H. Lin. 1998. The Effects of NavigationMap Types and Cognitive Styles on Learners Per-formance in Computer-Networked HypertextLearning System. Journal of Educational Multime-dia and Hypermedia (7). 151-176.

D. A. Cruse. 1986. Lexical Semantics. CambridgeUniversity Press, Avon, Great Britain.

Thomas H. Davenport and Prusak Laurence. 2000.Working Knowledge: How Organizations ManageWhat They Know. Boston: Harvard BusinessSchool Press.

E. N. Efthimiadis. 1996. Query Expansion. In: M. E.Williams (ed.). Annual Review of InformationSystems and Technology (ARIST). Vol.31. 121-187.

D. J. Foskett. 1997. Thesaurus. In: Sparck Jones, K.,Willet, P. (eds.): Readings in Information Retrieval.

Morgan Kaufmann Publishers, San Francisco, Cali-fornia. 111-134

Maurice Gross. 1993. Local grammars and their rep-resentation by finite automata. In: Hoey, M. P.(ed.): Data, Description, Discourse. HarperCollins,London 26-38.

Nicola Guarino. 1998. Formal Ontology and Informa-tion Systems. In Proceedings of FOIS’98 –formalOntology and Information Systems. Trento, Italy,6-8 June. IOS Press.

Z.S. Harris. 1988. Language and Information. In:Nevin, B. (ed.): Computational Linguistics Vol. 14,No.4. Columbia University Press, New York. 87-90

Marti Hearst. 1992. Automatic Acquisition of Hypo-nyms from Large Text Corpora. In Proceedings ofthe Fourteenth International Conference on Com-putational Linguistics (COLING’92). Nantes,France. 539-545.

Richard C. Huseman and Jon P. Goodman. 1998.Leading with Knowledge: The Nature of Competi-tion in the 21st Century. Thousand Oaks: SAGEPublications.

ISO 1087-1:2000. Terminology work -- Vocabulary --Part 1: Theory and application.

ISO 1087-2:2000 . Terminology work -- Vocabulary -- Part 2: Computer applications.

ISO 12616:2002. Translation-oriented terminography

M. Kugler. K. Ahmad and G. Thurmair (Eds.) 1995.Translators Workbench: Tools and Terminology forTranslation and Text Processing. Springer Verlag.

G. Leech, P. Rayson, A. Wilson. 2001. Word Fre-quencies in Written and Spoken English: based onthe British National Corpus. Pearson EducationLimited, Great Britain

John Sowa. 2000. Knowledge Representation: Logi-cal, Philosophical, and Computational Founda-tions. Pacific Grove CA: Brooks/Cole.

McG. D. Squire, W. Muller, H. Muller, T. Pun. 2000.Content-Based Query of Image databases: Inspira-tions from Text Retrieval. Pattern Recognition Let-ters 21. Elsevier Science B.V. 1193-1198


64

Ontology Aspects in Relation Extraction

Birte LonnekerInstitute for Romance Languages

Von-Melle-Park 6D-20146 Hamburg, Germany

[email protected]

Abstract

This paper presents a linguistic study informulating French patterns for relationextraction. The patterns are based on ananalysis of corpus annotations: Each an-notation consists in a labelled relationbetween one concept and another. Theconcept frames themselves are arrangedin a hierarchy. This enables the observa-tion of dependencies between their on-tological status and patterns for relationextraction. However, patterns dependnot only on the ontological status of theconcepts they relate, but also on otherfactors which are themselves connectedto the first one and can be combined intoa dependency network for relation pat-terns. These findings should be takeninto account when designing and tuninga pattern-based automatic system for re-lation extraction.

1 Introduction

Relation extraction plays a role in the ac-quisition of terminological data (cf. e.g.Condamines (2002), Marshman et al. (2002),Seguela (1999)), in ontology building and re-finement (cf. e.g. Pustejovsky et al. (2002),Sure et al. (2002)), and in building and improvinglexical or conceptual-lexical resources like Word-Nets (cf. e.g. Peters (2002)). The methods differin the relations they consider and in the degree towhich the extraction phase is automated.

In concept frames (Lonneker, 2002; Lonneker,2003), a knowledge representation technology(Davis et al., 1993) used to represent largeamounts of knowledge about a given concept, apropositional information unit basically consistsin a labelled relation between the frame conceptand a filler concept. Concept frames are acquiredfrom corpus texts, in which information is anno-tated using a specially designed annotation tool(Lonneker, 2002). In order to study whether thework could be improved by automatisation, thecorpus paragraphs underlying the annotations areinspected, and patterns for the linguistic expres-sion of five common relations are detected. Theresults of this analysis, presented in this paper, canbe generalised to other relation extraction tasks.

The remainder of this paper is organised as fol-lows. Section 2 presents the information used inthe analysis: After an introduction into conceptframe structure (Subsection 2.1), the top level ofthe frame ontology (Subsection 2.2) and the acqui-sition of concept frames from corpora (Subsection2.3) are described briefly. Section 3 deals with themethod and results of the relation pattern analysis:Subsection 3.1 presents five conceptual relationsand the method used to detect and describe theirlinguistic realisation; Subsection 3.2 gives exam-ples of the encountered patterns, some of whichdepend on the ontological status of the related con-cepts; Subsection 3.3 summarises the discussionand presents a network of dependencies for rela-tion patterns. The conclusion (Section 4) discussesconsequences for building automatical systems us-ing patterns for relation extraction.


65

2 Concept frames used in corpusannotation

This section explains how the information that ahuman annotator can find in a corpus is annotatedin concept frame structure. In the first subsection(2.1), this structure is presented. Subsection 2.2then briefly introduces the top-level frame hier-archy and compares it to other top-level ontolo-gies. Subsection 2.3 describes how more specificframes – subframes of the top-level frames – areacquired through text annotation.

2.1 Concept frame structure

Concept frames are large conceptual entitiesnamed by natural language nouns, the framenames. A concept frame comprises the followingmain elements, illustrated by French examples:

� A frame name, which is a noun referring tothe concept frame (e.g. LA MAISON ‘THE

HOUSE’).� A subslot, a labelled relation from the con-

cept frame to another concept or to an in-dividual (“instance”). The main compo-nent of a subslot is a natural languageverb. (E.g. + etre l’une des par-ties de ‘+ be a part of’; + avoir‘+ have’.)

� A filler, which is an adjective or a noun, pos-sibly modified, referring to a concept or in-dividual related to the frame name via therelation expressed in the subslot. (E.g. levillage ‘the village’; la ville ‘thetown’; la salle, de sejour ‘the liv-ing room’.)

Combined together, these elements form aproposition, for example:

� La maison -- + etre l’une desparties de -- le village

� La maison -- + etre l’une desparties de -- la ville

� La maison -- + avoir -- lasalle, de sejour

In the frames, subslots (relations) are groupedby aspects or slots. For example, the sub-slot + etre l’une des parties de is in-cluded in the slot Les relations avecles evenements englobants ‘The rela-tions with the comprising whole’. We can saythat in concept frames, the slot level of tradi-tional Knowledge Representation frames (Karp etal., 1995) has been split into two levels (“sub-slots” and “slots”). This structure provides a bet-ter overview of a large number of subslots, anda mechanism for disambiguating subslots whichhave the same natural language label. To illustratethe overall frame structure, Figure 1 shows an ex-cerpt from the concept frame LA MAISON. It canbe seen from the figure that subslots are groupedby slots and fillers by subslots.

LA MAISONL’importance pour l’homme:

+ servir ala viela vie – meilleur

+ causerle bien-etre

Les relations avec le tout englobant:+ etre l’une des parties de

le villagela ville

Les parties:+ avoir

la salle – de sejourla cuisine

Les relations avec les evenements englobants+ etre vendu, par

l’agence – immobilierle proprietairel’entreprisele vendeur

+ etre achete, parl’agence – immobilierl’acheteurle prorietaire

+ etre concerne parl’incendiel’intemperie

Figure 1: Some annotations of a specific conceptframe (in French).

2.2 Top-level concept frame hierarchy

In our research on concept frames, a top-level on-tology or frame hierarchy based on linguistic evi-dence is used. As pointed out in (Lonneker, 2002),an important point when building the hierarchywas to minimise redundancy and synonymy of re-lations. The hierarchy was built drawing on an ex-isting lexicographic resource, the so-called “ma-trix frames” in (Konerding, 1993).


66

These matrix frames are the twelve topmostconcepts except THING or ENTITY found as hy-peronyms in conventional dictionary definitions.They consist of blocks of questions that mightplausibly be asked about the concept represented(e.g. possible questions concerning ORGAN-ISMS). The verbs of these questions are taken overas default subslots into our frame system. Thequestion blocks have headlines indicating the as-pects of the entity they describe, which can betaken over as slots. Because of the general level ofthe knowledge, there are nearly no answers to thequestions in Konerding’s work, which means thatthere are nearly no fillers in the top-level frames.

L’ENTITE CONCRETELa definition

+ s’appeler egalement+ ressembler a+ etre+ inclure+ etre traite, dans

L’importance pour l’homme+ servir a+ causer+ temoigner de+ etre- etre

Les relations avec le tout englobant:+ etre l’une des parties de

Les parties:+ se composer de+ avoir

L’existence/la vie:+ avoir son origine dans+ pouvoir exister comme+ avoir+ terminer son existence, lors de+ terminer son existence, a cause de

Figure 2: A top level frame (in French).

The slot names and subslot names found inKonerding (1993) show some similarities fromframe to frame. In order to avoid redundan-cies, all slot-subslot combinations occurring inevery frame were extracted and inserted into anew frame, the CONRETE ENTITY-frame (cf.Figure 2), a superframe of all other frames.Further analysis of the matrix frame informa-tion finally resulted in four additional frameswith a higher level of abstraction: CONCRETEENTITY, CONTINUANT, PRIMARY OBJECTand ROLE/VIEWPOINT ON AN ENTITY. Thetwelve frames treated by Konerding are subframesof these, as shown in an overview of the hierarchyin Figure 3.

Concerning the status of the hierarchy, it has to

Figure 3: Hierarchy of top-level frames.

Concrete Entity

Occurrent/Event

ObjectInst./Soci-al Group

MatterPersonw. Role

Whole State

Part

ArtefactPerson w. Profession

Orga-nism

Action

Continuant

Primary ObjectRole/Viewpointon an Entity

be stressed that ontologies may have as differentsources of evidence as common sense, psycholog-ical and sociological explanations, languages, andresults of the natural sciences (cf. Wildgen (2001,179)). Because of these different foundations,and because of the different goals their develop-ers have in mind, ontologies are often very dif-ferent in nature: Chandrasekaran et al. (1999, 23)illustrate how differently four selected ontologiesarrange even the most general concepts into a tax-onomy. The top-level frame ontology itself is of-ten referred to as “top-level hierarchy”, in orderto distinguish it from formal ontologies (cf. e.g.Guarino (1997)1).

Some similarities between the top-level hi-erarchy and other ontologies as for instancethe Generalized Upper Model (Bateman et al.,1995), Mikrokosmos (Mahesh and Nirenburg,1995) and DOLCE/Wonderweb (Masolo et al.,2002) are: the distinction between EVENTS andCONTINUANTS (“non-events”), the existenceof classes for properties (STATE), collections orsets (WHOLE), living things (ORGANISM) andSOCIAL GROUPS.2 ORGANISMS can – pro-vided they are persons – be assigned differentroles: PERSON WITH A PROFESSION (for a pro-fessional role) and PERSON WITH A ROLE (forany other role). OBJECTS can be seen as ARTE-

1Cf. also Gangemi et al. (2001) for a comparison betweenformal ontology and the lexical hierarchy in WordNet (Milleret al., 1990).

2The top-level frame hierarchy does not provide suitableframes for all entities of the world. For example, TIME andSPACE do not have their own frames; Konerding (1993, 184–185) is aware of this problem and proposes to treat these en-tities as “WHOLES of PARTS”. We call the topmost frameCONCRETE ENTITY in order to show that some abstract ideasare not covered.


67

FACTS, if they have been produced by persons.These distinctions cover the one between LIVINGand NON-LIVING (agentive and non-agentive)things, which is also very common in ontology,and which will be the main distinction used whenanalysing relation patterns (cf. Section 3). Be-cause of these commonalities in ontologies, theexact ordering of the top-level concepts is not themain focus of this paper. On the contrary, we be-lieve that our results concerning ontology aspectsin relation extraction will be useful also with a dif-ferent taxonomic ordering of the general level.

2.3 Acquisition of subframes

The main purpose of the top-level hierarchy pre-sented in the previous subsection is to provide anadequate set of “default relations” in form of sub-slots ordered into slots, in each branch of the hier-archy. Subframes of top-level frames are acquiredthrough text annotation, during which the relationsinherited from the corresponding top-level super-frames are provided with fillers. For example, Fig-ure 1 shows that some of the subslots already de-fined in the top frame CONCRETE ENTITY (Fig-ure 2) have been used in the frame representationof the more specific concept HOUSE, where theyrelate this concept to others (the fillers), accordingto information found in the annotated texts.

Subconcepts of top-level frames are acquiredfrom concept-centered corpora, i.e. corpora con-taining dense information on the selected concept.These corpora are automatically collected fromthe Web and split into small portions easy to an-notate (Subsection 2.3.1). Information about theconcepts that is found in these corpora is annotatedusing an annotation tool specifically designed forthis task (Subsection 2.3.2).

2.3.1 Corpus extraction

The Web has been chosen as a source for corpusextraction because it contains information aboutvirtually all concepts of common world knowl-edge. Furthermore, Web texts show a large varietyof sources, target audience and genres, so that thecollected knowledge will not be too biased by oneof these factors.

Example texts as parts of each concept-centeredcorpus for a given concept are extracted from

the Web using the system LingKonnet (Lonneker,2003). The example extractor is an extendedsearch-engine wrapper with a special processingof the result pages. The page processing allows forthe specification of any search string that shouldbe contained in the extracted parts of the page; ide-ally, a lexical item referring to the selected concept(the frame name) is chosen as search string forthe page wrapping. Extracted parts are coherenttext snippets or “contexts” (e.g. text paragraphs,list entries) long enough to contain informationabout the concept. Each “text part collection” for agiven frame name can be considered as a concept-centered corpus and as a subcorpus of the wholecollected corpus. For further information on thealgorithm of LingKonnet and a more detailed dis-cussion of the nature of Web texts cf. (Lonneker,2003).

2.3.2 Text annotation

During the text annotation phase, the knowl-edge which is either explicitly or implicitly con-tained in the extracted texts, is stored in framesusing the frame structure presented in Subsec-tion 2.1. Using an annotation tool described in(Lonneker, 2002), subslots and fillers are chosenfrom a list of knowledge elements inherited fromthe top-level frames (superframes), or new onescan be created which apply to the more specificframes. For each text part, one or more proposi-tions consisting of a frame name, a subslot (be-longing to a certain slot) and a filler (cf. Subsec-tion 2.1) can be entered.

It is important to note that in the methodadopted for the annotation of corpora, general aswell as specific information was encoded. The an-notation of specific information means that alsopredications concerning instances of the frameconcept were taken into account. The rationale be-hind this method is that from all annotations, pro-totypical knowledge can be inferred by retainingonly the most often used subslots and fillers (i.e.,relations and related concepts).

More than 9,000 single annotations (framename–subslot–filler combinations) of French textsegments were encoded using more than twentydifferent concept frame names. Figure 1 above


68

displays some annotations of the HOUSE frame.3

A database contains the extracted example texts aswell as all used frame elements and the annotationfor every text part. The benefit of the database con-sists in the combination of text data from the Weband annotation data from the human annotator. Inthe next section, we show how the data is used toformulate heuristics (“patterns”) for linguistic ex-pression of relations (subslots) in texts.

3 Analysis of patterns for conceptualrelations

This section presents and discusses an analysis ofFrench patterns which were found to express se-lected conceptual relations in a corpus. In Sub-section 3.1, the method used in detecting and de-scribing linguistic patterns in corpora is explained.Subsection 3.2 introduces the five conceptual rela-tions for which patterns were compiled. Examplesof patterns are mentioned for each relation. Spe-cial attention is given to three relations for whichsubcorpora containing information on concepts ofdifferent ontological status were analysed sepa-rately. Finally, Subsection 3.3 summarises the re-sults of the individual analyses. It is shown thatlinguistic patterns expressing a conceptual relationdepend on the ontological status of the involvedconcepts, on characteristics of the lexical itemsused to refer to them, as well as on corpus charac-teristics, and that these individual factors are them-selves interrelated in a network of dependencies.

3.1 Detecting and describing patterns

A linguistic pattern can be any information ofsyntactic, lexical and/or “paralinguistic” (mainlypunctuation) nature (cf. Marshman et al. (2002,2–3)), which is used in the text to express a se-lected conceptual relation. In concept frame ter-minology, a conceptual relation is a subslot in aspecific slot, cf. Subsection 2.1 above. Patterns fora selected conceptual relation can thus be founddirectly in the corpus paragraphs featuring an an-notation with this subslot. The manual detectionof linguistic characteristics in the text expressing agiven relation is eased by two factors:

3A preliminary RDF representation of both the top-level frames and the annotated subframes can be found athttp://www.rrz.uni-hamburg.de/lingkonnet/RDF.

1. the coherent and relatively short contexts, i.e.annotated paragraphs (cf. Subsection 2.3.1above);

2. the already-made choice of the relation andof the related concepts (frame name, fillername) by the annotator (cf. Subsection 2.3.2above).

By spotting a linguistic expression for a givenrelation in an annotated corpus example, thechoice that was already made by the annotatoron the conceptual level is reconstructed or “re-interpreted” at a syntactico-lexical level. Thismethod can be characterised as “bottom-up” (Con-damines, 2002, 146):

[. . . ] with a bottom up view, patternsare examined in real corpora in order todescribe their function rather than beingfirst described and then systematicallysearched within corpora and so consid-ered as general.

For instance, the text paragraph containingExample (1) has been annotated with the sub-slot + avoir in the slot Les caracte-ristiques exterieures (+ have, Theexternal characteristics), and this as arelation between the concept frame LA MAISON

and the filler le plan, quadrangulaire:

(1) [. . . ] construire de vraies maisons, a planquadrangulaire, [. . . ]

Following a convention introduced by (Marsh-man et al., 2002), we use a double underlin-ing for the concept frame name (the conceptfrom which the relation “departs”) and a singleunderlining for the filler name (the concept towhich the relation “goes”). It is easy to findout that in Example (1), the given relationshipis linguistically expressed by the preposition a,which will be regarded as a lexical pattern4 forthe subslot + avoir in the slot Les car-acteristiques exterieures. Wheneverpossible, in order to maintain information aboutthe directionality of the relation, the “place”

4Due to the potential ambiguity of lexical patterns, theyshould be more precisely called “pattern candidates”.


69

of the noun phrases containing the frame name([FRAME-NP]) and the filler name ([FILLER-NP]) will be indicated in the pattern. Placingthe governing noun phrase before and the prepo-sitional complement phrase after the preposition,the following description (2) of the pattern corre-sponding to (1) above is achieved:

(2) [FRAME-NP] a [FILLER-NP] �

The index � indicates that the FILLER-NP has noarticle. Similarly, the indices def and indef can beused to refer to NPs with definite and indefinitearticle, respectively.

If a verb is part of a lexical pattern, we analo-gously indicate the argument position of the frameand filler noun phrases; an argument referring nei-ther to the frame nor to the filler concept is rep-resented by [NP]. The following examples showmore contexts in which the relation + avoirin the slot Les caracteristiques exte-rieures is expressed (3a; 4a), and the corre-sponding pattern descriptions (3b; 4b).

(3) a La terreur nocturne serait observee chezdes enfants d’age scolaire [. . . ]

b [NP] �� [observer] [FILLER-NP] ��[FRAME-NP] ��

(4) a [. . . ] votre maison en Bois Rond prendrades tonalites differentes [. . . ]

b [FRAME-NP] �� [prendre] [FILLER-NP] ��

In these patterns, the syntactical position indi-cated by the variable [FRAME-NP] is instantiatedby the concept frame name (enfant, maison), theargument corresponding to [FILLER-NP] is in-stantiated by the filler, here an external character-istic (terreur nocturne, tonalite). [NP]-positionscan be occupied by any other concept: In Pattern(3b) above, this is typically a person and can beleft implicit in the text, as can be seen from Ex-ample (3a). Arguments of verbs are numbered forthe first three arguments, which correspond to thesemantic “subject” (actor, Arg0), the semantic “di-rect object” (patient, Arg1) and the semantic “in-direct object” (beneficiary, (Arg3). Prepositional

arguments are indicated by the preposition used(e.g. ArgChez). Brackets around a lexical item in-dicate a canonical form, different inflected formsof which can occur in the text.

Patterns for relations between class concepts(for example, the relation + se trouver, abetween la maison and la ville) and between in-dividuals instantiating these concepts (for exam-ple, the relation + se trouver, a between lamaison and Paris) are dealt with collectively, be-cause the annotations do not distinguish betweenthem (cf. Subsection 2.3.2 above). Patterns canbe used in terminology extraction (Condamines,2002; Marshman et al., 2002; Seguela, 1999) andin information extraction (for example in the “tem-plate relations” and “scenario templates” tasks,(Chinchor, 1997b)). These two fields have slightlydifferent aims: Terminology extraction concen-trates on generic information (e.g. relations be-tween concept classes), while information extrac-tion is only interested in specific information (e.g.relations between individuals) (Grishman, 2000).Most insights we gained from analysing patternshold independently of whether generic or specificinformation is expressed; they can thus be usefulfor both terminology and information extraction.

3.2 Patterns for five conceptual relations

The annotated corpus examples were analysed forlinguistic expression patterns of the following fiveconceptual relations:

1. + inclure (Slot: La definition);comparable to the hyponymy relation;

2. + ressembler a (Slot: La defini-tion); this subslot can be compared to a re-lation of co-hyponymy, near synonymy or toa literal-figurative relation;

3. + avoir (Slot: Les caracteristi-ques exterieures), relating an entityto one of its external characteristics5 ;

4. + se trouver, a (Slot: Le lieu etla repartition), relating an entity to itsplace;

5The transition between internal and external characteris-tics and processes is smooth (cf. e.g. (Wildgen, 2001, 182));we consider a property external if it has clearly perceivableconsequences on the outer appearance of an individual.


70

5. + avoir (Slot L’importance pourl’homme), relating an entity to its legal, so-cial, or economic possession, or to charac-teristics (e.g. of ARTEFACTS) which influ-ence on these possessions of humans whodeal with them.

For the first two relations, the whole corpuscontaining examples for several different conceptframes was analysed. The patterns found for theserelations, + inclure and + ressembler a,confirm results of similar earlier research. In or-der to illustrate this fact, we give some examplesof common patterns detected in our corpus:

(5) Common patterns for + inclure orhyponymy

a Syntactic pattern: N + ADJ with N =frame name and N ADJ = filler name(“maison individuelle”)

b Lexical pattern: [FILLER-NP] � �� [etre][FRAME-NP] �� (Le reduve masqueest une grande punaise [. . . ].)

(6) Common patterns for + ressemblera or co-hyponymy

a Syntactic pattern: enumeration, e.g. N,N et N with N = frame name or fillername (“Nature et paysages”; “Livres,gravures et cartes postales”)

b Paralinguistic pattern: hyphen, e.g. N-N with N = frame name or filler name(“realisateur-journaliste -cameraman”)

The patterns mentioned in (5) are well-knownhyponymy patterns and have been described andused by several authors, including (Pustejovsky etal., 2002) (for the first one) and (Borillo, 1996)(for the second one). The enumeration patternmentioned in (6) for cohyponymy relations cor-responds to the second part of a hyponymy pat-tern like NX tel que Na, Nb [et] Nc mentionedby (Borillo, 1996). The overlap in the results ofthese analyses with common methods in terminol-ogy and information extraction supports the valid-ity of the pattern detection method.

When analysing the remaining three relations,only concepts of the branches PERSON IN A ROLE

and ARTEFACT (cf. the top-level hierarchy in Sub-section 2.2) were considered, and the corpus wassubdivided into these two groups. It turned out thatthere are only a few general patterns for these re-lations, i.e. patterns which are independent of theontological status of the frame concept. In fact,some patterns are very relevant for concepts in oneof the branches, while uncommon or absent withconcepts in the other branch, as illustrated by thefollowing examples.

� Some common patterns for + avoir (Lescaracteristiques exterieures)with a concept frame in the PERSON IN A

ROLE-branch contain the preposition chez:

(7) [FILLER-NP] chez [FRAME-NP] ��

(8) [NP] �� [observer] [FILLER-NP] ��[FRAME-NP] ��

(9) [NP] �� [decrire] [FILLER-NP] ��[FRAME-NP] ��

These patterns are absent from the corpus ex-amples annnotated with the same relation forsubconcepts of ARTEFACT.

� For + avoir (L’importance pourl’homme), a common syntactic patternwith a concept frame in the PERSON IN

A ROLE-branch is the possessive pronoun.This pattern occurs only very rarely withsubconcepts of ARTEFACT. However, the“explicit” variant of this pattern, usingthe possession-denoting preposition de,is common in both ontological branches.Relations between an ARTEFACT and oneof its properties also seem to be left implicitmore often6; it remains open whether thisphonomenon is connected more to the natureof the frame or that of the filler.

� Similarly, for + avoir (L’importancepour l’homme), a different position of the

6As in J’aime cette chemise a cause de la couleur, wherethe filler is preceded by a definite article instead of a posses-sive pronoun. Thanks go to Salah Ait-Mokhtar for discussingexamples.


71

frame-denoting NP can be observed in lexicalpatterns involving verbs of arity 3:

(10) PERSON WITH A ROLE-subcon-cepts

a [NP] �� [reconnaıtre] [FILLER-NP] �� a [FRAME-NP] ��

b [NP] �� [offrir] [FILLER-NP] �� a [FRAME-NP] ��

(11) ARTEFACT-subconcepts[FRAME-NP] �� [offrir] [FILLER-NP] �� 7.

For the relation + se trouver, a, anotherphenomenon was detected. Nearly all encounteredpatterns are prepositions, and these prepositionsdiffer between the ontological branches. A closerlook at the frame and filler names tells us that thechoice of spatial prepositions depends not only onthe frame concept, but also on the filler concept,and/or the lexical item denoting it. For example,with a PERSON-subframe, we find nearly exclu-sively the prepositions a and dans governing thefiller-NP, e.g. (12).

(12) a on rencontre des enfants [. . . ] dans lesecoles [. . . ]

b [FRAME-NP] dans [FILLER-NP]

With HOUSE, a big and immovable ARTE-FACT, the prepositions en and sur are encoun-tered much more often than a. Preceded by en,the filler can denote a geographic region or a kindof landscape (13), while filler-NPs governed bya sur-PP typically denote the underground of thehouse (14a); in some cases, they refer to larger re-gions or cities (14b). In this regard, the choiceof the spatial preposition also depends on selec-tion patterns of the governed noun; on the lex-ical level, we can compare this phenomenon tothe spatial lexical function Loc � � �� as defined by(Mel’cuk, 1984, 9), for example �� (gare)= a [la gare], �� (piece) = dans [la piece].

(13) a maison en campagne vallonnee [. . . ]

b [FRAME-NP] en [FILLER-NP] �7In this pattern, the third argument is implicit. It could be

mentioned as [NP] ��

(14) a Vends maison [. . . ] sur terrain de 3033m � [. . . ]

b Une maison en bois sur n’importe quellecommune ?

c [FRAME-NP] sur [FILLER-NP]

All four prepositions mentioned as patterns for+ se trouver, a are classified as “topologi-cal” or “internal localisation” prepositions by (Bo-rillo, 1998, 32), indicating a full or partical inclu-sion of the located object into the location (site).Yet, as pointed out by (Cadiot, 1997, 36–39),French prepositions have different degrees of ab-straction; the more abstract they are, the more dif-ficult it is to find a “sense definition” for them, andmost encountered spatial prepositions can be con-sidered as abstract (a and en as well as, less promi-nently, sur, cf. (Cadiot, 1997, 36)).8

Finally, some special patterns were detectedwhich seem to depend not only on the ontologi-cal nature of the frame concept, but also on corpuscharacteristics. For example, for ARTEFACTS,patterns indicating their size and their price can beindividuated. These patterns typically include nu-meric expressions (Chinchor, 1997a) consisting ofa value, and a unit like Euros and m2. The subcor-pora in which they were encountered can be char-acterised as advertisements.

3.3 Dependencies involved in patterns forrelation extraction

Some studies in terminology extraction (Con-damines, 2002; Marshman et al., 2002; Seguela,1999) point out that general patterns for seman-tic and conceptual relations are rare; most patternsseem in fact to depend on the nature of the corpus.The findings outlined in Subsection 3.2 confirmthat this dependency exists; however, specialisedcorpora like a natural sciences corpus with didac-tic genre (Condamines, 2002, 151) or an expert-to-layperson corpus from the domain of comput-ing (Marshman et al., 2002, 5–6) define also the

8Moreover, the prepositions enter in a complex “sense in-teraction” with the noun phrase they govern; for example, inetre dans une/la prison, the noun phrase has a “concrete” lo-cal meaning and refers to a building, while in etre en prisonthe preposition immaterialises the referent of the noun, so thatthe predication assigns a state or role (etre prisonnier), cf.(Cadiot, 1997, 26).


72

Concept frame/frame name

Pattern forrelation

Corpuscharacteristics

Filler concept/filler name

1

2

4 5

6

3

Figure 4: Dependencies involved in relation pat-terns.

content domain and thus the concepts that will bementioned in the texts (for domain dependency, cf.(Marshman et al., 2002)).

The collection of corpora for concept frame an-notation from the Web did not limit the natureof a corpus in any other way than by definingthat every part of it should contain a given stringused to identify the concept name; this meansthat, rather than collecting domain-centered cor-pora, “concept-centered” corpora were collectedfor concept frame annotation.

The patterns for each relation can thus be stud-ied concept by concept (or by several conceptswith the same superconcept), and it is possibleto detect further dependencies of patterns. Thedependencies that turned out during the detailedanalysis of three conceptual relations (cf. Subsec-tion 3.2) are displayed in Figure 4. In what fol-lows, we give a brief explanation of each depen-dency relation within this dependency network,keeping in mind the examples mentioned in theprevious subsection. The numbers in the enumer-ation correspond to the labels of the arc in the net-work of Figure 4.

1. Corpus characteristics depend on framename/concept frame. Certain subcorpora,which can be characterised as special textsorts like job offers or advertisements, de-pend on the nature of the concept for whichthe corpus is collected. For example, “situa-tions wanted” only occur within corpora thatare collected for PERSONS, while sale ad-

vertisements are only found in corpora col-lected for ARTEFACTS.

2. Filler names (names of related concepts) de-pend on frame name/concept frame. Differ-ent frames do not have the same fillers. Forexample, the filler “mother” can occur in oneor more relationships to the CHILD frame,but will not occur (or much less often) in theBOOK frame. Information extraction sys-tems try to find out dependencies of this kindautomatically.

3. Patterns for a given relation depend on framename/concept frame. Linguistic realisationsof the same conceptual relation in texts varygiven the branch in the concept hierarchyto which the frame concept belongs. This“place in the hierarchy” can be interpreted asthe ontological status of the concept frame, oras the semantic class of the frame name (cf.also Condamines and Rebeyrolle (2001)). Inthe examples discussed above, this depen-dency is illustrated by the pattern chez oc-curring only with PERSON-frames, and bythe different argument positions of filler andframe concepts in lexical patterns involvingverbs, according the whether the frame con-cept is a PERSON or an ARTEFACT.

4. Filler names (names of related concepts) de-pend on corpus characteristics. Accordingto the nature of the corpus, different filler canbe expected. For example, in a corpus con-taining medical advice, other concept namesoccur as candidates for filler concepts than ina corpus of “situations wanted”, even if theframe concept is the same in both cases.

5. Patterns for a given relation depend oncorpus characteristics. Some patterns canbe found only in subcorpora with specialcharacteristics. This is the case for theFrench preposition chez ‘in’. It was foundwith a higher frequency as an indicator forthe + avoir-relation (Slot: Les car-acteristiques exterieures) in amedical subcorpus than in any other sub-corpus for PERSON concepts. Special pat-terns including measures and currency units


73

like m � (square meter) and Euros were of-ten found in subcorpora which can be char-acterised as sale advertisements.

6. Patterns for a given relation depend on fillername/concept. Finally, some lexical patternsoccur only with certain filler names. Thisboth conceptual and lexical dependency con-cerns especially prepositions, which changeaccording to the noun (name of the concept)they precede. A good example are spatialprepositions like a and dans in French.

4 Conclusion

A careful analysis of automatically collected andmanually (with computer assistance) annotatedcorpora centered around concepts helps to de-tect not only patterns for relation extraction, butalso general dependencies in which these patternsare involved. It turns out that relation patternsare seldom completely independent of factors likeontological-semantic status and lexical selectionpreferences of the concepts related via the studiedrelation, as well as of corpus characteristics. Allfactors are interrelated in a dependency network.These dependencies might be further used in pat-tern definition and pattern tuning, and should betaken into consideration while half-automaticallyquerying corpora using these patterns. However,for a fully automated process of relation extrac-tion, the work of preprocessing the corpus syntac-tically, lexically and conceptually seems very in-tensive. It might be a better choice to label smallercorpora by hand, using specially designed annota-tion tools, or to use statistical methods for a biggercorpora.

References

John A. Bateman, Renate Henschel, and Fabio Rinaldi.1995. Generalized Upper Model 2.0: documenta-tion. Technical report, GMD/Institut fur IntegriertePublikations- und Informationssysteme, Darmstadt,Germany.

Andree Borillo. 1996. Exploration automatisee detextes de specialite : reperage et identification de larelation lexicale d’hyperonymie. Linx, 34/35:113–123.

Andree Borillo. 1998. L’espace et son expression enfrancais. Ophrys, Paris.

Pierre Cadiot. 1997. Les prepositions abstraites enfrancais. Armand Colin, Paris.

Balakrishnan Chandrasekaran, John R. Stephenson,and V. Richard Benjamins. 1999. What are ontolo-gies, and why do we need them? IEEE IntelligentSystems, 01/02:20–26.

Nancy A. Chinchor. 1997a. MUC-7 Named Entitytask definition. In Proceedings of the Seventh Mes-sage Understanding Conference (MUC-7), Fairfax,Virginia.

Nancy A. Chinchor. 1997b. Overview of MUC-7/MET-2. In Proceedings of the Seventh MessageUnderstanding Conference (MUC-7), Fairfax, Vir-ginia.

Anne Condamines and Josette Rebeyrolle. 2001.Searching for and identifying conceptual relation-ships via a corpus-based approach to a Terminologi-cal Knowledge Base (CTKB). In Didier Bourigault,Christian Jacquemin, and Marie-Claude L’Homme,editors, Recent advances in computational termi-nology, pages 127–148. John Benjamins, Amster-dam/Philadelphia.

Anne Condamines. 2002. Corpus analysis and concep-tual relation patterns. Terminology, 8(1):141–162.

Randall Davis, Howard Shrobe, and Peter Szolovits.1993. What is a Knowledge Representation? AIMagazine, 14(1):17–33.

Aldo Gangemi, Nicola Guarino, and Alessandro Oltra-mari. 2001. Conceptual analysis of lexical tax-onomies: The case of WordNet top-level. In Pro-ceedings of the International Conference on For-mal Ontology in Information Systems (FOIS 2001),Ogunquit, Maine.

Ralph Grishman. 2000. Entity annotation guidelines.In Detection and Tracking – Phase 1. ACE PilotStudy Task Definition. ACE.

Nicola Guarino. 1997. Semantic matching: For-mal ontological distinctions for information organi-zation, extraction, and integration. In Maria TeresaPazienza, editor, Information extraction. A multi-disciplinary approach to an emerging informationtechnology. International summer school SCIE-97,pages 139–170. Springer, Berlin et al.

Peter D. Karp, Karen L. Myers, and Tom Gruber. 1995.The Generic Frame Protocol. In Proceedings ofthe 14th International Joint Conference on ArtificialIntelligence (IJCAI 1995), Montreal, Canada, vol-ume I, pages 768–774.


74

Klaus-Peter Konerding. 1993. Frames und lexikalis-ches Bedeutungswissen. Untersuchungen zur lin-guistischen Grundlegung einer Frametheorie und zuihrer Anwendung in der Lexikographie. Niemeyer,Tubingen.

Birte Lonneker. 2002. Building concept frames basedon text corpora. In Proceedings of the 3rd Inter-national Conference on Language Resources andEvaluation (LREC), Las Palmas, Gran Canaria, vol-ume I, pages 216–223.

Birte Lonneker. 2003. Konzeptframes und Relationen.Extraktion, Annotation und Analyse franzosischerCorpora aus dem World Wide Web. Ph.D. thesis,Department of Linguistics, Literature, and MediaScience, University of Hamburg. Submitted.

Kavi Mahesh and Sergei Nirenburg. 1995. Seman-tic classification for practical Natural Language Pro-cessing. In Proceedings of the 6th ASIS SIG/CRClassification Research Workshop: An Interdisci-plinary Meeting, Chicago, IL.

Elizabeth Marshman, Tricia Morgan, and Ingrid Meyer.2002. French patterns for expressing concept rela-tions. Terminology, 8(1):1–29.

Claudio Masolo, Stefano Borgo, Aldo Gangemi,Nicola Guarino, Alessandro Oltramari, and LucSchneider. 2002. WonderWeb Deliverable D17.The WonderWeb library of foundational ontologies.Preliminary report. V. 2.0. Technical report, ISTC-CNR, Padova.

Igor A. Mel’cuk. 1984. Un nouveau type de dictio-nnaire: Le dictionnaire explicatif et combinatoiredu francais contemporain. In Igor A. Mel’cuk,editor, Dictionnaire explicatif et combinatoire dufrancais contemporain, pages 3–16. Les Presses del’Universite de Montreal, Montreal.

George A. Miller, Richard Beckwith, Christiane Fell-baum, Derek Gross, and Katherine J. Miller.1990. Introduction to WordNet: an on-line lexicaldatabase. International Journal of Lexicography,3(4):235–244.

Wim Peters. 2002. Extraction of implicit knowledgefrom WordNet. In Proceedings of the OntoLex-2002 workshop on Ontologies and Lexical Knowl-dge Bases, Las Palmas, Gran Canaria, pages 54–59.

James Pustejovsky, Anna Rumshinsky, and Jose Castano. 2002. Rerendering semantic ontologies: Auto-matic extensions to UMLS through corpus analytics.In Proceedings of the OntoLex-2002 workshop onOntologies and Lexical Knowldge Bases, Las Pal-mas, Gran Canaria, pages 60–67.

Patrick Seguela. 1999. Adaptation semi-automatiqued’une base de marqueurs de relations semantiquessur des corpus specialises. Terminologies nouvelles,19:52–60.

York Sure, Michael Erdmann, Juergen Angele, SteffenStaab, Rudi Studer, and Dirk Wenke. 2002. On-toEdit: Collaborative ontology development fo theSemantic Web. In Ian Horrocks and James Hendler,editors, The Semantic Web - ISWC 2002. Proceed-ings of the First International Semantic Web Con-ference, Sardinia, Italy, pages 221–235, Berlin et al.Springer.

Wolfgang Wildgen. 2001. Natural ontologies and se-mantic roles in sentences. Axiomathes, 12:171–193.


75

Enhancing Recall in Information Extraction through Ontological Semantics

Sergei Nirenburg, Marjorie McShane and Stephen Beale

Institute for Language and Information Technologies University of Maryland, Baltimore County

Baltimore, MD, USA

1. Introduction.

We proceed from the assumption that extracting and representing the meanings of texts that serve as sources for information extraction will enhance the latter’s quality. In particular, we believe that resolving reference in these texts will lead to higher levels of recall in IE because additional information will become available for extraction once it can be captured not simply by matching character strings in the IE template but by knowing that George W. Bush, President Bush, the current president of the US, the leader of the free world, and the winner of the 2000 National election all refer to the same entity and, therefore, whatever information in the text is introduced by any of the above (and other reference means, notably, pronominalization and ellipsis) is relevant.

2. The Environment

At the core of our environment are general-purpose syntactic and semantic analyzers developed over the past 10 years at the Computing Research Lab of New Mexico State University and the University of Maryland Baltimore County. We will very briefly describe the semantic analysis process (a detailed description can be found in Nirenburg and Raskin 2003), including the treatment of reference, and then relate it to the task of enhancing recall in information extraction.

Ontological-semantic processing for text analysis relies on the results of a battery of pre-semantic text processing modules (see Figure 1). The output of these modules provides input to and background knowledge for semantic analysis.

Preprocessor MorphologicalAnalyzer

Lexical Look-Up

SyntacticAnalyzer

Semantic Analyzer

Text MeaningRepresentation

(TMR)

Lexicons andOnomasticonsGrammars

Fact Repository

Ontology

StaticKnowledgeResources

Text Analysis Modules

Ontological knowledge used informulating lexicon and Fact Repository entries

Background knowledge fromresources to processing modules

Important information from TMRsstored in Fact Repository for future use

Control flow in the processor

Preprocessor MorphologicalAnalyzer

Lexical Look-Up

SyntacticAnalyzer

Semantic Analyzer


(TMR)


(TMR)

Lexicons andOnomasticonsLexicons andOnomasticonsGrammarsGrammars

Fact Repository

Ontology

StaticKnowledgeResources

Text Analysis Modules

Ontological knowledge used informulating lexicon and Fact Repository entries Ontological knowledge used informulating lexicon and Fact Repository entries

Background knowledge fromresources to processing modulesBackground knowledge fromresources to processing modules

Important information from TMRsstored in Fact Repository for future useImportant information from TMRsstored in Fact Repository for future use

Control flow in the processorControl flow in the processor


76

Figure 1. Ontological-semantic processing for text analysis.

Semantic analysis takes as input results from the earlier stages of processing and produces a text meaning representation (TMR). The central task for semantic analysis is to construct an unambiguous propositional meaning by processing selectional restrictions, which are listed in the ontology and the semantic zones of lexicon entries. Other issues include treating such phenomena as aspect, modality and non-literal language (which, incidentally, is important for the treatment of reference as well), and building a discourse structure associated with the basic propositional structure of the text.

The major “static knowledge sources” for text analysis are: the TMR language, the ontology, the fact repository and a lexicon that includes an onomasticon.

The ontology provides a metalanguage for describing the meaning of lexical units of a language as well as for the specification of meaning encoded in TMRs. The ontology contains specifications of concepts corresponding to classes of things and events in the world. Formatwise, the ontology is a collection of frames, or named collections of property-value pairs. The ontology contains about 5,500 concepts, each of which has, on average, 16 properties defined for it. Figure 2 shows a portion of the description of the concept ROOM (not all inheritance is shown). Small caps are used to distinguish ontological concepts from English words.


77

Figure 2. Part of the description of the ontological concept ROOM (not all inheritance is shown).

This ontology has been shown to be able to represent the meanings of over 40,000 entries in a Spanish lexicon. We also have an English lexicon of about 45,000 entries and have developed an efficient methodology for the acquisition of the ontology and the lexicon (Nirenburg and Raskin 2003, Chapter 9).

The fact repository contains a list of remembered instances of ontological concepts. For example, whereas the ontology contains the concept CITY, the fact repository contains entries for London, Paris and Rome; and whereas the ontology contains the concept SPORTS-EVENT, the fact repository contains an entry for the Salt Lake City Olympics. A sample fact repository entry is shown in Figure 3.

HUMAN-33599 NAME George W. Bush ALIAS George Bush, President Bush, the president of the United States, the US president, ... SOCIAL-ROLE PRESIDENT GENDER male NATIONALITY NATION-1 (i.e., The United States of America)


78

DATE-OF-BIRTH July 6, 1946 SPOUSE HUMAN-33966 (i.e., Laura Bush)

Figure 3. An excerpt from a sample entry in the fact repository.

The ontological semantic lexicon contains not only semantic information, it also supports morphological and syntactic analysis. Semantically, it specifies what concept, concepts, property or properties of concepts defined in the ontology must be instantiated in the TMR to account for the meaning of a given lexical unit of input.

The entries in the onomasticon directly point to elements of the fact repository. Onomasticon entries are indexed by name (e.g., New York), while their corresponding entries in the fact repository are named by appending a unique number to the name of the ontological concept of which they are instances (e.g., Detroit might be listed as CITY-213).

3. Resolving Reference

Most NLP work in reference resolution focuses on finding textual antecedents (or postcedents) for pronouns using knowledge-lean methods. For us, by contrast, resolving reference involves linking every referring entity to its real-world anchor in the FR using a broad range of semantic knowledge and heuristic clues. We present just a sampling of reference issues with their required processing and expected output.

Pronouns. Resolving a reference to a pronoun like he requires not only linking this pronoun to a coreferential element in the text (e.g., The President ) but further linking it to its real-world entity stored in the FR (e.g., George W. Bush). We supplement the same types of heuristics (e.g., text distance, syntactic structure) as most researchers but supplement them with ontological-semantic analysis of candidate coreferential entities.

Approximations. Resolving approximations requires positing a concrete range whose calculation depends upon semantic heuristics: e.g., around 8:00 might be 7:45-8:15, whereas around 8:06 will be 8:05-8:07. Whereas we have found that a 7% rule works quite well in most cases (i.e., expanding the range to 7% of the given number in each direction), exceptions – like around 8:06 -- must be detected and treated separately..

Relative Scalars. Resolving relative scalars (e.g., expensive) requires selecting the relevant range on the scale defined for modified entity. For example, an expensive bomber costs far more than an expensive pistol, which can be reasoned based on the fact that the property COST (which indicates the range of typical cost) in the ontological frame for the concept MILITARY-JET has a numerical filler that is orders of magnitude higher than the same property for GUN.

Definite Descriptions. Resolving reference to definite descriptions (i.e., noun phrases with the) requires first determining if the signals coreference. Non-coreferential definite descriptions include always-definite NPs (the winter; on the other) and NPs used in certain constructions, like appositives (Bill Gates, the chairman of Microsoft) and restrictive modification (the hope of ensuing peace). All other definite descriptions require coreference resolution, be they identical to their coreferent (the conflict... the conflict), synonymous (the treaty...the pact) in a hypernym/hyponym relationship (the bank... the financial institution), in a meronym relation (I walked in the room and found the window open), etc. We have the conceptual infrastructure to


79

carry out such analysis, as well as automatically corefer, e.g., the move in (2) with the meaning of the entire preceding sentence (1); our current work focuses on improving our algorithms to best exploit and extend these resources.

(1) The Standard & Poor's Corporation, a leading credit rating agency, cut its ratings on the debt of United to “default,” its lowest ranking.

(2) The move by S.& P. helped fuel speculation that United, the world’s second-biggest airline, was on the verge of seeking bankruptcy court protection from its creditors.

Syntactic and Semantic Ellipsis. Syntactic ellipsis is the non-representation of semantic information that is signaled by a syntactic gap: e.g., Italy voted against the proposal and France did [vote against the proposal] too. Semantic ellipsis is similar but without the syntactic gap to act as a trigger: The subcommittee started with [a discussion of, debate about] the gun issue. Ontological semantic analysis permits us to resolve ellipsis – sometimes quite specifically and other times more generally – based on the lexically stipulated selectional restrictions of text entities. For example, since we know that start regularly triggers semantic ellipsis (just like finish [the pizza], prefer [Hemingway], etc.), we created a lexical sense of this word that expects a PHYSICAL-OBJECT as a complement and explicitly calls a procedure that seeks to resolve the missing EVENT based on the semantic collocation between the overt text elements (subcommittee / gun). In other words, the given lexicon sense posits an EVENT whose agent is COMMITTEE (the mapping for subcommittee) and whose theme is GUN (the mapping for gun), then the semantic analyzer searches the ontology for the EVENT that best meets these selectional restrictions. Positing a lexical sense that expects a PHYSICAEL-OBJECT as a complement is not strictly necessary: the semantic analyzer has recovery procedures that would be triggered when the selectional restrictions for the first sense of start (start + EVENT ‘start reading’) were violated. However, encoding expectations about ellipsis in the lexicon, to the extent reasonable, helps the analysis process by reducing the search space for error recovery.

Resolving reference is arguably one of the most difficult aspects of text processing, alongside the metaphor and metonymy. We have spread our net wide in attempting to treat reference issues not only because we believe we have the infrastructure to achieve some success but also because we consider this aspect of text processing an opportunity to improve the results of applications like extraction, summarization and question-answering, where reference relations cannot simply be “carried over” – as is sometimes the case in machine translation – but must be explicitly resolved for each referring entity so that sentences containing those entities can be fully exploited.

4. IE in Ontological Semantics Unlike the rest of IE systems, information extraction that uses the mechanisms and knowledge sources of Ontological Semantics operates against the results of ontological-semantic text analysis, the TMRs, not against open text. In the TMRs, ambiguity and reference are resolved, to the best of the analyzer’s ability; ontological and extra-ontological semantic information is encoded, and referring expressions are linked to their corresponding entities (typically, instances of ontological concepts). We are currently conducting experiments comparing the IE against TMRs before and after reference resolution. We are using texts from the domain of business (specifically, bankruptcy reports) and our hypothesis is that results of reference resolution should lead to enhancement in the levels of recall in IE. We hope to present the initial results of our experimentation at the conference.


80

References.

Nirenburg, S. and V. Raskin. 2003. Ontological Semantics. MIT Press. Forthcoming.


81

Ontologies and Information Extraction

Documents

Transcript of Ontologies and Information Extraction