Learning to Harvest Information for the Semantic Web

15
Learning to Harvest Information for the Semantic Web Fabio Ciravegna, Sam Chapman, Alexiei Dingli, and Yorick Wilks Department of Computer Science, University of Sheffield Regent Court, 211 Portobello Street, S1 4DP Sheffield, UK N.Surname@dcs.shef.ac.uk http://nlp.shef.ac.uk/wig/ Abstract. In this paper we describe a methodology for harvesting in- formation from large distributed repositories (e.g. large Web sites) with minimum user intervention. The methodology is based on a combination of information extraction, information integration and machine learning techniques. Learning is seeded by extracting information from structured sources (e.g. databases and digital libraries) or a user-defined lexicon. Retrieved information is then used to partially annotate documents. An- notated documents are used to bootstrap learning for simple Information Extraction (IE) methodologies, which in turn will produce more annota- tion to annotate more documents that will be used to train more complex IE engines and so on. In this paper we describe the methodology and its implementation in the Armadillo system, compare it with the current state of the art, and describe the details of an implemented application. Finally we draw some conclusions and highlight some challenges and future work. 1 Introduction The Semantic Web (SW) needs semantically-based document annotation to both enable better document retrieval and empower semantically-aware agents. Most of the current technology is based on human centered annotation, very often com- pletely manual. The large majority of SW annotation tools address the problem of single document annotation. Systems like COHSE [7], Ontomat [8] and MnM [14], all require presenting a document to a user in order to produce annota- tion either in a manual or a (semi-)automatic way. Annotations can span from annotating portions of documents with concept labels, to identifying instances or concept mentions, to connect sparse information (e.g. a telephone number and its owner. The process involves an important and knowledge intensive role for the human user. Annotation is meant mainly to be statically associated to the documents. Static annotation can: (1) be incomplete or incorrect when the creator is not skilled enough; (2) become obsolete, i.e. not be aligned with page updates; (3) be devious, e.g. for spamming or dishonest purposes; professional spammers could use manual annotation very effectively for their own purposes. For these reasons, we believe that the Semantic Web needs automatic meth- ods for (nearly) completely automatic page annotation. In this way, the initial

Transcript of Learning to Harvest Information for the Semantic Web

Learning to Harvest Information for the

Semantic Web

Fabio Ciravegna, Sam Chapman, Alexiei Dingli, and Yorick Wilks

Department of Computer Science, University of SheffieldRegent Court, 211 Portobello Street, S1 4DP Sheffield, UK

[email protected]://nlp.shef.ac.uk/wig/

Abstract. In this paper we describe a methodology for harvesting in-formation from large distributed repositories (e.g. large Web sites) withminimum user intervention. The methodology is based on a combinationof information extraction, information integration and machine learningtechniques. Learning is seeded by extracting information from structuredsources (e.g. databases and digital libraries) or a user-defined lexicon.Retrieved information is then used to partially annotate documents. An-notated documents are used to bootstrap learning for simple InformationExtraction (IE) methodologies, which in turn will produce more annota-tion to annotate more documents that will be used to train more complexIE engines and so on. In this paper we describe the methodology and itsimplementation in the Armadillo system, compare it with the currentstate of the art, and describe the details of an implemented application.Finally we draw some conclusions and highlight some challenges andfuture work.

1 Introduction

The Semantic Web (SW) needs semantically-based document annotation to bothenable better document retrieval and empower semantically-aware agents. Mostof the current technology is based on human centered annotation, very often com-pletely manual. The large majority of SW annotation tools address the problemof single document annotation. Systems like COHSE [7], Ontomat [8] and MnM[14], all require presenting a document to a user in order to produce annota-tion either in a manual or a (semi-)automatic way. Annotations can span fromannotating portions of documents with concept labels, to identifying instancesor concept mentions, to connect sparse information (e.g. a telephone numberand its owner. The process involves an important and knowledge intensive rolefor the human user. Annotation is meant mainly to be statically associated tothe documents. Static annotation can: (1) be incomplete or incorrect when thecreator is not skilled enough; (2) become obsolete, i.e. not be aligned with pageupdates; (3) be devious, e.g. for spamming or dishonest purposes; professionalspammers could use manual annotation very effectively for their own purposes.

For these reasons, we believe that the Semantic Web needs automatic meth-ods for (nearly) completely automatic page annotation. In this way, the initial

2 Fabio Ciravegna et al.

annotation associated to a document will lose its importance because at anytime it will be possible to automatically reannotate the document. Systems likeSemTag [4] are a first step in that direction. SemTag addresses the problem of an-notating large document repositories (e.g. the Web) for retrieval purposes, usingvery large ontologies. Its task is annotating portion of documents with instancelabels. The system can be seen as an extension of a search engine. The processis entirely automatic and the methodology is largely ontology/application inde-pendent. The kind of annotation produced is quite shallow when compared tothe classic one introduced for the SW: for example there is no attempt to dis-cover relations among entities. AeroDaml [9] is an information extraction systemaimed at generating draft annotation to be refined by a user in a similar wayto nowadays’ automated translation services. The kind of annotation producedis more sophisticated than SemTag’s (e.g. it is also able to recognize relationsamong concepts), but, in order to cover new domains, it requires the develop-ment of application/domain specific linguistic knowledge bases (an IE expert isrequired). The harvester of the AKT triple store1 is able to build large knowl-edge bases of facts for a specific application. Here the aim is both large scale anddeep ontology-based annotation. The process requires writing a large numberof wrappers for information sources using Dome, a visual language which fo-cuses on manipulation of tree-structured data [11]. Porting requires a great dealof manual programming. Extraction is limited to highly regular and structuredpages selected by the designer. Maintenance is complex because - as well knownin the wrapper community - when pages changes their format, it is necessary tore-program the wrapper [10]. The approach is not applicable to irregular pagesor free text documents. The manual approach makes using very large ontologies(like in SemTag) very difficult.

In this paper we propose a methodology for document annotation that wasinspired by the latter methodology, but (1) it does not require human interven-tion for programming wrappers (2) it is not limited to highly regular documentsand (3) it is largely unsupervised. The methodology is based on adaptive infor-mation extraction and integration, it is implemented in Armadillo, a tool able toharvest domain information from large repositories. In the rest of the paper wedescribe and discuss the methodology, present experimental results on a specificdomain and compare Armadillo with the current state of the art. Finally weoutline some challenges that the methodology highlights.

2 Armadillo

Armadillo is a system for producing automatic domain-specific annotation onlarge repositories in a largely unsupervised way. It annotates by extracting in-formation from different sources and integrating the retrieved knowledge into arepository. The repository can be used both to access the extracted informationand to annotate the pages where the information was identified. Also the link

1 http://triplestore.aktors.org/SemanticWebChallenge/

Lecture Notes in Computer Science 3

Input:

•an Ontology;

•an Initial Lexicon;

•a Repository of Documents;

Output: A set of triples representing the extracted information

and to be used to annotate documents

do {

� spot information using the lexicon

� seek for confirmation of the identified information

� extend lexicon using adaptive information extraction

and seek confirmation of the newly extracted information

} while a stable set of information is found(e.g. the base

does not grow anymore)

• Integrate Information from different documents

• Store information in repository

Input:

•an Ontology;

•an Initial Lexicon;

•a Repository of Documents;

Output: A set of triples representing the extracted information

and to be used to annotate documents

do {

� spot information using the lexicon

� seek for confirmation of the identified information

� extend lexicon using adaptive information extraction

and seek confirmation of the newly extracted information

} while a stable set of information is found(e.g. the base

does not grow anymore)

• Integrate Information from different documents

• Store information in repository

Fig. 1. The Armadillo’s main algorithm

with the pages can be used by a user to verify the correctness and the prove-nance of the information. Armadillo’s approach is illustrated in Figure 1. In thefirst step in the loop, possible annotations from a document are identified us-ing an existing lexicon (e.g. the one associated to the ontology). These are justpotential annotations and must be confirmed using some strategies (e.g. disam-biguation or multiple evidence). Then other annotations not provided by thelexicon are identified e.g. by learning from the context in which the known oneswere identified. All new annotations must be confirmed and can be used to learnsome new ones as well. They will then become part of the lexicon. Finally allannotations are integrated (e.g. some entities are merged) and stored into a database. Armadillo employs the following methodologies:

– Adaptive Information Extraction from texts (IE): used for spotting informa-tion and to further learning new instances.

– Information Integration (II): used to (1) discover an initial set of informationto be used to seed learning for IE and (2) to confirm the newly acquired(extracted) information, e.g. using multiple evidence from different sources.For example, a new piece of information is confirmed if it is found in different(linguistic or semantic) contexts.

– Web Services: the architecture is based on the concept of ”services”. Eachservice is associated to some part of the ontology (e.g. a set of conceptsand/or relations) and works in an independent way. Each service can useother services (including external ones) for performing some sub-tasks. Forexample a service for recognizing researchers names in a University Web Sitewill use a Named Entity Recognition system as a sub-service that will recog-nise potential names (i.e. generic people’s names) to be confirmed using some

4 Fabio Ciravegna et al.

Fig. 2. The Armadillo Architecture

internal strategies as real researchers names (e.g. as opposed to secretaries’names).

– RDF repository: where the extracted information is stored and the link withthe pages is maintained.

A development environment allows to define architectures for new applications.Porting to new applications does not require knowledge of IE. All the methodsused tend to be domain independent and are based on generic strategies to becomposed for the specific case at hand. The only domain dependent parts are:the initial lexicon, the ontology and the way the confirmation strategies aredesigned/composed.

2.1 Extracting Information

Most of the current tools (e.g. COHSE and SemTag) provide annotation usinga static lexicon where the lexicalization of objects in the ontology is contained.The lexicon does not increase while the computation goes on, unless the useradds terms to it. Armadillo continually and automatically expands the initiallexicon by learning to recognize regularities in the repository. As a matter offact, Web resources (and in general all repositories) have a specific bias, i.e.there are a number of regularities, either internal to a single document or across

Lecture Notes in Computer Science 5

a set of documents [13]. Regularities can either be very strong (e.g. in case ofpages generated by a data base), or given by a style imposed by the designers.Armadillo is able to capture such regularities and use it to learn to expand itsinitial lexicon. There are two ways in which an object (e.g., an instance of aconcept) can be identified in a set of documents. Using its internal description(e.g. its name or the words describing it) and the context in which it appears.Systems like COHSE, SemTag and Magpie [6] use the former. MnM and Ontomat[8] use adaptive IE to learn from the context as well. They use the regularityin a collection of documents (e.g. a set of web pages about a specific topicfrom the same site) to derive corpus-wide rules. In those approaches, it is veryimportant that the corpus is carefully chosen as consistent in its regularity. Thisallows learning from the human annotation to converge quickly to a stable andeffective situation. Armadillo uses an approach that can be seen as an extensionof the one used in MnM and Ontomat where there is no human in the loop andwhere large diverse repositories (e.g. whole portions of the Web) are annotated,and therefore such regularity is not straightforward. The system has to findits own ways to identify those regularities and use them to learn without usersupport. When regularities in the context are found, they are used to learn otheroccurrences of the same (type of) object.

2.2 Gradually Acquiring Information

Armadillo exploits a key feature of the Web: the redundancy of information.Redundancy is given by the presence of multiple citations of the same infor-mation in different contexts and in different superficial formats. Redundancy iscurrently used for improving question answering systems [5]. When known in-formation is present in different sources, it is possible to use its multiple occur-rences to bootstrap recognizers that, when generalized, will retrieve other piecesof information, producing in turn more (generic) recognizers [1]. Armadillo usesredundancy in order to bootstrap learning beyond the initial user lexicon (if any)and even to acquire the initial lexicon. In particular, information can be presentin different formats on the Web: in documents, in repositories (e.g. databases ordigital libraries), via agents able to integrate different information sources, etc.From them or their output, it is possible to extract information with differentdegrees of reliability. Systems such as databases contain structured data thatcan be queried either via APIs or web front ends (getting HTML output). Inthe latter case, wrappers can be induced to extract information. Wrapper In-duction methodologies are able to model rigidly structured Web pages such asthose produced by databases [10]. When the information is contained in textualdocuments, extracting information requires more sophisticated methodologies.Wrapper induction systems have been extended to cope with less rigidly struc-tured pages, free texts and even a mixture of them [2]. There is an increasingdegree of complexity in the extraction task mentioned above. As complexity in-creases more training data is required. Wrappers can be trained with a handfulof examples whereas full IE systems may require millions of words.

6 Fabio Ciravegna et al.

All the IE process in Armadillo is based on integrating information fromdifferent sources to provide annotations which will bootstrap learning, which inturn will provide more annotation and so on. The process starts with simplemethodologies which require limited annotation, to produce further annotationto train more complex modules. The ontology provides the mean for integratinginformation extracted from different sources. For example simple wrappers canbe used to extract information from a web page produced by databases contain-ing papers from computer science departments. In order to avoid wrapping eachdatabase separately (i.e., providing examples of annotated input/output for eachof them), Armadillo uses information from a database already wrapped in orderto provide automatic annotation of examples for the other ones as proposed in[13]. For example, if the goal is to extract bibliographic information about theComputer Science field, it is possible to use Citeseer (www.citeseer.com), a large(and largely incomplete) database to learn how to query and understand anotherservice, e.g. the NLDB bibliography at Unitrier (http://www.informatik.uni-trier.de/∼ ley/db/). This can be done by querying Citeseer and the NLDB usingthe same terms, and producing two parallel pages of results. The one from Cite-seer will have a known format and the information can be easily extracted usinga predefined wrapper. Then, some of the information contained in the NLDBoutput page can be automatically annotated (e.g. for the paper title generally itis necessary just an intelligent string matching). Using the annotated examples itis possible to induce wrappers that, given the high regularity of the informationin the NLDB page, will be able to extract papers also from the latter. Consid-ering that training a wrapper generally requires just a handful of examples, it ispossible to focus only on those examples where the match is very clear and reli-able, discarding others that are more questionable, therefore producing a highlyreliable wrapper. Facilities for defining wrappers are provided in our architec-ture by Amilcare (nlp.shef.ac.uk/amilcare/), an adaptive IE system based on awrapper induction methodology able to cope with a whole range of documentsfrom rigidly structured documents to free texts [3].

2.3 Web Services

Each task in Armadillo (e.g. discovering all the papers written by an author)is performed by a server which in turn will use other servers for implementingsome parts (subtask) of it. Each server exposes a declaration of input and output,plus a set of working parameters. Servers are reusable in different contexts andapplications. For example one server could return all papers written by a personby accessing Citeseer. Another one will do the same on another digital library.Another one will use the papers extracted by the other two in order to discoverpages where they are cited and to bootstrap learning. Another one will invokethese servers and integrate the evidence returned by each of them and decideif there is evidence enough to conclude that some newly discovered candidatestrings represent a real new object or maybe just a variant version of a knownname. All the servers are defined in a resource pool and can be used in a user-defined architecture to perform some specific tasks. New servers can be defined

Lecture Notes in Computer Science 7

and added to the pool by wrapping them in a standard format. The definedarchitecture works as a ”Glass Box”. All the steps performed by the system areshown to the user together with their input and output. The user can check theintermediate results and manually modify their output, or change their strategy(if possible, such as in the case of modules who integrate information). Forexample if a piece of information is missed by the system, it can be manuallyadded by the user. The modules working on the output of that module will thenbe re-run and further information will hopefully be retrieved. In this way theuser is able both to check the results of each step and to improve the resultsof the system by manually providing some contributions (additions, corrections,deletion).

2.4 The Triple Store

Facts extracted from the Web are stored in an RDF store in the form of tripleswhich define relations in the form ”Subject - Verb - Object”, where the subjectand object are elemenets and the verb details the relation between them. Foreach element in the triples, the following information is stored: the string (e.g.J. Smith), the position where it was found (e.g. the document URL and itsoffset) and the concept, instance or relation represented. The triples can beused to derive also aliases for the same object, i.e. a lexicon (”J. Smith” at< www address1 >: 33 : 44 and ”John Smith” at < www address2 >: 21 : 35),and to recover dispersed information (e.g. the person JSMITH45 has names ”J.Smith” at < www address1 >: 33 : 44 and ”John Smith” at < www address2 >:21 : 35 and telephone number ”+44.12.12.12.12 at < www address3 >: 10 : 12,homepage at < www address4 >). The triple store constitutes the resourceused by the different services to communicate. Each server stores the extractedinformation in the form of signed triples. The other services will extract themand elaborate the information to store further information (or to confirm theexisting one). Each piece of information is tagged with its provenance both interms of source document and in terms of extraction method, i.e. the service oragent that has retrieved it. The provenance is used to assign reliability to theinformation itself: the more the information is confirmed, the more reliable it isconsidered.

2.5 Confirming Information

A crucial issue in the cycle of seeding and learning is the quality of the informa-tion used to seed. Wrong selections can make the method diverge and producespurious information. When a piece of information is acquired (e.g. a new paperis assigned to a specific author in the CS task mentioned in section 3), Armadillorequires confirmation by different sources before it can be used for futher seedingof learning. Again, using the redundancy of the Web, we expect that the samepiece of information is repeated somewhere in a different forms and the system tofind it. The strategy for evidence gathering is application dependent. Users haveto identify task specific strategies. Such strategies can be defined declaratively

8 Fabio Ciravegna et al.

in Armadillo by posing requirements on the provenance of the information interms of methods of extraction and sources (including the number of times theinformation was confirmed in the different sources).

The next section describes how Armadillo was used in a specific application.

3 Armadillo in Action

Armadillo has been applied so far to three tasks: the CS website harvestingtask, the Art Domain task and the discovery of geographical information2. Herewe describe Armadillo’s application to mining websites of Computer ScienceDepartments, an extension of its original task in the AKTive Space applicationthat won the 2003 Semantic Web Challenge3. Armadillo’s task is to discoverwho works for a specific department (name, position, homepage, email address,telephone number) and to extract for each person some personal data and a listof published papers larger than the one provided by services such as Citeseer.

3.1 Finding People Names

The goal of this subtask is to discover the names of all the people who workin the specific department. This task is more complex than a generic NamedEntity Recognition because many non researchers’ names are cited in a site,e.g. those of undergraduate students, clerics, secretaries, etc, as well as namesof researchers from other sites that e.g. participate in common projects or haveco-authored papers with members of staff. Organizing the extraction around ageneric Named Entity Recognizer (NER) is the most natural option. This doesnot finish the job, though, because a NER recognizes ALL the people’s namesin the site, without discriminating between relevant and irrelevant. Moreoverclassic NERs tend to be quite slow if launched on large sites and can be quiteimprecise on Web pages, as they are generally defined for newspaper-like articles.A two-step strategy is used here instead: initially a short list of highly reliableseed names are discovered; this constitute the initial lexicon. Such lexicon couldalso be provided by an existing list. Here we suppose such list does not exist.Then these seeds are used to bootstrap learning for finding further names.

Finding Seed Names. To find seed names, a number of weak strategies arecombined that integrate information from different sources. First of all the website is crawled looking for strings that are potential names of people (e.g. using agazetteer of first names and a regular expression such as <first-name<+(capitalizedword)+.). Then the following web services are queried:

– Citeseer (www.citeseer.com): Input: the potential name; Output: a list ofpapers and a URL for homepage (if any);

2 http://www.dcs.shef.ac.uk/∼sam/results/index.html3 http://challenge.semanticweb.org/)

Lecture Notes in Computer Science 9

– The CS bibliography at Unitrier (http://www.informatik.uni-trier.de/∼ ley/db/):Input: the potential name: Output: a list of papers (if any);

– HomePageSearch (http://hpsearch.uni-trier.de/): Input: the potential name;Output: a URL for homepage (if any);

– Annie (www.gate.ac.uk): Input: the potential name and the text surroundingit; Output: True/False;

– Google (www.google.co.uk) Input: the potential name and the URL of thesite in order to restrict search; Output: Relevant Pages that are hopefullyhomepages;

The digital libraries (Citeseer and Unitrier) are used as first filters to determineif a string is a name of a known researcher. If they return reasonable resultsfor a specific string (i.e. not too few and not too many), this name is furtherprocessed, otherwise it is discarded. A string is a potentially valid name if thedigital libraries return a reasonable number of papers (between 5 and 50 in ourexperiments). Results not in line with the reasonability criteria are discarded asinappropriate seeds. This is to discard potential anomalies such as ambiguousnames (e.g. Citeseer returns more than 10,000 papers for the term ”Smith”; thiscannot be a single researcher) and invalid names (e.g. the words ”Fortune Teller”do not return any paper). We tend to use quite restrictive criteria for keepingreliability high (i.e. it is very possible that a person writes some 100 papers,but that amount could also hide name ambiguity. The results of the digitallibraries are integrated with those of the classic Named Entity Recognizer runon a window of words around the candidates (so to avoid the problem of slowprocessing). At this point a number of names are available that fall in threepotential types: (1) correct (they are people working for the department); (2)wrong (they are not people: they are false positives); (3) people who do not workat the site but that are cited because, for example, they have coauthored paperswith some of the researchers of the department. For this reason, Citeseer, Googleand HomepageSearch are used to look for a personal web page in the site. If sucha page is not found, the names are discarded. From the results, personal webpages are recognized with simple heuristics such as looking for the name in thetitle or in ”<H1>” tags. The process mentioned above is meant to determine asmall, highly reliable list of seed names to enable learning. Each of the strategiesis, per se, weak, as they all report high recall, low precision. Their combinationis good enough to produce data with high accuracy.

Learning Further Names. All the occurrences of seed names are then anno-tated on the site’s documents and learning is initiated only on documents wherea reasonable quantity of known names are organized in structures such as listsand tables. Such structures generally have an intrinsic semantic: lists generallycontain elements of the same type (e.g. names of people), while the semantics intables is generally related to the position either in rows or columns (e.g. all theelements of the first column are people, the second column represents addresses,etc.). When some seeds (at least four or five in our case) are identified in a list orspecific portions of a table, we train a classifier able to relate a large part of these

10 Fabio Ciravegna et al.

examples, for example using linguistic and/or formatting criteria (e.g. relevantnames are always the first element in each row). If we succeed, we are able toreliably recognize other names in the structure. Every department generally hasone or more pages listing their staff in some kind of lists. These are the lists thatwe are mainly looking for, but also tables assigning supervisors and studentsare useful, provided that students and teachers can be discriminated. Each timenew examples are identified, the site is further annotated and more patterns canpotentially be learnt. New names can be cross-checked on the resources usedto identify the seed list: we have now more evidence that these names are realnames. In our experiments this is enough to discover a large part of the staff ofan average CS website with very limited noise, even using a very strict strategyof multiple cross-evidence. We are currently using combinations of the follow-ing evidence to accept a learnt name: (1) the name was recognized as seed; (2)the name is included in an HTML structure where other known occurrences arefound (3) there is an hyperlink internal to the site that wraps the whole name;(4) there is evidence from generic patterns (as derived by recognizing people onother sites) that this is a person. The latter strategy was inspired by [12].

Evaluation. The CS task effectiveness was evaluated on a number of sites. Herewe report results from an evaluation done on both a specific web site (the Com-puter Science Department site of the University of Sheffield, www.dcs.shef.ac.uk)and by pointing the system to some pages containing interesting infromation forthe task, but distributed in random sites (the latter task is equivalent to applyingArmadillo on the results of a document classifier providing interesting pages).Results on other sites are qualitatively largely equivalent. On the Sheffield de-partment’s website, the system initially discovers 51 seed names of people (eitheracademics, researchers or PhD students) integrating information from Citeseerand NLDB. Of them, 48 are correct and 3 wrong. These names are used to theseed learning. Learning allows to discover other 57 names, 48 correct, 6 wrong.This increases the overall recall from 37% to 84% with a very limited loss inprecision (see Table 1). Results obtained on the set of pages from random sitesare in Table 2. In this experiments we checked Armadillo’s ability to improveresults in case the II step returned high recall. The AKT triple store was used asa source of information in addition to Citeseer and UniTrier. The gain in usingthe IE-based extraction is still considerable (recall grows from 73 to 85).

3.2 Discovering Papers Citations

Discovering what papers are written by what members of the departmental staffis a very difficult task. It requires recognizing the paper title and the authors,and then relating the authors to the people identified in the previous step. Au-thors are names in particular positions and in particular contexts: they must notbe confused with editors of collections in which the paper can be published, northey must be confused with other names mentioned in the surrounding text. Atitle is generally a random sequence of words (e.g. the tile of [5]) and cannot be

Lecture Notes in Computer Science 11

Possible Actual Correct Wrong Missing Precision Recall F-Measure

Seed discovery 129 51 48 3 0 94 37 51Adaptive IE 129 108 99 9 30 92 84 87

Table 1. Results in Discovering People and Associated homepage. First line: accu-racy reached using Information Integration only (Citeseer+Google, etc.); second line:accuracy using adaptive IE. Possible represents the number of people working for thedepartment, Actual the number of people returned by the systems. Actual results aredivided into Correct, Wrong and Missing.

Possible Actual Correct Wrong Missing Precision Recall F-Measure

Seed Discovery 331 243 242 1 89 99.59 73.11 84.32Adaptive IE 331 288 284 4 47 98.61 85.80 91.76

Table 2. The results of discovering names of peoples working at some random sites.

characterized in any way (i.e. we cannot write generic patterns for identifyingcandidate strings as we did for people). Moreover paper titles must not be con-fused with titles of collections in which they are published. CS department sitestypically contain lists of publications for the department as a whole or personalones for each member of staff. Moreover papers are co-authored, so it is verypossible that each paper is cited more than one time within a specific site. Inrare cases personal lists of papers are produced using a departmental database(i.e. all the publication pages are formatted in the same way), but in most caseseach person writes the list using a personal format; very often the style is quiteirregular as the list is compiled manually over time. This is a typical case inwhich the classic methodology of manually annotating some examples for eachpage for each member of staff is unfeasible, due to the large number of differentpages. Also irregularities in style produce noisy data and classic wrappers arenot able to cope with noise. A generic methodology is needed that does notrequire any manual annotation.

In order to bootstrap learning we query the digital libraries (Citeseer andUniTrier) using staff names as keywords. The output for each name is hopefullya list of papers. Such lists will be incomplete because the digital libraries arelargely incomplete. The titles in the list are then used to query a search engine toretrieve pages containing multiple paper citations. We focus on lists and tableswhere at least four papers are found. We use titles because they tend to be uniqueidentifier. We are looking for seed examples, so we can discard titles which reporttoo many hits (so to avoid titles which are very common strings such as ”Lost”).As for discovering new papers, the seed examples are annotated and page-specificpatterns are induced. We favour examples contained in structures such as lists

12 Fabio Ciravegna et al.

Possible Actual Correct Wrong Missing Precision Recall F-Measure

Seed Discovery 320 151 152 1 168 99 47 64Adaptive IE 320 217 214 3 103 99 67 80

Table 3. Paper title harvesting for 7 random people.

and tables for which we have multiple evidence. Please note however that thestructure of the citation is often not very structured internally. For example:

<li> Fabio Ciravegna, Alexiei Dingli, Daniela Petrelli and Yorick Wilks:<br>User-System Cooperation in Document Annotation based on Information Extraction<br> in Asuncion Gomez-Perez, V. Richard Benjamins (eds.): KnowledgeEngineering and Knowledge Management (Ontologies and the Semantic Web),<br> Lecture Notes in Artificial Intelligence 2473, Springer Verlag <br></li>

Simple wrappers would be ineffective, as there is no way to discriminate - forexample - between authors and editors and title of paper and title of collec-tion when relying on the HTML structure only. More sophisticated wrapperinduction systems are needed, as that provided by Amilcare, which uses bothhtml structure and (para-)linguistic information [3]. Using a cycle of annota-tion/learning/annotation we are able to discover a large number of new papers.Note that every time co-authorship among people is discovered on a publicationpage of one author, the paper is retained for annotation of the publication pagesof the other authors (i.e. the redundancy is exploited again).

Evaluation. Discovering papers is a very complex task. We performed a taskof associating papers to people discovered during the previous step. A paper wasconsidered correctly assigned to a person if it was authored by the person andthe title was 100% correct. We did not use reseeding in the experiment, i.e. ifa paper was coauthored by two researchers, the information returned for oneperson was not used to further annotate the publication pages for the secondperson. In this sense the redundancy of information was not fully exploited.Checking correctness of papers for a hundred people is very labor intensive,therefore we randomly checked the papers extracted for 7 staff members forwhich the seed papers exceeded 6 examples; results are shown in Table 3. The useof IE increases significantly the overall recall rate which grows from 47 for seedsand 67 for IE-based, precision 99 and 98 and F-measure 64 and 80 respectively.

4 Conclusions

In this paper we have described a methodology to extract information fromlarge repositories (e.g. large Web sites) with minimum user intervention. Ex-tracted information can then be used for document annotation. Information isinitially extracted by starting from highly reliable/easy-to-mine sources such as

Lecture Notes in Computer Science 13

databases and digital libraries and is then used to bootstrap more complex mod-ules such as wrappers for extracting information from highly regular Web pages.Information extracted by the wrappers is then used to train more sophisticatedIE engines. All the training corpora for the IE engines are produced automati-cally. Experiments show that the methodology can produce high quality results.The user intervention is limited to providing an initial URL and to add infor-mation missed by the different modules when the computation is finished. Nopreliminary manual annotation is required. The information added or deleted bythe user can then be reused for restarting learning and therefore getting moreinformation (recall) and/or more precision. The type of user needed is a personable to understand the annotation task. No skills in IE are needed. The naturalapplication of such methodology is the Web, but large companies’ repositoriesare also an option. In this paper we have focused on the use of the technologyfor mining web sites, an issue that can become very relevant for the SemanticWeb, especially because annotation is provided largely without user interven-tion. It could potentially provide a partial solution to the outstanding problemof who is providing semantic annotation for the SW. It can potentially be usedeither by search engines associated to services/ontologies to automatically an-notate/index/retrieve relevant documents or by specific users to retrieve neededinformation on the fly by composing an architecture.

Armadillo has been fully integrated into the AKT triple store and it is con-stantly providing new triples to it. Its contribution to the architecture is theability to reindex the pages when they change format (in the classic architecturethis step would require manually reprogramming of the wrapper) and the abillityto extract information from sources that are not highly structured and regularas possible to Dome [11]. Armadillo is compatible with SW tools like COHSE,Magpie, MnM, Ontomat, etc. In COHSE and Magpie it could provide a way to(1) extend the automatic annotation step beyond the connection between sim-ple terms and concepts descriptions stored in a lexicon. It could allow to movetowards relation identification. Moreover it couls provide automatic extensionof the initial lexicon. In MnM and Ontomat, Armadillo could provide a wayto converge more rapidly towards an effective annotation service. As a matterof fact, learning in those tools is limited to the documents already annotatedby the user and to the use of an initial lexicon. Armadillo could provide a wayto integrate information from external repositories in the corpus (e.g. digital li-braries) to learn in an unsupervised way, from example from regularities foundin documents not annotated.

From the IE point of view there are a number of challenges in learning fromautomatic annotation, instead of using human annotation. On the one handnot all the annotation is reliable: the use of multiple strategies and combinedevidence reduces the problem, but still there is a strong need for methodologiesrobust with respect to noise. On the other hand, many IE systems are ableto learn from completely annotated documents only, so that all the annotatedstrings are considered positive examples and the rest of the text is used as a set ofcounterexamples. In our cycle of seed and learn, we generally produce partially

14 Fabio Ciravegna et al.

annotated documents. This means that the system is presented with positiveexamples, but the rest of the texts can never be considered as a set of negativeexamples, because unannotated portions of text can contain instances that thesystem has to discover, not counterexamples. This is a challenge for the learner.At the moment we present the learner with just the annotated portion of thetext plus a windows of words of context, not with the whole document. This isenough to have the system learning correctly: the unannotated examples thatbecome negative examples entering the training corpus is generally low enoughto avoid problems. In the future we will have to focus on using machine learningmethodologies that are able to learn from scattered annotation.

Many of the classic problems of integrating information are to be copedwith in Armadillo. Information can be represented in different ways, in differ-ent sources from both a syntactic and a semantic point of view. The syntacticvariation is coped with in the definition architecture definition step: when twomodules are connected, a canonical form of the information is defined in theontology, e.g. the classic problem of recognising film titles as ”The big chill”and ”Big chill, the” can be addressed. More complex tasks are to be addressed,though. In the art domain it is quite common to report the title of an art workin different languages. For example a number of Cezanne’s paintings can bereferred in different web sites as both ”Apples and Oranges” and ”Aepfel mitOrangen” (same title but in German), Michelangelo’s can be referred as ”TheLast Judgment” or ”Il Giudizio Universale” (in Italian). Relating them can bevery difficult, even for a human, without looking at the actual artwork. Also, aperson name can be cited in different ways in different documents: N. Weaver,Nick Weaver and Nicholas Weaver are potential variation of the same name. Butdo they identify the same person as well? This is the problem of intra- and inter-document coreference resolution well known in Natural Language Processing. Inmany applications it is possible to identify some simple heuristics to cope withthis problem. For example in mining one specific CS websites, N. Weaver, NickWeaver and Nicholas Weaver are most of the times the same person, therefore itis possible to hypothesize coreference. Another potential problem concerns ambi-guity in the external resources (e.g. in the digital libraries). When querying withvery common names (e.g. ”John Smith”) papers by different people are mixed.This is not a problem in Armadillo because the information returned is used toannotate the site in order to both seed more learning and to look for multipleconfirmation. Papers from people from other departments or universities will notintroduce any annotations and therefore will not be accepted. The same appliesin case multiple homepages are returned: if some of them do not have an addresslocal to the current site, they are not used.

Acknowledgements

This work was carried out within the AKT project (www.aktors.org), spon-sored by the UK Engineering and Physical Sciences Research Council (grantGR/N15764/01), and the Dot.Kom project (www.dot-kom.org), sponsored bythe EU IST asp part of Framework V (grant IST-2001-34038).

Lecture Notes in Computer Science 15

References

1. Sergey Brin. Extracting patterns and relations from the world wide web. InWebDB Workshop at 6th International Conference on Extending Database Tech-nology, EDBT’98, 1998.

2. Fabio Ciravegna. Adaptive information extraction from text by rule induction andgeneralisation. In Proceedings of 17th International Joint Conference on ArtificialIntelligence (IJCAI), 2001. Seattle.

3. Fabio Ciravegna. Designing adaptive information extraction for the Semantic Webin Amilcare. In S. Handschuh and S. Staab, editors, Annotation for the SemanticWeb, Frontiers in Artificial Intelligence and Applications. IOS Press, 2003.

4. S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, T. Kanungo, S. Ra-jagopalan, A. Tomkins, J. A. Tomlin, and J. Y. Zien. SemTag and Seeker: Boot-strapping the semantic web via automated semantic annotation. In Proceedings ofthe World Wide Web Conference 2003, 2003.

5. Susan Dumais, Michele Banko, Eric Brill, Jimmy Lin, and Andrew Ng. Web ques-tion answering: Is more always better? In Proceedings of the 25th Annual Inter-national ACM SIGIR Conference on Research and Development in InformationRetrieval (SIGIR 2002), Tampere, Finland, 2002.

6. Martin Dzbor, John B. Domingue, and Enrico Motta. Magpie - towards a semanticweb browser. In Proceedings of the 2nd Intl. Semantic Web Conference, October2003. Sanibel Island, Forida.

7. C. Goble, S. Bechhofer, L. Carr, D. De Roure, and W. Hall. Conceptual OpenHypermedia = The Semantic Web? In The Second International Workshop on theSemantic Web, pages 44–50, Hong Kong, May 2001.

8. S. Handschuh, S. Staab, and F. Ciravegna. S-CREAM - Semi-automatic CREAtionof Metadata. In Proceedings of the 13th International Conference on KnowledgeEngineering and Knowledge Management, EKAW02. Springer Verlag, 2002.

9. P. Kogut and W. Holmes. Applying information extraction to generate daml anno-tations from web pages. In Proceedings of the K-CAP 2001 Workshop KnowledgeMarkup & Semantic Annotation, 2001. Victoria B.C., Canada.

10. N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for informationextraction. In Proceedings of the International Joint Conference on Artificial In-telligence (IJCAI), 1997., 1997.

11. Thomas Leonard and Hugh Glaser. Large scale acquisition and maintenance fromthe web without source access. In Siegfried Handschuh, Rose Dieng-Kuntz, andSteffen Staab, editors, Proceedings Workshop 4, Knowledge Markup and SemanticAnnotation, K-CAP 2001, 2001.

12. Tom Mitchell. Extracting targeted data from the web. In Proceedings of the seventhACM SIGKDD international conference on Knowledge discovery and data mining,San Francisco, California, 2001.

13. M. Perkowitz and O. Etzioni. Category translation: Learning to understand infor-mation on the internet. In International Joint Conference on Artificial Intelligence,IJCAI-95, pages 930–938, Montreal, Canada, 1995.

14. M. Vargas-Vera, Enrico Motta, J. Domingue, M. Lanzoni, A. Stutt, andF. Ciravegna. MnM: Ontology driven semi-automatic or automatic support forsemantic markup. In Proc. of the 13th International Conference on KnowledgeEngineering and Knowledge Management, EKAW02. Springer Verlag, 2002.