Knowledge-based sense disambiguation (almost) for all structures

47
Knowledge-Based Sense Disambiguation (Almost) For All Structures Federica Mandreoli, Riccardo Martoglia DII - University of Modena e Reggio Emilia, Via Vignolese 905/b, 41125 Modena, Italy Abstract Structural disambiguation is acknowledged as a very real and frequent problem for many semantic-aware applications. In this paper, we propose a unified an- swer to sense disambiguation on a large variety of structures both at data and metadata level such as relational schemas, XML data and schemas, taxonomies, and ontologies. Our knowledge-based approach achieves general applicability by converting the input structures into a common format and by allowing users to tailor the extraction of the context to the specific application needs and struc- ture characteristics. Flexibility is ensured by supporting the combination of different disambiguation methods together with different information extracted from different sources of knowledge. Further, we support both assisted and completely automatic semantic annotation tasks, while several novel feedback techniques allow us to improve the initial disambiguation results without nec- essarily requiring user intervention. An extensive evaluation of the obtained results shows the good effectiveness of the proposed solutions on a large variety of structure-based information and disambiguation requirements. Key words: semantic web, structure-based information, word sense disambiguation 1. Introduction The ever increasing need of publishing and exchanging data on Web spaces together with the recent growth of the Semantic Web have favored the diffu- sion of a wide variety of online data structures that semantically describe their contents through meaningful markup terms and relationships. We refer to rela- tional database schemas such as those used in biological applications [1], XML 1 documents where arbitrary elements and structural relationships describe what data is, RDF 2 graphs which represent information about things that can be Email addresses: [email protected] (Federica Mandreoli), [email protected] (Riccardo Martoglia) 1 http://www.w3.org/XML/ 2 http://www.w3.org/RDF/ Preprint submitted to Elsevier August 3, 2010

Transcript of Knowledge-based sense disambiguation (almost) for all structures

Knowledge-Based Sense Disambiguation(Almost) For All Structures

Federica Mandreoli, Riccardo Martoglia

DII - University of Modena e Reggio Emilia, Via Vignolese 905/b, 41125 Modena, Italy

Abstract

Structural disambiguation is acknowledged as a very real and frequent problemfor many semantic-aware applications. In this paper, we propose a unified an-swer to sense disambiguation on a large variety of structures both at data andmetadata level such as relational schemas, XML data and schemas, taxonomies,and ontologies. Our knowledge-based approach achieves general applicability byconverting the input structures into a common format and by allowing users totailor the extraction of the context to the specific application needs and struc-ture characteristics. Flexibility is ensured by supporting the combination ofdifferent disambiguation methods together with different information extractedfrom different sources of knowledge. Further, we support both assisted andcompletely automatic semantic annotation tasks, while several novel feedbacktechniques allow us to improve the initial disambiguation results without nec-essarily requiring user intervention. An extensive evaluation of the obtainedresults shows the good effectiveness of the proposed solutions on a large varietyof structure-based information and disambiguation requirements.

Key words: semantic web, structure-based information, word sensedisambiguation

1. Introduction

The ever increasing need of publishing and exchanging data on Web spacestogether with the recent growth of the Semantic Web have favored the diffu-sion of a wide variety of online data structures that semantically describe theircontents through meaningful markup terms and relationships. We refer to rela-tional database schemas such as those used in biological applications [1], XML1

documents where arbitrary elements and structural relationships describe whatdata is, RDF2 graphs which represent information about things that can be

Email addresses: [email protected] (Federica Mandreoli),[email protected] (Riccardo Martoglia)

1http://www.w3.org/XML/2http://www.w3.org/RDF/

Preprint submitted to Elsevier August 3, 2010

OrderDetail

PK OrderDetailId

FK1 PurchaseOrderIdFK2 ProductId

Quantity

Product

PK ProductId

NameProducerLineUnitCost

PurchaseOrder

PK PurchaseOrderId

DateCustomerTaxRateDiscountInvoiceAmount

FK1

FK2

(a) PurchaseOrder relational schema

metadata

description

titleidentifier creator series formatsubject

(b) DublinCore standard portion

Figure 1: A small example showing the need for disambiguation in a schema matching context

identified on the Web, OWL3 ontologies which formally describe the meaning ofterminology used in Web documents, as well as taxonomies and directory treesused by search engines and on-line market places to classify their data.

There is a general agreement that annotating such online data structureswith machine interpretable semantics would allow the development of muchsmarter applications for final users. In line with this view, many of the metadata-intensive applications dealing with this kind of structures actually found theireffectiveness on unambiguous markup lexicons. Moreover, a new generation ofSemantic Web applications is emerging [2], focused on exploiting the Semanticmetadata available on the Web. The exact meaning of the involved terms isused, for instance, for the automatic classification of heterogeneous XML data[3] and ontologies [4], for XML query expansion [5], for Web crawling [6], forontology matching [7, 8] and clustering [9], to enhance enterprise search andinteroperability [10, 2] on document collections and business processes.

On the other hand, markup lexicons are intrinsically ambiguous, that ismany of the terms used in these structures can be interpreted in multiple waysdepending on the context in which they occur. Consider, for instance, thesmall example shown in Figure 1 involving DigiSell, a fictitious online digitalmedia store selling e-books, songs and videos. Since the beginning, the storehas managed basic product data (name, producer, line of product) and eco-nomic/accounting information (such as product cost, order and invoice details)through a SQL DDL relational schema called PurchaseOrder for short (Figure1-a). DigiSell is now interested in enhancing the digital media description avail-able on their website, by including, for instance, additional information such asthe subject of a book or video, or the format of the media, while also perform-ing validity checks on the already available data. Useful metadata about media

3http://www.w3.org/TR/owl2-overview/

2

products are already available as XML documents compliant with the DublinCore4 standard for resource description and cataloguing (Figure 1-b shows aportion we call DublinCore). In this case, matching techniques are needed inorder to “link” the two schemas and derive the correct correspondences be-tween them; these techniques should be automatic (remember that completereal-life examples often involve significantly larger structures), and should gobeyond purely syntactical information in order to be effective. Let us examinethe terminology of the two fragments more in detail to find out why.

PurchaseOrder (Figure 1-a) is made up of three tables, PurchaseOrder,Product, and OrderDetail, each containing a list of table attributes. PrefixesPK and FK are used to distinguish primary and foreign keys, respectively, andeach foreign key is associated with an arrow from the referencing table to thereferenced one. With reference to the WordNet [11] lexical knowledge base,which is one of the most used external knowledge sources for term senses, termline can take on a very large variety of meanings (29 senses), ranging from “aformation of people or things one beside another”, such as “line of soldiers”, to“text consisting to a row of words”, from “cable, transmission line” to “line ofproducts, line of merchandise”, which happens to be the right sense in our case.Further, an order could be a “purchase order” but also a “command givenby a superior” or a “logical arrangement of elements” (13 different senses),a product could imply “commodities offered for sale” but also a “result ofa chemical reaction” or a “quantity obtained by multiplication” (6 differentsenses), and so on. The same applies to the terms of DublinCore shown inFigure 1-b: 10 possible meanings for term title, 7 meanings for series, andso on. So, as we can see, the involved terms convey a clear meaning to humansbut convey only a small fraction (if any) of their meaning to machines [12].This is obviously a problem: though being different in the adopted terminology,the two structures present many conceptual correspondences (such as name andtitle, producer and creator, line and series) which would be completelymissed by merely exploiting syntactical matching techniques. It would neitherbe possible to consider term synonyms, since these vary with the particularmeaning each term is used in: for instance, considering term formation as asynonym of line could have very undesired consequences. Instead, as noted in[7, 13], Word Sense Disambiguation techniques could significantly enhance theprecision of the obtained results by exploring the semantics of the ontologicalterm senses involved in the matching process: more specifically, the result ofthe disambiguation, i.e. the annotations containing the right meaning of eachterm w.r.t. a reference thesaurus such as WordNet, could allow for instance tocompute semantic similarities between the terms and to eventually discover theright correspondences.

The issue of providing explicit semantic to already existing structure-basedinformation is indeed the common denominator to all those applications relyingon machine-interpretable markup lexicon. This problem is mostly independent

4http://dublincore.org/

3

from the specific application goals while the effectiveness of the proposed solu-tions greatly influence the application performance. We therefore argue that akey issue to effectively fill the semantic annotation needs of the current and fu-ture generations of applications is to develop robust and generic disambiguationmethods. However, very few application-independent solutions exist [12, 10]while great efforts have been spent to produce partial solutions in specific ap-plication contexts where the main focus was on how to use semantic annota-tions rather than on how to effectively disambiguate the involved structures[3, 4, 5, 6, 7, 8, 9].

On the other hand, Word Sense Disambiguation (WSD), i.e. the problem ofcomputationally determining which “sense” of a word is activated by the use ofthe word in a particular context, is a core research topic in the computationallinguistics field [14]. Long-standing research interest in WSD has produced inthe last decades different approaches which show good performance on plaintext [14, 15].

In this paper, we start from the lesson learnt in WSD field [15, 14] andpropose a unified answer to the sense disambiguation problems on a large varietyof structures both at data and metadata level such as relational schemas, XMLdata and schemas, taxonomies, and ontologies. The practical implication ofour proposal is that we put STRIDER5, a structural disambiguation service,at applications disposal, which can be tailored to the specific requirements andstructure characteristics so to effectively support the disambiguation needs ofalmost all applications.

In the WSD field, the representation of a word in context is the main supporttogether with additional knowledge resources for allowing automatic methodsto choose the appropriate sense [15]. One of the most critical issues whenmanipulating structure-based information is that the effectiveness of the contextextracted from the structure may depend on the specific application needs andstructure characteristics. For instance, structures on specific topics, such asthe DBLP bibliographic archive schema, usually benefit from “broad” contexts,containing almost all terms in the schema. On the other hand, the best strategyto disambiguate a given category in an online marketplace, such as the eBay“string” meant as musical instruments, is to select only the nearby categories inthe taxonomy, whereas including also distant ones, such as “women’s clothing”,introduces noise, thus bringing to wrong results.

The sense disambiguation approach we propose aims at overcoming all theabove mentioned problems in the following way:

• it achieves homogeneity in the disambiguation phase by converting theinput structure into a common format that follows the Open InformationModel (OIM) [16] and that makes the addition of new structure typeseasy;

• it introduces the notion of crossing setting as a means to tailor the ex-

5STRucture-based Information Disambiguation ExpeRt

4

traction of the context to the specific application needs and structurecharacteristics. In particular, a crossing setting allows us to define anabstraction of the neighborhood of each term to be disambiguated, so toextract context information from correlated terms;

• it achieves flexibility by supporting the combination of different disam-biguation methods founded on WSD principles together with differentinformation extracted from the two main sources of knowledge, i.e. a the-saurus and a corpus. For instance, it can exploit the sense definitions andusage examples that are often available in a thesaurus or the term fre-quencies in the corpus and it integrates the most frequent sense heuristicfor word meaning prediction [14];

• it produces a ranking of the plausible senses of each term in the inputstructure. In this way, it supports both the assisted semantic annotationand the completely automatic one. In the former case, the disambigua-tion task is committed to a human expert and the disambiguation serviceassists him/her by providing useful suggestions. In the latter case, thereis no human intervention and the selected sense can be the top one;

• it can improve the disambiguation results by means of several feedbacktechniques which do not necessarily require user intervention.

The proposed approach has been implemented in a new and completelyredesigned 2.0 version of our STRIDER system [17, 18]. A web version ofSTRIDER is also available online6. Since no official benchmarks are availablefor evaluating structure-based information disambiguation systems, we also pro-pose an evaluation method to thoroughly assess disambiguation effectiveness atdifferent levels of quality and for structures presenting different specializationlevels. An extensive evaluation of STRIDER shows the good effectiveness ofthe proposed solutions on a large variety of structure-based information anddisambiguation requirements.

The rest of the paper is organized as follows: After a discussion of relatedworks (Section 2) and an overview of our disambiguation approach (Section3), Section 4 discusses our context extraction method, while the actual dis-ambiguation algorithms are presented in Section 5. An in-depth experimentalevaluation is provided in Section 6, while Section 7 concludes the paper andoutlines directions for future work.

2. Literature review

Our structural disambiguation approach draws upon some of the ideas andtechniques that have been proposed in the well-established free text disam-biguation field. In this section we will first discuss the text disambiguation

6http://www.strider.unimo.it/

5

fundamentals on which we founded a large part of our work (Section 2.1); inthis case, we will mainly emphasize the ideas and techniques which we thinkcan be most useful to the new context. Then, we will discuss the systems andtechniques actually offering structural disambiguation features (Section 2.2) andbriefly compare our approach to them. As we will see, only few approaches areavailable, and most of them are indeed quite limited in features and designedfor very specific applications, such as schema matching or web documents’ an-notation.

2.1. A brief account of free text WSD

The free text disambiguation field is the most “classic” and well studied fieldof WSD. Most of the methods used to disambiguate free text can be adaptedto the structural WSD problem, and an understanding of the way they work isfundamental in order to successfully devise and evaluate structural disambigua-tion mechanisms. In this Section we provide a brief account of free text WSD.Interested readers can refer to [14, 15] for detailed discussions on this topic.

WSD is the ability to computationally determine which sense of a word isactivated by its use in a particular context. The possible senses of a wordare usually drawn from a sense inventory and the representation of the wordin context together with the use of external knowledge resources are the mainsupports for allowing automatic methods to select the appropriate sense.

Approaches to WSD are often classified according to the main source ofknowledge used in sense differentiation. Methods that rely primarily on dictio-naries, thesauri, and lexical knowledge bases, are termed knowledge-based, asopposed to methods that work directly from corpora which are named corpus-based. In this paper, we will mainly investigate knowledge-based methods whichshare most similarities with ours, while corpus-based proposals will be discussedonly shortly.

Knowledge-based methods are generally applicable in conjunction with anylexical knowledge base that defines word senses (and relations among them).However WordNet is, with no doubt, one of the most used external knowledgesources in this context. A simple and intuitive knowledge-based approach is thegloss overlap approach originally proposed by Lesk in [19]. It essentially assumesthat the correct senses of the target words are those whose glosses have the high-est overlap (i.e. words in common). Unfortunately, while this approach can insome cases achieve satisfying accuracies, it is very sensitive to the exact wordingof definitions. The revision presented in [20] corrects this problem by expandingthe glosses with the glosses of related concepts, however this still does not leadto state-of-the-art performance compared to other knowledge-based proposals[15]. The structure of the concept network made available by computational lex-icons like WordNet is instead exploited by structural approaches. Among those,graph-based approaches are very common and they often rely on the notion oflexical chain [21, 22, 23, 24]: a lexical chain is a sequence of semantically relatedwords, which creates a context and contributes to the continuity of meaning andthe coherence of a discourse [25]. The above cited papers essentially find in theconcept network the available lexical chains among all possible senses and the

6

semantically strongest chains are selected for disambiguation purposes. Othergraph-based approaches, instead, rely on graph theory, link analysis and socialnetwork analysis: for instance, [26] proposes a variety of measures analyzingthe connectivity of the graph structures and identifying the most relevant wordsenses, while [27] applies PageRank-style algorithms to the graphs extractedfrom natural language documents. While graph-based approaches, and in par-ticular those based on lexical chains, are “global” approaches, i.e. they disam-biguate the whole document and all its words at the same time, similarity-basedmethods are usually applied in local contexts [15] and can thus be more flexibleto be adapted to a structural disambiguation setting, where each term can havea very specific and selected context on the basis of the structure characteristics.Similarity-based methods rely on the principle that words that share a commoncontext are usually closely related in meaning. Therefore the appropriate sensescan be selected by choosing those meanings found within the smallest seman-tic distance. To this end, semantic similarity measures have been proposed forinstance in [28, 29, 30, 31, 32] and experimentally compared in [33] for differ-ent tasks. Among those, as we will see in Section 5.1, STRIDER exploits athesaurus based similarity straightly derived from one of the most known andeffective, the Leacock-Chodorow one [29].

Recent research works have proposed term similarities based on term co-occurrence information; in this case the assumption is that the closeness ofthe words in text is indicative of some kind of relationship between them [34].While these methods require very large textual corpora in order to show anacceptable effectiveness in disambiguation applications, they are becoming moreand more studied and widespread since they can easily exploit the enormousamount of text information available from the web [35]. For instance, in [34]the authors test several “standard” term co-occurrence computation formulasfor WSD, where the frequency statistics are derived from large corpora extractedwith web mining techniques. In particular, the Pointwise Mutual Information(PMI) formula appears one of the most effective. This result is confirmed bythe studies in [36]. In a recent study [37], the authors propose more complexterm-similarity techniques based on the computation of the cosine similaritybetween vectors in a vector space model; they also demonstrate that similaritiesbased on term co-occurrence information can perform as well as knowledge-based methods. Further, an original approach, proposed in [38], defines a newterm similarity measure, named “Google Distance” based only on the page countstatistics gathered by the Google search engine and the tests the authors performshow good performance figures w.r.t. standard WordNet approaches. Someapproaches even try to exploit Wikipedia extracted information [39]. Since thesenewest approaches are significantly more complex than “standard” approaches,even if their effectiveness is not always significantly better, STRIDER’s corpus-based similarity exploits the well known and effective PMI approach (see Section5.1).

Differently from knowledge-driven proposals, corpus-based approaches solveword ambiguity by exploiting training texts. In particular, (semi)supervisedmethods [40] make use of annotated corpora to train from, or as seed data in

7

a bootstrapping process whereas unsupervised methods [41] work directly fromraw unannotated corpora by clustering word occurrences and then classifyingnew occurrences into the induced clusters/senses.

Comparing and evaluating different WSD systems is extremely difficult, be-cause of the different test sets, sense inventories, and knowledge sources adopted[15]. Senseval (now renamed Semeval) [42, 43] is an international word sensedisambiguation competition whose objective is to compare WSD systems invitro, that is as if they were stand-alone, independent applications, and in vivo,i.e. as modules embedded in applications. The in vitro evaluation usually relieson measures borrowed from the information retrieval field (coverage, precision,recall, F1, etc.). While the performance of the supervised methods usually ex-ceeds the other approaches, they are applicable only to those words for whichthe system has been trained; further, relying on the availability of large trainingcorpora for different domains, languages, and tasks is not a realistic assumption(see [44, 45] for an estimation of the amount of required sense-tagged text andhuman effort to construct such a corpus). On the other hand, knowledge-basedmethods seem to be most promising in the short-medium term mainly becausethey have the advantage of a larger coverage. Indeed, they use knowledge re-sources which are increasingly enriched and it has been shown that the moreknowledge is available, the better their performance [26, 46]. Moreover, appli-cations in the Semantic Web need knowledge-rich methods which can exploitthe potential of domain ontologies and enable semantic interoperability betweenusers, enterprises, and systems [15].

2.2. Structural disambiguation approaches and applications

Structural disambiguation is acknowledged as a very real and frequent prob-lem for many semantic-aware applications. However, with a few exceptionswhich will be discussed later, up to now it has only been partially considered invery specific applications, such as schema matching [47, 8], automated construc-tion of domain specific ontologies [48], XML/ontology clustering [9, 3, 49], or inthe semantic annotation of web pages [10]. In all these cases, disambiguationis usually not the main issue discussed by the authors, since their main focus israther on how to use semantic annotations than how to produce them in a gen-eral context. Therefore, the proposed disambiguation solutions are only partialsolutions, designed to be effective in the very specific considered scenarios; fur-ther, experimental evaluation is limited to the benefits of disambiguation on thespecific scenarios, while no in-depth evaluation of the effectiveness of the disam-biguation techniques is performed. On the other hand, our main aim has beenspecifically to develop and extensively test a standalone and robust method forautomatically producing such annotations for different kinds of structures, thusgeneralizing other partial structural disambiguation solutions and satisfying theneeds of a very large variety of current and future applications. In the followingwe will first of all give an overview of some of the application-specific struc-tural disambiguation approaches, then we will focus on the few general-purposeactual structural disambiguation methods that have been presented.

8

In many schema matching applications, the closeness between nodes relieson semantic information; for instance, [8] employs context-sensitive strategiesand community information, such as user ratings and comments, for ontologymatching (RDF-S and OWL only), while a good number of statistical WSDapproaches have also been proposed (e.g. [47]). However, they have limitedapplicability (e.g. [47] only applies to relational data) and rely on data (train-ing data, etc.) which may not always be available. In a different context, [48]presents a system aimed at the construction of domain specific ontologies; inthis specific application, the disambiguation algorithm is tailored to the problemof finding the correct sense for ontology candidate terms, so to append subtreesof concepts under the appropriate ontology node, and is based on a semanticinterconnection measure which is designed ad-hoc to work with WordNet. As todata clustering scenarios, in [3] the authors propose a technique for XML dataclustering, where disambiguation is performed on the documents’ tag names.The local context of a tag is captured as a bag of words containing the tagname itself, the textual content of the element and the text of the subordi-nate elements and then it is enlarged by including related terms retrieved withWordNet. This context is then compared to those associated to the differentWordNet senses of the term to be disambiguated by means of standard vectormodel techniques. The technique appears convincing, but only applicable toXML documents; further, bag of words approach only provides a very roughsimilarity estimation, which would not be applicable to other finer-grained ap-plications, such as schema matching and semantic web. In a semantic websetting, [9] discusses a method to cluster terms (meanings); the exploited dis-ambiguation technique is derived from a previous approach for disambiguatingkeywords in user queries [50]. The context of each meaning is extracted fromOWL ontologies and is then compared to other contexts by means of the NGDGoogle co-occurrence similarity [38]. In this case, the meaning extraction ap-pears applicable only to OWL graphs and no other formats, such as relationalor tree schemas, are discussed or supported.

Closely related to structural disambiguation, a large number of semantic an-notation approaches, providing annotation w.r.t. a reference ontology describ-ing the domain of interest, are available in the literature [10]. Differently fromproper general-purpose structural disambiguation approaches, such techniquesannotate whole documents and do not allow finer annotation granularities, asrequired by structural disambiguation tasks. Further, while many frameworksand tools providing annotation facilities are currently available, such as Man-grove [51], the process of annotation is in many cases manual, since most toolslimit themselves to assist the user in selecting and inserting annotations. Onthe other hand, while fully-automated tools for annotating web documents arestill missing, some semi-automatic systems exist which assist the user by sug-gesting both the subject of annotation and its annotation from a certain knowl-edge source. For instance, [52] and [53] are based on statistical and supervisedlearning approaches, while Armadillo [54] and KnowItAll [55] have automaticannotation features which are based on unsupervised techniques. The formersystems require a large work and frequent user interventions in the training

9

GraphContext

Polysemous

Terms

Sense

Context

RelationalSchema

Tree

Graph

CommonFormat

(OIM-RDF)

PRE

PROCESSING

Senses’ Confidences and Ranking

1-…2-…3-…...

N-...

AnnotatedTerms List

CONTEXT

EXTRACTION

DISAMBIGUATION

INPUT OUTPUT

FORMAT

UNIFICATION

SENSE FEEDBACKThesaurus(WordNet)

Corpus(WWW)

Sense

InventorySense Frequency

Hypernymy

Relation

Term Frequency

Figure 2: STRIDER disambiguation system’s architecture.

phase, while the latter do not need explicit user supervision but suffer from alimited accuracy. All annotation systems support HTML documents, while pro-viding support for the semantic annotation of other formats (such as relationalschemas, XML schemas, ontologies) is a much less studied problem [10].

Finally, as we said, very few structural disambiguation approaches have beenpresented as independent applications and evaluated in vitro in the literature.Even if quite promising, also these methods often lack general-purpose applica-bility and flexibility in some ways. [12] presents a node disambiguation techniqueexploiting the hierarchical structure of a schema tree together with WordNethierarchies. However, in order for this approach to be fully effective, the schemarelations have to coincide, at least partially, with those of WordNet. In [17, 18]we presented a first release of the STRIDER system. The underlying approachposes the first bases for extracting and exploiting structural context informa-tion in the disambiguation process, including relational information betweenthe nodes. However, the approach is still very WordNet dependent, since itonly considers WordNet related term-similarity computation methods, and isoptimized and tested on tree-shaped XML Schemas only, thus not providing auniform way to handle different structural formats. The approach presentedin [4] allows the disambiguation of ontologies by exploiting an ad-hoc context-extraction which is not generalised and is specifically tailored to RDF/OWLdata (XML, relational and other formats are not supported), while the way thecontext similarities are computed is very dependent on WordNet relations. Onthe other hand, STRIDER pre-processing and context extraction phases (seeSection 4) and disambiguation algorithms (see Section 5) are completely or-thogonal from the specific employed knowledge sources and type of structuresconsidered (relational, tree models, graph models), thus effectively working in amuch wider array of contexts.

3. Our Structural Disambiguation Approach: An Overview

In this section we present the functional architecture of the STRIDER dis-ambiguation system and outline the information flow involved in the disam-

10

biguation process (see Figure 2). STRIDER can be used for disambiguatingdiverse data structures, i.e. structures used to describe data instances, calledmodels [56]. The linguistic and semantic phenomena used to resolve word senseambiguity are provided by two knowledge sources: a thesaurus, which is alsoused as sense inventory, and a corpus. The techniques STRIDER founds onare generally applicable in conjunction with any thesaurus that defines wordsenses and an hypernymy relation among senses and with any corpus where therelative term frequencies approximate their actual use in society. Practically,in its implementation STRIDER uses WordNet as thesaurus because, given itswidespread diffusion within the research community, it can be considered a defacto standard for English WSD. Further, since its senses are ordered w.r.t. theirfrequency of use, WordNet also allows us to easily integrate the most frequentsense heuristic for word meaning prediction. Moreover, the used corpus is theWWW because it is the largest corpus on earth and the information enteredby millions of independent users averages out to provide automatic semanticsof useful quality [38].

The process follows three main steps:

• First, STRIDER translates the input model into a labeled, directed multi-graph complying with the OIM-RDF metamodel [16] (pre-processingphase, Section 4.1);

• then, in the context extraction phase (Section 4.2), the list of polyse-mous terms to be disambiguated is extracted from the nodes of the graph.Each term is assigned context information. This includes both informa-tion contained within the graph (graph context) and information providedby the thesaurus (sense context);

• finally, disambiguation (Section 5) is performed through a similarity-based approach where similarities are computed by exploiting either thehypernymy relation provided by the thesaurus or word frequencies in theconsidered corpus.

The outcome of the overall process is a ranking of the plausible senses for eachterm contained in the model. Moreover, in order to maximize the effectivenessof the disambiguation process, several feedback techniques, both manual andautomated, are available to refine the initial results.

In the following sections we will elaborate on our approach further, also bymean of a reference example. To this end, we will focus on the PurchaseOrderrelational schema already shown in the introduction (Figure 1-a) and choose itas our reference example.

4. Pre-processing and Context Extraction

This section describes how the models are prepared for the proper disam-biguation process and how context is associated to each term to be disam-biguated (pre-processing and context extraction in Figure 2).

11

4.1. Pre-processing

Our aim is to support the disambiguation of a wide range of models:

• relational schemas: tables and views;

• tree-structured models: trees that can be employed to describe, for in-stance, the structure of XML documents, web directories and taxonomies;

• graph-structured models: graphs describing data structures in, forinstance, RDF-S files or OWL ontologies.

The pre-processing component is used to unify the input formats to a com-mon one: It allows us not to care about the specific input format and to achievehomogeneity in the disambiguation phase. More precisely, STRIDER adoptsthe RDF encoding of the Open Interface Model (OIM) [16] specification. Themain aim of OIM is to provide sharing and reuse of meta-data. In OIM-RDF[56], models are represented as directed labeled graphs complying with the RDFvocabulary description language and each supported model type is associatedwith a specific OIM meta-model. Some nodes of such graphs denote modelelements, such as relations and attributes in relational schemas, categories inweb directories, etc. Each element is uniquely identified by an object identifier(OID).

Definition 1 (OIM-RDF graph). An OIM-RDF graph is a set of RDF triples〈s, p, o〉 where s is the source node, p is the edge label, and o is the target node.The node identifiers and edge labels are drawn from the set of OIDs, which canbe implemented as integers, pointers, URIs, etc. The type of o is the union ofOIDs and literals, which include strings, integers, floats, and other data types.

For instance, Figure 3 illustrates a small portion of the OIM-RDF graph en-coding the relation schema of our reference example. Node &1 represents theProduct table and is the source node of four triples: 〈&1, type, Table〉, 〈&1, name,Product〉, 〈&1, column,&2〉, 〈&1, column,&4〉. The involved edge labels comefrom the OIM relational meta-model [56].

For the format unification we enhanced the platform for generic model man-agement presented in [56]: First, it converts the input model into an OIM-RDFgraph, then it performs a mapping between the native and the converted mod-els. Note that the followed approach makes the addition of new model typeseasy, as it is only a matter of defining conversion rules from the new format tothe common one.

The disambiguation process will focus on all the nodes’ strings as they cap-ture all the model informative content we are interested in. Going back to ourreference example (Figure 3), for the table Product we will disambiguate thestrings Product, Line and ProductId, which are the names of nodes &1, &2 and&4, respectively. Similarly, we consider OrderDetail and ProductId from tableOrderDetail.

12

&1

&2 &4 Column

Table

Product

&3 &5 ColumnType

Line ProductId

&13

OrderDetail

&14

&15

ProductIdReferentialConstrait

uniqueKeyRole foreignKeyRole

varchar 100 int int

column

SqlType

type

type

name param

name

type

name

column column

typetype

name name name

SqlType SqlType

type

typetype

name name

... ...

Table Product Table OrderDetail

... ...

SqlType

&12

Figure 3: A portion of the OIM-RDF graph of our reference Example.

4.2. Context Extraction

Once the input model is converted into the corresponding OIM-RDF graph,the terms & senses extraction component depicted in Figure 4 extracts thepolysemous terms from each node N ’s string and associates each of these terms,denoted by the tuple (t,N)7, with a list of plausible senses Senses(t,N) =[s1, s2, . . . , sk]. In principle, such list is the complete list of senses provided bythe dictionary but it can also be a shrunk version suggested either by humanor machine experts (see e.g. Sec. 5.5 on feedback). Notice that more than oneterm can be extracted from one string. See, for instance, the ProductId nodein Figure 1-a which contains the terms product and id. Moreover, insignificantand frequent terms, including articles and conjunctions, can be source of noisein disambiguating the other terms. Such words, usually referred as stopwords,are filtered out.

Before going on defining our disambiguation task in detail, we would like tomake a short note about the kinds of labels which are dealt with in STRIDER.As we have seen, the employed term identification approach works under theassumption that the nodes are described by meaningful labels. This is whatusually happens for data which are published and exchanged on Web spaces,in particular the ones expressed in one of the W3C standards for data repre-sentation, XML, RDF and OWL, whose goal is to produce “human-legible andreasonably clear” documents. This is the kind of data structures which are usedin almost all the applications STRIDER is aimed to and which are considered inthis paper. However, it may be the case that some labels contain abbreviations

7Notice that the same term t could be present more than once and that the disambiguationis strictly dependent on the node each instance belongs to. For this reason, each term isrepresented by the pair (t,N).

13

CONTEXT EXTRACTION

Thesaurus

OIM-RDFGRAPH

CONTEXT

EXTRACTION

TERMS

& SENSES

EXTRACTION

SENSE

CONTEXT

EXTRACTION

GraphContext

Polysemous Terms

SenseContext

Figure 4: Context Extraction Component.

or acronyms, making them syntactically different and difficult to understand.In this case, it is sufficient to apply a normalization technique to the labelsprior to term identification. For instance, a first possibility is the one proposedin the schema matching tools [57, 58], where abbreviations and acronyms areexpanded (e.g. a label PO would become PurchaseOrder) based on a thesauruslookup. The thesaurus can include terms used in common language as wellas domain-specific references. Another possible method is the one proposed in[59], which expands abbreviations and acronyms in a semi-automatic way byexploiting both external sources (such as user-defined / online abbreviation dic-tionaries, standard domain abbreviations, and so on) and internal ones (such ascontext information from the given schema, or complementary schemas aboutsimilar subjects).

Definition 2 (Structural disambiguation task). Let us consider the set ofterms extracted from an OIM-RDF graph. For each term (t,N), the goal ofthe structural disambiguation task is to define a confidence function φ whichassociates each sense si ∈ Senses(t,N) with a score φ(i) ∈ [0, 1] expressing theconfidence of choosing it as the right one for term (t,N).

It is worth noting that the confidence function φ induces a ranking of the sensesin Senses(t,N). In this way STRIDER supports two types of disambiguationservices: the assisted and the completely automatic one. In the former case, thedisambiguation task is committed to a human expert and the disambiguationservice assists him/her by providing useful suggestions. In the latter case, thereis no human intervention and the selected sense is the top one, i.e. the one withthe highest confidence value.

4.2.1. Graph Context Extraction

Identifying the correct sense of each node label can be done by analysingits context of use given by the labels of the surrounding nodes. Graph contextextraction (see Figure 4) is the phase devoted to the contextualization of eachpolysemous term (t,N) through the extraction of its context Gcont(t,N) fromthe OIM-RDF graph. In its broad form, Gcont(t,N) contains the terms of the

14

nodes connected to n. However, depending on the graph characteristics, finerchoices could be preferable. For instance, while disambiguating term line intable Product of our running example, terms that are “near” in the graph (suchas product and cost) can be useful to understand its correct sense as a “kind ofproduct or merchandise”. Instead, distant terms such as customer could lead towrong interpretations of term line, such as “a formation of people one besideanother”.

For the above reasons, we introduce the notion of crossing setting. Broadlyspeaking, a crossing setting associates each term with an abstraction of (a por-tion of) the OIM-RDF graph which represents its graph context. Such an ab-straction is (a subset of) the OIM-RDF nodes connected through labeled arcs,which can be either native OIM-RDF arcs or arcs derived through the evalua-tion of SPARQL queries on the OIM-RDF graph. For instance, referring to therelational case, the following SPARQL query, which is depicted on the left sideof Fig. 5, selects each column name (identified by the variable ?column) andthe name of the table it belongs to (identified by the variable ?table):

select ?table ?column

where { ?tableURI sql:name ?table.

?tableURI rdf:type sql:Table.

?tableURI sql:column ?columnURI.

?columnURI rdf:type sql:Column.

?columnURI sql:name ?column }

The results is a set of label pairs such as (line, Product). The relationship be-tween columns and tables is then made explicit by an arc labeled SR (StructuralRelationship) which is added between each of these pairs.

A crossing setting σ is made up of a reachability threshold τ and a set of arctypes, where each arc type has a name, is defined by a SPARQL query on theOIM-RDF graph, and is associated with two crossing costs in (0,1], one for eachcrossing direction. Costs will be used to compute the distance between two nodesin the graph, in principle the lower the cost of an arc crossing direction is thecloser two nodes connected by such arc are. For each supported model type, wedefined the default crossing setting shown in Appendix A (where the value of τhas been experimentally set to 3). The arc types of the relational default crossingsetting are depicted in Figure 5. Beside SR, the Foreign Key arc type (FK ) isan example of referential constraint which can be made explicit. Note thatthe FK costs are appreciably lower than those of SR because FK represents avery strong integrity constraint between pairs of attributes. Generally speaking,lowest crossing costs should be assigned to the strongest relationships whereashighest values should be reserved to the weakest ones. Indeed, what matters inthe crossing cost settlement are not the absolute cost values but rather the arctype order which is induced by those values. The position of each arc type inthis order and its costs should reflect its strength w.r.t. the other arc types.

Users are allowed to freely modify both the value of τ and the set of arctypes of the default crossing settings or to introduce their own crossing settings.

15

?table SR ?column

=

?tableURI ?columnURI

ColumnTable

?table

type

name

columntype

name

?column

?referencing FK ?referenced

=

?referencingURI ?referencedURI

?a

?referencing

name

uniqueKeyRole

name

?referenced

foreignKeyRole

Arc Type

DirectDirection

OppositeDirection

SR 1 0.8FK 0.1 0.1

Figure 5: Arc types for relational schemas.

By applying the arc type definitions of a given crossing setting σ, we derivea view of the OIM-RDF graph on which the graph context Gcont(t,N) of eachterm (t,N) is computed. In particular, Gcont(t,N) contains all and only thosenodes Ng in the OIM-RDF view which are reachable from N . One node Ng issaid to be reachable if and only if there exists one path from N to Ng whosetotal sum cost, obtained by summing the single costs associated to each crossingsetting’s relationship w.r.t. the crossing direction, is smaller or equal to thereachability threshold τ . Since we are dealing with graphs, different paths maybe available that connect a pair of nodes; among all, we choose the one havingthe lowest path cost LPC(N,Ng).

Each reachable node Ng is associated with a weight weight(Ng) which es-sentially represents the weight that the terms of Ng should have in the dis-ambiguation of the terms of N . The way node weights influence the properdisambiguation process will be shown in Sec. 5. In principle, the more Ng isclose to N and is connected by arcs with low costs the more its terms influ-ence the disambiguation of (t,N). For this reason, we compute weight(Ng) byapplying a gaussian distance decay function on LPC(N,Ng):

weight(Ng) = 2 · e−

(LPC(N,Ng))2

8

√2π

+ 1− 2√2π

(1)

Definition 3 (Graph context). The graph context Gcont(t,N) of a term (t,N)is a set of tuples ((tg, Ng), Senses(tg, Ng), weight(Ng)), one for each term(tg, Ng) of each reachable node Ng.

Example 4.1. Figure 6 shows the graph obtained by applying our defaultcrossing setting to the OIM-RDF graph depicted in Figure 3. Each node is as-signed a unique identifier which is used in the table to show the total costs andthe corresponding weights. The graph context of term line, GCont(line, 5),is made up of the terms (product,1), (producer,2), (name,3), (product,4),(id,4), (unit cost,6), (product,9), (id,9), (order,7) and (detail,7). Notethat all the terms from the same node have the same weight and that terms

16

4

ProductId

NameProducer Line

UnitCost

Product

ProductId

3

2

5

6

9

1

OrderDetail

7

SR

SR

SR

SR

SR SR

FK

Keys

Direct :Opposite : Node SPC (Total cost) Weight

1 0.8 0.9392 0.8+1.0=1.8 0.7343 0.8+1.0=1.8 0.7344 0.8+1.0=1.8 0.7346 0.8+1.0=1.8 0.7349 0.8+1.0+0.1=1.9 0.7107 0.8+1.0+0.1+0.8=2.7 0.523

Total Costs and Weights for reachable nodes for node Line

LPC

Figure 6: Node selection and weight calculation for term Line.

Table 1: Sense Explanation, examples and Scont(sk) of term (line,5)

sk Explanation Examples Scont(sk)s1 a formation of people

or things one besideanother

the line of soldiers advancedwith their bayonets fixed;they were arrayed in line ofbattle; the cast stood in linefor the curtain call

battle, bayonet,cast, curtain,formation, . . .

. . . . . . . . . . . .

s22 a particular kind ofproduct or merchan-dise

a nice line of shoes kind, line, mer-chandise, prod-uct, shoe, . . .

. . . . . . . . . . . .

can be composed by more than one word if the corresponding lemma is con-tained in the dictionary (i.e. unit cost is a lemma composed by two words).Finally, the Gcont(line, 5) tuple about (product,1) using WordNet [11] asinventory of senses is: ((product,1),[s1=commodities offered for sale (“goodbusiness depends on having good merchandise”; “that store offers a variety ofproducts”),. . .,s5=a quantity obtained by multiplication (“the product of 2 and3 is 6”),. . .],0.939). �

4.2.2. Sense Context Extraction

As most of the semantics is carried by noun words [15], the sense contextextraction phase enriches the context of each term (t,N) to be disambiguatedwith the nouns used to explain its plausible senses, similarly to [19]. The sensecontext is particularly useful when the graph context provides too little infor-mation, for instance because the OIM-RDF graph is too small.

Definition 4 (Sense context). The sense context Scont(s) of each sense s ∈Senses(t,N) of a given term (t,N) is a set of nouns extracted through a part-of-speech tagger from the definitions, the examples and any other explanation ofthe sense provided by the thesaurus.

17

GraphContext

Sense Contexts

Senses’ Confidences

and

Ranking

1-…2-…3-…...

N-...

DISAMBIGUATION

GRAPH CONTEXT contribution computation

SENSE CONTEXT contribution computation

SENSE FREQUENCYcontribution computation

α

β

γ

+

+

TERM SIMLARITY computation

=

Figure 7: Overview of the different contributions’ computations for the disambiguation of agiven term

Example 4.2. For each term sense s, WordNet provides a gloss that containsthe sense explanation and a list of examples in using that term with that mean-ing. Table 1 shows some of the 29 senses of the term line with sense explana-tions, examples and Scont(s) (the extracted nouns are in alphabetical order),respectively. �

5. Disambiguation Algorithms

STRIDER disambiguation algorithms follow a similarity-based approach:The key to their working is to semantically correlate each term to be dis-ambiguated with the terms of its contexts, graph context and sense contexts,through the use of knowledge sources. As we said in the previous sections, theoutcome of the overall process is a ranking of the plausible senses of each term(see Figure 7). In particular, the confidence function φ is composed by threedifferent contributions: those given by the graph context, the sense context andthe sense frequency. The first two are context-dependent and are computed withtwo confidence functions, named φG and φS , respectively; the latter uses thefrequency of senses and is computed with φF . Each confidence function givesrise to a confidence vector bearing the same name; then, the final confidencevector φ is obtained as their linear combination: φ = α ∗ φG + β ∗ φS + γ ∗ φF ,α+ β + γ = 1, where α, β and γ are parameters that can be freely adjusted inorder to conveniently weigh the contributions.

Example 5.1. Starting from the current example, we will explain howSTRIDER disambiguates terms by following the disambiguation of term line ofour reference schema. As we anticipated in the introductory example, there are29 different senses including (s1) “a formation of people or things one besideanother”, (s5) “text consisting to a row of words”, (s9) “cable, transmissionline” and (s22) “line of products, line of merchandise”. We remind that (s22) isthe right one. In particular, in this example, we will just give a glimpse of how

18

STRIDER chooses the sense for the given term by exploiting the relevant com-puted confidence values; then, further examples will clarify how the confidencevalues leading to this result are actually computed. By performing automaticdisambiguation on the schema, our approach is able to correctly disambiguatethis term, together with the large majority of the other schema terms. In par-ticular, using the default setting of α = β = 0.4 and γ = 0.2, the highest valueof the output confidence vector is φ(22) = 0.61, much higher than, for instance,φ(1) = 0.24, meaning that STRIDER is confident in s22 being the right sense.�

5.1. Computing term similarity

The confidence in choosing one of the senses associated with a given term isdependent on the similarity between that term and each term in the context. Tothis extent, this section focuses on how the term similarity component works,whose output is at the basis of the computation of the graph and sense contextcontributions (see Figure 7).

In order to quantify the similarity between two terms tx and ty, we decidednot to restrict our vision to a specific external source but to investigate thetwo alternative approaches we discussed in Sec. 2, thesauri and large textualcorpora.

Specifically, in the thesaurus-based similarity we make use of the hypernymy-hyponymy hierarchy of the reference thesaurus through one of the most promis-ing measures available in this field, the Leacock-Chodorow ([29]) one, which hasbeen reviewed in this way:

sim(tx, ty) =

{

−lnlen(tx,ty)

2·H if ∃ a common ancestor

0 otherwise(2)

where len(tx, ty) is the minimum among the number of links connecting eachsense in Senses(tx) and each sense in Senses(ty) and H is the height of thehypernymy hierarchy (e.g. 16 in WordNet). The set of lowest common ances-tors of tx and ty in the hypernyms hierarchies will be denoted as mch(tx, ty)(minimum common hypernyms).

Alternatively, when a large document repository or simply a common websearch engine is available, information on term co-occurrence can be successfullyexploited to compute term similarity (or distance) by means of several functions,such as Jaccard (recently used in [39]), PMI [36] or NGD [38]. In STRIDER wepropose a corpus-based similarity based on an exponential version of PMI:

sim(tx, ty) = ePMI(tx,ty) =M · f(tx, ty)f(tx) · f(ty)

(3)

where M is a normalizing constant, f(t) is the frequency of term t and f(tx, ty)is the frequency of co-occurrence of terms tx and ty together. We performedseveral ad-hoc tests comparing the results obtained through such formula tothose produced by applying different ones, such as Jaccard or NGD, and finally

19

found our exponential PMI to be the most effective to our means, especiallywhen, as in our case, f(t) is the aggregate WWW page-count estimate returnedby search engines such as Yahoo or Google.

Example 5.2. Let us consider the two terms: line (the one to be disam-biguated) and one of the terms of its graph context which will contribute indisambiguating it, product. The minimum path length is 1, since the sensesof such nouns that join most rapidly are “line (line of products)” and “prod-uct (merchandise)” and the minimum common hypernym is “product (mer-chandise)” itself. Thus, the thesaurus-based similarity is 3.47. On the otherhand, by exploiting Google’s term frequencies, we have f(line) = 1.08 ∗ 109,f(product) = 1.4 ∗ 109, f(line,product) = 0.23 ∗ 109. Thus, setting M tothe total number of Google’s indexed pages, the corpus-based similarity is 2.31.In the next section, we will see how such similarity values will be useful fordisambiguating our term. �

5.2. Graph context contribution

The graph context contribution for a term (t,N) is computed by means ofAlgorithm 5.1 which relies on the similarities between t itself and the terms inits graph context. Notice that, for the sake of simplicity of presentation, thealgorithm takes one term at a time. It does not correspond to its actual imple-mentation which, for efficiency reasons, works on entire models and performscomputations on commutative functions only once.

Algorithm 5.1 GraphContextContr(t,N)

1: for each sk in Senses(t,N) do // cycle in t senses2: φG[k] = 03: norm = 04: for each t

gi in Gcont(t,N) do // cycle in graph context terms

5: φG[k] = φG[k] + GCC[k][i]6: norm = norm + GCN [i]7: φG[k] = φG[k] / norm8: return φG

The algorithm takes in input a term (t,N) to be disambiguated and producesa vector φG of confidence values in choosing each of the senses in Senses(t,N).The idea behind the algorithm, which derives from [60], is that when two pol-ysemous terms (t, tgi ) are similar, their most informative subsumer providesinformation about which sense of each term is the relevant one. Moreover,the contribution of each term t

gi in Gcont(t,N) is proportional to its weight

weight(Ngi ) (see Eq. 1). As also depicted in Figure 8-a, the algorithm com-

putes the bi-dimensional GCC array (Graph Context Contribution) by meansof two nested cycles: one on the senses of t (outer cycle at line 1, dimension “k”)and the other one on the terms in Gcont(t,N) (inner cycle at line 4, dimension“i”). Specifically, the contribution of each term t

gi to the confidence of each

20

Gcont (t,N)

termstg

it

g

1... ...

(t,N)senses

s1

sk

...

...

tg

i contribution to

of senseconfidence

sk

GCC

Φg

Φg

Φg(k)

Gcont(t,N)| |

Senses(t,N)| |

(a) Disambiguation of a generic term t

Gcont (line,5)

termsproduct

(line,5)

senses

s1

...

...

s22

Φg

idproduct

nameproducer

...detailorder

0.05

0.58Φg(22)

Φg

(1)1.23 0 0 0 0 0 0 ...

0 02.553.26 01.22 0 ...

GCC

(b) Disambiguation of example term line

Figure 8: Graphical exemplification of the graph context contribution computation algorithm

sense sk is:

GCC[k][i] = sim(t, tgi ) ∗ numHyp(mch(t, tgi ), sk) ∗ weight(Ngi ), (4)

where sim(t, tgi ) is the similarity between t and tgi (computed as shown in Sec-

tion 5.1 using either equation 2 or 3), numHyp(mch(t, tgi ), sk) is the numberof the minimum common hypernymis in mch(t, tgi ) which are ancestors of sk.Equation 4 shows that the similarity between each pair (t, tgi ) is only consideredas supporting evidence of those senses sk which are descendent of at least oneof the minimum common hypernymis between t and t

gi . The φG(k) confidence

value for each sense sk is thus the normalized sum of these supports (line 5-7).Normalization brings all scores in a range between 0 and 1 and is performed onthe basis of the total number of minimum common hypernyms between t andtgi (GCN stands for Graph Context Normalization):

GCN [i] = sim(t, tgi ) ∗ |mch(t, tgi )| ∗ weight(Ngi ),

Example 5.3. Continuing our running example, Figure 8-b graphically depictssome of the entries for the computation of the confidence φG for two senses ofthe term line, s1 and s22, where the latter is the right one. All values relyon the thesaurus-based similarity presented in Section 5.1. The contributionof the graph topology in the computation is evident from the two instances ofproduct which only differs in the weights of the nodes they belong to: the firstoccurrence refers to node 1 (see Figure 6) which is closer to the line’s nodethan node 4.

Notice that many significant contributions are available for s22, since it isabout merchandise lines and thus it is very close to most graph context termssuch as product. This is also clear from the φG values: The most confidentsense (0.58) as to the graph context contribution is s22, while the other valuesare generally much lower (0.05 for s1). �

21

5.3. Sense context contribution

Beside the contribution of the graph context, we also exploit the sense con-text by quantifying the similarity between the graph context of each polysemousterm and the context of each of its senses. To this end, in this Subsection wepropose two alternative algorithms for sense context contribution computation.It is worth noting that the algorithms presented in this paragraph can be seenas a variant and/or a generalization of those employed for graph context: in-deed, while they share the same goal, i.e. quantifying the similarity betweenthe senses of the term to be disambiguated and its graph context, in the graphcontext contribution algorithms such senses are represented by a single nodein the thesaurus hierarchy, now all the terms available in their sense contextscontribute to the final result.

The first algorithm shares the same basic ideas of the graph context one withthe main difference that, in this case, each sense sk of term t to be disambiguatedis represented by the set of terms in Scont(sk). Thus, it compares the givengraph context Gcont(t,N) to the available sense contexts and the sense contextScont(sk) containing the terms most similar to those of Gcont(t,N) will be themost probable one. The flow of the algorithm is similar to Algorithm 5.1 andfor this reason it will not be presented in its entirety. However, the computationis now performed on three dimensions (cycles) in total: On the senses sk of t(dimension “k”), on the terms t

gi in Gcont(t,N) (dimension “i”) and now also

on the terms tsj in each Scont(sk) (dimension “j”). Supports are dealt withas in lines 5-6 of Algorithm 5.1, by replacing GCC and GCN with the three-dimensional Sense Context Contribution array (SCC) and the bi-dimensionalSense Context Normalization array (SCN), respectively:

SCC[k][i][j] = sim(tgi , tsj) ∗ numHyp(mch(t, tgi ), sk) ∗ weight(N

gi )

SCN [i][j] = sim(tgi , tsj) ∗ |mch(t, tgi )| ∗ weight(N

gi )

Notice that, instead of using the term t to be disambiguated, sim(tgi , tsj) com-

pares term tgi in Gcont(t,N) with term tsj describing sense sk. As Figure 9-a

graphically exemplifies, this is done for all terms in the graph context againstall terms describing all senses: In this bi-dimensional representation, each cellHCC[k][i] contains a vector of similarities, one for each tsj ; the sum of thesevalues represents tgi contribution to sk.

Example 5.4. Consider Figure 9: The similarities between Gcont(t,N) terms(above the matrix) and the terms from Scont(s22), such as merchandise andproduct, give a good clue of this sense being the right one. This is captured bythe algorithm and it is evident from the high similarity values of the Scont(s22)row: For instance, the sum of the similarities between product (the first termof Gcont(t,N)) and all terms in Scont(s22) is 27.13, while it is 0 for thosein Scont(s1), since they involve a very distant military meaning. φS valuesconfirm this intuition: s22 has the highest confidence (0.76), while the othersare generally much lower (0.04 for s1). �

22

Gcont (t,N) termst

g

it

g

1... ...

Scont(sk)

sk

...

tg

icontribution to

of senseconfidence Φs

Φs

Φs(k)

Scont(s1)

s

...

(t,N) sense contexts

Scont

{t1,…,tj,… }s

SCC

Gcont(t,N)| |

Senses(t,N)( )| |

(a) Disambiguation of a generic term t

(line,5)

sense contexts

Scont(s1){battle,

bayonet ...}

Scont(s22){merchandise,

product ...}

{…} {…} {…} {…} {…} {…} {…} …

{…} {…} {…} {…} {…} {…} {…} …

...

...

Φs

Φs

(22)

Φs

(1) 0.04

0.76020.510 027.13 03.26 ...

5.980 0 0 0 0 ...

product

idproduct

nameproducer

...orderdetail

0

SCC

Gcont (line,5)

terms

(b) Disambiguation of example term line

Figure 9: Graphical exemplification of the sense context contribution computation algorithm

The algorithm discussed above for sense context contribution computationis based on the same strengths of the graph context one and, from the test per-formed, it delivers a high effectiveness. However, there are some cases where thehypernym structures may not be completely adequate or sufficient to describea concept and thus to disambiguate it. While Section 5.1 shows that we canbe completely independent from such structures as to the computation of termsimilarities, we also felt important to provide STRIDER with an alternativeconfidence computation algorithm, Algorithm 5.2, which weighs the differentsimilarity contributions directly on the basis of the Gcont(t,N) weights and in-dependently from the thesaurus hierarchies. We did this for the computation ofthe sense contribution since, in this case, the senses are not involved in similar-ity computations as members of linguistic hierarchies but as a set of associateddescriptive terms. Given a sense sk (line 1) and a term t

gi in Gcont(t,N) (line

4), the similarities sim(tgi , tsj) between t

gi and every term tsj in Scont(sk) are

computed (lines 6-7), summarized in a single value by means of a parametricfunction f() and then added to the φS [k] confidence of the sense with weightweight(Ng

i ) (line 8). As to f(), various alternatives can be employed, such asmax() or mean(); among those, we found mean() to be particularly effective.Finally, normalization is performed in two steps: For each sense sk, φS [k] valuesare divided by the sum of all weights in order to obtain a weighted mean of theterms similarities (line 10), then the values in φS are brought in the range [0,1](line 11) in order to be compatible with the other contributions.

Finally, notice that both algorithms can be combined with the two alterna-tive ways of computing the term similarity sim(tx, ty) shown in Sec. 5.1. Itresults in four variants which use the sense context: two “pure” solutions, oneusing the thesaurus both in the similarity computation and in the algorithm andanother one only using the corpus-based similarity, and two “hybrid” solutions,which uses the thesaurus in the similarity computation but not in the algorithmand viceversa. All variants will be evaluated in Section 6.

23

Algorithm 5.2 SenseContextContr(t,N) (independent from hierarchies)

1: for k = 1 to |Senses(t,N)| do // cycle in senses sk of term t

2: φS [k] = 03: sumWeights = 04: for each t

gi in Gcont(t,N) do // cycle in graph context terms

5: vsimti = [0, . . . , 0]6: for each tsj in Scont(sk) do // cycle in sense context terms7: vsimti [j] = sim(tgi , t

sj)

8: φS [k] = φS [k] + f(vsimti) * weight(Ngi )

9: sumWeights = sumWeights + weight(Ngi )

10: φS [k] = φS [k] / sumWeights

11: φS = φS / max(φS)12: return φS

5.4. Sense frequency contribution

The last contribution we present is the one based on the frequencies of thesenses in the English language. The underlying idea is to attempt to emulatethe common sense of a human in choosing the right meaning of a term when thecontext gives little help. Indeed, among all possible meanings that a word mayhave, it is generally true that one meaning occurs more often than the othermeanings and its worth noting that word meanings exhibit a Zipfian distribu-tion.

Differently from the other contributions, this one is independent from thecontext and only relies on the knowledge about the a priori distribution providedby the thesaurus. In particular, WordNet incorporates the information extractedfrom the SemCor corpus consisting of more than 200,000 sense-tagged terms [61].

In STRIDER we propose two ways of computing sense frequency confidencecontribution φF : The first method basically starts from the consideration thatWordNet already orders the senses of each term t on the basis of the frequencyof use (i.e. the first is the most common sense, etc.). In this case, similarly toour proposal in [17], we can easily compute a value φF (k) for each sense sk bymeans of a linear decay function, which, starting from φF (1) = 1, decrementsthe confidence in choosing each successive sense sk proportionally to its position,pos(sk):

φF [k] = linearDecay(sk) =

{

1− ρpos(sk)−1

|Senses(t)|−1 if |Senses(t)| > 1

1 otherwise(5)

where 0 < ρ < 1 is a parameter we usually set at 0.8 (as to previous experi-mentations) and |Senses(t)| is the cardinality of Senses(t). In this way, we canprovide a rough quantification of the frequency of the senses, where the firstsense has full confidence and the last one has still a non-null decay (in our case,1:5), thus allowing us to exploit the benefits of sense frequency for all the senses.

Extended versions of the WordNet thesaurus, such as the one which can bequeried online at the official site, offer more detailed information about sense

24

frequencies in the form of frequency counts fc(sk), which are integers associatedto each s. The larger fc(sk), the more probable sk will be for a given term. Thesecond frequency function uses fc and is generally more precise and effectivesince the decay function is not linear but is shaped differently for each term:

φF [k] = freqCountDecay(sk) =

{

1− ρfc(s1)−fc(sk)

fc(s1)if fc(s1) > 0

1 otherwise(6)

Example 5.5. The frequencies associated to the first 4 of the 29 senses ofline are 51, 20, 15 and 15 respectively, meaning that the second sense is lessthan half frequent than the first, that the third and fourth are indeed equallyprobable, and so on. By applying equation 6 to this small example, we obtainφF = {1, 0.51, 0.43, 0.43, . . .}. �

5.5. Feedback

The presented disambiguation algorithms are able to achieve very effectiveresults from a single execution. However, in order to get even better results,we propose several feedback techniques, which refine the initial results by per-forming successive disambiguation runs. At each run i following the first one(i = 2, . . . , n), the set of senses Senses(t,N)i for some term (t,N) is a subsetof Senses(t,N)i−1. It is worth noting that very few solutions for automaticallyrefining disambiguation results, based on different techniques, have been pro-posed in the literature, and they were tested only in open text disambiguationsettings [24].

First of all, user feedback is available, in which the user plays an activerole by deactivating/activating the influence of selected senses on the disam-biguation process. For instance, suggesting the correct meaning of a difficultterm in the model may help the system in choosing the right meaning of theothers. However, being the focus of our work also on completely automatictechniques, we devised automated feedback techniques which are actually able,in most situations, to significantly improve the results of successive runs withoutuser intervention. We propose three gradually more complex methods, all basedon the idea that removing the “bottom” senses in the first run has a positiveeffect on the disambiguation, since the “noise” produced by those senses which,most probably, are not right is eliminated:

• Method 1 - Simple auto-feedback : The results of the first run are refinedby automatically disabling the contributions of all but the top X sensesin the following run. Thus, only a second run is required. A very lowX value, such as 2 or 3, can in principle be chosen, since in most casesthe STRIDER algorithms ensures that the right sense is in the very toppositions in the ranking;

• Method 2 - “Knockout” auto-feedback : At each run, the sense of each termwith the lowest confidence, i.e. the one presumably bringing more “noise”,is disabled. The process stops when for all terms only the top X senses are

25

left. This method requires more runs than method 1, but it is expectedto be more effective because of its greater gradualness. In particular thenumber of runs will depend on the maximum number of senses among themodel’s terms and on the value of X;

• Method 3 - Stabilizing “knockout” auto-feedback : It is a fixpoint versionof Method 2. In this case a residual vector ∆ is maintained for each term,where each entry ∆(k) keeps the modulus of the confidence variation ofsk between the current step (i) and the previous one (i − 1): ∆(k) =|φi(k) − φi−1(k)|. When the confidence vector of a term becomes stable,that is when the Euclidean length of its residual vector ∆ is less thana given ε, only the top X senses are kept. This method is expected toachieve the same effectiveness as Method 2, while requiring less runs.

In Section 6 we will show samples of the effectiveness of all methods on differenttypes of schemas.

6. Experimental evaluation

In this section we present the results we obtained through an actual imple-mentation of the STRIDER disambiguation system including all the featureswe presented in the previous sections.

Table 2: Features of the tested models

Model Name Spec. Number of #Senses SenseLevel terms nodes edges mean max Simil.

SQL DDL

PurchaseOrders 2 26 18 68 5.115 29 2.807

Computer 3 31 24 92 3.774 13 3.162

Student 2 19 18 68 4 11 3.057

XMLSchema

Yahoo 1 15 18 56 2.733 6 3.373

eBay 1 17 18 56 2.941 8 3.420

Shakespeare 3 14 17 52 8 29 2.152

DBLP 3 14 17 52 5.429 11 2.601

OWL/RDF

Camera 3 30 125 384 4.533 12 2.785

Travel 2 48 49 156 3.375 12 3.248

Process 1 69 119 320 3.666 13 2.916

6.1. Experimental Setting

Tests were conceived in order to show the behavior of our disambiguationapproach in different scenarios.

26

We performed several tests on a large number of models. Since no refer-ence collections are officially available for evaluating structural disambiguationapproaches, we will present a selection of the experiments on the most interest-ing and challenging models we found, for instance those containing terms thatare not used with their most common meaning. Table 2 shows the 10 chosenmodels, their input formats and their features both from a structural and asemantic point of view. The complete models can be found in Appendix B. Themodels are chosen so to be representative of diverse disambiguation challengesand scenarios: in order to evaluate the effectiveness of our approach, the outputof our system will be compared to the “gold standard” manually disambiguatedmodels. To this end, two human taggers contributed to annotate the corpusw.r.t. WordNet thesaurus. The taggers performed disambiguation without anyexternal knowledge; in case of disagreements, a third human tagger chose whichtag should apply. Notice that, in some particular cases where the target wordspresent very fine-grained senses in WordNet, more than one sense is chosen asequally appropriate.

We consider 3 relational schemas, 4 simple trees and 3 ontologies. Rela-tional schemas are examples commonly proposed on various database books:PurchaseOrder (see Figure 1-a for the complete schema description) is a simpleschema for modelling commercial orders for suppliers, Computer is a schemafor the assembly of PC components and Student contains the course-exam-student relational schema. Simple trees and ontologies are all available on theinternet: We chose a small portion of Yahoor’s web directories and eBayr’scatalog, a model extracted from Shakespeare’s Plays in XML8 and the entireDBLP XML schema, a scientific digital library for computer science journalsand proceedings. Camera and Travel models come from the OWL ontologies ofthe Protege Ontologies Library9; Camera ontology is also the Jena Framework10

running example. Process ontology is extracted from the “OWL-S Submission”,an OWL-based Web service ontology which supplies a core set of markup lan-guage constructs for describing the properties and capabilities of Web servicesin unambiguous, computer-intepretable form11.

The models we chose have various specialization levels (see Table 2) indicat-ing how much a model is contextualized in a particular scope:

Low Specialization (1) Generic models can gather very heterogeneous con-cepts, such as a web directory: The portion of the eBay catalog (see [17]for the entire model) contains terms such as batteries as an electronicdevice and chair as a piece of furniture that come from very differentsemantic areas, or very generic and abstract terms such as condition andcontrol;

8http://www.xml.com/pub/r/3969http://protege.stanford.edu/download/ontologies.html

10http://jena.sourceforge.net/ontology11http://www.w3.org/Submission/2004/07/

27

Medium Specialization (2) Models with medium specialization level can gatherheterogeneous concepts that come from a specific semantic area, such asexams, students and courses from the Student model;

High Specialization (3) Finally, highly specific models gather concepts froma very specific area: For instance the terms inside the Camera model aretypically about photographic techniques and devices, such as lens andshutter.

In the following, we will present the obtained results by grouping them in threegroups, corresponding to the different specialization levels. As we will see, dif-ferent groups represent different but equally complex challenges: While low andmedium specialization groups (Groups 1 and 2) involve terms with very hetero-geneous meanings, from a certain point of view the higher is the specializationof the models the harder is their disambiguation, because the involved termsrequire an accurate collocation in a specialized context (Group 3). Notice thatsuch groups are heterogeneous as to the input format: Grouping by input for-mat would not have been a significant choice since our disambiguation system iscompletely independent from the input format, thanks to the format unificationof the Pre-processing phase.

Table 2 also shows the structural features of the chosen models, such asnumber of terms, nodes and edges. Relational schemas and trees have few dozensof nodes, from whose a similar number of terms (depending on the number ofnodes that contain more than one term and on the terms that are eliminatedwith the stopwords list) is extracted. Ontologies are characterized by morecomplex graphs, having a higher number of nodes and edges (including blanknodes that are preserved but not disambiguated). Note that the number of edgesis more than 3 times higher than the number of nodes. It is also worth notingthat the size of the models is not a discriminating feature per-se: even smallermodels provide complex and interesting disambiguation challenges (as will beshown in the following experiments), while we are not interested inefficiencyevaluations. Moreover, Table 2 shows semantic features such as the mean andmaximum number of terms’ senses and the average similarity among the sensesof each given term in the graph (“Sense Simil.” in Table, computed by using avariant of Eq. 2). Such similarity should give a coarse intuition of how hard itis to disambiguate the terms in a given graph: the higher the value the morethe senses of the term are similar one to each other and, thus, possibly difficultto discriminate.

As a final implementation note, we developed STRIDER using Java JDK1.6, the Jena2 framework and the Berkeley Db database for the persistent RDFstorage.

6.2. Effectiveness Evaluation

In our experiments we evaluated the performances of our disambiguationalgorithms mainly in terms of effectiveness. Efficiency evaluation is not crucialfor a disambiguation approach and is beyond the goal of this article so it will not

28

Graph Sense Freq Total (Rnd) Graph Sense Freq Total (Rnd) Graph Sense Freq Total (Rnd)

P(1) 0,886 0,847 0,745 0,857 0,461 0,852 0,847 0,815 0,902 0,462 0,750 0,696 0,699 0,757 0,344

P(2) 0,976 0,971 0,929 0,971 0,723 0,958 0,958 0,944 0,976 0,655 0,914 0,835 0,774 0,931 0,569

P(3) 0,981 0,981 0,976 0,990 0,856 0,982 0,976 0,944 0,976 0,784 0,956 0,886 0,904 0,931 0,683

0,200

0,300

0,400

0,500

0,600

0,700

0,800

0,900

1,000P

re

cis

ion

le

ve

l

Group 1 Group 2 Group 3

Figure 10: Mean precision levels for the three groups

be deepened (in any case, the disambiguation process for the analysed modelsrequired few seconds at most).

In order to produce a complete analysis we computed full precision figures,synthesizing disambiguation effectiveness of the system on the different models.We firstly analyze the performance of the system by employing the default set-tings, i.e. those which we derived from a large number of preliminary tests ondifferent models. These generally allow a good disambiguation effectiveness andare available for the user without any configuration or intervention; in particularthey include crossing settings for all common file formats (more on this later),and adopt as default thesaurus-based similarity and thesaurus-based sense con-text algorithms. The impact of different similarities, sense context algorithmswill be analyzed in further successive sections. The effectiveness gains achiev-able through our automated feedback techniques will also be presented in depthin a dedicated final section.

Before delving into the test results, we will shortly describe the rationalebehind STRIDER default crossing settings, which can be found in detail inAppendix A. In these settings for relational, tree and graph models, we setspecific costs for each kind of edge w.r.t. the specific kind of relationship andfor both crossing directions. For instance we set the costs for navigating FKedges very low: This is to capture the strong relationship that links the termsrepresenting the referencing columns and those referenced, and thus to obtaina high mutual influence in their disambiguation. Graphs originated from theconversion of SQL DDL models contain a limited number of edges (such as pri-mary keys, foreign keys and unique constraints), but ontologies have numerouskinds of relationships including user defined ones. We fixed a low cost for themost common ontology relationship types stating equivalent concepts (i.e. OWLproperties equivalentClass, equivalentProperty), a high cost for those re-lationship types concerning inequalities (i.e. OWL properties disjointWith,differentFrom) and a default cost for any other. Finally, we derived tworeachability threshold settings, one specifically devised for limiting the context

29

(limited context setting) and thus useful for disambiguating lowly specializedand heterogeneous models, and the other not limiting the context (completecontext setting), useful for more specific models. Following this intuition, in thenext tests we will employ limited context setting for Group 1 models and thecomplete context for Groups 2 and 3 (in the next section we will also specificallytest the impact of the reachability threshold settings on the obtained precisionfigures).

In order to evaluate the precision of our disambiguation approach, we com-pared the fully automated disambiguation results with the manual “golden-stardard”. In general, precision P is the number of terms correctly disam-biguated divided by the number of terms in the models. Since our system isable to produce complete ranking results, we can compute precision P(M) atdifferent levels of quality, by considering the results up to the first M ranks:For instance, P(1) will be the percentage of terms in which the correct sensesare at the top position of the ranking. Figure 10 shows the mean precisions Pof the models in each group at three levels of quality (P(1), P(2), and P(3))and details the three contributions described in the previous section (the graphcontext contribution (Graph), the sense context one (Sense) and the sense fre-quency one (Freq), together with their combination (Total)). It is worth notingthat two baselines are also provided: the random baseline is denoted as (Rnd)in Figure, while the sense frequency contribution coincides with the well knownmost frequent sense baseline. Indeed, we decided to show sense frequency con-tributions results even if they do not directly give effectiveness indications astheir computation is not context-dependent: they positively contribute to thefinal results and, as a baseline, they show the disambiguation difficulty of theconsidered groups: The lower is the sense frequency contribution the more theinvolved terms are used in unusual meanings.

The combination of the three contributions produces good P(1) precisionlevels of 88% and 85% for Groups 1 and 2, respectively. Precision results forGroup 3 are lower (nearly 75%), but we have to consider the high mean andmaximum number of terms’ senses; even in this difficult setting, the resultsare quite encouraging, particularly if we notice that P(2) is well above 91%for each Group. As to the effectiveness of the sense context, notice that itscontribution alone (Sense) is generally very near to the graph context one, evenin the complex Group 3 setting, meaning a good effectiveness of this approachtoo; further, in all the three cases the combination of the three contributions(Total) produces better results than each of the contributions alone. Finally,note that, for all groups, the random baseline is very distant from the resultswe achieved, proving once again the goodness of the STRIDER results and thecomplexity of the considered disambiguation tasks.

6.2.1. Impact of the Crossing Setting Reachability Threshold

We will now specifically test if disambiguating low specialization models witha limited context and those with medium/high specialization with a completecontext setting is really producing better precision results as expected. Figure11 shows two illustrative comparisons between limited and complete contexts

30

Graph Sense Total Graph Sense Total

P(1) 0,941 0,941 0,941 0,882 0,882 0,882

P(2) 1,000 1,000 1,000 0,941 0,941 1,000

P(3) 1,000 1,000 1,000 1,000 1,000 1,000

0,200

0,300

0,400

0,500

0,600

0,700

0,800

0,900

1,000P

recis

ion

le

ve

l

Limited context Complete context

(a) Typical Group 1 behavior (eBay model)

Graph Sense Total Graph Sense Total

P(1) 0,774 0,774 0,871 0,774 0,806 0,903

P(2) 0,871 0,839 0,903 0,903 0,903 0,968

P(3) 0,968 0,968 0,968 0,968 0,968 0,968

0,200

0,300

0,400

0,500

0,600

0,700

0,800

0,900

1,000

Pre

cis

ion

le

ve

l

Limited context Complete context

(b) Typical Group 2 behavior (Computermodel)

Figure 11: Mean precision levels comparison between limited and complete context

for the eBay model (Group 1, Figure 11-a) and Computer model (Group 3,Figure 11-b). These models are representative of the two main behaviors: Group1 models respond better to the limited crossing setting, while Group 2 andGroup 3 show an opposite trend. Notice that the sense frequency contributionis not presented as it is not affected by context variations. In the first case,the precisions P(1) of graph, sense and total context contributions decreasefrom 94% to 88% when the complete context is used. This is due to the factthat Group 1 concepts are very heterogeneous and limiting the context only todirectly related terms reduces the disambiguation “noise” produced by thosecompletely uncorrelated. For instance, when the complete eBay model is usedto disambiguate the term chair in the furniture category, the top sense is“the position of professor” and not “a seat for one person” as the process iswrongly influenced by terms like speaker and fan as “persons” (“someone whospeaks” and an “enthusiastic devotee of sports”, respectively). Instead, whenthe model terms are specific and more contextualized, such as in the other twogroups and especially in Group 3, the result is the opposite: Notice that senseand total contributions increase beyond 90% for P(1) and 96% for P(2) whenthe complete context is used on computer model (Figure 11-b).

6.2.2. Evaluation of the Alternative Sense Context Contribution

The sense context contributions involved in the results presented in the pre-vious sections have all been computed by means of the standard thesaurus-basedsense context algorithm. In this section we will report the results obtained onour models by means of the alternative algorithm. In particular, we present asmall selection of the performed tests, together with some examples, in orderto show its performance. The standard algorithm appears to be the most ef-fective in the majority of the situations, however, we found out that in somecases the alternative algorithm was able to obtain even higher precision levelsin disambiguating some of the models. Specifically, we noticed that in threemodels of our reference set the alternative method provides significant preci-

31

Sense std Sense alt Sense std Sense alt

P(1) 0,667 0,768 0,537 0,643

P(2) 0,913 0,870 0,786 0,786

P(3) 0,942 0,942 0,929 0,857

0,200

0,300

0,400

0,500

0,600

0,700

0,800

0,900

1,000P

recis

ion

le

ve

l

Process Shakespeare

(a) Sense Context Algorithms comparison

Graph Sense Total Graph Sense Total

P(1) 0,783 0,667 0,696 0,812 0,754 0,754

P(2) 0,928 0,913 0,942 0,928 0,913 0,913

P(3) 0,942 0,942 0,971 0,942 0,942 0,971

0,200

0,300

0,400

0,500

0,600

0,700

0,800

0,900

1,000

Pre

cis

ion

le

ve

l

Thesaurus-based sim. Corpus-based sim.

(b) Similarity comparison (Process model)

Figure 12: Comparisons between Sense Context Contributions (standard vs alternative for-mula) and similarities (Thesaurus-based vs Corpus-based)

sion increments, typically in the range of 5%-10% (as in the Process or DBLPmodels) up to almost 13% in specifically complex situations (as in the Shake-speare model). Figure 12-a shows such comparison for the Process and theShakespeare models. In these particular situations, the hypernym structures ofWordNet are not completely adequate to describe some of the involved conceptsand thus to disambiguate them. Being the standard algorithm founded on theconcept of minimum common hypernyms (only the confidences of the senseswhich are descendants of the mch are incremented), this may result in a non-optimal confidence computation. For instance, in the DBLP model, the sensechosen by the standard algorithm for term pages is “US Diplomat and writerThomas Nelson Page” and not a “book or magazine page”: This is due to thefact that many terms in the associated graph context (such as author) involveliving things and the derived minimum common hypernym may thus skew thesuccessive computations. This is not true for the alternative computation algo-rithm, since the similarities between the terms’ senses directly contribute to thesense confidences and no minimum common hypernym computation is needed.

6.2.3. Evaluation of the Corpus-based Similarity

The thesaurus-based similarity function described in Section 5.1 and onwhich the results presented in previous sections are based, appears very effectivein most situations; however, we found out that in some cases the hypernymshierarchies are not as accurate as expected, worsening the overall disambigua-tion quality. On the other hand, large document repositories are a finer-grainedreality representation we may exploit in order to compensate for the hierarchiesshortcomings. In particular, we noticed that when the model we want to dis-ambiguate is very generic, such as the low specialization group’s models, someof the involved hierarchies could be quite misleading.

The following tests show the contribution of the corpus-based similarity (Eq.3). Since sim(tx, ty) is used both in graph and sense context evaluations, bothcontributions are affected with a noticeable effect on the overall precision. In

32

run5 run6 run7 run8 run9 run10

Graph 0,854 0,854 0,854 0,875 0,917 0,917

Sense 0,854 0,875 0,875 0,917 0,938 0,938

0,600

0,650

0,700

0,750

0,800

0,850

0,900

0,950

1,000P

(1)

Pre

cisi

on

Le

ve

l

(a) Method 2 feedback - Travel model

run1 run2 run3 run4 run5 run6

Graph 0,808 0,885 0,885 0,885 0,885 0,885

Sense 0,731 0,731 0,731 0,769 0,808 0,885

0,600

0,650

0,700

0,750

0,800

0,850

0,900

0,950

1,000

P(1

) P

reci

sio

n L

ev

el

(b) Method 3 feedback - PurchaseOrder model

Figure 13: Precision enhancements run by run of Method 2 and Method 3 feedback.

particular, we noticed a precision enhancement on some of the models, in partic-ular Process, Yahoo and Computer. Figure 12-b shows on the left the values ofthe Graph, Sense and Total contributions using the thesaurus-based similaritiesand on the right the corpus-based ones for the Process model. P(1) is enhancedfrom 3% up to 9%. For instance, even if the right meaning of term parameter

is in the thesaurus, the thesaurus-based similarities would incorrectly disam-biguate it as “a quantity that characterizes a statistical population” and notas a “constant in a function”. This is because the huge amount of documentsindexed by the search engine allows STRIDER to better discriminate the subtleshades of meanings that distinguish one sense from the other ones and that insome cases are not accurately represented in the thesaurus hypernyms hierarchy.

6.2.4. Feedback Evaluation

As discussed in Section 5.5, STRIDER provides a user feedback and anautomated one. In this section, we get back to the default settings and focus onthe impact on the effectiveness achieved by the latter. Generally speaking, allour automated feedback methods improve the overall precision of about 70% ofour models, without worsening it in any case. We will start by analysing eachmethod on the most significant cases, then we will present a complete reportof the achieved precisions on all models. Method 1 performs a successive runafter the first one disabling the contributions of all but the top X senses in thefollowing runs. For instance, by applying Method 1 with X = 3 to the Studentmodel, the results of the second run show a precision increment of almost 5%,and similar results are generally obtainable on most of the considered models.

We achieved even better results using Method 2, which at each run knocksout from the ranking the sense of the term with the lowest confidence. Thus,successive runs are triggered with increasing precision, since the noise producedby worse confidences in each ranking is eliminated run after run. Figure 13-ashows the behavior of Method 2 feedback on the Travel model’s P(1). Since thesense frequency contribution is not affected by the modifications to the contextterms’ senses made by the feedback, we only show the variations on graph

33

context and sense context contributions. We noticed that in the first runs, theconfidences of the models’ terms vary subtly without immediately modifying thecorresponding rankings. For instance, in the example considered in Figure 13-a,both precisions remain unchanged for the first 4 runs (and thus are not shownin the figure), then they begin to increase gradually run after run. In particular,the graph context contribution increases from 85% of the first run to more than91% of the ninth run; the sense context contribution achieves even better resultsincreasing from 85% of the first run up to more than 93% of the tenth run whichconcludes the process. Moreover, the precision P(2) raises to a perfect 100% forboth contributions from run 8 on, strengthening the performance of the Method2 feedback.

Finally, Method 3 achieved similar results to Method 2 but proved to requirea lower number of successive runs, witnessing that confidences stabilize on rightvalues and thus that monitoring the residual vector allows the system to shortenthe transient phase without losing in quality. For instance, Figure 13-b showsthe behavior of Method 3 for the PurchaseOrder model: Precision P(1) of thegraph context contribution increases from 80% of run 1 to almost 89% of run2 and then remains stable for the following runs. Instead, the sense contextcontribution’s P(1) needs 6 runs to raise its precision from 73% up to almost 89%achieving an increase of more than 15%. We also obtained the same precisionresults of Figure 13-a by applying Method 3 on the Travel model, but with halfthe number of runs. Further, generally speaking, we noticed that the higher themean and maximum number of terms’ senses are the more the Methods 2 and3 contributions are considerable on the overall precision; As we see in Figures13-a and 13-b the precision increment due to the feedback is substantial for boththe Travel and the PurchaseOrder models, and it is more evident for the lattermodel than the former one (see Table 2 for the mean and maximum number ofterms’ senses).

Our analysis of the automated feedback performances is concluded by Table3, which shows a complete report of the achieved precision levels on all theconsidered models for Method 3. In particular, the precision increments aresignificant for all groups of models and are clearly visible by comparing the“feedback off” figures (left part of table) to the “feedback on” (right part oftable), proving that automated feedback can provide equally promising resultsas those achieved in open-text WSD settings [24].

7. Conclusions

Structural disambiguation is acknowledged as a very real and frequentproblem for many semantic-aware applications. The solution we proposed isSTRIDER 2.0, a unified answer to the various sense disambiguation problemson a large variety of structure-based information types both at data and meta-data level, including relational schemas, XML data and schemas, taxonomies,and ontologies. The approach is knowledge-driven and leverages on the com-bination and adaptation of some of the most effective free-text WSD disam-biguation techniques to the domain WSD context, also taking advantage of the

34

Table 3: Performance of AutoFeedback techniques (Method 3)

Models AutoFeedback OFF AutoFeedback ON

P(1) P(2) P(3) P(1) P(2) P(3)

Group 1

Graph 0.886 0.976 0.981 0.895 0.976 0.981

Sense 0.847 0.971 0.981 0.898 0.971 0.981

Total 0.857 0.971 0.990 0.898 0.971 0.990

Group 2

Graph 0.852 0.958 0.982 0.873 0.958 0.982

Sense 0.847 0.958 0.976 0.860 0.958 0.976

Total 0.902 0.976 0.976 0.902 0.976 0.976

Group 3

Graph 0.750 0.914 0.956 0.803 0.932 0.956

Sense 0.696 0.835 0.886 0.749 0.871 0.922

Total 0.757 0.931 0.931 0.793 0.931 0.931

best features of two worlds of knowledge sources, i.e. electronically orientedthesauri and text corpora. Since different disambiguation tasks require differentcontext extraction processes, we proposed flexible ways of navigating the struc-tures in order to extract each term’s context; the latter can thus be tailoredto the specific disambiguation needs and can include a large variety of possi-bly available additional information, providing a significant improvement in thedisambiguation effectiveness.

The extensive experimental evaluation we performed showed the good effec-tiveness of the presented solutions on a large variety of structure-based informa-tion and disambiguation requirements, even from a single disambiguation run.Moreover, we proved that such encouraging single-run results can be further im-proved by performing successive runs through the newly introduced automaticfeedback techniques. Such very satisfying level of effectiveness can be achievedwithout any user intervention; furthermore, it proves that our combination ofWSD techniques can provide very good performances in a domain-constrainedword context, even better than those achievable in standard open-text disam-biguation settings. Ultimately, the achieved results can give a significant con-tribution towards the Semantic Web vision [62], making possible for an evergrowing number of applications to automatically select all and only the relevantknowledge in a more effective and accurate way.

Currently, STRIDER works on english language human readable labels,which characterize a very large portion of the structured information avail-able on the web. In future work, we plan to enhance the system with labelnormalization techniques (such as the ones discussed in Section 4.2) so to beable to “smartly” expand abbreviations and acronyms, possibly exploiting con-text information from the structure. Further, we plan to exploit the algorithmswhich are independent from the thesaurus hierarchies together with commonly

35

available web dictionaries so to test the effectiveness of the proposed approachon different languages. To this end, we already performed some initial butpromising tests in the Italian language. Finally, we plan to devise a way to“hide” the wealth of settings and disambiguation possibilities to the user, thusachieving the best possible disambiguation results while avoiding the need formanual configuration depending on the model characteristics. The idea is toexploit some extraction and analysis techniques so to be able to automaticallydetect the discriminating features of the involved models (such as specializa-tion level) and, consequently, to auto-configure the system so to use the mostsuitable combination of algorithms and settings.

Acknowledgments

The authors would like to thank Enrico Ronchetti for his contribution to thedesign and implementation of the STRIDER disambiguation service and theanonymous reviewers for their useful suggestions.

References

[1] L. Guo, J. Shanmugasundaram, G. Yona, Topology Search over BiologicalDatabases, in: ICDE, 2007.

[2] M. d’Aquin, E. Motta, M. Sabou, S. Angeletou, L. Gridinoc, V. Lopez,D. Guidi, Toward a New Generation of Semantic Web Applications, IEEEIntelligent Systems 23 (3) (2008) 20–28.

[3] M. Theobald, R. Schenkel, G. Weikum, Exploiting Structure, Annotation,and Ontological Knowledge for Automatic Classification of XML Data,in: Proceedings of the International Workshop on Web and Databases(WebDB’03), 2003, pp. 1–6.

[4] L. Guo, X. Wang, J. Fang, Ontology clarification by using semantic disam-biguation, in: Proceedings of the 12th International Conference on CSCWin Design (CSCWD), 2008, pp. 476–481.

[5] M. Theobald, H. Bast, D. Majumdar, R. Schenkel, G. Weikum, Topx: effi-cient and versatile top-k query processing for semistructured data, VLDBJ. 17 (1) (2008) 81–115.

[6] M. Ehrig, A. Maedche, Ontology-Focused Crawling of Web Documents, in:Proceedings of the 2003 ACM Symposium on Applied Computing (ACMSAC’03), 2003, pp. 1174–1178.

[7] J. Gracia, V. Lopez, M. d’Aquin, M. Sabou, E. Motta, E. Mena, SolvingSemantic Ambiguity to Improve Semantic Web based Ontology Matching,in: Proc. of the 2nd Int’l Workshop on Ontology Matching (OM), 2007.

36

[8] H. Paulheim, M. Rebstock, J. Fengel, Context-sensitive referencing forontology mapping disambiguation, in: Proceedings of the InternationalWorkshop on Contexts and Ontologies: Representation and Reasoning(C&O:RR), 2007.

[9] J. Gracia, M. d’Aquin, E. Mena, Large scale integration of senses for thesemantic web, in: Proceedings of the 18th International Conference onWorld Wide Web (WWW), 2009.

[10] V. Uren, P. Cimiano, J. Iria, S. Handschuh, M. Vargas-Vera, E. Motta,F. Ciravegna, Semantic annotation for knowledge management: Require-ments and a survey of the state of the art., Journal of Web Semantics 4 (1)(2006) 14–28.

[11] G. A. Miller, WordNet: A Lexical Database for English, Communicationof the ACM 38 (11) (1995) 39–41.

[12] P. Bouquet, L. Serafini, S. Zanobini, Bootstrapping Semantics on the Web:Meaning Elicitation from Schemas, in: Proceedings of the 15th Interna-tional Conference on World Wide Web (WWW’06), 2006, pp. 505–512.

[13] F. Mandreoli, R. Martoglia, P. Tiberio, Approximate Query Answering fora Heterogeneous XML Document Base, in: Proceedings of the 5th Confer-ence on Web Information Systems Engineering (WISE’04), 2004.

[14] E. Agirre, E. P. Edmonds, WORD SENSE DISAMBIGUATION Algo-rithms and Applications, Springer, 2007.

[15] R. Navigli, Word sense disambiguation: A survey, ACM Comput. Surv.41 (2).

[16] D. Shutt, P. Bernstein, T. Bergstraesser, J. Carlson, S. Pal, P. Sanders, Mi-crosoft Repository Version 2 and the Open Information Model, InformationSystems 24 (2) (1999) 71–98.

[17] F. Mandreoli, R. Martoglia, E. Ronchetti, Versatile Structural Disam-biguation for Semantic-aware Applications, in: Proceedings of 14th ACMInternational Conference on Information and Knowledge Management(CIKM’05), 2005, pp. 209–216.

[18] F. Mandreoli, R. Martoglia, E. Ronchetti, STRIDER: A Versatile Systemfor Structural Disambiguation, in: Proceedings of 10th International Con-ference on Extending Database Technology (EDBT’06), 2006, pp. 1194–1197.

[19] M. Lesk, Automatic sense disambiguation using machine readable dictio-naries: how to tell a pine cone from an ice cream cone, in: Proceedings ofthe 5th annual international conference on Systems documentation (SIG-DOC ’86), 1986, pp. 24–26.

37

[20] S. Banerjee, T. Pedersen, Extended Gloss Overlaps as a Measure of Seman-tic Relatedness, in: Proceedings of the 18th International Joint Conferenceon Artificial Intelligence (IJCAI’03), 2003, pp. 805–810.

[21] R. Mihalcea, Unsupervised Large-Vocabulary Word Sense Disambiguationwith Graph-based Algorithms for Sequence Data Labeling, in: Proceedingsof the Human Language Technology Conference and Conference on Empir-ical Methods in Natural Language Processing (HLT/EMNLP), 2005.

[22] R. Navigli, P. Velardi, Structural Semantic Interconnections: A Knowledge-Based Approach to Word Sense Disambiguation, IEEE Transactions onPattern Analysis and Machine Intelligence 27 (7) (2005) 1075–1086.

[23] M. Galley, K. McKeown, Improving Word Sense Disambiguation in LexicalChaining, in: Proceedings of the 18th International Joint Conference onArtificial Intelligence (IJCAI), 2003, pp. 1486–1488.

[24] R. Navigli, A Structural Approach to the Automatic Adjudication of WordSense Disagreements, Natural Language Engineering 14 (4) (2008) 547–573.

[25] M. Halliday, E. R. Hasan, Cohesion in English, Longman Group Ltd., 1976.

[26] R. Navigli, M. Lapata, Graph Connectivity Measures for UnsupervisedWord Sense Disambiguation, in: Proceedings of the 20th InternationalJoint Conference on Artificial Intelligence (IJCAI), 2007, pp. 1683–1688.

[27] R. Mihalcea, P. Tarau, E. Figa, PageRank on Semantic Networks, withApplication to Word Sense Disambiguation, in: Proceedings of the 20thInternational Conference on Computational Linguistics (COLING), 2004.

[28] R. Rada, H. Mili, E. Bicknell, M. Blettner, Development and applicationof a metric on semantic nets, IEEE Transactions on Systems Man andCybernetics 19 (1) (1989) 17–30.

[29] C. Leacock, M. Chodorow, Combining local context and WordNet simi-larity for word sense identification, in: C. Fellbaum (Ed.), WordNet: Anelectronic lexical database, MIT Press, 1998, pp. 256–283.

[30] J. Jiang, D. Conrath, Semantic Similarity Based on Corpus Statistics andLexical Taxonomy, in: Proceedings of the 10th International Conferenceon Research in Computational Liguistics, 1997.

[31] P. Resnik, Using Information Content to Evaluate Semantic Similarity in aTaxonomy, in: Proceedings of the 14th International Joint Conference onArtificial Intelligence (IJCAI ’95), 1995, pp. 448–453.

[32] E. Agirre, G. Rigau, Word Sense Disambiguation using Conceptual Density,in: Proceedings of the 16th International Conference on ComputationalLinguistics (COLING), 1996, pp. 16–22.

38

[33] E. Budanitsky, G. Hirst, Semantic distance in WordNet: An Experimental,Application-oriented Evaluation of Five Measures, in: Proceedings of theWorkshop on WordNet and Other Lexical Resources, in the North Ameri-can Chapter of the Association for Computational Linguistics (NAACL’01),2001.

[34] E. Terra, C. L. A. Clarke, Frequency Estimates for Statistical Word Simi-larity Measures, in: Proceedings of the 2003 Conference of the North Amer-ican Chapter of the Association for Computational Linguistics on HumanLanguage Technology (NAACL’03), 2003, pp. 244–251.

[35] P. Cimiano, S. Handschuh, S. Staab, Towards the self-annotating web, in:Proceedings of the 13th WWW Conference (WWW’04), 2004, pp. 462–471.

[36] P. D. Turney, Word Sense Disambiguation by Web Mining for Word Co-occurrence Probabilities, in: Proceedings of the 3rd International Workshopon the Evaluation of Systems for the Semantic Analysis of Text (Senseval-3), 2004, pp. 239–242.

[37] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Pasca, A. Soroa, AStudy on Similarity and Relatedness Using Distributional and WordNet-based Approaches, in: Proceedings of Human Language Technologies: The2009 Annual Conference of the North American Chapter of the Associationfor Computational Linguistics (NAACL), 2009, pp. 19–27.

[38] R. L. Cilibrasi, P. M. B. Vitanyi, The Google Similarity Distance, IEEETransactions on Knowledge and Data Engineering (IEEE TKDE) 19 (3)(2007) 370–383.

[39] M. Strube, S. P. Ponzetto, WikiRelate! Computing Semantic RelatednessUsing Wikipedia, in: Proceedings of the 21st National Conference on Ar-tificial Intelligence (AAAI’06), 2006.

[40] L. Marquez, G. Escudero, D. Martınez, G. Rigau, Supervised Corpus-BasedMethods for WSD, in: [14].

[41] T. Pedersen, Unsupervised Corpus-Based Methods for WSD, in: [14].

[42] A. Kilgarriff, M. Palmer, Introduction to the special issue on Senseval,Computers in the Humanities 34 (1-2) (2000) 1–13.

[43] A. Kilgarriff, P. Edmonds, Introduction to the Special Issue on EvaluatingWord Sense Disambiguation Systems, Journal of Natural Language Engi-neering 8 (4) (2002) 279–291.

[44] T. H. Ng, Getting serious about word sense disambiguation, in: Proceedingsof the ACL SIGLEX Workshop on Tagging Text with Lexical Semantics:Why, What and How?, 1997, pp. 1–7.

39

[45] P. Edmonds, Designing a task for Senseval-2, Tech. rep., University ofBrighton (2000).

[46] M. Cuadros, G. Rigau, Quality Assessment of Large Scale Knowledge Re-sources, in: Proceedings of the 2006 Conference on Empirical Methods inNatural Language Processing (EMNLP), 2006, pp. 534–541.

[47] J. Madhavan, P. A. Bernstein, A. Doan, A. Y. Halevy, Corpus-basedSchema Matching, in: Proceedings of 21st International Conference onData Engineering (ICDE’05), 2005, pp. 57–68.

[48] R. Navigli, P. Velardi, A. Gangemi, Ontology learning and its application toautomated terminology translation, IEEE Intelligent Systems 18 (1) (2003)22–31.

[49] A. Tagarelli, S. Greco, Clustering Transactional XML Data withSemantically-Enriched Content and Structural Features, in: Proceedings ofthe 5th Conference on Web Information Systems Engineering (WISE’04),2004, pp. 266–278.

[50] R. Trillo, J. Gracia, M. Espinoza, E. Mena, Discovering the semantics ofuser keywords, Journal of Universal Computer Science 13 (12) (2007) 1908–1935.

[51] L. McDowell, O. Etzioni, S. Gribble, A. Halevy, H. Levy, W. Pentney,D. Verma, S. Vlasseva, Mangrove: Enticing Ordinary People onto the Se-mantic Web via Instant Gratification, in: Proceedings of the 2nd Interna-tional Semantic Web Conference (ISWC’03), 2003, pp. 754–770.

[52] F. Ciravegna, A. Dingli, D. Petrelli, Y. Wilks, User-System Cooperationin Document Annotation Based on Information Extraction, in: Proceed-ings of the 13th International Conference on Knowledge Engineering andKnowledge Management (EKAW’02), 2002, pp. 122–137.

[53] V. Svatek, M. Labsky, M. Vacura, Knowledge Modelling for Deductive WebMining, in: Proceedings of the 14th International Conference on KnowledgeEngineering and Knowledge Management (EKAW’04), 2004, pp. 337–353.

[54] F. Ciravegna, S. Chapman, A. Dingli, Y. Wilks, Learning to harvest in-formation for the semantic web, in: Proceedings of 1st European SemanticWeb Symposium (ESWS ’04), 2004, pp. 312–326.

[55] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked,S. Soderland, D. S. Weld, A. Yates, Web-scale information extraction inKnowItAll: (preliminary results), in: Proceedings of the 13th Internationalconference on World Wide Web (WWW’04), 2004, pp. 100–110.

[56] S. Melnik, E. Rahm, P. A. Bernstein, Rondo: A Programming Platformfor Generic Model Management, in: Proceedings of the 22nd ACM In-ternational Conference on Management Of Data (SIGMOD’03), 2003, pp.193–204.

40

[57] J. Madhavan, P. A. Bernstein, E. Rahm, Generic Schema Matching withCupid, in: Proceedings of the 27th International Conference on Very LargeData Bases (VLDB’01), 2001, pp. 49–58.

[58] H. Do, E. Rahm, COMA – A system for flexible combination of schemamatching approaches, in: Proceedings of the 28th VLDB Conference, 2002.

[59] S. Sorrentino, S. Bergamaschi, M. Gawinecki, L. Po, ”schema normalizationfor improving schema matching”, in: ER, 2009, pp. 280–293.

[60] P. Resnik, Disambiguating Noun Groupings with Respect to WordNetSenses, in: Proceedings of the 3rd Workshop on Very Large Corpora, 1995,pp. 54–68.

[61] G. A. Miller, C. Leacock, R. Tengi, R. T. Bunker, A Semantic Concordance,in: Proceedings of the workshop on Human Language Technology (HLT),1993, pp. 303–308.

[62] T. Berners-Lee, J. Hendler, O. Lassila, The Semantic Web, Scientific Amer-ican 284 (5).

Table 4: Default crossing setting for each input format

Edge Direct Opposite ApplicabilityDirection Direction relat. tree graph

SR 1 0.8√ √ √

FK 0.1 0.1√

owl:differentFrom 1 0.8√

owl:disjointWith 1 0.8√

owl:complementOf 1 0.8√

owl:equivalentClass 0.1 0.1√

owl:equivalentProperty 0.1 0.1√

owl:someValuesFrom 0.1 0.1√

rdfs:domain 0.1 0.1√

rdfs:label 0.1 0.1√

rdfs:range 0.1 0.1√

rdfs:subPropertyOf 0.1 0.1√

A. Default crossing setting

Table 4 shows our default crossing setting for each input format; arc types arethose we found most significant for our disambiguation purposes. The SPARQLqueries we used to specify the SR and FK arc types are presented in Table 5.Notice that SR (structural relationship) is presented three times, once for eachinput format supported by our disambiguation approach. The used namespace

41

prefix list is:PREFIX sql: <http://www.oim-converter.com/vocabulary/sqlddl/1.0>

PREFIX xsd: <http://www.w3.org/2001/XMLSchema>

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns/>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema>

PREFIX owl: <http://www.w3.org/2002/07/owl>

B. Schema descriptions

In this section we show the schemas we chose for our experimental evaluation.As to relational schemas, Figure 14 and Figure 15 depict the Computer andStudent schemas, respectively, while the PurchaseOrder schema has alreadybeen presented in Section 3 as our running example.

Moreover, the trees we used in our test are a small portion of Yahoor’sweb directories and eBayr’s catalog, the schema extracted from Shakespeare’sPlays and the entire DBLP XML schema, are shown in Figures 16, 17, 18 and19, respectively.

Finally, we present the graphs we used in our experimental evaluation: theCamera Ontology is shown in Figure 20, while a selection of nodes and edgesfrom the Travel and Process ontologies is shown in Figures 21 and 22, respec-tively. Notice that boxes represent ontology classes, while arrows represent rela-tionships between classes (object property properties have an asterisk besidetheir label).

communicationPort

PK identifier

nametransmissionRate

desktopPC

PK identifier

FK1 motherBoardprocessor

FK2 communicationPortFK3 hardDisk

fansmousekeyboardmonitor

hardDisk

PK identifier

storageCapacityrotationSpeed

motherBoard

PK identifier

buschipSetcontrollerslots

FK2

FK1

FK3

Figure 14: Computer schema.

42

STUDENT

PK identifier

namesexmajorgradAssistantuniversity

CLASS

PK identifier

lecturesteachernumStudentscredit

EXAM

PK,FK1 classPK,FK2 student

datescore

FK1

FK2

Figure 15: Student schema.

Health

Concept

Medicine

Neurology Hygiene Cardiology

Hygiene

Entertainment

Music

Genres

Conferences

Jazz

Awards

Rock

Lyrics

Figure 16: Yahoo schema.

43

Table 5: SPARQL queries for SR and FK edges

Edge Input QueryFormat

SR relational select ?table ?column ?tableURI ?columnURIwhere { ?tableURI sql:name ?table.

?tableURI rdf:type sql:Table.?tableURI sql:column ?columnURI.?columnURI rdf:type sql:Column.?columnURI sql:name ?column }

SR tree select distinct ?parent ?child ?parentURI ?childURIwhere {

{ ?c xsd:schema ?b .?b xsd:complexType ?parentURI.?parentURI xsd:name ?parent .?parentURI ?middlePredicate ?x.?x xsd:element ?childURI .?childURI xsd:name ?child.

FILTER ( ?middlePredicate = xsd:sequenceor ?middlePredicate = xsd:all )

}UNION{ ?parentURI xsd:complexType ?y .?parentURI xsd:name ?parent .?y ?middlePredicate ?x.?x xsd:element ?childURI .?childURI xsd:name ?child.

FILTER ( ?middlePredicate = xsd:sequenceor ?middlePredicate = xsd:all )

}UNION{ ?parentURI xsd:type ?child .?parentURI xsd:name ?parent .?childURI xsd:name ?child.?childURI ?middlePredicate ?c .?c xsd:element ?d.?d xsd:name ?e .

FILTER ( ?middlePredicate = xsd:sequenceor ?middlePredicate = xsd:all )

}}

SR graph select ?parentURI ?childURIwhere { ?childURI rdfs:subClassOf ?parentURI.

?parentURI rdf:type rdf:Class.?childURI rdf:type owl:Class. }

FK relational select ?referencing ?referenced ?referencedURI ?referencingURIwhere { ?a sql:foreignKeyRole ?referencedURI.

?a sql:uniqueKeyRole ?referencingURI.?referencingURI sql:name ?referencing.?referencedURI sql:name ?referenced }

44

Computers

Buy

Cameras

Desktop PcComponents

Memory Speaker Fan Mouse

Accessories

Batteries

Antiques

Musical Instruments

String

Furniture

Chair

Figure 17: eBay Schema.

Personae

Play

Persona

Persona

Group

Act

Number

Speech

Speaker

Line

Title Subtitle

Scene Character

Figure 18: Shakespeare’s plays schema.

Dblp [Bibliography]

Proceedings

Id Title Volume Number Year

Article

Id Author Title Pages Ee[Hyperlink]

Crossref[Reference]

Figure 19: DBLP schema (Terms containing acronyms and abbreviations are replaced withthe text within square brackets).

45

Window

Lens

Body

compatibleWith*

PurchaseableItem

isa

isa

Camera

isa

Money

cost*

BodyWithNonAdjustableShutterSpeed

isa

Range

shutter-speed*

ViewerSingleLensReflex

lens*

body*

viewFinderisa

Large-Format

isa

Digital

isa

body*

Figure 20: Camera schema.

LuxuryHotel

AccommodationRating

hasRating*

Hotel

isa

Sightseeing

Museums

isa

Safari

isa

Activity

isa

Destination

isOfferedAt*

Adventure

isa

hasActivity*

hasPart*

Accommodation

hasAccommodation*

isa

hasRating*

BudgetAccommodation

isa

BedAndBreakfast

isa

Campground

isa

BunjeeJumping

isaisa

hasRating*

Figure 21: A portion of the Travel schema.

46

ResultVariable

Parameter

isa

Output

isa

Input

isa

CompositeFunction

ControlConstruct

composedOf*

Function

hasParameter*

isa

AtomicFunction

isa

Condition

hasPrecondition*

hasOutput*

Result

hasResult*

hasInput*Perform

isa

Produce

isa

process

Binding

hasDataFrom*

toParameter

OutputBinding

isa

InputBinding

isa

ValueOf

theVariable

fromProcess

ServiceModel

isa

hasResultVariable*

inCondition*

withOutput*

producedBinding*

Figure 22: A portion of the Process schema.

47