Comparing categories among geographic ontologies

10
Computers & Geosciences 31 (2005) 145–154 Comparing categories among geographic ontologies Marinos Kavouras , Margarita Kokla, Eleni Tomai Cartography Laboratory, School of Rural and Surveying Engineering, National Technical University of Athens, Zografos Campus, Athens 15780, Greece Received 16 June 2004; received in revised form 27 July 2004; accepted 27 July 2004 Abstract Numerous attempts have been made to generate semantic ‘‘mappings’’ between different ontologies, or create aligned/integrated ones. An essential step towards their success is the ability to compare the categories involved. This paper introduces a systematic methodology for comparing categories met in geographic ontologies. The methodology explores/extracts semantic information provided by categories’ definitions. The first step towards this goal is the recognition of syntactic and lexical patterns in definitions, which help to identify (a) semantic properties such as purpose, location, cover, and (b) semantic relations such as hypernym, part of, has-parts, etc. At the second step, a similarity measure among categories is applied, in order to explore how (the) extracted properties and relations interrelate. This framework enables us to (a) better understand the impact of context in cross-ontology ‘‘mappings’’, (b) evaluate the ‘‘quality’’ of definitions as to whether they respect mere ontological aspects (such as unambiguous taxonomies), and (c) deal more effectively with the problem of semantic translation among geographic ontologies. r 2004 Elsevier Ltd. All rights reserved. Keywords: Geographic ontologies; Semantic properties; Semantic relations; Similarity 1. Introduction A close inspection of existent geographic categoriza- tions or geographic data exchange standards shows that although they often refer to apparently similar cate- gories, they use different semantics due to different contexts. This ‘‘Babel Tower’’ makes the association process and the establishment of an aligned or integrated ontology (Sowa, 2000) very problematic. There have been numerous attempts to deal with the problem of ontology integration and semantic inter- operability (Wache et al., 2001; Vckovski et al., 1999; Uitermark, 2001; Kokla and Kavouras, 2001). In this endeavor, it is essential to understand, reveal and resolve existing heterogeneities. In earlier attempts, similarities and heterogeneities between geographic categories were identified on the basis of their common attributes (Kavouras and Kokla, 2002). Such attributes were defined based on experts’ knowledge on the features involved. This approach however involved a great degree of subjectivity, which may contradict the inten- tions of the original designer. Since definitions have been recognized as an important source of semantics, it was decided to exploit their potential. Besides their semantic value (Jensen and Binot, 1987; Klavans et al., 1993; Swartz, 1997), definitions are rich in different kinds of knowledge such as lexical, world, encyclopedic and semantic (see SIGLEX workshop, in Barriere, 1997). In addition, they are often the only objective available source we can rely on, especially in existing geographic data collections. Furthermore, any findings concerning the semantic completeness of definitions in describing categories would greatly help to form better definitions ARTICLE IN PRESS www.elsevier.com/locate/cageo 0098-3004/$ - see front matter r 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.cageo.2004.07.010 Corresponding author. Fax: +30 210 7722634. E-mail address: [email protected] (M. Kavouras).

Transcript of Comparing categories among geographic ontologies

ARTICLE IN PRESS

0098-3004/$ - se

doi:10.1016/j.ca

�CorrespondE-mail addr

Computers & Geosciences 31 (2005) 145–154

www.elsevier.com/locate/cageo

Comparing categories among geographic ontologies

Marinos Kavouras�, Margarita Kokla, Eleni Tomai

Cartography Laboratory, School of Rural and Surveying Engineering, National Technical University of Athens,

Zografos Campus, Athens 15780, Greece

Received 16 June 2004; received in revised form 27 July 2004; accepted 27 July 2004

Abstract

Numerous attempts have been made to generate semantic ‘‘mappings’’ between different ontologies, or create

aligned/integrated ones. An essential step towards their success is the ability to compare the categories involved. This

paper introduces a systematic methodology for comparing categories met in geographic ontologies. The methodology

explores/extracts semantic information provided by categories’ definitions. The first step towards this goal is the

recognition of syntactic and lexical patterns in definitions, which help to identify (a) semantic properties such as

purpose, location, cover, and (b) semantic relations such as hypernym, part of, has-parts, etc. At the second step, a

similarity measure among categories is applied, in order to explore how (the) extracted properties and relations

interrelate. This framework enables us to (a) better understand the impact of context in cross-ontology ‘‘mappings’’, (b)

evaluate the ‘‘quality’’ of definitions as to whether they respect mere ontological aspects (such as unambiguous

taxonomies), and (c) deal more effectively with the problem of semantic translation among geographic ontologies.

r 2004 Elsevier Ltd. All rights reserved.

Keywords: Geographic ontologies; Semantic properties; Semantic relations; Similarity

1. Introduction

A close inspection of existent geographic categoriza-

tions or geographic data exchange standards shows that

although they often refer to apparently similar cate-

gories, they use different semantics due to different

contexts. This ‘‘Babel Tower’’ makes the association

process and the establishment of an aligned or

integrated ontology (Sowa, 2000) very problematic.

There have been numerous attempts to deal with the

problem of ontology integration and semantic inter-

operability (Wache et al., 2001; Vckovski et al., 1999;

Uitermark, 2001; Kokla and Kavouras, 2001). In this

endeavor, it is essential to understand, reveal and resolve

existing heterogeneities. In earlier attempts, similarities

e front matter r 2004 Elsevier Ltd. All rights reserve

geo.2004.07.010

ing author. Fax: +30210 7722634.

ess: [email protected] (M. Kavouras).

and heterogeneities between geographic categories were

identified on the basis of their common attributes

(Kavouras and Kokla, 2002). Such attributes were

defined based on experts’ knowledge on the features

involved. This approach however involved a great

degree of subjectivity, which may contradict the inten-

tions of the original designer. Since definitions have been

recognized as an important source of semantics, it was

decided to exploit their potential. Besides their semantic

value (Jensen and Binot, 1987; Klavans et al., 1993;

Swartz, 1997), definitions are rich in different kinds of

knowledge such as lexical, world, encyclopedic and

semantic (see SIGLEX workshop, in Barriere, 1997). In

addition, they are often the only objective available

source we can rely on, especially in existing geographic

data collections. Furthermore, any findings concerning

the semantic completeness of definitions in describing

categories would greatly help to form better definitions

d.

ARTICLE IN PRESSM. Kavouras et al. / Computers & Geosciences 31 (2005) 145–154146

in new classifications as described by Tomai and

Kavouras (2004). The purpose of the present research is

to identify semantic information from definitions and to

enrich the representation of categories with semantic

properties and relations, such as those reported by Kokla

and Kavouras (2002), in order to disambiguate geo-

graphic categories. The ability to represent and visualize

the degree of semantic similarity with concept mapping

tools (Skupin, 2002; Tomai and Kavouras, 2002) can

greatly facilitate the entire process. For tackling these

semantic heterogeneities, we explore similarities/dissim-

ilarities of two well-known geographic ontologies, i.e.,

CORINE LC, MEGRIN, and one lexical ontology—

WordNet, which includes geographic categories.

Another aspect of the research is the representation of

semantic similarity, in order to identify semantic

heterogeneities, and therefore facilitate interoperability.

In the domain of geographic information science, few

approaches have attempted to model similarity. As far

as similarity of geographic entity classes is concerned,

Rodrıguez et al. (1999) and Rodrıguez and Egenhofer

(2003) have proposed a computational method for

assessing similarity between two ontologies by a

similarity function that compares distinguishing features

of the entities involved, such as parts, functions and

attributes. Our approach to establish semantic similarity

among categories from different geographic ontologies

exclusively uses semantic information, which can be

derived from categories’ definitions.

In summary, this paper compares category definitions,

determines heterogeneities, portrays semantic similarity,

and overall prepares the ground for the integration

process. In the rest of the paper, Section 2 presents the

characteristics of the ontologies employed. Semantic

relations and properties are described in Section 3.

Section 4 is concerned with the computation and

visualization of semantic similarity. The results of this

work are analyzed in Section 5, and finally some

conclusions are drawn in Section 6.

2. Case-study ontologies

As mentioned above, several problems of association

and integration are encountered when trying to compare

categories from distinct repositories of geographic

information. In this research, we identified semantic

relations–properties from the following categorizations:

M

CO

2MEGRIN’s PETIT project. http://www.eurogeographic-

CORINE LC1 is a categorization intended to provide

consistent localized geographical information on land

cover of the member states of the European Com-

1European Environmental Agency: CORINE Land Cover

ethodology and Nomenclature. http://reports.eea.eu.int/

R0-part1/en, http://reports.eea.eu.int/COR0-part2/en.

s.o

La

htt

munity, by using satellite data. CORINE Land Cover

has a three-level hierarchy of categories. The upper

level consists of 5 categories, the middle level of 15,

and the lowest one of 44 categories.

GDDD-Geographical Data Description Directory

(MEGRIN’s GDDD)2 contains information on

available digital geographic information from Eur-

ope’s National Mapping Agencies (NMAs). Layer

names, feature type names, and feature attribute

types names correspond to the nomenclature used in

the DIGEST Feature and Attribute Coding Catalo-

gue (FACC).

WordNet3 is a lexical database for the English

language, whose design was based on current

psycholinguistic theories.

At the context level, the first two categorizations are

considered as domain ontologies, while WordNet is a

general lexical ontology. At the level of formality, all

three categorizations can be considered as terminological

ontologies (Sowa, 2000), since they contain categories

specified by definitions expressed in natural language,

associated by subtype/supertype relations. Furthermore,

the first two can be characterized as light ontologies or

taxonomies since they establish classifications.

Categories at the lowest (and more detailed) hier-

archical level were examined for CORINE LC. In

addition, for the sake of simplicity and clarity, the study

was restricted only to a small, yet representative set of

categories from the three ontologies, properly selected to

account for a range of heterogeneities encountered

between geographic categorizations. Therefore, the

categories selected for the experiment were:

CORINE LC’s categories 4 (wetlands) and 5 (water

bodies),

MEGRIN’s category hydrography,

WordNet’s definitions for the related category terms.

Therefore, we ended up with definitions of 29

‘‘category types’’ (Table 1). The term category type

refers to categories that are found in different ontologies

under the same term (name of the category) but exhibit

differences in their definitions or the contexts under

which they are used.

3. Determination of semantic relations and properties

The rationale behind this research is to determine

semantic information from definitions and to enrich the

rg/megrin/PROJECTS/PETIT/Prototyp_desc.html.3WORDNET 1.7.1—a Lexical Database for the English

nguage, Cognitive Science Laboratory, Princeton University.

p://www.cogsci.princeton.edu/�wn/.

ARTICLE IN PRESS

Table 1

Category types used in our approach

Ontology Category type

CORINE land cover Peat bog

Water course

Water body

Salt marsh

Saline

Intertidal flat

Coastal lagoon

Estuary

Sea and ocean

Inland marsh

MEGRIN Bog

Canal

Lake/pond

Salt marsh

Salt pan

Watercourse

WordNet Body of water

Bog

Canal

Lake

Pond

Salt pan

Watercourse

Watercourse

Marsh

Estuary

Sea

Ocean

Lagoon

M. Kavouras et al. / Computers & Geosciences 31 (2005) 145–154 147

representation of categories with semantic properties

and relations in order to reveal similarities and hetero-

geneities. The field of natural language processing

develops methodologies for automatic extraction of

semantic information from definitions. According to

Jensen and Binot (1987), definitions include a wealth of

knowledge expressed in natural language, which can be

analyzed by natural language processing systems.

Definitions are a kind of text with special structure

and content. They are rich sources of scientific knowl-

edge of a domain. In geographic ontologies, definitions

are the primary and usually the only descriptions of

category terms, since other elements that could con-

tribute to the semantic definition of geographic cate-

gories (e.g., properties, functions, axioms) are either

missing or superficially described. Research on defini-

tions is seeking ways to exploit the wealth of informa-

tion latent in this special kind of text.

Definitions of geographic categorizations are usually

comprised of two parts: the genus and the differentiae.

The genus or hypernym is the superordinate term of the

defined category term. For example, in the definition:

‘‘hotel: a building where travelers can pay for lodging

and meals and other services’’, ‘‘building’’ is the genus of

category ‘‘hotel’’.

The differentiae are other elements of the definition

apart from the genus, which differentiate words with the

same genus. Thus, in the definition: ‘‘skyscraper: a very

tall building with many storeys’’, ‘‘skyscraper’’ has the

same genus (i.e., ‘‘building’’) with ‘‘hotel’’, but they are

distinguished by the differentiae (e.g., ‘‘where travelers

can pay for lodging and meals and other services’’ and

‘‘tall’’, ‘‘with many storeys’’).

The methodology adopted here, for analyzing defini-

tions and extracting semantic information, was intro-

duced by Jensen and Binot (1987) and further pursued

by Ravin (1993) and Vanderwende (1995). This

approach consists in the syntactic analysis of definitions

and the application of rules, which examine the existence

of certain syntactic and lexical patterns. Patterns take

advantage of specific elements of definitions, in order to

identify a set of semantic properties–relations and their

values based on syntactic analysis. Patterns applied in

the genus part of the definition extract the hypernym or

‘‘is-a’’ relation. Patterns applied in the differentiae part

extract other semantic relations such as ‘‘is-part-of’’,

‘‘has-parts’’, ‘‘adjacent-to’’, etc., as well as semantic

properties such as purpose, location, time, size, etc.

Therefore, it was necessary to specify semantic

relations and properties used in geographic definitions.

For that reason, different geographic ontologies, stan-

dards and categorizations (e.g., CYC Upper Level

Ontology, WordNet, CORINE Land Cover, DIGEST,

SDTS, etc.) were analyzed in order to identify patterns,

which are systematically used to express specific

semantic relations and properties. The most commonly

used are shown in Table 2. Besides general semantic

elements (e.g., PURPOSE, CAUSE, TIME, etc.), other

context-specific semantic elements were also identified.

For example, categories relative to hydrography are

described by semantic elements, such as nature (natural

or artificial) and flow (flowing or stagnant).

The pattern for the extraction of the semantic relation

PURPOSE (Vanderwende, 1995) is:

If the verb used (created, intended, prepared,

provided, etc.) is post modified by a prepositional

phrase with the preposition for, then create a

PURPOSE relation with the head(s) of that preposi-

tional phrase as the value.

For example, a PURPOSE property is extracted from

the definition: ‘‘canal: a manmade or improved natural

waterway used for transportation’’ (MEGRIN), with

value ‘‘transportation’’.

The methodology for extracting semantic information

is used to decompose definitions of geographic cate-

gories into a set of semantic properties—relations and

ARTICLE IN PRESSM. Kavouras et al. / Computers & Geosciences 31 (2005) 145–154148

their corresponding values. This formalized semantic

information is further used to disambiguate similar

categories by explicitly and objectively identifying

similarities and heterogeneities between them.

More specifically, if the methodology for extracting

semantic information is used for analyzing category

‘‘lake’’ as defined by MEGRIN: ‘‘lake/pond: a body of

water surrounded by land’’, the following semantic

properties and relations are determined: HYPERNYM

with value ‘‘body’’, MATERIAL with value ‘‘water’’

and SURROUNDED-BY with value ‘‘land’’. Respec-

tively, from the analysis of same category type as defined

by WordNet: ‘‘lake: a body of (usually fresh) water

surrounded by land’’, the same semantic properties—

relations and values—are determined. Therefore, it is

evident that the two ontologies equivalently define the

category ‘‘lake’’ (Table 3).

If, however, the above methodology is used for the

analysis of category ‘‘ditch’’ as defined by the same

ontologies (MEGRIN and WordNet), the resulting

semantic properties–relations and values reveal hetero-

Table 3

Determination of semantic information for category ‘‘lake’’

HYPERNYM

Lake (MEGRIN) Body

Lake (WordNet) Body

Table 2

Examples of semantic properties and relations

Semantic properties

Purpose

Cause

Location

Time

Material-cover

Size

Semantic relations

Is-a

Is-part-of

Has-part

Adjacent-to

Surrounded-by

Associated-with

Table 4

Determination of semantic information for category ‘‘ditch’’

HYPERNYM PURP

Ditch (MEGRIN) Channel Irriga

Ditch (WordNet) Waterway

geneities between the definitions of the homonymous

categories (Table 4).

Table 5 shows the complete set of semantic informa-

tion (properties and values) that can be identified in the

definitions of the 29 categories from the three different

ontologies.

3.1. Findings

M

W

W

O

tio

The presence of hypernyms in definitions may express

the ‘‘is-a’’ relation, but the values of hypernyms in

definitions differ significantly. A sample of 29

categories from two geospatial ontologies and one

lexical database, which all refer to waterbodies and

watercourses and coincide in naming (11 naming

terms for the 29 categories), present 19 distinct

hypernyms. Furthermore, as far as the hypernymic

relation is concerned, we can state the following:

o CORINE’s 10 categories that belong to 4 cate-

gories of the intermediate level, which further up

belong to 2 categories of the superordinate level,

have 9 distinct hypernyms in their definitions.

Definitions do not properly address the taxonomic

structure of the hierarchy, i.e., genera of category

terms do not necessarily coincide with their super-

ordinate category terms. Suggestively, we pinpoint

two cases of inconsistency.

(a) Definitions are circular (water courses are water

coursesy).

(b) The use of distinct terms, which could refer to

the same hypernyms, for instance, the terms

area stretch, zone, expanse, etc.

o MEGRIN’s 6 category definitions also have 5

distinct hypernyms; one definition is circular (water

course is a coursey).

o In WordNet this kind of inconsistency is absent

(the hypernymic relation is correctly addressed in

the definitions).

o CORINE’s hypernyms do not match those of

WordNet at all. All water bodies (such as lagoon,

AT

at

at

SE

n o

ER

er

er (u

r dr

IAL SURROUNDED BY

Land

sually fresh) Land

SIZE NATURE

ainage

Small Natural

ARTIC

LEIN

PRES

STable 5

Properties and values of categories as identified in their definitions in three ontologies

Categories Semantic information (properties and VALUES)

Hypernym Nature Use/purpose Material-cover Is part of Form

morphology

Size Location Surrounded

by

Condition-state

(attribution)

CORINE LC

Inland marsh Land Low-lying Flooded

(TIME: in winter)

Saturated

(MATERIAL-CAUSE: water

TIME: all year round)

Peat bog (Peat) land Decomposed moss

and vegetable

matter

Salt marsh Area Vegetation Low-lying, above

the high-tide line

Susceptible to flooding

(MATERIAL-CAUSE:

sea water)

Saline Salt-pan (Salt) Active or in process

of abandonment

Intertidal flat Expanse Mud, sand

or rock

Between high and low

water marks

Generally

unvegetated

Water course Water

course

Natural or

artificial

Water

drainage

channel

(Water)

Water body Stretch Natural or

artificial

Water

Coastal

lagoon

Stretch Salt or

brackish water

Coastal areas

Estuary Sea

and ocean

Mouth

Zones

River Seaward of the

lowest tide limit

MEGRIN

Bog Area Soil rich in

plant residue

Poorly drained

periodically flooded

Canal Waterway Manmade

or improved

natural

Transportation (Water)

Lake/pond Body Water Land

Salt marsh Depression Natural Salt encrusted

clayey soil

In arid/semi-arid

regions

Salt pan Area Natural surface

salt deposits

Flat

Watercourse Course Natural (Water) Flowing

M.

Ka

vou

ras

eta

l./

Co

mp

uters

&G

eoscien

ces3

1(

20

05

)1

45

–1

54

149

ARTIC

LEIN

PRES

STable 5. (continued )

Categories Semantic information (properties and VALUES)

Hypernym Nature Use/purpose Material-cover Is part of Form

morphology

Size Location Surrounded

by

Condition-state

(attribution)

WordNet

Body of

water

Part Water Earth’s

surface

Bog Ground Decomposing

vegetation

Wet spongy

Canal Strip Boats or

irrigation

Water Long

and narrow

Lake Body (Usually fresh)

water

Land

Pond Lake Small

Salt pan Basin Salt and

gypsum

Shallow In a desert region

Watercourse Channel Natural or

artificial

Watercourse Body Natural Running water On or under

the earth

Marsh Land Grassy

vegetation

Low-lying

Lagoon Body Water Cut-off from land

(ACTOR) a reef

of sand or coral

Estuary Part Fresh or

salt water

River Wide Near the sea

Sea Division Salt water Ocean Large (Partially) enclosed

ACTOR: land

Ocean Body Water Hydrosphere Large

M.

Ka

vou

ras

eta

l./

Co

mp

uters

&G

eoscien

ces3

1(

20

05

)1

45

–1

54

150

ARTICLE IN PRESS

Ta

Sim

1

2

M. Kavouras et al. / Computers & Geosciences 31 (2005) 145–154 151

estuary, sea and ocean, water body, watercourse of

category 5) in CORINE LC are defined using terms

that refer to two-dimensional hypernyms while in

WordNet they are defined using as hypernym the

term body that refers to three-dimensional physical

objects. This distinction indicates that CORINE

LC is taking a map view because it classifies land

cover, not geographic entities.

ble

il

(a) CORINE LC is a land cover ontology, subse-

quently semantic property ‘‘material-cover’’ is present

in most definitions of its categories; therefore defini-

tions in existent geospatial ontologies (esp. task or

domain ontologies) are context driven. (b) The same

semantic property, however, is also present in the

remaining ontologies (only two of the category

definitions do not contain lexical information for

that semantic relation).

Semantic property ‘‘nature’’ (artificial/manmade) is

addressed in only 7 definitions out of the total 29.

Semantic property ‘‘purpose’’ is present only in 3 out

of 29 definitions. This is because natural entities do

not have purposes in contrast to artificial ones.

Semantic properties such as ‘‘size’’ and ‘‘form/

morphology’’ are not adequately included in defini-

tions either: 3/29 and 4/29, respectively. This is very

low, considering that geospatial categories are ex-

pected to significantly possess properties about size

and morphology.

The importance of the meronymic semantic relation

‘‘has-part’’, or ‘‘is part-of’’ is not widely addressed in

definitions; only 5 of the total 29 definitions present

such information, 4 of which belong to WordNet (out

of a total of 13). Semantic properties ‘‘location’’ and

‘‘surrounded by’’ (both of them denoting topology)

are met in 12 definitions only, which again seems low.

Both realizations are contrary to what is generally

expected about the presence of mereotopologic

relations in geographic ontologies (Casati et al.,

1998).

Semantic property ‘‘time’’ is also absent to a wide

extent from definitions.

4. Determination and visualization of semantic similarity

In order to determine the similarity between two

categories, we take into account the values of the

6

arity for categories: lake and (peat) bog based on Table 5

Categories Similarity S

Lake (MEGRIN) S1;2 ¼ 1:000Lake (WordNet)

properties/relations they possess. If the values of a given

semantic property or relation coincide, then the two

category types are similar in terms of that property/

relation. If the values of a property/relation are distinct,

then similarity between the two categories is equal to

zero.

The similarity measure S between two categories a, b

is set by the ratio model (based on Tversky’s similarity

measure):

Sða; bÞ ¼C

A þ B þ C;

where C is the number of properties/relations which

categories a and b share, but also exhibit common values

for, A is the number of properties/relations of category a

but not of b, and B is the number of properties of

category b but not of a (examples can be found in

Table 6). As it can be understood, the ratio is bounded

between 0 and 1, the former denoting complete

dissimilarity, and the latter, coincidence of entities.

In special cases, to assess the similarity appropriately;

Compound nouns, such as ‘‘peatland’’ (does not exist

in WordNet)—the hypernym of peatbog (Table 6),

were identified and used as adjective+noun (peat

land) and not as a compound word, so the hypernym

was taken to be ‘‘land’’ instead.

Similar terms were grouped to diminish the range of

values of certain properties. Consider, for instance,

the values ‘‘water, fresh water, brackish water, salt

water’’ of property ‘‘material-cover’’ for canal, lake,

and coastal lagoon, respectively. When establishing

similarity, the value for these categories was taken as

‘‘water’’.

In order to visualize the different ontologies, we use

multi-dimensional scaling (MDS) (Kruskal and Wish,

1978). The method uses a similarity/dissimilarity matrix

to project the data into the projection space, which in

our case is a two-dimensional space. MDS is a

dimensionality reduction method that represents multi-

dimensional data sets by using a stress function;

therefore, distances among data reflect the correspond-

ing (dis) similarities. The value of the stress function is

an indicator of the goodness-of-fit of the result. The

higher its value, the more the distortion imposed on the

visualization of the entities; therefore, distances are

Categories Similarity S

Peatbog (CORINE LC) S1;2 ¼ 0:333Bog (WordNet)

ARTICLE IN PRESS

Bog

Canal

Lake/Pond

Salt marsh

Salt pan

Watercourse

MEGRIN

Sea

WordNet

Body of water

Bog

CanalLake

Pond

Salt pan

Watercourse

Watercourse

Marsh

Lagoon

Estuary

Ocean

Fig. 1. Visualization output for three ontologies.

M. Kavouras et al. / Computers & Geosciences 31 (2005) 145–154152

greater than the corresponding dissimilarities. The

output is a scatter plot of the data where similar entities

are close in the representation space while dissimilar

ones are far away. The visualization result is shown in

Fig. 1.

5. Interpretation of results

As mentioned before, the output of the MDS is the set

of coordinates for the examined caterory types. Subse-

quently, a clustering method is used to form groups of

categories that enjoy common properties–relations and

values. We are then able to explore whether differentia-

tions in naming denote the same category, while

sameness in naming but differentiation of the categories’

definitions denote distinct ones. In the current approach,

we used a hierarchical clustering method to examine

which way the three distinct ontologies contribute in the

formation of common upper-categories in a unified

schema (Fig. 2).

The analysis of properties/relations and their values

(in the findings of Section 3) indicated whether and

which ontological assertions (such as unambiguous

taxonomic structure) can be derived from definitions.

As a result, the following guiding principles for the

definitions of categories in geospatial ontologies could

be useful:

Basic ontological semantic relations (meronymy,

hypernymy, hyponymy) should be present in defini-

tions due to their expressiveness and rich semantics.

Category definitions in ontologies should

address the taxonomic structure of the categorization

correctly. Any inconsistency between the

definition’s hypernym and the superordinate category

term itself (when the categorization is hierarchical)

presents a misconstruction in representing the

hierarchy.

Definitions should account for the so-called special

features of geospatial categories such as morphology,

location/topology, which, according to the previous

analysis, does not seem to be the case.

Domain and task ontologies have context-driven

definitions, which is not a drawback. These

definitions however, should not contradict general

knowledge of the given categories and should

reflect to some extent the way they are construed by

humans, otherwise they are superficial and not widely

accepted.

6. Conclusions and further work

The research presented focuses on the determination

of semantic information from definitions of geographic

categories in order to identify and formalize similarities

and heterogeneities. Visualization of semantic similarity

proves to be a very useful tool for the association of

similar categories. Portraying similarities/dissimilarities

in a projection space gives us a concrete measure of the

heterogeneity of distinct ontologies. We can then draw

inferences

ARTICLE IN PRESS

Fig. 2. Resulting clusters showing heterogeneities among same category types (terms) in different ontologies.

M. Kavouras et al. / Computers & Geosciences 31 (2005) 145–154 153

as to what extent different ontologies can be

integrated,

about the associations between category types,

concerning the comparison of categories’ definitions

in a cross-ontological examination.

The purpose of the present work was to demonstrate

the difficulty in dealing with category semantics. It also

presents an alternative to customary approaches, which

manually determine similarities and heterogeneities

between category types, mainly based on similarity

between category terms. However, similarity in terms

does not necessarily imply equivalent category types.

Besides superficially dealing with categories, such

approaches usually result in misapprehending the

intentions of the original designers. On the contrary,

the present work formalizes semantic information

immanent in definitions of category types. Definitions

are usually the basic available and semantically rich

feature of geographic data collections and they reflect

the intentions of original designers. The result of

semantic similarity determination and visualization can

be used as a pre-processing step to semantic integration

of geographic categorizations.

Finally, it should be realized that the approach was

not intended to produce ‘‘perfect’’ results. Emphasis was

put on objectivity and automation (avoiding ad hoc

manual procedures and subjective experts’ knowledge).

Furthermore, any consequential ‘‘imperfect’’ results

have a value of their own, for they reveal (and provide

an opportunity to fix) imperfections of the original

taxonomy definitions, as well as help engineer better

ontologies in the future.

Acknowledgments

This work has been partially supported by the

Heraclitus Research Programme 2.2.3.b of the Hellenic

Ministry of National Education. The authors are also

indebted to the anonymous reviewers for their very

constructive comments.

References

Barriere, C., 1997. From a children’s first dictionary to a lexical

knowledge base of conceptual graphs. Ph.D. Dissertation.

Simon Fraser University, Vancouver, BC, Canada, 339 pp.

Casati, R., Smith, B., Varzi, A., 1998. Ontological tools for

geographic representation. In: Guarino, N. (Ed.), Formal

Ontology in Information Systems. IOS Press, Amsterdam,

pp. 77–85.

Jensen, K., Binot, J.L., 1987. Disambiguating prepositional

phrase attachments by using on-line dictionary definitions.

Computational Linguistics 13 (3/4), 251–260.

Kavouras, M., Kokla, M., 2002. A method for the formaliza-

tion and integration of geographical categorizations. Inter-

national Journal of Geographical Information Science 16

(5), 439–453.

Klavans, J., Chodorow, M., Wacholder, N., 1993. Building a

knowledge base from parsed definitions. In: Jensen, K.,

ARTICLE IN PRESSM. Kavouras et al. / Computers & Geosciences 31 (2005) 145–154154

Heidorn, G., Richardson, S. (Eds.), Natural Language

Processing: The PLNLP Approach. Kluwer Academic

Publishers, Dordrecht, The Netherlands.

Kokla, M., Kavouras, M., 2001. Fusion of top-level and

geographical domain ontologies based on context formation

and complementarity. International Journal of Geographi-

cal Information Science 15 (7), 679–687.

Kokla, M., Kavouras, M., 2002. Extracting latent semantic

relations from definitions to disambiguate geographic

ontologies. In: GIScience 2002 Abstracts, Second Interna-

tional Conference on Geographic Information Science.

Boulder, CO, pp. 87–90.

Kruskal, J.B., Wish, M., 1978. Multidimensional scaling. Sage

University Paper Series on Quantitative Applications in the

Social Sciences, Number 07-011. Sage Publications, New-

bury Park, CA, 96 pp.

Ravin, Y., 1993. Disambiguating and interpreting verb defini-

tions. In: Jensen, K., Heidorn, G.E., Richardson, S.D.

(Eds.), Natural Language Processing: The PLNLP Ap-

proach. Kluwer Academic Publishers, Dordrecht, The

Netherlands, pp. 175–189.

Rodrıguez, A., Egenhofer, M., 2003. Determining semantic

similarity among entity classes from different ontologies.

IEEE Transactions on Knowledge and Data Engineering 12

(2), 442–456.

Rodrıguez, A., Egenhofer, M., Rugg, R., 1999. Assessing

semantic similarities among geospatial feature class defini-

tions. In: Vckovski, A., Brassel, K., Schek, H.-J. (Eds.),

Interoperating Geographic Information Systems, Second

International Conference, INTEROP’99, Zurich, Switzer-

land. Lecture Notes in Computer Science, vol. 1580.

Springer, Berlin, pp. 189–202.

Skupin, A., 2002. A cartographic approach to visualizing

conference abstracts. IEEE Computer Graphics and Appli-

cations 22 (1), 50–58.

Sowa, J.F., 2000. Knowledge Representation: Logical, Philo-

sophical, and Computational Foundations. Brooks Cole

Publishing Co., Pacific Grove, CA 594 pp.

Swartz, N., 1997. Definitions, dictionaries, and meanings,

http://www.sfu.ca/philosophy/swartz/definitions.htm.

Tomai, E., Kavouras, M., 2002. ‘‘Sharpening’’ vagueness:

identifying, measuring, and portraying its impact on

geographic categories. In: GIScience 2002 Abstracts,

Second International Conference on Geographic Informa-

tion Science. Boulder, CO, pp. 189–192.

Tomai, E., Kavouras, M., 2004. From ‘‘onto-geonoesis’’ to

‘‘onto-genesis’’: the design of geographic ontologies.

GeoInformatica 8 (3), 285–302.

Uitermark, H.T., 2001. Ontology-based geographic data set

integration. Ph.D. Dissertation. Deventer, The Netherlands,

139 pp.

Vanderwende, L., 1995. The analysis of noun sequences using

semantic information extracted from on-line dictionaries. Ph.D.

Dissertation. Faculty of the Graduate School of Arts and

Sciences, Georgetown University, Washington, DC, 312 pp.

Vckovski, A., Brassel, K., Schek, H.-J. (Eds.), 1999. Interoperating

geographic information systems. Second International Con-

ference, INTEROP’99, Zurich, Switzerland. Lecture Notes in

Computer Science, vol. 1580. Springer, Berlin, 327 pp.

Wache, H., Vogele, T., Visser, U., Stuckenschmidt, H., Schuster,

G., Neumann, H., Hubner, S., 2001. Ontology-based

integration of information—a survey of existing approaches.

In: Proceedings of IJCAI-01 Workshop: Ontologies and

Information Sharing, Seattle, WA, pp. 108–117.