Acquiring owl ontologies from data-intensive web sites

8
Acquiring OWL Ontologies from Data-Intensive Web Sites Sidi Mohamed Benslimane LIRIS Laboratory,Claude Bernard University, Villeurbanne, France [email protected] Djamal Benslimane LIRIS Laboratory,Claude Bernard University, Villeurbanne, France [email protected] Mimoun Malki EEDIS Laboratory, University of Sidi-Bel-Abbes, Algeria [email protected] ABSTRACT The availability and the proliferation of ontologies are crucial for the success of the Semantic Web. As consequence, a great number of researchers are working on method and techniques to build ontologies through automatic or semi-automatic processes, which perform knowledge acquisition from texts, dictionaries and structured and semi-structured information sources. On the other hand, reverse engineering, when applied to software engineering, uses a collection of theories, methodologies and techniques to support information abstraction and extraction from a piece of software. In this paper we present a semi-automatic reverse engineering approach to acquire OWL ontology corresponding to the content of relational database. Our approach is based on the idea that the semantics extracted by analyzing HTML forms will be used to restructure and enrich the relational schema. OWL ontology is constructed through a set of transformation rules from the enriched schema. The main reason for this construction is to make the relational database information that is available on the Web machine-processable and reduce the time consuming task of ontology creation. Categories and Subject Descriptors D.2.7 [SOFTWARE ENGINEERING]: Distribution, Maintenance, and Enhancement Restructuring, reverse engineering, and reengineering; H.3.5 [INFORMATION STORAGE AND RETRIEVAL]: Online Information Services Data sharing; I.2.4 [ARTIFICIAL INTELLIGENCE]: Knowledge Representation Formalisms and Methods Representation languages. General Terms Design, Languages, Theory. Keywords Ontologies extraction, Relational Databases, OWL, HTML-forms, Reverse engineering. 1. INTRODUCTION The principal challenge for the Semantic Web community is to make machine-readable much of the material that is currently human-readable, and thereby enrich Web operations from their current information-based state into a knowledge-centric form. This is done by adding machine understandable content to Web resources. Such added content is called ontologies. Lately, ontologies have become the focus for research in several areas, including knowledge engineering and management, information retrieval and integration, agent systems, the Semantic Web, and e- commerce. The availability and the proliferation of formal ontologies are crucial for the success of the Semantic Web. Nevertheless building ontologies is so costly that it hampers the progress of the Semantic Web activity. Manual construction of ontologies [1, 2] still remains a tedious, time-consuming, error- prone and cumbersome task and easily causes a knowledge acquisition bottleneck. Automatic building of ontologies from existing information [3] is relevant and fully automated tools are still at the very early stage to be implemented. Therefore, the use of a semi-automatic ontologies extraction is seen as the practical short terms solution and an important topic for ontology research. Reverse engineering technique appears as an interesting solution to reach this objective. It’s defined as a process of analyzing a “legacy” system to identify all the system’s components and the relationships between them [4]. However, there are few approaches that consider ontologies as the target for reverse engineering. These approaches usually require more input information than is possible to provide in practice. Particularly, building an ontology based on an analysis of relational schema can be limited by the completeness of input information and its correctness: To improve performance, often many database designers break the rules of good database design by optimizing and de- normalizing the relational schema. The complete information about the relational database, such as functional and inclusion dependencies, is usually not available [5]. Since a relational schema does not support all constructs of a conceptual schema, some of the semantics captured in the conceptual schema – e.g. inheritance – will necessarily be lost when translating the schema from conceptual to relational [6]. The names of relations and attributes in the relational schema are often abridged (e.g. CUST_NB, StuName, S_125_AZE). However, it is difficult or even impossible to deduce the meaning of data from those names [7]. As an attempt to fill gap in the area of reverse engineering of relational databases to ontologies, we propose a novel approach that will semi-automatically create OWL ontology corresponding to the content of the relational database based on an analysis of their HTML forms, and make these available for humans and machines. Our approach is based on the idea that the semantics extracted by analyzing HTML forms will be used to restructure and enrich the relational schema. OWL ontology is constructed through a set of transformation rules from the enriched schema. This paper is organized as follows: In Section 2, we discuss some related works in reverse engineering relational databases into ontologies. Section 3 explains the overall of our reverse- Copyright is held by the author/owner(s). ICWE'06, July 11-14, 2006, Palo Alto, California, USA. ACM 1-59593-352-2/06/0007. 361

Transcript of Acquiring owl ontologies from data-intensive web sites

Acquiring OWL Ontologies from Data-Intensive Web Sites Sidi Mohamed Benslimane LIRIS Laboratory,Claude Bernard University, Villeurbanne, France

[email protected]

Djamal Benslimane LIRIS Laboratory,Claude Bernard University, Villeurbanne, France

[email protected]

Mimoun Malki EEDIS Laboratory, University

of Sidi-Bel-Abbes, Algeria [email protected]

ABSTRACT The availability and the proliferation of ontologies are crucial for the success of the Semantic Web. As consequence, a great number of researchers are working on method and techniques to build ontologies through automatic or semi-automatic processes, which perform knowledge acquisition from texts, dictionaries and structured and semi-structured information sources. On the other hand, reverse engineering, when applied to software engineering, uses a collection of theories, methodologies and techniques to support information abstraction and extraction from a piece of software. In this paper we present a semi-automatic reverse engineering approach to acquire OWL ontology corresponding to the content of relational database. Our approach is based on the idea that the semantics extracted by analyzing HTML forms will be used to restructure and enrich the relational schema. OWL ontology is constructed through a set of transformation rules from the enriched schema. The main reason for this construction is to make the relational database information that is available on the Web machine-processable and reduce the time consuming task of ontology creation.

Categories and Subject Descriptors D.2.7 [SOFTWARE ENGINEERING]: Distribution, Maintenance, and Enhancement – Restructuring, reverse engineering, and reengineering; H.3.5 [INFORMATION STORAGE AND RETRIEVAL]: Online Information Services – Data sharing; I.2.4 [ARTIFICIAL INTELLIGENCE]: Knowledge Representation Formalisms and Methods – Representation languages.

General Terms Design, Languages, Theory.

Keywords Ontologies extraction, Relational Databases, OWL, HTML-forms, Reverse engineering.

1. INTRODUCTION The principal challenge for the Semantic Web community is to make machine-readable much of the material that is currently human-readable, and thereby enrich Web operations from their current information-based state into a knowledge-centric form. This is done by adding machine understandable content to Web resources. Such added content is called ontologies. Lately, ontologies have become the focus for research in several areas, including knowledge engineering and management, information

retrieval and integration, agent systems, the Semantic Web, and e-commerce. The availability and the proliferation of formal ontologies are crucial for the success of the Semantic Web. Nevertheless building ontologies is so costly that it hampers the progress of the Semantic Web activity. Manual construction of ontologies [1, 2] still remains a tedious, time-consuming, error-prone and cumbersome task and easily causes a knowledge acquisition bottleneck. Automatic building of ontologies from existing information [3] is relevant and fully automated tools are still at the very early stage to be implemented. Therefore, the use of a semi-automatic ontologies extraction is seen as the practical short terms solution and an important topic for ontology research. Reverse engineering technique appears as an interesting solution to reach this objective. It’s defined as a process of analyzing a “legacy” system to identify all the system’s components and the relationships between them [4]. However, there are few approaches that consider ontologies as the target for reverse engineering. These approaches usually require more input information than is possible to provide in practice. Particularly, building an ontology based on an analysis of relational schema can be limited by the completeness of input information and its correctness:

• To improve performance, often many database designers break the rules of good database design by optimizing and de-normalizing the relational schema.

• The complete information about the relational database, such as functional and inclusion dependencies, is usually not available [5].

• Since a relational schema does not support all constructs of a conceptual schema, some of the semantics captured in the conceptual schema – e.g. inheritance – will necessarily be lost when translating the schema from conceptual to relational [6].

• The names of relations and attributes in the relational schema are often abridged (e.g. CUST_NB, StuName, S_125_AZE). However, it is difficult or even impossible to deduce the meaning of data from those names [7].

As an attempt to fill gap in the area of reverse engineering of relational databases to ontologies, we propose a novel approach that will semi-automatically create OWL ontology corresponding to the content of the relational database based on an analysis of their HTML forms, and make these available for humans and machines. Our approach is based on the idea that the semantics extracted by analyzing HTML forms will be used to restructure and enrich the relational schema. OWL ontology is constructed through a set of transformation rules from the enriched schema.

This paper is organized as follows: In Section 2, we discuss some related works in reverse engineering relational databases into ontologies. Section 3 explains the overall of our reverse-

Copyright is held by the author/owner(s). ICWE'06, July 11-14, 2006, Palo Alto, California, USA. ACM 1-59593-352-2/06/0007.

361

engineering framework. The extraction rules of forms schema are presented in Section 4. Section 5 describes the enrichment process of relational schema, whereas Section 6 details the rules of OWL ontology construction from the enriched schema. Finally, Section 7 contains concluding remarks and suggests some future works.

2. RELATED WORKS Much work have been done on relational databases reverse engineering, suggesting methods and rules for explicitly defining semantics in database schema [8], extracting semantics out of database schema [4] and transforming a relational model into an object-oriented model [9,10,11]. However, the semantics obtained by previous methods cannot meet the requirement of constructing ontology fully. Although object-oriented model is close to an ontological theory, there are still some differences between them. For example, there does not exist hierarchies and cardinality about properties in object-oriented model. So these methods cannot be used to discover ontology from relational database directly. A few years ago, some approaches that consider ontologies as the target for reverse engineering relational database have been proposed. These approaches fall roughly into one of the three categories: 1. Approaches based on an analysis of user queries: E.g. Kashyap’s approach [12] builds an ontology based on an analysis of relational schema; the ontology is then refined by user queries. However, this approach does not create axioms, which are part of the ontology. 2. Approaches based on an analysis of relational schema: E.g. Stojanovic et al’s approach [13] provides a set of rules for mapping constructs in the relational database to semantically equivalent constructs in the ontology. These rules are based on an analysis of relations, keys and inclusion dependencies (which are not often available). However, its target is to construct RDF(S) ontology that has unclear semantics and has not inference model, which is incompetent for automatic tasks. Dogan & Islamaj’s approach [14] provides simple and fully automatic reverse engineering, thus: relations map to classes, attributes in the relations map to attributes in the classes and tuples in the relational database map to instances in the ontology. However, this approach ignores inheritance, which leads into developing ontology that looks like a “relational” model. Rubin et al’s approach [15] proposes the automation of the process of filling the instances and their attributes’ values of an ontology using the data extracted from external relational sources. This method uses a declarative interface between the ontology and the data source, modelled in the ontology and implemented in XML schema. This approach needs several components: ontology, the XML schema, and an XML translator. 3. Approaches based on an analysis of tuples: E.g. Astrova’s approach [16]. Since the relational schema often has little explicit semantics [17], this approach also analyzes tuples in the relational database to discover additional “hidden” semantics (e.g. inheritance). However, this approach is very time consuming with regard to the number of tuples of relational database.

As an attempt to solve the common problems of reverse engineering, an interest was given to extract semantics of data in a relational database by analyzing HTML pages. This analysis has been focused on generation of wrappers [18, 19, 20]. Wrappers have the main advantage of reconstructing a (part of) relational database “hidden” behind HTML forms, when a relational schema is unknown. The backside of this advantage is that any changes to structures of HTML can break the wrappers and thus, the

ontologies they are based on. HTML pages are often redesigned typically more than twice a year [21]. Recently, Astrova’s approach [22] constructs an ontology based on analyzing the HTML-forms to extract a form model schema, transforming the form model schema into ontology and creating ontological instances from data contained in the pages. The drawback of this approach is that it does not offer any way to the identification of inheritance relationship which is a significant aspect in the ontology construction.

3. OUR APPROACH To overcome this limitation, we have proposed a novel approach for reverse engineering relational databases into ontologies. This approach articulates around three phases: i) Extract forms schema by analyzing HTML pages, ii) Restructure and enrich the relational schema through semantics of the forms schema, iii) Construct OWL ontology from the enriched relational schema using a set of transformation rules without demanding a middle model like in [23]. The uses of information extracted from both HTML forms used for sending user queries and HTML tables returned as the query results, can be supported by the fact that HTML forms1 are often the most popular and convenient interfaces for entering, changing and viewing data in the actual data-intensive Web pages and, therefore, important information can be obtained by analyzing an HTML-forms. The proposed architecture of our approach is depicted in Figure 1. Main components are:

Filtering

Form model schema

XML Schema of Forms

Data-intensive Web application

Enriched Relational schema

Ontology structure

TRANSFORMATION ENGINE

Identification Rules

Generation Rules

Enrichment Rules

EXTRACTION ENGINE

Construction Rules

Migration Rules

ENRICHMENT ENGINE

IDENTIFICATION

GENERATION

ENRICHMENT

ONTOLOGISATION

POPULATION

Relational schema

HTML PagesDatabasesInstances

Ontology instances

Filtering

Form model schema

XML Schema of Forms

Data-intensive Web application

Enriched Relational schema

Ontology structure

TRANSFORMATION ENGINE

Identification Rules

Generation Rules

Enrichment Rules

EXTRACTION ENGINE

Construction Rules

Migration Rules

ENRICHMENT ENGINE

IDENTIFICATION

GENERATION

ENRICHMENT

ONTOLOGISATION

POPULATION

Relational schema

HTML PagesDatabasesInstances

Ontology instances

Figure 1. Ontology Extraction Framework

- The Extraction Engine consists of two sets of extraction rules. The first set of rules, analysis the HTML pages to identify

1 In what follows HTML forms nominates both HTML forms and

HTML tables

362

constructs in the form model schema. However the second set of rules permits the extraction of a form XML schema from the constructs of the form model schema, and derives the domain semantics by extracting the sub-schemas of forms and their dependencies.

The Enrichment Engine consists of a set of enrichment rules that allows integration of semantics extracted from schema of forms into databases relational schema. The Transformation Engine consists of two sets of translation rules. The first set provides an automatic translation from the enriched relational schema to OWL ontological constructs. Rules are organized into four groups: rules for constructing classes, rules for constructing properties, rules for constructing hierarchy and rules for constructing axioms. The second set is responsible of creating ontolological instances from relational tuples.

To illustrate our approach, we’ll use a Web site for booking flight at http://www.airalgerie.dz. Two HTLM-pages among the application’s pages are shown in figure 2, namely a Booking-Form and a Program of flights Table. The underlying source is a relational database whose schema is shown in table1. Underlined attributes make up the primary key, while slanted attributes indicate a foreign key.

Figure 2. HTML pages along with HTML-Form and HTML-

Table

4. FORMS SCHEMA EXTRACTION

4.1 Analysis of HTML pages structure The main goal of this phase is to understand the form meaning and explicit its structure by analyzing HTML forms (both their structure and data they contain) to identify its components and their interrelationships and extract a form model schema. A form model schema was originally proposed, suitable for databases reverse engineering task [24].

4.1.1 The form model The model allows abstracting any database form, that is, to make explicit its components, fields as well as objects, and their interrelationships. This model is similar but not identical to the

models presented in [25, 26]. Basically, this model consists of: Form type: Is a structured collection of empty fields formatted to communicate with databases. A particular representation of form type is called form template. A form template defines the structure, constraints and presentation of the form fields. It represents the forms intension as perceived by users. Three basic components of any template are title, captions, and entries. Structural units: Is a group of homogeneous pieces of information, that is, an object that groups closely related form fields. Each structural unit is a logical sub-part of a form type. It generally corresponds with areas on a form layout. Form instance: Is an occurrence of a form type. This is the extensional part obtained when a form template is filled in with data. Figure 2 is an instance of the “Booking form” and “Program of flight” forms type. Form fields: Is an aggregation of a caption with its associated entry. Caption is pre-displayed on the form and serves as a clue as what is to be filled in by the respondent as well as a guide to enter or read it on the form. Notice that we can identify a form field where there is no caption for an entry or inversely. Each form field entry is generally linked to an attribute of one table in the underlying database. We use the concept of linked-attribute to designate this attribute. Some form fields are computed; others can be simply unlinked with the relational database. We distinguish three types of fields: Filling fields (e.g., TEXT, CHECKBOX, RADIO, TEXTAREA tags, etc.) which are an aggregation of name and entry associated to it. Selection fields (e.g., SELECT tag) which let the user select one, or more than one choice (MULTIPLE attribute). Link fields (HREF tag) which are used to rely two or more forms (pages). Underlying source: This is a structure of the relational database, which defines relations and attributes along with their data types. Relationships: this is a connection between structural units that relates one structural unit into another (or back to itself). Constraint: This is a rule that defines what data is valid for a given form field. A cardinality constraint specifies for an association relationship the number of instances that a structural unit can participate in.

Table 1. Relational database schema

Passenger (PassengerID, FN, LN, Age) City (CityID)Departure-City (CityID, DC-Name)Arrival-City (CityID, AC-Name) Date (DeparatueDate)Hour (HourID)Departure-Hour (HourID, type) Arrival-Hour (HourID, type) Company (CompagnyID, CompanyName) Plane (PlaneID, CompID, Capacity) Leaving-From (FlightID, DepartureCityID) Going-To (FlightID, ArrivalCityID) Flight (FlightID, Dep-CityID, Arr-CityID,

Dep_HourID, Arr_HourID, PlaneID) Book (PassengerID, FlightID, DepartureDate, Class)

Passenger (PassengerID, FN, LN, Age) City (CityID)Departure-City (CityID, DC-Name)Arrival-City (CityID, AC-Name) Date (DeparatueDate)Hour (HourID)Departure-Hour (HourID, type) Arrival-Hour (HourID, type) Company (CompagnyID, CompanyName) Plane (PlaneID, CompID, Capacity) Leaving-From (FlightID, DepartureCityID) Going-To (FlightID, ArrivalCityID) Flight (FlightID, Dep-CityID, Arr-CityID,

Dep_HourID, Arr_HourID, PlaneID) Book (PassengerID, FlightID, DepartureDate, Class)

4.1.2 Identification rules of form model schema The rules below briefly summarise the transformation used to identify the form model constructs, they are part of the extraction engine component (Figure 1).

Rule 1: Identifying form instances. In order to clearly distinguish different kinds of information in the document, the Web pages are usually split to multiple areas. Each area is created using specific

363

tags. For our approach we perform a filtering process and consider both, the section between the open and closing <form> tag used to access and updates the relational databases, and, the section between the open and closing (<table>, <td>,<tr>,<li>,<ul>) tags returned as the query results and representing a particular view of the relational databases.

Rule 2: Identifying linked attributes. Linked attributes are identified by examining the HTML code for structural tags such as <thead> and <th> [27]. If the linked attributes aren’t separated with the structural tags (merged data), we can use visual cues [21, 19]. This approach typically implies that there will be some separators (e.g. blank areas) that help users to splot the merged data. We can also look for linked attributes in attributes of the relational schema. This is because a given HTML page may contain only a part of the total attributes of the relational schema.

Rule 3: Identifying structural units. To determine a logical structure of HTML page, we can use visual cues [21] e.g. the users might consider the FirstName, LastName, and Age in figure.2 as a whole group (passenger), just because they are specifications too.

Rule 4: Identifying relationships. The association can be indicated by the fact that the two structural units appear at the same page. If the two structural units come together, they might be logically related to each other. Since the relational database information typically does not reside on a single HTML page, we can try to find relationships in hyperlinks. Hyperlinks can be interpreted, in many cases, as semantic relations between structural units.

Rule 5: Identifying constraints In addition to the structures of HTML pages, we also analyze data in the pages to identify constraints. A data analysis includes a strategy of learning by examples, borrowed from machine learning techniques [28]. E.g. from figure 2 we would identify a constraint “Not Null” on the linked attributes Departure-City and Arrival-City. This contains non-null values for any “Booking-form” instance.

4.2 Generation of form XML-schema Once the structure of the form type is extracted, the corresponding XML-schema is generated based on a set of translation rules between concepts of form model and those of the XML schema. As the form is a structured document, the translation is done systematically by the following set of transformation rules.

Rule 1: Each structural unit in the form type is translated as a complexType element in the corresponding XML schema. Example: the structural unit “passenger” is translated as follow: <xsd: complexType name=”passenger”> …</xsd: complexType>

The rule 1 is applied recursively on the complex structural unit components. Example: The complex field “Period from (Day, month, Year)” in the “date of departure” structural unit is translated as a ComplexeType element too. <xsd: complexType name=”PeriodFrom”> .</xsd: complexType>

Rule 2: Each form field of the structural unit is translated in a sub-element of the corresponding complexeType element. The primitive type of the element is the one of the field. Example: the field “FirstName” is translated as a string type: <xsd: element name=”firstname” type=”xsd:string”/>

Rule 3: If the structural unit contains some simple filling fields (e.g. TEXT tag), the corresponding ComplexeType element takes “minOccurs = 1” and “maxOccurs = 1" as occurrence.

Rule 4: If the structural unit contains some multiple filling fields (e.g. MULTIPLE attribute), the corresponding ComplexeType element takes “maxOccurs = "*"” as maximum occurrence.

The rules 3 and 4 are applied recursively on the form fields of each structural unit.

4.3 Construction of the hierarchical structure of forms In order to have a precise view of the hierarchical relationships of a form and to clearly understand its meaning and facilitate the interpretation and the extraction of the domain semantics, the form XML schema is transformed into a form hierarchical structure without loss of information. Formally, the hierarchical structure of a form is defined by the Tiplet (N, TN, L) where: N: represent no terminal nodes of the hierarchical structure, TN: represent terminal nodes of the hierarchical structure, L: represent the parent-child link between nodes. This process, which is automatic and transparent to the designer, constructs the hierarchical structure in four steps:

- Defining the root node (with level 0) whose name is the form’s title.

- Transforming all complex elements into no-terminal nodes (with level 1). This step transforms recursively, the complex sub-elements into a no-terminal sub-nodes (with level 2, 3, etc).

- Transforming all simple elements and attributes into terminal nodes.

-Identifying the link type between two nodes of the tree according to the occurrence value (maxoccurs=1 or maxoccurs=n).

5. ENRICHMENT OF RELATIONAL SCHEMA The goal of this phase is to augment the relational schema semantic by deriving the sub-schemas of forms from their hierarchical structure and their instances according to the physical schema of the underlying database to gather additional information about the relational schema.

5.1 Forms semantic extraction First, the relations and their primary keys are respectively identified with regard to both structural units (nodes) of form and underlying database, then the functional and inclusion dependencies and constraints are extracted through both their hierarchical structure and instances. All this semantics will be used thereafter to enrich the relational schema.

5.1.1 Form relations extraction The forms either permit the updating of relations in underlying database or represent a view that is often a joint of relations. Therefore, each field entry is generally linked to an attribute of one relation in the underlying database. However, the identification of form relations and their primary keys respectively, consists of determining the equivalence and/or the similarity between structural units (nodes) of hierarchical

364

structure and relations in the underlying database. This is a basis point from a reverse engineering point of view [24]. A node of a form hierarchical structure may be either:

– Equivalent to a relation in the underlying database, i.e., these two objects (node and relation) have a same set of attributes; – Similar to a relation, i.e., its set of attributes is a subset of the one of the relation; – A set of relations, i.e., its set of attributes regroups several relations in underlying database.

Also, for dependent nodes (or form relation), primary keys are formed by concatenating the primary key of its parent with its local primary key. This process of identification is semi-automated because it requires the interaction with the analyst to identify relations that do not verify proprieties of equivalence and similarity. While applying this process on the hierarchical structure of “Booking Form” and the physical relational schema of database, we extract the following relational sub-schemas: Passenger (PassengerID, FirstName, LastName, Age) Departure-City (CityID, DepartureCityName) Arrival-City (CityID, ArrivalCityName) Date (DepartureDate)

From the “program flights” form we identify the following relational sub-schemas: Departure-Hour (HourID, type) Arrival-Hour (HourID, type) Plane (PlaneID, Capacity) Flight (FlightNumber, DepartureCityID, ArrivalCityID, Dep_HourID, Arr_HourID, PlaneID)

From the relationships among hierarchical structure of “Booking Form” and “program flight” forms we identify the following relational sub-schemas: Book (PassengerID, FlightNumber, DepartureDate, Class) Leaving-From (FlightNumber, DepartureCityID) Going-To (FlightNumber, ArrivalCityID)

5.1.2 Functional dependencies extraction The extraction of functional dependencies (FDs) from the extension of database has received a great deal of attention [29, 30, 31]. In our approach we use the algorithm introduced by [24] to reduce the time for exacting functional dependencies by replacing database instances with a more compact representation that is, the form instances. While applying this algorithm on the sub-schema of “program of flights” and their instances, one finds the non trivial FD: PlaneID FlightNumber, what means in this case that a plane ensures only one flight.

5.1.3 Inclusion dependencies extraction In our approach, we formulate possible inclusion dependencies between relations’ key of sub-schema of form. The time of this process is more optimized with regard to the other approaches [31, 4] because the possible inclusion dependencies are verified by analyzing the form extensions which are more compact representation with regard to the database extension. In this algorithm, attributes of dependencies are the primary keys and foreign keys. Thus, the time complexity is reduced to the test of the inclusion dependency on the form instances. While applying this algorithm on the sub-schema of “Booking-form” and their instances, one finds the set of the inclusion dependencies:

DepartureCity.CityID << City.CityID ArrivalCity.CityID << City.CityID

5.1.4 Integration of sub-schemas of forms In agreement with [32] that the integration schema process consists in two phases: comparison and conforming of schemas, and merging and restructuring of schemas. The comparison phase performs a parities comparison of relations (of the sub-schemas) and finds possible relations pairs, which may be semantically similar with respect to some proprieties, such as synonyms (name of attribute and relation) of equal primary key attribute and equivalent of relations. The conforming is a variety of analysts assisted techniques that are used to resolve conflicts. The merging and restructuring phase generates an integrated schema from two component schemas that have been compared. The intermediate results are analyzed and restructured in order to eliminate the symmetrical and transitive relationship between relations. In addition, we consider only one pair of schemas at a time; further, the result of integration schema is accumulated into a single schema, which evolves gradually towards the global schema of forms. For more details see [24].

5.2 Relational schema enrichment The relational schema is restructured end enriched (See Table 2) through the semantics extracted from the global schema of forms by: - Clarifying the relational schema while retaining names apparent in the schema of forms instead those of the relational schema. Notice the adaptation of the name DepartureCityName to the attribute. This can better convey the meaning of data than the original attribute name DC-Name would. Similarly we retain FirstName, and FlightNumber instead FN and FlightID.

- Restructuring the relational schema by transforming it into third form normal through the found functional dependences.

- Adding the extracted inclusion dependencies to the relational schema. - Adding the extracted constraints to the attributes in the relations.

In the next section, the added dependencies and constraints are used to detect respectively hierarchical and axioms construct in the target OWL ontology.

6. BUILDING THE ONTOLOGICAL STRUCTURE Before exposing the ontology construction method, some semantic similarity and difference between relational model and ontology are analyzed firstly here.

6.1 Relational schema VS Ontology The underlying model of relational database is the relational model [6], it consists of: a set of relations R, a set of attributes AR, a set of basic types TR, a function attr: R AR×AR that returns attributes of relations, a function dom: AR TR that returns types of attributes, a function PK: R AR×AR that returns primary keys of relations, and a function FK: R AR×AR that returns foreign keys of relations. In addition, there is dependency relationship between the data of attributes in relational model. For relations Ri and Rj in database, supposed that Ai ⊆ attr(Ri) and Aj ⊆ attr(Rj), ti(Ai) expresses the values of tuple ti for attributes Ai, and tj(Aj) expresses the values of tuple tj

365

Functional dependenciesPlaneID FlightNumber

Inclusion dependencies DepartureCity.CityID << City.CityIDArrivalCity.CityID << City.CityIDDeparture-Hour.HourID<< Hour.HourIDArrival-Hour.HourID << Hour.HourID…

ConstraintsDeparture-City : NotNull.Arrival-City : NotNull.…

Passenger (PassengerID, FirstName, LastName, Age)City (CityID)Departure-City (CityID, DepartureCityName)Arrival-City (CityID, ArrivalCityName)Date (DeparatueDate)Hour (HourID)Departure-Hour (HourID, type)Arrival-Hour (HourID, type)Company (CompagnyID, Address, Phone) Plane (PlaneID, CompagnyID, Capacity)Leaving-From (FlightNumber, DepartureCityID)Going-To (FlightNumber, ArrivalCityID)Flight (FlightNumber, DepartureCityID, ArrivalCityID,

DepartureHourID, ArrivalHourID, PlaneID)Book (PassengerID, FlightNumber, DepartureDate, Class)

Functional dependenciesPlaneID FlightNumber

Inclusion dependencies DepartureCity.CityID << City.CityIDArrivalCity.CityID << City.CityIDDeparture-Hour.HourID<< Hour.HourIDArrival-Hour.HourID << Hour.HourID…

ConstraintsDeparture-City : NotNull.Arrival-City : NotNull.…

Passenger (PassengerID, FirstName, LastName, Age)City (CityID)Departure-City (CityID, DepartureCityName)Arrival-City (CityID, ArrivalCityName)Date (DeparatueDate)Hour (HourID)Departure-Hour (HourID, type)Arrival-Hour (HourID, type)Company (CompagnyID, Address, Phone) Plane (PlaneID, CompagnyID, Capacity)Leaving-From (FlightNumber, DepartureCityID)Going-To (FlightNumber, ArrivalCityID)Flight (FlightNumber, DepartureCityID, ArrivalCityID,

DepartureHourID, ArrivalHourID, PlaneID)Book (PassengerID, FlightNumber, DepartureDate, Class)

Table 2 Enriched Relational Database Schema

for attributes Aj. For each ti(Ai) in Ri, if in Rj there exists ti(Ai) = tj(Aj), Ai and Aj are called inclusion dependency, denoted as Ri(Ai) << Rj(Aj). Besides the above entities, relational model includes some further constraints, such as NOT NULL, UNIQUE etc. All these constitute relation schema, which is used to describe the structure and association of data. The tuples in relations reflect the values of schema, and they are content of database.

Ontological structure is a 5-tuple O= {C, R, Hc, rel, Ao} [33], where C is a finite set of concepts; R is a finite set of relations; Hc is called concept hierarchy or taxonomy, which is a directed relation Hc ⊆ C×C; rel relates concepts non-taxonomically; Ao is a set of axioms, which is expressed in an appropriate logical language, e.g. first order logic. Based on the ontological structure, ontology comprises a set of instances, which could be seen as the extension of concepts.

As a result of the ongoing process of defining a standard ontology Web language, a number of intermediate versions of the language have been defined (OIL, DAML, DAML+OIL, etc.). This paper adopts the latest standard recommended by W3C, OWL (Ontology Web Language) [34] instead of [35] where we used Frame Logic, as the ontology description language.

Similarities do exist between relational models and ontological models with respect to abstracting and modelling the domain of discourse. But, their purposes are different.

6.2 Ontology construction rules Both relational model and ontology are a kind of model for organizing knowledge, and there are some semantic similarities between them. In what follows we will present a whole of the ontology construction rules. The rules are organized in five groups.

6.2.1 Rule for constructing classes Rule 1. An OWL class Ci can be created based on the relation Ri, if one of the following conditions can be satisfied: (i) |PK(Ri)| = 1; (ii) |PK(Ri)| > 1, and there exists Ai, where Ai ∈ PK(Ri) and Ai ∉ FK(Ri).

6.2.2 Rules for constructing properties OWL distinguishes two kinds of properties, so called object properties and datatype properties. Properties can be functional, i.e. their range may contain at most one element. Their domain is always a class. Object properties may additionally be inverse functional, transitive, symmetric or inverse to another property.

Their range is a class, while the range of datatype properties is a datatype.

Rule 2. For relations Ri and Rj, if Ri(Ai) ⊆ Rj(Aj) and Ai ⊄ PK(Ri) are satisfied, then an object property P can be created based on Ai. Let’s assume that the classes corresponding to Ri and Rj are Ci and Cj respectively, the domain and range of P are Ci and Cj.

Rule 3. For relations Ri and Rj, two ontological objects property “has-part” and “is-part-of” can be created, if the two conditions are satisfied:

(i) |PK(Ri)|>1; (ii) FK(Ri) ⊂ PK(Ri), where FK(Ri) referring to Rj.

Suppose that the classes corresponding to Ri and Rj are Ci and Cj respectively, the domain and range of “is-part-of” are Ci and Cj and the domain and range of “has-part” are Cj and Ci. Properties “has-part” and “is-part-of” are two inverse properties.

Rule 4. For relations Ri, Rj and Rk, if Ai = PK(Ri), Aj = PK(Rj), Ai ∪ Aj = FK(Rk) and Ai ∩ Aj =∅, then two object properties Pj’ and Pj’’ can be created based on the semantics of Rk. Suppose that the classes corresponding to Ri and Rj are Ci and Cj respectively, the domain and range of Pj’ are Ci and Cj and the domain and range of Pj’’ are Cj and Ci. Pj’ and Pj’’ are two inverse properties.

Rule 5. For relation R1, R2, …, Ri and Rj, if A1 = PK(R1), A2 = PK(R2), …, Ai = PK(Ri), A1 ∪ A2 ∪…∪ Ai = FK(Rj) and A1 ∩ A2 ∩…∩ Ai =∅, then object properties Pi

1, Pi2,…, Pi

i*(i-1) can be created. The semantics of Pi

1, Pi2,…, Pi

i*(i-1) are based on the decomposition of n-ary relationship provided by Rj.

Rule 6. For an ontological class Ci and the datatype properties set of Ci denoted as DP(Ci), if Ci corresponds to relations R1, R2,…, Ri in database, then for every attribute in R1, R2, …, Ri, if it cannot be used to create object property by using Rule 2, then it can be used to create datatype property of Ci. The domain and range of each property Pi are Ci and dom(Ai) respectively, where Pi ∈ DP(Ci) and Ai ∈ attr(Ri).

6.2.3 Rule for constructing Inheritance In OWL, the classes and properties can be organized in a hierarchy. In our approach this hierarchy can be discovered through inclusion dependencies.

366

Rule 7. For relation Ri and Rj, supposed that Pi = PK(Ri), Pj = PK(Rj), if Ri(Pi) << Rj(Pj) is satisfied, then the class corresponding to Ri is a subclass of the class corresponding to Rj.

6.2.4 Rules for constructing axioms In OWL, a property when applied to a class can be constrained by cardinality restrictions on the domain giving the minimum (minCardinality) and maximum (maxCardinality) number of instances which can participate in the relation. In addition, an OWL property can be globally declared as functional (functionalProperty) or inverse functional (inverseFunctional). A functional property has a maximum cardinality of 1 on its range, while an inverse functional property has a maximum cardinality of 1 on its domain. The rules for constructing axioms are shown in Rule 8, Rule 9, and Rule 10.

Rule 8. For relation Ri and Ai ∈ attr(Ri), if Ai = PK(Ri) or Ai = FK(Ri), then the minCardinality and maxCardinality of the property Pi corresponding to Ai is 1.

Rule 9. For relation Ri, and Ai ∈ attr(Ri), if Ai is declared as NOT NULL, the minCardinality of the property Pi corresponding to Ai is 1.

Rule 10. For relation Ri, and Ai ∈ attr(Ri), if Ai is declared as UNIQUE, the maxCardinality of the property Pi corresponding to Ai is 1. Alternatively the property Pi can be declared as functionalProperty. FunctionalProperty is shorthand for stating that the property's minimum cardinality is zero and its maximum cardinality is 1.

6.2.5 Rules for constructing instances Once the ontology structure is created, the process of data migration can start. The objective of this task is the creation of ontological instances (that form a knowledge base) based on the tuples of the relational database. The data migration process has to be performed in two phases based on the following rules:

Rule 11: First, the instances are created. To each instance is assigned a unique identifier. This translates all attributes, except for foreign-key attributes, which are not needed in the metadata.

Rule 12: Second, relations between instances are established using the information contained in the foreign keys in the database tuples. This is accomplished using a mapping function that maps keys to ontological identifiers.

Rule 11 and 12 shows that for one ontological class, its instances consist of tuples in relations corresponding to the class and relations between instances are established using the information contained in the foreign keys in the database tuples.

7. CONCLUSION AND FUTURE WORKS In this paper we focus on the problem of automating the generation of domain ontologies, at least partially, by applying reverse engineering technique to acquire OWL ontologies from data-intensives Web site. We present the complete details of the process of semi-automatically create OWL ontology corresponding to the content of relational database based on the analysis of its related HTML-forms. Our approach can be used for migrating HTML pages (especially those that are dynamically generated from a relational database) to the ontology-based Semantic Web. The main reason for this migration is to make the relational database information that is available on the Web

machine-processable and reduce the time consuming task of ontology creation. However, in the most circumstances, the obtained ontological structure is coarse. In addition, some semantics of obtained information need to be validated. So refining obtained ontological structure is necessary. Because existing repositories of lexical knowledge usually includes authoritative knowledge about some domains, we suggests as future work refining obtained ontology according to them, especially machine-readable dictionaries and thesauri (e.g. WordNet).

8. ADDITIONAL AUTHORS Youssef Amghar (LIRIS Laboratory, INSA of Lyon, Villeurbanne, France, e-mail: [email protected]), and Hamadou Saliah-Hassane (Quebec University of Montreal, Canada, e-mail: [email protected]).

9. REFERENCES [1] Erdmann, M., Maedche, A., Schnurr, H. and Staab, S. From

Manual to Semi-automatic Semantic Annotation: About Ontology-based Text Annotation Tools, In: Proceedings of the Workshop on Semantic Annotation and Intelligent Content (COLING), (2000).

[2] Volz, R., Handschuh, S., Staab, S., Stojanovic, L., Stojanovic. N. Unveiling the hidden bride: deep annotation for mapping and migrating legacy data to the Semantic Web, Journal of Web Semantics: science, services and agents on the Word Wide Web 1 (2004) 187-206.

[3] Haustein, S. and Pleumann, J. Is participation in the Semantic Web too difficult? In First International Semantic Web Conference, number 2342 in LNCS, pages 448-453, Sardiana, Italy, June 2002. Springer-Verlog

[4] Chiang, R.H.L., Barron, T.M., Story, V.C. Reverse engineering of relational databases: extraction of an EER model from a relational database. Data and Knowledge Engineering, 1994.

[5] Premerlani, W. and Blaha, M. An Approach for Reverse Engineering of Relational Databases, In: Communications of the ACM, Vol. 37. No. 5 (1994) 42–49

[6] Codd, EF. A relational model of data for large shared data banks”. CACM 13 No.6, 1970.

[7] Muller, R. Database Design for Smarties: Using UML for Data Modeling, Morgan Kauf-mann (1999).

[8] Biskup, J. Achievements of relational database schema design theory revisited. Semantics in Database, Springer Verlag, 1998.

[9] Vermeer, M., Apers, P. Object-oriented views of relational databases incorporation behaviour, Proceedings of the 4th International Conference on databases systems for Advanced Application, Singapore, April 11-13, 1995, 26-35.

[10] Behm, A., Geppert, A., Dittrich, K. On the Migration of Relational Schemas and Data to Object-Oriented Database Systems”. In Proceeding of the 5th Int. Conference on Re-Technologies for Information Systems (Klagenfurt, December 1997), pp. 13-33.

[11] Hainaut, J. Henrard, J., Hick, J.M., Roland, D., Englebert, V. Database Design Recovery, Proc. 8th Conference on

367

Advanced Information Systems Engineering (CAiSE), Heraklion, Crete, Greece, LNCS, 1080, 1996, 272–300.

[12] Kashyap, V. Design and Creation of Ontologies for Environmental Information Retrieval, Proc. 12th Workshop on Knowledge Acquisition, Modeling and Management (KAW), Banff, Alberta, Canada, 1999.

[13] Stojanovic, L., Stojanovic, N., Volz, R. Migrating Data-intensive Web Sites into the Semantic Web, Proc. 17th ACM Symposium on Applied Computing, Madrid, 2002.

[14] Dogan, G., Islamaj, R. Importing Relational Databases into the Semantic Web, 2002 , URL: http://www.mindswap.org/Webai/2002/fall/Importing_20Relational_20Databases_20into_20the_20Semantic_20web.html

[15] Rubin, D.L., Hewett, M., Oliver, D.E., Klein, T.E, Altman, R.B. Automatic data acquisition into ontologies from pharmacogenetics relational data sources using declarative object definitions and XML. In: Proceedings of the Pacific Symposium on Biology, Lihue, HI, Eds. R.B. et al l (2002).

[16] Astrova, I. Reverse Engineering of Relational Databases to Ontologies , Proc. 1st European Semantic Web Symposium (ESWS), Heraklion, Crete, Greece, LNCS, 2004, 327–341.

[17] Noy, N., Klein, M. Ontology Evolution: Not the same as Schema Evolution. Knowledge and Information Systems, 6(4):428-440, 2004.

[18] Sahuguet, A., Azavant, F. Building Intelligent Web Applications Using Lightweight Wrappers, Data Knowledge Engineering, March 2001, pp. 283-316.

[19] Wang, J., Lochovsky, F.: Data Extraction and Label Assignment for Web Databases, In: Proceedings of the 12th International Conference on World Wide Web (WWW), Budapest, Hungary, 2003, 187–196.

[20] Embley, D. Toward Semantic Understanding – An Approach Based on Information Ex-traction, In:Proceedings of the 15th Australasian Database Conference, 2004, 3–12.

[21] Yang, Y., Zhang, H. HTML Page Analysis Based on Visual Cues, In:Proceedings of the 6th International Conference on Document Analysis & Recognition (ICDAR), Seattle, WA, USA, 2001, 859–864.

[22] Astrova, I., Stantic, B. An HTML Forms driven Approach to Reverse Engineering of Relational Databases to Ontologies, In: proceeding of the 23rd IASTED International Conference on Databases and Applications (DBA), eds. M. H. Hamza, Innsbruck, Austria, 2005, pp. 246- 251.

[23] Benslimane, S.M., Malki, M., Rahmouni, M.K. Benslimane, D. Building domain-specific ontology from data-intensive

Web site: An HTML forms-based reverse engineering approach, In: Proceedings of the International Conference on Signal-Image Technology & Internet- Based Systems (SITIS’05/IEEE), Yaoundé, Cameroon, 2005.

[24] Malki, M., Flory, A., Rahmouni, M.K. Extraction of Object-oriented Schemas from Existing Relational Databases: a Form-driven Approach, INFORMATICA, International Journal (Lithuanian Academy of Sciences) pp 47-72, Vol. 13(1), 2002.

[25] Choobineh, J. A form-based approach for database analysis and design. Communication of the ACM, 35(2) 1992.

[26] Mfourga, N. Extracting entity-relationship schemas from relational databases: a form-driven approach. In Proc. of Working Conf. on Reverse EngineeringWCRE’97 1997.

[27] Tijerino, Y.A, Embly, W., Lonsdale, W., Ding, Y., Nagy, Nagy Towards Ontology Generation from tables. Springer Science+Business Media B.V, Kluwer Academic publishers, September 2005, 261 - 285.

[28] Michalski, R. A Theory and Methodology of Inductive Learning. Machine Learning, vol.1, Eds. J.G.Carbonel, R.S.Michalski and T.M.Mitchel, Palo Alto (1983) 83-134,.

[29] Anderson, M. Extracting entity relationship schema from a relational database through reverse engineering. In Proc. of ER’94, LNCS, 403–419. Springer-Verlag, 1994.

[30] Mannila, H., Räihä, K.J. The Design of Relational Databases. Addison-Wesley publishing, England, 1992, 318 pages.

[31] Petit, J.M. Toumani, F. Kouloumdjian, J. Relational database reverse engineering: a method based on Query analysis. International Journal of Cooperative Information System, 4(2,3), 287–316, 1995.

[32] C. Batini, M. Lenzerini, and S.B. Navathe . A Comparative Analysis of Methodologies for Databases Schema Integration. ACM Computing Surveys, Vol. 18 pp.323-364 Dec ; 1986

[33] Maedche, A. Ontology learning for the Semantic Web. Boston: Kluwer Academic Publishers. 2002

[34] Smith, M.K., Welty, C., McGuinness, D.H. Eds. OWL Web Ontology Language Guide. W3C Proposed Recommendation, December 2003.

[35] Benslimane, S.M., Malki, M., Amar Bensaber, D. Automated Migration of Data-Intensive Web Pages into Ontology-Based Semantic Web: A Reverse Engineering Approach. In, Meersman R., Tari Z. et al.,(eds.),ODBASE, vol. 2, LNCS 3761, pp. 1640 - 1649, 2005. Springer Verlag.

368