An intelligent query processing for distributed ontologies

11
An intelligent query processing for distributed ontologies Jihyun Lee a , Jeong-Hoon Park a , Myung-Jae Park a , Chin-Wan Chung a, * , Jun-Ki Min b a Division of Computer Science, Department of Electronic Engineering and Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejon 305-701, Republic of Korea b School of Internet-Media Engineering, Korea University of Technology and Education, Byeongcheon-Myeon, Cheonan, Chungnam 330-708, Republic of Korea article info Article history: Received 30 October 2008 Received in revised form 30 March 2009 Accepted 7 June 2009 Available online 12 June 2009 Keywords: Distributed query processing Semantic mapping Query optimization Distributed ontologies Semantic Web abstract In this paper, we propose an intelligent distributed query processing method considering the character- istics of a distributed ontology environment. We suggest more general models of the distributed ontology query and the semantic mapping among distributed ontologies compared with the previous works. Our approach rewrites a distributed ontology query into multiple distributed ontology queries using the semantic mapping, and we can obtain the integrated answer through the execution of these queries. Fur- thermore, we propose a distributed ontology query processing algorithm with several query optimization techniques: pruning rules to remove unnecessary queries, a cost model considering site load balancing and caching, and a heuristic strategy for scheduling plans to be executed at a local site. Finally, experi- mental results show that our optimization techniques are effective to reduce the response time. Ó 2009 Elsevier Inc. All rights reserved. 1. Introduction In the Semantic Web, the definitions of resources and the rela- tionship between resources are described by an ontology in order to automatically interpret the resources and retrieve useful infor- mation. The resources in the Web are independently generated in many locations. Thus, even if the ontologies describe resources in the same (similar) domain, they can use different representations (i.e., language and schema). Also, the ontologies are managed by various local ontology management systems which have different capabilities and strategies for storing and query processing. Under these environment, some Web applications want to access the ontologies without regard to the heterogeneity and the dispersion of the ontologies and the local systems. In order to support such a request, an efficient query processing over the distributed ontolo- gies is essential. Of course, existing distributed query processing techniques can be applied to query the distributed ontologies. However, they confront the limitations of the efficiency and the functionality since some important characteristics of a distributed ontology environment are not considered. Fig. 1 shows an example of the distributed ontology environ- ment. There are three kinds of ontologies, UNIV, COLLEGE, and PUB which are managed in three different sites and two types of local systems (i.e., LS 1 ; LS 2 ). UNIV and COLLEGE describe the infor- mation of the university and the college, respectively, and PUB describes the publication information. For the simplicity, we de- scribe only the schema and omit the instance part. These ontolo- gies are independently generated but related to each other even if they have different schemas. For example, let us suppose the fol- lowing conditions: first, the concept of Professor in UNIV is defined as the concept of Lecturer in COLLEGE. Second, the information of the authors in PUB can be found in UNIV and COLLEGE. In this dis- tributed ontology environment, consider the following example queries: Example 1. Q 1 : Find professors who teach ‘Algorithm’. Example 2. Q 2 : Find authors who wrote publications about ‘Semantic Web’ and also retrieve the name and the email addresses of the authors. In order to find the answer of query Q 1 , we should retrieve professors and lecturers who teach ‘Algorithm’ from UNIV and COLLEGE, respectively. For query Q 2 , UNIV and COLLEGE should be searched along with PUB to find the personal information of the authors who wrote papers about ‘Semantic Web’. For such a query, in order to efficiently find the answer dispersed in several ontologies and local sites, a distributed query processing method considering the heterogeneity of the ontologies is required. The use of the semantic mapping is a representative approach to deal with the heterogeneity among different ontologies (Borg- ida and Serafini, 2003; Haase and Motik, 2005; Motik et al., 2004; Serafini and Andrei, 2005). In (Borgida and Serafini, 2003; Serafini and Andrei, 2005), the semantic mapping is the semantic relationship (i.e., subsumption or equivalence) between concepts (i.e., classes or properties) in two different ontologies and it has 0164-1212/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2009.06.008 * Corresponding author. Tel.: +82 42 350 3537; fax: +82 42 350 3510. E-mail addresses: [email protected] (J. Lee), [email protected] (J.-H. Park), [email protected] (M.-J. Park), [email protected] (C.-W. Chung), [email protected] (J.-K. Min). The Journal of Systems and Software 83 (2010) 85–95 Contents lists available at ScienceDirect The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss

Transcript of An intelligent query processing for distributed ontologies

The Journal of Systems and Software 83 (2010) 85–95

Contents lists available at ScienceDirect

The Journal of Systems and Software

journal homepage: www.elsevier .com/locate / jss

An intelligent query processing for distributed ontologies

Jihyun Lee a, Jeong-Hoon Park a, Myung-Jae Park a, Chin-Wan Chung a,*, Jun-Ki Min b

a Division of Computer Science, Department of Electronic Engineering and Computer Science, Korea Advanced Institute of Science and Technology (KAIST),Daejon 305-701, Republic of Koreab School of Internet-Media Engineering, Korea University of Technology and Education, Byeongcheon-Myeon, Cheonan, Chungnam 330-708, Republic of Korea

a r t i c l e i n f o a b s t r a c t

Article history:Received 30 October 2008Received in revised form 30 March 2009Accepted 7 June 2009Available online 12 June 2009

Keywords:Distributed query processingSemantic mappingQuery optimizationDistributed ontologiesSemantic Web

0164-1212/$ - see front matter � 2009 Elsevier Inc. Adoi:10.1016/j.jss.2009.06.008

* Corresponding author. Tel.: +82 42 350 3537; faxE-mail addresses: [email protected] (J. L

(J.-H. Park), [email protected] (M.-J. Park)(C.-W. Chung), [email protected] (J.-K. Min).

In this paper, we propose an intelligent distributed query processing method considering the character-istics of a distributed ontology environment. We suggest more general models of the distributed ontologyquery and the semantic mapping among distributed ontologies compared with the previous works. Ourapproach rewrites a distributed ontology query into multiple distributed ontology queries using thesemantic mapping, and we can obtain the integrated answer through the execution of these queries. Fur-thermore, we propose a distributed ontology query processing algorithm with several query optimizationtechniques: pruning rules to remove unnecessary queries, a cost model considering site load balancingand caching, and a heuristic strategy for scheduling plans to be executed at a local site. Finally, experi-mental results show that our optimization techniques are effective to reduce the response time.

� 2009 Elsevier Inc. All rights reserved.

1. Introduction describes the publication information. For the simplicity, we de-

In the Semantic Web, the definitions of resources and the rela-tionship between resources are described by an ontology in orderto automatically interpret the resources and retrieve useful infor-mation. The resources in the Web are independently generated inmany locations. Thus, even if the ontologies describe resources inthe same (similar) domain, they can use different representations(i.e., language and schema). Also, the ontologies are managed byvarious local ontology management systems which have differentcapabilities and strategies for storing and query processing. Underthese environment, some Web applications want to access theontologies without regard to the heterogeneity and the dispersionof the ontologies and the local systems. In order to support such arequest, an efficient query processing over the distributed ontolo-gies is essential. Of course, existing distributed query processingtechniques can be applied to query the distributed ontologies.However, they confront the limitations of the efficiency and thefunctionality since some important characteristics of a distributedontology environment are not considered.

Fig. 1 shows an example of the distributed ontology environ-ment. There are three kinds of ontologies, UNIV, COLLEGE, andPUB which are managed in three different sites and two types oflocal systems (i.e., LS1; LS2). UNIV and COLLEGE describe the infor-mation of the university and the college, respectively, and PUB

ll rights reserved.

: +82 42 350 3510.ee), [email protected], [email protected]

scribe only the schema and omit the instance part. These ontolo-gies are independently generated but related to each other evenif they have different schemas. For example, let us suppose the fol-lowing conditions: first, the concept of Professor in UNIV is definedas the concept of Lecturer in COLLEGE. Second, the information ofthe authors in PUB can be found in UNIV and COLLEGE. In this dis-tributed ontology environment, consider the following examplequeries:

Example 1. Q1: Find professors who teach ‘Algorithm’.

Example 2. Q 2: Find authors who wrote publications about‘Semantic Web’ and also retrieve the name and the email addressesof the authors.

In order to find the answer of query Q 1, we should retrieveprofessors and lecturers who teach ‘Algorithm’ from UNIV andCOLLEGE, respectively. For query Q2, UNIV and COLLEGE shouldbe searched along with PUB to find the personal information ofthe authors who wrote papers about ‘Semantic Web’. For such aquery, in order to efficiently find the answer dispersed in severalontologies and local sites, a distributed query processing methodconsidering the heterogeneity of the ontologies is required.

The use of the semantic mapping is a representative approachto deal with the heterogeneity among different ontologies (Borg-ida and Serafini, 2003; Haase and Motik, 2005; Motik et al.,2004; Serafini and Andrei, 2005). In (Borgida and Serafini, 2003;Serafini and Andrei, 2005), the semantic mapping is the semanticrelationship (i.e., subsumption or equivalence) between concepts(i.e., classes or properties) in two different ontologies and it has

LS1 LS2LS1

UNIV

Person

PaperCourse

String

String

University

teachemail

name

write

Professor

Student degreeFrom

Member

PublicationCourse

String

String

College

lecture hasEmail

name

authorOf

Lecturer

Student workFor

Publication

String

String

isAbout

title

Author

hasAuthor

cite

COLLEGE PUB

Local Site S1 Local Site S2 Local Site S3

String

researchOn

: Class : Property : subClassOf

Fig. 1. An example of the distributed ontology environment.

86 J. Lee et al. / The Journal of Systems and Software 83 (2010) 85–95

been extended to that between views (i.e., queries) (Haase andMotik, 2005; Motik et al., 2004). However, the previous worksdo not support more general semantic mapping and distributedquery covering more than two ontologies. Besides, most of themhave focused on only the rewriting of the query using the seman-tic mapping, and do not make an issue of the efficient distributedquery evaluation (i.e., query rewriting, scheduling, and execution).

In this paper, we resolve issues of the distributed query process-ing over multiple heterogeneous ontologies. We extend the modelsof the distributed query and the semantic mapping to supportmore general distributed ontology query answering comparedwith previous works. Furthermore, we present a distributed ontol-ogy query processing algorithm with several query optimizationtechniques considering the characteristics of the distributed ontol-ogy environment.

The contributions of the paper are as follows:Extended models of the distributed ontology query and the

semantic mapping: We present a general distributed ontologyquery model to cover multiple different ontologies. We also pres-ent a general semantic mapping model in which more than twoontologies can be associated. The extension of query and semanticmapping models makes it possible to include relevant data whichcould not be accessed before in the query result. Also, our approachlogically integrates independently grown distributed ontologiesthrough the query rewriting based on the semantic mapping. Asa result, we can efficiently extract an integrated answer of a dis-tributed query over different ontologies.

Optimization techniques for an efficient query processing onthe distributed ontologies: Multiple distributed queries are gen-erated from an original distributed query to obtain results fromdispersed ontologies. In order to remove unnecessary operationsand to increase the parallelism among executions of the multiplequeries, we suggest several optimization techniques. First, wepresent pruning rules to remove invalid and redundant queries.Second, we suggest a heuristic strategy for scheduling plans to beexecuted at a local site. Third, we propose a cost model consideringsite load balancing and caching for processing multiple distributedqueries.

The remainder of the paper is organized as follows: In Section 2,we review related work. In Section 3, we present a distributedontology query model and a semantic mapping model. Section 4describes a distributed query processing technique with severalquery optimization techniques over distributed ontologies. Section5 contains the results of experiments. Finally, in Section 6, we con-clude this paper.

2. Related work

Recently, the research on a query processing over distributedontologies has been performed. Stuckenschmidt et al. (2005) sug-

gests a global data summary for locating data matching query an-swers in different sources and the query optimization. However,Stuckenschmidt et al. (2005) assumes that all distributed ontolo-gies can be accessed in a uniform way like a global schema. In otherwords, the heterogeneity of schemas of the distributed ontologiesis not considered. Besides, many tasks are concentrated on themediator. As well as query scheduling, the merge (i.e., join) of alllocal query results is also executed in the mediator. Thus, whenthe mediator receives requests for many queries at the same time,the bottleneck on the mediator is inevitable.

The most of research on the query answering over distributedontologies are based on the P2P architecture. Edutella (Nejdlet al., 2002) uses an unstructured P2P network which has no meth-od to route a query to the relevant ontologies. Instead, the query isbroadcasted in the entire network. Thus, a huge amount of unnec-essary network traffic incurs. As a successor of Edutella, to providebetter scalability, Nejdl et al. (2003) presents a schema-basedquery routing strategy in a hierarchical topology using the super-peer concept. Nejdl et al. (2003) also suggests a rule-based media-tion between two different schemas in order to collect results frommany peers using heterogeneous schemas. SomeRDFs (Adjimanet al., 2007) supports the semantic mapping between two atomicconcepts and between the domain (or range) of a property and aclass. Piazza (Halevy et al., 2003) proposes a language (heavily re-lies on XQuery/XPath) to describe the semantic mapping betweentwo different ontologies. In these works, for distributed queryanswering, a peer reformulates a query by using the semanticmapping and forwards the reformulated query to another peer re-lated by the semantic mapping.

DRAGO (Serafini and Andrei, 2005) focuses on a distributed rea-soning based on the P2P-like architecture. In DRAGO, every peermaintains a set of ontologies and the semantic mapping betweenits local ontologies and remote ontologies located in other peers.A reasoning service is performed by a local reasoner for the locallyregistered ontologies and the reasoning is propagated to the otherpeers when the local ontologies are semantically connected to theother remote ontologies. The semantic mapping supported in DRA-GO is only the subsumption relationship between two atomic con-cepts. Besides, it does not support the ABox reasoning.

KAONP2P (Haase and Wang, 2007) also suggests the P2P-likearchitecture for query answering over distributed ontologies.KAONP2P supports more extended semantic mapping which de-scribes the correspondence between views of two different ontolo-gies, where each view is represented by a conjunctive query. For thedistributed query answering, it generates a virtual ontology includ-ing a target ontology to which the query is issued and the semanticmapping between the target and the other ontologies. Then, thequery evaluation is performed against the virtual ontology.

OBSERVER (Mena et al., 2000a) does not consider the P2P envi-ronment. Non the less, the goal is also to find an answer of an

J. Lee et al. / The Journal of Systems and Software 83 (2010) 85–95 87

ontology query over heterogeneous ontologies. In OBSERVER, IRM(Inter-ontology relationship manager) manages the semantic map-ping among different ontologies for query rewriting. However, thescope of the semantic mapping is restricted to the equality amongconcepts (i.e., synonym relationship). As a follow-up of OBSERVER,(Mena et al., 2000b) considers the query rewriting using similarconcepts (i.e., hypernyms or hyponyms) when there is no synonymrelationship between ontologies. Such a rewriting cannot preservethe semantics of the original query. Thus, Mena et al. (2000b) ad-dresses the information loss of the retrieved query results causedby such a rewriting. In our research, we assume that the queryrewriting is based on the well-defined semantic mapping. Thus,the queries rewritten through the semantic mappings preservethe semantics of the original query.

The previous studies premise that a user makes a query basedon a target ontology. Thus, they do not consider queries associatedwith multiple ontologies such as query Q2 in Example 2. In addi-tion, the recent several studies based on the P2P architecture con-sider the semantic mapping between two different ontologies.However, the scope of the semantic mapping is still not sufficientfor the distributed ontology environment. They do not considerthe semantic mapping associated with more than two ontologieslike the following example:

COLLEGE : ðauthorOf ; ?x; ?yÞ ^ PUB : ðisAbout; ?y; ?zÞ@UNIV

: ðresearchOn; ?x; ?zÞ

This semantic mapping means if ?x wrote ?y about ?z, ?x re-searches on z, and the properties authorOf, isAbout and researchOnare defined in COLLEGE, PUB, and UNIV ontology, respectively. Be-sides, most of previous works have concentrated on only the queryreformulation, but not on the efficient evaluation of distributedontology queries.

Table 1An example of semantic mappings.

ID Semantic mapping

M1 C : ðtype; ?x; LecturerÞ ! U : ðtype; ?x; ProfessorÞM2 U : ðtype; ?x; PersonÞ ^ U : ðwrite; ?x; ?yÞ ! P : ðtype; ?x; AuthorÞM3 C : ðauthorOf ; ?x; ?yÞ ^ P : ðisAbout; ?y; ?zÞ ! U : ðresearchOn; ?x; ?zÞ

3. Preliminary

The ontology describes the definitions of resources and thesemantic relationships among the resources. An ontology consistsof a schema and instances. The schema defines concepts (i.e., classand property) and relationships between the concepts. In the in-stance part, the type (i.e., class) of a resource and the relationship(i.e., properties) between resources are declared according to theschema. The ontology is expressed in triples describing the rela-tionships between concepts, between instances, and between aclass and an instance.

Definition 1 (Triple). Triple t is ðp; s; oÞ where p is a property, s isthe subject of p and o is the object of p. p has the domain DðpÞ andthe range RðpÞ. s is a class or an instance of class DðpÞ and o is aclass or an instance of class RðpÞ. For a class c in a schema, ½c�denotes the set of instances belonging to c.

For example, UNIV in Fig. 1 includes the following triples: (sub-ClassOf, Professor, Person) and (write, Person, Paper). Also, the in-stance part of UNIV can contain (type, prof1, Professor) and (write,prof1, pub1).

Definition 2 (Ontology query). Given ontology K, query Q is a triplepattern like qt1 ^ � � � ^ qtm where qti ¼ ðpi; si; oiÞ is a query tripleover K. The subject, the object, and the property of the query tripleare either a variable or a literal. qti contains at least one variable.And, for each qti, there is qtj such that they share a variable.

For qti ¼ ðpi; si; oiÞ over ontology K; ½qti� denotes the set ofmatches for qti in K. ½piðsiÞ� and ½piðoiÞ� denote the set of distinctsubjects and the set of distinct objects in ½qti�, respectively. If pi istype, ½qti� ¼ ½oi� where the object oi is a class declared in K.

The query answering for Q over ontology K is to find matches forthe triple pattern of Q from K. Given ontology UNIV,ðtype; ?x; ProfessorÞ ^ ðwrite; ?x; ?pÞ ^ ðisAbout; ?p; ‘Semantic Web’Þfinds professors who wrote papers about ‘Semantic Web’ fromUNIV.

We assume that the schemas of multiple ontologies are pro-vided to a user, and the user can generate a distributed query cov-ering the multiple ontologies.

Definition 3 (Distributed ontology query). Given a set of ontologiesS, distributed ontology query DQS is a conjunctive query overontologies in S. DQS ¼ Q1 ^ . . . ^ Qn where Qi is a local ontologyquery for an ontology in S. And, for Qi, there is Qj such that theyshare a variable.

Query triples in a distributed query match triple instances ofvarious ontologies. Thus, in order to distinguish the query triplesaccording to the ontology, the identifier of the correspondingontology is assigned to each query triple. A query triple qti overontology Oj is represented by Oj : ðpi; si; oiÞ.

In the distributed ontology environment, there may be thesemantic mapping among some views of different ontologies rele-vant to each other. To define the semantic mapping, we use theconcept of the DL-safe rule (Haase and Motik, 2005; Motik et al.,2004). A rule is DL-safe if all the variables of the rule are restrictedto only objects explicitly named in the ontology. The DL-safe is ap-plied to the semantic mapping to avoid the undecidability in thequery answering.

Definition 4 (Semantic mapping). A semantic mapping is repre-sented in an assertion DQS ! DQT , where DQX ;X 2 fS; Tg, is adistributed ontology query. The assertion means that the answer ofthe left side query is contained by the answer of the right sidequery. Also, the semantic mapping is DL-safe.

The inconsistency and the containment of queries in the seman-tic mappings are important issues to construct the semantic map-ping. Many studies (Meilicke et al., published online; Calvaneseet al., 2008; Calvanese et al., 1998) have been conducted to effec-tively deal with those issues. Therefore, we can assume that thecorrect semantic mapping is constructed through those existingstudies.

Table 1 shows an example of semantic mappings over the dis-tributed ontologies in Fig. 1. We support the semantic mappingsrelated to more than two ontologies such as semantic mapping M3.

4. Distributed ontology query processing

The schemas of multiple ontologies are provided to the userwho wants to query over the ontololgies, and the user makes dis-tributed ontology queries covering the ontologies. We assume thatthe schemas are provided in the same representation format andthe user is capable of understanding the content of each schema.

A distributed ontology query can be rewritten to multiple ontol-ogy queries according to semantic mappings. The answer of theoriginal user query will be covered by the distributed answers ofthe multiple queries. The queries search other ontologies whichthe user does not specify in the original query, even does not knowabout them. In conclusion, our approach does not physically inte-

88 J. Lee et al. / The Journal of Systems and Software 83 (2010) 85–95

grate all local ontologies into a global ontology, but can find theintegrated answer of the user query through the query rewritingusing semantic mappings and the evaluation of queries.

In this paper, we focus on the distributed reasoning based onthe semantic mapping, and the distributed reasoning is handledby the query rewriting and the answering of the reformulated que-ries. We assume that the local reasoning required for the queryanswering on a local ontology is managed by the corresponding lo-cal ontology management system.

The optimization cost for processing a distributed query can beexpensive. Especially, in order to find the optimal plan for the dis-tributed query which can be processed in parallel at multiple sites,many parallel execution plans should be examined due to the var-ious sources of parallelism. Therefore, a heuristic to reduce thesearch space and to provide an efficient query plan is required. Inaddition, the reformulated queries as well as the original queryare also distributed ontology queries. If these queries are scheduledindependently, the workload can be concentrated on a few localsites. Thus, a query plan which distributes the workload evenlyamong all local sites is required.

For the query rewriting and the query optimization, we assumethat every local site in the distributed environment registers itsontologies in a metadata registry. The registry maintains the sche-mas and the statistics of all ontologies for query rewriting andscheduling in the distributed environment. In addition, the seman-tic mappings related to the ontologies are also stored in the regis-try. All local sites can remotely access the registry.

4.1. Query rewriting

In the query rewriting phase, multiple distributed ontologyqueries are generated from a distributed ontology query accordingto semantic mappings. Fig. 2 describes the query rewriting algo-rithm. Given a query qi, we find the set of all semantic mappingsMR such that the right side of a rule in MR matches a part of qi (line3). The ‘match’ means that the triple sets A and B satisfy the follow-ing three conditions. (1) The number of triples in A is equal to thenumber of triples in B, (2) Each triple in A has one correspondingtriple in B such that their corresponding non-variable componentsare the same, and (3) The join relationship among variables of thetriples in A is equal to that of corresponding triples in B. For eachrule rj in MR, a new query EQn is generated by replacing the partof qi matching the right side of rj with the left side of rj preserving

Fig. 2. Query rewri

the variables of the replaced part (line 5). This process is applied toall newly generated queries as well as the original query DQ. Ifthere is no more additional query, the process is terminated (line10).

Our rewriting method is similar to the query rewriting of glo-bal-as-view approach (GAV) (Halevy, 2001). Global-as-view gener-ates a global view to integrate heterogeneous data sources. Theglobal view is described in terms of the views of local sourcesand a query is rewritten by a simple view unfolding process. Ourquery rewriting which recursively replaces the query triplesaccording to the semantic mapping is similar to the query rewrit-ing of GAV. However, our rewriting method provides a particularmatching method to find an appropriate replacing part accordingto the characteristics of the ontology query.

Table 2 shows a part of the queries derived from DQ by usingthe semantic mappings in Table 1. Table 2 also presents the appliedsemantic mapping (SM) and the previous query from which thequery is derived (FROM). The query EQ 1 is generated from DQ byusing semantic mapping M1. U : ðtype; ?x; ProfessorÞ in DQ is re-placed by the left side of M1, C : ðtype; ?x; LecturerÞ. EQ2 is gener-ated from EQ 1 through M3.

Among the queries generated in the rewriting phase, there canbe some invalid (i.e., there is no answer) or redundant (i.e., the an-swer is contained in the answer of another query) queries. If weprune such queries, we can effectively reduce the workload. As aresult, we suggest two types of pruning rules to eliminate theunnecessary queries before the query scheduling.

Consider two query triples sharing a variable. If there is no over-lap between the sets of instances matched to the variable in eachontology, we can conclude that the result of the query does not ex-ist. Thus, such a condition can be used to prune the invalid queries.According to this condition, we define the following pruning rule:

Rule 1. Given a query EQ containing qti¼ Oi : ðpx; sx; oxÞ and

qtj¼ Oj : ðpy; sy; oyÞ, if sx ¼ sy and ½DðpxÞ� \ ½DðpyÞ� ¼ ; or if sx ¼ oy

and ½DðpxÞ� \ ½RðpyÞ� ¼ ;ðsy ¼ ox is symmetric) or if ox ¼ oy and½RðpxÞ� \ ½RðpyÞ� ¼ ; then EQ has no result.

There can be superfluous queries whose answers are containedin the answer of another query. Consider the distributed ontologiesin Fig. 1 and the following queries:

EQ 1 : U : ðtype; ?x; ProfessorÞ ^ U : ðwrite; ?x; ?yÞ.EQ 2 : C : ðtype; ?x; LecturerÞ ^ U : ðwrite; ?x; ?yÞ.

ting algorithm.

Table 2Reformulated queries.

ID EQ SM FROM

DQ U : ðtype; ?x; ProfessorÞ ^ U : ðresearchOn; ?x; ‘semanticweb0ÞEQ1 C : ðtype; ?x; LecturerÞ ^ U : ðresearchOn; ?x; ‘semanticweb0Þ M1 DQEQ2 C : ðtype; ?x; LecturerÞ ^ C : ðauthorOf ; ?x; ?yÞ M3 EQ 1

^P : ðisAbout; ?y; ‘semanticweb0Þ

J. Lee et al. / The Journal of Systems and Software 83 (2010) 85–95 89

Assume that there is a student who has some publications inUNIV(U) and the student is also a lecturer in COLLEGE(C). In thediagram of Fig. 3a, the student is included in the a area. However,the striped area which is the answer of EQ1 can not cover the aarea. Therefore, if we want to see all his/her publications, both EQ1

and EQ2 are needed. In contrast, consider the following queries:

EQ3 : U : ðtype; ?x; PersonÞ ^ U : ðwrite; ?x; ?yÞ.EQ4 : C : ðtype; ?x;MemberÞ ^ U : ðwrite; ?x; ?yÞ.

Since the domain of the property write is U : Person, allinstances matched to U : ðwrite; ?x; ?yÞ (i.e., the gray area inFig. 3b) are retrieved by EQ3. The answer of EQ4 (i.e., the b areain Fig. 3b) is included in the answer of EQ3. Therefore, EQ4 is aredundant query of EQ3. Consequently, we can define a newpruning rule as below:

Rule 2. Given a query EQ containing qti ¼ Oi : ðtype; sx; oxÞ andqtj ¼ Oj : ðpy; sy; oyÞ, if there is a query EQ 0 that containsqt0i ¼ Oj : ðtype; sx; o0xÞ instead of qti, where sx ¼ sy and DðpyÞ# o0xor sx ¼ oy and RðpyÞ# o0x, and other query triples in EQ 0 are the sameas those in EQ, then EQ is a redundant query since the result of EQ 0

contains the result of EQ.There can be an indirect semantic mapping between two

concepts in different ontologies. Given the semantic mappingsX : A! Y : B and Y : C ! Z : D, if the local subsumption relation-ship Y : B! Y : C is declared, a semantic mapping X : A! Z : D isimplied. In order to derive such an indirect semantic mapping, ourapproach stores the local subsumption relationships betweenconcepts as semantic mappings. We assume that the local reason-ing to find all subsumption relationships between concepts in anontology is performed by a local ontology management system.

4.2. Query optimization technique

We assume that the ontologies relevant to a query are distrib-uted at several local sites and all the local sites are capable of queryprocessing (i.e., join). Thus, the local site where a user submits aquery and all local sites where the ontologies relevant to the queryare stored can be potentially involved in the processing of thequery. We call these local sites ‘interesting sites’.

A query plan for processing a distributed ontology query has abinary tree structure. It describes a join order among the results

Fig. 3. The diagram representing the overlap of query answers: U and C denoteUNIV ontology and COLLEGE ontology, respectively.

of the local ontology queries, and locations where the joins shouldbe performed. The result of the root plan is the final result of thedistributed ontology query. Each leaf plan corresponds to a localontology query. Fig. 4 shows an example of the distributed ontol-ogy query DQ and its query plan. Table 3 presents notations usedto describe a plan. The answer of query DQ is dispersed at localsites S1 and S2, and the user issues DQ to S1. Thus, the interestingsites are fS1; S2g. p2; p4 and p5 are plans for the local ontology que-ries Q 1;Q 2, and Q 3, respectively. The execution site of p5 is S2 (i.e.,p5:site ¼ S2) and the result of p5 should be sent to the site wherethe upper plan (i.e., p3) is executed (i.e., p5:des ¼ S1). p2 and p3

are the left and right sub-plans of p1. The result of the root planp1 is the final result of DQ.

4.2.1. Cost modelIn a distributed ontology environment, each local site has the

capability to process distributed queries. Each local ontology queryis processed in the local ontology management system which per-forms the disk-based query processing. On the other hand, the dis-tributed query processing which needs to join the results of thelocal ontology queries is processed in memory.

The sub-queries (i.e., local ontology query and intermediate dis-tributed query) of a distributed ontology query can be executed inparallel. Thus, as the cost function, we use the response time whichis the amount of the time it takes for a user to receive the queryresult. To estimate the response time of the distributed query pro-cessing, the cost of data transmission and the cost of join should beconsidered.

For a join operation, we consider two kinds of methods: HashJoin and Nested Loop Join. The join cost according to the join meth-od can be estimated as follows:

4.2.1.1. Join cost. The join cost of a plan p,

JCðpÞ ¼HJCðp:l;p:rÞ ¼ I � jp:lj þ R � jp:rj for Hash JoinNJCðp:l;p:rÞ ¼ jp:lj � jp:rj � C for Nested Loop Join

– I: the cost of inserting an item in the hash table;– R: the cost of retrieving a bucket from the hash table;– jxj: the cardinality of the result of x;– C: the comparison cost of two items.

For each join operation, our optimizer chooses a cheaper meth-od between these two join methods.

Fig. 4. An example of the distributed ontology query DQ and its query plan.

Table 3Notations for information of plan p.

Notation Description

p.l The left sub-plan of pp.r The right sub-plan of pp.site The site of the root of pp.des The site where the result of p will be sent

90 J. Lee et al. / The Journal of Systems and Software 83 (2010) 85–95

During the distributed query processing, data (i.e., query plansand query results) transmissions among local sites occur a lot.We only consider the transmission of the query result since thesize of query plans is relatively very small. Thus, the transmissioncost of the query result is estimated as follows:

4.2.1.2. Transmission cost. The result transmission cost of a plan p,

TCðpÞ ¼d � lengðpÞ � jpj if p:des – p:site

0 if p:des ¼ p:site

– leng(p): the average length of tuples in the result of p;– jpj: the cardinality of the result of p;– d: the constant decided by the data transfer rate of the network

(in inverse proportion to the data transfer).

Each leaf plan in a plan tree is carried out by the local ontologymanagement system in its execution site. Thus, the cost of the leafplan is dependent on the query optimization strategy of the localontology management system. There are several approaches esti-mating the cost of the leaf plan: generic cost model approach, indi-vidual wrapper cost model approach, and learning curve approach(Kossmann, 2000). The first approach uses a generic cost modelwith an adjustable parameter denoting the performance of each lo-cal ontology management system. The second approach definesthe separate cost model for each local system. For this approach,the query scheduling module should know the cost models of allthe local systems. The last approach estimates the cost of the localquery plan based on the statistics of the past plans. Our approachuses a generic cost model for estimating the cost for the local queryplan since it is the simplest one and the cost models for all the localsystems are generally not known. The cost of the plan p for a localontology query LCðpÞ is estimated by the following generic costmodel:

4.2.1.3. Generic cost. The cost of a plan p for local ontology query,

LCðpÞ ¼ GCðpÞ

GCðpjÞ ¼ c �jpjj if pj is a leaf plan in p

JCðpjÞ þ GCðpj:lÞ þ GCðpj:rÞ otherwise

(

– pj: a sub-plan in p;– c: a constant denoting the performance of the local system.

In general, many ontology management systems preprocess theABox reasoning required for the query answering in order to re-duce the query processing time. Thus, in our approach, we assumethat the necessary local reasoning for the query answering hascompletely conducted in advance and no local reasoning is con-ducted during the query processing. Consequently, we ignore thelocal ontology reasoning cost in our cost model for the distributedontology query processing.

We use the same join cost model in both memory-based queryprocessing and disk-based query processing. In order to reflect thedifference between them in the cost estimation, we use differentvalues of the parameters (e.g., the cost of inserting an item in thehash table, I, and the cost of retrieving a bucket from the hash tablefor the hash join, R).

In the distributed query processing, the operands of a join oper-ation can be generated from different sites. Thus, the transmissioncost should be considered. However, each join operation is de-ferred until both operands (i.e., the results of left and right sub-plans) are prepared and other query plans to be executed beforethe plan at the same site finish their works. This is why the waitingtime should be also considered. The cost of the distributed queryplan is estimated as follows:

4.2.1.4. Query cost. The cost for a distributed query plan p,

QCðpÞ ¼ wþLCðpÞ þ TCðpÞ if p is a leaf planJCðpÞ þ TCðpÞ otherwise

�w ¼ maxfQCðp:lÞ;QCðp:rÞ;QCðLastPlanðp:siteÞÞg

– LastPlan(p.site): the last plan which should be executed before pat p:site.

The overall response time is the time taken to finish the pro-cessing for all distributed queries. The plans at a single site are exe-cuted sequentially, while the plans in different sites are executedin parallel. Consequently, the overall cost is decided by the costof the query plan which is to be lastly finished among the plansprocessed at local sites.

4.2.1.5. Overall cost. The overall cost of EP which is a set of queryplans for executing the distributed ontology queries,

OCðEPÞ ¼maxsi2SfQCðLastPlanðsiÞÞg

– S: the set of interesting sites;

– LastPlanðsiÞ: the distributed ontology query plan which is lastly

executed at site si.

4.2.2. Query scheduling algorithmFig. 5 shows the query scheduling algorithm. Given a distrib-

uted query DQ, the semantic mappings SM, and the interestingsites S, the algorithm returns plan lists to be executed in each inter-esting site. As mentioned in Section 4.1, multiple queries EQ is gen-erated from DQ based on the semantic mappings SM (line 1). Themultiple queries are sequentially scheduled. For each distributedquery, a query plan is generated and decomposed according tothe execution site. A decomposed query plan is added to the planlist of its operating site (lines 2–9). We purpose to make the max-imum use of the processing capacity of each local ontology man-agement system. Thus, at the beginning of generating a plan foreach distributed query, the query is partitioned into local ontologyqueries (line 3). In order to estimate the cost of execution plan ofeach local ontology query, our scheduling algorithm finds the opti-mal plan of the query by using the traditional dynamic program-ming algorithm (Kossmann, 2000) (line 5), and the cost model toselect the optimal execution plan is presented in Section 4.2.1(generic cost). The dynamic programming algorithm is used in al-most all commercial DBMSs. Also, since a single local ontologyquery is not too complex, we use the traditional dynamic program-ming algorithm. Then, a join plan among the local ontology queriesof an distributed query is generated by PlanGeneration algorithm inFig. 6 (line 8). PlanGeneration algorithm enumerates query plansby composing local query plans for the distributed ontology query.Then, it finds an efficient query plan by considering the currentload of each local site and the parallel execution among local sites.In order to find an efficient query plan for the distributed ontologyquery processing, our query scheduling algorithm uses the follow-ing query optimization techniques:

4.2.2.1. Deeper Plan First Order (DPFO). Note that plans at a local siteare sequentially executed. The execution order among the plans ata local site has an influence on the total response time. Fig. 7 showsthe response time (i.e., cost) of two different execution orders forthe same query plan. There are two kinds of possible execution or-ders among plans p2 and p4 in S2. If p2 is executed before p4, theexecution of p3 is delayed for the execution time of p2 since p3

needs the result of p4. Besides, the delay is propagated to the totalcost. On the other hand, if p4 is executed first, it is possible to exe-cute p2 and p3 in parallel, and ultimately the total cost can be re-

Fig. 5. Query scheduling algorithm.

Fig. 6. PlanGeneration algorithm.

Fig. 7. Two different execution orders of plans for DQ in Fig. 4. it d Denotes the depth of a plan in the plan tree.

J. Lee et al. / The Journal of Systems and Software 83 (2010) 85–95 91

duced such that Cd < Cr . In conclusion, we decide the execution or-der among plans at a single site where the deeper plan in the plantree is executed first. We call this ordering method ‘Deeper Plan

First Order (DPFO)’. The order of execution among plans with thesame depth is determined by selecting the one with the smallestcost.

Table 5Semantic mappings.

ID Semantic mapping

M1 C : ðtype; ?x; StudentÞ ! U : ðtype; ?x; StudentÞM2 U : ðtype; ?x; PersonÞ ^ U : ðpublicationAuthor; ?y; ?xÞ ! P : ðtype; ?x;AuthorÞM3 C : ðtype; ?x;MemberÞ ^ U : ðwrittenBy; ?y; ?xÞ ! P : ðtype; ?x;AuthorÞM4 C : ðname; ?x; ?yÞ ! U : ðname; ?x; ?yÞM5 C : ðtype; ?x; ProfessorÞ ! U : ðtype; ?x; ProfessorÞM6 C : ðtype; ?x; FullProfessorÞ ! U : ðtype; ?x; FullProfessorÞM7 U : ðPublicationAuthor; ?x; ?yÞ ! P : ðwriter; ?x; ?yÞM8 C : ðwrittenBy; ?x; ?yÞ ! P : ðwriter; ?x; ?yÞM9 U : ðtype; ?x; PublicationÞ ! P : ðtype; ?x; PublicationÞM10 C : ðtype; ?x; PaperÞ ! U : ðtype; ?x; ProfessorÞM11 C : ðteachesClass; ?x; ?yÞ ! U : ðteacherOf ; ?x; ?yÞM12 C : ðwrittenBy; ?x; ?yÞ ! U : ðpublicationAuthor; ?x; ?yÞM13 C : ðwrittenBy; ?x; ?yÞ ^ P : ðisAbout?y?tÞ ! U : ðresearchOn; ?x; ?tÞ

Table 6Query set.

ID Query

Q1 U : ðtype; ?s; StudentÞQ2 P : ðtype; ?a;AuthorÞQ3 U : ðtype; ?f ; FullProfessorÞ ^ U : ðname; ?f ; ?nÞQ4 P : ðtype; ?p; PublicationÞ ^ P : ðwriter; ?p; ?aÞQ5 U : ðtype; ?f ; ProfessorÞ ^ U : ðpublicationAuthor; ?s; ?f Þ ^ U : ðtype; ?s;

StudentÞ ^ U : ðemailAddress; ?s; ?eÞ ^ U : ðadvisor; ?s; ?f ÞQ6 U : ðtype; ?a; ProfessorÞ ^ U : ðworksFor; ?s; ?uÞ ^ U : ðemailAddress; ?a; ?eÞ

^P : ðtype; ?p; PublicationÞ ^ P : ðwriter; ?a; ?pÞQ7 U : ðtype; ?p; ProfessorÞ ^ U : ðresearchOn; ?p; ‘semanticweb0ÞQ8 U : ðtype; ?a; ProfessorÞ ^ U : ðpublicationAuthor; ?p; ?aÞ ^ U :

ðname; ?a; ?nÞ ^ U : ðteacherOf ; ?a; ?cÞ

92 J. Lee et al. / The Journal of Systems and Software 83 (2010) 85–95

4.2.2.2. Load balancing for distributed queries. For a single distrib-uted ontology query, multiple distributed ontology queries canbe generated and processed for the query result. Therefore, duringthe scheduling of a distributed ontology query, if we do not con-sider the workload of each local site where the operations of theother queries are also assigned, many operations can be concen-trated on several local sites, even one site. In order to deal with thisproblem, at the beginning of the scheduling for each query, thescheduler refers to the cost of the plan which is lastly executedamong the plans belonging to the previously scheduled query foreach local site, and decides the operating site for each join so asto minimize the overall response time. This is implemented as fol-lows: in the query scheduling algorithm of Fig. 5, PL is a set of listsof plans assigned to each local site. PL returned from the schedul-ing of the previous distributed ontology query eqi is an input to thescheduling of the next query eqiþ1 like PL :¼ PlanGenerationðLs; S; PLÞ. Since the list of plans for each site is accumulated, atthe beginning of the scheduling of each distributed ontology query,the scheduler can consider the last plan of previously scheduledontology query for each local site and the cost of the last plan is re-flected according to our cost model (QueryCost) in Section 4.2.1.

4.2.2.3. Caching. In the query rewriting phase, multiple distributedontology queries are generated from a user query in order to coverthe dispersed answer of the query. There can be common sub-que-ries among them. It means that the common sub-queries arerepeatedly performed during the entire query processing. Thecaching is an effective way to reduce the overhead due to the re-peated execution of the same query. Thus, during the executionof a distributed query, the result of the sub-query which will be ac-cessed again by subsequent other distributed queries is stored inthe cache at local sites. In addition, the caching can be consideredin the cost estimation for a query plan. If the result of the query iscached, the cost of the plan is estimated as:

QCðpÞ ¼ wþ jpj þ TCðpÞ if the result ofp is cached at p:site

Table 7The effectiveness of pruning.

QUERY EQ P.EQ P.RULE

Q1 1 1 –Q2 4 2 Rule 2Q3 4 4 –Q4 9 3 Rule 2Q5 16 8 Rule 1Q6 72 18 Rules 1, 2Q7 7 7 –Q8 24 24 –

5. Experimental analysis

The previous studies on the query processing over distributedontologies have focused on only the query rewriting for the distrib-uted query processing, but not on the efficiency of the query eval-uation. Thus, in order to show the efficiency of our queryprocessing technique, we implemented AIDOS (An Intelligent Dis-tributed Ontology query proceSsing), and empirically comparedthe performance of the distributed query processing of two ver-sions of AIDOS; AIDOS-I and AIDOS-II. AIDOS-I prunes unnecessaryqueries before the query scheduling, and uses DPFO heuristic in thescheduling of each query. AIDOS-I treats each distributed ontologyquery independently so that it does not consider the load balancingamong sites and the caching for the repeatedly accessed dataamong the multiple distributed ontology queries. AIDOS-II sup-ports all optimization techniques (pruning, DPFO, load balancing,and caching) mentioned in this paper. In addition, we present theeffectiveness of pruning invalid and redundant queries.

Table 4An experimental environment.

Site System CPU Memory Data

Site 1 ONTOMS 2.8 GHz 2 GB UNIV(U)Site 2 ONTOMS 2.4 GHz 1 GB COLLEGE(C)Site 3 JENA 2.4 GHz 768 MB PUB(P)

5.1. Environment

We implemented AIDOS-I(II) using a relational database man-agement system (RDBMS) and Java. The RDBMS is used for manag-ing metadata information such as semantic mappings andstatistics. To construct the heterogeneous distributed experimentalenvironment, ONTOMS (Park et al., 2007), JENA (Wilkinson et al.,2003) were used as local ontology management systems. A de-tailed information on local systems is described in Table 4.

5.1.1. Data setWe used Lehigh University Benchmark Data (LUB).1 LUBM (Guo

et al., 2004) is well-known OWL benchmark data containing the uni-versity ontology. In order to evaluate the performance of the distrib-uted query processing, we produced three different types of

1 Available in http://swat.cse.lehigh.edu/projects/lubm/index.htm.

Table 8The number of plans assigned to each site (U: 20 MB, C: 20 MB, P: 10 MB).

Query Total (T) Method S1 S2 S3

Q1 2 AIDOS-I 1 1 0AIDOS-II 1 1 0

Q2 3 AIDOS-I 1 1 1AIDOS-II 1 1 1

Q3 8 AIDOS-I 4 4 0AIDOS-II 4 4 0

Q4 3 AIDOS-I 1 1 1AIDOS-II 1 1 1

Q5 28 AIDOS-I 18 10 0AIDOS-II 18 10 0

Q6 68 AIDOS-I 36 26 6AIDOS-II 34 28 6

Q7 16 AIDOS-I 9 4 3AIDOS-II 5 8 3

Q8 40 AIDOS-I 21 15 4AIDOS-II 24 12 4

Table 9The statistics of caching.

Query Cached (C) C/T (%) U.CACHE (U) U/T (%)

Q5 5 17.8 7 25Q6 15 22.05 20 29.411Q7 3 18.75 3 18.75Q8 8 20 10 25

Fig. 8. Query exe

J. Lee et al. / The Journal of Systems and Software 83 (2010) 85–95 93

ontologies (based on LUBM) but related to each other (i.e., UNIV(U),COLLEGE(C), PUB(P)), as briefly described in Fig. 1. Several proper-ties (e.g., researchOn) are newly inserted, and some instances are de-clared in different ontologies as different types. According to thethree different ontology schemas, we generated variously sizedOWL data for each schema: 10 MB, 20 MB and 50 MB. Those gener-ated OWL data were stored at three local sites along with the corre-sponding schemas. Also, we defined the set of semantic mappingsamong the ontologies, as shown in Table 5.

5.1.2. Query setWe selected eight representative queries for the evaluation of

the performance. Table 6 shows those selected queries. The criteriaof selection are the complexity of the query and the semantic map-ping (i.e., Q1, Q2, Q7), the distribution degree of the answer (i.e.,Q2, Q5), and the effect of pruning and caching (i.e., Q2, Q4, Q5,Q6, Q8).

The query processing time was measured by executing eachquery five times and averaging the time for three of the runsexcluding the minimum and maximum times.

5.2. Experimental results

5.2.1. PruningTable 7 shows the effectiveness of our pruning technique in the

query rewriting. EQ presents the number of queries generated bythe rewriting algorithm in Fig. 2. P.EQ and P.RULE show the num-ber of queries after pruning and the applied pruning rule,respectively.

cution time.

94 J. Lee et al. / The Journal of Systems and Software 83 (2010) 85–95

For Q4, the queries which are generated by replacing P :

ðwriter; ?p; ?aÞ to U : ðpublicationAuthor; ?p; ?aÞ or C : ðwrittenBy;?p; ?aÞ through the semantic mapping M7 and M8 were removedby Rule 2, since, in UNIV, the domain of publicationAuthor is Publi-cation and all writers of publications in UNIV are retrieved by theoriginal query. Some queries derived from Q2 were also removedby Rule 2. In the data set used in our experiments, there is no inter-section between ½C : Student� and ½U : Person�. As a result, most que-ries including sub-queries such as C : ðtype; ?s; StudentÞ ^ U :

ðadvisor; ?s; ?f Þ where the domain of advisor is U : Person haveno answer. Thus, eight invalid queries for Q5 are eliminated byRule 1. For Q6, both Rule 1 and Rule 2 are effectively used to re-move the large number of invalid and redundant queries.

Consequently, the number of queries is in proportion to thecomplexity of query and in inverse proportion to the disjointnessamong sources related to the query.

5.2.2. Load balancingTable 8 shows the number of sub-plans (i.e., local query plans

and intermediate distributed query plans) executed at each localsite. S1, S2, and S3 present the number of plans assigned to eachlocal site, and TOTAL represents the total number of the plans.

Generally, S1 processes the most plans among local sites S1, S2and S3. The first reason is that S1 has the most powerful system.Second, for each query, most of reformulated queries require to ac-cess the ontology UNIV in S1.

The caching also has an influence on the load balancing amongsites. From Q6 to Q8, AIDOS-I and AIDOS-II assign the differentnumber of plans to each local site since AIDOS-II uses the caching,but AIDOS-I does not. For example, in case of Q8, AIDOS-II expectsthat much time would be reduced by the caching in S2. Thus, AI-DOS-II generates a plan where three more plans will be processedin S1, compared with AIDOS-I.

5.2.3. CachingAs mentioned before, AIDOS generates multiple extended que-

ries for a distributed ontology query, and there can be commonsub-queries among the extended queries. Table 9 shows the num-ber of sub-plans whose results are cached (CACHED) and the num-ber of sub-plans which use the cached result (U.CACHE) for eachquery. From Q1 to Q4, there are no cached sub-plans since theyhave no common sub-queries among the queries. Thus, we only in-clude the result of Q5 to Q8 in Table 9. From Q5 to Q8, the results ofabout 20% of sub-plans are cached and about 18–30% of sub-plansused the cached results without re-execution of the same query. Asthe query has the larger number of common sub-queries, the effec-tiveness of caching also increases.

5.2.4. Execution timeFig. 8 presents the response time for each query. Generally, AI-

DOS-II outperforms AIDOS-I by average 20.57% for Q5–Q8 whichused caching. During the query scheduling time, AIDOS-II consid-ers the degree of workload for each local site and which plan needsto be cached. However, the scheduling time of AIDOS-II is almostthe same with that of AIDOS-I. Also, this was tiny time enough tobe ignored, compared with the query execution time. The majorpart which spends the most time is the processing each local ontol-ogy query in the local ontology management system. As a result, asthe number of complex local query plans is larger, the responsetime also increases. In general, many local ontology queries areshared among queries. Consequently, AIDOS-II effectively reducesthe response time by caching the results of common queries andthe load balancing according to the caching. In conclusion, theeffectiveness of our optimization techniques are growing along

with the increase of the data size and the number of queries whichcontain common sub-queries.

6. Conclusion

In this paper, we introduce an intelligent distributed ontologyquery processing method. We suggest more general models ofthe distributed onotology query and the semantic mapping amongdistributed ontologies than those of previous works so as to beapplicable to more general environments and to retrieve the richerquery answer. Also, through the query rewriting using the seman-tic mapping, we can obtain the integrated answer of a query overdistributed ontologies without any global schema and the physicalintegration of different ontologies.

In addition, we propose several query optimization techniques.First, our approach eliminates invalid and redundant queries in thequery rewriting phase using pruning rules. Second, Deeper PlanFirst Order (DPFO) scheduling method increases the parallelismof executing sub-plans belonging to a query but independent eachother. Third, in order to distribute plans among queries, our ap-proach considers the plans of the previously scheduled queries.Furthermore, our approach uses the caching in order to removethe overhead due to the repeated execution of the same sub-que-ries among distributed queries. These load balancing and cachingfactors are reflected in the cost estimation for a query plan.

Finally, through the experimental results, we observed that ouroptimization techniques effectively reduced the response time.Especially, the effectiveness of the caching and the load balancingincreases along with the growing of the data size and the numberof distributed queries which access the common part of resources.

Acknowledgement

This research was supported in part by the Ministry of Knowl-edge Economy, Korea, under the Information Technology ResearchCenter support program supervised by the Institute of InformationTechnology Advancement. (grant number IITA-2008-C1090-0801-0031), and in part by Microsoft Research Asia.

References

Adjiman, P., Goasdoue, F., Rousset, M.-C., 2007. Somerdfs in the semantic web. J.Data Semantics VIII 4380, 158–181.

Borgida, A., Serafini, L., 2003. Distributed description logics: assimilatinginformation from peer sources. J. Data Semantics 1, 153–184.

Calvanese, D., Giacomo, G.D., Lenzerini, M., 1998. On the decidability of querycontainment under constraints. In: Proc. of the 17th ACM SIGACT SIGMODSIGART Symp. on Principles of Database Systems, pp. 149–158.

Calvanese, D., Giacomo, G.D., Lenzerini, M., 2008. Conjunctive query containmentand answering under description logic constraints. ACM Trans. Comput. Logic 9(3).

Guo, Y., Pan, Z., Heflin, J., 2004. An evaluation of knowledge base systems for largeOWL datasets. In: Proc. of ISWC, pp. 274–288.

Haase, P., Motik, B., 2005. A mapping system for the integration of owl-dlontologies. In: Proc. of IHIS 2005, pp. 9–16.

Haase, P., Wang, Y., 2007. A decentralized infrastructure for query answering overdistributed ontologies. In: Proc. of SAC 2007, pp. 1351–1356.

Halevy, A.Y., 2001. Answering queries using views: a survey. VLDB J. 10 (4).Halevy, A.Y., Ivesc, Z.G., Mork, P., Tatarinov, I., 2003. Piazza: data management

infrastructure for semantic web applications. In: Proc. of WWW 2003, pp. 556–567.

Kossmann, D., 2000. The state of the art in distributed query processing. ACMComput. Survey 32 (4), 422–469.

Meilicke, C., Stuckenschmidt, H., Tamilin, A. Reasoning support for mappingrevision. J. Logic Comput., Advance Access published online.

Mena, E., Illarramendi, A., Kashyap, V., Sheth, A.P., 2000a. OBSERVER: an approachfor query processing in global information systems based on interoperationacross pre-existing ontologies. Distrib. Parallel Databases 8 (2).

Mena, E., Kashyap, V., Illarramendi, A., Sheth, A.P., 2000b. Imprecise answers indistributed environments: estimation of information loss for multi-ontologybased query processing. Int. J. Cooperative Inf. Syst. 9 (4).

Motik, B., Sattler, U., Studer, R., 2004. Query answering for owl-dl with rules. In:Proc. of ISWC 2004, pp. 549–563.

J. Lee et al. / The Journal of Systems and Software 83 (2010) 85–95 95

Nejdl, W., Wolf, B., Qu, C., Decker, S., Sintek, M., Naeve, A., Nilsson, M., Palmer, M.,Risch, T., 2002. Edutella: a p2p networking infrastructure based on rdf. In: Proc.of WWW 2002, pp. 604–615.

Nejdl, W., Wolpers, M., Siberski, W., Schmitz, C., Schlosser, M.T., Brunkhorst, I.,Loser, A., 2003. Super-peer-based routing and clustering strategies for rdf-basedpeer-to-peer networks. In: Proc. of WWW 2003, pp. 536–543.

Park, M.-J., Lee, J., Lee, C.-H., Lin, J., Serres, O., Chung, C.-W., 2007. An efficient andscalable managment of ontology. In: Proc. of DASFAA, LNCS 4443, pp. 975–980.

Serafini, L., Andrei, 2005. Drago: distributed reasoning architecture for the semanticweb. In: Proc. of ESWC 2005, pp. 361–376.

Stuckenschmidt, H., Vdovjak, R., Broekstra, J., Houben, G.-J., 2005. Towardsdistributed processing of rdf path queries. Int. J. Web Eng. Technol. 2 (2),207–230.

Wilkinson, K., Sayers, C., Kuno, H.A., Reynolds, D., 2003. Efficient RDF Storage andretrieval in Jena2. In: Proc. of the 1st International Workshop on SWDB, pp.131–150.

Jihyun Lee is a Ph.D student in the Division of Computer Science at Korea AdvancedInstituted of Science and Technology (KAIST), South Korea. Her research interestsinclude XML data management, semantic Web, ontology data management, andinformation retrieval on the Web.

Jeong-Hoon Park is a Ph.D student in the Division of Computer Science at KoreaAdvanced Instituted of Science and Technology (KAIST), South Korea. His researchinterests include semantic Web, semantic annotation, ontology data management,and information retrieval on the Web.

Myung-Jae Park is a Ph.D. student in the Division of Computer Science at the KoreaAdvanced Institute of Science and Technology (KAIST), Korea. His research interestsinclude XML, ontology and the semantic Web, and publish/subscribe systems.

Chin-Wan Chung received a Ph.D. degree from the University of Michigan, AnnArbor in 1983. He was a Senior Research Scientist and a Staff Research Scientist inthe Computer Science Department at the General Motors Research Laboratories(GMR). While at GMR, he developed Dataplex, a heterogeneous distributed data-base management system integrating different types of databases. Since 1993, hehas been a professor in the Division of Computer Science at the Korea AdvancedInstitute of Science and Technology (KAIST), Korea. At KAIST, he developed a full-scale object-oriented spatial database management system called OMEGA, whichsupports ODMG standards. His current research interests include the semanticWeb, the mobile Web, sensor networks and stream data management, and multi-media databases.

Jun-Ki Min is a professor in the school of Internet-Media at the Korea University ofTechnology and Education (KUT) in Korea. He received a Ph.D. degree from theKorea Advanced Institute of Science and Technology (KAIST), Korea in 2002. He wasa senior researcher in Electronics and Telecommunications Research Institute(ETRI), Korea. While at ETRI, he developed UbiCore, which is a large volume streamdata management system. He has written and published several articles in inter-national journals and conference proceedings. His current research interests includeXML, the semantic Web, sensor network and stream data management.