Thesis proposal: advanced features in PDMS (Final draft

Thesis Proposal : On PDMS support for extendedheterogeneity, aggregation and advanced query processing

by

Jian Xu

A THESIS PROPOSAL SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

Doctor of Philosophy

in

THE FACULTY OF GRADUATE STUDIES

(Department of Computer Science)

The University Of British Columbia

(Vancouver)

February 2009

c© Jian Xu, 2009

Abstract

A Peer Data Management System (PDMS) allows the easy sharing of data betweenheterogeneous databases over a flexible, highly available peer-to-peer (P2P) net-work. In a PDMS, queries are translated among data sources and processed bythe databases located in the physical peers in the network. This allows informa-tion rich in semantics to be exchanged, which distinguishes a PDMS from a P2P

file exchange network. However, when looking at real-world applications, thereprove to be other common requirements that current PDMSs do not allow. In thisproposal we present extensions to the current PDMS architecture to improve it inthree new areas by answering the following questions. How can a PDMS supportextremely heterogeneous databases? How can objects defined in different domainsbe used together in query answering effectively? How can we model and informthe users about the quality of the query answers and keep the users updated with thebest possible answers to their queries? These questions were raised in the contextof the Joint Interdependent Infrastructure Project (JIIRP) for responding to large-scale disasters like earthquakes. We show case studies from the JIIRP project toexplain why the existing techniques are insufficient for the JIIRP’s requirements.We present our prototype of the extended PDMS to address the above questions,and show how JIIRP as well as other applications would benefit from the proposedextensions.

i

Table of Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1 Overview of the General Goals for the Thesis Work . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 A case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Overview of the extended features . . . . . . . . . . . . . . . . . 4

1.3.1 Supporting extended heterogeneity . . . . . . . . . . . . . 51.3.2 Supporting deriving and managing new information . . . 51.3.3 Supporting quality information and continuous update . . 61.3.4 Internal connections between the three proposed enhance-

ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4 Optimizing the semantic overlay network topology . . . . . . . . 91.5 Supporting PDMS aggregate queries . . . . . . . . . . . . . . . . 111.6 VE and continuous query . . . . . . . . . . . . . . . . . . . . . . 111.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Optimizing the Topology of the Semantic Overlay Network . . . . . 152.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3 The acquaintance selection framework . . . . . . . . . . . . . . . 19

2.3.1 Acquaintance selection operations . . . . . . . . . . . . . 192.3.2 Mapping and mapping quality metric . . . . . . . . . . . 192.3.3 The acquaintance selection criteria . . . . . . . . . . . . . 21

2.4 Summary of results for acquaintance selection . . . . . . . . . . . 222.5 Empirical study . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

iii

3 Processing Object Decomposition Aggregate Queries . . . . . . . . 273.1 Motivation of PDMS aggregation query . . . . . . . . . . . . . . . 27

3.1.1 A query from JIIRP . . . . . . . . . . . . . . . . . . . . . 273.1.2 A single database version . . . . . . . . . . . . . . . . . . 283.1.3 The PDMS as the query answering platform . . . . . . . . 30

3.2 Related work for PDMS aggregation query (PAQ) . . . . . . . . . . 343.2.1 Open world and close world assumptions . . . . . . . . . 353.2.2 Aggregation in data integration . . . . . . . . . . . . . . . 363.2.3 Aggregation query answering in data exchange . . . . . . 363.2.4 Schema mappings and data mappings . . . . . . . . . . . 383.2.5 A comparison with existing PDMS systems . . . . . . . . 40

3.3 Problem definitions . . . . . . . . . . . . . . . . . . . . . . . . . 413.3.1 Relocating data sets to data sources in the PDMS . . . . . 413.3.2 Definition of PDMS aggregation query . . . . . . . . . . . 413.3.3 Answers to PDMS aggregate queries . . . . . . . . . . . . 443.3.4 Challenges in processing PAQ . . . . . . . . . . . . . . . 45

3.4 Object decomposition aggregates . . . . . . . . . . . . . . . . . . 453.4.1 The JIIRP application for object decomposition aggregate

(ODA): . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.4.2 Definition of the object decomposition aggregate (ODA) . 46

3.5 Research plan for processing ODA using AFOD for optimization . . 48

4 Value Estimation and Continuous Query Support in the PDMS . . . 514.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2 Value estimation definitions . . . . . . . . . . . . . . . . . . . . . 554.3 Research plan for value estimation . . . . . . . . . . . . . . . . . 564.4 Related work for value estimation . . . . . . . . . . . . . . . . . 564.5 Continuous query definitions . . . . . . . . . . . . . . . . . . . . 574.6 Research plan for continuous query . . . . . . . . . . . . . . . . 594.7 Related work for continuous query . . . . . . . . . . . . . . . . . 59

5 Research Plan and Time Table . . . . . . . . . . . . . . . . . . . . . 63

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

iv

List of Tables

2.1 Notations in empirical study . . . . . . . . . . . . . . . . . . . . 23

3.1 Terms used in SON and P2P network . . . . . . . . . . . . . . . . 31

5.1 Time table for the proposed research . . . . . . . . . . . . . . . . 63

v

List of Figures

1.1 Query answering in two SONs with same set of peers but differenttopologies: Triangles denote data sources that could contribute tothe query and circles are ones that do not. The edges representmappings between data sources that form query answering paths.In the example we can see that because of the SON topologies aredifferent, query answering “coverage” for SON A is lower than thaton SON B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1 An example of a PDMS and semantically mapped peers . . . . . . 162.2 An example of cordless paths: (s,b,c, t), (s,d,e,c, t), (s,d,e, f , t)

are cordless paths from s to t. . . . . . . . . . . . . . . . . . . . . 212.3 one-shot accuracy on D-regular topology . . . . . . . . . . . . . . 242.4 two-hop vs. random on D-regular topology . . . . . . . . . . . . . . 25

3.1 The PDMS architecture: a logical diagram . . . . . . . . . . . . . 333.2 The SON and PHY topology of a PDMS with 3 peers. For query

Q1 to be translated from A to B (follow the green line), one hopon SON (left) and two hops on PHY (right) are needed. For queryQ2 from A to C (red dotted line), two hops on SON (via B, redline1, 2) are needed and three hops on PHY (redline 1.1, 1.2, 2.1) arerequired. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Processing of the query in section 3.1. Two logic flows exist asindicated in the legend. A schema level search is conducted tolocate data sources related to the query. After that, the query istranslated along the paths found in the first step. . . . . . . . . . . 50

4.1 An example for value estimation (VE) : the PDMS consists of 3 datasources A, B and C. B and C answer the query QB, QC translatedfrom QA and a VE is performed at data source A on the answersreturned by B and C. . . . . . . . . . . . . . . . . . . . . . . . . 52

vii

Glossary

PDMS Peer Data Management System

PAQ PDMS aggregation query

ODA Object Decomposition Aggregate

JIIRP Joint Interdependent Infrastructure Project

CWA Closed World Assumption

OWA Open World Assumption

TGD tuple generation dependency

ODA object decomposition aggregate

AFOD accuracy first object decomposition

SPJ select-project-join

P2P peer-to-peer

SON semantic overlay network

VE value estimation

CQ continuous query

PCQ PDMS continuous query

CPAQ continuous PDMS aggregation query

CVE continuous value estimation

ix

Chapter 1

Overview of the General Goalsfor the Thesis Work

1.1 IntroductionReal life applications keep posing new requirements for data integration systems.A data integration system must support a large number of independently main-tained data sources. We want the system to be able to answer queries using all itsdata sources, despite the semantic heterogeneity between them. We want to be ableto organize data sources that reside in remote, distributed sites so that users of thesystem only see a unified interface as if they are querying one single, consistentdatabase.

A Peer Data Management System (PDMS) [15, 67] extends the autonomousdata sharing of a peer-to-peer (P2P) system from file exchange to the exchange ofsemantic rich information. The P2P model allows peers to easily join and exit aPDMS, allows the owners of the data sources to fully control the access and sharingof the data, and does not rely on a centralized server which would be a bottleneck.These advantages make a PDMS the first choice in disaster management and othersituations in which data sharing must be set up quickly and easily, with limitedresources, but the availability of a centralized server cannot be guaranteed.

For the past three years, we have been working with the multi-disciplinary JointInterdependent Infrastructure Project (JIIRP) [30] disaster management project ondata engineering issues. The primary goal of JIIRP is to study the interdependen-cies between infrastructure elements during a disaster. One component of JIIRP

is a simulator which takes as input a disaster scenario (e.g., an earthquake), in-cluding the conditions of the infrastructures right after the disaster, to predict loss

1

and infrastructure conditions in the future. Obtaining accurate simulation resultsrequires a significant manipulation of the input sources. The simulator needs con-dition reports to the infrastructure being simulated. The data in JIIRP is structureddata and we manage it in a PDMS that uses relational databases for the data sources.Throughout this proposal when we refer to data and databases, we mean relationaldata and databases.

While a PDMS helps to integrate the data, its current version is not sufficient.For example, BC Hydro has the electricity supply information, while BCTC main-tains the electricity transmission system. As in standard PDMSs, both databases areautonomous and not designed to work together. The requirements of JIIRP’s appli-cation introduce several new wrinkles that current PDMSs do not have. For exam-ple, BCTC has the electricity wiring scheme for each building while the seismicdamage assessment of the buildings are published by the school of civil engineer-ing which has no access to that information. Using standard PDMSs, the damageassessment could tell how much damage the “wires in a certain building will sufferfrom an earthquake” but it does not know which wire is affected and the possibleaftermath of damage to that particular wire. The reason behind is that BCTC’s datacannot be shared with the building damage assessment study as the two belongto two different domains. Data sources belonging to separate knowledge domainsusually don’t have their schema/data mapped before they are put into a PDMS forintegration. A similar situation exists on the BCTC side: it can tell how damageto particular wires could affect the power system, but it does not know the wayto use buildings’ structural damage assessment to estimate the physical damage tothe wires during an earthquake. An additional complication is that the same reallife object may have multiple data sources describing it. For example, the buildingelectricity wiring scheme is also available from school’s utility office and the pub-lic safety department of the city owns a seismic damage assessment conducted 10years ago. Current PDMSs are not up to solving these problems. The best we seefrom an existing system is that it collects answers from all “possible” data sourcesand returns the union, leaving the burden of making the final choice to the user.Ironically, answers returned in this fashion usually carry no extra information thanonly the values, making it extremely hard for the user to perform the last step.

The above examples suggest that PDMSs need to improve in the following di-rections. A PDMS should allow data sources in apart domains to exchange infor-mation, specifically, it should support processing queries that require informationfrom data sources of multiple domains. The inclusion of data sources from mul-tiple domains creates the possibility to derive new information from their integra-tion. Therefore, a PDMS should support data processing and more importantly, tomanage the derived data as well as the deriving logic itself. Moreover, when mul-tiple data sources provide answers for one query, the PDMS should help the user to

2

choose one answer to use and possibly understand the quality of the answer.We propose to improve the existing PDMS and provide solutions to the above

problems. In short, we extend the current PDMS model in the following three as-pects. First, we add support to extended heterogeneity among data sources. Sec-ond, we work on supporting deriving new information across data sources andmanaging it. Third, we propose to add value estimation and continuous query sup-port to the PDMS. Details of these improvements are described in Section 1.3 andsucceeding chapters of this proposal.

This proposal is organized as follows. In the remaining sections of this chapterwe continue to motivate the above mentioned extensions to the PDMS and discuss inmore detail specific requirements for the proposed improvements. In Section 1.2,we look at a case study for a query processing task from JIIRP. Section 1.3 de-scribes in more detail the new features we propose to add to the PDMS and the in-ternal connections between them. Then, we propose three solutions each addressesone proposed point in Section 1.4, Section 1.5 and Section 1.6 respectively. Sec-tion 1.7 presents some related work to the proposed research topics at the end ofthis chapter.

In Chapter 2, the work of optimizing semantic overlay network (SON) topol-ogy is described for the goal defined in Section 1.4. We describe our techniquesto optimize the acquaintance selection process so a peer in the PDMS could chooseacquaintances that best help it to query the PDMS. Some preliminary experimentresults are also presented. In Chapter 3, we describe the ongoing work of process-ing PDMS aggregation query (PAQ) as the proposed solution to answer the questionsraised in Section 1.5 of this chapter. In Chapter 4, we propose a future work to an-swer the calls from Section 1.6 that (1) when multiple answers are collected, howthe final step of choosing the right answer to use can be automated and (2) howupdates to queries can be continuously pushed to the user of the PDMS. Finally,Chapter 5 gives a tentative time table for the proposed research.

1.2 A case studyWe use the following example to illustrate the need for the new features in thePDMS. This case study is from JIIRP where the simulator needs to know if a sub-station’s remaining supply capacity can satisfy the electricity demand of the south-ern area of the university campus after an earthquake. Specifically, a query asks tocompute the difference between the remaining supply and the electricity demandsfrom the southern campus after an earthquake strike.

In the data sources of the PDMS, we have a spatial database (as one data source)which has the “southern area” defined in longitude and latitude coordinates. Thespatial database can also be queried for the buildings in the area. The data sources

3

of buildings and infrastructure damage assessment are maintained and publishedby school of civil engineering. They are published by two study groups as two datasources in the PDMS. One is an assessment performed 2 years ago and the otheris a more recent re-assessment. BC Hydro publishes a data source that describesthe working conditions of the substation on campus. The wiring scheme whichtells information about the panels and wires in the buildings and the way they arewired to the substation is published in another data source by BCTC. To enable in-tegration of the data, the above data sources are managed in a PDMS and mappingsbetween the data sources are established by the JIIRP project group.

To answer the query, we need all the three newly proposed features in thePDMS. All five data sources and aggregate functions (e.g, SUM) are needed tocompute the remaining electricity supply to the southern area from the substationand the total demand from the buildings there. The simulator needs to track thechanges for demand and supply while on both sides infrastructure are being fixedand brought back online after the earthquake. This query thus needs continuousupdates to inform any changes affecting the difference.

A hybrid system where a PDMS is created from a number of domain homo-geneous data integration systems (e.g., the Piazza [67]) is insufficient. A hybridsystem usually does not maintain data integration paths between sources of differ-ent domains. So a query involving only one domain can be answered but a querythat requires information from multiple domains does not find answers from anyof the single domain subsystem. This problem is rooted in two issues. First, thereis no knowledge maintained in a hybrid system to bring together data in differentsubsystems. Second, subsystems are the boundaries of query processing in sucha hybrid system, so no query can make use of data sources in more than one sub-system for query answering. In practice, it is also hard to determine the domainsand classify subsystems. Some data sources may contain cross-domain data so it ishard to choose a subsystem to manage such data sources. In fact, this kind of datasources are very important glue for data integration and play crucial roles in crossdomain query answering.

1.3 Overview of the extended features

In this section, we present a more detailed overview to the new features we plan tosupport in an PDMS. We describe in more detail the motivations for each featurehere and leave technical details to chapters that follow.

4

1.3.1 Supporting extended heterogeneity

We introduce a new level of heterogeneity which we call it “domain heterogene-ity” to distinguish itself from the widely known and studied schema heterogene-ity. We first compare the two concepts. Schema heterogeneity exists in the waythat two databases describe the same type of objects. This includes the differ-ent representation of the object and the different associations the object has toother object types, taken by the two databases. Domain heterogeneity, on theother hand, measures how different the set of object types are maintained in twodatabases. Databases which are domain homogeneous but schema heterogeneousare discussed intensively in existing data integration applications e.g., the canon-ical examples of integrating bibliography records in different online publicationdatabases. There are many fewer systems that focus on applications which requireintegration of domain heterogeneous data sources. JIIRP is one typical case thatrequires this level of integration. In JIIRP, we need data that describe physicalproperties of land, buildings and critical facilities; we need damage assessmentdatabases for various lifeline infrastructure including electricity, water, gas, com-munication; we need data that describe service information for security, publichealth and emergency response policies. All these data needs to be integrated inorder to be useful for JIIRP study, e.g., to run an earthquake simulation. A query(e.g., from the JIIRP simulator) usually involves integrating knowledge from crossdomains like the example query we discussed in Section 1.2.

Query answering in the PDMS depends on the mappings between data sourcesthat help translate queries around local schemas on the data sources. One questionto answer is how the introduction of domain heterogeneous data sources could af-fect query answering in the PDMS. We observed that the mappings are not evenlydistributed among data sources. Usually there are more mappings between datasources of the same domain than for different domains. Therefore in a PDMS thatsupports domain heterogeneous data sources, selection of a data source’s acquain-tances becomes important. If a good selection is made, more data sources cancontribute to queries from the data source; on the other hand, a poor selection mayblock query answering. We propose a process called “acquaintance selection” tohelp a data source in the PDMS choose its acquaintances. This effort, from the viewof the PDMS, optimizes the semantic topology and is our first step towards a PDMS

supporting extended heterogeneity.

1.3.2 Supporting deriving and managing new information

An PDMS with databases from various domains needs to natively support additionaloperations to integrate domain heterogeneous databases and query answering that

5

requires cross domain knowledge. Processing of a query that brings domain apartdata sources together is a process of deriving new information. Such queries, un-like a lot of well studied queries in data integration that essentially reorganize theCartesian product of the existing schemas, encode additional information in thequery answering.

One observation from creating the PDMS to work with JIIRP is that the usersubmitted queries, especially queries that derive new information, are valuable in-put to the system. Those queries give hints on how data sources could co-operatetogether. Among the class of queries that derive new information, the aggregatequery is one type that is typical and is of great interest to us. An aggregate queryprovides rich information about (1) a data processing procedure that is of interest tothe user; (2) a choice of an aggregate function that, when applied on some featuresof a group of objects, carries clear semantic that “this aggregation makes sense forthis feature of the object”. (3) the instructions on how objects are grouped for cer-tain aggregation. Taking this viewpoint, aggregate queries bring new semantics tothe PDMS and we want to get the queries efficiently processed. Also, answers to ag-gregate queries often represent new features of newly created data objects and it isimportant to maintain the aggregation process as well as the values computed. Wepropose to maintain both by using views that encode aggregations so that both thevalues (as answers to view evaluation) and the data deriving semantics (as the viewdefinition) are properly preserved. Motivated by the above, we propose to studyaggregate query in the PDMS environment as a first exploration in this direction.

1.3.3 Supporting quality information and continuous updateThe answer that a user receives for a query processed in an existing PDMS systemsuch as Piazza [68] is usually a union of answers from the data sources that respec-tively process the translated queries. For PDMSs that use relational databases on thedata sources, each answer from a contributing data source is a relation that shares acommon schema defined by the query. Users are usually not given any informationabout the quality of tuples in the answer, neither does it provide information onhow up-to-date the tuples are (expected to be). In fact, it is very common underthe current PDMS query processing architecture that a data source returns inaccu-rate answers or answers from different data sources are mutually inconsistent. Thereason is that a PDMS does not guarantee data consistency among its data sources.Therefore conflicting tuples in an answer could confuse the user and may resultin abandoning the whole answer relation in extreme cases. If conflicting answerscould be eliminated before returning to the user and quality information of tuplesin an answer could be provided, it will greatly help the user to use the query an-swers. In general, a user would expect the following quality information returned

6

with the answers.

1. Value estimation (VE) : one tuple representing the best estimate from mul-tiple answers for a same queried object returned by different data sources.

2. Ranking : a ranked (and ordered) list of answer tuples.

3. Annotated answer : the answer relation with extra columns indicating aquality measure for the tuples in the relation.

The above three requirements meet different application needs. VE (item 1) isneeded when the query answering is part of an automated workflow. The PDMS

performs VE for the query select damage from cells and ensures thatonly one damage assessment value is returned for a cell although many differentassessment values may be returned from the data sources. The VE procedure canbe as simple as, for example, taking the mean and it can also be a complicatedprocess to make the estimation, depending on the VE algorithm to use. For rank-ing (item 2), a ranked answer is a natural requirement for applications like searchengines. Note that the top answer is not necessarily the output of a VE opera-tion and search engines usually favor an ordered list of candidates instead of one“best estimate”. While the ranked answer only gives an ordering of the tuples,in annotated answers (item 3) approach, an annotated answer proposes to providequantative quality measure to go together with queried values. For example, theJIIRP simulator could use the quality measure of the values to estimate errors for itpredications based on the query answers received. While the above three ways ofproviding quality information to query answers address different application needs,they share some common features which make it convenient to investigate them inone framework. The above three all work around scoring answering tuples. In VE,a final estimate is usually based on the quality of all candidate answers and rankingis always based on some scoring on the candidates. Additionally, the user mayrequest combinations of quality information in their query answers.

We propose to support VE as the first approach in this direction. We chooseto start with value estimation (VE) to meet the requirement from JIIRP that asksto integrate query answering with the simulator’s workflow. The value estimationoperation will take into consideration quality measures associated with the queryanswers and provide the simulator with a quality evaluation for the final estima-tion it makes. We believe that a solution to the VE operation will include sometechniques that also benefit supporting ranking (item 2) and annotating the queryanswers (item 3).

Another feature we plan to support in the PDMS is continuous query processing.Continuous query processing is studied in data streams [17, 92, 94, 121] to provide

7

proactive updates to streaming computations but to the best of our knowledge notseen in the context of PDMS query answering. Generally, a query to a PDMS isprocessed at the distributed data sources after it is translated to their local schema.In the PDMS, the time for a query to be translated and data sources to processthe translated queries may vary, depending on several factors. The load of thedata sources, the network delay and the semantic overlay network (SON) topologywhich decides the number of semantic hops a query needs to be translated overall affects the query processing performance in the PDMS. So unlike traditionalquery processing of an RDBMS (R for relational) that a user gets query answersall as a whole, query answers in a PDMS are better fed back to the user when theybecome available. Moreover, in the aggregate query processing we have proposedin Section 1.3.2, because it usually costs more to process an aggregate query whichinvolves more complicated query rewriting and computation than an SPJ query, thisimbalance is more easily observed especially when the size of the PDMS grows. Wepropose to support continuous PDMS aggregate query that changes the way queryanswers are returned. Benefits could be obtained in the following two aspects: (1)the computed values remain up-to-date without explicitly reprocessing the query;(2) the query processing cost is amortized to data updates therefore fast responseat querying time can be achieved.

1.3.4 Internal connections between the three proposed enhancements

The three enhancements we propose to the PDMS are internally connected. Sup-porting more data sources for data exchange is a general goal for all PDMSs. Thedifficulty in doing so is that data sources from different domains usually do nothave much of their schemas mapped. Thus the problem of preventing domain het-erogeneity from hurting the overall PDMS query answering performance requiresimmediate solutions. The approach proposed in Section 1.3.1 and detailed in Sec-tion 1.4 and Chapter 2 provides a solution to fit domain heterogeneous data sourcesin one PDMS. After those data sources all find their positions in the PDMS, we wantthem to work together. Again, domain heterogeneous poses the second challenge:queries requiring objects from cross domain to cooperate need to be supported. Toautomate processing of such queries, not only the data needs to be properly main-tained but also the logics for bringing data sources together need to be properlymanaged by the PDMS. This leads to our proposal of processing PDMS aggregatequeries as described in Section 1.3.2 and detailed in Section 1.5 and Chapter 3.We also observe in Section 1.3.3 that the query answering model also needs to beimproved to help people better retrieve and understand the answers returned bythe PDMS. Our proposal described in Section 1.6 and detailed in Chapter 4 aimsto improve the answering of PDMS aggregate queries as well as the general PDMS

8

query answering. Particularly in processing PDMS aggregate queries, it is oftenvery difficult for a user to tell which aggregated value to use when multiple aggre-gated values are returned without additional information. We propose to solve thisproblem by performing value estimation (VE) so that the PDMS proactively helps tomake the best choice. The users not only are freed from deciding to use an answerfrom many returned but could expect to get higher quality answers. Finally, con-tinuous query support is proposed to speed up PDMS query answering especiallyfor the aggregate queries.

By studying the problems we have proposed in this order, we support a widerrange of data sources in the PDMS, then we support a new type of query that makesuse of the data sources and finally we improve the PDMS architecture to get betterefficiency and quality in query answering.

1.4 Optimizing the semantic overlay network topologyAs is described in Section 1.3.1, an PDMS welcomes domain heterogeneous datasources. Query answering in a PDMS is performed through rewriting queries viamappings established between pairs of data sources. The network topology formedby these semantic mappings is referred to as a semantic overlay network (SON). Ona SON, it is often beneficial to discover and map data sources belonging to the samedomain for the following two reasons. First, often more mappings can be estab-lished between two data sources in the same domain than for those in differentdomains. More mappings create more chances for the data sources to contributeto query answering. The number and quality of the mappings determine the abil-ity of the PDMS in using the data sources for query answering. Properly groupeddata sources ensure a good coverage for queries to be processed at data sourcesthat contain answers. Second, the number of hops for a query to get rewrittenbefore it reaches a destination has both time and quality implications to query an-swering. On the time side, each rewriting takes time and computation resources,which directly contributes to the delay and peer load in the PDMS. On the qualityside, rewriting a query along a path of multiple hops is equivalent to a rewrit-ing using a mapping that is composed by the mappings on the path. As is widelyknown[54, 95], mapping composition is likely to lose information (or even becomeimpossible) so the SON topology directly affects the quality of query answering.Therefore a properly optimized SON helps the query answering of the PDMS to beefficient and scalable. Figure 1.1 illustrates how the SON topology could affectquery answering.

We propose that the PDMS performs SON optimization via an “acquaintance se-lection” step. This operation helps the data sources in a PDMS to find good neigh-bors on the SON. It optimizes the overall query answering ability of the system

9

Query QuerySON A SON B

Figure 1.1: Query answering in two SONs with same set of peers but differ-ent topologies: Triangles denote data sources that could contribute tothe query and circles are ones that do not. The edges represent map-pings between data sources that form query answering paths. In theexample we can see that because of the SON topologies are different,query answering “coverage” for SON A is lower than that on SON B.

by finding, for a data source, acquaintances that better map to it and shorten thesemantic hops in query answering. In a typical acquaintance selection procedure,a data source selects the acquaintances that best improves the ability that querieswritten in its schema can be translated to other data sources in the PDMS. Thisoptimization is conducted when a new data source joins the PDMS and is periodi-cally performed by the PDMS to give already joined data sources chances to selectbetter acquaintances. We discuss algorithms and other technical details of acquain-tance selection in Chapter 2. In Chapter 2, we present two algorithms to optimizeacquaintance selection and also show some preliminary experiment results.

Queries being processed in an PDMS also give hints for optimizing the SON

topology. We plan to annotate returning results with their home data sources totrack the average number of hops needed to process a query. Performance im-provement can be obtained by minimizing the number of hops via optimizing theSON topology. This optimization is likely to bring two benefits. First, by gettingthe query issuer and the answer contributors closer on the SON, it is likely to helpreduce the query rewriting cost thus the response speed of the PDMS can be im-proved. Second, by combining with value estimation techniques (Section 1.3.3),the optimization is likely to help early pruning of query translating paths that doesnot help improve query answering. Optimizing SON topology using query statistics

10

is not covered by the techniques developed in Chapter 2 and we plan it as a futurework to follow the ongoing work of supporting aggregate queries.

1.5 Supporting PDMS aggregate queriesWith domain heterogeneous data sources included in the system, JIIRP requires thatthe PDMS does more than locating and retrieving data from individual data sourcesand return a union of the collected answers to the user. The fact that mappings areestablished between domain heterogeneous data sources demands that an PDMS

be able to process the data instructed by queries that integrate cross-domain infor-mation. Motivated by JIIRP queries, we propose to support aggregate queries inthe PDMS. We choose aggregate query for two reasons. First, many JIIRP queriessuch as the examples in Section 1.1 and Section 1.2 can be modeled as aggregatequeries. Second, the data sources in the PDMS we study use relational databasesas local DBMS. Standard aggregation semantics is well defined and supported bymost relational DBMSs. So we can reuse the aggregate processing techniques atlocal data sources and focus on aggregate processing at the PDMS level.

An aggregate query in a PDMS is different from a conventional aggregate queryin a RDBMS. The major differences are:

1. the query does not specify the data sources that contribute to the aggrega-tion. The system has to locate the set of data sources that contribute to theaggregation;

2. the data sources may contain duplicated information which needs to be takencare of;

3. There often exists more than one way to process an aggregate query; thePDMS system needs to find out the best plan to compute the aggregation.

We present the detailed model and propose solutions to process PDMS aggre-gate queries in Chapter 3. We study a type of aggregate queries called PDMS aggre-gation query (PAQ). A PAQ is a typical query needed by JIIRP and it unveils manyinteresting characteristics of general PDMS aggregate query answering.

1.6 VE and continuous queryIn JIIRP, because the simulator which issues queries to the PDMS works in a fullyunsupervised fashion, it usually expects to get one answer value for a query like“the damage to the CS Building under earthquake level 7.” For such a query, a listof differently valued answers are usually returned from multiple data sources in the

11

PDMS after query processing. We use value estimation (VE) as briefly discussed inSection 1.3.3 to support returning one instead of many so that (1) the PDMS makesa final estimation from the candidate answers and frees the user from having todecide the final answer use; (2) the query answering process can be integrated intoan unsupervised workflow; and (3) the finally returned answer is expected to havea good quality.

A straight-forward way to perform VE is to collect query answers the sameway as in conventional query processing and then apply VE logics on the result set.However, it is likely to bring some advantages if the VE logics could be integratedinto the query processing of the PDMS:

1. some computations can be performed locally on the data sources in thePDMS, requiring less data transmission over the P2P network;

2. early pruning can save the cost of computing and network transmission;

3. the entire VE algorithm being distributed, the costs on both time and spaceare amortized to the data sources in the PDMS; this potentially improves theload balancing of the PDMS;

4. continuous updates of VE results can be performed locally at data sourcesthat witness updates, making the continuous query processing more efficient;

Continuous query (CQ) processing in the PDMS shares some common featureswith that for stream processing [17, 94]. They both work to actively return updatedanswers to computations and they both work in a data-driven fashion. A user onlyneeds to submit queries once and answers to the queries get continuously updateduntil the queries are de-registered by the user or terminate by themselves. Updatesto a query are triggered by new incoming data items in stream processing; and in aPDMS, by newly joined data sources or query rewritings. There are still substantialdifference between the CQ for stream and for PDMS. The biggest difference is thatCQ processing in PDMS involves continuously maintaining query rewriting againstupdates not only to the data but also mappings in the PDMS. Moreover, supportingCQ in the distributed PDMS environment poses challenges to properly communi-cate among the distributed data sources. We propose to support continuous queryprocessing by developing new algorithms to minimize the cost of updating queryanswers in the PDMS, potentially speeding up the system and making the PDMS

more scalable. Basically, we hope that, when queries are continuously processed,the additional updating cost only occurs on new data sources that contribute to thequeries, on data sources that witness data updates and on those newly joined con-tribute to the query answering. This cost will be much cheaper than re-processing

12

the queries over and over again in the PDMS, especially when the size of the PDMS

becomes large.We plan to support continuous query in the PDMS aggregate query processing.

Specifically, we will pick PAQ as the query, a VE operation focusing on accuracyand an approximate continuous query updating scheme focusing on minimizingcommunication cost with precision guarantees.

1.7 Related workIn this section we briefly describe some related works to the research proposed inthis chapter. More detailed discussion on related techniques will be presented in theaccording chapters for acquaintance selection, PDMS aggregate query processingand value estimation & continuous query processing respectively.

Data sources in PDMSs rely on meta-data management technique to establishmappings (see [49] for a recent survey) and rewrite queries along semantic pathsin query processing [95, 100]. Recent theoretical results discuss the feasibility andcomplexity of mapping operations (e.g., [54]).

The P2P community has focused on some issues involving neighbor selection,(e.g., [36, 102, 113]). Recent reported work [106] optimized SON structure byconstructing multiple SON’s for a P2P network and distribute peers to SON’s basedon their content similarity. [144] focused on obtaining high coverage schema sum-mary which can be useful in classifying data sources.

Aggregation and ranked query results are well studied topic in many areasbesides data integration. [37, 97] discussed general models for Top-K and user-defined aggregations and [9, 42] discussed answering top-K queries using viewsand an improvement to the well-known TA algorithm for top- K query respectively.In [5], aggregation complexity for data exchange is studied and both positive andnegative results are given. Approximate and randomized algorithms for aggregatesin the data integration context are discussed in [11, 140]. [3, 80] focus on minimiz-ing communication cost in computing aggregates using peer gossiping. Rankingissues in data integration with text databases is discussed in a recent work [130].

Continuous queries (CQs) [17, 94] were first discussed in querying data streams(also see recent updates in [92, 121]). The “register once, continuous update” fash-ion of CQ has received more and more applications beyond stream processing, e.g.[127] discussed CQ processing in PDMS . In [57], a quality-aware delivery ser-vice is suggested for P2P networks to support continuous queries and in [76] theauthors discussed adaptively processing continuous aggregate queries in conven-tional RDBMSs. In [73], the system is optimized by sharing load among multipleaggregation queries. A recent study on approximate continuous aggregate queryprocessing is also reported in [79].

13

VE operations usually involve tagging results from data sources with confi-dence measure associated with them. Therefore, models that work with uncertaindata [117] become good candidates for value estimation models. Applying uncer-tainty model to data integration is also seen in [51]. Investigation on ranking andtop-K query processing for data with uncertainty is reported in a recent work [143].

14

Chapter 2

Optimizing the Topology of theSemantic Overlay Network

2.1 IntroductionA Peer Data Management System (PDMS) (e.g., [23, 67, 115]) combines the flex-ibility of ad-hoc sharing of information in a peer-to-peer network with the richersemantics of a database. In a PDMS, each source is assumed to have a databaseto share, rather than just exchanging files. This allows the users of the PDMS toexchange semantic-rich data, rather than only exchanging simple files. Since thesepeers are autonomous, they are assumed to have their own schemas. To solve thisproblem, PDMSs require semantic mappings between the various schemas.

Consider the example PDMS in Figure 2.1. In response to a recent earthquake,four cell phone companies (CHEAP CELL PHONES(CCP), CELL PHONES FOR-EVER(CPF), CELL PHONE LAND(CPL), CELL PHONE EASY(CPE)), a land-based telephone company (LAND LINES R US(LRU)), an electric company (ELECTRIC

COMPANY(EC)) and a cable company (HAPPY CABLE(HC)) have quickly formeda PDMS to share data to see the global problems for their customers and shared in-frastructures. To establish basic connectivity, they have created a small number ofmappings (shown as lines between peers in Figure 2.1). Similarities may vary sub-stantially between peers. For example, CCP has much in common with CPF, anda lot (but less) in common with LRU. HC and EC share under-water pipes for theirwires, and EC and LRU share utility poles. For the network shown in Figure 2.1,a query from HC will need to be translated through the two CPE mappings beforeit can be processed at EC. Peers that are directly connected to each other throughsuch a semantic mapping are called acquaintances. e.g., HC’s only acquaintance

15

is CPE.

Electric Company(EC)

Happy Cable(HC)

Cheap Cell Phones

(CCP)

Land Lines R

Us(LRU)

Cell Phones Forever

(CPF)Cell Phone Land

(CPL)

Cell Phone Easy(CPE)

Figure 2.1: An example of a PDMS and semantically mapped peers

Query answering in PDMS involves translating queries along semantic map-pings. There has been considerable work on decreasing the cost to construct directschema mappings between two peers in a PDMS. Since the mappings are inherentlydifficult to determine fully automatically, the best that can be done is to create suchmappings semi-automatically. While semi-automatic schema matching techniquesdecrease the costs (see [49] for a recent survey), it still remains too expensive tocreate mappings between a newly joined peer and all other peers in the PDMS. Inour example, though creating a mapping between HC and all other peers will yieldthe optimal ability to answer queries, HC may not have the resources to do so —particularly in a disaster management situation, where time is critical. Thereforethe goal of acquaintance selection, motivated by the difficulty in creating excessivepairwise mappings either by mapping composition or full/semi-automatic schemamatching, is to carefully choose a limited number of acquaintances so that peers’ability/usage of the PDMS can be maximized.

In our continuing example, CPE is in a poor position because it lacks a cellphone company as its acquaintance. Although queries from CPE can still be trans-lated along the established mappings, query answering is limited, e.g., cell-phonespecific queries may be blocked at peers EC and LLRS lack cell-phone specificschema elements on route to the other cell phone peers.

Note that the best new acquaintance for a peer may not be the candidate thathas the best potential mapping quality with it. Consider our example in Figure 2.1;assume that the mapping between CPL and CCP is predicted to have a highermapping quality than the mapping between CPL and CPE. If CPL is consideringcreating a new mapping, its best choice may be CPE instead of CCP because italready has a high quality semantic path to CCP through the mappings via CPF

16

while a query will need to take 5 hops of translation to CPE. Thus, the selec-tion criteria must consider the additional benefit when a candidate is chosen asacquaintance.

Because the queries that can be translated across peers vary greatly dependingon which peers are selected as acquaintances, it is imperative that a peer planningto add a new acquaintance can tell which peers are likely to be of the greatest helpon query answering — if chosen as acquaintances — without fully creating themappings involved. We call this the acquaintance selection problem: given anexisting PDMS, and a peer i which may already have some acquaintances in thePDMS, how can i choose new acquaintance(s) to maximize its ability to query thePDMS?

As shown above, there are two aspects to the acquaintance selection problem:(1) the ability of the new acquaintance to help answer queries (from the host peer)must be estimated and (2) it must be estimated how well queries can be answeredwithout the proposed acquaintance. This paper describes two schemes to the ac-quaintance selection problem. The first acquaintance selection scheme, the “one-shot” scheme classifies peers into a set of clusters and suggests new acquaintancesbased on discovered clusters. Note that the best choice may not be in the same clus-ter. E.g., in our continuing example, suppose all cell phone companies are clusteredtogether, CPL may choose EC, which is not in its cluster, over CCP because it al-ready has a good query answering path with the latter and the benefit of creating anextra mapping with CCP is therefore low. The “one-shot” scheme pre-processesall peers in the PDMS and thus the selection process afterwards virtually takes noextra time. The second solution, namely the “two-hop” scheme, explores the net-work in multiple rounds and performs acquaintance selection using the informationavailable locally at each round. Whereas one-shot is quite efficient when we knowroughly the number of clusters that the peers form, two-hop can be used when thisinformation is unavailable. In addition, the two-hop theme is more adaptive, itrefines estimations more easily when new information becomes available.

One of the theoretical foundation of query translating is mapping composi-tion [54, 95, 129], which has been shown, for general tgd formulas, to be com-putationally hard (NP-complete) and even not first-order definable for some cases.Hence the scale of a PDMS — regardless of how acquaintances are selected — islimited; unlike a typical file-exchange peer-to-peer network, which can easily scaleup to thousands of peers, PDMSs are usually formed by no more than several hun-dred peers. Our empirical study shows that both our schemes — which do not relyon any specific mapping composition algorithm — effectively help acquaintanceselection and scale to hundreds of peers.

We make the following specific contributions:

17

• We introduce acquaintance selection in peer data management system and pro-pose a first set of solutions to the acquaintance selection problem.• We propose an acquaintance selection framework to allow multiple selection

schemes to co-exist in one PDMS.• We propose a “one-shot” scheme and a “two-hop” scheme both solve the ac-

quaintance selection problem.• We empirically evaluate the effectiveness and efficiency of the two acquain-

tance selection schemes.

2.2 Related workPeople have long noted that selecting good neighbors can reduce networking costsin structured (e.g., Chord [125], Pastry [116]) and unstructured (e.g., Napster,Gnutella) P2P networks. Both [113, 134] discuss peer selection and groupingstrategies to lower networking cost. Selecting peers to form groups and exploitingthe locality opportunity (by getting as much as possible from nearby neighbors)are also studied in works [35, 102, 123, 131, 138]. In [93], clustering informa-tion is used to reduce large scale flooding in the network. Some structured P2Pworks [31, 36, 40] study neighbor selection based on proximity information to en-able efficient routing.

An approach to improve query routing quality for information retrieval on aclustering-based architecture is reported in [84]. A recent work [98] discussesefficient query routing for PDMSs in the WISDOM project; queries are passedfrom one to another following certain semantic routing indices, so that a goodbalance between query answering quality and networking cost can be achieved.Also in [64], ontology information is used for routing queries.

All of the above, while also focusing on the peer selection problem, do not ad-dress the problem we have now. The goal of the previous neighbor selection tech-niques is improving networking costs/efficiency or to better route queries, whileour focus is maximizing the semantic querying ability of a peer.

There are also a number of related works in machine learning and patternrecognition areas. Some works on pairwise clustering algorithms [52, 104, 122]have proposed methods to learn the pattern of a data set given pairwise distance in-formation, which is directly related to our one-shot acquaintance selection scheme.Another pairwise clustering approach reported in [114] also uses EM in clustering.Our approach differs substantially from theirs on the basic assumptions of variabledistributions, the function to optimize and the detail optimizing method, whichtheirs was developed for motion-segmentation applications. The works [8, 122]suggest looking into longer paths in the network than only the pairwise relations.This motivates our development of the two-hop acquaintance selection scheme.

18

2.3 The acquaintance selection frameworkThis section defines a framework for acquaintance selection. Acquaintance selec-tion schemes implementing different optimization strategies follow this frameworkand use the metric and selection criteria defined in Section 2.3.2 and Section 2.3.3.This framework allows multiple acquaintance selection schemes to be integratedinto one PDMS and the PDMS could easily switch among them.

We formalize the general statement of acquaintance selection problem in Sec-tion 2.1 to the following one that we solve in the framework : Given an exist-ing PDMS, and a host peer p which already has some acquaintances in the PDMS,choose for p a new acquaintance that maximizes p’s benefit with respect to a givenselection criteria.

2.3.1 Acquaintance selection operationsDuring acquaintance selection, a host peer will need to perform the following op-erations in order:

1. Probe : The host peer collects information about other peers in the PDMS

and mappings among them.

2. Estimate : The host peer estimates the benefit of mapping to candidate peers.In this step it uses estimators to rank candidates according to the selectioncriteria.

3. Pick Acquaintance: The host peer picks the top ranked peer as acquaintanceand establishes mappings with it. The true quality of the newly establishedmapping is then computed on the host peer (as opposed to the estimate thatis created in the previous step — it is only after the mappings are fully builtthat the host peer can compute the true mapping quality).

4. Book Keeping: The host peer keeps track of the information disseminated toother peers. The book keeping operation ensures no duplicated or redundantinformation is transmitted.

2.3.2 Mapping and mapping quality metricTo decide which peer to pick as an acquaintance, the host peer needs to estimatethe quality of the mapping, if established, to a candidate peer. While the acquain-tance selection schemes are not bound to any specific quality metric, we choose areasonable metric to demonstrate the effectiveness of our approach. While we alsowant the acquaintance selection process to be fast and incur low communication

19

over-head, the PDMS does need to choose a quality metric that balances amongaccuracy, time/space/communication complexity. Because the goal of a mappingis to allow query translation, one primary factor is the number of attributes that aremapped from the source schema to the target schema, e.g. the mappings studiedin [50] deal with attribute correspondences between two relational schemas. Forthe ease of illustrating selection schemes, in this paper we use a quality metric thatserves as a first approximation: it measures the fraction of schema attributes thatare mapped. Formally:

Definition 2.3.1. [Mapping Quality Metric] Let sch(i) denote peer i’s schema and|sch(i)| be the number of attributes in sch(i). Let xi j be the distinct number ofattributes in sch(i) that appear in mapping Mi j. Then the mapping quality S(Mi j)is defined as S(Mi j) = xi j

|sch(i)| . �

This quality measure is by definition asymmetric. i.e., S(Mi j) 6= S(M j i); whichmeans quality S(Mi j) is measured at peer i’s perspective.

We now define the “current aggregate mapping” whose quality will need to beestimated. The current aggregate mapping, as its name suggests, can be regardedas a virtual mapping that takes into consideration all the existing query translationpaths from the host peer to a candidate peer in the PDMS. We use its quality asthe measure for the impact of the candidate peer to the host peer without a directmapping being established. To define it properly, we first define cordless path.

Definition 2.3.2. [Cordless Path] In a graph G, a cordless path p from vertex i to jis a path that satisfies

1. (simple): p is acyclic

2. (cordless): i, j with any vertex subset of p do not form a path from i to j inG.

�

Figure 2.2 shows an example of cordless paths in a graph where arrows indi-cates mappings between peers. The three solid paths are cordless paths from s to twhile the two dashed ones are not (e.g., the upper path (s,a,b,c, t) is disqualifiedbecause of the cord (s,b)).

Observe the triangle (s−a−b) in the example and suppose queries are trans-lated through paths (s,a,b) and (s,b). We say path (s,b) dominates path (s,a,b)if all queries that can be translated through (s,a,b) can also be translated through(s,b). If a path is dominated, then removing it from query translation paths doesnot affect the answer of a query. We observe that such cords often imply path

20

s t

a

b c

de f

Figure 2.2: An example of cordless paths: (s,b,c, t), (s,d,e,c, t), (s,d,e, f , t)are cordless paths from s to t.

domination, so to speed up estimating current mapping impact, we only considercordless paths as effective query translating paths. To represent the contribution aquery translation path offers, we use M(p) to denote an equivalent mapping whichcould translate just all the queries that can be translated along path p for the twoending peers of that path.

Hence, we define the “Current Aggregate Mapping (CM)” from peer i to peerj as the mapping created by the union of the equivalent mappings on all cordlesspaths from peer i to j. Formally:

Definition 2.3.3. [Current Aggregate Mapping] The current aggregate mappingfrom peer i to peer j is defined as

CMi j =⋃

p∈P(i, j)

(M(p))

where P(i, j) is the set of cordless paths from i to j. �

Computing M(p) is an involved operation, e.g., deriving M(p) using mappingcomposition may result in infinite number of rules [54]. Because only the qualitymeasure of CM (i.e. S(CM)) is required, a selection scheme needs only estimateS(M(p)) instead of computing M(p) explicitly.

2.3.3 The acquaintance selection criteriaAs the example in Section 2.1 shows, using only the direct mapping quality as theselection criteria is insufficient. A peer’s goal is to maximize its query answeringbenefit for each acquaintance it chooses. This benefit can be quantified as the

21

difference between the direct mapping quality S(Mi j) and the current aggregatemapping quality S(CMi j).

Hence, we want a peer in the PDMS to choose an acquaintance that maximizesits query answering benefit. Using the concepts we have just defined, the selectioncriteria is

Definition 2.3.4. [selection criteria] Peer i selects j as its acquaintance if and onlyif ∀ j′ 6= j, S(Mi j)−S(CMi j)≥ S(Mi j′)−S(CMi j′), where S is the quality metricdefined in Definition 2.3.1 and CM in Definition 2.3.3. �

and we call S(Mi j)−S(CMi j) the “criteria value” of candidate j for peer i.A host peer i needs estimators for S(Mi j) and S(CMi j) in order to rank candi-

dates according to the selection criteria. We develop efficient estimators for a peerto quickly estimate the potential of mapping to another peer without creating actualmappings.

2.4 Summary of results for acquaintance selectionWe briefly summarize the results we obtained for acquaintance selection. For moredetails please refer to the full technical report [141].

We developed estimators for the direct mapping quality (S(Mi j)) and the cur-rent aggregate mapping quality (S(CMi j)). The estimators for S(CMi j) use anmax-min approximation over possible pathes between peer i and j in the PDMS. Asthere can be a lot (exponentially large number) of such paths between two peers, weonly consider the dominating “cordless paths” as defined in Definition item 2.3.2.We developed an algorithm that computes all “cordless paths” in linear time w.r.t.to the number of such paths in the PDMS. Together with the max-min approxima-tion, we are able to estimate S(CMi j) very quickly.

Two acquaintance selection schemes are developed to estimate S(Mi j) and tomake the selection decisions. In the first scheme which we call it “one-shot”, aclustering algorithm is developed to cluster peers based on the observed pairwisemappings in the PDMS and then estimate S(Mi j) based on the discovered clusters.The clustering algorithm uses expectation maximization (EM) to obtain an MAPsolution and to speed up computation, a local search algorithm is applied. Thisscheme supports incrementally maintaining the clusters but the algorithm requiresto know the number of clusters before hand and is thus limited to applicationswhich domain categories are pre-determined. To resolve this, a second schemewhich we call “two-hop” is developed. In the two-hop scheme, a peer uses onlylocal information (peers that are in 2 hops away from it on the SON) to make es-timations for S(Mi j). Iteratively using the estimations made in previous steps, apeer explores the whole PDMS in multiple rounds. As S(Mi j) are collected at each

22

round, a peer makes acquaintance decisions in a “pay-as-you-go” fashion. Twoheuristics are also developed for the two-hop scheme for better estimations.

Analytical results are obtained for the one-shot scheme, stated in the followingtheorem:

Theorem 2.4.1. The complexity of the one-shot estimator for each EM iterationis O(BKCKN logC), where N is the number of peers in the PDMS, B is the maxi-mal number of acquaintances a peer has, C is the number of clusters, and K is aconstant.

Our empirical study shows that in most cases it takes a small number iterationsfor the EM based clustering to converge.

The following theorem asserts that a peer takes a small number of rounds toexplore a typical PDMS.

Theorem 2.4.2. For a p2p network which each peer has in average K(K ≥ 3)outbound mappings, it takes in average O(log logN) time for a peer to discover allpeers in the network, where N is the number of peers in the network.

The quality of the estimator for S(Mi j) and the effectiveness of the acquain-tance by the two-hop scheme is tested empirically. We used two kinds of networktopology to test the two-hop scheme and the results shows good performance inboth settings.

2.5 Empirical studyIn this section we very briefly present the results we obtained in the empiricalstudy for the one-shot and two-hop acquaintance selection schemes. Due to spaceconstraints of the proposal, we only present the accuracy test results for the twoselection schemes. For a more comprehensive report of the empirical study, pleaserefer to [141].

Table 2.1 lists the notations used in our tests.

Notation MeaningN The size of the PDMS

D Average connectivity of the PDMS.C Number of potential clusters peers form

Table 2.1: Notations in empirical study

23

D=10 D=11 D=12 D=13 D=14 D=15

0.1

0.2

0.3

100 200 300 400 500 600

RM

SE

N

(a) Estimation error

0

0.1

0.2

0.3

0.4

0.5

0.6

100 200 300 400 500 600

Mis

clas

sific

atio

n ra

tio

N

(b) Misclassification ratio

Figure 2.3: one-shot accuracy on D-regular topology

Figure 2.3(a) shows the root mean square error (RMSE)1 for a set of estima-tions of the direct mapping quality from the one-shot estimator. The results showthat the one-shot estimator in general is more accurate when more initial mappingsare observed (i.e., with larger D). Figure 2.3(b) also shows this trend. In general, alower misclassification ratio2 leads to a smaller estimation error. Note that we havesome runs with no misclassification as shown in Figure 2.3(b), but the correspond-ing RMSE in Figure 2.3(a) is above zero. This error comes from both the varianceof the true quality distribution and the one-shot scheme’s error in its inference onparameters (θ ). We can see that the one-shot scheme estimates θ in good precision.

We further measured the significance of the two-hop acquaintance selection bycomparing the actual mapping quality benefit gained against the benefit that peersget if they randomly chooses their acquaintances. The box plot [135] in Figure 2.4shows the results for a PDMS of size 400 with D-regular topology.

In Figure 2.4, the notched boxes represent the benefit achieved when two-hopselection is applied. Each box represents statistics for the 1200 selections in thatround. The lower series of boxes are the corresponding statistics of the benefitgained from random selection.It shows that, with strong statistical support, the ben-efit achieved by the two-hop selection outperforms random selection in all rounds.

1defined as√

1N ∑

Ni=0(y− y)2, where y is the estimation and y is the true value

2calculated by #misclassi f ication/N

24

1 2 3 4 5 6 7 8 90

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Sel

ectio

n be

nifit

sta

tistic

s

round

Figure 2.4: two-hop vs. random on D-regular topology

25

Chapter 3

Processing Object DecompositionAggregate Queries

In Chapter 1 we have defined a goal to natively support aggregations in the PDMS.In this chapter, we describe one type of aggregation that we set off to support inthe PDMS. We use PAQ to name the general form of aggregate queries in the PDMS.Motivated by JIIRP, we study a specific subclass of PAQ which we call an objectdecomposition aggregate (ODA). As its name suggests, processing an ODA queryneeds to decompose an object and perform a special query rewriting that bringsaggregation into business.

3.1 Motivation of PDMS aggregation query

3.1.1 A query from JIIRP

Consider the following example from JIIRP where several research groups coop-erate together in a disaster management study. The simulation group works tosimulate the impact of earthquakes to the university campus and publishes a datasource A which contains objects called “cells” used by a simulator. A cell in datasource A is defined as a set of buildings on campus. A seismic assessment grouppublishes data source B which contains damage assessment information for build-ings on campus. Data source B can evaluate an aggregation function B.overall() tocompute an overall damage over a set of buildings in its database. There is anotherindependent seismic damage assessment study from another lab in the departmentwhose data is published in source C. In this study, small groups of buildings oncampus are the unit of investigation and the aggregation function avg() is used togive an estimate for the overall damage.

27

The above data sources are published in a PDMS. Now the simulation groupworking at data source A, wants to find out the damage of the cells they have de-fined for an earthquake simulation. They want queries such as “Given a cell, whatis the damage of this cell for earthquake level IX?” to be processed in the PDMS.Recall we are using the relational data model, so the question can be expressed asa view definition: we define a viewCellDmg(cellid,disasterlevel,dmglevel) and to propagate this view with the dam-age assessment information from the data sources B and C in the PDMS, for the cellsand disaster levels defined in A.

Before we discuss the problem in the PDMS, we first discuss how this problemcan be solved if all the data sets are loaded into one database. A single databasescenario helps to show how aggregates are needed in processing such queries. It isthe semantics of deriving new information for the defined objects (e.g., the cells)that requires the use of the aggregation. Additionally, in the data sources of thePDMS, we assume that relational databases are used so queries are rewritten in alanguage in the relational context (e.g., the DataLog or SQL language). It is thusconvenient to use relations/schemas to represent data sets and mappings amongthem.

3.1.2 A single database versionWe assume that all the data is loaded and made consistent in one database. We useSource.Relation to represent the data from different sources so that it is straightfor-ward to later put them back to distributed data sources. The data is loaded into thefollowing relations:

A.Cell(cellid, cellname)B.DmgBuilding(dmgbid, disasterlevel, dmglevel)C.DmgBdnGroup(grpid, disasterlevel, damagelevel)

There is also a building relation containing all the buildings on campus:

building(bid, name)

The definitions for cell and building groups are represented in the followingrelations:

Map-Cell-Building(cellid, bid)Map-DmgBuilding-Building(dmgbid, bid)Map-DmgBdnGroup-Building(grpid, bid)

It’s easy to observe that source B can be used to propagate the view CellDmg. Avalid view definition in SQL using the damage assessment data from B.DmgBuildingis:

28

create view CellDmg(cellid,disasterlevel,dmglevel) asSELECT AC.cellid as cellid,

BD.disasterlevel as disasterlevel,B.overall(BD.dmglevel) as dmglevel

FROM B.DmgBuilding BD, building B, A.Cell AC,Map-Cell-Building MCB, Map-DMgBuilding-Building MDB

WHERE AC.cellid = MCB.cellidAND BD.dmgbid = MDB.dmgbidAND MDB.bid = B.bidAND B.bid = MCB.bid

GROUP BY AC.cellid, BD.disasterlevel

Here we also give a DataLog representation of the aggregate query startingfrom the “SELECT” clause. Using a syntax introduced in [38] that supports aggre-gation in DataLog, the query is:

CellDmg(cid, dislev, overall(dmglev)):- BD(dmgbid, dislev, dmglev), B(bid,X1),

AC(cid, X2), MCB(cid, bid), MDB(dmgbid, bid)

We briefly explain the datalog syntax for aggregate query using this example.The body of the query is a normal conjunctive query that joins several relations.The aggregate function appears in the head of the query with “overall” as the aggre-gate function and “dmglev” as the attribute to aggregate. The two other variablesin the header serves as grouping variables so that tuples in the joined relation withthe same “cid” and “dislev” values are grouped together for aggregation. For moredetails on the syntax, refer to [38].

We can see that no relation in the database contains damage assessment for theobject “cell”. So the view asking for cell damage cannot be defined by a select-project-join (SPJ) query over other relations. To work out the damage assessment,we have to use the component objects that forms the cell object and aggregatetheir values to make that estimate. We know by the definition in the relationMap-Cell-Building that a cell IS a set of buildings so it might be possibleto derive the cell damage assessment values from building damages. Then we ob-serve that the aggregate function B.overall() can be used for this purpose tocombine building damages together. In this case, the above definition-aggregatepair serves to decompose a cell which we do not know its damage assessment toa set of buildings which we know their damages and then combine the buildingdamages to derive the cell damage. It is easy to see that using another definition-aggregate pair

29

(Map-Cell-Building,Map-DmgBdnGroup-Building)-C.avgwe canderive cell damages from damage assessment data in the relation C.DmgBdnGroup.

This example shows an application where aggregation is required to propa-gate a view. It also unveils a fact that the semantics embedded in the definition-aggregate pair creates the possibility for data exchange among different types ofobjects. In the single database scenario, the view definer (usually the DBA) is re-sponsible to work out the view definition like the one in the sample SQL above.The goal of our work is to investigate the problem in the PDMS environment andautomate the process of retrieving information from data sources for such views in-volving aggregations. Specifically, we propose to tackle the following challenges:

1. to recognize and use the semantics in object definitions and aggregate func-tions;

2. to treat the view definition problem as query processing problem and findanswers in the PDMS; in this case, processing the query SELECT * FROMCellDmg is equivalent to finding appropriate definitions for the view CellDmg;

3. to properly evaluate the quality of answers returned from different queryprocessing paths and to provide information for value estimation (VE) 1.

Next we describe the PDMS architecture where aggregation is going to be sup-ported.

3.1.3 The PDMS as the query answering platformA PDMS system in our context is a data integration system that uses a P2P networkto transfer data between peers. Existing systems that generally use a similar P2P

setting are Piazza [67] and Hyperion [115]. Upon the networking layer, one ormore semantic overlay network (SON)s [41] are defined to provide an abstractionfrom physical peers in the P2P network to data sources in the data integration layer.

In the PDMS, all query processing and data exchange happen on the SON. Dif-ferent PDMS systems may adopt different SON models. In our SON model, physicalpeers carrying relational databases publish data sources onto one global SON. Datasources are inter-connected with pairwise semantic mappings which serve as di-rected semantic channels for query rewriting. The data integration logic is hostedin each data source where a query written in its local schema is processed and trans-lated to send to other data sources for processing. Our SON model is different thanthe one described in [105]. The major difference is that in [105] (and also [41]),physical peers publish data sources into multiple SONs depending on the way that

1Refer to Chapter 4 for details on VE.

30

the data sources are classified. In our model, the decision to use only one SON tohold all data sources is made to enable co-operation of data sources from differentdomains and also to remove the restriction of needing to know the domain assign-ments as a priori. In our PDMS, the domain classification procedure is handled by adynamic acquaintance selection operation which helps to optimize the topology ofthe SON, as described in Chapter 2. The topology of the SON is usually differentthan the topology of the underlying P2P network. To clearly refer to objects on thetwo layers and avoid confusion, we distinguish objects on the SON and in the P2P

network by different terms which are listed in Table 3.1.

Semantics SON P2P Networknode in topology data source peercontents in communication query, relation dataedge in topology hop, mapping connectioncommunication method exchange transmithow edges are formed pairwise mapping physical links

Table 3.1: Terms used in SON and P2P network

We now formalize the SON model we use. First we introduce the definitions forelements that are in the SON. The pairwise semantic mapping makes connectionsbetween data sources in the PDMS. Formally,

Definition 3.1.1. [schema mapping] A schema mapping from schema σ to schemaτ , denoted as sm(σ ,τ) is a first order formula in the form of ∀~x∀~yϕσ (~x,~y) →∃~zψτ(~x,~z) where ϕ and ψ are first order formulas. [86, 89] �

Definition 3.1.2. [data mapping] A data mapping from schema σ to schema τ ,denoted as dm(σ ,τ) is a mapping table in the form of a relational table M(~α,~β )where~α is a set of schema attributes from σ and ~β is a set of attributes from τ . [14]

�

Definition 3.1.3. [pairwise semantic mapping] A pairwise semantic mapping fromschema σ to schema τ , denoted as m(σ ,τ) is either a schema mapping sm(σ ,τ)following Definition 3.1.1, or a data mapping dm(σ ,τ) following Definition 3.1.2.

�

We use “mapping” as the short term for pairwise semantic mapping. See Sec-tion 3.2.4 for a discussion on related works to the mapping systems in PDMS anddata integration. By definition 3.1.3, a pairwise semantic mapping is directionalfrom the source schema σ to the target schema τ . In our PDMS setting, a mapping

31

m(σ ,τ) is owned by the data source in schema σ to translate queries to schema τ .Usually, σ is called the “source schema” and τ is called the “target schema.”

We define a direct semantic channel as a set of pairwise semantic mappingsbetween the same source to target pair, capturing the fact that normally in the PDMS

there are more than one mappings that can be used to translate queries from asource schema to a target schema.

Definition 3.1.4. [directed semantic channel] A directed semantic channel fromschema σ to schema τ , denoted by C(σ ,τ) is a set of pairwise semantic mappingsfrom σ to τ . We call the data source with schema τ an acquaintance of the datasource with schema σ . �

In the SON model, a semantic channel is always pair-wise. A pair-wise channelfrom a host schema to an acquaintance’s is implemented by a set of mappings. Eachmapping in the channel relates a set of relations in the host data source to a set ofrelations in an acquaintance and allows queries to be translated from the host to theacquaintance. For PDMS systems that actually adopt a “mediated schema” settingwhere data sources all map to a mediated schema fs, we can treat a pairwise channelas the composition of two sets of mappings: the mappings from host to fs andthose from fs to the acquaintance. Therefore, the pairwise mappings and semanticchannels defined in Definition 3.1.4 does not harm the semantic expressiveness.

The data source is another key element in the SON. Data sources are publishedby physical peers in the PDMS and are connected by directed semantic channels. Adata source is defined as follows:

Definition 3.1.5. [data source] A data source (ds) in a PDMS is published by a peer.It has:

1. a local relational schema σ ,

2. a database D that contains a set of relations and views D = {R1..Rm,V1..Vn}in schema σ ;

3. the ability to process relational queries written in schema σ using databaseD;

4. a set of directed semantic channels (Definition 3.1.4) from itself to the ac-quaintances;

5. the ability to translate queries written in schema σ to an acquaintance’sschema using the directed semantic channels stored on ds.

�

32

Note that we always use “data source” in the context of data integration anduse “peer” to refer to a physical node in the P2P network. A physical peer inthe P2P network of the PDMS may publish multiple data sources, which meansmultiple databases are shared from this physical host. Also, relations in a datasource may actually come from multiple peers in the P2P network, which meansthose physically distributed relations use a same schema and are consistent.

With data source and directed semantic channel defined, we give the formaldefinition of the SON as a graph formed by data sources and direct semantic chan-nels, although we have already been using the word “SON” to refer to this structure.

Definition 3.1.6. [semantic overlay network (SON)] A semantic overlay network(SON) in a PDMS is a directed graph G(V,E) where V is a set of data sources(Definition 3.1.5) and E is a set of directed semantic channels ( Definition 3.1.4)between data sources in V . The SON is the layer where query translations arecarried out. �

Figure 3.1 shows the architecture of our PDMS.

Figure 3.1: The PDMS architecture: a logical diagram

We use the following example to explain why we care more about SON topol-ogy in data integration than the underlying peer network. Figure 3.2 shows a PDMS

33

formed by 3 peers A, B and C. The topology of the SON is on the left and the peernetwork topology (PHY) is on the right. We compare the number of communica-tion hops needed in translating a query (Q1) from A to B with one (Q2) from A toC.

A B

C

A B

C

SON (upper) PHY (lower)1

2

1

2

1.1

1.2

2.1

Figure 3.2: The SON and PHY topology of a PDMS with 3 peers. For queryQ1 to be translated from A to B (follow the green line), one hop on SON

(left) and two hops on PHY (right) are needed. For query Q2 from Ato C (red dotted line), two hops on SON (via B, redline 1, 2) are neededand three hops on PHY (redline 1.1, 1.2, 2.1) are required.

This example shows that the topology of the SON determines the way queriesare translated and forwarded in the PDMS. We then regard the topology of the SON

as the topology of the PDMS when we discuss translating and routing queries in thePDMS.

Next we present the aggregate query processing problem for the PDMS. Weformally define the class of PDMS aggregate queries and the subclass of objectdecomposition aggregate (ODA) queries which we set off to investigate in detail.To the best of our knowledge, this is the first time that aggregations are used in thiskind of query rewriting and in a fully automatic fashion. A query to an object olikeCellDmg(cellid, disasterlevel, dmglevel) is to be automaticallyprocessed using the aggregation of features from component objects that forms o.And for the first time in query answering, we propose to equip with each queryanswer a quality measure indicating how good the PDMS expects the answer to be.

3.2 Related work for PDMS aggregation query (PAQ)In this section, we discuss related work to PAQ processing while we defer the detaildefinition to Section 3.3. This includes the Open World Assumption (OWA), ClosedWorld Assumption (CWA) assumptions and our decisions for PAQ query semantics

34

(Section 3.2.1), the literature for aggregation in data integration (Section 3.2.2) anddata exchange settings (Section 3.2.3) together with a comparison to processingPAQ, schema mapping and data mapping models and practice (Section 3.2.4) andfinally a brief review of existing PDMS systems (Section 3.2.5). We also take thechance to go over widely adopted notations and settings used in the above relatedtopics.

3.2.1 Open world and close world assumptionsOWA and CWA are two fundamental concepts in asserting “truth” in query answer-ing. They are important to define how the system treats the statement made by datasources and directly affect the semantics of query answering. Basically OWA is theassumption that the truth-value of a statement is independent of whether or not it isstated by a data source while in its opposite CWA, it is assumed that any statement,if not asserted true by a data source, is false. I.e. all the truth is covered in the datasources under CWA.

Recent research in [5, 61, 89, 90] argued that data exchange was better for-malized and modeled under certain CWA for reasons summarized as following: (1)under OWA, it just gives trivial answers for aggregations and (2) some fundamentaloperations’ complexity is NP-complete or co-NP complete under OWA while poly-nomial algorithms exists under CWA. In practise, OWA is widely used in existingdata integration systems such as Piazza [68] and Hyperion [14] because OWA fitsbetter for a real-life data integration system that answers from all or partial of itsdata sources only represent an incomplete view of the whole world. When we in-troduce aggregations into the PDMS, the problem is that an aggregation function(such as sum) must work on a closed set of objects while we cannot claim uni-formly that the data sources in the system contain complete information, making ithard to define aggregation queries in a pure OWA system. Therefore in the PDMS,we decide to respect both OWA and CWA and apply an open-close world assump-tion as follows. We require that definitional mappings that defines an object as aset of components to follow the CWA assumption; and allow relations that carriesspecific feature information to remain in OWA which is known to be a good as-sumption used by existing systems. For example, the definition of a cell as a setof buildings is represented by a mapping table R(cellid,buildingid) in which weapply CWA that if tuple t(cid,bid) /∈ R then it is asserted that building bid is not incell cid. While in a data source with a relation D(bid, level,damage) representingthe damage estimates for a building bid at earthquake level level, OWA is used toallow that if we don’t have a tuple (b, l,d) in D, it only means this data source doesnot know this particular damage assessment value. By respecting both CWA andOWA to co-exist in the system, the semantics of PAQ queries is now well estab-

35

lished. E.g., the set of buildings to compute for a cell is determined while multipledata sources may be needed to contribute to the features of a building set.

There is also research that lets both CWA and OWA to co-exist in one theoreticalframework. [90]

3.2.2 Aggregation in data integration

In recent work [39] and its conference version [38], Cohen et al. discuss the prob-lem of rewriting aggregate queries using views, thus bring aggregate queries intodata integration. Standard aggregate functions (min, max, sum, count) are dis-cussed in [38] with both set and bag (multi-set) semantics. Conjunctive queryrewriting is investigated in both works 2 and subclasses are identified in with theexistence and equivalence of rewritings can be verified. Results of the above workshow that in general, the complexity of performing a query rewriting with aggre-gation (min, max, sum and count) is NP-complete for complete rewriting and onlyfor a very restricted class of queries (linear queries), polynomial algorithms existfor partial rewritings basically because with that assumption, the total number ofrewriting candidates is limited. [39] also extends the set of aggregate functions toa bigger set of commutative-semigroup aggregation functions and one contributionof this work is that it provides a theoretical framework to unify previous works onrewriting queries with aggregation functions such as in [4, 60, 124, 142].

While the existing works mainly focus on view selection and using views forquery rewriting, our approach differs that we focus on discovering aggregations andthe grouping functions to transform a query in the PDMS and optimize the process.When the databases in the data sources are provided in the form of views discussedin the existing works such as in [38], the rewriting complexity is lower bounded bythe existing complexity results. However, we hope that we can develop efficientalgorithms to automatically discover, perform and optimize query translations withaggregate functions involved. By far we don’t find reports on this direction.

3.2.3 Aggregation query answering in data exchange

In a recent study [5], the authors discuss aggregation query answering in data ex-change. They define the semantics of aggregation queries in a data exchange settingand observe that to obtain non-trivial aggregation semantics, a strict close world as-sumption (CWA) has to be applied. The data exchange setting used in [5], as wellas adopted by other data exchange studies [55, 61, 89] is as follows: Given a triple(σ ,τ,Σ) where σ and τ are source and target schemas respectively and Σ a set of

2In [38], the query body is a disjunction of conjunctive clauses.

36

source-to-target dependencies (STDs) in the form ψτ(~x,~z) :−ϕσ (~x,~y) (or in equiv-alent FO formula form ∀~x∀~yϕσ (~x,~y)→∃~zψτ(~x,~z)), observe a source σ instance I,compute certain answers for schema τ . A certain answer is defined as a set T ofanswers to query Q so that ∀J = Sol(I)∀t ∈ T, t ∈ Q(J) [55]. I.e., the intersectionof the answers on all possible solutions.

The semantics of aggregations in data exchange is defined then as the aggre-gation over certain answers, written as agg−certain( f (Q), I,W (I)) where f is theaggregation function, I is a source instance and W (I) is the set of “possible worlds”.The authors of [5] argued, by comparing with other options, that an endomorphicimage3 of the canonical universal solution (CanSol(I)), which adopts CWA, needsto be used. In fact, the endomorphic image is the intersection of the CWA-solutionset and the CanSol(I). (Section 3.3 of [5])

In the solution of [5], a value interval is used to represent an aggregation resultwhich indicates the lower and upper bounds of the aggregated value. This repre-sentation was seen earlier in [13] to handle aggregations for inconsistent databases.There is a connection between the two although they are seemingly apart topics.Both data exchange and fixing of an inconsistent database create a set of possibleworlds; aggregations on those possible worlds return different values. Therefore,using an interval to represent an aggregation in such context is preferred.

Results in [5] show that, using the proposed endomorphic image as possibleworlds and using a range representation, on the positive side, the complexity ofanswering aggregation queries for min, max, sum, count, avg is in PTIME (theorem4.1 and 5.12) and on the negative side, deciding if there exists a target instanceJ such that avg() = r, for a given value r is NP-complete (theorem 5.14). Thepositive results use early results for computing core in polynomial time reportedin [61]. Basically the results show that it’s easy to compute the range but to knowthe family or the distribution of the possible results is hard in general.

There are some fundamental differences between the data exchange setting anddata integration. As also summarized in [55], first, only systems with GLAV tgdsare using the same mapping rule formalization as that in data exchange. Systemsusing GAV or LAV do not allow conjunctive queries for target schema in the headerof a tuple generation dependency (TGD) rule, neither does it allow existence qual-ifier (free variables) in the header. Also, in the data exchange setting, a targetschema is created on its own with its own constraints (represented using TGDs withboth sides the target schema). The goal of data exchange is to insert data from thesource schema into the target schema [61]. So it either retrieves data from sourceschemas and then materializes answers in the target schemas; or to explicitly statethat the source and target schemas which respectively holds their own data satisfy a

3an endomorphic image is a function that map a set to itself

37

set of source-to-target dependencies. However, on the data integration counterpart,the target schemas are usually views defined over source schema and its data solelycomes from applying the view definition (in LAV, GAV or GLAV formulas) on thesource schema and there are usually no dependency constraint between two viewdefinitions. In a data integration setting, mapping rules are used for query rewritingand source schemas still take evaluate queries translated and sent to them.

The above diversities between general data exchange and data integration mod-els also affects query answering for aggregations. In the context of data exchange,the aggregation semantics defined in [5] is to compute, given an instance I of sourceschema σ , an aggregation f (one of min, max, count, sum, avg and count) on tar-get schema τ , written as agg−certain( f (Q), I,W (I)) where Q is a query (usually aselection on one numerical attribute) over target schema τ and W (I) the set of pos-sible worlds for the target schema. It answers the question “Without generating orlooking at target instance, what is the answer to the aggregation, if to be evaluatedover the target schema?” The existence qualifier in the STD set on target ψτ(~x,~z)naturally implies that more than one possible world of target schema (τ) exist andthey possibly result in different values when the aggregations are evaluated over.This semantic is different than what is desired in our PDMS query answering andthe processing steps also differ. Instead of “guessing” possible worlds at targetschema τ , aggregation queries in the PDMS are rewritten, sent and executed locallyon the source schema σ . Therefore, we are not computing an aggregation over atarget instance which has lots of “possible and undecided states” but retrieving in-formation directly from the source schema. Accordingly, we do not care, in termsof its affection to aggregation, the possible worlds of the target schema. The chal-lenges are to find a way to process an aggregation which is expressed in a foreignschema (the target) using the data sources (the source) available to the PDMS. Cor-respondingly, our tasks are (1) to find out sources that could contribute to queryprocessing, as the destinations for query rewriting; (2) to evaluate the quality ofthe aggregation results; and (3) to continuously maintain the aggregated values,respond to data updates and improve estimates. Therefore, our PDMS uses a settingdifferent than the one used in data exchange world and the techniques discussed in,say [5] does not trivially apply. New techniques focusing on query rewriting, valueestimation and distributed query processing need to be developed.

3.2.4 Schema mappings and data mappingsSchema mappings are the crucial glue to bring heterogeneous data sources togetherto allow information exchange. The STDs in data exchange and data integrationsystems are most intensively studied schema mappings [86]. A number of workare also devoted to discovering and creating schema mappings [69, 72, 108, 119,

38

137] and operations on the schema mappings. E.g. mapping composition [54] andinversion [16] are both important operations.

Another mapping representation used to express the relationship between twodata set is through the use of mapping tables [14, 78, 81, 82]. A mapping ta-ble is a natural choice to express relations between schema atoms that cannot beexpressed using mapping rules e.g. the STDs. Actually when people use the entity-relationship model (ER model) to design databases, they create relations to expressthe association between two or more entities. The use of mapping tables extendsthis idea to express associations between entities across multiple schema. In ourmotivating example, A cell is defined as a collection of buildings where the en-tity “cell” and “building” are two entities at two data sources. This relationshipis better represented using a mapping table than a formula of some mapping lan-guage. The reason is that a mapping is a mathematical function that relates entitiesin two schemas together. Sometimes a function can be concisely encoded as first-order logic formulas such as the STDs in some mapping systems while in othercases, and more often, the function is more complicated and cannot be conciselyexpressed. The case for defining cells into sets of buildings is of exactly this kindwhere the cell classification work is done manually based on more parameters thanwhat are finally stored in the schema.

There are some connections between the semantics of tgd mapping rules andmapping tables. Recall a tgd mapping rule is of the form ∀~x∀~yϕσ (~x,~y)→∃~zψτ(~x,~z).In a mapping table M, let attribute set cS and cT denote attributes from source andtarget schemas respectively and let t be a tuple in M and t(cS), t(cT ) be the valuesin t on the two attribute sets respectively, then the semantic under CWA is definedas follows : For any two entities u in schema S and v in schema T , if t(cS) holdsfor u and t(cT ) holds for v, then entities u and v have association M otherwisethey are not associated. The CWA assumption here implies that a mapping tableis both sound and complete in describing the association. A mapping table doesnot encode existence qualifiers as that in tgd mapping rules so it always explicitlypoints out the mapped entities in the target schema (this is especially true whencT contains a key) and if a tgd “rule” can be derived from M then it will be a fulltgd rule [86]. This leads to some good properties e.g., adding a mapping table to aSTD mapping system does not change the weakly acyclic property of the underly-ing dependency graph, which is an assumption a number of complexity results arebased on [55, 61, 86]. Therefore it is a relatively easy decision to include mappingtables into a mapping system. A mapping table involving two schemas are calledbi-directional if it takes the above semantics. From the symmetric definition wecan see that it doesn’t matter which schema is source and which is target and froma source valuation we can find all target entities that are M-associated and viceversa.

39

Mapping tables can also be used to map more than two schemas together,creating a “star” topology in the dependency graph. An STD system will needa mediated schema to express this kind of association among multiple schemas.The semantics for mapping tables for two schemas can be extended to multiple(m) schemas as follows. For entity u in source schema S and a vector of entities~v = {vi, i ∈ 1..m} in target schema Ti, i ∈ 1..m and let t be a tuple in M, if t(cS)holds for u and t(cTi) holds for all vi, i ∈ 1..m, then entity u and entities in ~v areassociated by M, otherwise there is at least one entity v j fails to associate with uunder M. Note this semantics defines a star structure that it specifies the relation-ship between the source schema and the set of target schema but it does not specifyany knowledge between any two target schemas. Note here the “bidirectional”property does not extend to multiple target sources because the star topology withmore than one target schemas is by definition asymmetric.

A semantic alternative: There is another semantics can be applied to the map-ping table to include existence qualifiers. To do this, the semantics of a mappingtable M needs to be modified to: for an entity u in the source schema and a tuple tin M , if t(cS) holds for u then there exists v in target schema T that t(cT ) holds forv. The two semantics unify when the target attribute set cT contains a key for theentity in T . In this alternative semantics, the mapping is strictly directional. I.e., atarget valuation for a tuple t in M does not provide any information on what entityin the source is matched via M. We are working to decide to use one or both thesemantics in the PDMS.

Mapping tables have a significant distinction from tgd mapping rules. In amapping table, the data at source schema and target schema are independentlymaintained data sets, unlike that in tgd mappings rules (for data integration) thatthe target schema is a view defined over the source schema. From this point ofview, mapping tables carry semantics more close to the data exchange setting thanto the data integration setting.

3.2.5 A comparison with existing PDMS systemsWe now briefly explain why PAQ processing needs special support in PDMS andshow that existing systems are incapable to process this new type of query. Thesystems we look at are Piazza [68] and Hyperion [14].

The Piazza system uses a mediated schema on its SON. Peers all map their ownschemas to that mediated schema and query answering follows the pattern of QS→QM → QT where S,M,T are source, mediated and target schema respectively. Inour applications, because of the extended heterogeneity, it is hard for data sourcesto agree on a common mediated schema or maintain one. It’s even hard to use ahybrid approach that a mediated schema clusters part of the data sources on the

40

SON. The Hyperion approach does not use mediated schema and relies solely onsemantic channels built with pairwise mappings. One add-on in Hyperion is thatit uses mapping tables to represent those mappings that are not representable bytgd mapping rules in Piazza. In our application, we need to use mapping tables todefine an object using a set of components.

One important mechanism absent in both Piazza and Hyperion is the ability toprocess aggregate queries especially the PAQ queries. The reason for the inability isthat aggregation is not in the query language they consider to support in their sys-tems and they simply don’t perform aggregation rewriting. In our aggregate queryprocessing, some data sources need to serve as “translators” while they themselvesmay not hold any of the features requested by the query. It can be seen from theexamples in Section 3.1 that the existing systems do not have the ability in queryprocessing to relate a query on “cell” to a query on “building” or ”building group”;neither can any existing systems rewrite such a query into aggregations.

3.3 Problem definitions

3.3.1 Relocating data sets to data sources in the PDMS

First we migrate our problem (Section 3.1.1) from the single database setting tothe PDMS setting, using the PDMS we just described in Section 3.1.3. The cell def-inition, building catalog, per-building damage assessment and per-group damageassessment relations in the previous example are now located in four data sourceson the SON of the PDMS. The relations starting with “Map-” are now transformedinto pairwise mappings forming the directed semantic channels between the datasources. These pair-wise mappings are established automatically by mapping gen-eration algorithms or semi-automatically combining human efforts. The detail ofestablishing and maintaining pair-wise mappings are beyond the scope of this pro-posal. See [48, 112] for some survey in this area. Figure 3.3 illustrates the PDMS

setting and the data flows involved in processing a similar query as in Section 3.1.1.We start by defining the PDMS aggregate query we propose to study.

3.3.2 Definition of PDMS aggregation queryTo properly define the PDMS aggregation query, we first define an aggregationrewriting. We build the definition of the aggregate rewriting on the general defini-tion of query rewriting.

Definition 3.3.1. [query rewriting] Given a query Qσ written in schema σ , a di-rected semantic channel C(σ ,τ) which consists of a finite number of mappings

41

from σ to an acquaintance’s schema τ , a query rewriting of Qσ using C is a queryQτ written in schema τ translated from Qσ . �

The definition itself does not specify how the goodness of a rewriting is mea-sured. The goodness of a rewriting usually depends on the types of queries torewrite. Usually for conjunctive queries we require that it holds for the outputquery Qτ that Qτ ⊆Qσ , in this case we call the rewriting a “contained rewriting”.A well studied problem is to find a maximally contained rewriting [110]. If in thesame time Qτ ⊆ Qσ holds, then the two queries are considered as semanticallyequivalent, thus Qτ is called an “equivalent rewriting” of Qσ . Finding equivalentrewriting is often computationally hard [6, 7, 65, 109].

We define the aggregation rewriting as a rewriting that the output query is anaggregate query. First we formalize the aggregate function to use in the definition.

Definition 3.3.2. [Grouping function] A grouping function g maps a vector inspace I = I1× I2..× Ik to another space J = J1× J2...× Jl , i.e., g : I 7→ J. Thegrouping function g takes a relational tuple, uses k of all its attributes with the ithchosen attribute in domain Ii as input and “groups” such tuples into groups identi-fied by an output, an l-ary vector j ∈ J. �

Definition 3.3.3. [Aggregate function] An aggregate function f (x,g(~y)) is a func-tion that aggregates relational tuples on attribute x. An aggregate function uses agrouping function g to determine the groups for aggregation: tuples in the sameg(~y) group are aggregated. In the case that g is not specified, we assume that thegrouping function g(~y) =~y is used. �

Following a similar formalization in [39], a condition A is defined as “a con-junction of relational atoms and built-in predicates.” We define the aggregationrewriting as a subclass of query rewriting:

Definition 3.3.4. [aggregation rewriting] A query rewriting Qτ is an aggregationrewriting if Qτ is an aggregate query: q( f (x,g(~y)),g(~y)) :−A where f is an aggre-gate function (Definition 3.3.3), g is a grouping function (Definition 3.3.2) and Ais a condition. �

Assessing the goodness of a query rewriting involving aggregation cannot di-rectly apply the contained or equivalent criterions. We plan to investigate newmethods to measure the goodness of the aggregation query rewriting.

The DataLog syntax we use in the above definition follows the formalizationused in [38, 39] where aggregation operations are expressed in the head of a query.In the formalization we also require that the query body A does not contain aggre-gations. We extend the syntax in [38] to generalize the group by operator by in-troducing the grouping function g. For example, a query QT : q( f (x,g(y)),g(y)) :

42

−R1(x,z),R2(z,y) is an aggregate query that the first variable in the head of QT

is an aggregate function over variable x and it uses g for grouping. Suppose theschema of R1 and R2 are R1(a,b),R2(c,d), the query QT : q(sum(x),y) :−R1(x,z),R2(z,y)is equivalent to an SQL query:

SELECT sum(a), R2.d FROM R1, R2WHERE R1.b = R2.c GROUP BY R2.d

We define the PDMS aggregation query as

Definition 3.3.5. [PDMS aggregation query (PAQ)] A PDMS aggregation query(PAQ) Q is a query in the PDMS so that processing of it involves aggregation rewrit-ing. Assume Qσ is in the form: Qσ (~x) : −Aσ where Aσ is a condition written inschema σ , then after performing a aggregation rewriting to schema τ , Qτ is in theform Qτ( f (x,g(~y)),g(~y)) :−Bτ where Bτ is a condition in schema τ . �

As illustrated in Section 3.1, a PAQ query does not necessarily appear withaggregations in the first place but it requires the query processing engine to performan aggregation rewriting (e.g., the view definition in Section 3.1.2 is an aggregationrewriting). Our query example in Section 3.1 suggests that a PAQ is different froma traditional aggregate query in the following aspects.

1. omitted sources: A PAQ query does not specify the data sources or relationsthat contain information to answer it. This is a common property for mostdata integration queries that a query does not know which data source willcontribute to the answers.

2. omitted aggregation: A PAQ query does not specify the aggregation func-tions to be used for computing the answer. In fact, the aggregation functionis usually provided by the data source and is unknown to the query issuerbefore hand.

3. omitted rewriting method: A PAQ query does not specify how an aggrega-tion rewriting can be performed. This information is meant to be discoveredfrom the semantic channels between data sources and aggregation rewritingis done behind the scene by the PDMS system.

4. omitted component object: A PAQ query does not specify which compo-nent object to use for aggregation. The example in Section 3.1 suggests thatan aggregation rewriting can use either buildings or building groups as com-ponent objects.

43

Existing PDMS systems are not built to process PAQ queries because they arenot equipped with the intelligence to perform the aggregation rewriting. However,like a traditional data integration query (e.g., those in Piazza [68] and Hyperion [14]systems), a PAQ omits data sources in the query specification. Compared to anaggregate query in a traditional database system, a query in an RDBMS alwaysrequires the source relations, the aggregation function and the grouping method beexplicitly specified.

We define the general PAQ processing problem to solve.

Definition 3.3.6. [PAQ processing problem] Given a PAQ query issued from oneof the data sources in a PDMS, PAQ processing is to perform aggregation rewritingand compute answers to the PAQ query in the PDMS. �

Processing PAQ queries requires a combination of the Closed World Assumption(CWA) and the Open World Assumption (OWA). As discussed in Section 3.2.1, ex-isting systems such as Piazza [68] which works on select-project-join (SPJ) queriesadopt OWA and research on aggregations reported in [5] suggests to apply CWA

when aggregations are involved. We decided to use a combination of CWA andOWA to respect the following two facts: (1) a practical PDMS cannot assume anyone or a set of its data sources to contain complete information; (2) in order toperform aggregation, we must represent an object using a complete and closed setof component objects. Based on the above observations, we apply OWA to datasources when and use CWA when we determine the grouping function g for anaggregation rewriting.

3.3.3 Answers to PDMS aggregate queriesAn answer tuple in the form of q( f (x,g(y)),g(y)) of an PAQ query consists of theaggregate value f () and one or more fields of the grouping attributes g(y). Unlikein SPJ queries where tuples are retrieved from the source databases, the interestedvalues in a PAQ query are now computed by the aggregate function. Several factorsmay affect the accuracy of an aggregation. Use the JIIRP cell damage assessmentquery in Section 3.1.2 as an example, aggregating the damage assessment of cellcomponents (buildings and building groups) often returns an approximation. Thebuildings in a cell may receive different damage during the earthquake and the“overall” measurement itself is an approximation. On the other hand, because thecell object is defined (arbitrary) by the simulation, the only way we could computea cell damage is through aggregation. Also, it often happens that the damage as-sessment database does not contain assessments for all the buildings defined in acell, so in this case we can only aggregate over a subset of the component buildingsas a best-effort approximation.

44

3.3.4 Challenges in processing PAQ

Some of the challenging subgoals in general PAQ processing are identified as listedbelow:

1. to deal with data completeness assumptions in data-sources;

2. to decide the use of schema based and data based pairwise mappings;

3. to perform query rewriting and process the queries distributively in the PDMS;

4. to perform optimizations on result accuracy, response time and communica-tion cost.

3.4 Object decomposition aggregatesBefore we described the general PAQ processing problem as the goal to solve inour study. In this section, we introduce a subclass of PAQ query called the objectdecomposition aggregate (ODA). As its name has suggested, an “object decompo-sition” operation is needed in the aggregation rewriting to determine the groupingconditions. The ODA is of particular interest to JIIRP; the example in Section 3.1shows that we often need to decompose a feature of one object (e.g., the damage ofa cell) into an aggregation of a feature from a set of other objects (e.g., the damageof the buildings). We propose to start our investigation with processing the ODA

queries in the PDMS.

3.4.1 The JIIRP application for object decompositionaggregate (ODA):

We motivate the object decomposition operation from JIIRP. We have two kinds ofobjects managed in the PDMS. One is a set of “physical” objects like buildings anddamage assessment as described in the previous examples. These objects receivedata for their features from their corresponding real world entities. E.g., the utilityoffice maintains the data for the buildings on campus and the department of civilengineering maintains various versions of building damage assessments. The otheris a set of “abstract” objects that are defined by researchers for particular kindsof needs (e.g., the simulation). An abstract object (e.g., a cell in the simulator) isdefined as a set of physical objects (e.g., the buildings) and all the features (e.g., thedamage assessment) of the abstract object need to be retrieved or derived from thephysical objects they are defined over. It is thus a “decomposition” process whena feature of an abstract object needs to be broken down to the physical objects.Rewriting a query on an abstract object is thus an aggregation rewriting.

45

Our current solution to the above problem is to compute features of abstractobjects manually and store the computed values as ordinary attributes in the re-lation that represents the abstract object. For example, a student creates a relation”CellDamage” and propagates tuple by tuple the values of the cell damage calcu-lated manually using the cell definition and one of the building damage assessment.This approach has several obvious disadvantages: (1) it is an excessive workloadto manually decide the best decomposition scheme as there can be many choicesfor decomposition; (2) the definition of an abstract object maintains the link be-tween itself and the component objects but the manually curated values break thisconnection so that it becomes extremely difficult to update features of an abstractobject when its definition changes or the feature of the underlying component ob-jects change. The best way to solve this problem is to support ODA queries nativelyin the PDMS.

For aggregates, the two key parameters are the aggregate function itself andthe way of grouping objects for aggregation. Normally, the choice of aggregatefunction is made by the feature of the selected component object. The remainingfree parameter is the choice of the grouping function g. The key element in theODA–the object decomposition operation focuses on discovering and determininga good grouping for the components. We find it the most challenging part in ODA

processing to find a good grouping function as it both affects the quality of the finalaggregation and overall query processing efficiency.

3.4.2 Definition of the object decomposition aggregate (ODA)We first define the object decomposition operation. The object decomposition op-eration helps to identify the grouping function needed by the aggregation. In thedefinition we use two terms “object” and “feature”. In relational context, an objectrefers to a relation and a feature refers to an attribute in a relation and also the valueof a tuple when the context is clear. A feature e in object T is denoted as T.e. In therelational context, this is the same as relation T ’s attribute e. The definition of theobject decomposition operation is general so that it is not limited to the relationaldata model.

Definition 3.4.1. [object decomposition operation] Given an object T and a featuree of T (denoted as T.e), an object decomposition operation determines a set ofobjects C = {c1..ck} and a feature e′ of ci, i = 1..k so that T.e can be aggregatedusing a function f over C’s feature set C.e′ = {ci.e′}. �

The objects in the set C identified by an object decomposition operation iscalled the “component objects”. To connect with the aggregation rewriting def-inition (Definition 3.3.4), the object decomposition operation defines a grouping

46

function g for aggregation rewriting. Let t be a tuple in schema σ identified bykey attributes t.~k and t.e be the attribute to compute in a query q(e, t.~k) from ODA.The object decomposition operation determines a set of tuples C(t) that can becomputed from a database in schema τ and an attribute e′ of tuples in C(t). Letkey attributes ~k′ identify a tuple c ∈C(t), the header of the aggregation rewritingq( f (c.e′,g(c.k′)),g(c.k′)) satisfies g(c.k′) = t.~k iff. c ∈C(t).

We define object decomposition aggregate (ODA) as a subclass of PAQ that usesthe object decomposition operation in the aggregation rewriting. Formally,

Definition 3.4.2. [object decomposition aggregate (ODA)] An object decomposi-tion aggregate is a PAQ (Defintion 3.3.5) in which object decomposition operations(Definition 3.4.1) are performed in the aggregation rewriting (Definition 3.3.1). �

From the definition of ODA we can see that the two key factors of process-ing an ODA query are the decomposition operation and the use of an aggregatefunction. We observe that, once the set of component objects is determined, thecorresponding aggregation function is pretty much determined. Often, the datasource hosting the component objects provides the aggregate function if it is notone of the standard ones (MIN, MAX, AVG, SUM). Therefore, optimizing theobject decomposition operation becomes the key factor for aggregation rewritingin ODA processing. That is, to determine the “best” decomposition C, where “best”can be a goal defined to meet various optimization objectives.

There are multiple optimization objectives we want to achieve for a better ob-ject decomposition. We want the decomposition be more accurate, be faster tocompute and be more resource efficient. Among these, accuracy is the most de-manding factor for JIIRP research because in the context of preparing for futuredisasters, time and computational resources are relatively abundant. We start ourinvestigation by defining the accuracy first object decomposition (AFOD) to opti-mize the accuracy. Formally, with the assumption that the aggregate function is nota free parameter, the AFOD is defined as follows:

Definition 3.4.3. [accuracy first object decomposition (AFOD)] Given a PDMS, anobject T defined on the source schema with a feature e of T being queried, anaccuracy first object decomposition (AFOD) is an object decomposition of T sothat the accuracy of the estimate for T.e = f (C.e′) is maximized, where C is theoutput of the object decomposition. �

Besides the assumption that the aggregate function is not a free parameter, wemake the following assumptions for AFOD in our study.

1. We only consider set based decomposition between objects: the object to bedecomposed is always defined as a set of component objects.

47

2. Aggregate functions are pre-defined on the data sources hosting the compo-nent objects. There is no additional restriction for the aggregate function,i.e., given a set of objects C and a feature e of c ∈C, the aggregation f (C.e)can always be computed.

3. Schema and data mappings are hold in both data sources involved in themappings.

4. An object, when decomposed, always contains finite many components.

5. Data quality on different data sources are different. Given two data sources,there is a quality measure Q and a partial order ≤ defined so that given datasource i, j, Q(i)≤ Q( j) means the quality of data on data source j is better.

The above assumptions are all valid for the PDMS in our study. As indicatedfrom JIIRP application, as well as other applications, it is common to define anew type of object as a set of other (component) objects. The quality assumptionreflects the fact that data sources in a PDMS are of different quality and in mostcases, per-data source quality is the finest granularity we can have for the qualitymeasure.

3.5 Research plan for processing ODA using AFOD foroptimization

We plan to start with the study of processing ODA query using AFOD as decompo-sition method. The goals we plan to achieve are:

1. to model the Object Decomposition Aggregate (ODA) as a sub-class of thegeneral PAQ;

2. to develop ODA processing techniques in our PDMS architecture;

3. to devise quality measure for the answers to the ODA queries so that the userget both the aggregated value and a quality evaluation;

4. starting with AFOD, investigate optimizations for ODA processing;

5. to support continuous query answering to ODA queries (in Chapter 4).

The object decomposition operation will first be studied to use the definitionalmappings between an object in schema A and components in schema B. The goalis to develop techniques that perform a “best” decomposition. The next step is

48

to perform aggregation rewriting so that an ODA query is correctly translated intoaggregate queries in the schema where the components are defined. We assumethat the decomposition may not be 100% accurate so for each aggregate querytranslated, the answers will be returned together with a quality measure. While thiscompletes the life of an ODA query, we plan to integrate in value estimation (VE)and continuous query support (Chapter 4).

49

Instance :

name: UBC hospital

el: nodata

func: nodata

Instance :

bn: pudy pavilion

locationx : 12345.67

locationy: 12345.67

address: 1234 University

Road

Structure: Wooden

cp: John Meyer

Instance :

building name : Purddy

Structure : Wooden

address: 1234 Uni. Road

structure : Wooden

el: 7

fl: 0.2

objclass: cell

feature: name

feature: earthquake

level (el)

feature: functionality

(func)

Query: find func of <cell name:

UBC hospital> with el=7

Instance :

bn: KOERNER

locationx : 11234.67

locationy: 11234.67

address: 1237 University

Road

Structure: Conrete

cp: David Foo

objclass: building

feature: building name

(bn)

feature: locationx

feature: locationy

feature: address

feature: structure

feature: contact

person(cp)

objclass: damage

feature: building name

feature: structure level (el)

feature: func. loss (fl)

feature: address

feature: contact person (cp)

feature: Wooden

feature: earthquake level(el)

Aggregation functions

fl-agg = agg(fl1, fl2, fl3 ...)

mapping (M1) :

map buildings to cells

(decompose)

mapping(M2) :

between features

decompose 1

decompose 2

Ref.Rec :

matching on features

bn, address, structure

Ref.Rec:

matched nothing

s1

s2

s3

mapped features (func, el) in query,

guide query this way

All queried

features are mapped

a semantic path found

return fl1:0.2

return agg()

return obj signiture

s3

s2

s1

return agg(fl1,fl2)

Return func=0.78

Ref. Rec:

matching on features

bn,cp,structure

another

damage

source

return fl2: 0.25

return agg()

return obj signiture

Object Schemas Object Instances

s1Schema level

processings1

Data level

processing

xxx schema mappings

infomationBackward pathfinding

on schema level

op

Forward query

processing

on data level

return fl1:0.2

return agg()

Answer returning

and aggregation

s4

Sample query

processing for a

typical JIIRP (I2DI)

query

Dec 22, 2007

Ref.Rec :

func=1-fl-agg

Ref.Rec : Reference

reconciliation, can be

on either data or

schema level

Figure 3.3: Processing of the query in section 3.1. Two logic flows exist asindicated in the legend. A schema level search is conducted to locatedata sources related to the query. After that, the query is translatedalong the paths found in the first step.

50

Chapter 4

Value Estimation and ContinuousQuery Support in the PDMS

4.1 IntroductionAs briefly introduced in Chapter 1, the value estimation in a PDMS is a process thattakes multiple answers to the same query as input and outputs one answer that isreturned to the user. In JIIRP, this helps to integrate the PDMS query answeringprocess into a fully automated simulation run. For example, the JIIRP simulatorneeds the damage assessment value of a “cell”. This value is computed by process-ing the PDMS aggregate query as described in Chapter 3. In the PDMS, differentprocessing paths of the query may return different estimates to the damage assess-ment of the cell. The simulator does not accept a set of such estimates as queryanswers, neither can it work out the final value to use if the estimates are returnedin this way. We propose the value estimation (VE) operation so that the finallyreturned answer to the simulator for a queried cell is just one value. In this way,the simulator could use the query answer directly in its workflow. By performingVE, the PDMS aims to achieve the following three goals: (1) to eliminate duplicateand/or conflicting answers in the query; (2) to work out an estimate t from multipleanswers computed in the PDMS so that the expected error of t is minimized; (3) toprovide a quality measure to the final estimate.

We use the following example to illustrate the use of value estimation (VE).Figure 4.1 shows a PDMS where the query select damage from celldmgwhere cellid=1 and dmglevel=8 is issued by the simulator. Data sourcesB and C contain damage assessment data and contribute to the query. The an-swer from B is a tuple (0.6,0.02) and that from C is (0.8,0.04) where the tuple

51

(dmg,var) represents the damage assessment and the uncertainty associated withthe estimated value for the damage. The simulator does not accept the answer{0.6,0.8} because this answer does not tell the simulator which value to use. A VE

may return the mean of the two answers and return a ready-to-use value 0.7 to thesimulator.

A

B C

M ( A , B ) M ( A , C

)

Q_A

Q_B Q_C

R e w r i

t e u

s i n g

M ( A

, C )

R e w r i t e u s i n g M ( A , B )

SELECT id, damage FROM celldmg WHERE cellid = '1' AND dmgLevl='8'

( d m g , v a r ) = ( 0 . 6 , 0 . 0 2 ) ( d m

g , v a

r ) = ( 0

. 8 , 0

. 0 4 )

VE to decide the return value

Figure 4.1: An example for value estimation (VE) : the PDMS consists of 3data sources A, B and C. B and C answer the query QB, QC translatedfrom QA and a VE is performed at data source A on the answers returnedby B and C.

In this example, we see that VE helps to choose a value to use among multiplereturned values. Note that the final value suggested by VE in the example is not oneof the returned value from the data sources in the PDMS. The VE operation makesan estimate over the returned values and return one that is, say, expected to be mostaccurate. This distinguishes VE from a ranking approach.

Also in the example, the raw answers returned from data sources are equippedwith a quality measure (0.02,0.04); it will be desirable if the output of valueestimation (VE) also contains quality information. Note here the two quality mea-sure are used differently. The quality measure returned by each contributing datasource with the answers is for the user to compare the “conflicting” answers: auser may pick the one with the highest quality and discard others. The VE op-eration uses the raw quality measure in its computation. The output of VE is anestimate taking consideration of multiple “opinions” from the data sources and thequality measure serves to indicate how diverse those opinions are. With continu-

52

ous query processing enabled, a quality measure of an VE output also suggests howlikely this answer will be updated in the future.

We plan to study VE and closely associate it with the PDMS aggregation query(PAQ) processing (Chapter 3) to achieve the following goals:

1. to develop a VE framework to perform value estimation in the PDMS;

2. to integrate the VE operation with PDMS aggregation query (PAQ) processing,optimize the overall performance;

3. to provide several options for VE methods that work in a uniform framework;

4. to develop a quality measure for value estimation;

5. to integrate VE with continuous query processing.

The first subgoal provides the PDMS with the ability to perform value estima-tion in the general sense. We plan to run VE with PDMS aggregate query but thisframework should also be designed to work for other types of queries. Using thisframework, we will provide several VE strategies for the user to choose from sothat the actual VE strategy can vary for different queries to obtain the best perfor-mance for the queries. The VE operation we develop will return a quality measuretogether with the estimation. After continuous query processing is added to thePDMS, we plan to let VE automatically update its estimation as well as the qualitymeasure.

Another important feature we plan to add to the PDMS query processing is thecontinuous query (CQ) support. Continuous query processing suggests a new wayof delivering query results to the user than traditional query processing also knownas “ad-hoc” query processing. In traditional query processing, the user issues aquery to the PDMS and wait for the answer. Once query answers are returned to theuser, the life of this query ends. However in continuous query processing, the PDMS

continuously updates the answers to the user when new answers become available.Compared to ad-hoc query processing, the continuality lays in two folds: (1) theanswers to a query are returned to the user in a continuous way instead of in onebatch; (2) updates to the perviously returned answers are fed back by the PDMS

in a continuous fashion. A PDMS is a distributed platform, the answers to a querytakes different time to process in the data sources. The difference exists both inthe time needed to translate the original query to the queries to process on the datasources and in the time needed by an individual data source to process the queryand send answers back. Therefore continuous query update takes the advantagethat partial answers to a query are returned to the user as soon as they are sent

53

back by the data sources, without needing to wait for all answers to be collected.Also, the query being continuously processed, it does not need to be repeatedlytranslated and sent to data sources for re-evaluation. This also saves the cost inquery rewriting. Finally, it is now the data updates in the PDMS that trigger theupdates of the query answers. Compared to ad-hoc query processing, continuousquery processing has the potential to update answers in a more timely fashion.

So the purpose of supporting continuous query (CQ) in the PDMS can be sum-marized as follows:

1. to provide a data monitoring service in the PDMS;

2. to reduce the query-answer delay especially for complex PDMS queries likethe aggregate queries;

3. to enable fast and in-time answer updates triggered by data (source) updatesin the PDMS.

In JIIRP, we assume that during an earthquake, data sources will update thestatus of infrastructure at the disaster site to the PDMS in real time. These informa-tion, after proper data transformation, is modeled as events and needs to be fed intothe simulator as fast as possible. A typical event can be a timely update to a cell’sdamage after the earthquake. Continuous query updates to the events work betterthan having the simulator to periodically query for the events, say, every 5 seconds.Specific to this scenario, the continuous query (CQ) saves computational resourcesboth for the simulator and the PDMS system in the following places. First, at thetime queries for events are issued to the PDMS from a data source, acquaintances ofthis data source returns quickly the events they host while the queries are translatedto retrieve more events on faraway data sources at the same time. With CQ, the sim-ulator gets reports on those quickly returned events and can work on them at thetime it waits for more events to arrive. Second, queries are translated only once inthe PDMS with CQ supported. When new data sources join the PDMS, queries willbe translated from their acquaintances and simulator events are directly returned.Compare to translating the original queries all the way from the initial data sourceto the newly joined data sources, CQ has apparent advantages.

We propose to support CQ in the PDMS and achieve the following subgoals:

1. a framework to perform continuous query processing in the PDMS,

2. algorithms to guarantee error bounded continuous query processing withminimal processing overhead,

3. specific solutions for PDMS aggregation query (PAQ).

54

The research plan for continuous query (CQ) is similar to VE that we will addthis feature to the PAQ processing in the PDMS. After that, we look at how tosupport CQ for general queries in the same solution framework.

Next we formalize the proposed VE and CQ models.

4.2 Value estimation definitionsIn this section, we formally introduce the value estimation (VE) operation and de-fine the problem we plan to solve in the PDMS. VE is an operation that returnsone estimate from many possibly different answers to a same object retrieved frommultiple data sources during the processing of a query in the PDMS. In this context,a raw query answer to a query is a relational tuple in the schema specified by thehead of a query. E.g., a raw answer of a query Q(x,y,z) :- R1(x,y,a), R2(z,a) is atriple ans(x1,y1,z1). An VE operation typically requires more information than theraw answer. Formally,

Definition 4.2.1. [value estimation (VE)] value estimation (VE) is an operation thattakes as input a multi-set of query answers in the form of (V,Q) and outputs oneestimate in the form of (E,Q) where V and E are in the form of V = V (c1,c2, ...cn)representing the answer to a query and the final estimate to the query answer; Q is aquality measure in one of the two forms: a vector Q = Q(q1,q2, ...qn),qi ∈R∪{∅}representing the quality evaluation for V (E) or a scalar Q = q ∈ R∪{∅} whichmeans the quality of all values in V (E). �

In the definition of VE, we require that the input to carry the quality informationQ. Depending on specific VE algorithms, the use of Q varies, but in general a VE

algorithm needs this information to make the final estimation E as accurate aspossible. For a data source that does not support returning of quality information,Q can be taken as a constant scalar for all values, representing the prior of thesystem on the values returned from that data source. The output of VE takes thesame syntactic form as the input. This enables recursive value estimations andactually a VE algorithm does not care if an input (v,e) is a raw answer from a datasource or an output of an upstream VE operation.

The example in Section 4.1 can be use again to illustrate how the VE follow-ing this definition works, while we still defer technical and algorithmic details tofurther discussion. In a PDMS shown in Figure 4.1, the query select cellid,damage from celldmg where cellid=1 and dmglevel=8 receivesanswers in the format Ans((id,damage),(varid ,vardmg)) from data source B asansB((1,0.6),(∅,0.02)) and from C as ansC((1,0.8),(∅,0.04)). Here the symbol∅ means that the cellids are accurate. A VE algorithm may return est((1,0.7),(∅,0.06))as the final estimate. Despite of the logics used by VE algorithm, the VE process

55

combines multiple assertions from different data sources in the PDMS for a queryand returns a value as requested.

We notice that there is considerable analogy between a VE and an aggregationoperation. Both operations take a set of values as input and return one value as theoutput. The VE can be viewed as a special class of aggregation in this sense. Weactually plan to investigate if we can make use of this analogy to better integrateVE with aggregate query processing in the PDMS. We also notice that the VE isstill substantially different from a general purpose aggregate operations such assum, avg, max or min. First, a VE is usually a more complicated processthan an aggregate function, it takes both the value of a candidate and its qualitymeasure into consideration while aggregate functions work directly on a set ofvalues. Second, The semantics of VE and aggregate functions are different. Usuallythe result of an aggregation carries the semantics defined by the aggregate function.It can be “total” or “maximum” or others while in VE the output value has exactlythe same semantics as the input candidates. Finally, the goal of the VE is differentthan aggregation. An aggregate function is just a computation that with a pre-defined logic it computes a new value out of a set of inputs; however, VE is anestimation based operation. In the VE, the goal is to find a value that is a goodapproximation to the true value queried.

4.3 Research plan for value estimationAs briefly described in Chapter 1, the research plan for value estimation (VE) isto develop a VE framework in the PDMS and integrate the VE process into theprocessing of PDMS aggregation query (PAQ) in the PDMS. We will develop severalvalue estimation (VE) schemes under the same framework to fit different needs ofvalue estimation and also let the user choose the best suitable VE for their queries.As a first VE scheme, we plan to develop a VE algorithm taking inputs as Gaussiandistributions. After this, we plan to investigate the problem of continuous updatingVE results for queries, to cope with the continuous query processing architecture.Both non-continuous and continuous versions of VE will involve distributing theVE computation around the PDMS to amortize the computational overhead and tooptimize the networking cost.

4.4 Related work for value estimationProvenance (aka. lineage) in databases has recently received a lot of research atten-tion [22, 26, 27, 29]. Provenance “describes the source and derivation of data. [27]”The VE operation we propose relies on a quality measure of the input data to per-form the estimation whereas in many cases the data quality directly relates to the

56

sources and the workflow that derives the data. Therefore, provenance informationcan be useful for the VE operation. Many methods have been developed to repre-sent and manage provenance information. For example, annotations such as in [25]are widely used to represent provenance. In [28] the authors discussed the annota-tion propagation problem for views. In [58], color blocks are used to tag the dataitems in the curation of scientific databases. Although related in the above sense,solving the provenance problem does not immediately solve the value estimationproblem. The fact that provenance information does not always imply the qualityof data or query answers is one reason; and a more important reason is that theVE operation cares about the “quality” but not the originality of the data and queryanswers. In a VE operation, we don’t distinguish inputs with the same quality, al-though they may come from difference sources and/or are computed in differentworkflows. For example, one input can come from an very accurate data sourcebut is distorted in data processing; another one may be directly retrieved from aless accurate data source. As long as both inputs are measured the same qualityseen by the VE operation, they are treated exactly the same.

We propose to model the input to VE, a pair (value,quality), as a probabilitydistribution. Interpreting inputs in this way, the VE operation is to make estima-tions over a mixture model. A probability mixture model is defined as a convexcombination of a set of component distributions [99, 133], which in our context isthe set of inputs to the VE. A big branch of the mixture model research focuses onidentifying component distributions from the mixed data points observed. For ex-ample, the parametric mixture model asks to identify parameters of the componentdistributions for which the expectation maximization (EM) is an important methodto classify and cluster data points. It is worth noting that our VE operation worksjust in the reverse direction.

So far we didn’t find published works directly addressing the VE problem asstated in Section 4.2. We are still looking into more literature for more relatedwork on this topic.

4.5 Continuous query definitionsIn Chapter 1 and Section 4.1, we described the basic ideas of the continuous queryand its use in PDMS query processing. Now we give a preliminary model for PDMS

continuous query processing. We define a PDMS continuous query as follows.

Definition 4.5.1. [PDMS continuous query (PCQ)] A PDMS continuous query (PCQ)is a tuple K(Q,D,E,T ) where Q is a query with its head in the form of Q(~a); D isa destination data source where the query answers and updates are sent to; E is avector in the same size as ~a serving as an error threshold, i.e. E =~e, |~e|= |~a|,ei ∈

57

(R∪{∅})2; and T is a boolean formula that controls the termination of K. Ananswer or an update to a PCQ is a tuple A(ans(~a),~q,s, t) where ans is the answer toquery Q; ~q is a quality measure with qi ∈R∪{∅}; s is the data source that issuesthis answer (update) and t is a timestamp for this update. �

Vector E in Definition 4.5.1 puts thresholds on each value pair in (A.ai,A.qi)of the PCQ answer. It controls how the answers are updated during continuousprocessing.

In the definition of PCQ, the query Q is written in the schema of the data sourcethat processes the query. Therefore, when the query rewriting is performed inthe PDMS, the query Q changes and T needs to be changed accordingly so it cancontinue to control the termination of the translated query. For an answer of a PCQ,the source s and the timestamp t are used together to identify an update so that anupdate from one data source won’t overwrite an answer from another one. Thetimestamp is used to make sure that always the newest update is used, althoughdue to unpredictable network delay, an outdated answer may arrive even later thana more up-to-date answer.

Here is an PCQ query example K(Q,D,E,T ) using the PDMS as in Figure 4.1.K.Q : Q(id,damage) :−CellDmg(id,damage,8), K.D = ”A”, K.E =((∅,∅),(0.1,0.02)),K.T = ” f alse”. Basically, this query asks to return cell damage at earthquakelevel 8 to data source A. The query processing engine should send updates ifan updated answer differs from a previous answer ≥ 0.1 on the damage assess-ment value or ≥ 0.02 on the quality measure. An example answer to this query isAns((1,0.3,8),(∅,0.02,∅),”B”,”1200”) which says “an answer from data sourceB, at time 1200 states that the cell with id 1 suffers from damage 0.3 for an earth-quake level 8; the variance for the damage assessment value 0.3 in this answer is0.02 while other values are believed to be accurate.”

The PDMS continuous query processing is defined to process PCQ queries fol-lowing Definition 4.5.1

Definition 4.5.2. [PDMS continuous query processing] Given a PDMS continuousquery (PCQ) K and a PDMS, the PDMS continuous query processing is to com-pute and maintain updates to this continuous query in the PDMS. Specifically, ittranslates K to the data sources in the PDMS to process and it maintains the errorthresholds K.E so that any changes to the answer that exceeds K.E get reported tothe destination D. Additionally, T is maintained so that the continuous processingterminates when the condition in T is met. �

Continuous query processing will be supported by the query processing enginein the PDMS. The query processing engine supports both ad-hoc query and contin-uous query. An ad-hoc query can be viewed as a continuous query with T set totrue after returning the first set of query answers.

58

4.6 Research plan for continuous queryWe plan to support continuous query in the PDMS query processing engine for thePDMS aggregation query (PAQ) and value estimation (VE). So eventually both VE

and continuous query will work for PDMS aggregation query (PAQ) processing. Wefurther define the continuous query version of PAQ as CPAQ and define CVE forcontinuous value estimation.

Definition 4.6.1. [continuous PDMS aggregation query (CPAQ)] A continuous PDMS

aggregation query (CPAQ) is a PCQ K(Q,D,E,T ) which Q is a PDMS aggregationquery (PAQ). �

And CPAQ processing is defined as a special case of PDMS continuous queryprocessing and follows Definition 4.5.2. We define CVE in a similar way as thePCQ definition by replacing the PDMS query with an VE operation.

Definition 4.6.2. [continuous value estimation (CVE)] A CVE is a tuple K(P,D,E,T )where P is a VE operation, D is the destination data source, E is an error thresholdin the form of E = E(ev,eq) and T is the termination condition. �

We define CVE processing similar to CPAQ processing.

Definition 4.6.3. [CVE processing] Given a continuous value estimation K and aPDMS, the CVE processing is to compute and maintain updates to K.P in the PDMS.Specifically, any time if the new value estimation differs from the most recentlyreported estimation more than K.E, a new value estimation result is reported to thedestination D. Furthermore, T is maintained so that the CVE processing terminateswhen the T = true. �

The CVE processing is closely related to the CQ processing that the input to thevalue estimation operator in the CVE is continuously updated. That’s the reason weinclude both CVE and CPAQ in the research plan.

4.7 Related work for continuous queryThe continuous query was introduced early in [132] to filter incoming data suchas messages of emails and forum posts in append-only database settings. Thisnew type of query receives separate attention and is formalized in [20] where theapplications of continuous queries are much widened. Research in data streamswelcomes continuous query as first-class citizen [19, 34, 45, 91, 101, 126] due tothe fact that stream computations are mostly data driven. In a typical stream pro-cessing task, often a computation is defined over one or more streams and small

59

sized synopsis is maintained for the computation. Therefore this stream computa-tion naturally becomes a continuous query as the output is continuously updatedupon the arrival of new stream items. Each time a data item is observed from thestream, the synopsis is updated and the system decides whether the computationresults need to be updated.

Applications of monitoring [2] and alerting [118] also find the continuousquery useful. In those applications the monitoring and alerting tasks are predefinedas continuous queries. The goal of query processing is to quickly capture the datachange and report answers back in time. Various monitoring and alerting missionsare carried out by the wireless sensor network [70, 96, 111, 139] formed by batterypowered sensors with small range wireless communication capability [10]. Eachsensor processes one or more pre-defined continuous queries (usually hard-codedsuch as to probe the temperature) and will stop functioning after depleting its bat-tery. As listening to wireless channels and transmitting data by air consumes mostenergy in its operations, it becomes the primary concern for a sensor network topreserve energy. A sensor’s lifetime can be extended by optimizing the data trans-mission [53, 88, 120, 128]. The peer-to-peer (P2P) network is a typical architectureused by the sensor networks as well as in other scenarios involving distributed andautonomous data management. Works reported in [18, 56, 59, 74, 77, 145] discussvarious scenarios that require efficient continuous query processing that minimizesresult updates relying on multi-hop communication in such a distributed environ-ment.

To make a system scalable on the number of continuous queries, queries aretreated as another class of “data” and get indexed so that they can be quickly lookedup upon relevant data updates [77, 145]. This idea is generalized and in a recentwork [91] the authors describe how data and queries can be modeled as multi-dimensional points and to use spatial join to find relevant queries for an incomingdata item efficiently. The PDMS we develop for JIIRP also needs to monitor the datachanges e.g., the changing condition of an infrastructure after an earthquake strike.Although we don’t have an urgent need to save peer energy in our application, thelimited networking resources and potentially large volumes of data updates to thepeers and the need for a quick response all demand optimized continuous queryprocessing. Load balancing and scalability on both the size of the PDMS and thenumber of continuous queries are important to our PDMS.

Maintaining aggregates in a continuous fashion is studied in [18, 62, 63, 71].In [71] the authors propose to let the user to control the execution of aggregationqueries. One of their solutions is to compute running confidence intervals to letthe users decide whether the results obtained already meet the users’ needs andthus the query processing can stop. Works in [18, 62] considered different ag-gregation tasks but they share the idea that the accuracy tolerance of a query is a

60

resource that can be assigned to the distributed contributors to reduce the amountof message transmission in query answering. The work [103] uses a similar ideato adaptively adjust the time intervals for peers to send reports. In [46] the authorstrade-off between the function (re)evaluation cost and the accuracy guarantees forcontinuous queries by adaptively selecting functions of different complexity. Ourtasks of supporting continuous aggregate queries in the PDMS distinguish from theexisting approaches that in addition to some common optimizations, we also opti-mize the query translation process. Furthermore in the PAQ processing, we need tomaintain incrementally both the object decomposition and the computed aggregatevalues.

While the above works discuss supporting continuous query for new applica-tions in new architectures. Works in [126] and [12] aim to support the continuousquery in traditional relational databases and to develop a declarative, SQL-likequery language that captures continuous query semantics. In [126], the authorsreport the progress in Oracle to support continuous queries for RDBMS. Specialdata structures called “clogs” and materialized aggregate views (MAV) are used toincrementally maintain the query results.

Works in [75] and [47] consider adaptive query processing in data integration.In [75], the authors argue that in a data integration system where much statistics isunavailable, the system should go adaptive in producing and selecting query plansfor queries. Although the query processing setting in [75] substantially differs fromcontinuous query processing, the observations for the data integration environmentand suggestions for adaptive query processing are still valid for the long-runningcontinuous query. The query plan scheduling problem is considered in [47]. Theauthors observe that often many query plans are generated during reformulatinga query in a data integration system. They argue that the query execution engineshould first execute those plans resulting more answer tuples in order to improvethe throughput. The authors do not discuss continuous query processing but theirobservations implicitly suggest to optimize the query reformulating process.

61

Chapter 5

Research Plan and Time Table

We summarize the research plan in this chapter and give a tentative timetable forthe research plan. The order for the future investigation will be “aggregation sup-port -¿ value estimation -¿ continuous query support.” We will focus mainly on thefirst two and it depends on the progress of the first two that we decide the support-ing level of CQ.

The following is the timetable for the proposed research.

Topic Time Expected results CommentsArchitecture(Section 3.1.3)

Feb 09 - April 09 Revise the PDMS archi-tecture

Implement a PDMS pro-totype

PAQ query sup-port (Chapter 3)

March 09 - July 09 Algorithms for PAQ, pro-cess PAQ queries fromJIIRP

Implement a PAQ enginein the PDMS

VE support(Chapter 4)

July 09- Nov 09 Support VE with PAQ

queryImplement with the PAQ

engineContinuousquery support(Chapter 4)

Nov 09 - Feb 10 Continuous query sup-port for PAQ and VE

Implement with thePDMS prototype

Table 5.1: Time table for the proposed research

63

Bibliography

[1] Proceedings of the 24th International Conference on Data EngineeringWorkshops, ICDE 2008, April 7-12, 2008, Cancun, Mexico, 2008. IEEEComputer Society. → pages 72, 78

[2] D. J. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee,M. Stonebraker, N. Tatbul, and S. B. Zdonik. Aurora: a new model andarchitecture for data stream management. VLDB J., 12(2):120–139, 2003.→ pages 60

[3] K. Aberer, P. Cudre-Mauroux, and M. Hauswirth. A framework forsemantic gossiping. SIGMOD Record, 31(4):48–53, 2002. → pages 13

[4] F. N. Afrati and R. Chirkova. Selecting and using views to computeaggregate queries (extended abstract). In T. Eiter and L. Libkin, editors,ICDT, volume 3363 of Lecture Notes in Computer Science, pages 383–397.Springer, 2005. ISBN 3-540-24288-0. → pages 36

[5] F. N. Afrati and P. G. Kolaitis. Answering aggregate queries in dataexchange. In Lenzerini and Lembo [87], pages 129–138. ISBN978-1-60558-108-8. → pages 13, 35, 36, 37, 38, 44

[6] F. N. Afrati, C. Li, and P. Mitra. Answering queries using views witharithmetic comparisons. In Popa [107], pages 209–220. → pages 42

[7] F. N. Afrati, C. Li, and P. Mitra. On containment of conjunctive querieswith arithmetic comparisons. In Bertino et al. [24], pages 459–476. ISBN3-540-21200-0. → pages 42

[8] S. Agarwal, J. Lim, L. Zelnik-Manor, P. Perona, D. J. Kriegman, andS. Belongie. Beyond pairwise clustering. In CVPR(2), 2005. → pages 18

[9] R. Akbarinia, E. Pacitti, and P. Valduriez. Best position algorithms fortop-k queries. In Koch et al. [85], pages 495–506. ISBN978-1-59593-649-3. → pages 13

65

[10] S. A. Aldosari and J. M. F. Moura. Fusion in sensor networks withcommunication constraints. In K. Ramchandran, J. Sztipanovits, J. C. Hou,and T. N. Pappas, editors, IPSN, pages 108–115. ACM, 2004. ISBN1-58113-846-6. → pages 60

[11] B. Arai, G. Das, D. Gunopulos, and V. Kalogeraki. Approximatingaggregation queries in peer-to-peer networks. In Barga and Zhou [21],page 42. → pages 13

[12] A. Arasu, S. Babu, and J. Widom. Cql: A language for continuous queriesover streams and relations. In G. Lausen and D. Suciu, editors, DBPL,volume 2921 of Lecture Notes in Computer Science, pages 1–19. Springer,2003. ISBN 3-540-20896-8. → pages 61

[13] M. Arenas, L. E. Bertossi, J. Chomicki, X. He, V. Raghavan, andJ. Spinrad. Scalar aggregation in inconsistent databases. Theor. Comput.Sci., 3(296):405–434, 2003. → pages 37

[14] M. Arenas, V. Kantere, A. Kementsietsidis, I. Kiringa, R. J. Miller, andJ. Mylopoulos. The hyperion project: from data integration to datacoordination. SIGMOD Record, 32(3):53–58, 2003. → pages 31, 35, 39,40, 44

[15] M. Arenas, V. Kantere, A. Kementsietsidis, I. Kiringa, R. J. Miller, andJ. Mylopoulos. The hyperion project: From data integration to datacoordination. SIGMOD Record, 32(3):53–58, 2003. → pages 1

[16] M. Arenas, J. Perez, and C. Riveros. The recovery of a schema mapping:bringing exchanged data back. In Lenzerini and Lembo [87], pages 13–22.ISBN 978-1-60558-108-8. → pages 39

[17] R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive queryprocessing. In ACM SIGMOD International Conference on Management ofData (SIGMOD), pages 261–272, 2000. → pages 7, 12, 13

[18] B. Babcock and C. Olston. Distributed top-k monitoring. In Halevy et al.[66], pages 28–39. ISBN 1-58113-634-X. → pages 60

[19] S. Babu and J. Widom. Continuous queries over data streams. SIGMODRecord, 30(3):109–120, 2001. → pages 59

[20] D. Barbara. The characterization of continuous queries. Int. J. CooperativeInf. Syst., 8(4):295–, 1999. → pages 59

66

[21] R. S. Barga and X. Zhou, editors. Proceedings of the 22nd InternationalConference on Data Engineering Workshops, ICDE 2006, 3-7 April 2006,Atlanta, GA, USA, 2006. IEEE Computer Society. → pages 66, 69, 70, 74,75

[22] O. Benjelloun, A. D. Sarma, A. Y. Halevy, M. Theobald, and J. Widom.Databases with uncertainty and lineage. VLDB J., 17(2):243–264, 2008. →pages 56

[23] P. Bernstein, F. Giunchiglia, A. Kementsietsidis, J. Mylopoulos, L. Serafini,and I. Zaihrayeu. Data management for peer-to-peer computing: A vision.In WebDB, pages 89–94, 2002. → pages 15

[24] E. Bertino, S. Christodoulakis, D. Plexousakis, V. Christophides,M. Koubarakis, K. Bohm, and E. Ferrari, editors. Advances in DatabaseTechnology - EDBT 2004, 9th International Conference on ExtendingDatabase Technology, Heraklion, Crete, Greece, March 14-18, 2004,Proceedings, volume 2992 of Lecture Notes in Computer Science, 2004.Springer. ISBN 3-540-21200-0. → pages 65, 71

[25] D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya. An annotationmanagement system for relational databases. VLDB J., 14(4):373–396,2005. → pages 57

[26] R. Bose and J. Frew. Lineage retrieval for scientific data processing: asurvey. ACM Comput. Surv., 37(1):1–28, 2005. → pages 56

[27] P. Buneman and W. C. Tan. Provenance in databases. In Chan et al. [32],pages 1171–1173. ISBN 978-1-59593-686-8. → pages 56

[28] P. Buneman, S. Khanna, and W. C. Tan. On propagation of deletions andannotations through views. In Popa [107], pages 150–158. → pages 57

[29] P. Buneman, A. Chapman, and J. Cheney. Provenance management incurated databases. In Chaudhuri et al. [33], pages 539–550. ISBN1-59593-256-9. → pages 56

[30] P. S. Canada. Joint infrastructure interdependencies research program,2005-2009. URLhttp://www.publicsafety.gc.ca/prg/em/jiirp/index-eng.aspx. → pages 1

[31] M. Castro, P. Druschel, Y. C. Hu, and A. Rowstron. Proximity neighborselection in tree based structured peer-to-peer overlays. Technical report,Microsoft, 2003. → pages 18

67

http://www.publicsafety.gc.ca/prg/em/jiirp/index-eng.aspx

[32] C. Y. Chan, B. C. Ooi, and A. Zhou, editors. Proceedings of the ACMSIGMOD International Conference on Management of Data, Beijing,China, June 12-14, 2007, 2007. ACM. ISBN 978-1-59593-686-8. → pages67, 71, 74

[33] S. Chaudhuri, V. Hristidis, and N. Polyzotis, editors. Proceedings of theACM SIGMOD International Conference on Management of Data,Chicago, Illinois, USA, June 27-29, 2006, 2006. ACM. ISBN1-59593-256-9. → pages 67, 68, 70, 73, 74

[34] J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. Niagaracq: A scalablecontinuous query system for internet databases. In W. Chen, J. F.Naughton, and P. A. Bernstein, editors, SIGMOD Conference, pages379–390. ACM, 2000. ISBN 1-58113-218-2. → pages 59

[35] V. Cholvi, P. Felber, and E. W. Biersack:. Efficient search in unstructuredpeer-to-peer networks. In SPAA, 2004. → pages 18

[36] B.-G. Chun, B. Y. Zhao, and J. Kubiatowicz:. Impact of neighbor selectionon performance and resilience of structured p2p networks. In IPTPS, 2005.→ pages 13, 18

[37] S. Cohen. User-defined aggregate functions: bridging theory and practice.In Chaudhuri et al. [33], pages 49–60. ISBN 1-59593-256-9. → pages 13

[38] S. Cohen, W. Nutt, and A. Serebrenik. Rewriting aggregate queries usingviews. In PODS, pages 155–166. ACM Press, 1999. ISBN 1-58113-062-7.→ pages 29, 36, 42

[39] S. Cohen, W. Nutt, and Y. Sagiv. Rewriting queries with arbitraryaggregation functions using views. ACM Trans. Database Syst., 31(2):672–715, 2006. → pages 36, 42

[40] C. Cramer and T. Fuhrmann:. Proximity neighbor selection for a dht inwireless multi-hop networks. In Peer-to-Peer Computing, 2005. → pages18

[41] A. Crespo and H. Garcia-Molina. Semantic overlay networks for p2psystems. In G. Moro, S. Bergamaschi, and K. Aberer, editors, AP2PC,volume 3601 of Lecture Notes in Computer Science, pages 1–13. Springer,2004. ISBN 3-540-29755-3. → pages 30

68

[42] G. Das, D. Gunopulos, N. Koudas, and D. Tsirogiannis. Answering top-kqueries using views. In Dayal et al. [43], pages 451–462. ISBN1-59593-385-9. → pages 13

[43] U. Dayal, K.-Y. Whang, D. B. Lomet, G. Alonso, G. M. Lohman, M. L.Kersten, S. K. Cha, and Y.-K. Kim, editors. Proceedings of the 32ndInternational Conference on Very Large Data Bases, Seoul, Korea,September 12-15, 2006, 2006. ACM. ISBN 1-59593-385-9. → pages 69,76, 78

[44] A. Delis, C. Faloutsos, and S. Ghandeharizadeh, editors. SIGMOD 1999,Proceedings ACM SIGMOD International Conference on Management ofData, June 1-3, 1999, Philadelphia, Pennsylvania, USA, 1999. ACM Press.ISBN 1-58113-084-8. → pages 70, 71

[45] M. Denny and M. J. Franklin. Predicate result range caching forcontinuous queries. In F. Ozcan, editor, SIGMOD Conference, pages646–657. ACM, 2005. ISBN 1-59593-060-4. → pages 59

[46] M. Denny and M. J. Franklin. Operators for expensive functions incontinuous queries. In Barga and Zhou [21], page 147. → pages 61

[47] A. Doan and A. Y. Halevy. Efficiently ordering query plans for dataintegration. In ICDE, pages 393–. IEEE Computer Society, 2002. ISBN0-7695-1531-2. → pages 61

[48] A. Doan and A. Y. Halevy. Semantic-integration research in the databasecommunity. AI Mag., 26(1):83–94, 2005. ISSN 0738-4602. → pages 41

[49] A. Doan and A. Y. Halevy. Semantic integration research in the databasecommunity: A brief survey. AI Magazine, 26(1):83–94, 2005. → pages 13,16

[50] X. Dong, A. Y. Halevy, and C. Yu. Data integration with uncertainty. InVLDB ’07, pages 687–698. VLDB Endowment, 2007. ISBN978-1-59593-649-3. → pages 20

[51] X. L. Dong, A. Y. Halevy, and C. Yu. Data integration with uncertainty. InKoch et al. [85], pages 687–698. ISBN 978-1-59593-649-3. → pages 14

[52] S. Dubnov, R. El-Yaniv, Y. Gdalyahu, E. Schneidman, N. Tishby, andG. Yona. A new nonparametric pairwise clustering algorithm based oniterative estimation of distance profiles. Machine Learning, 47(1):35–61,2002. → pages 18

69

[53] A. Dunkels, F. Osterlind, and Z. He. An adaptive communicationarchitecture for wireless sensor networks. In S. Jha, editor, SenSys, pages335–349. ACM, 2007. ISBN 978-1-59593-763-6. → pages 60

[54] R. Fagin, P. G. Kolatis, L. Popa, and W. C. Tan. Composing schemamappings: Second-order dependencies to the rescue. In PODS, pages83–94, 2004. → pages 9, 13, 17, 21, 39

[55] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange:semantics and query answering. Theor. Comput. Sci., 336(1):89–124, 2005.→ pages 36, 37, 39

[56] B. Gedik and L. Liu. Peercq: A decentralized and self-configuringpeer-to-peer information monitoring system. In ICDCS, pages 490–499.IEEE Computer Society, 2003. ISBN 0-7695-1920-2. → pages 60

[57] B. Gedik and L. Liu. Quality-aware dstributed data delivery for continuousquery services. In Chaudhuri et al. [33], pages 419–430. ISBN1-59593-256-9. → pages 13

[58] F. Geerts, A. Kementsietsidis, and D. Milano. Mondrian: Annotating andquerying databases through colors and blocks. In Barga and Zhou [21],page 82. → pages 57

[59] H. G. Gok and O. Ulusoy. Transmission of continuous query results inmobile computing systems. Inf. Sci., 125(1-4):37–63, 2000. → pages 60

[60] J. Goldstein and P.-A. Larson. Optimizing queries using materializedviews: A practical, scalable solution. In SIGMOD Conference, pages331–342, 2001. → pages 36

[61] G. Gottlob and A. Nash. Data exchange: computing cores in polynomialtime. In Vansummeren [136], pages 40–49. ISBN 1-59593-318-2. →pages 35, 36, 37, 39

[62] R. Gupta and K. Ramamritham. Optimized query planning of continuousaggregation queries in dynamic data dissemination networks. In C. L.Williamson, M. E. Zurko, P. F. Patel-Schneider, and P. J. Shenoy, editors,WWW, pages 321–330. ACM, 2007. ISBN 978-1-59593-654-7. → pages60

[63] P. J. Haas and J. M. Hellerstein. Ripple joins for online aggregation. InDelis et al. [44], pages 287–298. ISBN 1-58113-084-8. → pages 60

70

[64] P. Haase and R. Siebes. Peer selection in peer-to-peer networks withsemantic topologies. In ICSNW’04, pages 108–125, 2004. → pages 18

[65] A. Y. Halevy. Theory of answering queries using views. SIGMOD Record,29(4):40–47, 2000. → pages 42

[66] A. Y. Halevy, Z. G. Ives, and A. Doan, editors. Proceedings of the 2003ACM SIGMOD International Conference on Management of Data, SanDiego, California, USA, June 9-12, 2003, 2003. ACM. ISBN1-58113-634-X. → pages 66, 72, 74

[67] A. Y. Halevy, Z. G. Ives, D. Suciu, and I. Tatarinov. Piazza: Datamanagement infrastructure for semantic web applications. In ICDE, pages505–516, 2003. → pages 1, 4, 15, 30

[68] A. Y. Halevy, Z. G. Ives, J. Madhavan, P. Mork, D. Suciu, and I. Tatarinov.The piazza peer data management system. IEEE Trans. Knowl. Data Eng.,16(7):787–798, 2004. → pages 6, 35, 40, 44

[69] B. He and K. C.-C. Chang. Statistical schema matching across web queryinterfaces. In ACM SIGMOD International Conference on Management ofData (SIGMOD), pages 217–228, 2003. → pages 38

[70] W. R. Heinzelman, A. Chandrakasan, and H. Balakrishnan.Energy-efficient communication protocol for wireless microsensornetworks. In HICSS, 2000. → pages 60

[71] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. InJ. Peckham, editor, SIGMOD Conference, pages 171–182. ACM Press,1997. → pages 60

[72] M. A. Hernndez, R. J. Miller, and L. M. Haas. Clio: A semi-automatic toolfor schema mapping. In SIGMOD, 2001. → pages 38

[73] R. Huebsch, M. N. Garofalakis, J. M. Hellerstein, and I. Stoica. Sharingaggregate computation for distributed queries. In Chan et al. [32], pages485–496. ISBN 978-1-59593-686-8. → pages 13

[74] S. Idreos, M. Koubarakis, and C. Tryfonopoulos. P2p-diet: One-time andcontinuous queries in super-peer networks. In Bertino et al. [24], pages851–853. ISBN 3-540-21200-0. → pages 60

[75] Z. G. Ives, D. Florescu, M. Friedman, A. Y. Levy, and D. S. Weld. Anadaptive query execution system for data integration. In Delis et al. [44],pages 299–310. ISBN 1-58113-084-8. → pages 61

71

[76] N. Jain, M. Dahlin, Y. Zhang, D. Kit, P. Mahajan, and P. Yalagandula. Star:Self-tuning aggregation for scalable monitoring. In Koch et al. [85], pages962–973. ISBN 978-1-59593-649-3. → pages 13

[77] J. Kannan, B. Yang, S. Shenker, P. Sharma, S. Banerjee, S. Basu, and S.-J.Lee. Smartseer: Using a dht to process continuous queries overpeer-to-peer networks. In INFOCOM. IEEE, 2006. → pages 60

[78] V. Kantere, I. Kiringa, J. Mylopoulos, A. Kementsietsidis, and M. Arenas.Coordinating peer databases using eca rules. In K. Aberer, V. Kalogeraki,and M. Koubarakis, editors, DBISP2P, volume 2944 of Lecture Notes inComputer Science, pages 108–122. Springer, 2003. ISBN 3-540-20968-9.→ pages 39

[79] F. B. Kashani and C. Shahabi. Fixed-precision approximate continuousaggregate queries in peer-to-peer databases. In ICDE DBL [1], pages1427–1429. → pages 13

[80] S. R. Kashyap, S. Deb, K. V. M. Naidu, R. Rastogi, and A. Srinivasan.Efficient gossip-based aggregate computation. In Vansummeren [136],pages 308–317. ISBN 1-59593-318-2. → pages 13

[81] A. Kementsietsidis, M. Arenas, and R. J. Miller. Mapping data inpeer-to-peer systems: Semantics and algorithmic issues. In Halevy et al.[66], pages 325–336. ISBN 1-58113-634-X. → pages 39

[82] A. Kementsietsidis, M. Arenas, and R. J. Miller. Managing data mappingsin the hyperion project. In U. Dayal, K. Ramamritham, and T. M.Vijayaraman, editors, ICDE, pages 732–734. IEEE Computer Society,2003. ISBN 0-7803-7665-X. → pages 39

[83] A. Kemper, P. Valduriez, N. Mouaddib, J. Teubner, M. Bouzeghoub,V. Markl, L. Amsaleg, and I. Manolescu, editors. EDBT 2008, 11thInternational Conference on Extending Database Technology, Nantes,France, March 25-29, 2008, Proceedings, volume 261 of ACMInternational Conference Proceeding Series, 2008. ACM. ISBN978-1-59593-926-5. → pages 74, 75, 77

[84] I. A. Klampanos and J. M. Jose. An architecture for information retrievalover semi-collaborating peer-to-peer networks. In SAC, 2004. → pages 18

[85] C. Koch, J. Gehrke, M. N. Garofalakis, D. Srivastava, K. Aberer,A. Deshpande, D. Florescu, C. Y. Chan, V. Ganti, C.-C. Kanne, W. Klas,

72

and E. J. Neuhold, editors. Proceedings of the 33rd InternationalConference on Very Large Data Bases, University of Vienna, Austria,September 23-27, 2007, 2007. ACM. ISBN 978-1-59593-649-3. → pages65, 69, 72, 76, 78

[86] P. G. Kolaitis. Schema mappings, data exchange, and metadatamanagement. In C. Li, editor, PODS, pages 61–75. ACM, 2005. ISBN1-59593-062-0. → pages 31, 38, 39

[87] M. Lenzerini and D. Lembo, editors. Proceedings of the Twenty-SeventhACM SIGMOD-SIGACT-SIGART Symposium on Principles of DatabaseSystems, PODS 2008, June 9-11, 2008, Vancouver, BC, Canada, 2008.ACM. ISBN 978-1-60558-108-8. → pages 65, 66, 73, 76

[88] L. E. Li and P. Sinha. Throughput and energy efficiency intopology-controlled multi-hop wireless sensor networks. In C. S.Raghavendra, K. M. Sivalingam, R. Govindan, and P. Ramanathan, editors,Wireless Sensor Networks and Applications, pages 132–140. ACM, 2003.ISBN 1-58113-764-8. → pages 60

[89] L. Libkin. Data exchange and incomplete information. In Vansummeren[136], pages 60–69. ISBN 1-59593-318-2. → pages 31, 35, 36

[90] L. Libkin and C. Sirangelo. Data exchange and schema mappings in openand closed worlds. In Lenzerini and Lembo [87], pages 139–148. ISBN978-1-60558-108-8. → pages 35, 36

[91] H.-S. Lim, J.-G. Lee, M.-J. Lee, K.-Y. Whang, and I.-Y. Song. Continuousquery processing in data streams using duality of data and queries. InChaudhuri et al. [33], pages 313–324. ISBN 1-59593-256-9. → pages 59,60

[92] H.-S. Lim, J.-G. Lee, M.-J. Lee, K.-Y. Whang, and I.-Y. Song. Continuousquery processing in data streams using duality of data and queries. InChaudhuri et al. [33], pages 313–324. ISBN 1-59593-256-9. → pages 7, 13

[93] A. Loser, F. Naumann, W. Siberski, W. Nejdl, and U. Thaden. Semanticoverlay clusters within super-peer networks. In DBISP2P, pages 33–47,2003. → pages 18

[94] S. Madden, M. A. Shah, J. M. Hellerstein, and V. Raman. Continuouslyadaptive continuous queries over streams. In M. J. Franklin, B. Moon, andA. Ailamaki, editors, SIGMOD Conference, pages 49–60. ACM, 2002.ISBN 1-58113-497-5. → pages 7, 12, 13

73

[95] J. Madhavan and A. Y. Halevy. Composing mappings among data sources.In VLDB 2003, 2003. → pages 9, 13, 17

[96] A. M. Mainwaring, D. E. Culler, J. Polastre, R. Szewczyk, and J. Anderson.Wireless sensor networks for habitat monitoring. In C. S. Raghavendra andK. M. Sivalingam, editors, WSNA, pages 88–97. ACM, 2002. ISBN1-58113-589-0. → pages 60

[97] N. Mamoulis, K. H. Cheng, M. L. Yiu, and D. W. Cheung. Efficientaggregation of ranked inputs. In Barga and Zhou [21], page 72. → pages 13

[98] F. Mandreoli, R. Martoglia, S. Sassatelli, P. Tiberio, and W. Penzo. Usingsemantic mappings for query routing in a pdms environment. In SEBD,2006. → pages 18

[99] G. McLachlan and D. Peel. Finite Mixture Models. Wiley-Interscience,2000. ISBN ISBN-13:978-0471006268. → pages 57

[100] S. Melnik, A. Adya, and P. A. Bernstein. Compiling mappings to bridgeapplications and databases. In Chan et al. [32], pages 461–472. ISBN978-1-59593-686-8. → pages 13

[101] K. Mouratidis, S. Bakiras, and D. Papadias. Continuous monitoring oftop-k queries over sliding windows. In Chaudhuri et al. [33], pages635–646. ISBN 1-59593-256-9. → pages 59

[102] C. H. Ng, K. C. Sia, and C.-H. Chan. Advanced peer clustering andfirework query model in the peer-to-peer network. In WWW(Poster), 2003.→ pages 13, 18

[103] C. Olston, J. Jiang, and J. Widom. Adaptive filters for continuous queriesover distributed data streams. In Halevy et al. [66], pages 563–574. ISBN1-58113-634-X. → pages 61

[104] M. Pavan and M. Pelillo. Dominant sets and pairwise clustering. IEEETransactions on Pattern Analysis and Machine Intelligence (TPAMI), 29(1):167–172, 2007. → pages 18

[105] W. Penzo, S. Lodi, F. Mandreoli, R. Martoglia, and S. Sassatelli. Semanticpeer, here are the neighbors you want! In Kemper et al. [83], pages 26–37.ISBN 978-1-59593-926-5. → pages 30

[106] W. Penzo, S. Lodi, F. Mandreoli, R. Martoglia, and S. Sassatelli. Semanticpeer, here are the neighbors you want! In Kemper et al. [83], pages 26–37.ISBN 978-1-59593-926-5. → pages 13

74

[107] L. Popa, editor. Proceedings of the Twenty-first ACMSIGACT-SIGMOD-SIGART Symposium on Principles of DatabaseSystems, June 3-5, Madison, Wisconsin, USA, 2002. ACM. → pages 65, 67

[108] R. Pottinger and P. A. Bernstein. Schema merging and mapping creationfor relational sources. In Kemper et al. [83], pages 73–84. ISBN978-1-59593-926-5. → pages 38

[109] R. Pottinger and A. Y. Levy. A scalable algorithm for answering queriesusing views. In A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal,N. Kamel, G. Schlageter, and K.-Y. Whang, editors, VLDB, pages 484–495.Morgan Kaufmann, 2000. ISBN 1-55860-715-3. → pages 42

[110] R. A. Pottinger and A. Y. Halevy. Minicon: A scalable algorithm foranswering queries using views. VLDB Journal, 10(2-3):182–198, 2001. →pages 42

[111] L. Prasad, S. S. Iyengar, R. L. Kashyap, and R. N. Madan. Functionalcharacterization of sensor integration in distributed sensor networks. InIPPS, pages 186–193, 1991. → pages 60

[112] E. Rahm and P. A. Bernstein. A survey of approaches to automatic schemamatching. The VLDB Journal, 10(4):334–350, 2001. ISSN 1066-8888.doi:http://dx.doi.org/10.1007/s007780100057. → pages 41

[113] M. K. Ramanathan, V. Kalogeraki, and J. Pruyne. Finding good peers inpeer-to-peer networks. In IPDPS, 2002. → pages 13, 18

[114] A. Robles-Kelly and E. R. Hancock. Pairwise clustering with matrixfactorisation and the em algorithm. In Conference on Computer Vision,pages 63–77, London, UK, 2002. Springer-Verlag. ISBN 3-540-43744-4.→ pages 18

[115] P. Rodrıguez-Gianolli, M. Garzetti, L. Jiang, A. Kementsietsidis, I. Kiringa,M. Masud, R. J. Miller, and J. Mylopoulos. Data sharing in the hyperionpeer database system. In VLDB, pages 1291–1294, 2005. → pages 15, 30

[116] A. I. T. Rowstron and P. Druschel. Pastry: Scalable, decentralized objectlocation, and routing for large-scale peer-to-peer systems. In Middleware,2001. → pages 18

[117] A. D. Sarma, O. Benjelloun, A. Y. Halevy, and J. Widom. Working modelsfor uncertain data. In Barga and Zhou [21], page 7. → pages 14

75

http://dx.doi.org/http://dx.doi.org/10.1007/s007780100057

[118] U. Schreier, H. Pirahesh, R. Agrawal, and C. Mohan. Alert: Anarchitecture for transforming a passive dbms into an active dbms. In G. M.Lohman, A. Sernadas, and R. Camps, editors, VLDB, pages 469–478.Morgan Kaufmann, 1991. ISBN 1-55860-150-3. → pages 60

[119] P. Senellart and G. Gottlob. On the complexity of deriving schemamappings from database instances. In Lenzerini and Lembo [87], pages23–32. ISBN 978-1-60558-108-8. → pages 38

[120] M. A. Sharaf, J. Beaver, A. Labrinidis, and P. K. Chrysanthis. Balancingenergy efficiency and quality of aggregate data in sensor networks. VLDBJ., 13(4):384–403, 2004. → pages 60

[121] M. A. Sharaf, P. K. Chrysanthis, A. Labrinidis, and K. Pruhs. Efficientscheduling of heterogeneous continuous queries. In Dayal et al. [43], pages511–522. ISBN 1-59593-385-9. → pages 7, 13

[122] N. Shental, A. Zomet, T. Hertz, and Y. Weiss. Pairwise clustering andgraphical models. In Neural Information Processing Systems, 2003. →pages 18

[123] K. Sripanidkulchai, B. M. Maggs, and H. Zhang:. Efficient content locationusing interest-based locality in peer-to-peer systems. In INFOCOM 2003,2003. → pages 18

[124] D. Srivastava, S. Dar, H. V. Jagadish, and A. Y. Levy. Answering querieswith aggregation using views. In T. M. Vijayaraman, A. P. Buchmann,C. Mohan, and N. L. Sarda, editors, VLDB, pages 318–329. MorganKaufmann, 1996. ISBN 1-55860-382-4. → pages 36

[125] I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek,F. Dabek, and H. Balakrishnan. Chord: a scalable peer-to-peer lookupprotocol for internet applications. IEEE/ACM Transactions on Networking,11(1):17–32, 2003. → pages 18

[126] S. Subramanian, S. Bellamkonda, H.-G. Li, V. Liang, L. Sheng, W. Smith,J. Terry, T.-F. Yu, and A. Witkowski. Continuous queries in oracle. In Kochet al. [85], pages 1173–1184. ISBN 978-1-59593-649-3. → pages 59, 61

[127] S. Subramanian, S. Bellamkonda, H.-G. Li, V. Liang, L. Sheng, W. Smith,J. Terry, T.-F. Yu, and A. Witkowski. Continuous queries in oracle. In Kochet al. [85], pages 1173–1184. ISBN 978-1-59593-649-3. → pages 13

76

[128] A. Talukder, R. Bhatt, T. Sheikh, R. Pidva, L. Chandramouli, andS. Monacos. Dynamic control and power management algorithm forcontinuous wireless monitoring in sensor networks. In LCN, pages498–505. IEEE Computer Society, 2004. ISBN 0-7695-2260-2. → pages60

[129] I. Tatarinov and A. Y. Halevy. Efficient query reformulation in peer datamanagement systems. In SIGMOD, pages 539–550, 2004. → pages 17

[130] A. Telang, R. Mishra, and S. Chakravarthy. Ranking issues for informationintegration. In ICDE Workshops, pages 257–260. IEEE Computer Society,2007. → pages 13

[131] C. Tempich, A. Loser, and J. Heizmann:. Community based ranking inpeer-to-peer networks. In ODBASE, 2005. → pages 18

[132] D. B. Terry, D. Goldberg, D. A. Nichols, and B. M. Oki. Continuousqueries over append-only databases. In M. Stonebraker, editor, SIGMODConference, pages 321–330. ACM Press, 1992. → pages 59

[133] D. Titterington, A. F. Smith, and U. E. Makov. Statistical Analysis of FiniteMixture Distributions. Wiley, John & Sons, Incorporated, 1986. ISBNISBN-13:9780471907633. → pages 57

[134] D. Tsoumakos and N. Roussopoulos:. Agno: An adaptive groupcommunication scheme for unstructured p2p networks. In Euro-Par, 2005.→ pages 18

[135] J. W. Tukey. Exploratory Data Analysis. Addison-Wesley Publishing,1977. ISBN 0201076160. → pages 24

[136] S. Vansummeren, editor. Proceedings of the Twenty-Fifth ACMSIGACT-SIGMOD-SIGART Symposium on Principles of DatabaseSystems, June 26-28, 2006, Chicago, Illinois, USA, 2006. ACM. ISBN1-59593-318-2. → pages 70, 72, 73

[137] T. Wang and R. Pottinger. Semap: a generic mapping construction system.In Kemper et al. [83], pages 97–108. ISBN 978-1-59593-926-5. → pages39

[138] Z. Y. Xi Tong, Dalu Zhang. Efficient content location based oninterest-cluster in peer-to-peer system. In ICEBE, 2005. → pages 18

77

[139] B. Xu, O. Wolfson, and S. Chamberlain. Spatially distributed databases onsensors. In K.-J. Li, K. Makki, N. Pissinou, and S. Ravada, editors,ACM-GIS, pages 153–160. ACM, 2000. ISBN 1-58113-319-7. → pages 60

[140] F. Xu and C. Jermaine. Randomized algorithms for data reconciliation inwide area aggregate query processing. In Koch et al. [85], pages 639–650.ISBN 978-1-59593-649-3. → pages 13

[141] J. Xu and R. Pottinger. Optimizing acquaintance selection in a pdms.Technical Report TR-2007-18, Department of Computer Science,University of British Columbia, 2007. → pages 22, 23

[142] W. P. Yan and P.-A. Larson. Eager aggregation and lazy aggregation. InU. Dayal, P. M. D. Gray, and S. Nishio, editors, VLDB, pages 345–357.Morgan Kaufmann, 1995. ISBN 1-55860-379-4. → pages 36

[143] K. Yi, F. Li, G. Kollios, and D. Srivastava. Efficient processing of top-kqueries in uncertain databases. In ICDE DBL [1], pages 1406–1408. →pages 14

[144] C. Yu and H. V. Jagadish. Schema summarization. In Dayal et al. [43],pages 319–330. ISBN 1-59593-385-9. → pages 13

[145] Y. Zhu. Bandwidth-efficient continuous query processing over dhts. InICPP, pages 91–98. IEEE Computer Society, 2008. → pages 60

78

Thesis proposal: advanced features in PDMS (Final draft

Documents

Transcript of Thesis proposal: advanced features in PDMS (Final draft