Progressive skylining over Web-accessible databases

29
Progressive Skylining over Web-Accessible Database Eric Lo a Kevin Yip a King-Ip Lin b David W. Cheung a a Department of Computer Science, The University of Hong Kong b Division of Computer Science, The University of Memphis Abstract Skyline queries return a set of interesting data points that are not dominated by any other point on all dimensions. Most of the existing algorithms focus on skyline computation in centralized databases, and some of them can progressively return skyline points as they are identified rather than returning the answer in a batch. Processing skyline queries over the web is much challenging. One reason is that in many web applications, the target attributes are stored at different sites, and they might not be available other than through external web-accessible form interfaces. In this paper, we develop PDS (progressive distributed skylining), a progressive algorithm that evaluates skyline queries efficiently in this setting. PDS could also be applied to different types of web skyline queries. Furthermore, it supports a user-friendly progress indicator that allows users to monitor the progress of long running skyline queries. Our performance study further shows that PDS is efficient and robust to different data distributions and achieves its progressive goal with a minimal overhead. Key words: Distributed DBs, Query Optimaization, Inf. services on the Web, Web-based information systems 1 Introduction Many database users are interested in “best” queries – queries that return the objects that best fit a certain set of criteria. However, in many cases – Email addresses: [email protected] (Eric Lo), [email protected] (Kevin Yip), [email protected] (King-Ip Lin), [email protected] (David W. Cheung). Preprint submitted to Elsevier Science 17 September 2004

Transcript of Progressive skylining over Web-accessible databases

Progressive Skylining over Web-Accessible

Database

Eric Lo a Kevin Yip a King-Ip Lin b David W. Cheung a

aDepartment of Computer Science, The University of Hong Kong

bDivision of Computer Science, The University of Memphis

Abstract

Skyline queries return a set of interesting data points that are not dominated byany other point on all dimensions. Most of the existing algorithms focus on skylinecomputation in centralized databases, and some of them can progressively returnskyline points as they are identified rather than returning the answer in a batch.Processing skyline queries over the web is much challenging. One reason is that inmany web applications, the target attributes are stored at different sites, and theymight not be available other than through external web-accessible form interfaces.In this paper, we develop PDS (progressive distributed skylining), a progressivealgorithm that evaluates skyline queries efficiently in this setting. PDS could alsobe applied to different types of web skyline queries. Furthermore, it supports auser-friendly progress indicator that allows users to monitor the progress of longrunning skyline queries. Our performance study further shows that PDS is efficientand robust to different data distributions and achieves its progressive goal with aminimal overhead.

Key words:

Distributed DBs, Query Optimaization, Inf. services on the Web, Web-basedinformation systems

1 Introduction

Many database users are interested in “best” queries – queries that returnthe objects that best fit a certain set of criteria. However, in many cases –

Email addresses: [email protected] (Eric Lo), [email protected] (Kevin Yip),[email protected] (King-Ip Lin), [email protected] (David W.Cheung).

Preprint submitted to Elsevier Science 17 September 2004

2

4

6

8

10

2 4 6 8 10 0

Distance to beach

Price i

f

c

b

d

e

h

a

g

j

Fig. 1. Example dataset and skyline

especially in multi-criteria decision making – there is no single best answerto a query. Instead, we have a set of objects that satisfy the conditions butneither one can claim to be better than the other.

For example, suppose we want to book a hotel. We may specify the room price(the lower the better) and the distance of the hotel to the beach (the shorterthe better) as the evaluation criteria. A hotel is not interesting at all if thereexists another available hotel that is not more expensive, but closer to thebeach, or one that is not further away from the beach, but costs less. Figure 1shows ten hotel options, where the two axes represent the two evaluationcriteria in arbitrary scales. Following the argument above, options a, d, e, g,h and j (marked by unfilled circles) are all uninteresting. However, hotel cis interesting, as any other hotel is either more expensive, or is further awayfrom the beach. The same properties are also found in hotel b, f, i (markedby solid circles). While none of them can claim to be the ‘best’ hotel, eachone can lay claim that no other hotel is better than it. We call all the objectsthat satisfy this criteria (having no one better than it) a skyline object, andthe queries that retrieve all skyline objects, skyline queries [1]. We also useskyline to denote the set of all skyline objects.

Skylines are useful to multi-criteria decision making as they represent the setof all solutions that the user can safely take without fear that something betteris out there. It can act as a filter to discard sub-optimal objects. The user canthen interactively look at the (smaller) set of skyline objects and further selectthe ones that fit his/her needs. Thus effectively finding skylines is an importanttask for database systems. Moreover, as data are nowadays spread over theweb, it is possible that each criterion (represented by an attribute of an object)is stored separately. For our hotel example, the room prices and distancesto beach are stored at a broker site and a digital map site, respectively. Sodevising effective algorithms for this kind of distributed skyline queries are of

2

utmost importance.

As mentioned, skylines can be used in an interactive fashion, as a quick filter toestablish a smaller subset for the user to further decide on his/her choice. Forinstance, if a user specifies room price and distance to beach as the evaluationcriteria, and observes that the results contain some hotels that are too far awayfrom the airport, he may add the time to the airport as a new evaluationcriteria. To support this, an algorithm that allows quick response – whereskyline objects can be output as soon as they are found, not wait until all ofthem are discovered – is crucial.

Thus the goal of this paper is to devise methods that effectively answer dis-tributed skyline queries. Such methods must return skyline objects progres-sively: that is, the algorithm should determine whether an object is a skylinethe first time it sees an object, and reports result immediately, without wait-ing till the end. The algorithm should minimize the number of remote dataaccess (as cost of remote data access can be much higher then computationcost, see [2,3]). Such algorithm can be extremely useful in application such asinteractive search engines (for example, returning the best set of hotel rooms).

There has been pioneering algorithms on finding skylines in centralized data-bases [1], as well as finding them progressively [4–6]. Moreover, [2] extendsskyline queries on the web environment described above. However, we noticedthere has not been work on finding skylines progressively on a Web environ-ment. We believe our work is important as we bring the two aspects of skylinequeries together by providing an efficient (less data access cost) and effective(progressive, with quick response time) skyline algorithm on the Web.

Here are the contributions of this paper: (1) We present a novel algorithm,PDS (progressive distributed skylining), for evaluating skyline queries overweb data sources. It returns skyline progressively by using R-trees to identifyon the fly whether the latest encountered object belongs to the skyline (thenoutputs it immediately). (2) In addition to using R-trees, PDS also employsa new linear-regression based heuristic to enable faster discovery of skyline ina wide range of applications. (3) We extend PDS to evaluate top-K skylinequery over web databases. (4) We extend PDS to support a progress indicatorsuch that users know the evaluation progress of long-running skyline queries.(5) We have evaluated PDS through various experiments, which show that itoutperforms and is more robust than the distributed skyline algorithms in [2]in terms of both the number of source accesses and processing time.

The rest of the paper is organized as follows. Section 2 presents the relatedwork on skyline queries processing. In Section 3, we review the basic blockingskyline algorithms for the web, which solve a problem similar to ours, but donot return skyline objects progressively. Section 4 describes the details of PDS.

3

We discuss how to extend PDS to evaluate other types of skyline queries andto support a query progress indicator in Section 5. We report the experimentalresults in Section 6 and Section 7 concludes the paper with some directionsfor future work.

2 Definition and Related Work

In this section we define the skyline problem more formally. We describe themodel we use to model the web data access. We also briefly introduce therelated work in skyline query processing.

2.1 Definition and model

Given a database of objects, a user is interested in retrieving objects basedon a set of evaluation criteria. Each criteria is modeled as a total-orderedattribute of the objects. For simplicity’s sake we assume each attribute is anon-negative real number, and the user always treat smaller value as better.(In our hotel example, lower price corresponds to better choice).

Given a set of evaluation criteria, we say an object a is dominated by anotherobject b (or equivalently, b is said to dominate a) if b is not worse than a asevaluated by every criterion, and b is strictly better than a as evaluated by oneor more criteria. This means overall b is strictly better than a. Two objectsare said to be incomparable if both of them are not dominated by the other.For instance, all objects in the skyline are pairwise incomparable.

With the above definition, the skyline of a dataset is formally defined as theset of all data objects that are not dominated by any objects in the dataset.

In this paper, we assume the evaluation criteria (attributes) for objects aredistributed. Each attribute is stored on a separate data source Di. More-over, following [3], we assume each data source provides only two data accessmethods: Sorted Access (SA) and Random Access (RA). Sorted accesses areperformed through the getNext() function, which returns the best objectand its attribute value among those that have not been accessed. Referringto Figure 1, if the getNext() function of the digital map web site is calledmultiple times, it will return <i, 0> on the first call, <f , 1> on the second, <d,3> on the third, and so on. If multiple objects have the same attribute value,the returning order is arbitrary. Random accesses are performed through thegetScore() function, which takes an object as an input, and returns the at-tribute value of the object. The call getScore(h) to the digital map web site

4

will return the value 7. We use S(Di, O, VOi) to denote sorted access on datasource Di that returns object O and its attribute value VOi. The correspondingnotation for a random access is R(Di, O, VOi).

2.2 Related work

Skylines, and other directly related problems like maximum vector [7], contourproblem [8] and multi-objective optimization [9], have been well-studied in con-ventional datasets. However, the algorithms do not work when the dataset doesnot fit into main memory. In relational database context, Borzsonyi et al. [1]first proposed two algorithms, namely Divide-and-Conquer (D&C) and BlockNested Loops (BNL), to evaluate skyline queries. D&C divides the datasetinto several partitions such that each partition can fit into memory. Skylineobjects for each individual partition are then computed by a main-memoryskyline algorithm. The final skyline is obtained by merging the skyline objectsfor each partition. BNL compares each object in a nested-loop by keeping a listof candidate skyline objects in main memory. Dominated objects are discardedduring the comparisons. The efficiency of BNL is further improved in [10] bysorting the entire dataset according to a monotonic function a priori. Sinceobjects are sorted in a particular order, objects with lower scores are likely tobe pruned by objects that are already in the list.

Tan et al. presented two progressive skyline algorithms [6] that require pre-processing. The first algorithm encodes the dataset as a collection of bitmaps.Decision on whether an object belongs to the skyline is based on some bit op-erations. The second algorithm organizes a set of d-dimensional objects intod lists such that an object o is assigned to list i if and only if its value atattribute i is the best among all attributes of o. Each list is indexed by a B-tree, and the skyline is computed by scanning the B-tree until an object thatdominates the remaining entries in the B-trees is found. Kossmann et al. [4]observed that the skyline problem is closely related to the nearest neighbor(NN) search problem. They proposed a method that returns skyline objectsby applying nearest neighbor search on an R-tree indexed dataset recursively.Papadias et al. surveyed most of the secondary-memory algorithms for cen-tralized database and proposed an I/O optimal skyline algorithm that is alsobased on R-tree in [5].

The above methods are designed for centralized database and do not work inthe distributed web environment. Following the general query model [3] forthe web environment, Balke et al. [2] extended the skyline problem for the webin which the attributes of an object are distributed in different web-accessibleservers, as we have described in the last section. They proposed two algorithmsthat compute the skyline in such a distributed environment, which return all

5

skyline objects at the end rather than return them progressively. Since theproblem that they study is similar to the one of the current study, we willdiscuss them in detail in the next section as an important reference to ournew algorithm.

3 Skylining on the Web

3.1 Basic distributed skyline (BDS) algorithm

The basic distributed skyline (BDS) algorithm in [2] contains two phases. Thefirst phase identifies a subset of objects that includes the skyline. The secondphase filters away all the non-skyline objects in that subset.

In the first phase, data is retrieved by sorted access only, and each data sourceis invoked in a round-robin fashion. To illustrate the idea, we extend our hotelexample by adding an extra evaluation criterion, time to airport (obtainedfrom the web site of a shuttle bus company), and present it in tabular formatin Figure 2. Based on the round-robin scheme, BDS performs sorted accesson the database in the order S(D1, b, 0), S(D2, i, 0), S(D3, e, 1), S(D1, a, 1),S(D2, f, 1), and so on. The algorithm avoids reading an excessive amountof data by looking for a terminating object, which is the first object whoseattribute values at all sources have been retrieved (via sorted access). In Fig-ure 2, the terminating object is f , which is detected after performing 12 sortedaccesses. An important lemma regarding the terminating object is proved in[2], which we rephrase in terms of our notations as follows:

Lemma 1 When a terminating object T is detected, and no more sorted ac-cesses from each data source Di can return a value equal to VT i, all objectsnot yet encountered cannot be in the skyline.

If all values of each attribute are distinct, the lemma simply states that phaseone is complete when the terminating object is detected, and all objects notyet encountered cannot be in the skyline. In our example, when f is detectedas a terminating object, objects g and h have not been encountered by sortedaccesses. According to the lemma, they cannot be skyline objects. Intuitively,this is because they have all attribute values larger than those of f , so theyare dominated by f . The condition “and no more sorted access from each datasource Di can return a value equal to VT i” of the lemma covers the case wherean attribute is allowed to contain duplicated values. Therefore, if duplicatedvalues are allowed in our example, the following three more sorted accessesare needed to confirm f as the real terminating object: S(D1, d, 4), S(D2, c, 5)and S(D3, b, 5).

6

Distance Time to

Hotel Price Hotel to beach Hotel airport

b 0 i 0 e 1

a 1 f 1 c 2

c 2 d 3 j 3

f 3 e 4 f 4

d 4 c 5 b 5

g 5 b 6 g 6

j 6 h 7 h 7

e 7 a 8 i 8

h 8 g 9 d 9

i 9 j 10 a 10

Broker (site D1) Digital map (site D2) Bus company (site D3)

Fig. 2. Example dataset with three attributes distribute at three different sites

Hotel D1 D2 D3

b 0 5

a 1

c 2 5 2

f 3 1 4

d 4 3

K1

Hotel D1 D2 D3

i 0

f 3 1 4

d 4 3

e 4 1

c 2 5 2

K2

Hotel D1 D2 D3

e 4 1

c 2 5 2

j 3

f 3 1 4

b 0 5

K3

Fig. 3. The tables that store the data obtained from phase one of the BDS algorithm

For each attribute Di, we create a table Ki storing the attributes of the objectsretrieved by sorted access on Di. Notice that objects in each Ki contains all theattributes that are retrieved (not just attribute Di). Figure 3 shows the tablesfor the three attributes of our extended hotel example. These tables are passedto phase two, which filters out the non-skyline objects. Non-skyline objectscan be identified by retrieving the missing attribute values through randomaccesses (e.g., R(D2, b, 6)) and performing a pairwise comparison between allobjects in the tables. Objects found to be dominated by some others arefiltered. To reduce the number of comparisons, [2] observes that an object Ocan only be dominated by other objects that exist in all tables that contain O.For example, since table K1 contains object b but not e, b cannot be dominatedby e. It follows from the fact that the value of e at attribute D1 must not besmaller than that of b since it is not retrieved before the completion of phaseone. Therefore, pairwise comparisons need to be performed between objectsthat exist in the same table only. Using the same argument, a missing valuefor attribute Di in a table cannot be smaller than the largest Di value storedin the table. For example, according to table K1, the value of d at D3 shouldbe at least 5. This means d is proved to be dominated by f , without actuallygetting the value of d at D3 through a random access.

7

3.2 Improved distributed skyline (IDS) algorithm

It can be seen from Figure 2 that if a terminating object is known a priori, it ispossible to perform fewer sorted accesses in phase one and have fewer objectsstored in the Ki tables, which in turn may reduce the number of randomaccesses and pairwise comparisons in phase two. For example, if object f ispredicted to be a terminating object, once S(D2, d, 3) is performed, we couldstop doing sorted accesses on D2 according to lemma 1.

Based on the idea, an improved distributed skyline (IDS) algorithm, based on aheuristic to predict the “most probable” terminating object, is proposed in [2].IDS tries to find an object whose attribute values can be fully retrieved usingthe smallest number of sorted accesses. The heuristic works as follows: initiallya sorted access is performed on D1, and the object retrieved is treated as themost probable terminating object temporarily. In our example, the object is b.Unlike the original BDS algorithm that performs only sorted accesses in phaseone, the heuristic requires the other attribute values of the most probableterminating object to be retrieved right away through random accesses, whichinclude R(D2, b, 6) and R(D3, b, 5) in our example. Using the attribute valuesof b, a score is computed to estimate the remaining amount of sorted accessesrequired to retrieve all attribute values of b. The score is composed of the sumof the score component of each attribute, which is calculated as the differencebetween b’s value and the last value retrieved through a sorted access. Since noattribute values at D2 and D3 have been retrieved through sorted accesses, theminimum value 0 is used. The score for object b is thus (0−0)+(6−0)+(5−0) =11.

After calculating the score, the algorithm is ready to perform the next sortedaccess. Instead of doing it in the round-robin fashion, the heuristic randomlyselects one of the attributes that the most probable terminating object has notbeen encountered through sorted accesses. For object b, the attributes are D2

and D3. Suppose D2 is selected randomly, then the sorted access S(D2, i, 0)is performed. Again, the other attribute values of i are retrieved throughrandom accesses R(D1, i, 9) and R(D3, i, 8), and the score of it is computedas (9 − 0) + (0 − 0) + (8 − 0) = 17. Since the score is larger than that ofb, intuitively the “distance” to reach b through sorted accesses is shorter, sothe heuristic keep b as the current candidate and moves on to the next sortedaccess.

8

4 Progressive Distributed Skyline Algorithm

In this section we describe the details of PDS. For ease of presentation, wefirst assume that all the score values are distinct. Therefore, objects returnedby sorted access are sorted in a strictly ascending order like Figure 2. InSection 4.1, we present the shortcomings of BDS and IDS and the motivationsof PDS. Then we show how PDS could identify skyline objects progressivelyin Section 4.2 and how PDS improves the performance of distributed skyliningby rank estimation in Section 4.3. Section 4.4 describes the basic PDS and wegeneralized it to the case where duplicate values exist in Section 4.5. Finally,in Section 4.6, we informally evaluate our work with respect to the desirablefeatures of progressive algorithms.

4.1 Motivation

Section 3 presented the BDS and IDS algorithms from [2]. While they performwell in various cases, we believe there are grounds for improvement:

(1) In both algorithms, an object is confirmed as a skyline object only afterthe pairwise comparisons in phase two is completed. In real-time applicationsthat may result in a long response time before the first skyline objects to bereturned. In fact, it is possible to identify skyline objects much earlier. Forexample, the two sorted accesses S(D1, b, 0) and S(D1, a, 1) are sufficient toshow that b is a skyline object (since no other objects can have a value smallerthan b at D1), so it can be returned right away. Similarly, some objects can bedetected as dominated by others in an early stage, so they can be discardedimmediately so as to avoid including them in the pairwise comparisons inphase two.

(2) While IDS reduces the number of data access, its success depends on theheuristic accurately predicting the terminating object. Yet it only works insome specific situations. In some cases, as we are going to show in section 6,the prediction is not accurate and the number of data accesses is increased,rather than decreased, due to the extra random accesses performed in phaseone.

Thus, we propose a progressive distributed skylining (PDS) algorithm. PDSdoes not wait till the end of the algorithm to start return skyline objects. Infact, if there are no duplicate score values in each source, PDS very quicklydetermines whether an object is a skyline immediately after it is first retrievedby it and output the object if it is a skyline object, resulting in quick responsetime.

9

Moreover, we overcome the limitation of the existing heuristic by proposingto use the (estimated) rank of an object for terminating object prediction. Wefurther present a linear-regression-based method to estimate the rank of theobject. We argue in section 4.3 that rank is a more appropriate measure todetermine the terminating objects. Experiments shows that our estimate isvery robust over a wide range of distributions. Moreover, it also works wellwhen data are correlated or anti-correlated. Thus we believe rank estimationis an effective way to speed up the algorithms.

Below we discuss the two ideas: progressiveness and rank estimation.

4.2 Enable progressiveness

To enable progressiveness, we need to determine whether an object just re-trieved belongs to the final skyline as soon as the object is retrieved. Achievingthis requires the following lemma:

Lemma 2 If a data source Di returns data objects in a strictly monotonicorder, an object O retrieved from Di would only be dominated by objects thatare retrieved from Di before O.

Proof. By definition, if an object O is dominated by another object P , ev-ery attribute value of P must be smaller than or equal to the correspondingattribute value of O. Since Di returns no duplicate values, the value of P atDi must be smaller than that of O. This means sorted accesses on Di mustreturn P before O.

Lemma 2 shows that if an object O is retrieved from a data source by sortedaccess, we could only need to test if O is dominated by any objects that appearsbefore O in the same source only. If no such object exists, O is definitely partof the skyline – as all objects that appear after O cannot dominate O (as theyare worst than O in that data source).

Take data source D3 in Figure 2 as an example. Object e(7, 4, 1) is first re-trieved by invoking the getNext() function on D3 and is identified as partof the skyline because no objects can dominate e on data source D3 (the at-tribute values 7 and 4 of e are retrieved by random accesses 1 if necessary).Next, object c(2, 5, 2) is retrieved and compared with all objects that appearbefore it (i.e., object e) and found to be incomparable with e. By lemma 2,we know that c would not be dominated by objects after it and thus c can beoutput as part of the skyline immediately. Similarly, object j(6, 10, 3) is next

1 Following [2,3], here we assume an object O must exist in every data source Di;however in practice, we could discard O if it does not exist in all data sources.

10

2

4

6

8

10

2 4 6 8 10 0

y -axis (Distance)

x -axis (Price)

e

c

(a) Testing of c on R3

2

4

6

8

10

2 4 6 8 10 0

y -axis (Distance)

x -axis (Price)

e

j

c

(b) Testing of j on R3

2

4

6

8

10

2 4 6 8 10 0

y -axis (Distance)

x -axis (Price)

e

f

c

(c) Testing of f on R3

N 1

2

4

6

8

10

2 4 6 8 10 0

y -axis (Distance)

x -axis (Price)

e

a

b c

d

f g

h

N 2

N 3

N 4

N 5

N 6

x

(d) An example R-tree

Fig. 4. R-tree for dominance checking

retrieved from D3. We compare it with e and c and found that j is dominatedby e in all dimensions and therefore j is not part of the skyline.

Still, lemma 2 requires an object O to be compared to every object retrievedbefore it in the same source. We would like to cut down on the number ofpairwise comparison. To this end, we propose using a main-memory multidi-mensional index to speed up this process. We choose R*-tree [11] as the indexbecause of its efficiency and wide acceptance in skyline literature [4,5]. Ourapproach differs from the R*-tree algorithms that find skyline in centralizeddatabase [4,5] because the target data set is not ready before the evaluationof query.

The R*-tree method builds an R*-tree Ri for each attribute i involved inthe skyline query. The R*-tree for attribute i contains the skyline objectsdiscovered so far based on sorted access in Di. As a result, for an objectO retrieved from a data source Di, we can check to see if any object in Ri

dominates O. If we found no such objects, O is identified as a skyline objectimmediately and is inserted into Ri. In order to demonstrate the concept of

11

R*-tree, let us consider data source D3 again. Initially, the skyline object e isinserted into R3. Note that since the score values of the objects retrieved fromeach sources are strictly increasing, we do not need to insert the i-th attributevalue of an object into Ri. For instance, the skyline object e(7, 4, 1) is insertedinto R3 in the form of e(7, 4).

To test whether a newly retrieved object O is dominated by any object thathave been seen before in the same source Di, a containment query QO is con-structed with the origin set as the lower-left corner and O set as the upper-rightcorner and see if any object in Ri contained within QO. If QO is an emptyquery, this indicates that there is no objects appeared before O has bettervalues than O in the projected dimensions. In this case, O is a skyline objectand can be output. Conversely, if QO is a non-empty query, that means thereexists at least one object, say S, is better than O in all projected dimen-sions. Since S appears before O, this implies that S must also better than Oin attribute Di. Therefore, O is dominated by S in all dimensions and doesnot belong to the final skyline. Let us revisit data source D3 as an example.Firstly, recall that object e(7, 4, 1) is a skyline and thus inserted into R3 inthe form of e(7, 4) (only stores the projected dimensions). To test whether thenext retrieved object c(2, 5, 2) is a skyline, a containment query Qc[(0, 0)(2, 5)]that based on the projected dimension D1 and D2 of c is posed on R3 (seeFigure 4(a)). Since query Qc return no answers, object c is output as a sky-line and is inserted into R3. Next, to see if object j(6, 10, 3) belongs to theskyline, we again pose another containment query Qj[(0, 0)(6, 10)] on R3 (seeFigure 4(b)). Query Qj returns c as answer, thus it implies that object c dom-inates j at all three attributes and thus j can be discarded. Similarly, thecontainment query Qf [(0, 0)(3, 1)] that based on object f(3, 1, 4) is an emptyquery (see Figure 4(c)), therefore f belongs to the skyline and would be in-serted into R3. Notice that since f(3, 1) dominates e(7, 4) in R3, insertion off into R3 should delete objects that are dominated by f in order to make theR*-tree more compact and efficient. In this example, we should delete e in R3

after we insert f .

The benefits of using R*-tree to check domination of objects is two-fold. First,the R*-tree built for each attribute is very compact in size since it stores onlythe skyline objects that have highest pruning power (e.g., object e is deletedafter object f is inserted into R3). Second, the containment query operation isvery efficient because a containment query Q only visits objects with minimumbounding box that overlap with Q in the compact R*-tree. Figure 4(d) showsan example R*-tree index with node structure. We can see that the dominancechecking function does not need to access node N6 and its children for testingobject x.

Summing up the above ideas, we first present an R*-tree based dominancechecking function isDominated in Figure 5. This function will be employed

12

Function isDominated(Object o, Attribute i, R*-tree Ri)

1. //Ri:the R*-tree storing skyline objects appears in attribute i

2. let N as the number of dimensions involved in the skyline

3. construct a (N − 1)d containment query Qo

4. set the origin as the lower-left corner of Qo (ignore dimension i)

5. set o as the upper-right corner of Qo (ignore dimension i)

6. if Ri.contain(Qo) returns true //o is not a skyline

7. return true

8. else //o is a skyline

9. construct a (N − 1)d containment query Qo

10. set o be the lower-left corner of Qo (ignore dimension i)

11. set infinitiy as the upper-right corner of Qo (ignore dimension i)

12. delete all objects falls in Qo

13. Ri.insert(Qo)

13. return false

End isDominated

Fig. 5. Dominance checking by R*-tree

by the main algorithm PDS later and it returns true if an input object O isdominated by an existing skyline object that appears before O in the samedata source.

4.3 A linear-regression based heuristic

In order for the distributed skyline algorithm to be efficient, it is crucial thata good terminating object is found quickly. Here we argue that a good termi-nating object is characterized by its rank in each of the sources.

We denote ranki(O) as the rank of object O is source Di. We adopt theconvention that the first object retrieved from the source has rank 1, thesecond one has rank 2 and so on. We define sumrank(O) =

∑i ranki(O) as

the sum of the ranks in all attributes of O. We note the following:

• If T is a terminating object, then the minimum number of sorted accessrequired to locate it is sumrank(T ).

• The number of objects that need to be compared increases when sumrank(T )increases, though the increase is not necessarily linear. Increasing in num-ber of objects implies more random access (to retrieve all the attributes) isrequired.

Thus the smaller the sumrank(O), the more preferable O is as the terminatingobject. Thus we would like PDS to locate an object with the minimum sum ofall ranks as the terminating object. This implies that the minimum number ofsorted access (and likely, random access) is needed to pinpoint the terminating

13

object. As a result, PDS will likely outperform any other heuristics.

However, in our model the rank of an object in a data source Di is not readilyavailable. For instance, random access provides the value of all the sources foran object. However, it does not return how many objects appear before it.The only way to know the rank of O for Di is via sorted access – when O isretrieved by the k-th sorted access, then the rank of O in Di is k. This meansthat discovering the rank of every object encountered is infeasible (a objectwith an high rank will leads to an unnecessarily high number of sorted access,rendering the algorithm inefficient).

As a result, we need a good heuristic to estimate the rank of an object at eachdata source. If one have information about the distribution of the values for asource, one can tailor-make a function to accurately estimate the rank of anobject at that source. In this paper, however, we assume no such informationexists. Thus we have to devise a standard heuristic to estimate the rank. Whilewe would like the heuristic to be accurate, we also would like to avoid overlylaborious calculations. Thus we propose using linear regression as our heuristicof rank estimation. For each source, PDS has the information about the rankof objects via the sorted access it already made. For a source Di, we use theinformation to devise an equation of the form ranki = ai ∗ valuei + bi, wherevaluei of an object at source Di and ranki is the rank of it. Assume thatwe have made k sorted access on source Di, we can obtain ai and bi via thefollowing linear regression formula.

ai =k

∑rankivaluei − (

∑valuei)(

∑ranki)

k∑

valuei2− (

∑valuei)2

(1)

bi =(∑

ranki)(∑

valuei2) − (

∑valuei)(

∑rankivaluei)

k∑

valuei2− (

∑valuei)2

(2)

This formula is easy to evaluate. Moreover, we can maintain a current valueof all its components (sums), and when a new value is retrieved via sortedaccess, update all the sums and evaluate the new component. As we need atleast two values to evaluate the formula, initially we make two sorted accessfor each source, and evaluate the initial value for each ai, bi. While we do notclaim that linear regression is accurate in all cases, we will show by experimentthat it works very well in many distributions – skewed or not.

14

Algorithm PDS

1. Skyline = ∅ //list of skyline objects

2. Pruned = ∅ //list of pruned objects – object that are not skylines

3. Cand = ∅ //list of objects that are candidates for terminating object

4. for each data source Di

5. perform SA and retrieves the two topmost objects oi1, oi2

6. perform necessary RA on oi1, oi2

7. OutputSkyline(oi1, i)

8. initalize regression coefficients ai, bi

9. CheckSkyline(oi2, i)

10. add oi1, oi2 to Cand

11. end for

12. while no objects are fully probed by SA

13. T = PickNextTermObject()

14. let D′ = set of all data sources where T has not been probed by SA

15. Randomly pick one data source D′

ifrom D′

16. perform SA on D′

i, let oi be the object retrieved

17. add oi to Cand

18. update regression coefficients ai, bi

19. if oi 6∈ Pruned or Skyline

20. CheckSkyline(oi, i)

21. end while

End PDS

Fig. 6. PDS algorithm

4.4 Algorithm description

The pseudo-code of the main algorithm PDS is presented in Figure 6 and theauxiliary functions of PDS are presented in Figure 7. At the beginning, thealgorithm retrieves objects with the two best score from each source. The firstobject retrieved from each source is a skyline object as they are the best inat least one dimension. Consider Figure 2 again, initially, the algorithm per-forms two sorted accesses on each source: S(D1, b, 0), S(D1, a, 1); S(D2, i, 0),S(D2, f, 1); S(D3, e, 1), S(D3, c, 2). Since the data values are sorted in a strictlyascending order (this assumption would be dropped later), object b, i and eare the skyline from attribute D1,D2 and D3 respectively. By lemma 2, afterthe unknown attribute values of object b, i and e are probed by RA, theycould be reported as skyline by the OutputSkyline function immediately.

Now, the regression coefficient ai and bi can be calculated according to theretrieved information. For example, given object <e,1> (rank 1) and <c,2>(rank 2) from data source D3, the values of a3 and b3 are:

15

Function CheckSkyline(Object o, Attribute i) //check if o is a skyline point

1. if isDominated(o, i, Ri)

2. add o to Pruned

3. else

4. OutputSkyline(o, i)

End CheckSkyline

Function PickNextTermObject() //Find the most probable terminating object

1. for each o in Cand

2. ranko = 0;

3. for each data source Di

4. if o’s rank in Di is known

5. ranko += o’s rank in Di

6. else

7. ranko += ai * oi.value + bi

8. end for

9. return o with the minimum ranko

End PickNextTermObject()

Function OutputSkyline(Object o, Attribute i) //report a skyline to user

1. report o as a skyline

2. add o to Skyline

End OutputSkyline()

Fig. 7. Various function of PDS

a3 =2(1 · 1 + 2 · 2) − (1 + 2)(1 + 2)

2(12 + 22) − (1 + 2)2= 1

b3 =(1 + 2)(12 + 22) − (1 + 2)(1 · 1 + 2 · 2)

2(12 + 22) − (1 + 2)2= 0

After the initialization of the regression coefficient ai and bi, PDS invokes theCheckSkyline function to determine whether the second-ranked objects a, fand c belong to the skyline. Essentially, the CheckSkyline checks if the inputobject is dominated by any other objects by the isDominated function whichwe have presented in the previous section. If the input object is dominatedby others, it would be pruned; else, it would be reported as skyline by theOutputSkyline function. After the regression coefficients are computed, anobject is selected by the function PickNextTermObject as the “most probable”terminating object among all the skyline objects seen so far. Table 1 showsan instance of the ranks of objects after the initialization step. Note that theestimated rank of an object by the linear regression equation is underlined.Referring to the table, object f is chosen as the current terminating object T

16

Object Value Rank Sum of Ranks

b (0, 6, 5) (1 ,7, 5) 13

a (1, 8,10) (2 ,9, 10) 21

i (9, 0, 8) (10,1, 8) 19

f (3, 1, 4) (4 ,2, 4) 10

e (7, 4, 1) (8 ,5, 1) 14

c (2, 5, 2) (3 ,6, 2) 11

Table 1The ranks of objects in example

Distance Time to

Hotel Price Hotel to beach Hotel airport

b 0 i 0 e 1

a 1 f 2 h 2

i 2 j 2 c 3

f 3 c 4 j 3

d 3 e 5 f 4

g 5 b 6 g 5

j 5 h 7 b 7

c 6 a 8 i 8

h 8 g 9 d 8

e 9 d 9 a 10

Broker (site D1) Digital map (site D2) Bus company (site D3)

Fig. 8. Dataset with duplicate values

because its sum of ranks, 10, is the smallest.

In the iterative step PDS randomly picks an attribute of the current terminat-ing object that has not been probed by sorted access as the next source to beprobed. Assume that D3 is selected as the next source to be probed, object jis retrieved by sorted access S(D3, j, 3). Afterwards, the unknown dimensionvalue of j is probed by RA and j is checked in R3 to see if it is a skyline objectby invoking the isDominated function again. The process repeats until thereis an object that is fully probed by SA, i.e., the terminating object T is found.

4.5 Handling duplicates

So far our discussions assume the data returned by sorted access are distinctin values and sorted in a strictly monotonic order. Essentially, Lemma 2 statesthat an object O retrieved from data source Di can be output as a skylineif no objects retrieved before O from the same source dominate it. However,Lemma 2 cannot be directly applied when a source stores objects with samevalues. Consider the data instance in Figure 8, object c and j now shares thesame data value 3 in database D3. Assume that after the initialization phase

17

Function OutputSkyline(Object o, Attribute i) //handle duplicate values version

1. //Bi : an R*-tree buffer on dimension i

2. if value of object o > current value in buffer in dimension i

3. flush all objects in the buffer Bi into Skyline and report them as skyline

4. clear buffer Bi

5. if isDominated(o,i, Bi)

6. insert object o into Bi

7. else

8. discard o

End OutputSkyline

Fig. 9. Revised OutputSkyline function

(i.e., each database has been accessed by sorted access twice), PDS regardsf as the terminating object and selects data source D3 to perform the nextsorted access. Although object c(6, 4, 3) retrieved from D3 is strictly betterthan both e(9, 5, 1) and h(8, 7, 2) on the projected dimension D1 and D2, cis not a skyline! It is because there is an object j(5, 2, 3) after c which is asgood as c on dimension D3 (with value 3) and is better than c on all the otherdimensions D1 and D2. Clearly, j dominates c and c should not regard as askyline object.

This problem could be easily solved by keeping a buffer Bi for each data sourceDi. Whenever an object is retrieved from Di and is identified as a possibleskyline object by the isDominated function, it is inserted into Bi instead ofreported as a skyline immediately. Essentially, the buffer Bi is another R*-tree that holds possible skyline objects with the same value in attribute Di

at a particular instant. The steps of inserting a possible skyline object intothe buffer is identical to those of testing dominance of newly retrieved objectsand thus can re-use the isDominated function. Therefore, for each possibleskyline object s that is inserted into the buffer Bi, we check if s is dominatedby any objects in the buffer (discard s if this is the case), else we delete allobjects in the buffer that are dominated by s. When an object with a largervalue in data source Di is inserted into Bi, objects that are still in Bi areoutput as skyline and the buffer is emptied. Figure 9 presents the revisedOutputSkyline function that handles duplicate values.

4.6 Effectiveness evaluation

Now, we informally evaluate the effectiveness of PDS according to the criterialisted in [12,4].

1. Progressiveness: the first results should be returned to the users almostinstantaneously, and the rest of the results should be returned progressivelygiven more time.

18

2. Completeness: given enough time, the algorithm should return the full sky-line completely.

3. Correctness: the algorithm should return objects that are definitely in thefinal skyline. That is, the algorithm should not return objects that will belater replaced.

4. Fairness: the algorithm should not favor objects that are particularly goodin one dimension.

5. Support user preferences: users are allowed to control the order of skylineobjects returned according to their preferences.

6. Universality : the algorithm should be applicable to skyline queries withdifferent dimensionality and dataset, but require no special data structures.

By using the R*-tree approach, PDS returns the first results almost instantlyand fulfills criteria (1). Since PDS stops only if it finds the terminating objectT , therefore objects that are belongs to the skyline (i.e., incomparable with T )must have at least one dimension better than T and must have been visitedonce before T is reached, thus criteria (2) also satisfied. Lemma 2 guaranteescriteria (3). Since processing queries on the web has to deal with restrictiveaccessing interfaces (e.g., sorted access and random access only), the orderof data retrieved from each source is fixed. Criteria (4) that is suitable forthe centralized environment like [12,4] is inapplicable to the problem we arestudying. As we will show in the next section, PDS is able to support userpreference queries like top-K skyline queries, thus criteria (5) is also fulfilled.Finally, PDS satisfies criteria (6) because it supports skyline queries that in-volve any number of dimensions. And it is based on common index structures(e.g., R*-tree) that support dynamic updates and independent to the datadistribution.

5 Extensions on PDS

Next we propose two interesting extensions on PDS. In particular, Section 5.1discusses how PDS can be applied for processing top-K skyline queries andSection 5.2 shows that PDS can support a query progress indicator.

5.1 Evaluation of top-K skyline queries

While skyline queries return a set of interesting objects, it is somehow possiblethat the set consists of thousands of objects. Therefore one may want to rank

19

the skyline objects according to a preference function f . Consider the runningexample on Figure 1, a user may be interested on the top-2 skyline hotelsthat minimize the function f : score(price, distance) = price + distance. Theoutput skyline hotels should be <f ,4>, <b, 6> in this order (the number indi-cates the score of a hotel according to f). Here, we restrict f to be monotonicincreasing with each attribute (as smaller (better) attribute values results ina more desirable value for f).

PDS can easily handle such queries by keeping a priority queue P to keep thetop-K objects along the skyline evaluation process. A skyline object s wouldbe inserted into P if s has a lower aggregate score than an object o in P .Assume that the preference function f is same as above. Let boundi be thevalue retrieved by the last sorted access on data source Di, i.e., the lower boundvalue of objects that have not yet seen in Di. Originally, PDS terminates whenan object has been probed by SA in all data sources. Now, PDS could alsoterminate if the k-th object in P has a score less than the sum of lower boundin all attributes, i.e., score(k) <

∑i boundi. It is because we know that the

scores of all unseen objects must larger than the score of the k-th object. Forthe same reason, an object o in P could be output as a top-K skyline oncescore(o) <

∑i boundi during the evaluation process. The performance of PDS

could be further optimized in top-K queries. In particular, for each newly SAretrieved object o, o can be discarded immediately if its score is higher thanthe score of the k-th object in P , i.e., score(o) > score(k). Note that we couldcalculate the score of o without probing its unknown attribute values RA byusing the value boundi instead.

In fact, the evaluation strategy of top-K skyline by PDS can be extended tothe case when only one source supports sorted access (denote as S-Source)and the rest supports random access (denote as R-Source) [3]. It is a realisticconcern because the web abound sources support random access only (e.g.,the MapQuest site 2 ). While it is obvious that all distributed algorithms mustscan the S-Source once in order to compute the full skyline under this model,we could evaluate top-k skyline instead of the full skyline. To evaluate top-Kskyline under this model, we could disable the rank-based heuristic of PDSand perform sorted access on the S-Source only.

5.2 Progress indicator

A useful application of the terminating object prediction method is to estimatethe amount of remaining query time, so that users can determine whether towait for the completion of the query, or to make a decision based on thecurrent set of results [13,14] (see Figure 10). The estimation is as follows:

2 http://www.mapquest.com

20

Fig. 10. The user interface for reporting the current querying progress

suppose object O is the current most probable terminating object with rank(confirmed by sorted accesses or approximated by linear-regression) Roi onsource Di. Also let Ri be the rank of the last object retrieved from Di by sortedaccess. The values

∑i Roi and

∑i Ri indicate the estimated total number of

sorted accesses required by the whole query and the number of sorted accessesalready performed respectively. We estimate the completed portion of thequery, p, by the following formula:

p =∑

i

min(Roi, Ri)/Roi

Given p, the estimated remaining query time can be derived from p easily.The use of min(Roi, Ri) in the numerator is to handle the case Ri > Roi,which happens when more than Roi sorted accesses have been performed atDi before O is identified as the most probable terminating object. Noticethat the formula considers sorted access only. Since in the latter stage somevalues retrieved by sorted accesses belong to objects that have already beenencountered, the number of random accesses per sorted access is expected todecrease with time. In other words, the formula is a conservative one. In generalthe value of p has an increasing trend. But decreases could occur locally, dueto wrongly estimated ranks or a change of the choice of the most probableterminating object. Smoothing techniques can be applied to eliminate thelocal fluctuations, such as reporting the average value of p in the last fewestimations instead of reporting p directly.

6 Experimental Evaluation

To evaluate the effectiveness and efficiency of PDS, we evaluated PDS againstthree distributed skylining algorithms under both real and synthetic datasets.

21

The first two algorithms are BDS (Section 3.1) and its variant IDS thatbased on distance optimization (Section 3.2). The last algorithm Optimal, isan optimal version of PDS, built assuming complete knowledge on the tar-get datasets. Optimal determines the best terminating object T by scanningthe whole dataset once, and then calculates the number of necessary sortedaccesses and random accesses according to T . Of course Optimal is not a prac-tical algorithm. It is only of theoretical interest and serves as a lower boundfor this problem.

In Section 6.1, we first evaluate PDS over a real data set. Then we study theperformance of PDS in terms of the number of source accesses under differentsettings in Section 6.2. Section 6.3 studies the progressive behavior of PDSand, finally Section 6.4 evaluates the effectiveness of our proposed progressindicator. The programming language used to implement all algorithms isJava. R-trees built are using a page size of 4Kbytes. Since the size of themain-memory buffer pool gives marginally impact on Skyline algorithms [1],we follow [1] and set the size of the main-memory buffer pool as 10 percent ofthe size of the database. A Pentium III Xeon 700MHz PC with 1Gbytes Ramis used for all experiments.

6.1 Real data experiments

To validate PDS on real data distributions, we performed experiments overthe Cover data set [15]. The real data set is used for predicting forest covertypes from cartographic variables and was employed by [16] to experimentallyevaluate top-K algorithms over web-accessible data sources. The data containsinformation about various wilderness areas. Specifically, we consider three at-tributes: Elevation (in meters), aspect (in degrees azimuth), and horizontaldistance to roadways (in meters). We extracted a database of 60,000 objectsfrom the Cover data set. The skyline results consist of 182 objects. Table 2shows that PDS returns the first 10 skyline objects almost instantaneously.Further, it shows that PDS returns the full skyline before both PDS and IDSreport the first results. PDS also outperforms BDS and IDS in terms of sourceaccesses and is close to Optimal. Now, we observe that PDS indeed outper-forms the existing approaches on real applications. In the sequel, we evaluatePDS against different settings.

6.2 Number of source accesses

Following the common methodology in the literature, we first study the num-ber of source accesses under independent and anti-correlated dataset with 3–5

22

BDS IDS PDS Optimal

Time for first 10 objects (sec) 46 172 <1 n/a

Completion time (sec) 46 172 43 n/a

Source accesses (K) 35.3 37.1 28.0 24.2

Table 2Real data set

(a) Uniform dataset (b) Non-uniform dataset

Dimension Independent Anti-correlated Random Denormalized

10K 50K 10K 50K 50K 50K

3 50 64 367 812 61 68

4 167 266 914 2341 234 277

5 430 792 1886 5166 764 751

Table 3Average number of skyline

data sources 3 , having 10K and 50K objects. Table 3a shows the sizes of theskylines for databases in different dimensions (# of attributes) and databasesizes under uniform distribution. We can see the size of the skyline increaseswith the number of dimensions but not the database size. In addition, the sizeof the skyline for anti-correlated database is larger than independent data-base. Note that we do not perform experiments on 2-d skyline query becauseevaluating 2-d skyline query under the web model is trivial. Essentially, theoptimal solution is a hybrid of BDS and PDS. That is, accesses the two websources in round-robin like BDS, and performs dominance checking as PDS. Inthis way, we could return skyline objects progressively by accessing minimalnumber of sources.

The number of source accesses as a function of dimensionality for independentand anti-correlated datasets with 10K objects are shown in Figure 11(a) andFigure 11(b). IDS and PDS access less number of sources than BDS in bothindependent and anti-correlated dataset. The result is similar when we scaleup the size of data set up to 50K objects (see Figure 12(a) and Figure 12(b)).PDS and IDS accessed a significant less number of data sources than BDS be-cause they could reach the terminating object earlier than BDS. The figuresalso show the performances of IDS and PDS are close to Optimal, showing thatPDS and IDS estimate a set of very good terminating objects under uniformdistribution. In Figure 11(b) and Figure 12(b), we observe that the improve-ment of PDS and IDS is less obvious when the dataset is anti-correlated. It is

3 It is commonly accepted that skyline query with high dimension (say 6) yields nosensible results [1,4,2]

23

0

5

10

15

20

25

3 4 5 Dimensionality

Avg

. no.

of s

ourc

e ac

cess

es (K

)

BDS (10K)

IDS (10K)

PDS (10K)

Optimal

(a) Independent

0

5

10

15

20

25

30

35

40

45

50

3 4 5 Dimensionality

Avg

. no.

of s

ourc

e ac

cess

es (K

)

BDS (10K)

IDS (10K)

PDS (10K)

Optimal

(b) Anti-Correlated

Fig. 11. Average number of source accesses under uniform dataset (10K objects)

because the final terminating object in anti-correlated datasets usually locatesaround the middle potion of each source and the terminating object would beeasier to be reached by BDS in a round-robin manner.

To illustrate the robustness of all algorithms under different situations, we gen-erate another dataset with dimensionality in the range [3,5] of 50K objects.However, the distribution of each dimension is randomly picked from the fol-lowing distributions: Uniform; Normal (Gaussian); and Negative Exponential.Figure 13(a) shows the experimental results. Note that IDS performs signifi-cant worse than BDS in this experiment. Since the skyline size of the randomdistribution dataset is similar to that of independent uniform distribution (seeTable 3), the degradation of IDS is solely due to its error in estimating thewrong terminating objects. The error is mainly due to IDS assumes the dataare uniform, which is not always true in practice. On the other hand, PDSstill closes to Optimal and outperforms BDS and IDS in this experiment. Al-though the linear regression model employed by PDS would also be affectedby the data distribution, it at least takes the data distribution into accountwhen estimating the terminating object and thus more stable than IDS.

We now present the performance of the three algorithms when the valuesof each attribute are not normalized to the same range in Figure 13(b). Weobserve that IDS accesses more sources than PDS because it assumes all at-

24

0

20

40

60

80

100

120

3 4 5 Dimensionality

Avg

. no.

of s

ourc

e ac

cess

es (K

)

BDS (50K)

IDS (50K)

PDS (50K)

Optimal

(a) Independent

0

50

100

150

200

250

3 4 5 Dimensionality

Avg

. no.

of s

ourc

e ac

cess

es (K

)

BDS (50K)

IDS (50K)

PDS (50K)

Optimal

(b) Anti-Correlated

Fig. 12. Average number of source accesses under uniform dataset (50K objects)

tributes are normalized to the same data range, e.g., [0,10]. Again, since theskyline size in this setting is close to that of the independent dataset (seeTable 3), the degradation of IDS is caused by imprecise prediction of termi-nating object when the attributes have different domain ranges. In contrast,PDS is unaffected by the domain because it imposes no such assumptions onthe attribute domain.

6.3 Progressive behavior

Next we evaluate the speed of the algorithms in returning the skyline pointsprogressively. Figure 14 shows the time of PDS on evaluating a 4-d skylinequery as a function of the percentage of points returned for an anti-correlateddataset with 50K objects. We present results in this setting merely becauseit yields a long-running query and could effectively demonstrate the powerof progressiveness. Figure 14 shows that PDS returns the first few skylineobjects almost instantaneously. It returns the skyline continuously and half ofthe skyline are returned within a minute. Furthermore, PDS outputs the entireskyline before both BDS and IDS report the first results. The speedup of PDSis brought by its heuristic and the R*-tree structure. First, the heuristic of PDSreduces the number of source accesses effectively and thus it could terminate

25

0

20

40

60

80

100

120

140

160

180

3 4 5 Dimensionality

Avg

. no.

of s

ourc

e ac

cess

es (K

)

BDS (50K)

IDS (50K)

PDS (50K)

Optimal

(a) Random Distribution

0

20

40

60

80

100

120

3 4 5 Dimensionality

Avg

. no.

of s

ourc

e ac

cess

es (K

)

BDS (50K)

IDS (50K)

PDS (50K)

Optimal

(b) Denormalized Domain

Fig. 13. Average number of source accesses under various dataset combinations

0

100

200

300

400

500

600

0 10 20 30 40 50 60 70 80 90 100

% of answers output

time

(s)

IDS

BDS

PDS

Fig. 14. Progressive behavior (4-d 50k objects, anti-correlated data)

earlier. Second, while BDS and IDS perform pairwise comparisons after theterminating object is found, PDS employs R*-tree to reduce the number ofcomparisons. From the experimental result of both real and synthetic data,we assert that the operations of R*-tree in PDS is much efficient than thepairwise comparisons approach of BDS and IDS. The running time of IDS islonger than BDS even though it accesses less number of sources than BDS (seeFigure 11(b)) because BDS could discard non-skyline objects in the secondphase by using the upper bound information as discussed in Section 3.

26

0

10

20

30

40

50

60

70

80

90

100

0 50 100 150 200 250 300

time (1000s)

% c

ompl

ete

estimated % complete

actual % complete

Fig. 15. Estimated completion % vs time (4-d 50k objects, anti-correlated data)

6.4 Effectiveness of progress indicator

Finally, we evaluate the effectiveness of our proposed progress indicator bycomparing the estimated completion percentage reported by PDS with theactual percentage. The experiment setting is same as the previous experiment.In Figure 15, the actual completion percentage of the query over time is shownby the dotted line and the completion percentage estimated by PDS is shownby the solid line. We can see that the estimation of PDS is fairly close to theactual percentage at the beginning and closer in the later stage. The unstableestimation of PDS at the early stage is due to inadequate data for the linearregression method to provide a good fitting equation. As long as more objectsare visited by PDS, the linear regression equation continues to refine andestimates the terminating object more precisely and yield a better result.

7 Conclusion

Skyline queries are important for database applications like decision supportand customer information systems. In this paper, we proposed PDS, a novelprogressive algorithm that computes skyline queries over web-accessible data-bases. Users pose skyline queries over web data sources by PDS and retrievethe results incrementally. As a result, users could make their decisions inreal-time. By making use of a simple heuristic that based on linear-regressionestimation and a common index structure R*-tree, PDS evaluates skyline ef-ficiently in terms of both number of accesses and computation time over awide range of data distribution and with various data sources. Furthermore,it does not require any pre-computation (the R*-tree is created on the fly) onthe remote data and also easily extensible to evaluate top-K skyline under theweb environment. In addition, PDS is user-friendly by supporting a progressindicator such that users know the headway of the query evaluation process.

27

There are several avenues for future work. First, we plan to improve the us-ability of PDS by allowing the users to barter between progressiveness andefficiency. It is because we found that the number of source accesses could besaved by reporting skyline points less frequent.

Second, distributed query processing on mobile devices has also drawn muchattention in recent years [17]. Thus, we plan to evaluate distributed skylineon devices with severe memory requirement (e.g., PDA and cell phone). Itwould be useful and convenient if users could pose skyline queries through cellphones and PDAs.

Finally, we will extend PDS to compute skyline from real-time stream databecause basically PDS is an real-time algorithm that accepts input data andoutput results progressively.

Acknowledgment

We would like to thank Benjamin Kao and Nikos Mamoulis for their helpfulcomments and suggestions. We would also like to thank Marios Hadjielefthe-riou for providing his code library for this project.

References

[1] S. Borzsonyi, D. Kossmann, K. Stocker, The skyline operator, in: Proc. of ICDE,2001, pp. 421–430.

[2] W.-T. Balke, U. Guntzer, J. X. Zheng, Efficient distributed skylining for webinformation systems, in: Proc. of Extending Database Technology (EDBT),2004, pp. 256–273.

[3] N. Bruno, L. Gravano, A. Marian, Evaluating top-k queries over web-accessibledatabases, in: Proc. of ICDE, 2002, pp. 369–382.

[4] D. Kossmann, F. Ramsak, S. Rost, Shooting stars in the sky: An onlinealgorithm for skyline queries, in: Proc. of VLDB, 2002, pp. 275–286.

[5] D. Papadias, Y. Tao, G. Fu, B. Seeger, An optimal and progressive algorithmfor skyline queries, in: Proc. of ACM SIGMOD, 2003, pp. 467–478.

[6] K.-L. Tan, P.-K. Eng, B. C. Ooi, Efficient progressive skyline computation, in:Proc. of VLDB, 2001, pp. 301–310.

[7] J. Matousek, Computing dominances in En, Information Processing Letters38 (5).

28

[8] D. H. McLain, Drawing contours from arbitrary data points, The ComputerJournal 17 (4) (1974) 318–324.

[9] R. Steuer, Multiple criteria optimization, Wiley, 1986.

[10] J. Chomicki, P. Godfrey, J. Gryz, D. Liang, Skyline with presorting, in: Proc.of ICDE, 2003, pp. 717–816.

[11] N. Beckmann, H.-P. Kriegel, R. Schneider, B. Seeger, The R*-tree: An efficientand robust access method for points and rectangles, in: Proc. of ACM SIGMOD,1990, pp. 322–331.

[12] J. M. Hellerstein, R. Avnur, A. Chou, C. Hidber, C. Olston, V. Raman, T. Roth,P. J. Haas, Interactive data analysis: The control project, IEEE Computer 32 (8)(1999) 51–59.

[13] G. Luo, J. F. Naughton, C. J. Ellmann, M. W. Watzke, Toward a progressindicator for database queries, in: Proc. of ACM SIGMOD, 2004, pp. 791–802.

[14] S. Chaudhuri, V. R. Narasayya, R. Ramamurthy, Estimating progress of longrunning sql queries, in: Proc. of ACM SIGMOD, 2004, pp. 803–814.

[15] C. Blake, C. Merz, UCI repository of machine learning databases (1998).URL http://www.ics.uci.edu/∼mlearn/MLRepository.html

[16] A. Marian, N. Bruno, L. Gravano, Evaluating top-k queries over web-accessibledatabases, ACM Transaction on Database System 29 (2) (2004) 319–362.

[17] E. Lo, N. Mamoulis, D. W. Cheung, W. S. Ho, Processing ad-hoc joins on mobiledevices, in: Proc. of Database and Expert Systems Applications (DEXA), 2004.

29