Customized Answers to Summary Queries via Aggregate Views

10
Customized Answers to Summary Queries via Aggregate Views Francesco M. Malvestuto Computer Science Department “La Sapienza” University of Rome, Italy [email protected] Elaheh Pourabbas IASI “Antonio Ruberti” National Research Council, Rome, Italy [email protected] Abstract Statistical users typically require summary tables and want fast and accurate answers to their queries. Usu- ally, the query system keeps materialized aggregate views to speed up the evaluation of summary queries. If the sum- mary table on the variable of interest to a statistical user is not derivable from the set of materialized aggregate views, the answer to his query will consist of an estimate and, if the user is a domain expert, he would like to participate in the estimation process. Therefore, he should be left the possi- bility of “tuning” the response to an auxiliary variable, for which either there is a materialized aggregate view or ag- gregate data can be externally provided by the user himself. In this framework, we solve the computational problems re- lated to the estimation of summary queries, and propose ef- ficient algorithms which make use of notions and techniques developed in the theory of acyclic database schemes. 1. Introduction Traditional query processing deals with computing exact answers by possibly minimizing response time and maxi- mizing throughput. However, a recent querying paradigm, called On-Line Analytical Processing (OLAP) [6], often in- volves complex queries over very large multidimensional relations (”data cubes”) with category (or dimensional) and measure (or summary) attributes, so that the computation of exact answers may require a huge amount of time and resources. As OLAP queries mainly deal with operations of aggregation (e.g., addition) of measure values on dimension ranges, an interesting approach to improving performances is to store some aggregate data and to inquiry them rather than the original data thus obtaining approximate answers. This approach is very useful when database users want fast answers without being forced to wait a long time to get ex- act answers, whose precision often is not necessary. The issue of estimating sums by never accessing original data but only consulting aggregate data has very recently started receiving a deal of attention, since it has several applications also in query optimization, data warehousing, data integra- tion [5]. In this paper, we address the problem of evaluating a summary query from a summary database that is, a set of materialized views, which consist of summary tables on the same variable. Summary tables are called, for the sake of brevity, “tables” and the measures associated with them are referred to “variables”. We assume that every summary query is formulated by a statistical user and asks for the distribution of a variable of interest, called target variable (e.g., US Population 2001), by a set of category attributes (e.g., gender, state,. . . ). Suppose that the query system is endowed with a summary database containing not only ta- bles on the target variable but also a list of auxiliary vari- ables (e.g., US Population 2000, US Total-Income 2001), which are correlated in “some sense” to the target variable. If the requested distribution is not derivable from the sum- mary database, the answer to the query will consist of an estimate, and the user is left the possibility of selecting an auxiliary variable in order to possibly refine the estimation result. If the user selects no auxiliary variable, then the an- swer to his query will be computed by the query system using its own estimation algorithm (see the two examples below). Otherwise, depending on whether the type of the selected auxiliary variable is or is not the same as the tar- get variable (e.g. US Population 2000 and US Population 2001 are of the same type, but US Total-Income 2001 and US Population 2001 are not), two different estimation meth- ods will be applied by the query system. Both methods are based on a “proportionality” criterion, and are described by the two examples below. The first example is borrowed from [15] and deals with a method of interpolation, called “linear indirect estimation”, known in the literature as small area estimation [9]. An indirect estimator uses data from surveys designed to produce estimates of the target variable at the national or regional level, and to obtain comparable Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM’04) 1099-3371/04 $ 20.00 © 2004 IEEE Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on February 5, 2009 at 13:08 from IEEE Xplore. Restrictions apply.

Transcript of Customized Answers to Summary Queries via Aggregate Views

Customized Answers to Summary Queriesvia Aggregate Views

Francesco M. MalvestutoComputer Science Department

“La Sapienza” University of Rome, [email protected]

Elaheh PourabbasIASI “Antonio Ruberti”

National Research Council, Rome, [email protected]

Abstract

Statistical users typically require summary tables andwant fast and accurate answers to their queries. Usu-ally, the query system keeps materialized aggregate viewsto speed up the evaluation of summary queries. If the sum-mary table on the variable of interest to a statistical user isnot derivable from the set of materialized aggregate views,the answer to his query will consist of an estimate and, if theuser is a domain expert, he would like to participate in theestimation process. Therefore, he should be left the possi-bility of “tuning” the response to an auxiliary variable, forwhich either there is a materialized aggregate view or ag-gregate data can be externally provided by the user himself.In this framework, we solve the computational problems re-lated to the estimation of summary queries, and propose ef-ficient algorithms which make use of notions and techniquesdeveloped in the theory of acyclic database schemes.

1. Introduction

Traditional query processing deals with computing exactanswers by possibly minimizing response time and maxi-mizing throughput. However, a recent querying paradigm,called On-Line Analytical Processing (OLAP) [6], often in-volves complex queries over very large multidimensionalrelations (”data cubes”) with category (or dimensional) andmeasure (or summary) attributes, so that the computationof exact answers may require a huge amount of time andresources. As OLAP queries mainly deal with operations ofaggregation (e.g., addition) of measure values on dimensionranges, an interesting approach to improving performancesis to store some aggregate data and to inquiry them ratherthan the original data thus obtaining approximate answers.This approach is very useful when database users want fastanswers without being forced to wait a long time to get ex-act answers, whose precision often is not necessary. The

issue of estimating sums by never accessing original databut only consulting aggregate data has very recently startedreceiving a deal of attention, since it has several applicationsalso in query optimization, data warehousing, data integra-tion [5].

In this paper, we address the problem of evaluating asummary query from a summary database that is, a set ofmaterialized views, which consist of summary tables on thesame variable. Summary tables are called, for the sake ofbrevity, “tables” and the measures associated with them arereferred to “variables”. We assume that every summaryquery is formulated by a statistical user and asks for thedistribution of a variable of interest, called target variable(e.g., US Population 2001), by a set of category attributes(e.g., gender, state,. . . ). Suppose that the query system isendowed with a summary database containing not only ta-bles on the target variable but also a list of auxiliary vari-ables (e.g., US Population 2000, US Total-Income 2001),which are correlated in “some sense” to the target variable.If the requested distribution is not derivable from the sum-mary database, the answer to the query will consist of anestimate, and the user is left the possibility of selecting anauxiliary variable in order to possibly refine the estimationresult. If the user selects no auxiliary variable, then the an-swer to his query will be computed by the query systemusing its own estimation algorithm (see the two examplesbelow). Otherwise, depending on whether the type of theselected auxiliary variable is or is not the same as the tar-get variable (e.g. US Population 2000 and US Population2001 are of the same type, but US Total-Income 2001 andUS Population 2001 are not), two different estimation meth-ods will be applied by the query system. Both methodsare based on a “proportionality” criterion, and are describedby the two examples below. The first example is borrowedfrom [15] and deals with a method of interpolation, called“linear indirect estimation”, known in the literature as smallarea estimation [9]. An indirect estimator uses data fromsurveys designed to produce estimates of the target variableat the national or regional level, and to obtain comparable

Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM’04) 1099-3371/04 $ 20.00 © 2004 IEEE

Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on February 5, 2009 at 13:08 from IEEE Xplore. Restrictions apply.

estimates at more geographically disaggregated levels suchas counties.

Example 1. A database contains information about a popu-lation of employees described by attributes as gender, age,state, Income, etc. A (statistical) user asks for the distri-bution of employees by gender, age-class and state. As-sume that the database system has materialized two aggre-gate views: one is the distribution p(gs) of employees bygender and state, and the other is the distribution q(ga) ofthe Total-Income by gender and age-class. Then, the querysystem informs the user that he may obtain a fast answerto his query in an approximate way and that he may tunethe query evaluation to some auxiliary variable, e.g. Total-Income. If the user selects no auxiliary variable, his querywill be answered by issuing the distribution p(gs)

N , whereN is the number of age-classes. If the user selects Total-Income as auxiliary variable, the answer to his query willbe the distribution p(gs)∑

aq(ga)

q(ga).

The next example is a classical problem which arises in theanalysis of frequency tables [8] as well as in the economicanalysis of interindustry transactions [10][16] [1].

Example 2. A database contains information about the pop-ulation of a country in the years 2000 and 2001. A (sta-tistical) user asks for the distribution of the population in2001 by gender and state. Assume that the database systemhas materialized two aggregate views on the population in2001: one is the distribution p1(g) by gender, and the otheris the distribution p2(s) by state. Moreover, the databasesystem also keeps other materialized aggregate views, oneof which is the distribution q(gs) of the population in 2000by gender and state. Then, the query system informs theuser that he may obtain a fast answer to his query in an ap-proximate way and that he may drive the query evaluationusing an auxiliary variable to be selected from a list whichcontains the distribution q(gs) of the Population 2000 bygender and state. If the user selects no auxiliary vari-able, his query will be answered by issuing the distributionp1(g)p2(s)

N , where N =∑

g p1(g) =∑

s p2(s). Otherwise,if he selects Population 2000 as auxiliary variable, then theanswer will consist of the estimate p(gs) of the distributionunder the assumption that p(gs) is related “biproportion-ally” to q(gs) [1], [8]. Explicitly, p(gs) will be computedusing the Deming-Stephan algorithm [8] as the limit of thedistribution sequence p[0], p[1], p[2], . . . where:

p[0](gs) = q(gs)p[1](gs) = p1(g)∑

sp[0](gs)

)p[0](gs)

p[2](gs) = p2(s)∑g

p[1](gs))p[1](gs)

p[3](gs) = p1(g)∑s

p[2](gs))p[2](gs)

p[4](gs) = p2(s)∑g

p[3](gs))p[3](gs)

. . .

In this paper, we address the multi-proportional versions ofthe two above mentioned estimation models. We call themMEM1 and MEM2, and we solve their related computa-tional problems, referred to as P1 and P2 by giving algo-rithms which combine the “divide-and-conquer” principlewith computations on “join trees” borrowed from the theoryof the acyclic database schemes [4]. The paper is structuredas follows. In the next section, we state the estimation mod-els MEM1, and MEM2 and the computational problemsP1, and P2. In Section 3, we recall the standard method,known as Iterative Proportional Fitting Procedure (IPFP)used to solve P1 and P2. In Section 4, we review an effi-cient implementation of IPFP for solving problem P1 givenin [2]. In Section 5, we generalize it in view of solvingproblem P2. Finally, Section 6 contains some remarks andfuture research directions.

2. Statement of the problem

Henceforth, we only consider variables (also called”summary attributes” or ”measure attributes”) ofnonnegative-real type and of additive nature. Givensuch a variable A, let X be a set of (category) attributes. Thedomain of X, written dom(X), is the set of all semanticallypossible tuples on X; by size(X) we denote the cardinalityof dom(X). Let p(x) be a nonnegative real-valued functiondefined on dom(X); the support of p(x) is the relationwith scheme X containing all tuples x with p(x) �= 0.The couple T = 〈X, p(x)〉 defines a (summary) table onA, of which X is the scheme. Henceforth, without loss ofgenerality, we always assume that the data reported in atable are normalized to one. Let T = 〈X, p(x)〉 be a tableon A, and let Y be a subset of X. The marginal of p(x)with respect to Y is p(y) =

∑x p(x) the summation being

extended over all tuples x in dom(X) whose restrictionsto Y coincide with y. We also admit the case Y = Ø;then, the marginal of p(x) with respect to Y is the unity.Note that the support of the marginal p(y) of p(x) withrespect to Y is the projection onto Y of the support ofp(x). By the marginal of T with respect to a subset Yof X , written T [Y ], we mean the table 〈Y, p(y)〉. A setof tables T = {T1, ..., Tn} is consistent if there exists atleast one table T with scheme X , where X is the union of

Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM’04) 1099-3371/04 $ 20.00 © 2004 IEEE

Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on February 5, 2009 at 13:08 from IEEE Xplore. Restrictions apply.

the schemes of the tables Ti, such that the marginal of Twith respect to the scheme of Ti coincides with Ti, for alli. Such a table is called a universal table of T . Moreover,the set X and the collection of schemes of the tables Ti

will be referred to as the universal set of attributes in Tand the scheme of T , respectively. Let T = {T1, ..., Tn}be a summary database on the target variable A, whereTi = 〈Xi, pi(xi)〉, and let 〈Y, q(y)〉 be the auxiliary table.Henceforth, as is natural, we assume that T is consistent.Let X be the universal set of attributes in T . Consider thefollowing two multi-proportional estimation models wherep(x, y) denotes an unknown distribution with schemeX ∪ Y .

MEM 1Marginal constraints: p(xi) = pi(xi), i = 1, . . . , nProportionality criterion: Let Z = X ∩ Y . There existreal-valued functions f1(x1), . . . , fn(xn) such that thefactorization p(x, y) = f1(x1) · · · fn(xn) q(y)

q(z) holds forevery tuple (x, y) in the support of p(x, y).

MEM 2Marginal constraints: p(xi) = pi(xi), i = 1, . . . , nProportionality criterion: There exist real-valued func-tions f1(x1), . . . , fn(xn) such that the factorizationp(x, y) = f1(x1) · · · fn(xn)q(y) holds for every tuple(x, y) in the support of p(x, y).

First of all, observe that MEM2 need not have a solu-tion. To see it, let Z = X ∩ Y . Then, the proportion-ality criterion implies that, for every tuple z in the sup-port of the marginal p(z) of p(x, y), p(z) = f(z)q(z)where f(z) is the marginal with respect to Z of the functionf(x) = f1(x1)f2(x2) . . . fn(xn). Therefore, no solution ofMEM2 exists if, for every distribution p(x, y) that satisfiesthe marginal constraints, there is a tuple z with q(z) = 0and p(z) �= 0. In the mathematical language, the conditionthat the support of q(z) contains the support of p(z) is ex-pressed by saying that p(z) is absolutely continuous withrespect to q(z) [7]. So, we have,

Fact 2.1 If p(x, y) is a solution to MEM2, then itsmarginal p(z) is absolutely continuous with respect to q(z).

We will reconsider the existence issue in the next section.In the rest of this section, we assume that both MEM1 andMEM2 admit at least one solution. We shall see in the nextsection that, if this is the case, their solutions are unique.We denote them by p(x, y) and p(x, y), respectively. Atthis point, we can state the two estimation problems, whichwe mentioned in the previous section.

Problem 1 (P1) Given a summary database T on thetarget variable and a table on an auxiliary variable, let X be

the universal set of attributes in T , and let Y be the schemeof the auxiliary table. Find the solution p(x, y) of MEM1.

Problem 2 (P2) Given a summary database T on thetarget variable and a table on an auxiliary variable, let X bethe universal set of attributes in T , and let Y be the schemeof the auxiliary table. Find the solution p(x, y) of MEM2.

We now give some useful formulas for solving problems P1and P2. Let Z = X ∩ Y , and Y ′ = Y − X . Then, bysumming out y′ in the functional expressions of p(x, y) andp(x, y), we obtain

p(x) = f1(x1) . . . fn(xn) (1)

for every tuple x in the support of p(x), and

p(x) = f1(x1) . . . fn(xn)q(z) (2)

for every tuple x in the support of p(x). Formulas (1) and(2) lead to the following expressions for the solutions toMEM1 and MEM2:

p(x, y) = p(x)q(y)q(z)

(3)

p(x, y) = p(x)q(y)q(z)

(4)

Suppose that we know how to compute the distributionsp(x) and p(x). Then, the procedure below solves problemsP1 and P2, respectively.

Procedurea) Compute the distribution p(x) (respectively, p(x)).b) Find the marginal of q(y) with respect to Z.c) Compute p(x, y) (respectively, p(x, y)) using for-mula 3 (respectively, formula 4).

From a computational point of view, Steps b and c of Proce-dure are a matter of routine; therefore, we focus on Step a.It should be noted that from equations (1) and (2), it followsthat, if in MEM2 the set Z is empty or is contained in thescheme of some table in T , or if q(z) is a uniform distribu-tion, then p(x) = p(x). Therefore, Problem 1 is a specialcase of Problem 2.

We shall show that the distributions p(x) and p(x) can becomputed using a popular method, called Iterative Propor-tional Fitting Procedure (IPFP) in the statistical literature[3], which is the generalized version of the Deming-Stephanalgorithm and will be recalled in the next section. So, in or-der to solve problems P1 and P2, we can apply the IPFP.However, we can do better since, as shown in Sections 4and 5, in most cases tree and local computation are viable.

Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM’04) 1099-3371/04 $ 20.00 © 2004 IEEE

Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on February 5, 2009 at 13:08 from IEEE Xplore. Restrictions apply.

3 The IPFP

Let T = {T1, ..., Tn} be our summary database, whereTi = 〈Xi, pi(xi)〉, and let X be the universal set of at-tributes in T . Let q(y) be the distribution of the auxiliaryvariable with scheme Y , let q(z) be the marginal of q(y)with respect to Z = X ∩ Y . Formula (2) can be re-writtenin an equivalent way as

p(x) = f1(x1) . . . fn(xn)q(z)

size(X − Z)(5)

for some real-valued functions f1(x1), . . . , fn(xn). So,p(x) is of the type

f1(x1) . . . fn(xn)p0(x) (6)

Proposition 3.1 [7] For an arbitrary distribution p0(x), anecessary and sufficient condition for the existence of a uni-versal table of T whose distribution is of type (6) is thatthere is a universal table of T whose distribution is ab-solutely continuous with respect to p0(x). If this is thecase, then the universal table of T whose distribution is oftype (6) is uniquely determined by the information-theoreticprinciple of minimum cross-entropy and its distribution canbe computed as the limit of the sequence of distributionsp[0](x), p[1](x) . . . where p[0](x) = p0(x) and, for eachr > 0, p[r](x) is calculated using the following iterativeprocedure (IPFP):

first iteration cycle p[1](x) . . . p[n](x)second iteration cycle p[n+1](x) . . . p[2n](x). . . . . . . . . . . .h-th iteration cycle p[hn+1](x) . . . p[hn+n](x). . . . . . . . . . . .

where p[r](x) for r = hn + i, h ≥ 0 and 1 ≤ i ≤ n, isobtained from p[r−1](x) and from the distribution pi(xi) ofthe base table Ti as follows:

p[r](x) =pi(xi)

p[r−1](xi)p[r−1](x).

The existence condition stated in Proposition 3.1 can bechecked by solving the following linear-programming prob-lem:

minimize∑

x/∈S p(x)

subject to the constraints

p(xi) = pi(xi) (i = 1, . . . , n),p(x) ≥ 0 for all x ∈ dom(X)

where S is the support of p0(x). The minimum of the ob-jective function is zero if and only if there is a universaltable of T whose distribution is absolutely continuous withrespect to p0(x). We now state a sufficient condition, ex-pressed in relational algebra, which is more easy to test. Toachieve this, it is sufficient to observe that the support of thedistribution of every universal table of T is contained in thejoin of the supports of the distributions of the tables in T .Therefore, by Proposition 3.1, we have:

Lemma 3.1 For an arbitrary distribution p0(x), if the sup-port of p0(x) contains the join of the supports of the dis-tributions of the tables in T , then the universal table of Twhose distribution is of type (6) exists and is unique.

We call such a universal table of T the minimum-cross-entropy universal table of T relative to p0(x). Accord-ingly, by (5) the table T = 〈X, p(x)〉 is the minimum-cross-entropy universal table of T relative to the distribu-tion q(z)

size(X−Z) . Note that, if the set Z is empty or is con-tained in the scheme of some table in T , then minimizingthe cross entropy is the same as maximizing the entropy(see Section A of the Appendix), and T coincides withthe table T = 〈X, p(x)〉, which is called the maximum-entropy universal table of T [11] [12] [5]. By Lemma 3.1,the maximum-entropy universal table T always exists (sinceT was assumed to be consistent), but the minimum-cross-entropy universal table T need not exist and we now state arelational-algebraic condition which is sufficient for its ex-istence, which follows from Lemma 3.1.

Theorem 3.1 The minimum-cross-entropy universal tableT of T relative to q(z)

size(X−Z) exists if the support of q(z)contains the projection onto Z of the join of the supports ofthe distributions of the tables in T .

Before closing this section, we note that the i-th iteration ineach cycle of the IPFP requires the execution of size(X)additions, size(X) multiplications and size(Xi) divisionsso that, if R is the scheme of the summary database T ,each cycle requires the execution of |R|size(X) additions,|R|size(X) multiplications and size(R) divisions, wheresize(R)=

∑i size(Xi). To sum up, each iteration cycle

requires O(|R|size(X) + size(R)) elementary operations.In Section 5, we shall present an efficient implementationof the IPFP for computing the distribution p(x). It is thegeneralization of the procedure reported in Section 4 andborrowed from [2] for computing the distribution p(x) ofthe maximum-entropy table of T .

Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM’04) 1099-3371/04 $ 20.00 © 2004 IEEE

Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on February 5, 2009 at 13:08 from IEEE Xplore. Restrictions apply.

4 Computing the maximum-entropy univer-sal table

In this section, we review the procedure for computingthe distribution of the maximum-entropy universal table ofT given in [2]. It combines a ”tree-implementation” ofthe IPFP with an application of the principle of divide-and-conquer. Both techniques are based on the following propo-sitions where collections of subsets of a set of attributes areviewed as hypergraphs. Basic notions of hypergraph theoryas well as the definitions of an acyclic hypergraph and of ajoin tree of an acyclic hypergraph, are recalled in Section Bof the Appendix.

Proposition 4.1 [11] Let X be a set of attributes, H a hy-pergraph with vertex set X and F a cover of H. For everydistribution p(x), the maximum-entropy universal table ofthe table set {〈A, p(a)〉 : A ∈H} is also the maximum-entropy universal table of the table set {〈B, p(b)〉 : B ∈F}.

Proposition 4.2 [12] Let X be a set of attributes, and Ja join tree of a connected, acyclic hypergraph with vertexset X . Let {A1, . . . , Am} be the set of the labels of nodesof J and {B1, . . . , Bk} the set of the labels of arcs of J .For each h, 1 ≤ h ≤ k, let dh be the number of the arcsof J that are labelled by Bh. For every distribution p(x),the distribution of the maximum-entropy universal table ofthe table set T = {〈Aj , p(aj)〉 : j = 1, . . . , m} has thefollowing closed-form expression:∏

j=1,...,m p(aj)∏h=1,...,k[p(bh)]dh

We re-phrase Proposition 4.2 as follows. Let us weight eachnode j of J by the table 〈Aj , p(aj)〉, and each arc (j, l) ofJ by the table 〈Bh, p(bh)〉 where Bh is the label of (j, l).The resulting weighting of J is said to be induced by p(x).Let us denote by α the weighting of J induced by p(x); wecall (J, α) a table tree associated with p(x). We can sum-marize the contents of Proposition 4.2 by saying that thedistribution of the maximum-entropy universal table of thetable set T is the joint distribution of the table tree (J, α).Consider now our summary database T = {T1, . . . , Tn},where Ti = 〈Xi, pi(xi)〉. Let X be the universal set of at-tributes in T and let R be the scheme of T . Henceforth,without loss of generality, we assume that R is a connectedhypergraph. Combining Propositions 4.1 and 4.2, we havethat the maximum-entropy universal table T = 〈X, p(x)〉of T is the joint distribution of the table tree (J, α), whereJ is a join tree of an acyclic cover of R and α is the weight-ing of J induced by p(x). The key point is that, given J ,the weighting of J can be found without passing throughthe computation of p(x). This is trivially true if R is acyclic[11], [12]. If R is not acyclic, α can be obtained by ap-plying the following tree implementation of the IPFP [2],

henceforth referred to as tree-IPFP, with input (T , J, α0),where α0 is the weighting of J induced by the uniform dis-tribution 1

size(X) .

Tree-IPFPInput: T , J , α0

Output: α1) Set α := α0

2) Until the convergence is attained, repeat:For each node i = 1, . . . , n do:

beginUPDATE(J, α, Xi, pi(xi)) and set α := α′

end

Update ProcedureInput: J , α, Xi, pi(xi)Output: α′

1) Mark each arc of J .2) Select a node j∗ of J , such that the label Aj∗ of j∗contains Xi. Set

p(xi) =∑

Aj∗−Xip(aj∗);

p′(aj∗) = pi(xi)p(xi)

p(aj∗);α′(j∗) = 〈Aj∗ , p′(aj∗)〉;

3) Perform a traversal of J . During the traversal of J ,when an edge (j, l) is traversed from j to l, do

- let Bh be the label of (j, l),- if (j, l) is marked, then do:

p′(bh) =∑

Aj−Bhp′(aj);

α′(j, l) = 〈Bh, p′(bh)〉;g(bh) = p′(bh)

p(bh) ;unmark (j, l);

- set p′(al) = g(bh)p(al);- set α′(l) = 〈Al, p

′(al)〉.

Example 3. Let R= {AB, AF, BC, CD, CG, DH, DI, EF,FG, GH, HI} be the scheme of the summary database T ={T1, . . . , T11} (see Figure 1).

A B C D

E F G H I

T1

T10T9

T5 T6T7

T3T2

T4

T8 T11

Figure 1. The summary database T

Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM’04) 1099-3371/04 $ 20.00 © 2004 IEEE

Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on February 5, 2009 at 13:08 from IEEE Xplore. Restrictions apply.

By taking an acyclic cover H= {ABCFG, CDGH, DHI,EF} of R, we have the join tree J of H shown in Figure 2.

1 2 3 4EF ABCFG CDGH DHI

F CG DH

Figure 2. The join tree J

Let us distinguish the following four cases.

Case 1: if Xi is AB, BC, CG,FG, AF, then the join tree Jis traversed as shown in Figure 3(1);

Case 2: if Xi is EF, then the join tree J is traversed asshown in Figure 3(2);

Case 3: if Xi is CD,DH, GH, CG, then the join tree J istraversed as shown in Figure 3(3);

Case 4: if Xi is DH,HI, DI, then the join tree J is traversedas shown in Figure 3(4).

1 2 3 4EF ABCFG CDGH DHI

F CG DH

1 2 3 4EF ABCFG CDGH DHI

F CG DH

AB BC CGFG AF

EF

1 2 3 4EF ABCFG CDGH DHI

F CG DH

CD DHGH CG

1 2 3 4EF ABCFG CDGH DHI

F CG DH

DH HI DI

(1)

(2)

(3)

(4)

Figure 3. Traversals of the join tree J

To sum up, we have the following proposition.

Proposition 4.3 [2] Let H be an acyclic cover of thescheme R of our summary database T . Let J be a join treeof H, and α0 the weighting of J induced by the distribution

1size(X) . Then, the distribution of the maximum-entropy

universal table T of T is the joint distribution of the tabletree (J, α), where α is the weighting of J obtained by ap-plying the tree-IPFP to (T , J, α0).

The cost of each iterative cycle of the tree-IPFP isO(|R|size(H) + size(R)) and, whenever size(H) <size(X), the procedure is more efficient than the blindimplementation of the IPFP whose cost, as we said, isO(|R|size(X) + size(R)).

However, we can do better since, with an appropriatechoice of the acyclic cover H of R, we can apply the princi-ple of divide-and-conquer. As stated below (see Proposition4.4), the best choice for H is given by the so-called ”com-paction” of R [13] [14], whose definition is now recalled.Let R be a connected hypergraph. Two vertices of R aretightly connected if they are separated by no partial edge.Sets of pairwise tightly connected vertices of R are calledcompacts. Of course, each edge is a compact of R. Thecompact components of R are the subhypergraphs of R in-duced by maximal compacts, and the compaction of R isthe cover of R whose edges are its maximal compacts.

The compaction of R has a number of nice properties[13] [14]: the compaction of R is acyclic; R is acyclic ifand only if R coincides with its compaction; in every jointree of the compaction of R, the arc labels are all partialedges of R; and the compaction of R can be computed inpolynomial time. Finally, given a connected hypergraph Rwhose vertex set is compact, by a fill-in cover of R we meanan acyclic hypergraph F obtained as follows:

(1) Construct the 2-section [R]2 of R

(2) Triangulate [R]2 with a zero fill-in algorithm [17].

(3) Set F to the clique hypergraph of the resultingchordal graph.

Proposition 4.4 [14]. Let R be the scheme of our summarydatabase T . The compaction of R is the minimal (with re-spect to covering) acyclic cover H of R such that, for eachedge A of H, the marginal p(a) of p(x) with respect toA can be obtained by applying the IPFP to the table setT [A] = {T1[A], . . . , Tn[A]} with p[0](a) = 1

size(A) .

A table set such as T [A] will be referred to as the projectionof the summary database T onto A. So, by Proposition 4.3,p(x) is the joint distribution of the table tree (J, α), andby Proposition 4.4, α can be determined with the followingalgorithm [2], where H= {A1, ..., Am} is the compactionof R, and J is a join tree of H.

Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM’04) 1099-3371/04 $ 20.00 © 2004 IEEE

Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on February 5, 2009 at 13:08 from IEEE Xplore. Restrictions apply.

ALGORITHM 11) For each arc (j, l) of J , find a minimum-size edgeXi of R containing the label of (j, l) and set α(j, l) tothe marginal of the table Ti in T with respect to the labelof (j, l).2) For each node j of J such that its label (Aj) is anedge of R, say Xi, set α(j) = Ti.3) For each node j of J such that its label (Aj) is not

an edge of R, do:(3.1) find a fill-in cover Fj of the compact component

R[Aj ] of R;(3.2) find a join tree Lj of Fj ;(3.3) construct a table tree (Lj , βj), where βj is ob-

tained by applying the tree-IPFP to (T [Aj ], Lj , βj,0),βj,0 being the weighting of Lj induced by the distribu-tion 1

sixe(Aj);

(3.4) set α(j) to the table with scheme Aj having asits distribution the joint distribution of the table tree(Lj , βj).

Example 3. (Continued) The compaction of R is H={ABCFG, CDGH, DHI, EF}. The join tree J of H wasshown in Figure 2.The table tree (J, α) is constructed using ALGORITHM 1and assuming size(A) ≤ . . . ≤ size(I). The weightsα(1, 2), α(2, 3) and α(3, 4) are set to T4[F], T5 and T6, re-spectively.The weight α(1) is set to T8.The weight α(2) is computed locally (see Step 3 of ALGO-RITHM 1). The compact component of R corresponding tothe label of the node 2 is R[ABCFG] = { AB, AF, BC,CG,FG}. A fill-in cover of R[ABCFG] is F2 = {ABF, BCF,CFG} and the join tree L2 of F2 is shown in Figure 4.

1 2 3BF CF

ABF BCF CFG

Figure 4. The join tree L2

Given L2 and the projection of T onto ABCFG,T [ABCFG] = {T1, T2, T4, T5, T9}, the table tree (L2, β2)is constructed, where β2 is obtained by applying the tree-IPFP to (T [ABCFG], L2, β2,0), β2,0 being the weighting ofL2 induced by the distribution 1

size(ABCFG). Let

β2(1) = 〈ABF, p(abf)〉 β2(2) = 〈BCF, p(bcf)〉β2(3) = 〈CFG, p(cfg)〉β2(1, 2) = 〈BF, p(bf)〉 β2(2, 3) = 〈CF, p(cf)〉

At this point, the weight α(2) is set to the table with schemeABCFG whose distribution is the joint distribution of thetable tree (L2, β2), that is,

α(2) =⟨

ABCFG,p(abf)p(bcf)p(cfg)

p(bf)p(cf)

⟩.

Also the weight α(3) is computed locally. The compactcomponent of R corresponding to the label of the node 3is R[CDGH] = {CD, CG, DH, GH}. A fill-in cover ofR[CDGH] is F3 = {CDG, DGH} and the join tree L3 ofF3 is shown in Figure 5.

1 2DG

CDG DGH

Figure 5. The join tree L3

Given L3 and the projection of T onto CDGH,T [CDGH] = {T3, T5, T6, T10}, the table tree (L3, β3) isconstructed, where β3 is obtained by applying the tree-IPFPto (T [CDGH], L3, β3,0), β3,0 being the weighting of L3 in-duced by the distribution 1

size(CDGH). Let

β3(1) = 〈CDG, p(cdg)〉 β3(2) = 〈DGH, p(dgh)〉β3(1, 2) = 〈DG, p(dg)〉

At this point, the weight α(3) is set to the table with schemeCDGH whose distribution is the joint distribution of the ta-ble tree (L3, β3), that is,

α(3) =⟨

CDGH,p(cdg)p(dgh)

p(dg)

⟩.

Also the weight α(4) is computed locally. The compactcomponent of R corresponding to the label of the node 4 isR[DHI] = {DH, DI, HI}. Its 2-section is chordal, so thatits fill-in cover is F4 = {DHI} and the join tree L4 of F4 isillustrated in Figure 6.

1DHI

Figure 6. The join tree L4

Given L4 and the projection of T onto DHI, T [DHI] ={T6, T7, T11}, the table tree (L4, β4) is constructed,where β4 is obtained by applying the tree-IPFP to(T [DHI], L4, β4,0), β4,0 being the weighting of L4 inducedby the distribution 1

size(DHI) . Let β4(1) = 〈DHI, p(dhi)〉.At this point, the weight α(4) is set to β4(1).

Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM’04) 1099-3371/04 $ 20.00 © 2004 IEEE

Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on February 5, 2009 at 13:08 from IEEE Xplore. Restrictions apply.

Finally, the distribution p(abcdefghi) is taken to be thejoint distribution of (J, α), that is,

p(abcdefghi) =p8(ef)p(abf)p(bcf)p(cfg)p(cdg)p(dgh)p(dhi)

p4(f)p(bf)p(cf)p(dg)p5(cg)p6(dh)

5 Computing the minimum-cross-entropyuniversal table

In this section, we present a procedure for comput-ing the distribution of the minimum-cross-entropy univer-sal table T = 〈X, p(x)〉 of T relative to the distribution

q(z)size(X−Z) . Henceforth, we assume that Z is not a par-tial edge of the scheme of T , otherwise, as we said abovep(x) = p(x) and we can apply ALGORITHM 1. We firstconsider the minimum-cross-entropy universal table of Trelative to an arbitrary distribution p0(x) and, then, discussthe case p0(x) = q(z)

size(X−Z) . The following result, whoseproof is given in Section C of the Appendix, generalizesProposition 4.3.

Lemma 5.1 Let p(x) be the distribution of the minimum-cross-entropy universal table of T relative to p0(x). Let Hbe an acyclic cover of R, J a join tree of H, and α0 theweighting of J induced by the distribution p0(x). If p0(x)is the joint distribution of (J, α0), then p(x) is the joint dis-tribution of the table tree (J, α), where α is the weightingof J obtained by applying the tree-IPFP to (T , J, α0).

By Lemma 5.1, we obtain an effective procedure for com-puting the distribution p(x). It is sufficient to observe that,for every acyclic cover H of R such that Z is a partial edgeof H, the distribution p0(x) = q(z)

size(X−Z) is the distribu-tion of the maximum-entropy universal table of the table set{p0(a) : A ∈H} and, hence, is the joint distribution of thetable tree (J, α0), where J is a join tree of H and α0 is theweighting of J induced by p0(x).

Theorem 5.1 Let H be an acyclic cover of R∪{Z}. LetJ be a join tree of H, and α0 the weighting of J inducedby the distribution q(z)

size(X−Z) . Then, the distribution p(x)is the joint distribution of the table tree (J, α), where α isthe weighting of J obtained by applying the tree-IPFP to(T , J, α0).

We can do better. Let X0 = Z and T0 = T [Z]. Consider the(partially specified) table set T0 = {T0, T1, . . . , Tn} withscheme R0 =R∪{X0}. Since T is a universal table of T , Tis a universal table of T0 too; furthermore, we can re-write(2) as

p(x) = f0(x0)f1(x1) . . . fn(xn)1

size(X)

where f0(x0) = q(z)size(X). So, T is also themaximum-entropy universal table of T0 and we can ap-ply the divide-and-conquer principle as follows. Let H={A1, . . . , Am, Am+1, . . . , As} be the compaction of R0,where A1, . . . , Am are the maximal compacts of R0 forwhich the set Z ∩ Aj is empty or a partial edge of R. LetA0 =

⋃j=m+1,...,s Aj . Note that Z is contained in A0.

The hypergraph H0 = {A0, A1, . . . , Am} is a connected,acyclic cover of R0. Let J0 be a join tree of H0 with nodeset {0, 1, ...,m}, and let {B1, . . . , Bk} be the set of arc la-bels in J0. Then, p(x) is the joint distribution of the tabletree (J0, α) where α is the weighting of J0 induced by p(x),that is,

p(x) = p(a0)

∏j=1,...,m p(aj)∏

h=1,...,k[p(bh)]dh.

By Proposition 4.4, each distribution p(aj) with 1 ≤ j ≤m, equals the distribution obtained by applying the IPFPto T0[Aj ] = T [Aj ] ∪ {T0[Aj ]} with initial distribution

1size(Aj)

; but, since T0[Aj ] is redundant in T0[Aj ], we haveT0[Aj ] = T [Aj ] so that p(aj) can be computed from T [Aj ]as p(aj) (see Steps 2 and 3 of ALGORITHM 1). Likewise,the distributions p(bh), 1 ≤ h ≤ k, can be computed fromT since each Bh is a partial edge of R. So, except the weightα(0) of the node 0 (which is labelled by A0), the weight-ing of J0 can be obtained exactly in the way ALGORITHM

1 computes α . The weight α(0) is computed using thefollowing algorithm, with input the compact componentsR0[Am+1],. . . , R0[As] of R0 and the projection T [A0] ofT onto A0.

ALGORITHM 21) For each j, m + 1 ≤ j ≤ s, find a fill-in cover Fj ofR0[Aj ].2) Let F0 =

⋃j=m+1,...,sFj . Find a join tree L0 of F0.

3) Construct the table tree (L0, β0) where β0 is ob-tained by applying the tree-IPFP to (T [A0], L0, β0,0),β0,0 being the weighting of L0 induced by the distribu-

tion q(z)size(A0−Z) ;

4) Set α(0) to the table with scheme A0 having asits distribution the joint distribution of the table tree(L0, β0).

Once the table tree (J0, α) has been constructed, p(x) canbe computed as the joint distribution of(J0, α).

Theorem 5.2 The procedure above correctly computes thedistribution p(x).

Example 4. Consider again the sum-mary database T , whose scheme is R={AB, AF, BC, CD, CG,DH, DI, EF, FG, GH, HI}.

Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM’04) 1099-3371/04 $ 20.00 © 2004 IEEE

Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on February 5, 2009 at 13:08 from IEEE Xplore. Restrictions apply.

Let Z = BEF. Then R0 = R∪{Z} ={AB, AF, BC, CD, CG, DH, DI, BEF, FG, GH, HI} andits compaction is H= {ABF, BCFG, BEF, CDGH, DHI }.The intersection with Z of each edge of H except CDGHand DHI neither is empty nor is a partial edge of R. LetA0 = ABCEFG, H0 = {ABCEFG, CDGH, DHI}. Thejoin tree J0 of H0 is shown in Figure 7.

0 1 2ABCEFG CDGH DHI

CG DH

Figure 7. The join tree J0

The weights α(0, 1) and α(1, 2) are set to T5 and T6, re-spectively. The weights α(1) and α(2) are computed lo-cally, that is,

α(1) =⟨

CDGH,p(cdg)p(dgh)

p(dg)

α(2) = 〈DHI, p(dhi)〉

The weight α(0) is computed using ALGORITHM 2 withinput the three compact components of R0 correspondingto ABF, BCFG and BEF, and the projection T [A0] of Tonto A0. Explicitly, we have R0[ABF] = {AB, AF, BF},R0[BCFG] = {BC, BF, CG, FG} and R0[BEF] ={BEF}, and T [A0] = {T1, T2, T4, T5, T8, T9}. By takingthe union of the fill-in covers of R0[ABF], R0[BCFG] andR0[BEF] we find F0 = {ABF, BEF, BCF, CFG}. A jointree L0 of F0 is shown in Figure 8.

1 2 3 4ABF BEF BCF CFG

BF BF CF

Figure 8. The join tree L0

Construct the table tree (L0, β0), where β0 is theweighting of L0 obtained by applying the tree-IPFP to(T [A0], L0, β0,0), β0,0 being the weighting of L0 induced

by the distribution q(bef)

size(ACG). At this point, set α(0) to the

table with scheme A0 whose distribution is the joint distri-bution of the table tree (L0, β0), that is,

p(abcefg) =p(abf)p(bcf)p(bef)p(cfg)

[p(bf)]2p(cf).

Finally, we can compute p(abcdefghi) as the joint distribu-tion of (J0, α). We obtain

p(abcdefghi) =p(abf)p(bcf)p(bef)p(cfg)p(cdg)p(dgh)p(dhi)

[p(bf)]2p(cf)p(dg)p5(cg)p6(dh).

6 Conclusions

In the previous sections, we considered the followingscenario. A statistical user asks for a table on a target vari-able and wishes a quick answer. Typically, the answer canbe computed by the query system in an approximate way us-ing the target summary database and an estimation model.If the statistical user wishes to participate in the estimationprocess, the query system leaves him the possibility of intro-ducing an auxiliary variable to compute the query answer.We discussed the case that the data on the auxiliary vari-able is reported in a single table. In the general case, thedata on the auxiliary variable is spread in an auxiliary sum-mary database. The future research is addressed to solvethe problem of answering queries in such a general case bycombining the target summary database with the auxiliarysummary database.

References

[1] M. Bacharach. Biproportional Matrices and Input-OutputChange. University Press, Cambridge, 1970.

[2] J.-H. Badsberg and F. M. Malvestuto. An implementationof the iterative proportional fitting procedure by propagationtrees. Computational Statistics and Data Analysis, 37:297–322, 2001.

[3] S. F. Bishop and P. Holland. Discrete Multivariate Analysis.MIT-Press, 1975.

[4] D. M. C. Beeri, R. Fagin and M. Yannakakis. On the de-sirability of acyclic database schemes. Journal of ACM,30:479–513, 2001.

[5] H. V. J. C. Faloutsos and N. D. Sidiropoulos. Recoveringinformation from summary data. Proceedings of the 23rdVLDB Conference, pages 36–45, 1997.

[6] S. Chauduri and U. Dayal. An overview of data warehous-ing and olap technology. ACM SIGMOD Record, 26:65–74,1997.

[7] I. Csiszar. I-divergence geometry of probability distribu-tions and minimization problems. The Annals of Probability,3:146–158, 1975.

[8] W. E. Deming and F. F. Stephan. On a least square ad-justment of a sampled frequency table when the expectedmarginal totals are known. Annals of Mathematical Statis-tics, 11:427–444, 1940.

[9] M. Ghosh and J. N. K. Rao. Small area estimation: Anappraisal. Journal of Statistical Science, 9:55–93, 1994.

[10] W. W. Leontief and A. Strout. Multiregional input-outputanalysis. Structural Interdependence and Economic Devel-opment. T. Bama (Ed.), pages 119–169, 1963.

Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM’04) 1099-3371/04 $ 20.00 © 2004 IEEE

Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on February 5, 2009 at 13:08 from IEEE Xplore. Restrictions apply.

[11] F. M. Malvestuto. Answering queries in categoricaldatabases. Proc. of the ACM Symp. on Principles ofDatabase Systems, pages 87–96, 1987.

[12] F. M. Malvestuto. A universal table model for categoricaldatabases. Information Sciences, 49:203–223, 1989.

[13] F. M. Malvestuto and M. Moscarini. A fast algorithm forquery optimization in universal-relation databases. J. Com-puter and System Sciences, 56:299–309, 1998.

[14] F. M. Malvestuto and M. Moscarini. Decomposition of a hy-pergraph by partial-edge separators. Theoretical ComputerScience, 237:57–79, 2000.

[15] E. Pourabbas and A. Shoshani. Answering joint queriesfrom multiple aggregate olap databases. Lecture Notes inComputer Science, Y. Kambayashi, M. Mohania, W. Wβ.(Eds.), 2737:24–34, 2003.

[16] R. Stone and A. Brown. A Computable Model for EconomicGrowth, A Programme for Growth No.1. Chapman and Hall,London, 1962.

[17] R. E. Tarjan and M. Yannakakis. Simple linear-time al-gorithms to test chordality of graphs, test acyclicity of hy-pergraphs, and selectively reduce hypergraphs. SIAM J. onComputing, 13:566–579, 1984.

APPENDIX

Section A

The entropy of a distribution p(x) is the non-negative func-tional

H[p] = −∑

x

p(x) log(p(x))

the summation being extended over all tuples x in the sup-port of p(x). It is well-known that H[p] is always less thanor equal to log size(X). Given a distribution p0(x) withrespect to which p(x) is absolutely continuous, the cross-entropy (or ”I-divergence” or ”discrimination information”or “Kullback-Leibler distance”) between p(x) and p0(x) isthe nonnegative functional

D[p, p0] =∑

x

p(x) logp(x)p0(x)

the summation being extended over all tuples x in the sup-port of p(x). It is well-known that D[p, p0] = 0 if and onlyif p(x) = p0(x). For p0(x) = q(z)

size(X−Z) , we have

D[p, p0] = log(size(X−Z))−H[p]−∑

z

p(z) log(q(z)).

So, if Z is a (possibly empty) subset of the scheme of sometable in T , say Xi, then

∑z

p(z) log(q(z)) =∑

z

pi(z) log(q(z)) = const

and, hence, minimizing D[p, p0] is the same as maximizingH[p].

Section B

A hypergraph with vertex set X is a nonempty collectionH of nonempty subsets of T , which are called edges of H[4] and whose union recovers X . A partial edge of H is aset of vertices that is contained in some edge of H. A coverof H is a hypergraph S with vertex set X such that eachedge of H is a partial edge of S. The two-section of H, de-noted by [H]2, is the simple graph with node set X wheretwo nodes are adjacent if and only if they appear togetherin some edge of H. A hypergraph H is acyclic [4] if eachclique (i.e., each nonempty set of pairwise adjacent nodes)of [H]2 is a partial edge of H, and [H]2 is a chordal graph.Several other equivalent definitions of acyclicity exist [4].We shall make use of the tree representation of a connected,acyclic hypergraph, which is called a ”join tree” [4].

Let us assume that H is a connected hypergraph. LetG(H) be the intersection graph of H; that is, the nodesof G(H) correspond one-to-one to and are labelled by theedges of H, and two distinct nodes of G(H) are joined byan arc if their labels have a nonempty intersection. Finally,if (i, j) is an arc of G(H) and Ai and Aj are the labels ofthe nodes i and j, then the arc (i, j) is labelled by Ai ∩ Aj .A spanning tree J of G(H) is a join tree of H if, for everytwo nodes of J , the intersection of its labels is contained inthe label of each node along the (unique) path in J that con-nects the two nodes. Then, H is acyclic if and only if thereexists a join tree of H [4]. Such a join tree can be foundby taking a maximum-weight spanning tree of G(H) afterweighting each arc by the cardinality of its label.

Section C

Proof. By Proposition 3.1, p(x) is the limit of the sequenceof distributions p[0](x), p[1](x) computed by the IPFP withp[0] = p0(x). Let (J, αr) be the r-th table tree, r ≥ 0,constructed by the tree-IPFP and let πr(x) be the joint dis-tribution of (J, αr). Then for each r, r > 0, we have

πr(x) =pi(xi)

πr−1(xi)πr−1(x)

for some i, 1 ≤ i ≤ n. If p0(x) is the joint distributionof (J, α0) then not only p0(x) = π0(x) but also p[r](x) =πr(x) for each r, r ≥ 0, so that p(x) is the joint distributionof (J, α).

Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM’04) 1099-3371/04 $ 20.00 © 2004 IEEE

Authorized licensed use limited to: Universita degli Studi di Roma La Sapienza. Downloaded on February 5, 2009 at 13:08 from IEEE Xplore. Restrictions apply.