Data Warehouse Clustering on the Web

European Journal of Operational Research 160 (2005) 353–364

www.elsevier.com/locate/dsw

Data warehouse clustering on the web

Aristides Triantafillakis *, Panagiotis Kanellis, Drakoulis Martakos

Department of Informatics and Telecommunications, National and Kapodistrian University of Athens,

University Campus, Athens 15771, Greece

Received 5 June 2002; accepted 24 July 2003

Available online 9 December 2003

Abstract

In collaborative e-commerce environments, interoperation is a prerequisite for data warehouses that are physically

scattered along the value chain. Adopting system and information quality as success variables, we argue that what is

required for data warehouse refreshment in this context is inherently more complex than the materialized view

maintenance problem and we offer an approach that addresses refreshment in a federation of data warehouses. Defining

a special kind of materialized views, we propose an open multi-agent architecture for their incremental maintenance

while considering referential integrity constraints on source data.

� 2003 Elsevier B.V. All rights reserved.

Keywords: Decision support systems; Multi-agent systems; Data warehouse refreshment; Materialized view maintenance

1. Introduction

A base requirement for the success of a data

warehouse is the ability to provide decision makers

with both accurate and timely information

(information quality) as well as fast query responsetimes (system quality) [3]. Common methods that

are used in practice for achieving those are largely

focused on the concept of the materialized views

(MVs) where a business question (i.e. query) is

more quickly answered against the MV than

accessing directly the base data sources [10], which

may potentially involve time-demanding opera-

tions such as large-table scans and joins. However,

* Corresponding author.

E-mail address: [email protected] (A. Triantafillakis).

0377-2217/$ - see front matter � 2003 Elsevier B.V. All rights reserv

doi:10.1016/j.ejor.2003.07.012

inevitable updates at the data sources introduce

the information quality problem––how to keep the

MVs at a certain level of consistency with the

source data when update transactions take place

[13,16]. In short, system and information quality

are bound together – MVs can improve queryperformance if we can manage to update the views

consistently [4].

Collaborative electronic commerce (Ce-com-

merce), simply augments this challenge because

these data sources are not only internal, as they

largely were a mere few years ago. For example,

the emergence of business communities in the form

of Business-to-Business (B2B) exchanges meansthat the boundaries of organizations are more fluid

than they used to be [14]. Ce-commerce poses new

challenges to the MV maintenance problem, as

�extended enterprises� have to integrate far more

ed.

mail to: [email protected]

Data Extraction

Data Cleansing

Data Integration

Update Propagation

Customization External Event

External Event

External Event

External Event

BeforeIntegration

Event

After CleaningEvent

After IntegrationEvent

BeforePropagation

Event

AfterPropagation

Event

BeforeCustomization

Event

Loading sub-Process

Fig. 1. A generic workflow for DWR (adapted from [2]).

354 A. Triantafillakis et al. / European Journal of Operational Research 160 (2005) 353–364

data originating outside the organization into asingle repository. A recent paper [7] by Hammer

examines this trend by felicitously pointing out

that the streamlining of cross-company processes

‘‘. . . is the next great frontier for reducing costs,

enhancing quality, and speeding operations’’ (p.

83). In this way multiple enterprises within a

shared market segment can collaboratively plan

and manage the flow of goods, services, andinformation along the supply chain in a way that

increases customer-perceived value and optimizes

at the same time internal efficiencies [14].

What this means for data warehouse informa-

tion and system quality is that we should start

viewing Data Warehouse Refreshment (DWR) as

an operational process that must provide explicit

support for cross-enterprise collaboration in termsof changing business constraints. Therefore, DWR

should not only be limited to MV maintenance in

the context of a single warehouse (as we are

accustomed to), but also support the refreshment

in a federation of data warehouses. This, in turn,

translates to providing a new set of algorithms and

techniques to materialize views from source data

that may reside in remote sites, and to incremen-tally maintain these views. One should also note

that separate DWR processes in separate data

warehouses augments this challenge, because there

might exist different maintenance policies on the

MVs of interest for each DWR process. To the

best of our knowledge, previous work on view

maintenance has mainly considered the case of SPJ

(Select-Project-Join) views in a single warehouse,while not providing insights for a data warehous-

ing solution in such an environment. In short, the

problem we address in this paper can be stated as

follows: ‘‘how to maintain MVs in environments

where a data warehouse utilizes data from other

data warehouses’’.

Considering system and information quality as

success variables [5], we argue in the next sectionthat for collaborative e-commerce there is more to

DWR than the MV maintenance problem. In

Section 3 we present an agent-based framework

based on the eXtended Markup Language (XML)

standard for the incremental maintenance of a

special kind of MVs and in section 4 we clarify the

limitations of the empirical part of our research,

providing at the same time some directions forfuture work.

2. Issues and challenges for data warehouse refresh-

ment in collaborative e-commerce environments

In the case of a single data warehouse, DWR

can be viewed as a process that involves a hierar-chy of data stores accommodating a range from

source data to highly aggregated data [2,16]. The

Operational Data Store (ODS) stores source data

in a uniform and clean representation whilst the

Corporate Data Warehouse (CDW) stores aggre-

gate views of interest, or in other words, MVs.

This hierarchy of data stores is a logical way to

represent the flow of information, which goes fromthe data sources to the MVs.

DWR is a complex process composed of both

asynchronous and parallel tasks of which the main

task is to capture the differential changes in the

sources and to propagate them through this hier-

archy of data stores in the data warehouse. Fig. 1

presents a generic workflow for the refreshment

process, where for simplicity we have not consid-ered archiving.

The information flow begins with the loading

sub-process, which �feeds� the data warehouse and

is composed of the data extraction and data

cleansing steps. Other sub-processes include data

integration, update propagation and customiza-

tion, which propagate the summarized data to the

data marts. In the remaining section we willdemonstrate why DWR in a federation of data

A. Triantafillakis et al. / European Journal of Operational Research 160 (2005) 353–364 355

warehouses is a complex issue and poses challengesthat go beyond those normally associated with

simple view maintenance.

An integrated value system, which may include

a number of data warehouses, requires a solution

where the system of each participating business in

the chain should be able to communicate with all

others. In such a context this communication is

translated in customizing and querying remoteMVs in each participating business that contain

information of interest to the collaborative-busi-

ness workflow. The atomicity of the source

warehouses augments this challenge, as we also

have to take into account the different mainte-

nance policies of the MVs of each participating

warehouse. In such environments, we define an

MV as a hyper-view, because it provides a higherlevel of aggregation and consolidation of internal

as well as external data sources. Also, the data

sources may be MVs and not solely base tables,

as we are accustomed to. Fig. 2 depicts this

situation, where for simplicity we have consid-

ered two enterprise systems, namely Site1 and

Site2.

Hyper-views are dealt with in the customizationsub-process of the DWR process depicted in Fig. 1

and may have various forms. For example, they

may be defined as a union over the corresponding

MVs of the participating warehouses, or as an

ExtractionCleaning

IntegrationExtractionCleaning

ExtractionCleaning

OperationalData Store

DataSources

ExtractionCleaning

IntegrationExtractionCleaning

ExtractionCleaning


Site1

Site2

DataSources

Fig. 2. Data warehouse architecture

aggregate query that provides decision makerswith a higher level of consolidated information.

In the former case a hyper-view is defined as

HV ¼[n

i¼1

MVi; ð1Þ

while in the latter case, a hyper-view is defined as

HV ¼ MV1 ffl MV2 ffl � � � ffl MVn ð2Þwith ðMV1;MV2; . . . ;MVnÞ being the MVs of the

n participating warehouses.

We should underline that a hyper-view aug-

ments the complexity quotient, even in its simplest

form – a hyper-view as a union over remote MVs.

This is not trivial as it may initially seem and isonly partially addressed in practice by creating an

MV in each database that does the aggregation for

that database and then combine the results. It is

obvious that the result set is not stored in a central

repository, so the union-based view is not materi-

alized but recomputed every time. Consequently, a

hyper-view that stores union-based result sets has

greater performance, which translates to bettersystem quality. For decision makers along the

collaborative business workflow, the information

and systems quality requirements dictate MV

consistency and fast query response times.

We stated that for extended enterprises a num-

ber of processes can be identified, integrated and

Corporate DataWarehouse


Highly Aggregated Data

HyperViewsMVs

MVs

in Collaborative E-Commerce.


shared along the value chain, and then managed asa single common process without regard to physi-

cal and technological corporate boundaries (dif-

ferent technologies that may affect system quality)

and with much less overhead and error in the

information provided to the end user (information

quality). To exemplify this, consider the simplified

workflow for e-procurement presented in Fig. 3

(forget the critical path for now), where the supply-chain cycle begins with a customer�s order.

The manufacturer passes the order through the

usual intra-enterprise activities and turns to out-

side support from suppliers and other providers of

goods and services that are needed to make the

product. The information exchange pertains to

such matters as requests for quote, bids and pur-


Information Flow

DatabaseTransactions

CorporateDataWarehouse

MVs

Site1

CorpW

Procure EquipmentOrder

Manufac.

Customer

Fig. 3. A workflow in

chase orders, order confirmations, invoices andpayment confirmation. With respect to this

example, assume that Site1 and Site2 have the

�Order Fulfillment� business process in common,

which in turn is decomposed into the �Verify Sta-

tus�, �Get Bids� and �Order� sub-processes.The �Verify Status� process depends on �Get

Bids�, which in turn depends on the �Order� sub-process. Within the boundaries of a single enter-prise, considering these sub-processes as MVs

means that a delay in any step explicitly affects the

other processes, which in turn affects the whole

business workflow in terms of efficiency. In the

case of an extended enterprise the obvious inter-

dependencies augment this problem due to the

necessary querying of the hyper-views that contain


MVs

Site2

Vendors

Vendor

FinanceDept,

orate Dataarehouse

HyperViews

Verify Status

Order

Get Bids

Critical Path

a supply chain.


aggregations of interest consolidated from MVs ofthe participating warehouses. A delay in any step

may also affect the other workflows that provide

links in the value-chain. Therefore, we can think of

these three sub-processes (or MVs) as belonging to

a critical path (Fig. 3), i.e., the sub-processes that

the extended enterprises share, for the collabora-

tive-business workflow and infer the following

lemmas:

Lemma 1 (information quality). There is a critical

path along the value chain that requires complete

consistency.

Lemma 2 (information quality). There is a busi-

ness transaction along the critical path that requires

complete consistency.

Lemma 3 (system quality). The critical path is the

bottleneck of the workflow and may explicitly affect

system quality.

Lemma 4 (system and information quality). Effi-

cient DWR along the critical path is the depended

success variable for effective decision support.

Lemmas 1 and 2 refer to information quality in

terms of requesting complete consistency. In par-

ticular, Lemma 2 states that the three aforemen-

tioned processes should belong to a business

transaction [14], in analogy to a database trans-

action, that groups both transactional as well as

non-transactional processes together in a unit ofwork that reflects the semantics and behavior of its

underlying business tasks. Lemma 3 refers to sys-

tem quality in terms of fast query response times in

conjunction with adopting different access fre-

quency constraints and different maintenance

policies and consistency requirements for different

views according to the needs of the decision

makers. Finally, Lemma 4 simply states that con-sidering all the parameters that affect the refresh-

ment processes (i.e. efficiency) should provide the

decision makers with the intended result (i.e.

effective decision support).

Hence, from the above discussion we can infer

that the DWR process is a complex and event-

driven system that needs certain monitoring and

evolves frequently, following the evolution of datasources and user requirements. It refers to the

ability to define different scenarios depending on

user requirements, source constraints and data

warehouse constraints. However, most of the prior

work on warehouse refreshment deals with the

problem of maintaining SPJ-type MVs incremen-

tally in the case of a single warehouse. Existing

algorithms found in the literature [1,6,9,11,13,15–17] should be revisited and take into account spe-

cific constraints such as the freshness of data, the

space limitation of the ODS or CDW, referential

integrity constraints on source data and access

frequency to sources that users, data warehouse

administrators and data source administrators may

impose. In particular, the Strobe Algorithm [16] has

the potential of infinite waits requiring quiescencefrom the sources to update the corresponding

views. It also ensures strong consistency, but not

complete consistency since it incorporates the ef-

fects of several updates collectively. In the case of

the ECA algorithm [15] the data warehouse model

is restricted in that the number of data sources is

limited to a single data source. It is obvious that in

a Ce-commerce environment, quiescence is unlikely(Lemma 1) and transactions involve multiple enti-

ties (Lemma 2). Although, the SWEEP algorithm

[1] guarantees complete consistency and has linear

message complexity with the number of data

sources for processing an update, it does not con-

sider access frequency constraints to the data

sources (Lemma 3). Again, approaches described

in [11,13] are restricted to a single warehouse(Lemma 3). Although the authors in [9] use key and

referential integrity constraints in order to mini-

mize auxiliary data and make a class of views self-

maintainable, the class of views considered does

not include aggregation. An architecture for multi-

view consistency is proposed in [17], using an

integrator, view managers and a merge process that

collects changes to the views, holds them until allaffected views can be modified together, and then

forwards all of the views� changes to the warehouse

in a single warehouse transaction. However, the

authors provide algorithms for the merge process

to decide when to hold and when to forward

actions (Lemma 4). A work that is close to ours

is described in [6] where the authors propose


incremental view maintenance algorithms for datawarehouses defined on multi-relation information

sources. However, they do not explicitly refer to a

warehousing solution in a Ce-commerce environ-

ment. Finally, the SDCC framework introduced in

[18] provides a solution to the problem of concur-

rent view definition and evolution but the authors

make the assumption that each information source

has only one relation, which is deemed unrealis-tic in the defined Ce-commerce environment

(Lemma 2).

Regarding commercial database products, the

issues defined above have not been adequately

addressed either. For example, an MV in Micro-

soft�s SQL Server 2000 (called indexed view) must

not reference any other views, but only base tables,

and all base tables referenced by the view must bein the same database as the view. Besides, UNION

is not allowed in an indexed view. A workaround

to these issues is to create an indexed view in each

database that does the aggregation for that data-

base and then combine the results. Thus, the

combined result set is not materialized, so we lose

stored result sets – one of the fundamental prop-

erties of the data warehouse.In summary, the case of a federation of data

warehouses that are physically scattered along

integrated value chains over the web, necessitates

an architecture where interoperation is a pre-

requisite in terms of highly aggregated MVs that

depend in turn on the remote MVs of the partici-

pating enterprises. Prior academic work on DWR

has mainly considered the incremental mainte-nance of SPJ-type MVs in the case of a single

warehouse, whereas in commercial products UN-

ION based views cannot be materialized.

Consequently, we believe that a different ap-

proach is needed that addresses the incremental

MV maintenance problem in Ce-commerce type

environments. In the next section we propose

such an approach and architecture for the incre-mental maintenance of union-based hyper-

views stored at the central data warehouse that

depend on simple SPJG (Select-Project-Join-

Group by) MVs of the participating data ware-

houses. We also consider the SUM function as a

motivating example and treat the hyper-view as a

table.

3. System architecture

The proposed architecture builds upon and ex-

tends the works of [6,8,9,17], contributing in the

process a novel approach for DWR in Ce-com-

merce environments. In particular, we use a log Lof the changes (base table deltas), modeled by

installing triggers on MVs. The Agent [8] andIntegrator [6,8,17] entities are responsible for the

incremental maintenance of the hyper views.

Specifically, the proposed architecture is appli-

cable in cases where an appropriate MV already

exists in a participating data warehouse, i.e. there

is a complete rewriting of the original query over

an MV. Thus, we pump data directly from the

existing MV. We model the MV of each datawarehouse as a table and install triggers on the

base tables that maintain the MV incrementally.

We also install triggers on the MV (considering it

as a table) that replicate the changes to a tempo-

rary table at a participating data warehouse (delta

table – DR). This is because triggers on indexed

views are not �fired� as part of the RDBMS�s viewmaintenance internal process. Moreover, updatesto the HV are handled as deletions followed by

insertions. In this approach we need one delta ta-

ble at the data warehouse for every MV to be

monitored.

3.1. Definitions

We refer to the data warehouse in this envi-

ronment as a hyper-warehouse (HW) – a super-set

of the participating data warehouses with the hy-

per-view as a union over its corresponding MVs

ðMV1;MV2; . . . ;MVnÞ of the n participatingwarehouses (Eq. 1). Thereby, a hyper-view is a

special kind of MV that its view definition is the

super-set of the definition of a corresponding MV

together with an extra field that indicates the

warehouse origin of each tuple. We also utilize the

agent technology [6,17], for the interoperation of

the data warehouses. This is because agent soft-

ware can be installed on top of an RDBMS inevery participating data warehouse that transmits

the base table deltas to an integrator entity at the

HW, which is responsible for maintaining the

hyper-view. The child-agent entity pumps data

Site1

Hyper Warehouse

.

.

.

Siten

Agent1

AgentN

.

.

.Inregrator

HyperViews

Dat

a S

ourc

es

Age

nts

Fig. 4. Overall system architecture.

CREATE VIEW C_MV as

(

SELECT c.p_name, sum(b.ps_availqty) as availqty

FROM partsupp b, part c

WHERE c.p_partkey = b.ps_partkey

GROUP BY p_name

)

Fig. 5. MV used for performance evaluation.


directly from the existing MV. At the next stage,

an algorithm executed by an agent maintains

incrementally the hyper-view at the HW by ana-

lyzing the base table deltas at the DR and posting

the appropriate queries to the hyper-view, whereupdates are handled as deletions followed by

insertions. In addition, we use the schema from the

TPCD-1GB reference schema [12] and assume that

schema mapping is possible using the architecture

proposed in [8].

In Fig. 4 we provide an illustration of the

architecture of the proposed system. We use a

modified version of the architecture initially pro-posed in [6,8]. The main components are:

• Data sources: technologies such as RDBMSs

(e.g. MS SQL Server) or Data Warehouses

(e.g. MVs).

• Integrator: information broker that is responsi-

ble for receiving the update notifications from

the agents and installing the updates to the hy-per-view in a unit of work. The messages that

are exchanged between the agents and the inte-

grator are flat files marked up with XML.

• Agents: server processes (pre-spawned) that con-

trol the interaction with a data source, and wait

for requests from the integrator to monitor spe-

cific MVs or perform specific actions. They also

transmit the updates to the integrator marked-up with XML.

Having this architecture in mind, we assume

that an SPJG MV (e.g. C_MV) has been defined in

each data warehouse that aggregates the available

quantity of each product (refer to the example in

the previous section). For the sake of simplicity we

consider two joining tables with the definition of

this MV shown in Fig. 5.

From the above discussion, the hyper-view (e.g.

C_HV) in analogy to the C_MV, will have the

following fields: {P_NAME, PS_AVAILQTY,

WH_ORIGIN} and the combined key will be:

{P_NAME, WH_ORIGIN}, where the field�PS_AVAILQTY� is the aggregated amount (i.e.

SUM) of the available quantity of each product in

each site (i.e. supplier).

3.2. Middle-tier Components

In the following paragraphs we elaborate on the

middle-tier components (i.e. agents and integrator)and on the communication protocol between

them.

3.2.1. Integrator

The integrator sends a request to every agent in

each data warehouse to monitor the corresponding

Table 1

Queries that must be issued by the integrator to the hyper-view for an MV and each action (assuming communication with Agent1)

Table Action Query definition

C_MV_TEMP U Delete C_HV where P_NAME¼<P_NAME> and WH_ORIGIN¼ 1

Insert into C_HV (P_NAME, AVAILQTY, WH_ORIGIN) values (<P_NAME>,

<AVAILQTY>, 1)

I Insert into C_HV (P_NAME, AVAILQTY, WH_ORIGIN) values (<P_NAME>,

<AVAILQTY>, 1)

D Delete C_HV where P_NAME¼<P_NAME> and WH_ORIGIN¼ 1


MV (e.g. C_MV). When an agent detects a changein the source data, marks up with XML the re-

cords that have been changed and transmits this

XML file to the integrator. Then the integrator

parses the XML file and installs the changes to the

hyper-view on a FCFS (First-Come-First-Serve)

basis. These changes are installed by executing the

appropriate query to the C_HV as shown in Table

1. Upon successful installation, the integratorsends back a positive acknowledgment to the agent

(the ROWID of the tuples that were installed to

the C_HV), otherwise a negative acknowledgment

(i.e. a �ROLLBACK� action).

3.2.2. Agents

An agent is a server process waiting for requests

from the integrator. When a request to monitor aspecific MV is present then the child agent oper-

ates under the following algorithm:

1. Parse the incoming request and extract the

name of the view that is to be monitored (e.g.

C_MV).

2. Analyse the definition of the view and extract

the table names and fields that participate in theview (e.g. P_NAME, PS_AVAILQTY).

3. Create a temporary table, named after the

MV that keeps the base table deltas. The fields of

the table are those that appear in the view defini-

tion plus a field that indicates the action that has

been performed, a status field that indicates if this

tuple has been transmitted or not, and a row-id

field that provides unique numbering (i.e. autoincrement). Thus, the definition of the temporary

table is as follows: C_MV_TEMP {P_NAME,

PS_AVAILQTY, ACTION, STATUS, ROWID}.

For example, if a row in the C_MV_TEMP has

been created as part of the �on insert� trigger on theC_MV, then the ACTION field of this tuple has

the value �I� that stands for Insert. Similarly the

ACTION field may have the values �U� for updateand �D� for delete. In addition, the STATUS field

will be set to �1� for this tuple.4. Install triggers on the C_MV that replicate

the changes to the C_MV_TEMP.

5. Run periodically a query Q that selects alltuples from the temporary table, which are to be

re-transmitted or have not been transmitted yet, in

this order (i.e. Order by STATUS Desc). We use

this ordering due to the fact that the records to be

retransmitted have a higher priority over the new

records. If no records are returned then wait and

repeat, else, mark-up the returned records with

XML, transmit them to the integrator and set theSTATUS field of these records to �0� (i.e. trans-mitted). Upon receiving a positive acknowledg-

ment from the integrator that all of these records

(i.e. using the ROWID of the tuples) were suc-

cessfully installed in a unit of work to the C_HV,

the transmitted records are deleted from the tem-

porary table. Otherwise, a �ROLLBACK� action is

performed, i.e. these records are marked for futureretransmission (STATUS:¼ STATUS+1).

3.3. Communication protocol

When a trigger on a MV is fired as part of the

maintenance transaction, then there are three cases

(on insert/update/delete) for the MV that must be

handled differently. Table 1 presents the queriesthat the integrator must issue to the hyper-view for

each action and temporary table, assuming that

the updates are originating from Agent1 (i.e.

WH_ORIGIN¼ 1).

Agent1 Agent2

Integrator

HW(SQL Server 2000)

HyperViews

MVs

Manufacturer

MVsDistributor 1

DW1(SQL Server 2000)

MVsDistributor 2

DW2(SQL Server 2000)

Fig. 6. High-level view of the prototype�s architecture.


3.4. Discussion

The presented framework has a straightfor-

ward implementation on top of an RDBMS and

is useful in cases where an appropriate MV al-

ready exists at a participating data warehouse.

Thus, we pump data directly from the stored re-

sult set while keeping the storage overhead at theparticipating data warehouse at low levels and

letting the original MV be maintained by other

existing algorithms. In particular, we use one

temporary table at the participating data ware-

house for every MV to be monitored and once the

transmitted records are successfully installed at

the hyper-view they are deleted from the tempo-

rary table.However, one issue that arose during the design

of a prototype for the purpose of conducting

experimental work (see next section), concerned

the method of populating the base table deltas.

There are two options. One is to define triggers on

each base table R, so that updates, insertions, and

deletions will trigger the insertion of change re-

cords into DR. The other option is to populate DR

by extracting changes from the database engine�stransaction log. However, we choose the trigger

method because it does not require knowledge of

the database engine�s log format and is simpler to

implement.

Another issue was the fact that triggers on in-

dexed views are not fired as part of the RDBMS�sview maintenance internal process. Thus, wemodeled the MV of interest of each data ware-

house as a table and installed triggers on the ref-

erenced base tables that maintain the MV

(considering it as a table) incrementally. However,

it would be far better if there was a special kind of

triggers that could be fired as part of the

RDBMS�s view maintenance internal process. In

this case we would not model the MV as a tableand we could apply our algorithm directly to the

MV. Also, we mainly considered the sum aggre-

gate function. Other functions, such as AVG can

be handled similarly by noticing that the AVG

function equals to SUM/COUNT. Thus, the inte-

grator should also have knowledge of the count

number and increase/decrease this number upon

insertion/deletion respectively.

4. Experimental work and analysis

A prototype that implements the proposed

algorithms has been developed using a set of

external drivers around the Microsoft SQL Server

2000 relational database engine. The agents and

the integrator entities have been implemented

using Borland�s Delphi programming languageand communicate using TCP/IP sockets. We also

modeled the MVs as indexed views in Microsoft�sSQL Server 2000 for three data warehouses; two

client data warehouses and one central warehouse

connected in a 100 Mbps LAN with the hyper-

view being modeled as an ordinary table.

Fig. 6 gives a high-level view of the prototype�sarchitecture. Solid arrow-lines represent data flow.

Performance is being evaluated on three as-

pects: answering queries over union-based non-

materialized hyper-views that depend on MVs and

relating the cost of performing a query over a non-

materialized view to the cost of propagating

incremental updates to the corresponding hyper-

view. In accordance with the evaluated aspects


above, the hyper-views used are referred to asC_HV1 and C_HV3. Also, the source MVs used at

the participating data warehouses are a variant of

the C_MV given in Fig. 5.

In Table 2 we illustrate a sample of the evalu-

ation queries for which a decision maker accessing

the central data warehouse would be interested in

having fast system response time.

Query Qa selects all the products from theparticipating MVs that say, a manufacturer pro-

duces, where the available quantity of each prod-

uct is less than or equal to 40 pieces. Query Qb

selects the product name and the sum of the

available quantity of each distinct product in each

participating data warehouse, where the sum is less

than or equal to 40 pieces.

In the first case, C_HV1 is defined as a unionquery that selects the product name and the

available quantity of each product and partici-

pating remote MV. This is accomplished through

the linked server and �openquery� properties of

Microsoft�s SQL Server 2000. Specifically, the

definition of the C_HV1 is shown in Fig. 7.

As we will see next, the performance of this

union query has almost constant response timewith the updates that take place at the participat-

ing data warehouses. Although the result set is

always up-to-date with the source data, this case

Table 2

Queries used for the evaluation process

Query name Query definition

Qa Select * from c_hvx where av

Qb Select p_name, sum(availqty)

Create view C_HV1 as

(select p_name, availqty, 1

openquery(DW01,'select p_name, avail

union all

select p_name, availqty, 2

openquery(DW02,'select p_name, avail

Fig. 7. The hyper-view used for answeri

has the slowest response time of the presentedcases because the result set is not materialized at

the HW but recomputed on demand. The only

MVs are those at the participating data ware-

houses. In addition, this case is susceptible to

network/link failures because the HW must query

remote data warehouses.

The hyper-view (C_HV3) is defined as a table at

the HW and we populate the C_HV3 by issuing aquery similar to that of Fig. 7. Although this case

has an initial materialization overhead, consequent

updates to the participating data warehouses are

applied incrementally to the C_HV3. This is

accomplished through the use of the agent and

integrator entities where the former transmits the

base table deltas and the later installs the updates

to the HV by issuing the appropriate queries asshown in Table 1. In this way, the hyper-view is

always up-to-date with the source MVs and con-

sequent queries are run against this table only (i.e.

C_HV3) providing enhanced system and informa-

tion quality, i.e., fast system response time and up-

to-date data. However, this case is susceptible to

network/link failures because the agent and inte-

grator properties communicate using TCP/IP.We ran the evaluation queries (i.e. Qa and Qb)

on every C_HVx and performed updates at the

source MVs on different times, namely t1 and t2,

ailqty <¼ 40 order by p_name

from c_hvx group by p_name having sum(availqty) <¼ 40

as mySUPPLIER from

qty from C_MV ')

as mySUPPLIER from

qty from C_MV'))

ng queries over union-based MVs.

Table 3

Average system response time for the evaluation queries

t1 t2 (update) t3 t4 (update) t5

Qa Qb Qa Qb Qa Qb Qa Qb Qa Qb

C_HV1 20.2 37.2 20.5 37.1 20.2 37.2 20.2 37.2 20.2 37.2

C_HV3 28.6 30 1 2.8 0.51 2.4 1 2.8 0.51 2.4


with t0 being the initialization phase. Table 3compares the average system response time of the

evaluating queries that retrieve the result of Qa and

Qb, that is, the available quantity of each product

and participating distributor, where it is less than

or equal to 40 pieces.

As we can see, the case of answering queries

over union-based non-materialized hyper-views

that depend on MVs (i.e. C_HV1) has almostconstant system response time. This was expected

due to the fact that the C_HV1 is not materialized

but recomputed on demand. With reference to the

above table, the proposed architecture (i.e. case of

C_HV3) outperforms the other approach, due to

the fact that updates are applied incrementally to

the hyper-view. Thus, consequent queries are

evaluated only against the hyper-view, which is alocal table/view at the HW.

5. Conclusions and further research

In this paper we have presented an original

approach to view materialization in which an MV

depends on remote MVs, identifying it as a hyper-view. Moreover, we proposed an algorithm for

maintaining this kind of MV and presented a

prototype, which implemented the algorithm in the

context of experimental work. As expected, our

approach outperforms the approaches usually

employed by commercial RDBMSs, since it in-

stalls the updates incrementally.

However, we modeled an MV as a table andprovided a simple maintenance technique by

installing triggers on source tables. This is due to

the fact that a trigger on an indexed view is fired

upon explicit insert/update/delete on the view it-

self. We believe it would be more convenient and

less computationally expensive if there was a spe-

cial kind of triggers on an indexed view that were

fired upon the RDBMS�s view maintenance inter-nal process.

Moreover, in our experiment we assumed that

the base tables in each data warehouse are avail-

able and we installed triggers on them. In this way,

we bypassed an intermediate level, that of an MV.

Further research should concentrate on installing

triggers directly on the MVs instead of the base

tables. In addition, we believe that a promisingavenue of research is the definition of a cost model

in order to decide whether the intermediate level

(i.e. an MV) should be bypassed or not and whe-

ther the intermediate level should be constructed

(if it does not already exists) or not. Obviously,

such decision have an impact on the consistency of

the views and consequently on the overall effi-

ciency and effectiveness of the collaborative-busi-ness workflow, affecting directly system and

information quality.

We also considered a hyper-view as a union

over MVs. An interesting research direction would

be to incrementally maintain hyper-views defined

as a join between some MVs of the participating

data warehouses or more complex queries, related

to, for example, annual projections of the returnon investments that may potentially include sub-

queries. Also, further research should deal with the

consistency of data elements from different data

sources having different semantics. This case was

out of the scope of this paper, having assumed that

this data will be consistent.

Concluding, we argue that for extended enter-

prises, past research on DWR offers little or noinsight as it focuses mainly on update propagation

through MVs for single warehouses. We believe

that this paper draws attention to a new set of

challenges that demand our attention, far beyond

view maintenance, which is just one step of the

complete refreshment process. Other steps may

concern, data cleansing and data merging due to


potential data and/or schema differences betweenthe participating warehouses, and data custom-

ization through the concept of the hyper-views.

References

[1] Divyakant Agrawal, Amr El Abbadi, Ambuj Singh, Tolga

Yurek, Efficient view maintenance at data warehouses, in:

The Proceedings of the ACM SIGMOD International

Conference on Management of Data, Tucson, AZ, USA,

13–15 May 1997, pp. 417–427.

[2] Morkane Bouzeghoub, Francoise Fabret, Maja Matulovic-

Broque, Modeling DWR Process as a Workflow Applica-

tion, in: The Proceedings of the International Workshop

on Design and Management of Data Warehouses, Heidel-

berg, Germany, 14–15 June 1999, pp. 6-1–6-12.

[3] Lie-da Chen, Klalid S. Soliman, En Mao, Mark N.

Frolick, Measuring user satisfaction with data warehouses:

An exploratory study, Information and Management 37

(2000) 103–110.

[4] Lyman Do, Pamela Drew, Wei Jin, Vish Jumani, David

Van Rossum, Issues in developing very large data ware-

houses, in: The Proceedings of the 24th International

Conference on Very Large Databases, New York City,

NY, USA, 24–27 August 1998, pp. 633–636.

[5] W.H. DeLone, E.R. McLean, Information systems success:

The quest for the dependent variable, Information Systems

Research 3 (1) (1992) 60–95.

[6] Lingli Ding, Xin Zhang, Elke A. Rundensteiner, The MRE

wrapper approach: Enabling incremental view mainte-

nance of data warehouses defined on multi-relation infor-

mation sources, in: The Proceedings of the ACM Second

International Workshop on Data Warehousing and

OLAP, Kansas City, MO, USA, 6 November 1999, pp.

30–35.

[7] Michael Hammer, The superefficient company, Harvard

Business Review (2001) 82–91.

[8] Costas Petrou, Stathis Hadjiefthymiades, Drakoulis

Martakos, An XML-based, 3-tier scheme for integrating

heterogeneous information sources to the WWW, in: The

Proceedings of the International Workshop on Internet

Data Management, 10th International Workshop on

Database & Expert Systems Applications, Florence, Italy,

1–3 September 1999, pp. 706–710.

[9] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick,

Making Views Self-Maintainable for Data Warehousing,

in: The Proceedings of the 4th International Conference on

Parallel and Distributed Information Systems, Miami

Beach, FL, USA, 18–20 December 1996, pp. 158–169.

[10] Nick Roussopoulos, Materialized views and data ware-

houses, ACM SIGMOD Record 27 (1) (1998) 21–26.

[11] Kenneth Salem, Kevin Beyer, How to roll a join: Asyn-

chronous incremental view maintenance, in: The Proceed-

ings of the ACM SIGMOD International Conference on

Management of Data, Dallas, TX, USA, 16–18 May 2000,

pp. 129–140.

[12] Transaction Processing Performance Council TPC Bench-

markT D (Decision Support), Working Draft 6.0, Trans-

action Processing Performance Council. Available from

<www.tpc.org>, 1993.

[13] Hui Wang, Maria Orlowska, Weifa Liang, Efficient

refreshment of materialized views with multiple sources,

in: The Proceedings of the International Conference on

Information and Knowledge Management, Kansas City,

MO, USA, 2–6 November 1999, pp. 375–382.

[14] Jian Yang, Mike P. Papazoglou, Interoperation support

for electronic business, Communications of the ACM 43

(6) (2000) 39–47.

[15] Yue, H. Garcia-Molina, J. Hammer, J. Widom, View

maintenance in a warehousing environment, in: The

Proceedings of the ACM SIGMOD International Confer-

ence on Management of Data, San Jose, CA, 22–25 May

1995, pp. 316–327.

[16] Yue Zhuge, Hector Garcia-Molina, Janet L. Wiener, The

Strobe Algorithms for multi-source warehouse consistency,

in: The Proceedings of the Conference on Parallel and

Distributed Information Systems, Miami Beach, FL, USA,

18–20 December 1996, pp. 146–157.

[17] Yue Zhuge, Janet L. Wiener, Hector Garcia-Molina,

Multiple view consistency for data warehousing, in: The

Proceedings of the 13th International Conference on Data

Engineering, Birmingham, UK, 7–11 April 1997, pp. 289–

300.

[18] Xin Zhang, Elke A. Rundensteiner, Integrating the main-

tenance and synchronization of data warehouses using a

cooperative framework, Information Systems 27 (4) (2002)

219–243.

http://www.tpc.org

Data Warehouse Clustering on the Web

Documents

Transcript of Data Warehouse Clustering on the Web