European Journal of Operational Research 160 (2005) 353–364
www.elsevier.com/locate/dsw
Data warehouse clustering on the web
Aristides Triantafillakis *, Panagiotis Kanellis, Drakoulis Martakos
Department of Informatics and Telecommunications, National and Kapodistrian University of Athens,
University Campus, Athens 15771, Greece
Received 5 June 2002; accepted 24 July 2003
Available online 9 December 2003
Abstract
In collaborative e-commerce environments, interoperation is a prerequisite for data warehouses that are physically
scattered along the value chain. Adopting system and information quality as success variables, we argue that what is
required for data warehouse refreshment in this context is inherently more complex than the materialized view
maintenance problem and we offer an approach that addresses refreshment in a federation of data warehouses. Defining
a special kind of materialized views, we propose an open multi-agent architecture for their incremental maintenance
while considering referential integrity constraints on source data.
� 2003 Elsevier B.V. All rights reserved.
Keywords: Decision support systems; Multi-agent systems; Data warehouse refreshment; Materialized view maintenance
1. Introduction
A base requirement for the success of a data
warehouse is the ability to provide decision makers
with both accurate and timely information
(information quality) as well as fast query responsetimes (system quality) [3]. Common methods that
are used in practice for achieving those are largely
focused on the concept of the materialized views
(MVs) where a business question (i.e. query) is
more quickly answered against the MV than
accessing directly the base data sources [10], which
may potentially involve time-demanding opera-
tions such as large-table scans and joins. However,
* Corresponding author.
E-mail address: [email protected] (A. Triantafillakis).
0377-2217/$ - see front matter � 2003 Elsevier B.V. All rights reserv
doi:10.1016/j.ejor.2003.07.012
inevitable updates at the data sources introduce
the information quality problem––how to keep the
MVs at a certain level of consistency with the
source data when update transactions take place
[13,16]. In short, system and information quality
are bound together – MVs can improve queryperformance if we can manage to update the views
consistently [4].
Collaborative electronic commerce (Ce-com-
merce), simply augments this challenge because
these data sources are not only internal, as they
largely were a mere few years ago. For example,
the emergence of business communities in the form
of Business-to-Business (B2B) exchanges meansthat the boundaries of organizations are more fluid
than they used to be [14]. Ce-commerce poses new
challenges to the MV maintenance problem, as
�extended enterprises� have to integrate far more
ed.
Data Extraction
Data Cleansing
Data Integration
Update Propagation
Customization External Event
External Event
External Event
External Event
BeforeIntegration
Event
After CleaningEvent
After IntegrationEvent
BeforePropagation
Event
AfterPropagation
Event
BeforeCustomization
Event
Loading sub-Process
Fig. 1. A generic workflow for DWR (adapted from [2]).
354 A. Triantafillakis et al. / European Journal of Operational Research 160 (2005) 353–364
data originating outside the organization into asingle repository. A recent paper [7] by Hammer
examines this trend by felicitously pointing out
that the streamlining of cross-company processes
‘‘. . . is the next great frontier for reducing costs,
enhancing quality, and speeding operations’’ (p.
83). In this way multiple enterprises within a
shared market segment can collaboratively plan
and manage the flow of goods, services, andinformation along the supply chain in a way that
increases customer-perceived value and optimizes
at the same time internal efficiencies [14].
What this means for data warehouse informa-
tion and system quality is that we should start
viewing Data Warehouse Refreshment (DWR) as
an operational process that must provide explicit
support for cross-enterprise collaboration in termsof changing business constraints. Therefore, DWR
should not only be limited to MV maintenance in
the context of a single warehouse (as we are
accustomed to), but also support the refreshment
in a federation of data warehouses. This, in turn,
translates to providing a new set of algorithms and
techniques to materialize views from source data
that may reside in remote sites, and to incremen-tally maintain these views. One should also note
that separate DWR processes in separate data
warehouses augments this challenge, because there
might exist different maintenance policies on the
MVs of interest for each DWR process. To the
best of our knowledge, previous work on view
maintenance has mainly considered the case of SPJ
(Select-Project-Join) views in a single warehouse,while not providing insights for a data warehous-
ing solution in such an environment. In short, the
problem we address in this paper can be stated as
follows: ‘‘how to maintain MVs in environments
where a data warehouse utilizes data from other
data warehouses’’.
Considering system and information quality as
success variables [5], we argue in the next sectionthat for collaborative e-commerce there is more to
DWR than the MV maintenance problem. In
Section 3 we present an agent-based framework
based on the eXtended Markup Language (XML)
standard for the incremental maintenance of a
special kind of MVs and in section 4 we clarify the
limitations of the empirical part of our research,
providing at the same time some directions forfuture work.
2. Issues and challenges for data warehouse refresh-
ment in collaborative e-commerce environments
In the case of a single data warehouse, DWR
can be viewed as a process that involves a hierar-chy of data stores accommodating a range from
source data to highly aggregated data [2,16]. The
Operational Data Store (ODS) stores source data
in a uniform and clean representation whilst the
Corporate Data Warehouse (CDW) stores aggre-
gate views of interest, or in other words, MVs.
This hierarchy of data stores is a logical way to
represent the flow of information, which goes fromthe data sources to the MVs.
DWR is a complex process composed of both
asynchronous and parallel tasks of which the main
task is to capture the differential changes in the
sources and to propagate them through this hier-
archy of data stores in the data warehouse. Fig. 1
presents a generic workflow for the refreshment
process, where for simplicity we have not consid-ered archiving.
The information flow begins with the loading
sub-process, which �feeds� the data warehouse and
is composed of the data extraction and data
cleansing steps. Other sub-processes include data
integration, update propagation and customiza-
tion, which propagate the summarized data to the
data marts. In the remaining section we willdemonstrate why DWR in a federation of data
A. Triantafillakis et al. / European Journal of Operational Research 160 (2005) 353–364 355
warehouses is a complex issue and poses challengesthat go beyond those normally associated with
simple view maintenance.
An integrated value system, which may include
a number of data warehouses, requires a solution
where the system of each participating business in
the chain should be able to communicate with all
others. In such a context this communication is
translated in customizing and querying remoteMVs in each participating business that contain
information of interest to the collaborative-busi-
ness workflow. The atomicity of the source
warehouses augments this challenge, as we also
have to take into account the different mainte-
nance policies of the MVs of each participating
warehouse. In such environments, we define an
MV as a hyper-view, because it provides a higherlevel of aggregation and consolidation of internal
as well as external data sources. Also, the data
sources may be MVs and not solely base tables,
as we are accustomed to. Fig. 2 depicts this
situation, where for simplicity we have consid-
ered two enterprise systems, namely Site1 and
Site2.
Hyper-views are dealt with in the customizationsub-process of the DWR process depicted in Fig. 1
and may have various forms. For example, they
may be defined as a union over the corresponding
MVs of the participating warehouses, or as an
ExtractionCleaning
IntegrationExtractionCleaning
ExtractionCleaning
OperationalData Store
DataSources
ExtractionCleaning
IntegrationExtractionCleaning
ExtractionCleaning
OperationalData Store
Site1
Site2
DataSources
Fig. 2. Data warehouse architecture
aggregate query that provides decision makerswith a higher level of consolidated information.
In the former case a hyper-view is defined as
HV ¼[n
i¼1
MVi; ð1Þ
while in the latter case, a hyper-view is defined as
HV ¼ MV1 ffl MV2 ffl � � � ffl MVn ð2Þwith ðMV1;MV2; . . . ;MVnÞ being the MVs of the
n participating warehouses.
We should underline that a hyper-view aug-
ments the complexity quotient, even in its simplest
form – a hyper-view as a union over remote MVs.
This is not trivial as it may initially seem and isonly partially addressed in practice by creating an
MV in each database that does the aggregation for
that database and then combine the results. It is
obvious that the result set is not stored in a central
repository, so the union-based view is not materi-
alized but recomputed every time. Consequently, a
hyper-view that stores union-based result sets has
greater performance, which translates to bettersystem quality. For decision makers along the
collaborative business workflow, the information
and systems quality requirements dictate MV
consistency and fast query response times.
We stated that for extended enterprises a num-
ber of processes can be identified, integrated and
Corporate DataWarehouse
Corporate DataWarehouse
Highly Aggregated Data
HyperViewsMVs
MVs
in Collaborative E-Commerce.
356 A. Triantafillakis et al. / European Journal of Operational Research 160 (2005) 353–364
shared along the value chain, and then managed asa single common process without regard to physi-
cal and technological corporate boundaries (dif-
ferent technologies that may affect system quality)
and with much less overhead and error in the
information provided to the end user (information
quality). To exemplify this, consider the simplified
workflow for e-procurement presented in Fig. 3
(forget the critical path for now), where the supply-chain cycle begins with a customer�s order.
The manufacturer passes the order through the
usual intra-enterprise activities and turns to out-
side support from suppliers and other providers of
goods and services that are needed to make the
product. The information exchange pertains to
such matters as requests for quote, bids and pur-
OperationalData Store
Information Flow
DatabaseTransactions
CorporateDataWarehouse
MVs
Site1
CorpW
Procure EquipmentOrder
Manufac.
Customer
Fig. 3. A workflow in
chase orders, order confirmations, invoices andpayment confirmation. With respect to this
example, assume that Site1 and Site2 have the
�Order Fulfillment� business process in common,
which in turn is decomposed into the �Verify Sta-
tus�, �Get Bids� and �Order� sub-processes.The �Verify Status� process depends on �Get
Bids�, which in turn depends on the �Order� sub-process. Within the boundaries of a single enter-prise, considering these sub-processes as MVs
means that a delay in any step explicitly affects the
other processes, which in turn affects the whole
business workflow in terms of efficiency. In the
case of an extended enterprise the obvious inter-
dependencies augment this problem due to the
necessary querying of the hyper-views that contain
Corporate DataWarehouse
MVs
Site2
Vendors
Vendor
FinanceDept,
orate Dataarehouse
HyperViews
Verify Status
Order
Get Bids
Critical Path
a supply chain.
A. Triantafillakis et al. / European Journal of Operational Research 160 (2005) 353–364 357
aggregations of interest consolidated from MVs ofthe participating warehouses. A delay in any step
may also affect the other workflows that provide
links in the value-chain. Therefore, we can think of
these three sub-processes (or MVs) as belonging to
a critical path (Fig. 3), i.e., the sub-processes that
the extended enterprises share, for the collabora-
tive-business workflow and infer the following
lemmas:
Lemma 1 (information quality). There is a critical
path along the value chain that requires complete
consistency.
Lemma 2 (information quality). There is a busi-
ness transaction along the critical path that requires
complete consistency.
Lemma 3 (system quality). The critical path is the
bottleneck of the workflow and may explicitly affect
system quality.
Lemma 4 (system and information quality). Effi-
cient DWR along the critical path is the depended
success variable for effective decision support.
Lemmas 1 and 2 refer to information quality in
terms of requesting complete consistency. In par-
ticular, Lemma 2 states that the three aforemen-
tioned processes should belong to a business
transaction [14], in analogy to a database trans-
action, that groups both transactional as well as
non-transactional processes together in a unit ofwork that reflects the semantics and behavior of its
underlying business tasks. Lemma 3 refers to sys-
tem quality in terms of fast query response times in
conjunction with adopting different access fre-
quency constraints and different maintenance
policies and consistency requirements for different
views according to the needs of the decision
makers. Finally, Lemma 4 simply states that con-sidering all the parameters that affect the refresh-
ment processes (i.e. efficiency) should provide the
decision makers with the intended result (i.e.
effective decision support).
Hence, from the above discussion we can infer
that the DWR process is a complex and event-
driven system that needs certain monitoring and
evolves frequently, following the evolution of datasources and user requirements. It refers to the
ability to define different scenarios depending on
user requirements, source constraints and data
warehouse constraints. However, most of the prior
work on warehouse refreshment deals with the
problem of maintaining SPJ-type MVs incremen-
tally in the case of a single warehouse. Existing
algorithms found in the literature [1,6,9,11,13,15–17] should be revisited and take into account spe-
cific constraints such as the freshness of data, the
space limitation of the ODS or CDW, referential
integrity constraints on source data and access
frequency to sources that users, data warehouse
administrators and data source administrators may
impose. In particular, the Strobe Algorithm [16] has
the potential of infinite waits requiring quiescencefrom the sources to update the corresponding
views. It also ensures strong consistency, but not
complete consistency since it incorporates the ef-
fects of several updates collectively. In the case of
the ECA algorithm [15] the data warehouse model
is restricted in that the number of data sources is
limited to a single data source. It is obvious that in
a Ce-commerce environment, quiescence is unlikely(Lemma 1) and transactions involve multiple enti-
ties (Lemma 2). Although, the SWEEP algorithm
[1] guarantees complete consistency and has linear
message complexity with the number of data
sources for processing an update, it does not con-
sider access frequency constraints to the data
sources (Lemma 3). Again, approaches described
in [11,13] are restricted to a single warehouse(Lemma 3). Although the authors in [9] use key and
referential integrity constraints in order to mini-
mize auxiliary data and make a class of views self-
maintainable, the class of views considered does
not include aggregation. An architecture for multi-
view consistency is proposed in [17], using an
integrator, view managers and a merge process that
collects changes to the views, holds them until allaffected views can be modified together, and then
forwards all of the views� changes to the warehouse
in a single warehouse transaction. However, the
authors provide algorithms for the merge process
to decide when to hold and when to forward
actions (Lemma 4). A work that is close to ours
is described in [6] where the authors propose
358 A. Triantafillakis et al. / European Journal of Operational Research 160 (2005) 353–364
incremental view maintenance algorithms for datawarehouses defined on multi-relation information
sources. However, they do not explicitly refer to a
warehousing solution in a Ce-commerce environ-
ment. Finally, the SDCC framework introduced in
[18] provides a solution to the problem of concur-
rent view definition and evolution but the authors
make the assumption that each information source
has only one relation, which is deemed unrealis-tic in the defined Ce-commerce environment
(Lemma 2).
Regarding commercial database products, the
issues defined above have not been adequately
addressed either. For example, an MV in Micro-
soft�s SQL Server 2000 (called indexed view) must
not reference any other views, but only base tables,
and all base tables referenced by the view must bein the same database as the view. Besides, UNION
is not allowed in an indexed view. A workaround
to these issues is to create an indexed view in each
database that does the aggregation for that data-
base and then combine the results. Thus, the
combined result set is not materialized, so we lose
stored result sets – one of the fundamental prop-
erties of the data warehouse.In summary, the case of a federation of data
warehouses that are physically scattered along
integrated value chains over the web, necessitates
an architecture where interoperation is a pre-
requisite in terms of highly aggregated MVs that
depend in turn on the remote MVs of the partici-
pating enterprises. Prior academic work on DWR
has mainly considered the incremental mainte-nance of SPJ-type MVs in the case of a single
warehouse, whereas in commercial products UN-
ION based views cannot be materialized.
Consequently, we believe that a different ap-
proach is needed that addresses the incremental
MV maintenance problem in Ce-commerce type
environments. In the next section we propose
such an approach and architecture for the incre-mental maintenance of union-based hyper-
views stored at the central data warehouse that
depend on simple SPJG (Select-Project-Join-
Group by) MVs of the participating data ware-
houses. We also consider the SUM function as a
motivating example and treat the hyper-view as a
table.
3. System architecture
The proposed architecture builds upon and ex-
tends the works of [6,8,9,17], contributing in the
process a novel approach for DWR in Ce-com-
merce environments. In particular, we use a log Lof the changes (base table deltas), modeled by
installing triggers on MVs. The Agent [8] andIntegrator [6,8,17] entities are responsible for the
incremental maintenance of the hyper views.
Specifically, the proposed architecture is appli-
cable in cases where an appropriate MV already
exists in a participating data warehouse, i.e. there
is a complete rewriting of the original query over
an MV. Thus, we pump data directly from the
existing MV. We model the MV of each datawarehouse as a table and install triggers on the
base tables that maintain the MV incrementally.
We also install triggers on the MV (considering it
as a table) that replicate the changes to a tempo-
rary table at a participating data warehouse (delta
table – DR). This is because triggers on indexed
views are not �fired� as part of the RDBMS�s viewmaintenance internal process. Moreover, updatesto the HV are handled as deletions followed by
insertions. In this approach we need one delta ta-
ble at the data warehouse for every MV to be
monitored.
3.1. Definitions
We refer to the data warehouse in this envi-
ronment as a hyper-warehouse (HW) – a super-set
of the participating data warehouses with the hy-
per-view as a union over its corresponding MVs
ðMV1;MV2; . . . ;MVnÞ of the n participatingwarehouses (Eq. 1). Thereby, a hyper-view is a
special kind of MV that its view definition is the
super-set of the definition of a corresponding MV
together with an extra field that indicates the
warehouse origin of each tuple. We also utilize the
agent technology [6,17], for the interoperation of
the data warehouses. This is because agent soft-
ware can be installed on top of an RDBMS inevery participating data warehouse that transmits
the base table deltas to an integrator entity at the
HW, which is responsible for maintaining the
hyper-view. The child-agent entity pumps data
Site1
Hyper Warehouse
.
.
.
Siten
Agent1
AgentN
.
.
.Inregrator
HyperViews
Dat
a S
ourc
es
Age
nts
Fig. 4. Overall system architecture.
CREATE VIEW C_MV as
(
SELECT c.p_name, sum(b.ps_availqty) as availqty
FROM partsupp b, part c
WHERE c.p_partkey = b.ps_partkey
GROUP BY p_name
)
Fig. 5. MV used for performance evaluation.
A. Triantafillakis et al. / European Journal of Operational Research 160 (2005) 353–364 359
directly from the existing MV. At the next stage,
an algorithm executed by an agent maintains
incrementally the hyper-view at the HW by ana-
lyzing the base table deltas at the DR and posting
the appropriate queries to the hyper-view, whereupdates are handled as deletions followed by
insertions. In addition, we use the schema from the
TPCD-1GB reference schema [12] and assume that
schema mapping is possible using the architecture
proposed in [8].
In Fig. 4 we provide an illustration of the
architecture of the proposed system. We use a
modified version of the architecture initially pro-posed in [6,8]. The main components are:
• Data sources: technologies such as RDBMSs
(e.g. MS SQL Server) or Data Warehouses
(e.g. MVs).
• Integrator: information broker that is responsi-
ble for receiving the update notifications from
the agents and installing the updates to the hy-per-view in a unit of work. The messages that
are exchanged between the agents and the inte-
grator are flat files marked up with XML.
• Agents: server processes (pre-spawned) that con-
trol the interaction with a data source, and wait
for requests from the integrator to monitor spe-
cific MVs or perform specific actions. They also
transmit the updates to the integrator marked-up with XML.
Having this architecture in mind, we assume
that an SPJG MV (e.g. C_MV) has been defined in
each data warehouse that aggregates the available
quantity of each product (refer to the example in
the previous section). For the sake of simplicity we
consider two joining tables with the definition of
this MV shown in Fig. 5.
From the above discussion, the hyper-view (e.g.
C_HV) in analogy to the C_MV, will have the
following fields: {P_NAME, PS_AVAILQTY,
WH_ORIGIN} and the combined key will be:
{P_NAME, WH_ORIGIN}, where the field�PS_AVAILQTY� is the aggregated amount (i.e.
SUM) of the available quantity of each product in
each site (i.e. supplier).
3.2. Middle-tier Components
In the following paragraphs we elaborate on the
middle-tier components (i.e. agents and integrator)and on the communication protocol between
them.
3.2.1. Integrator
The integrator sends a request to every agent in
each data warehouse to monitor the corresponding
Table 1
Queries that must be issued by the integrator to the hyper-view for an MV and each action (assuming communication with Agent1)
Table Action Query definition
C_MV_TEMP U Delete C_HV where P_NAME¼<P_NAME> and WH_ORIGIN¼ 1
Insert into C_HV (P_NAME, AVAILQTY, WH_ORIGIN) values (<P_NAME>,
<AVAILQTY>, 1)
I Insert into C_HV (P_NAME, AVAILQTY, WH_ORIGIN) values (<P_NAME>,
<AVAILQTY>, 1)
D Delete C_HV where P_NAME¼<P_NAME> and WH_ORIGIN¼ 1
360 A. Triantafillakis et al. / European Journal of Operational Research 160 (2005) 353–364
MV (e.g. C_MV). When an agent detects a changein the source data, marks up with XML the re-
cords that have been changed and transmits this
XML file to the integrator. Then the integrator
parses the XML file and installs the changes to the
hyper-view on a FCFS (First-Come-First-Serve)
basis. These changes are installed by executing the
appropriate query to the C_HV as shown in Table
1. Upon successful installation, the integratorsends back a positive acknowledgment to the agent
(the ROWID of the tuples that were installed to
the C_HV), otherwise a negative acknowledgment
(i.e. a �ROLLBACK� action).
3.2.2. Agents
An agent is a server process waiting for requests
from the integrator. When a request to monitor aspecific MV is present then the child agent oper-
ates under the following algorithm:
1. Parse the incoming request and extract the
name of the view that is to be monitored (e.g.
C_MV).
2. Analyse the definition of the view and extract
the table names and fields that participate in theview (e.g. P_NAME, PS_AVAILQTY).
3. Create a temporary table, named after the
MV that keeps the base table deltas. The fields of
the table are those that appear in the view defini-
tion plus a field that indicates the action that has
been performed, a status field that indicates if this
tuple has been transmitted or not, and a row-id
field that provides unique numbering (i.e. autoincrement). Thus, the definition of the temporary
table is as follows: C_MV_TEMP {P_NAME,
PS_AVAILQTY, ACTION, STATUS, ROWID}.
For example, if a row in the C_MV_TEMP has
been created as part of the �on insert� trigger on theC_MV, then the ACTION field of this tuple has
the value �I� that stands for Insert. Similarly the
ACTION field may have the values �U� for updateand �D� for delete. In addition, the STATUS field
will be set to �1� for this tuple.4. Install triggers on the C_MV that replicate
the changes to the C_MV_TEMP.
5. Run periodically a query Q that selects alltuples from the temporary table, which are to be
re-transmitted or have not been transmitted yet, in
this order (i.e. Order by STATUS Desc). We use
this ordering due to the fact that the records to be
retransmitted have a higher priority over the new
records. If no records are returned then wait and
repeat, else, mark-up the returned records with
XML, transmit them to the integrator and set theSTATUS field of these records to �0� (i.e. trans-mitted). Upon receiving a positive acknowledg-
ment from the integrator that all of these records
(i.e. using the ROWID of the tuples) were suc-
cessfully installed in a unit of work to the C_HV,
the transmitted records are deleted from the tem-
porary table. Otherwise, a �ROLLBACK� action is
performed, i.e. these records are marked for futureretransmission (STATUS:¼ STATUS+1).
3.3. Communication protocol
When a trigger on a MV is fired as part of the
maintenance transaction, then there are three cases
(on insert/update/delete) for the MV that must be
handled differently. Table 1 presents the queriesthat the integrator must issue to the hyper-view for
each action and temporary table, assuming that
the updates are originating from Agent1 (i.e.
WH_ORIGIN¼ 1).
Agent1 Agent2
Integrator
HW(SQL Server 2000)
HyperViews
MVs
Manufacturer
MVsDistributor 1
DW1(SQL Server 2000)
MVsDistributor 2
DW2(SQL Server 2000)
Fig. 6. High-level view of the prototype�s architecture.
A. Triantafillakis et al. / European Journal of Operational Research 160 (2005) 353–364 361
3.4. Discussion
The presented framework has a straightfor-
ward implementation on top of an RDBMS and
is useful in cases where an appropriate MV al-
ready exists at a participating data warehouse.
Thus, we pump data directly from the stored re-
sult set while keeping the storage overhead at theparticipating data warehouse at low levels and
letting the original MV be maintained by other
existing algorithms. In particular, we use one
temporary table at the participating data ware-
house for every MV to be monitored and once the
transmitted records are successfully installed at
the hyper-view they are deleted from the tempo-
rary table.However, one issue that arose during the design
of a prototype for the purpose of conducting
experimental work (see next section), concerned
the method of populating the base table deltas.
There are two options. One is to define triggers on
each base table R, so that updates, insertions, and
deletions will trigger the insertion of change re-
cords into DR. The other option is to populate DR
by extracting changes from the database engine�stransaction log. However, we choose the trigger
method because it does not require knowledge of
the database engine�s log format and is simpler to
implement.
Another issue was the fact that triggers on in-
dexed views are not fired as part of the RDBMS�sview maintenance internal process. Thus, wemodeled the MV of interest of each data ware-
house as a table and installed triggers on the ref-
erenced base tables that maintain the MV
(considering it as a table) incrementally. However,
it would be far better if there was a special kind of
triggers that could be fired as part of the
RDBMS�s view maintenance internal process. In
this case we would not model the MV as a tableand we could apply our algorithm directly to the
MV. Also, we mainly considered the sum aggre-
gate function. Other functions, such as AVG can
be handled similarly by noticing that the AVG
function equals to SUM/COUNT. Thus, the inte-
grator should also have knowledge of the count
number and increase/decrease this number upon
insertion/deletion respectively.
4. Experimental work and analysis
A prototype that implements the proposed
algorithms has been developed using a set of
external drivers around the Microsoft SQL Server
2000 relational database engine. The agents and
the integrator entities have been implemented
using Borland�s Delphi programming languageand communicate using TCP/IP sockets. We also
modeled the MVs as indexed views in Microsoft�sSQL Server 2000 for three data warehouses; two
client data warehouses and one central warehouse
connected in a 100 Mbps LAN with the hyper-
view being modeled as an ordinary table.
Fig. 6 gives a high-level view of the prototype�sarchitecture. Solid arrow-lines represent data flow.
Performance is being evaluated on three as-
pects: answering queries over union-based non-
materialized hyper-views that depend on MVs and
relating the cost of performing a query over a non-
materialized view to the cost of propagating
incremental updates to the corresponding hyper-
view. In accordance with the evaluated aspects
362 A. Triantafillakis et al. / European Journal of Operational Research 160 (2005) 353–364
above, the hyper-views used are referred to asC_HV1 and C_HV3. Also, the source MVs used at
the participating data warehouses are a variant of
the C_MV given in Fig. 5.
In Table 2 we illustrate a sample of the evalu-
ation queries for which a decision maker accessing
the central data warehouse would be interested in
having fast system response time.
Query Qa selects all the products from theparticipating MVs that say, a manufacturer pro-
duces, where the available quantity of each prod-
uct is less than or equal to 40 pieces. Query Qb
selects the product name and the sum of the
available quantity of each distinct product in each
participating data warehouse, where the sum is less
than or equal to 40 pieces.
In the first case, C_HV1 is defined as a unionquery that selects the product name and the
available quantity of each product and partici-
pating remote MV. This is accomplished through
the linked server and �openquery� properties of
Microsoft�s SQL Server 2000. Specifically, the
definition of the C_HV1 is shown in Fig. 7.
As we will see next, the performance of this
union query has almost constant response timewith the updates that take place at the participat-
ing data warehouses. Although the result set is
always up-to-date with the source data, this case
Table 2
Queries used for the evaluation process
Query name Query definition
Qa Select * from c_hvx where av
Qb Select p_name, sum(availqty)
Create view C_HV1 as
(select p_name, availqty, 1
openquery(DW01,'select p_name, avail
union all
select p_name, availqty, 2
openquery(DW02,'select p_name, avail
Fig. 7. The hyper-view used for answeri
has the slowest response time of the presentedcases because the result set is not materialized at
the HW but recomputed on demand. The only
MVs are those at the participating data ware-
houses. In addition, this case is susceptible to
network/link failures because the HW must query
remote data warehouses.
The hyper-view (C_HV3) is defined as a table at
the HW and we populate the C_HV3 by issuing aquery similar to that of Fig. 7. Although this case
has an initial materialization overhead, consequent
updates to the participating data warehouses are
applied incrementally to the C_HV3. This is
accomplished through the use of the agent and
integrator entities where the former transmits the
base table deltas and the later installs the updates
to the HV by issuing the appropriate queries asshown in Table 1. In this way, the hyper-view is
always up-to-date with the source MVs and con-
sequent queries are run against this table only (i.e.
C_HV3) providing enhanced system and informa-
tion quality, i.e., fast system response time and up-
to-date data. However, this case is susceptible to
network/link failures because the agent and inte-
grator properties communicate using TCP/IP.We ran the evaluation queries (i.e. Qa and Qb)
on every C_HVx and performed updates at the
source MVs on different times, namely t1 and t2,
ailqty <¼ 40 order by p_name
from c_hvx group by p_name having sum(availqty) <¼ 40
as mySUPPLIER from
qty from C_MV ')
as mySUPPLIER from
qty from C_MV'))
ng queries over union-based MVs.
Table 3
Average system response time for the evaluation queries
t1 t2 (update) t3 t4 (update) t5
Qa Qb Qa Qb Qa Qb Qa Qb Qa Qb
C_HV1 20.2 37.2 20.5 37.1 20.2 37.2 20.2 37.2 20.2 37.2
C_HV3 28.6 30 1 2.8 0.51 2.4 1 2.8 0.51 2.4
A. Triantafillakis et al. / European Journal of Operational Research 160 (2005) 353–364 363
with t0 being the initialization phase. Table 3compares the average system response time of the
evaluating queries that retrieve the result of Qa and
Qb, that is, the available quantity of each product
and participating distributor, where it is less than
or equal to 40 pieces.
As we can see, the case of answering queries
over union-based non-materialized hyper-views
that depend on MVs (i.e. C_HV1) has almostconstant system response time. This was expected
due to the fact that the C_HV1 is not materialized
but recomputed on demand. With reference to the
above table, the proposed architecture (i.e. case of
C_HV3) outperforms the other approach, due to
the fact that updates are applied incrementally to
the hyper-view. Thus, consequent queries are
evaluated only against the hyper-view, which is alocal table/view at the HW.
5. Conclusions and further research
In this paper we have presented an original
approach to view materialization in which an MV
depends on remote MVs, identifying it as a hyper-view. Moreover, we proposed an algorithm for
maintaining this kind of MV and presented a
prototype, which implemented the algorithm in the
context of experimental work. As expected, our
approach outperforms the approaches usually
employed by commercial RDBMSs, since it in-
stalls the updates incrementally.
However, we modeled an MV as a table andprovided a simple maintenance technique by
installing triggers on source tables. This is due to
the fact that a trigger on an indexed view is fired
upon explicit insert/update/delete on the view it-
self. We believe it would be more convenient and
less computationally expensive if there was a spe-
cial kind of triggers on an indexed view that were
fired upon the RDBMS�s view maintenance inter-nal process.
Moreover, in our experiment we assumed that
the base tables in each data warehouse are avail-
able and we installed triggers on them. In this way,
we bypassed an intermediate level, that of an MV.
Further research should concentrate on installing
triggers directly on the MVs instead of the base
tables. In addition, we believe that a promisingavenue of research is the definition of a cost model
in order to decide whether the intermediate level
(i.e. an MV) should be bypassed or not and whe-
ther the intermediate level should be constructed
(if it does not already exists) or not. Obviously,
such decision have an impact on the consistency of
the views and consequently on the overall effi-
ciency and effectiveness of the collaborative-busi-ness workflow, affecting directly system and
information quality.
We also considered a hyper-view as a union
over MVs. An interesting research direction would
be to incrementally maintain hyper-views defined
as a join between some MVs of the participating
data warehouses or more complex queries, related
to, for example, annual projections of the returnon investments that may potentially include sub-
queries. Also, further research should deal with the
consistency of data elements from different data
sources having different semantics. This case was
out of the scope of this paper, having assumed that
this data will be consistent.
Concluding, we argue that for extended enter-
prises, past research on DWR offers little or noinsight as it focuses mainly on update propagation
through MVs for single warehouses. We believe
that this paper draws attention to a new set of
challenges that demand our attention, far beyond
view maintenance, which is just one step of the
complete refreshment process. Other steps may
concern, data cleansing and data merging due to
364 A. Triantafillakis et al. / European Journal of Operational Research 160 (2005) 353–364
potential data and/or schema differences betweenthe participating warehouses, and data custom-
ization through the concept of the hyper-views.
References
[1] Divyakant Agrawal, Amr El Abbadi, Ambuj Singh, Tolga
Yurek, Efficient view maintenance at data warehouses, in:
The Proceedings of the ACM SIGMOD International
Conference on Management of Data, Tucson, AZ, USA,
13–15 May 1997, pp. 417–427.
[2] Morkane Bouzeghoub, Francoise Fabret, Maja Matulovic-
Broque, Modeling DWR Process as a Workflow Applica-
tion, in: The Proceedings of the International Workshop
on Design and Management of Data Warehouses, Heidel-
berg, Germany, 14–15 June 1999, pp. 6-1–6-12.
[3] Lie-da Chen, Klalid S. Soliman, En Mao, Mark N.
Frolick, Measuring user satisfaction with data warehouses:
An exploratory study, Information and Management 37
(2000) 103–110.
[4] Lyman Do, Pamela Drew, Wei Jin, Vish Jumani, David
Van Rossum, Issues in developing very large data ware-
houses, in: The Proceedings of the 24th International
Conference on Very Large Databases, New York City,
NY, USA, 24–27 August 1998, pp. 633–636.
[5] W.H. DeLone, E.R. McLean, Information systems success:
The quest for the dependent variable, Information Systems
Research 3 (1) (1992) 60–95.
[6] Lingli Ding, Xin Zhang, Elke A. Rundensteiner, The MRE
wrapper approach: Enabling incremental view mainte-
nance of data warehouses defined on multi-relation infor-
mation sources, in: The Proceedings of the ACM Second
International Workshop on Data Warehousing and
OLAP, Kansas City, MO, USA, 6 November 1999, pp.
30–35.
[7] Michael Hammer, The superefficient company, Harvard
Business Review (2001) 82–91.
[8] Costas Petrou, Stathis Hadjiefthymiades, Drakoulis
Martakos, An XML-based, 3-tier scheme for integrating
heterogeneous information sources to the WWW, in: The
Proceedings of the International Workshop on Internet
Data Management, 10th International Workshop on
Database & Expert Systems Applications, Florence, Italy,
1–3 September 1999, pp. 706–710.
[9] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick,
Making Views Self-Maintainable for Data Warehousing,
in: The Proceedings of the 4th International Conference on
Parallel and Distributed Information Systems, Miami
Beach, FL, USA, 18–20 December 1996, pp. 158–169.
[10] Nick Roussopoulos, Materialized views and data ware-
houses, ACM SIGMOD Record 27 (1) (1998) 21–26.
[11] Kenneth Salem, Kevin Beyer, How to roll a join: Asyn-
chronous incremental view maintenance, in: The Proceed-
ings of the ACM SIGMOD International Conference on
Management of Data, Dallas, TX, USA, 16–18 May 2000,
pp. 129–140.
[12] Transaction Processing Performance Council TPC Bench-
markT D (Decision Support), Working Draft 6.0, Trans-
action Processing Performance Council. Available from
<www.tpc.org>, 1993.
[13] Hui Wang, Maria Orlowska, Weifa Liang, Efficient
refreshment of materialized views with multiple sources,
in: The Proceedings of the International Conference on
Information and Knowledge Management, Kansas City,
MO, USA, 2–6 November 1999, pp. 375–382.
[14] Jian Yang, Mike P. Papazoglou, Interoperation support
for electronic business, Communications of the ACM 43
(6) (2000) 39–47.
[15] Yue, H. Garcia-Molina, J. Hammer, J. Widom, View
maintenance in a warehousing environment, in: The
Proceedings of the ACM SIGMOD International Confer-
ence on Management of Data, San Jose, CA, 22–25 May
1995, pp. 316–327.
[16] Yue Zhuge, Hector Garcia-Molina, Janet L. Wiener, The
Strobe Algorithms for multi-source warehouse consistency,
in: The Proceedings of the Conference on Parallel and
Distributed Information Systems, Miami Beach, FL, USA,
18–20 December 1996, pp. 146–157.
[17] Yue Zhuge, Janet L. Wiener, Hector Garcia-Molina,
Multiple view consistency for data warehousing, in: The
Proceedings of the 13th International Conference on Data
Engineering, Birmingham, UK, 7–11 April 1997, pp. 289–
300.
[18] Xin Zhang, Elke A. Rundensteiner, Integrating the main-
tenance and synchronization of data warehouses using a
cooperative framework, Information Systems 27 (4) (2002)
219–243.
Top Related