An "Intelligent" Extraction and Aggregation Tool for Company Data Bases

145

An 'Intelligent' Extraction and Aggregation Tool for Company Data Bases Peter STECHER and Pekka HELLEMAA IBM Europe, F-92075 Paris-la D~fense, France

To ease the access to data bases for the casual user, a system called SMARTY has been developed as an intelligent front end to the relational data base management system SQL/DS. A user only needs to specify the data attributes and possibly their value ranges which he wants to be displayed. SMARTY will test the meaningfulness of this request for information and subsequently generate and execute the SQL statements for deriving the information from SQL relations. SMARTY is capable of extracting or aggregating detailed information in a company database. It can be regarded as an expert system on top of a relational DBMS to improve the user interface, or as a query system to a universal relation. SMAR'I~' was written in Prolog. Both Prolog and SQL/DS are running under the IBM operating system VM/SP 3.

Keyword~: Relational data base, SQL, Universal relation implementation, Query expert system, Logic programming application.

Peter Stecher has been with IBM since 1968. He initially worked as a systems engineer for manufacturing and air- lines industries in Germany and En- gland. In 1980 he joined IBM's European Headquarters in Paris where he currently is information systems ar- chitect. He holds the degree 'Diplom- mathematiker' from the University in Munich (1968) and a Ph.D. in Com- puting from the London School of Economics (1978). In his Ph.D. thesis he designed a real-time rule-based sys-

tem to support the execution of business processes.

Pekka Hellemaa joined IBM Finland 1971 and is currently on assignment at IBM's European Headquarters since 1983. All this time he has worked as application analyst/programmer. His major interests include developing program generators and application independent tools for end users.

North-Holland Decision Support Systems 2 (1986) 145-158

I. Introduction

1.1. Problem description

Lmge organisations in the public and business domains collect a large amount of detailed information about their customers, products, services, activities, resources, etc. for their day-to-day operations. This information which is often com- puterized is used later to analyse the past performance of the organisation and to plan for its future. In business this activity is the task of headquarters such as ours. Headquarters functions have little or no concern for detailed information on customers, products, t ransactions, resources. Deta i led in fo rmat ion represents the ' raw' da ta f rom which a large variety of in format ion is de-

rived for the purpose of business per formance

analysis a nd planning. The basic processes of de- r ivat ion consist of selection, l inking, project ion

and aggregat ion of the raw data. The k inds of in format ion which can be derived

from the raw data base, the so-called end user views, are in the range of billions. An example of

an end user view is Coun t ry - Year - N u m b e r of Inhabi tants , mean ing ' n u m b e r of inhabi tan ts by country and

year ' . For tunate ly , most of the theoretical ly feasible

end user views are not meaningful or not signifi-

cant to end users. The number of significant end

user views is far too large to be generated, in par t icu lar as many of them are not requested

regularly. Those which are requested by end users

depend very much on the business and its devel-

opment . This s i tuat ion lends itself qui te natural ly to a

system with the following features:

(1) The set of the most 'popular' end user views is produced on a regular basis. This set is small.

(2) Any other end user view which is requested is generated only on demand.

This is exactly what SMARTY, our extractor

0167-9236/86/$3.50 © 1986, Elsevier Science Publishers B.V. (North-Holland)

146 P. Stecher, P. Hellemaa /An Intelligent Extraction and Aggregation Tool for Data Bases

for large data bases, tries to achieve. Fig. 1 shows the dataflow around SMARTY. In addition the generation of end user views should happen in an optimum way (i.e., with the fastest response time), and automatically (i.e., an end user specifies the view he wants to see, and SMARTY does the job of producing it). The set of files from which an end user view on demand is derived, comprises the raw data base plus all files derived from it and which exist in the system at that time. This means, the set of files to be considered for derivation is dynamic. This is the crux of SMARTY.

Today end user views are often generated by means of 'hard-coded' computer programs. If end user views change, it is often difficult and time consuming to incorporate the changes in these programs.

There is another aspect of querying a data base, which SMARTY addresses: It is the ease of access for the casual user to a data base.

The introduction of nonprocedural query lan- guages such as SQL/DS [3] have brought about substantial progress for users who have limited

knowledge of how data is stored or retrieved by the computer. Nevertheless the knowledge required for querying a data base is still consider- able. In addition to the syntax of the language, an end user has to know

- the exact names of data attributes, - the names of relations, - the attribute composition of relations.

If we can, we should alleviate the task of the casual user and thereby extend the set of candi- date users of the data base system. The aim of SMARTY is to have an end user select from a menu only the data attributes he wants to see. He should not be required to indicate the name of any relation. SMARTY would take care of producing this end user view in an optimum way including joining relations.

Query by Example [20] as implemented in the Query Management Facility of SQL/DS goes a long way towards these goals: A user may ask for the attribute composition of a relation and specify

I l l ' ' ' l l l l l l ' ' ' | l l i I l ' ' ' t t l J ) ! ) P

: : : : : : : ! ! t ! t t

. t t t t t t ) . : . t t . " ) | ' ttttl ltttt

: ORDER : : : : : : & : : CUSTOMER : : PRODUCT : : INVENTORY : : : : :

! t t ! ! !

j,V,) 11 t!

! !

• ,llttlt ). " :

: RAW : : DATA BASE :

1 1 ~1111111 i

DATA BASE MANAGEMENT SYSTEM

I I E oJB=Lool

( V !

CTRY-PROD-ORDER [ ]

DATABASE CATALOGUE

>

EUV

V

PROGRAM TO SATISFY REQUEST

SI'{ARTY

A" I

EUV REQUEST

i

Fig. 1. Dataflow around SMARTY (EUV stands for End User View).

P. Stecher, P. Hellemaa / An Intelligent Extraction and Aggregation Tool for Data Bases 147

selection conditions by attribute. He, however, still has to indicate the conditions for joining relations.

1.2. Intelligent extractor approach

The objectives of SMARTY are therefore four- fold:

(1) Unique and stable source of information. The source from which data is extracted or aggregated should contain only the bare essential data. It should be free of redundancy and derivations.

(2) Flexibility regarding end user view generation. Only those end user views should be generated which are actually required to conduct the business. Further, if end user views change, no program should need modification.

(3) Optimization of end user view derivation. Usually the end user view requested can be generated from both the raw data base and from files derived from it. Among the many possible ways SMARTY should try to come as close as possible to the optimum way. This objective belongs to the subject of how to improve performance in relational data base systems. One class of approaches deals with access path optimization, e.g., [11,16]. Another class investigates equivalent query transfor- mation in order to reduce operations on relations, e.g., [1,13]. A third class suggests the materialization of joins frequently needed for satisfying data requests, e.g., [15]. Our approach by using relations derived from the raw data base for extraction and aggregation can be regarded as an extension to the latter class.

(4) User-friendly interface. An end user may specify his request for information without reference to specific files and without know- ing whether the information has to be provided through extraction or ag3regation of data. The first part of this objective expresses what the universal relation is set out to achieve [2,17]. Hence SMARTY can be seen as an attempt to attain goals of the universal relation interface in a large organisation. The knowledge which SMARTY needs to be friendly, the business semantics, should be stored as tables to facilitate modification.

Expert systems may be employed to allow for a wider range and more concise formulation of data base queries. The addition of general rules and inference mechanisms to a DBMS may improve the user interface [8,19]. These authors distinguish in this context between loose and tight coupling of expert systems with DBMSs. Loose coupling means extracting a snapshot of the required data from the data base and storing it in the internal format of the expert system before the expert system begins to work. Tight coupling refers to changing a user query into a form whose evalua- tion can be deferred while certain optimization is performed on the query. By these definitions we may regard SMARTY as an expert system which is tightly coupled to a relational DBMS.

1.3. System implementation

SMART¥ grew out of a study wlfich aimed at simplifying the organisation of company data and access to it. We started with a simple prototype system where we tested a number of concepts. We then gradually refined the prototype taking into account user reactions to system behaviour until we arrived at a production level system. SMARTY has always been an implementation project rather than a research project.

SMARTY runs under the operating system V M / S P 3. It is programmed using VM/Prolog [7], and for full screen support Restructured Ex- tended Executor (REXX) [4] and ISPF Release 2 [6]. The DBMS for the raw data base and derived files is SQL/DS.

The current version of SMARTY derives end user views from the raw data base. We are working on the extension to SMARTY to generate end user views in a more efficient way by taking into account all relations already derived from the raw data base.

SMARTY can simultaneously be employed by any number of users and for any number of applications. Each application is defined by its raw data base.

The rest of the paper is organized as follows: Section 2 compares the SMARTY syntax with the SQL syntax and defines features of the raw data base. In section 3 we describe the algorithms for validating and generating end user views. We then focus on the system implementation. Finally the aspect of SMARTY as a universal relation interface is discussed shortly.

148

2. System overview

2.1. System structure

P. Stecher, P. Hellemaa / An Intelligent Extraction and Aggregation Tool for Data Bases

" V! Fig. 2 shows the lay-out of the system. A user

specifies to SMARTY his request for information in terms of data attributes (the so-called end user view) and the corresponding value ranges desired. For instance

Country 7 : 11 Product 128 (1) Yearmonth 8301 : 8311 Order

meaning 'Display the order value by country, product and yearmonth. Country numbers should vary from 7 to 11, yearmonths from 8301 to 8311, while only one product number should be taken, namely 128'.

SMARTY looks at the set of the relations which exist at the moment. This set comprises the raw data base and any derived relations. Subse- quently SMARTY analyses ways of deriving the end user view and selects the one with the fastest response time. Then SMARTY generates the SQL statements to derive the end user view. For end

VM / SP

~ <'

I

user view (1) we get

END USER REQUEST

v VM/PROLOG

SMARTY

/ /

SQL / DS

I SQL STATEMENTS

I v

QHF

I OUTPUT TABLE

Fig. 2. System lay-out of SMARTY.

SELECT DISTINCT COUNTRY, PRODUCT, YEARMONTH, SUM(ORDER)

FROM ORD WHERE COUNTRY BETWEEN 7 AND 11 AND

128 = PRODUCT AND YEARMONTH BETWEEN 8301 AND 8311

GROUP BY COUNTRY, PRODUCT, YEARMONTH

END USER qEW

(2)

The SQL statements are passed on to the interactive SQL facility QMF [5] which executes them. The final result, namely the end user view requested, is presented to the end user in the form of an SQL table, see Table 1 as a possible output of query (1).

By comparing (1) with (2) we observe the following:

(1) All attributes are taken from one relation only called ORD.

(2) Though in (1) there is no hint to summarization, SMARTY decided that order values have to be aggregated by country, product and yearmonth in order to satisfy the request.

In the following, when value ranges of an end user

view request do not matter, we will write it hori- zontally rather than vertically. Thus (1) will be written

Country - Product - Yearmonth - Order.

2.2. Short description of the SQL syntax for queries

To position the SMARTY syntax we will com- pare it with the SQL syntax for queries. Before doing so we shall sketch the latter as it was implemented in SQL/DS [3].

The basic form of the SQL query command is

SELECT data attribute list FROM relation list WHERE search conditions and join conditions

GROUP BY data attribute list


Table 1 Example of an output table.

Country Product Yearmonth Order

7 128 8301 4 7 128 8302 0 7 128 8306 9 7 128 8311 5

10 128 8301 8 10 128 8303 1 10 128 8307 2 11 128 8302 5 11 128 8311 6

The attribute list of the SELECT clause defines the data attributes to be displayed in response to the query. An attribute in the list may be preceded by one of five built-in functions

AVG - Calculates the average value of the attribute to be selected,

SUM - Calculates the total value of the attribute to be selected,

MIN - Finds the smallest value of the attr ibute to be selected,

MAX - Finds the largest value of the attribute to be selected,

C N T - Counts ~he number of rows.

If any of these functions occur in a query then it may be desirable to apply them to groups of rows of relations that have matching values in some attributes. The definition of the groups is specified with a G R O U P BY clause, see for instance (2).

The F R O M clause identifies the relations from which data is to be selected. If a data attr ibute name occurs in more than one relation of the F R O M clause, the attribute name must be pre- fixed with the relation name of the relation desired to properly identify the attribute. For example O R D . C O U N T R Y identifies the attr ibute C O U N T R Y of the O R D relation.

The search conditions of the W H E R E clause specify a filter for the rows of the relations. These conditions are basically a comparison between two items. An item can be an attribute name, a constant or an expression, i.e., a combination o-~ attr ibute names and constants connec,ed by an arithmetic operator. Each condition may include one of the comparison operators - - , ~= > > = < , < = . Multiple conditions are connected by

the operators A N D and OR, pGssibly modified by NOT. Again, query (2) may serve as an example.

SQL provides four additional functions for search conditions: BETWEEN, IN, NULL, LIKE. BETWEEN enables us to select rows of relations if the value of a data attribute is between two limits, e.g., Country BETWEEN 7 AND 11. IN enables us to select rows of relations if the value of a data attribute is contained in a list of values, e.g., Country IN (7, 8, 9, 10, 11). The N U L L function provides a way to look for null values in a relation, e.g., Order IS NULL. T :: LIKE function enables us to search for character-string data that partially matches a given string, e.g., Year- month LIKE '83_ ' .

SQL also allows a value or set of values in a search condition to be computed by a query, called subquery. The result of the subquery is substituted directly into the search condition in which it appears. The subquery should return only one value, except when using the IN function. For instance

SELECT CUSTOMER, Y E A R M O N T H , O R D E R F R O M ORD W H E R E ORDER >

(SELECT AVG(ORDER)

F R O M ORD)

will find those customers and months where the o r d e r v a l u e is l a r g e r t h a n t h e a v e r a g e o r d e r v a l u e .

The join conditions of the WHERE clause specify some relationship between relations to be joined. Of particular interest are the cases where a relation is joined to itself, as it happens in recursive queries.

2.3. The S M A R T Y syntax

The SMARTY syntax compares to the SQL syntax as follows:

S E L E C T clause and GROUP B Y clause. Rather than typing the data attribute list a user selects attributes from a given list one by one or as a predefined end user view request. This will be described in section 'End user view definition' below. SQL allows a user to input an expression, i.e., a combination of data attributes and constants connected by arithmetic operators, as a data attribute, SMARTY does not.

The decision whether or not summarization of

150 P. Stecker, P. Hellemaa / A n Intelligent Extraction and Aggregation Tool for Data Bases

numeric attributes has to happen is left to SMARTY. If, however, a user wants AVG, MIN, MAX, C N T of numeric attributes, he has to specify it explicitly. The definition of the attributes by which to group is done in SQL in the G R O U P BY clause, whereas SMARTY does it automatically.

In the rest of the paper we shall treat summarization as the only form of aggregation. F R O M clause. Because SMARTY is a universal relation interface the user sees the data base as one relation. Therefore the F R O M clause has no correspondence in the SMARTY syntax. W H E R E clause. Join conditions need not be specified because there is no reference to relations.

As far as search condition are concerned there are some minor differences in notation which we introduced to facilitate the keying in of attr ibute values, see table 2.

Also, SMARTY accepts all values without quotes, e.g., M U N I C H instead of ' M U N I C H ' , except the ones containing % _ blank ' ", while SQL requires them for all non-numeric values. Note: If a value contains or % and is not in quotes it is interpreted as LIKE, see table 2.

Expressions in SQL such as Coun t ry> = 2 3 A N D Country'S=40, are shortened in SMARTY to Country > = 23 A N D a = 40.

Subqueries are not supported in SMARTY. We may say that on the one hand the syntax of

SMARTY is simpler than that of SQL mainly because there is no reference to relations and join variables. On the other hand the price we pay for that is a limitation in functions. In addition to the items we mentioned above SMARTY currently cannot handle

(1) queries which require more than one tuple variable, like 'Display the country numbers whose order inflow this month is greater than last month's ' .

(2) recursive queries. If recursive queries are re-

Table 2 Differences in search conditions.

SQL

BETWEEN value 1 A N D value 2 IN (value 1, value 2 . . . . ) IS N U L L LIKE '83 '

SMARTY

value 1: value 2 value 1, value 2 . . . . N U L L 83_

(3)

quired by the end user the part of the data base where recursion occurs has to be 'f lat- tened out' . This can be done by redefining relations in SQL. Thereby the depth of recursion which may be allowed for queries becomes bounded at application definition time and has to be given explicitly in the end user view request. certain combinations of values for data attributes. For instance the search condition

C o u n t r y = 12 AND Product = 128 OR Country = 14 A N D Product -- 200 (3)

cannot be defined in SMARTY. The closest one can come to (3) is

Country 12 OR 14 Product 128 OR 200 (4)

SMARTY will translate (4) to the ~QL search conditions

(Country = 12 OR Country = 14) A N D

(Product = 128 OR Product = 200)

which is different than (3). The reason is that in SMARTY values on a line are in an A N D relationship with values on any other line.

2.4. Raw data base

Data which is transmitted to our headquarters is normalized and integrated into the raw data base. This data base contains the bare essential information on business objects (or er, tity sets) such as products, customers, orders and invento- ries from which all other information can be derived. All instances of an entity set are gathered in one SQL relation which is in third normal form as defined in the relational data base model.

Even though one reason for keeping relations in third normal form is to avoid data redundancy and simplify file updating, the main motivation stems from business considerations: Since we do not know in advance which end user views will be required, we shall be best off in the long term if we model the real world by the basic objects which are used to run the business. The basic objects are customer, product, order, installed inventory, etc. As long as the basic objectives and processes of the business are not changed, the basic objects in the shape of the entity sets are not changed either. Experience has taught us that it is

P. Stecher, P. Heilemaa / An Intelligent Extraction and Aggregation Tool for Data Bases 151

the particular aspect of the business which is prone to changes. Focusing on a particular aspect is equivalent to defining a particular end user view. In case the basic business objects do change, then the raw data base concept allows us in a simple way to change an existing entity set or introduce a new one.

It should be noted that our recommendations to have data bases in third normal form is not of general nature. In most applications it is not feasible for performance reasons. In particular oper- ational applications like order entry or inventory control very often require data bases which are not in third normal form. Our recommendation applies to informational systems like business performance analysis and planning.

The relational schema of the raw data base can be represented as a graph whose nodes are the entity sets and whose arcs indicate "foreign key connectivity" between entity sets, see fig. 3. The direction of an arc specifies the direction of the 'foreign key connectivity': An arc, e.g., from O R D to CTY, indicates that ~he key of the entity set to which the arc is pointing, e.g., CTY, is contained in the entity set where it originates, e.g., ORD. Relations are joined only by 'foreign key connectivity'.

Note: Whenever we talk of joins, we mean natural joins.

Two basic assumptions for the raw data base are therefore

(1) Each relation is in third form. (2) Relations are joined only by 'foreign key con-

nectivity'.

We define a path between two entity sets as being a sequence of consecutive arcs crossed only in the direction of the arcs, and a cycle as a path where some entity set occurs twice. For instance, the path between CTY and GA2 in fig. 3 is CTY- > GA1- > GA2. There is, however, no path between CTY and TYP.

If any two entity sets are connected by more than one path we require that the business mean- ings of the paths are the same. This is the so-called 'one-flavour assumption' [17]. It says - loosely speaking - that the significance of data relationships should not depend on the path chosen in the data base. For example if we take the data relationship expressed by Country - Order (meaning

'Order value by country') it should not matter whether we select path O R D - > C T Y or path ORD- > CSC- > CTY in fig. 3 to generate it.

From these assumptions and definitions we can immediately conclude that there are no cycles in the raw data base (see [12] for a rigorous proof). If this is the case, then we can assign a number, called entity level number, to each entity set by the following algorithm:

(1) We pair each entity set with all entity sets where the 'foreign key connectivity' holds, e.g., CTY is paired with the list ORD, MAC, CSC.

(2) All entity sets with an empty list are assigned entity level number 1, e.g., ORD, MAC.

(3) We then remove all entity sets with level number 1 from the lists and repeat step 2. This time, however, level number 2 is assigned. And SO o n .

The concept of entity level number is used extensively in the SMARTY program. Entity sets of level I are special in so far that they contain the most detailed description of the business.

If there is more than one path connecting two entity sets, then the result of joining relations on these paths is independent of the path chosen (after projecting out those attributes which occur only on one path). This is due to the one-flavour assumption and the 'foreign key connectivity' (lossless join theorem, [14]). We may conclude: By assuming (i) third normal form, (ii) 'foreign key connectivity', and (iii) one-flavour links of relations in the raw data base we ensure path-independence of join operations. This will allow us to select those paths for joining which give the best response time. Path-independence of join operations is an essential feature, indeed, and should hold in any data base where users are allowed to navigate freely.

It is interesting to consider the question 'Can data dependencies on the level of end user views be construed in a straightforward way from data dependencies on the level of the raw data base?' If this was the case we could then define end user view semantics based on the semantics of the raw data base. Unfortunately this is not possible in general. For instance, a key attribute of an entity set can occur in an end user view as a non-key attribute, and conversely a non-key attribute of an entity set can occur in an end user view as a key attribute. Consider the example of the colour of a

152 P. Stecher, P. Heilemaa /An Intelligent Extraction and Aggregation Too/for Data Bases

CSC -

CTY-

GAI - GA2 -

IND - ISI - MAC -

HOD -

MRC -

ORD -

PC1 -

PC2 -

PC3 -

PC4 -

TYP -

YH0 -

Customer Country Geographical Classification 1

Geographical Classification 2

Industry Group International Industry Code Inventory Machine Model Marketing Unit Order Product Classification 1

Product Classification 2



Machine Type Yearmonth

A

A____~

CSC

h . . . . . . .

ORD I

I

A-A-A

I

Fig. 3. Entity graph of the SMARTY raw database.

~ PC4 A

I ' ,,

I

A-A-A

'1 I

MAC

I

I

serialized machine. In the entity set of the machine its colour will occur as a non-key attribute, while there may be an end user view which is asking for all machines of a particular colour. Colour then becomes an attribute of an all-key end user view.

2.5. End user view definition

There are three ways how a user can define his view

(1) He calls an end user view by name. SMARTY will then display the data attributes and their value ranges which make up this end user view. The name was given previously to the view by the user himself or by the SMARTY administrator. Before: execution a user may change the data attributes and value ranges.

(2) He selects a view from a list of end user view names. Again, SMARTY will display the data attributes and their value ranges which make up this view. Before execution a user may

change the data attributes and value ranges. (3) l l e calls an empty screen and defines an end

user view by inputting data attributes or selecting them from a list of attribute names, and then entering their value ranges. He can then save it by giving it a name.

The sequence of attributes in the end user view defines the sequence of columns in the output table. The way SMARTY interprets the end user view, however, is independent of the input sequence.

3. End user view generation

In response to an end user view request the current version of SMARTY proceeds in the following phases.

(1) Proof of derivability of the end user view. Given any end user view request, SMARTY proves whether or not this end user view is semantically


correct. If it is correct, SMARTY identifies the raw data base relations necessary to derive it. If not, SMARTY states the reason(s).

(2) Generation of the end user view from the raw data base. In this phase SMARTY uses subgraphs created in the course of the derivability proof to generate the SQL statements for deriving the given end user view from the raw data base. SMARTY minimizes the number of joins necessary to create the end user view.

The extension of SMARTY on which we are working will proceed with phase 2 only if the following phase has no positive result:

(3) Generation of the end user view in a more efficient fashion. In this phase SMARTY takes into account all existing relations to identify a derivation path that is more efficient than the derivation from the raw data base. Although we would like to find the opt imum derivation path, we are well aware that this often may be too costly to compute. Thus we contend ourselves with a kind of local optimum.

Below we shall focus on the algorithms used in phases 1 and 2, and sketch the basic problems of phase 3.

SMARTY has been developed in such a way that it is independent from any pecularities of its business applications. This was done by keeping all the business semantics in two semantic tables:

(1) End user view attributes. This table contains a line for each attribute. A line defines, among others, its name, the entity set to which it initially belongs (its 'home ' entity set), * whether it is a ' sum' attribute (i.e., whether it can be summarized), whether it may occur in an end user view as key o r / a n d non-key, and additional attributes which must be present in the end user view. Examples: Attr ibute Country has CTY as home entity. Country may occur as a key in an end user view and as a non-key in another end user view. Order is a sum attribute whose home enti ty is ORD. It may occur only as a non-key in an end user view. Customer number must be present if Customer name occurs in the end user view.

(2) Composition of entity sets. This table contains a line for each entity set. A line defines, among

others, its level number on the entity graph, * the attributes which make up the key, the non-key attributes, and the foreign key attributes. * As all entity sets are in third normal form, the non-key attributes are functionally dependent on the key. The attribute classes of the non-key attributes are also recorded here, e.g., sales, customer. This information is used to prompt users in case of failure, as we shall see later. This table is used to derive the arcs between entity sets as shown in fig. 3.

• This property can actually be derived from other properties of the semantic tables. Tables 1 and 2 incorporate all the business semantics SMARTY needs.

There is a third table which is used in phase 3:

(3) Data base catalogue. This table contains the column names and the value ranges of th,~ columns of existing relations. An extension of the system catalogue of S Q L / D S is used for this table.

3.1. Proof of derivability of the end user view

We mentioned previously that the sequence of attributes input by a user determines the sequence of columns in the output table and has no other significance. SMARTY validates each attribute, e.g. its name, its value range entered, and possibly complements the end user view with other attributes, e.g. customer name with customer number. Complementation of attributes is based on business semantics as expressed in the semantic tables.

Since an end user may input any sequence of valid attributes we have to make sure that the end user view is semantically correct before trying to generate an SQL expression for its derivation. We do it by interpreting the end user view. The basic idea is to try to connect all attributes of the view. This is done by applying the following algorithm:

(1) SMARTY assigns to each attribute of the end user view its 'home ' entity set in the raw data base. Then paths between these entity sets are established. Those paths which have a common bot tom entity set are consolidated into subgraphs. We call tilese subgraphs "common bottom subgraphs" or cb-subgraphs for short. Consider the following end user view:

154 P. Stecher, P. Hellemaa / An Intelligent Extraction and Aggregation Tool for Data Bases

Country - Product - Yearmonth - Order.

Their respective home entity sets are (see fig. 3) CTY, TYP, YMO, ORD. The paths are

ORD- > CSC- > CTY ORD- > CTY ORD- > MOD- > TYP ORD- > TYP ORD- > YMO

The cb-subgraph consists of these 5 paths. ORD is its common bot tom entity set.

There may be home entity sets which do not participate in any path. Each one of these becomes a cb-subgraph on its own.

Any end user view can be assigned a set of cb-subgraphs in this way, because there are no cycles in the raw data base, as we have seen in section 'Raw data base'.

If there is an isolated cb-subgraph i among two or more cb-subgraphs, we define the pertinent end user view as being semantically incorrect. To see why, let us look at an example: Assume an end user inputs the view

C o u n t r y - Product.

The home entity sets are CTY, TYP. There are no paths (remember, a path is a sequence of consecutive ares between two entity sets, crossed only in the direction of the arcs). Nevertheless we have 2 cb-subgraphs: CTY and TYP.

On the entity graph there are basically two ways how one can navigate between CTY and TYP- one is via ORD, the order entity set and the other is via MAC, the inventory entity set. By specifying Country - Product the end user is asking for all product types which are either ordered or installed in countries. This, at least, is the meaning of the end user view to a data theore- tician who is well aware of the relational schemes of the raw data base. A user on the other hand should not know the relational schemes at all. This is one raison d'etre of SMARTY. Then the request Country - Product is as meaningful as a customer

i A cb-subgraph is isolated, if two or more cb-subgraphs correspond to an end user view and the cb-subgraph has no entity set in common with any other cb-subgraph.

visiting a car dealer for the first t ime and asking the salesperson 'Where is the car?'. Both requests become unambiguous only if we qualify further the objects. We define the following rule:

An end user view is declared semantically incorrect, if it corresponds to an isolated cb-subgraph. We then say the end user view has attributes which have no well-defined relationships.

Semantically incorrect end user views are re- jected. SMARTY assists the end user to disambiguate his view by (i) identifying possible links between isolated cb-subgraphs and (ii) finding the data attribute classes on these links, e.g., sales, customer. It then suggests to the user to select attributes from these classes, e.g., gross sales.

(2) Each cb-subgraph is tagged either for 'selection, joining and projection' (extraction) or 'selection, joining, projection and summarization' (aggregation). Aggregation is done, if there is a sum attribute in the end user view, which occurs in the common bot tom entity set. There is, however, an exception to that rule: If the full key of this entity set is present in the end user view, then extraction takes place. For instance,

Product - Product Name,

meaning 'Display all products and their names', requires extraction.

Product - Order,

meaning 'Disp lay order value by products ' , requires aggregation, as Order is a sum attribute which occurs in the common bo t tom entity set ORD. While

Product - Order Reference Number - Order,

meaning 'Disp lay individual orders, identified by their order reference number, the order value and the product ordered' , will only need extraction, because Order Reference Number is the key of the common bot tom entity set ORD.

As ca~: be seen, the decision whether extraction or aggregation is needed, is made by SMARTY and does not burden an end user when he defines his view.

P. Stecher, P. Heilemaa / An Intelligent Extraction and Aggregation Tool for Data Bases 155

(3) If summarization has to happen the next problem to solve is finding the attributes in the end user view by which to summarize. We must keep in mind that summarization in the raw data base means (i) selecting appropriate detailed information in the raw data base in the form of tUl:,les of relations, (ii)joining them and (iii) aggregating 'sum' attributes by some attributes. These attributes make then up the key of the end user view. Let us consider for instance, the end user view

Product - Product Name - Order,

meaning 'Display order value by products and their names'. Since Order Reference Number is missing in this end user view, Order has to be summarized. The attribute by which to summarize is Product, while Product Name describes Product further, i.e., it is functionally dependent on Prod- uct.

The identification of the key is done for each sum attribute in turn by looking up the semantic table 1 'End user view attributes'. There we defined for each attribute whether or not it may occur in an end user view as key. Next, the keys of all sum attributes are compared. If they do not match, a warning is given to the user. The reason is that because the functional dependencies in an end user view are not always obvious, the final output table could be misinterpreted.

3.2. Generation of the end user view from the raw data base

by attributes which also occur in other relations of the cb-subgraph. For instance in the end user view (1) the home entity of Country is CTY, that of Product TYP, that of Yearmonth YMO and that of Order ORD. Thus the relations CTY, TYP, YMO, ORD make up the cb-subgraph of (1). However, because attributes Country, Product, Yearmonth also occur in ORD as foreign keys, no joining is required, as can be seen in (2).

Sometimes there are shortcuts for joining relations. For instance,

Product - Product N a m e - Order,

entails two feasible paths for joining: ORD- > MOD- > TYP and ORD- > TYP. SMARTY always selects the shortest paths.

SMARTY now translates the operations into a sequence of SQL statements. Sometimes SMARTY has to generate more than one SQL SELECT statement to derive a given end user view, i.e., intermediate relations have to be created in the data base. This, for instance, happens if there are two cb-subgraphs for summarization. Then the operations to be performed are selection, joining, projection, summarization with creation of two new relations, and subsequently joining of these two relations. In general intermediate relations have to be created if the operations on relations of the raw data base cannot be expressed by one SQL SELECT statement.

The resulting SQL statements are handed over to QMF for execution.

in this phase SMM' .3v uses the cb-subgraphs which were created in the course of the derivability proof to generate the SQL statements.

The result of the derivability proof above is as follows:

- The set of relations to be used for generating the end user view is identified in the form of cb-subgraphs.

- The kinds of SQL operations on these relations is determined, such as summarization, joining, projection, selection.

Before translating the kinds of operations into SQL statements SMARTY minimizes the number of joins. Sometimes a relation which participates in a cb-subgraph contributes to an end user view

3.3. Generation of the end user view in a more efficient fashion

For proving the derivability of the end user view we considered the relational schemes of the raw data base (semantic table 2) and the definition of end user view attributes (semantic table 1). For the generation of the end user view in a more efficient fashion, all relations already derived from the raw data base are taken into account, too (table 3). Basically two situations may arise:

(1) An existing relation matches an end user view requested, except for the value range. For instance, in the beginning of Yearrnonth 8401 a relation will most certainly exist for the end user view

156 P. Stecher, P. Hellemaa / An Intelligent Extraction and Aggregation Too/for Data Bases

Country Product Yearmonth Order

but not for

7:11 128 8301 : 8311

Country 7 : 11 Product 128 Yearmonth 8301 : 8312 Order

SMARTY should make use of the existing relation and go back to less aggregated relations only for Yearmonth 8312.

lations is granted beforehand. In the case of intermediate relations, access must be checked and granted dynamically. For this purpose we will have a Prolog program on a separate virtual machine which will be called from SMARTY.

The problem of updates to the raw data base is currently handled in a simple way: During updating no end user view requests are accepted. After updating the relation workspace is purged and a sequence of 'popular ' end user views is run over- night. Thereby a number of intermediate relations are created and available when users issue requests to the data base.

(2) An existing relation matches an end user view requested, except for a non-key attribute. For instance, the view requested is

Country 7:11 Product 128 Yearmonth 8301 : 8311 Order Installation

(meaning 'Display the order value and installation value by country, product and yearmonth. Country numbers should vary from 7 to 11, yearmonths from 8301 to 8311, while only one product number should be taken, namely 128') and a relation exists for

Country 7 : 11 Product 128 Yearmonth 8301:8311 Order

Again, SMARTY should go back to less aggregated relations for Installation only.

Usually there are many ways of generating the end user view requested. The problem is finding the one which will give the fastest response time. This problem is closely related to the logic inher- ent in the data base management system chosen. We are employing the optimizing facilities of SQL for this phase. We are also applying classical principles for minimizing costs in search spaces.

All derived relations are stored in a workspace on disks. This workspace is managed similar to a paging system of a processor: If it is full the relations least used are overwritten.

A particular problem of this phase is to prevent unauthorized access to data. When we use only the raw data base to derive views, access to re-

4. System implementation

VM/Prolog was selected as the programming language of SMARTY mainly for two reasons:

(1) Given the manpower and time available (initially 1 person for 1 year) a concise language was needed.

(2) The kind of reasoning to be done favoured a logic programming language.

S Q L / D S and the Prolog interpreter talk to one another. When information is needed about the columns and value ranges of relations SMARTY consults the system catalogue of SQL/DS. To this end complet e SQL commands are defined in the Prolog program and passed on to S Q L / D S via a built-in Prolog predicate. If more than one tuple satisfies a request, then all tuples can be read by Prologthrough backtracking. SQL data types such as integers, decimals, floating point numbers, variable character strings can be manipulated by Pro- log.

Full screen support for SMARTY input and output is provided by REXX and ISPF. There is an active interchange of data between our Prolog program and REXX. This interface is implemented through built-in Prolog predicates.

Our current program has approximately 380 Pro!og clauses. The CPU time spent on an IBM System 3081 for analysing an average end user view of 6 attributes and generating the SQL statements is approximately 170 milliseconds.

If we had written SMARTY in a programming language like Fortran or P L / 1 we estimate we would have needed 10 times more coding. In


numbers (including some 35% comment lines): 2600 versus 20 - 30 000 lines of code. The concise- ness of Prolog also allowed us to explore different ways of solving some crucial problems.

The fac t / ru le feature of Prolog has some practical applications: We may delay the solution of certain problems until the right moment comes. For instance, in the beginning some properties of the enti ty graph in Figure 3. were defined as Prolog facts, like arcs between entity sets. Later these facts were replaced by Prolog rules specifying proper algorithms.

relation interface would stipulate. A compromise may help: We allow a user to

specify his information request in a simple way, for instance

Country - Product - Order - Installation, (6)

rather than (5). The fact that we are interested in Products ordered or installed is shifted from the Product attribute to the Order and Installation attributes. Consequently, if we drop the latter two from (6) the end user view

Country - Product

5. SMARTY's universal relation interface

The universal relation interface is based on the universal relation assumption which says: For a given set S of relations under consideration, there exists in principle a single universal relation U such that each relation in S is a projection of U.

The pro's and con's of a universal relation have been the subject of some heated discussions, see for instance [10,18]. We don' t want to add to these. Our goal is a very practical one: How to facilitate lives of end users. The basic idea of the universal relation concept, namely to let a user express his information requirements in terms of attributes rather than files, is indeed most appeal- ing, but has sometimes to be paid for by naming attributes in an awkward way, which was rightly pointed out by Kent. For instance, if a user needs to know the kind of products which are ordered in a country he specifies an end user view like

Country - Product ordered.

If he requires the kinds of products which are installed in a country he defines

Country - Product installed.

becomes ambiguous, since it may mean the kinds of Products ordered in a country or installed in a country. By simplifying the definition of end user views, we deviate from the definition of a universal relation interface. This definition implies that the meaning of attribute relationships be somehow expressed by the names of these attributes. The approach we adopt is that we declare an end user view which is ambiguous to SMARTY as semantically incorrect, as we have demonstrated in section 'Proof of derivability of the end user view'. A user can disambiguate it by adding for example the attribute Order to indicate that he is interested in the products ordered in a country.

There is another reason why we deviate from the traditional definition of a universal relation interface. Certain end user views just don't make sense. For example, if a user requests country summary information and detail order information in the same end user view, we assume that he made a mistake. Therefore the semantic rules and the reasoning of the SMARTY program were introduced also to 'protect ' an end user from the power of a universal relation interface.

A third point of departure is that in addition to joining, projection and selection SMARTY is capable of automatically summarizing data.

But what is he to do if he needs the number of products ordered or installed? We have to introduce yet another variant name for product:

C o u n t r y - Product ordered or installed

- O r d e r - Installation. (5)

It is obvious that this approach is not very elegant nor user-friendly, it even may eventually make a system unmanageable. But it is what a universal

6. Summary

SMARTY is an implementation of a universal relation interface with additional features. The benefits of SMARTY's universal relation interface are obvious:

- It allows us to introduce as the unique source of information a raw data base which without a

158 P. Stecher, P. Hellemaa / A n Intelligent Extraction and Aggregation Tool for Data Bases

universal relation interface would be too heavy to use.

- It makes data bases and their structure trans- parent to end users.

- It allows us to change the data base scheme without affecting the users' perception of data.

The additional features which SMARTY provides are:

- Test of end user views for their meaningfulness and explanation of failure

- Aggregation of data - Business semantics is defined in semantic tables

which are easy to modify.

Currently there are some limitations of the SMARTY syntax compared to the SQL syntax, e.g., recursive queries or certain combinations of search conditions are not supported. The main advantage of SMARTY over SQL and Query by Example is that join conditions need not be given in a query. It is this feature which makes SMARTY - to our surprise - popular among people for which it was not designed: the community of SQL experts who want to save typing at the terminal. The conditions for path-independence of join operations can be weakened and SMARTY can thus be used for data bases whose relations are not in third normal form.

With the use of computers in more and more areas of the public and business domain where hitherto their use could not be justified, the amount of detail information available is increasing dramatically. We believe that for those applications where simple extraction and aggregation of data is required, tools such as SMARTY will become more and more important.

Acknowledgement

Thanks are due to Dipayan Gangopadhyay (IBM Research, Yorktown), Hubert Lehmann (IBM Sci- entific Centre, Heidelberg) and Anti Nigam (IBM Research, Yorktown) for pleasant and fruitful discussions on the theoretical aspects of SMARTY, to Marc Gillet for providing us with an efficient and semantically rich Prolog interpreter and its interfaces to S Q L / D S and REXX, and to Jean Pierre Adam and Jean Fargues, all of IBM Scien- tific Centre, Paris for getting us started with Pro- log. We wish to thank Jacques Cohen, Gerd Hart-

mann and Claude Hebert of IBM Europe for their continuous support of the SMARTY project. Also, comments by one referee are greatly appreciated.

References

[1] Aho, A., Y. Sagiv and J. Ullman, Equivalence of Rela- tional Expressions, SIAM Journal on Computing (1979) 218-246.

[2] Beeri, C., P. Bernstein and N. Goodman, A Sophisticate's Introduction to Database Normalization Theory, Proc. Int. Conf. on Very Large Data Bases-4 (1978) 113-124.

[3] IBM, SQL/Data System, Concepts and Facilities, Pro- gram No. 5748-XXJ, Document No. H24-5013 (1981).

[4] IBM, VM/SP System Product Interpreter User's Guide, Program No. 5664-167, Document No. C24-5238 (1983a) Restructured Extended Executor (REXX) is part of this program product.

[5] IBM, Query Management Facifity (QMF), Program No. 5668-972, Document No. C26-4101 (1983b).

[6] IBM, Interactive System Productivity Facility (ISPF), Ver- sion 2, Program No. 5664-285, Document No. C34-2181 (1984).

[7] IBM, VM/Programming in Logic (VM/Proiog), Program No. 5785-ABH, Document No. Bll-6374 (1985).

[8] Jarke, M. and Y. Vassiliou, Coupling Expert Systems with Data Base Management Systems, in: Artificial Intelligence Applications for Business, W. Reitman, ed., (Ablex, Norwood N J, 1984) 65-85.

[9] Jarke, M., J. Clifford and Y. Vassiliou, An Optimizing Prolog Front-End to a Relational Query System, Proc. A CM-SIGMOD Int. Conf. on Management of Data (1984) 296-306.

[10] Kent, W, Consequences of Assuming a Universal Rela- tion, ACM Trans. On Data Base Systems (1981) 539-55C

[11] Kim, M, Query Optimization for Relational Database Sys- tems, IBM Research, San Jose CA, Document Nr. ILl 3081 (1981).

[12] Osborn, S, Towards a Universal Relation Interface, Proc. Int. Conf. on Very Large Data Bases-5 (1979) 52-60.

[13] Ott, N. and K. Horlaender, Removing Redundant Join Operations in Queries Involving Views, Information Sys- tems, Vo.. 10, No. 3 (1985) 279-288.

[14] Rissanen, J, Independent Components of Relations, ACM Trans. On Data Base Systems (1977) 317-325.

[15] Schkolnick, M. and P. Sorenson, The Effects of Denor- realization on Database Performance, IBM Research, San Jose CA, Document Nr. RJ 3082 (1981).

[16] Selinger, P., M. Astrahan, D. Chamberlin, R. Lorie and T. Price, Access Path Selection in a Relational Database Management System, IBM Research, San Jose CA, Docu- ment Nr. RJ 2429 (1979).

[17] Ullman, J, Principles of Database Systems (Computer Science Press, Potomac MD, 1982a).

[18] Ullman, J, The U.R. Strikes Back, Proc. ACM Symposium on Principles of Data Base Systems (1982b) 10-22.

[19] Vassiliou, Y., J. Clifford and M. Jarke, Access to Specific Declarative Knowledge by Expert Systems: The Impact of Logic Programming, Decision Support Systems, Vol. 1, No. 2 (1985) 123-141.

[20] ZIoof, M, Query by Example, Proc. AFIPS National Computer Conference 44 (1975) 431-438.

An "Intelligent" Extraction and Aggregation Tool for Company Data Bases

Documents

Transcript of An "Intelligent" Extraction and Aggregation Tool for Company Data Bases