A rule-based approach for merging generalization hierarchies

16
/njorm. Qsrems Vol. 13, No. 3, PP. 257-272, 1988 Printed in Great Britain 0306-4379188 $3.00 + 0.00 Pergamon Press plc A RULE-BASED APPROACH FOR MERGING GENERALIZATION HIERARCHIES MICHAEL V. MANNINO,’ SHAMKANT B. NAVATHE~ and WOLFGANG EFFELSBERG’ ‘Department of Management Science and Information Systems, The University of Texas at Austin, TX 787 12, *Computer and Information Sciences Department, University of Florida, Gainesville, FL 3261 I, U.S.A. and ‘IBM European Networking Center, Tiergartenstrasse 15. 6900 Heidelberg, F.R.G. (Received 15 December 1986; in revised form 24 March 1988) Abstract-We describe the underlying operators and rules of an interactive procedure for merging generalization hierarchies. This procedure assists a designer in defining a global view which is a view over multiple databases. The small collection of operators permit: (1) connecting generalization hierarchies to form a new hierarchy, (2) adding and deleting subhierarchies, and (3) deleting intermediate levels. The merging procedure applies the rules in two phases. In the connecting phase, the input generalization hierarchies are connected to form a new hierarchy. The input hierarchies are typically connected at their roots but may be connected at nonroot nodes. In the subtree merging phase, the new generalization hierarchy is revised according to equivalence assertions about the attributes of the subtypes. The rules are described in detail for the case of binary merging. Extensions for the general m-way case are briefly outlined. This set of rules can form the basis of a rule-based expert system. Keywords: View integration, database design, global views, generalization data models. 1. INTRODUCTION A global view is an integrated view of independent databases. These databases can be heterogeneous- they can be defined in different data models and be managed by different database management systems. The definition of a global view describes (1) the objects in the view (e.g. entity types and attributes) and (2) the mapping between the global objects and the underlying objects in the local databases. The mapping part of a global view can be more complex than the mapping part of a single database view (e.g. a view in a relational database) because of the use of generalization and data conversion operations in a global view. Generalization is an abstraction in which a class of individual objects can be considered generically as a single object [l]. The generic entity type is called the “supertype” and the individual entity types are called the “subtypes”. By definition, a subtype inherits the attributes of all of its supertypes. In global view definition, generalization is a necessary abstrac- tion because of overlapping information in the local databases. Supertypes are defined to integrate the overlapping information among the local data- bases. Global views are an integral element of the multi- database technology. The goal of this technology is to provide uniform access to independent, distributed databases without displacing existing investments in hardware and software. A multi-database system serves as an interface layer above existing database and file management systems. In a multi-database environment, user queries and updates eventually effect objects in local databases. To achieve this purpose, the queries and updates references objects in a global view. The multi-database system uses the global view dejinition when modifying, decomposing, and translating the original query into requests pro- cessable by the local database systems [2, 31. Exam- ples of multi-database systems are reported by Land- ers and Rosenthal [3], Gligor and Luckenbaugh [4], Heimbigner and McLeod [5], Brill and Templeton [6], and Breitbart, Olson and Thompson [7]. A global view design methodology structures the process of matching, merging and mapping the local schemas into a global view. This paper describes an interactive procedure which constitutes a major part of such a methodology. Our procedure can be incor- porated into the merging step of a global view design methodology such as the one described in [8-- IO] or 1111. The procedure is based on a small collection of schema operators and a collection of rules which apply those operators to construct a new gener- alization hierarchy from the input generalization hierarchies. The schema operators add and delete entity types and attributes, connect generalization hierarchies to form a new hierarchy, and modify an existing generalization hierarchy. The rules are or- ganized into two phases. In the connecting pha.re, the input generalization hierarchies are connect.ed to form a new hierarchy. The connecting phase is driven by the overlap among the extensions of the input generalization hierarchies. In the subtree merging phase, the newly formed generalization hierarchy is revised according to equivalence assertions among the attributes of the input generalization hierarchies. 257

Transcript of A rule-based approach for merging generalization hierarchies

/njorm. Qsrems Vol. 13, No. 3, PP. 257-272, 1988 Printed in Great Britain

0306-4379188 $3.00 + 0.00 Pergamon Press plc

A RULE-BASED APPROACH FOR MERGING GENERALIZATION HIERARCHIES

MICHAEL V. MANNINO,’ SHAMKANT B. NAVATHE~ and WOLFGANG EFFELSBERG’

‘Department of Management Science and Information Systems, The University of Texas at Austin, TX 787 12, *Computer and Information Sciences Department, University of Florida, Gainesville, FL 3261 I, U.S.A.

and ‘IBM European Networking Center, Tiergartenstrasse 15. 6900 Heidelberg, F.R.G.

(Received 15 December 1986; in revised form 24 March 1988)

Abstract-We describe the underlying operators and rules of an interactive procedure for merging generalization hierarchies. This procedure assists a designer in defining a global view which is a view over multiple databases. The small collection of operators permit: (1) connecting generalization hierarchies to form a new hierarchy, (2) adding and deleting subhierarchies, and (3) deleting intermediate levels. The merging procedure applies the rules in two phases. In the connecting phase, the input generalization hierarchies are connected to form a new hierarchy. The input hierarchies are typically connected at their roots but may be connected at nonroot nodes. In the subtree merging phase, the new generalization hierarchy is revised according to equivalence assertions about the attributes of the subtypes. The rules are described in detail for the case of binary merging. Extensions for the general m-way case are briefly outlined. This set of rules can form the basis of a rule-based expert system.

Keywords: View integration, database design, global views, generalization data models.

1. INTRODUCTION

A global view is an integrated view of independent databases. These databases can be heterogeneous- they can be defined in different data models and be managed by different database management systems. The definition of a global view describes (1) the objects in the view (e.g. entity types and attributes) and (2) the mapping between the global objects and the underlying objects in the local databases. The mapping part of a global view can be more complex than the mapping part of a single database view (e.g. a view in a relational database) because of the use of generalization and data conversion operations in a global view.

Generalization is an abstraction in which a class of individual objects can be considered generically as a single object [l]. The generic entity type is called the “supertype” and the individual entity types are called the “subtypes”. By definition, a subtype inherits the attributes of all of its supertypes. In global view definition, generalization is a necessary abstrac- tion because of overlapping information in the local

databases. Supertypes are defined to integrate the overlapping information among the local data-

bases. Global views are an integral element of the multi-

database technology. The goal of this technology is to provide uniform access to independent, distributed databases without displacing existing investments in hardware and software. A multi-database system serves as an interface layer above existing database and file management systems. In a multi-database environment, user queries and updates eventually

effect objects in local databases. To achieve this purpose, the queries and updates references objects in a global view. The multi-database system uses the global view dejinition when modifying, decomposing, and translating the original query into requests pro- cessable by the local database systems [2, 31. Exam- ples of multi-database systems are reported by Land- ers and Rosenthal [3], Gligor and Luckenbaugh [4], Heimbigner and McLeod [5], Brill and Templeton [6], and Breitbart, Olson and Thompson [7].

A global view design methodology structures the process of matching, merging and mapping the local schemas into a global view. This paper describes an interactive procedure which constitutes a major part of such a methodology. Our procedure can be incor-

porated into the merging step of a global view design methodology such as the one described in [8-- IO] or

1111. The procedure is based on a small collection of

schema operators and a collection of rules which apply those operators to construct a new gener- alization hierarchy from the input generalization hierarchies. The schema operators add and delete entity types and attributes, connect generalization hierarchies to form a new hierarchy, and modify an existing generalization hierarchy. The rules are or- ganized into two phases. In the connecting pha.re, the input generalization hierarchies are connect.ed to form a new hierarchy. The connecting phase is driven by the overlap among the extensions of the input generalization hierarchies. In the subtree merging

phase, the newly formed generalization hierarchy is revised according to equivalence assertions among the attributes of the input generalization hierarchies.

257

258 MICHAEL V. MANNINO et al.

The first novel aspect of our approach is the treatment of generalization hierarchies. Previous ap- proaches to the merging problem have addressed the merging of entity types rather than generalization hierarchies. In our case studies [12], we have found that generalization hierarchies frequently occur in schemas to be integrated. In fact, the general problem of integrating knowledge from one or more applica- tion domains ultimately reduces to an integration of generalization hierarchies. The nodes on the hierarchy may represent any object or concept.

The second novel aspect is the treatment of over- lapping databases. The intersection of the extensions of two entity types can be classified into the following four classes: (1) equivalent, (2) subset, (3) overlapping without a subset relationship, and (4) null overlap. Typically, the third case is most common because entity types in different databases are defined over different application domains. For example, the engineering students of a branch campus typically may overlap but are not a subset of the students of the main campus. Our contribution is the capturing of further semantics of the third case to achieve a better merging.

The scope of our approach is limited to generaliza- tion hierarchies rather than lattices, We do not feel that this is a major limitation because most existing database systems only support generalization hier- archies (if at all). The extension of our work to lattice is a topic of further study.

This paper is organized as follows. Section 2 reviews the related work on schema merging. Section 3 overviews the methodology of Mannino and Effelsberg [8] as background for the discussion of the schema operators in Section 4, the binary merging procedure in Section 5, and the m-way merging in Section 6. Section 7 summarizes the paper and presents future directions.

2. RELATED WORK

Schema merging occurs in two different contexts. In global view design, two or more existing databases are integrated. In view integration, a collection of user views are integrated to form a conceptual schema satisfying all the requirements. Batini et al. [13] survey a number of methodologies for both purposes. The most significant of these are described in the remainder of this section.

One of the earliest significant works is that of Motro and Buneman [14-161. They defined a col- lection of schema operators and several merging algorithms based on these operators. Their schema operators include the addition and deletion of sub-

their results are limited because they neglected the problems of data incompatibility and partially over- lapping databases that are very common to hetero- geneous database environments. Their merging algo- rithms only address the problem of merging entity types, not generalization hierarchies. They also assume that the merging process can be optimized by minimizing the number of edges in a schema graph which is equivalent to minimizing the number of direct (i.e. noninherited) attributes.

Elmasri and Navathe [ 11, 17, 181 developed match- ing and merging techniques for entities, attributes and relationships. They present a small number of rules to merge entities and relationships and a general algorithm to order the entities and relationships for merging. Their approach is similar to the meth- odology described in [8,9] except that they deal with the problem of relationship merging, not just entity merging. The results of merging is a generalization lattice of entity and relationship types. Larson et al. [ 191 have discussed different types of attribute equiv- alences and how they can be treated during entity and relationship merging. However, their work does not address the issue of merging generalization hier- archies.

Dayal and Hwang [2] described a language to define a global view over a collection of DAPLEX [20] local schemas. Their language is roughly equiv- alent in expressive power to the global view definition language described in [21]. They do not present any design methodology or algorithms.

Kim et al. [22] defined a collection of schema operators for an object-oriented data model featuring multiple inheritance. They presented operators for changing classes and modifying subtypes and super- types, and proved the consistency and completeness of their transformations. However, they did not address merging of independent databases so some of our operators and our merging strategy are not addressed in their research.

3. OVERVIEW OF THE GLOBAL VIEW DESIGN METHODOLOGY

We must first introduce the methodology of Mannino and Effelsberg [8] so that the details of the merging algorithms can be put in perspective. For a more detailed description of our methodology, consult [9]. We also briefly discuss our representation of generalization hierarchies as a prelude to a description of the operators in Section 4.

3.1. The Methodoloav -. types, supertypes and attributes. Their Automatic Merge algorithm performs the merge without any

The methodology consists of the following four

user interaction, while their Cooperative Merge steps.

algorithm permits user interaction at each step. (1) extended schema conversion: convert the local Their work is important because it is the first schemas into the target data model of the multi-

substantial effort on global view design. However, database system;

Merging generalization hierarchies 259

(2) inter-schema matching: define and analyze

assertions among entity types and attributes from different local schemas;

(3) schema merging: connect the local schemas by

applying the schema operators; (4) attribute mappings: define global attribute for-

mats and data conversion operations.

In the first step, the local schemas are converted

into equivalent schemas in the target data model. The target data model can be any model that supports generalization such as the General Entity Manipu- lator [23] or the Entity Category Relationship Model [24]. The conversion process is typically straight- forward. The types of the source data model are usually converted into the types of the target model based on a small number of rules. Variations of previously published conversion algorithms such as those described in [25-281 will usually suffice.

After a schema is initially converted, it is further

refined along the following lines: (1) analysis of functional dependencies and possible further decom- position into higher normal forms; (2) discovery of generalization hierarchies; and (3) explicit represent- ation of referential integrity constraints [29]. These changes make the semantics of the local schemas more explicit and thus easier to integrate. Kent [30] discusses numerous ways in which generalization hierarchies are represented in data models that do not support generalization. His ideas can be used as a guidelines for the second type of refinement. After discovering generalization hierarchies, the ability to merge them becomes important in the merging step.

In the second step, assertions are made about the

local entity types and attributes. Two entity types can be merged if they have a common identifier. A generalization assertion between two entity types indi- cates their common identifier and the degree of intersection between their extensions. The common identifiers of the local entity types are converted into a common domain so that comparisons can be made.

The classes of intersection are named as follows: equivalent, subset, nonnull intersection but no subset, and null.

For each pair of entity types in a generalization

assertion, equivalence assertions about their attri- butes are made. An equivalence assertion among two attributes indicates that they can be combined into a single attribute in a global view. We say the attributes are “common” if they are related in an equivalence assertion. Two attributes that are not domain com- patible can be equivalent. Data conversion oper- ations defined in the fourth step will force the common attributes into the same domain.

In the third step, the local target model schemas are merged on the basis of the assertions defined in the previous step. Since later sections describe this step in detail, we do not further elaborate on it here.

In the fourth step, global attribute domains and data conversion operations are defined. A domain

(MAJOR)

q-1 SEC-MAJOR tr”57 USERNAME

Fig. 1. Example tree diagram

specification minimally includes a data type and length. The conversion operations can involve type changes, arithmetic formulas, table lookups, and

pre-compiled procedures.

3.2. Representation of Generalization Hierarchies

We use a diagram to represent generalization hier- archies (Fig. 1). The nodes of the tree are entity types and the branches are ISA relations. In Fig. 1, the STUDENT generalization hierarchy has STUDENT as its root and four subtypes: GRAD, UND- GRAD, EE, and CIS. Attributes, when shown, are

attached to nodes, either below or on the side. Only direct attributes (i.e. noninherited) are shown. For example, in Fig. 1 GRAD has one direct attribute (ADVISOR) and two inherited attributes (NAME and ADDR).

Subtypes can be grouped into sublists which are groups of related subtypes. In the tree diagram, sublists are connected by a vertical line and the name

of the sublist appears above. STATUS and MAJOR are two sublists in Fig. 1.

A Sublist is characterized by two aspects of its membership. First, an occurrence of the supertype can be a member of one or more subtypes in a sublist. If a supertype occurrence can be a member of only one subtype, the sublist is disjoint; otherwise it is overlapping. A disjoint sublist name is enclosed within square brackets ([I), while an overlapping sublist name is enclosed within angle brackets (( )). In Fig. 1, both sublists are disjoint. For example, in

the STATUS sublist a STUDENT can be a member of GRAD or UNDGRAD but not both.

Each disjoint sublist has a classification attribute. The values of a classification attribute map one-to- one to the subtypes in its corresponding sublist. For example, the classification attribute of the STATUS sublist has two values, while the attribute for MAJOR has three values. For the sake of brevity, we assume that the classification attribute is named identically to its corresponding sublist.

Every supertype occurrence may or may not be required to be an occurrence of a sublist. If required, the sublist is total; otherwise it is partial. In the tree diagram, a partial sublist is represented by adding an empty box to the list of members. In Fig. I, STATUS

260 MICHAEL V. MANNINO et al.

operator

Table I. Summary of operators

Arguments Primitive Purpose

GEN

DELETE ADDSUBTYPE

ADDSUBLIST

MOVESUBLIST

DELINTERMEDIATE

NEWROOT, ENTLIST. NEWSLIST ENTNAME ENTITY, SUBLIST SUPTYPE ENTLIST, SUBLIST SUPTYPE SUBLIST, OLDSUPTYPE, NEWSUPTYPE SUPTYPE

Yes

Yes Yes

No

No

No

Create a supertype

Delete a subtree Add a subtype

Add a sublist

Move a subtree vertically Delete an intermediate level

is a total sublist: every student must be classified as GRAD or UNDGRAD. MAJOR is a partial sublist: every student cannot necessarily be classified as EE or CIS.

4. SCHEMA OPERATORS

In this section, we define the operators used in the action part of the merging rules discussed in Section 5. The result of a merging rule is to rewrite one or more generalization hierarchies into an equivalent generalization hierarchy. Rewriting is accomplished by applying operators to connect two trees to form a new tree, add and delete a subtree, and add and delete a level of a tree. These operators obey the semantics of generalization regarding attribute inher- itance and subtype membership. Operators for mod- ifying the internal structure of a subtype such as adding and deleting attributes are not included but are assumed to exist.

We first explain the primitive operators and then the secondary or nonprimitive operators. The three primitive operators are GEN, DELETE and ADDSUBTYPE, while the secondary operators are ADDSUBLIST, MOVESUBLIST and DELINTER- MEDIATE. Table 1 briefly summarizes the notation and purpose of each operator.

For each primitive operator, we provide a func- tional and, where possible, an algebraic definition, describe its information preserving properties, and present an example. A functional definition consists of a function name, an argument list, and a return value. An algebraic definition is based on the oper- ators of relational algebra. The information pre- serving properties are related to the preservation of attributes, entity occurrences, and subtype memberships.

4.1. GEN

GEN creates a new supertype which is the gener- alization of one or more input entity types. It is a fundamental operator because generalization is the method of merging schemas. The first stage of merg- ing rules applies GEN to entity types of different generalization hierarchies which results in a new generalization hierarchy. The second stage of merg- ing rules applies GEN to entity types of the same

hierarchy, which adds a new level to the existing hierarchy.

Functionally, it is defined as GEN (NEWROOT, ENTLIST, NEWSLIST) where NEWROOT is the new supertype name, ENTLIST is the list of entity types to be generalized, and NEWSLIST is the name of the new sublist containing ENTLIST. The common attributes (i.e. the attributes related by equivalence assertions) are assigned as attributes of NEWROOT. By definition, the set of entity occurrences in NEWROOT is the union of occurrences of ENTLIST. Therefore, NEWSLIST is always a total sublist.

Figure 2 illustrates GEN applied to two root-only generalization hierarchies. As a result of the oper- ation, STATUS becomes a total sublist grouping UNDGRAD and GRAD and STUDENT acquires the common attributes of UNDGRAD and GRAD. In Fig. 2, we assume that attributes A and D, and B and E are equivalent. If UNDGRAD and GRAD were generalization hierarchies, both would be subtrees of the new generalization hierarchy with STUDENT as the root.

Algebraically, GEN is the outerjoin of ENTLIST on a merge condition followed by data conversion operations on the common attributes of ENTLIST. The merge condition is always an equality of identifiers that specifies when an occurrence of one subtype is equivalent to an occurrence of another subtype. The data conversion operations are used to convert common attributes of different domains into the same domain. The common attributes and merge condition are defined in the matching step of the design methodology as explained in the previous

GENWUDENT. (UNDGRAD. GRAD). STATUS)

STUDENT

AD B-E

A B C GEN <STATUS> ä F: GRAD ’

UNDGRAD F

Fig. 2. GEN example.

DELETECSTUDENT) entity type. ADDSUBLIST is not a primitive oper- .4 ator because it uses a sequence of ADDSUBTYPE

DELETE operations. ADDSUBLIST is used indirectly as part ) of the MOVESUBLIST and DELINTERMEDIATE

operators. ADDSUBTYPE is used directly during the subtree merging stage.

Fig. 3. DELETE example.

ADDSUBTYPE has one other effect. The inserted section. For a detailed discussion of the outerjoin as subtype inherits the attributes of its supertype. In applied to global views, consult [31]. Fig. 4a, FACULTY inherits the attribute of UNIV-

GEN is an information preserving operator since PERSON. Attribute inheritance also applies to all entity occurrences and attributes are preserved. ADDSUBLIST because it implicitly uses ADDSUB- The data conversion operations may lose information TYPE. Therefore, EE and CIS inherit the attributes such as in converting from a fine to a coarse scale (e.g. of STUDENT in Fig. 4b. numeric to alphabetic grades), but these are outside ADDSUBTYPE and ADDSUBLIST are informa- of the scope of the GEN operator. tion preserving operators since no attributes nor

4.2. DELETE entity occurrences are lost. Information is added about additional subtype classifications.

DELETE eliminates an entity type from gener- ADDSUBTYPE has no direct algebraic counter- alization hierarchy. It is used indirectly as a sub- part because it merely involves a reclassification of operator of DELINTERMEDIATE and MOVE- SUBLIST during operations to rearrange the interior nodes of a generalization hierarchy. 0) ADDSUBTYPE(FACULW. TYPE)

Functionally, it is defined as DELETE (ENT- NAME) where ENTNAME denotes the entity to delete. All of the attributes of ENTNAME are im- plicitly deleted. When ENTNAME is a nonleaf node, the operation is recursively applied to its subtypes. Figure 3 depicts the result of a DELETE operation. STUDENT is removed along with its two subtypes, UNDGRAD and GRAD. The attributes of STUDENT, UNDGRAD and GRAD are also lost. TYPE becomes a partial sublist as shown by the blank box. DELETE is obviously not an information preserving operator since both the entity occurrences and attributes of the deleted entity are lost. b) ADDSUBLISTKEE. Cl%. MAJOR. STUDENT)

Algebraically, DELETE involves a selection, a difference, and a projection. The selection produces

pKE+A pi&Y--IA

L

the occurrences of the deleted entity (sub)type. The difference operation removes those occurrences from the original generalization hierarchy. The projection I- removes the direct attributes of the deleted entity

(sub)type. For example, in Fig. 3, the attributes of L[E+c

STUDENT, UNDGRAD and GRAD are no longer part of the UNIV-PERSON generalization hierarchy

after the DELETE operation.

4.3. ADDSUBTYPE/ADDSUBLIST

ADDSUBTYPE inserts one subtype into an exist- ing subhst. ADDSUBLIST inserts a new sublist and all the sublist members as subtypes of an existing

Merging generalization hierarchies 261

Functionally, ADDSUBTYPE is defined as ADD- SUBTYPE (ENTNAME, SUBLIST) where ENT- NAME is inserted into the existing sublist named SUBLIST. ADDSUBLIST is defined as ADD- SUBLIST (ENTLIST, SUBLIST, SUPTYPE) where ENTLIST is a list of entity types, SUBLIST is a new subhst, and SUPTYPE is the direct supertype of the members of ENTLIST. In Fig. 4a, FACULTY is added to the TYPE sublist. In Fig. 4b, the MAJOR sublist is added beneath STUDENT.

Fig. 4. ADDSUBTYPE/ADDSUBLIST examples.

262 MICHAEL V. MANNINO et al.

the existing entity occurrences in the generalization hierarchy. However, the extension of a subtype can be produced by a selection on the subtype’s mem- bership condition followed by a projection of the direct and inherited attributes of the subtype.

4.4. MO VESUBLIST

MOVESUBLIST changes the level at which a sublist is placed in a generalization hierarchy. As a result, the sublist has a new supertype. This operator is used to rearrange a generalization hierarchy after the initial merging has occurred.

Functionally, it is defined as MOVESUBLIST (SUBLIST, OLDSUPTYPE, NEWSUPTYPE) where SUBLIST is the sublist to move, and OLD- SUPTYPE and NEWSUPTYPE are the old and new supertypes, respectively. MOVESUBLIST is not a primitive operator. It is defined as an ADDSUBLIST operation to add the entity types in SUBLIST as subtypes of NEWSUPTYPE, followed by a sequence of DELETE operations applied to the original entity types in SUBLIST.

A sublist can be “raised” to a higher level in the generalization hierarchy or “lowered” to a deeper level. In Fig. Sa, the STATUS sublist has been moved from under ENGSTD to STUDENT. A side effect of this operation is the loss of inherited attributes. For example, after the “raise” operation, UNDGRAD and GRAD no longer inherit the B attribute of ENGSTD.

In the “lower” variation, the new supertype is a

a) MOVESUBLlST(STAIUS. ENGSTD. STUDENT)

b) MOVESUELISTWATUS, STUDENT, ENGSTD)

5: A MOVESUBLIST s! A

Fig. 5. MOVESUBLIST examples.

subtype of the old supertype. In Fig. 5b, ENGSTD (the new supertype) is the subtype of STUDENT (the old supertype), and COLLEGE is a total single entity sublist. In this case, a side effect is the addition of inherited attributes. For example, after the lower operation, EE and CIS inherit the B attribute of ENGSTD.

To prevent loss of information, the following re- strictions are applied to MOVESUBLIST operations. In the “raise” variation, the old supertype is the only entity in a total sublist. For example in Fig. 5, ENGSTD is the only member of COLL. The inher- ited attributes of the old supertype have not been lost because the old and new supertypes are in a one to one correspondence. In Fig. 5a, there is a 1-l map- ping between STUDENT and ENGSTD because of the total and single entity restrictions.

Analogous restrictions apply to the “lower” vari- ation. The new supertype must be the only entity in a total sublist. In Fig. Sb, ENGSTD is the only member of COLL. The inherited attributes of the new supertype are not gained because the old and new supertypes are in a one to one correspondence. In Fig. 5b, there is a l-l mapping between STUDENT and ENGSTD.

4.5. DELINTERMEDIATE

DELINTERMEDIATE removes an intermediate level of a generalization hierarchy. It is used to delete unneeded levels of a generalization hierarchy.

The functional definition is DELINTER- MEDIATE (SUPTYPE) where the level beneath SUPTYPE is deleted. In Fig. 6, level 1 containing PERSON1 and PERSON2 is deleted. No attributes or entity occurrences are lost in this operation. The specialized attributes of PERSON1 and PERSON2 (i.e. B and E) are inherited by STDl and FACl, and STD2 and FAC2, respectively. STDl has a new

DELINTERMEDIATE(PERSON)

- DELINTERMEDIATE

(OFFICE1 -b

PERSON1 B t--P [TYPEI)

F

Fig. 6. DELINTERMEDIATE example.

Merging generalization hierarchies 263

attribute named Bl which is the B attribute from PERSONl. A similar situation exists for FACl, STD2, and FACZ. All occurrences of PERSON still exist. However, an entity cannot be classified as PERSON1 or PERSON2.

DELINTERMEDIATE is not a primitive oper- ator. It is defined as a MOVESUBLIST operation applied to each sublist of each subtype of NEW- ROOT followed by a DELETE of each original subtype of NEWROOT.

DELINTERMEDIATE has two other effects. First, the specialized attributes of the deleted subtypes become direct attributes (rather than inher- ited) of their former subtypes. In Fig. 6, the attributes of PERSON1 become direct attributes of STDl and FACI rather than inherited attributes. Second, the sublists beneath the deleted level are reclassified as partial if the extensions of the deleted subtypes are not equivalent. In Fig. 6, the TYPE1 and TYPE2 sublists become partial because the extensions of PERSON1 and PERSON2 are not equivalent. In other words, every PERSON occurrence cannot be classified as STDl or FACl or as STD2 or FAC2.

5. BINARY MERGING

Merging generalization hierarchies is a subjective process, not amenable to optimization criteria. Opti- mization is not easy to address because merging is a representation problem. Two designers may have different preferences about combining generalization hierarchies and hence may reach different final solu-

tions. Our approach is a rule-driven merging editor. The

editor suggests actions based on user-supplied asser- tions and the current merging state. Some of the assertions (e.g. attribute equivalence assertions) are collected during previous steps of the methodology, while others (e.g. generalization assertions) are col- lected during the merging step. The actions are applications of the schema operators. The designer can accept the suggestions of the editor, override the editor’s suggestions, or supplement the suggestions.

Actions originated by the designer are subject to consistency constraints. The designer can also browse, undo, and redo actions from the design

history. Figure 7 displays a mockup of the proposed user

interface. Pull down menus include assertion brow- sing, schema diagram views, schema operations, and design history manipulation. User initiated actions can be invoked through the schema operators menu. The bottom portion of the screen is reserved for conditions and actions of rules. Like many interfaces, windows can be moved, copied and charged in shape. In Fig. 7, only the current merging state is displayed.

The rules are organized into two phases. In the first phase, a pair of generalization hierarchies is con- nected with the GEN operator. In the second phase, the subtree of the new generalization hierarchy is

revised according to attribute equivalence assertions. Both phases are interactive-they resemble a struc- tured editing session in which the new generalization hierarchy is constructed in a top down manner. After each action in a phase, the designer is informed of the semantics of the actions, especially regarding possible information losses. The process terminates when the designer chooses not to apply any transformations to a level in the new generalization hierarchy. In the remainder of this section, we explain in detail both merging phases.

5. I. Connecting phase

The connecting phase has three cases which corre- spond to the possibilities for intersection between two generalization hierarchies. These assertions are stated as follows:

GENASSERT(EQUIVALENT, GH 1, GH2); GENASSERT(SUBSET, GH 1, GH2): GENASSERT(OVERLAP, GHl, GH2);

The first assertion indicates that generalization hier- archies GHl and GH2 have equivalent extensions at all times. The second assertion indicates that GHl’s extension is a subset of GH2’s extension at all times. The third indicates that their extensions overlap but that a subset relationship does not hold at all times.

We do not allow merging where the extensions have a null intersection because the purpose of merging is to integrate overlapping databases.

We refer to these three cases as equivalent, subset, and overlapping, respectively. When merging inde- pendent databases, the third case is the most common and important. The following subsections discuss

these cases in detail.

5. I. 1. Equivalent generalization hierarchies

The merging strategy for generalization hierarchies

\

Fig. 7. Screen layout of the merging editor

264 MICHAEL V. MANNINO et al.

with equivalent extensions is simple. A new hierarchy is created using a GEN operation on the roots of the two input generalization hierarchies, Level one of the new hierarchy is removed through an application of the DELINTERMEDIATE operator because all oc- currences of the new root are members of both subtypes.

These actions are captured in the following rule:

Equivalent rule:

if GENASSERT(EQUIVALENT, GHl, GH2)

then GETNAME(“root”, NEWROOT); GETNAME(“subhst”, SLNAME); GEN(NEWROOT, (GH 1, GH2),

SLNAME); DELINTERMEDIATE(NEWROOT) .

The GETNAME actions prompt the user for the names of the new root entity type and the sublist. The results are placed in NEWROOT and SLNAME, respectively. According to the side effects of the DELINTERMEDIATE operator, the subhsts beneath GHl and GH2 are not reclassified as partial because the extensions of GHI and GH2 are equiv- alent.

The semantics of this rule are defined by the GEN and DELINTERMEDIATE operators. In the resulting generalization hierarchy, no entity occur- rences nor attributes are lost. As a result of the GEN operation, the common attributes are assigned to NEWROOT. As a result of the DELINTER- MEDIATE operation, the attributes not assigned to NEWROOT become direct attributes of the former subtypes of GHl and GH2.

Figure 8 demonstrates the use of this rule. ENGSTD-Ml and ENGSTD-M2 represent two identical collections of engineering students main- tained by different administrative departments. If their intersection is declared as equivalent in a GENASSERT assertion, the new generalization hierarchy appears in the lower half of Fig. 8. ENGSTD-1 and ENGSTD-2 are initially created as intermediate nodes and later deleted.

It is rare but possible that two generalization hierarchies have identical extensions. Even if two databases seemingly did contain an identical col- lection, timing differences and administrative barriers may make the collection of objects different. For example, one list of engineering students may be managed by the registrar’s office, and the other list managed by individual department offices. Hence, the lists may not be identical because they are drawn from different sources. The registrar may only record currently registered engineering students, while the individual offices may track students who are on a temporary nonregistered status.

51.2. Subset generalization hierarchies

This case is also simple. The root of the new

generalization hierarchy is the superset root. The subset generalization hierarchy becomes a subtree under the superset root. The following rule accomplishes these actions.

Subset rule:

if GENASSERT(SUBSET, GH 1, GH2); then COPY(GH2);

GETNAME(“sublist”, SLNAME); ADDSUBTYPE(GH1, SLNAME, GH2).

The COPY action merely adds GH2 as a generaliz- ation hierarchy in the global view. The GETNAME action prompts the user for the sublist name and places the result in SLNAME. The ADD-SUBTYPE then copies GHl as a subtree under GH2.

The semantics of this rule are defined by the ADDSUBTYPE operator. The resulting gener- alization is enriched through the addition of the attributes and subtype classifications of GHl. Also, the new subtype (GHI) inherits the attributes of GH2. However, the entity population remains the same because of the subset assertion.

In Fig. 9, ENSTD represents the collection of all engineering students, while MEMBER represents the engineering students who are society members. In a GENASSERT assertion, ENGSTD is declared as a superset of MEMBER. The SUBSET rule fires resulting in ENGSTD as the root of the new generalization hierarchy and MEMBER as a sub- type. The MEMBERSHIP subhst is partial because of the subset relationship between ENGSTD

NGSTD-Ml cl [(SOCIETY)

GENASSERT(EQUIVALENT. ENGSTD-I. ENGSTD-2)

c

(STATUS1

-j GRAD )

Fig. 8. Merging of equivalent generalization hierarchies

Merging generalization hierarchies 265

GENASSERTWJBSET. MEMBER-M. ENGSTD-M)

Fig. 9. Merging of subset generalization hierarchies

and MEMBER. The subtrees of ENGSTD and MEMBER are identical to the input hierarchies.

The subset case is not common for the same reasons as the equivalent case. Even though one generalization hierarchy may appear to be a subset of the other, timing and administrative differences may violate the subset relationship.

5. I .3. Overlapping generalization hierarchies

Because of the reasons cited in equivalent and subset cases, the overlapping case is the most com- mon for independently developed databases. A sim- ple merging strategy would be to create a new generalization hierarchy by using a GEN operation on the roots of the two input hierarchies. This strategy does not always achieve a uniform gener- alization hierarchy, however. Applying this strategy to a hierarchy of all main campus students and a hierarchy of all branch campus engineering students would yield a new hierarchy with both of these on level one of the new hierarchy. A more appropriate solution would be to have all students of both databases on level one or all engineering students of both databases on level one.

To achieve a more uniform generalization hier- archy in the overlapping case, more information about the input generalization hierarchies is needed. The SUBCONCEPT assertion indicates that one generalization hierarchy represents a broader concept than another. Hierarchy A is a subconcept of hier-

archy B if under the equivalent domains assumption, the extension of B contains the extension of A. The equivalent domain is defined as the union of the extensions of the two generalization hierarchies. The equivalent domains assumption disregards differ- ences due to the scope of the local databases. If two generalization hierarchies are defined under the same domain, the only differences will be due to the concepts they represent, not the scope of the local databases. For example, the engineering students of the branch campus are not a subset of all students of the main campus. If the domains of these two hier- archies are the union of the main and branch campus databases, the engineering students are a subset of all the students. An assertion stating that ENGSTD-B is a subconcept of ALLSTD-M is stated as: SUB- CONCEPT(ENGSTD-B, ALLSTD-M).

A subconcept assertion must be accompanied by a dominance assertion which indicates the gener- alization hierarchy to preserve in the merging process. An assertion to give priority to ENGSTD-B over ALLSTD-M is stated as: DOMINANCE- (ENGSTD-B).

We divide the overlapping case into three subcases corresponding to subconcept and dominance asser- tions. The first case occurs when the generalization hierarchies are not involved in a SUBCONCEPT assertion. The second and third cases both involve a SUBCONCEPT assertion. The second case further requires the superconcept generalization hierarchy to be dominant over the subconcept hierarchy. The third requires the subconcept generalization hier- archy to be dominant. We refer to these three cases as no dominance, higher level dominant, and lower level dominant, respectively. The following subsections discuss each of these cases.

5.1.3.1. No dominance. GEN is used to connect the roots of the input generalization hierarchies resulting in a new generalization hierarchy that has branches that match the input hierarchies. The NO DOMINANCE rule is stated as:

No dominance rule:

if GENASSERT(OVERLAP, (ROOT], ROOT)) and

NOT EXISTS (SUBCONCEPT(ROOT1, ROOT2)) and

NOT EXISTS (SUBCONCEPT(ROOT2, ROOT 1));

then GETNAME(“Root”, NEWROOT); GETNAME(“sublist”, SLNAME); GEN(NEWROOT, (ROOT], ROOT2),

SLNAME).

This is an information preserving rule because its semantics are defined by the GEN operator. In the resulting hierarchy, no entity occurrences nor attri- butes are lost. The common attributes of ROOT1 and ROOT2 become direct attributes of NEWROOT.

GENASSEWOVERLAP. (ENGSTD1, ENGSTD-2))

ENGSTD

<CAMPUS>

1 I

ENGSTD-M ENGSTD-B

(STATUS-Ml (MAJOR1 (STATUS-B)

UNDGRAD-M -j---=-j UNDGRAD-B

Fig. 10. No dominance example (overlap case I).

This is an information preserving rule because its semantics are defined by the GEN operator. In the resulting hierarchy, no entity occurrences nor attri- butes are lost. The common attributes of ROOT1 and ROOT2 become direct attributes of NEWROOT. Information is added because ROOT2 has a new supertype, DUMMYROOT.

Figure 11 exemplifies the higher dominant case. ENGSTD-M (engineering students of the main campus) is a subconcept of ALLSTD-B (all branch campus students) because engineering students is a subconcept of all students. The dominance assertion states that ALLSTD-B has preference over ENGSTD-M. In the resulting hierarchy, ALLSTD- M (all students-main campus) is created as the dummy supertype of the lower root ENGSTD-M. ALLSTD-M and ALLSTD-B are then connected with the GEN operator.

5.1.4. Lower level dominant Figure 10 displays an example of merging of two

generalization hierarchies representing the en- In this case, we give preference to the lower level

gineering students of a main and a branch campus root (ROOTl) and lose information about the higher

database. The new generalization hierarchy is the level root. The designer chooses a comparable sub-

generalization of the input hierarchies at the root level. The NO DOMINANCE rule fires the following GEN operation: GEN (ENGSTD, (ENGSTD-M, I IALLSTD_B ENGSTD-B), CAMPUS).

5.1.3.2. Higher level dominant. In this case, infor- /STATUS-B+iLLEGEI (STATUS-)+,&R)

mation about both generalization hierarchies is pre- served, but we give preference to the generalization hierarchy representing the broader concept. In the new hierarchy, we make the branch with the lower level supertype appear similar to the branch with the higher level supertype. We accomplish this by cre- GENASSERT(OVERLAP. (ENGSTD-M. ALLSTD-B))

ating a “dummy” supertype above the lower level SUBCONCEPT(ENGSTD-M. ALLSTD-B)

supertype. The “dummy” supertype is connected DOMINANCE(ALLSTD-B)

with the root of the other generalization hierarchy to form a new herarchy. c

The higher level case uses the following rule:

Higher dominant rule:

if GENASSERT(OVERLAP, (ROOTl, ‘pEi&-Ml

I I

ROOT2)) and (STATUS-B) 1 (COLLEGE-81

SUBCONCEPT(ROOT2, ROOTI) and DOMINANCE(ROOT1);

then COPY(ROOT2); GETNAME(“root”, DUMMYROOT); GETNAME(“sublist”, SLNAME); GEN(DUMMYROOT, (ROOT2),

SLNAME); GETNAME(“root”, NEWROOT); GETNAME(“sublist”, SLNAME); Fig. 11. Higher level root is dominant (overlap case 2).

GEN(NEWROOT, (ROOTl, DUMMYROOT), SLNAME).

266 MICHAEL V. MANNINO et al.

In this rule, the lower root (ROOT2) is copied to the global view and a dummy supertype is created above it. The higher level root is then generalized with the dummy root.

Merging generalization hierarchies 267

type (say, SUBTYPE2) of the higher level root. The new generalization hierarchy is created with ROOT1 and SUBTYPE2 as subtypes. The other sublists on the same level as SUBTYPE2 are added as sublists of SUBTYPE2 in the new hierarchy because these sub- lists apply to SUBTYPE2. Information is lost if they are not added.

The following rule handles the lower level domi- nant case:

Lower level dominant rule:

if GENASSERT(OVERLAP, (ROOTl, ROOT2)) and

SUBCONCEPT(ROOT1, ROOT2) and DOMINANCE(ROOT1)

then FINDSUBTYPE(ROOT2, SUBTYPE2); GETSLSAMELEVEL(SUBTYPE2,

SLLIST); GETNAME(“root”, NEWROOT); GETNAME(“sublist”, SLNAME); GEN(NEWROOT, (ROOT], SUBTYPE2), SLNAME); ADDSL(SLLIST, SUBTYPEZ).

In this rule, the FINDSUBTYPE action prompts the user for the comparable subtype and places the result in SUBTYPE2. The GETSL-SAMELEVEL action collects the sublists on the same level as SUBTYPE2 and places the resulting list of sublists in SLLIST. After the new generalization hierarchy is created by the GEN operator, the sublists in SLLIST are added as subtypes by the ADDSL action which uses ADDSUBLIST to add each sublist in SLLIST.

In Fig. 12, we reverse the situation of Fig. 11.

ENGSTD-M is still a subconcept of ALLSTD-B, but ENGSTD-M is given priority in the new generaliz- ation hierarchy. This leads to information loss in the new hierarchy, but in some cases information loss may be desired. In the merging process, a comparable subtype (ENG) of ENGSTD-M is identified. The other subtypes in the COLLEGE sublist are deleted. ENG is then connected to the lower level root of the other generalization hierarchy. The higher level root is eliminated from the new generalization hierarchy as a result of connecting the ENG subtype.

In the lower level root case, information is lost in

two ways. First, the attributes and membership of each deleted subtype are lost. In Fig. 12, the attri- butes of and membership in business (BUS) students are lost. Second, the occurrences of the higher level root which are not members of the comparable

GENASSERT(OVERLAP. (ENGSTD-M. ALLSTD-B))

SUBCONCEPT(ENGSTD-M. ALLSTD-8) DOMINANCE(ENGSTD-M)

Fig. 12. Lower level root is dominant (overlap case 3).

subtype are lost. In Fig. 12, the occurrences of non-engineering students are deleted. These types of information losses are easy to convey to the designer in an interactive design tool. The designer would only choose this alternative with full knowledge of the information loss.

51.5. Summary of the connecting phase cases

Table 2 summarizes the Connecting Phase cases by values of the GENASSERT and DOMINANCE assertions.

5.2. Subtree Merging Phase

After the new generalization hierarchy is created in the Connecting Phase, revisions may be needed to achieve a more uniform generalization hierarchy. An internal level can be deleted if it serves no purpose. Sublists can be moved vertically and merged if they match a sublist on another level. The revisions occur in a top down manner, and the process terminates when no revisions are made to a level. These revisions are based on attribute equivalence assertions and designer preferences.

In this section, we discuss the rules underlying these revisions and demonstrate their application

Genassert

Equivalent

Subset

Overlapping

Overlapping

Overlapping

Table 2. Summary of the connecting phase cases

Dominance Description Information loss

N/A Join of the input roots NO

N/A The superset root is the root of the new hierarchy NO

Same Generalization of the input roots NO

Higher node Generalization of the higher level root and a dummy root NO

Lower node Generalization of the lower level root and a subtype Yes

268 MICHAEL V. MANNINO et al.

through examples. The first subsection discusses the chain. In the case of two subtypes, a transitive chain deletion of internal levels, the second describes the is just a single equivalence assertion. Mannino and raising of sublists, and the third presents the merging Effelsberg [8,9] discuss these results and describe of sublists. The final subsection demonstrates these algorithms to resolve inconsistencies among a chain types of revisions in a larger example. of equivalence assertions.

5.2.1. Deleting an intermediate level

In both merging phases, additional levels can be introduced into the new generalization hierarchy. In the Connecting Phase, a level is added in the three overlapping cases. For example in Fig. 10, the new generalization hierarchy is one level deeper than either one of the input hierarchies. Level 1 containing ENGSTD-M and ENGSTD-B groups engineering students by their respective campus. In the Subtree Merging Phase, additional levels can also be intro- duced during sublist merging (explained later).

Formally, we define the common attribute measure (CAM) using the number of direct and common attributes as follows: CAM = CA/(CA + DA) where CA is the number of common attributes and DA is the number of direct attributes.

For example, if two subtypes have 4 direct attri- butes each and there exist 4 equivalence assertions, the common attribute measure is

4 1

(4=8)=

The advantage of deleting a level is that additional merging can be done beneath it. In Fig. 10, if level 1 of the new generalization hierarchy is deleted, the sublists STATUS-M and STATUS-B can be merged. The following rule guides the level deletion decision. The conclusion of the rule is tentative. The designer can choose to ignore the conclusion if he/she desires.

If the common attribute measure is close to 1 (no direct attributes), the third premise in the rule is true. The designer will then have the choice of deleting the level. The threshold value for the common attribute measure can be set by the designer.

52.2. Raising sublists

Level deletion rule:

if SUPERTYPE and COUNT(SUBLISTS(X)) = 1 and CAM(SUBTYPES(X)) > Threshold

then MAYDELETE(

If the level to delete contains nonterminal entity types, the semantics of this rule are defined by the DELINTERMEDIATE operator. No attributes nor entity occurrences are lost. However, the ability to classify entities as members of the deleted level has been lost. If the deleted level contains terminal entity types, the DELETE operator is used and attributes can be lost unless the designer specifically adds them to the supertype.

Sometimes, raising a sublist allows later merging of it with another sublist. The raise sublist rule lifts a sublist one level if there exists a compatible sublist on a higher level (lower level value) in the generalization hierarchy. Two sublists are compatible if there exists an attribute equivalence assertion relating their un- derlying classification attributes.

The rule for raising sublists can be stated as follows:

Raise sublist rule:

if sublist X is on level, and sublist Y is on level,, , and the classification attribute of X is X’ and the classification attribute of Y is Y’ and equivalence assertion j involves X’ and Y’

and The application of this rule is rather restrictive. In

the first and second terms, a level qualifies if all subtypes of the level belong to the same sublist. Potentially, a level rooted anywhere in the new generalization hierarchy can be deleted. In the third term, a level qualifies if the common attribute measure (CAM) is larger than a threshold value defined by the designer. The CAM value will be greater than the threshold if most of the subtypes’ attributes are inherited. Loosely speaking, we measure this property by the ratio of the number of common attributes to the number of direct attributes of each subtype.

the supertype of X is X” and the supertype of Y is Y”

then MOVESUBLIST(Y, Y”, X”)

The first and second terms ensure that the sublists are on adjacent levels. The third-fifth terms ensure that their classification attributes are related in an attri- bute equivalence assertion. The last two terms bind the variables for the MOVESUBLIST operator.

Recall that two attributes are common if they are related by an attribute equivalence assertion. More generally, there is a common attribute among N subtypes if there exists a transitive chain of equiv- alence assertions of length N - 1. It follows that there is one common attribute for each such transitive

The premises of the raising rule apply to two cases of the connecting phase: (1) subset and (2) higher level dominant. In both cases, compatible sublists can be on adjacent levels of a generalization hierarchy after the connecting phase. For example, the STATUS-M and STATUS-B sublists of Fig. 11 both describe the same property of a student, yet they are on different levels of the generalization hierarchy. If their classification attributes are defined as equivalent in an assertion, the raise sublist rule applies resulting

Merging generalization hierarchies 269

ALLSTD

ALLSTD-M

(COLLEGE-M) ~STATUSMI [COLLEGE-K

Fig. 13. Application of the raise sublist rule.

in the STATUS-M sublist being lifted one level

(Fig. 13). The semantics of this rule are defined by the

MOVESUBLIST operator. The subtypes of the raised sublist lose the inherited attributes of their former supertype. In this situation, however, the attributes are not lost because of the restrictions of the raise operation. Recall (see Section 4.4) that a raise operation is only valid if the old and new

supertypes are in a one-to-one correspondence. In the previous example, ENGSTD-M and ALLSTD-M are in a one-to-one mapping since all students are en- gineers in the main campus database. This one-to-one mapping requirement ensures that attributes inher- ited from ENGSTD-M are not lost after the raise operation. The mapping integrity constraint is checked as a part of the MOVESUBLIST action.

5.2.3. Merging sublists

The previous rule ensures that a raised sublist can be merged with another sublist on its new level. This provision is necessary because the only reason to raise

a sublist is to merge it with another sublist. Therefore, the rules for raising and merging sublists are typically used in succession-first a sublist is raised and then it is merged. However, these rules are separate be- cause two compatible sublists may already be on the same level without a raise operation.

The sublist merging rule is shown below. It requires two sublists to have the same supertype and an attribute equivalence assertion to relate them.

Sublist merging rule:

if the supertype of sublist X is Z the supertype of sublist Y is Z the classification attribute of X is X’ and the classification attribute of Y is Y’ and equivalence assertion j involves X’ and Y’

then MERGESUBLISTS(X, Y, SLNAME).

This rule applies to the generalization hierarchy in Fig. 13 if the CAMPUS sublist is deleted by the level

deletion rule. This rule also applies to the MAJOR-B and MAJOR-M sublists in Fig. 9 if MAJOR-B is lifted by the raise sublist rule. The MERGE- SUBLISTS action combines the two sublists (X and Y) by deleting them and adding a new sublist (SLNAME) with the sum of the subtypes of the original sublists.

Sublist merging is information preserving as no attributes nor entity occurrences are lost. However, there has been a change in the hierarchy because two sublists have been combined. The new sublist is total and overlapping if either of the original sublists is total. If both original sublists are partial, the new sublist can be partial or total and disjoint or over- lapping. By default, we assume partial and over- lapping because they are looser, but the designer can override this with assertions about the sublist membership.

Figure 14 extends the example from Fig. 13 with the deletion of the CAMPUS sublist and the merging of the STATUS-B and STATUS-M sublists. After deleting the CAMPUS sublist, the sublists are reclassified as partial because of the DELIN- TERMEDIATE operation. After merging STATUS- B and STATUS-M, the new sublist STATUS con- tains four subtypes. STATUS is total and overlapping because every student must have an

a - Fig. 14. Application of the sublist merging rule

270 MICHAEL V. MANNINO et al.

enrollment status and some students have an en- rollment status in both main and branch campuses. In this case the designer had to make an assertion about the total sublist.

After combining sublists in this manner, it may be desirable to merge the subtypes in the new sublist. For example, in Fig. 14, it may be desirable to combine UNDGRAD-M and UNDGRAD-B into another sublist or to combine them into a single subtype. These additional transformations can be accomplished through GEN operations and the de- lete level rule. The designer must specify the GEN operations and the delete level rule will be triggered if its conditions are met.

5.2.4. Subtree merging example

We demonstrate the operations of the subtree merging phase through an extension of the “No Dominance” case of Fig. 10. Assume that ENGSTD- M has 1 direct (i.e. noninherited) attribute, ENGSTD-B has no direct attributes, and there are four equivalence assertions. If the common attribute measure threshold value is less than 0.8, the level deletion rule applies because level 1 of Fig. 10 only contains 1 sublist and the common attribute measure is 0.8. If the designer decides to delete this level, the resulting generalization hierarchy appears as shown in Fig. 15a.

If the classification attributes of the STATUS-M and STATUS-B sublists are related in an equivalence assertion, the sublist merging rule applies because the sublists are on the same level. The resulting sublist

a) After the Deletion of Level I

ENGSTD

(STAT&MI (STATUS-81

_ UNDGRAD-6

_ GRAD-B

b) Merging of sublists

<STATUS>

ENGSTD

(MAJOR1

cGSCHOOL>

GRAD-M

GRAD-B

Fig. 15. Subtree merging example.

STATUS is total and overlapping because every student has a status and some students have a status in both campuses. Figure 15b shows a further merging of the subtypes within the STATUS sublist. There have been two additional GEN operations to create UNDGRAD and GRAD.

After the merging is complete, the delete level rule would then be attempted to both sublists on level 2 of Fig. 15b. If the rule is used on USCHOOL, the specific attributes of UNDGRAD-M and UNDGRAD-B are lost unless the designer adds them back to UNDGRAD. The process terminates after the deletion rule is attempted because the generalization hierarchy is exhausted.

6. EXTENSIONS FOR M-WAY MERGING

The problem of merging M generalization hier- archies can occur at the initial design of a global view or at the modification of an existing global view. At initial design time, the problem can occur when there are more than two databases to integrate. At modification time, the problem can occur when there is more than one additional database to integrate.

A spectrum of approaches to M-way merging are possible. On the ends of the spectrum are the in- cremental and the simultaneous approaches. In the incremental approach, M hierarchies are merged two at a time. Initially, two hierarchies are merged, then the next hierarchy is merged with the result of the first merge. In the simultaneous approach, all gener- alization hierarchies are merged in one pass. This approach probably uses fewer merging operations than the incremental approach, but it is also more complex because assertions must be made about all relevant combinations of the M hierarchies.

For flexibility and ease of use, a design meth- odology should support both approaches. Some designers may prefer one approach over the other and there are situations when one approach is superior to the other. It is a challenging problem to support a spectrum of approaches with incremental and simultaneous merging being the end points of the spectrum. In the remainder of this section, we sketch the changes to our merging operators, rules, and assertion types needed for both approaches.

The incremental approach requires only minor changes to the operators, rules and assertion types described in Sections 4 and 5. Global object types must be permitted in an assertion in addition to local schema objects. The designer must specify a total ordering either before the merging process starts or incrementally. The final generalization hierarchy is highly dependent on the chosen order. The evolving global view hierarchy is treated as a special database to integrate. In the overlapping case of the connecting step, it is considered dominant by default. Any changes to the existing global hierarchy must be reflected in the mappings to the local schemas.

Merging generalization hierarchies 271

The simultaneous approach is not so easily accom-

modated. The assertion types should be extended to permit statements about N objects rather than just two objects. Otherwise, an exhaustive list of pairwise assertions is required. Some assertions are easy to extend such as the EQUIVALENT case of GEN- ASSERT because the assertion has the transitive property. Others such as the OVERLAP case are more difficult because they are not transitive. These differences may make the assertions less uniform and therefore, more confusing.

Regarding the operators and the rules, the GEN

operator can be readily extended to M generalization hierarchies. Extensions to the other operators are not necessary. The existing merging rules must be revised to reflect changes to the format of assertions. Addi- tional changes and new rules are necessary to handle M way merging. The case where all hierarchies are EQUIVALENT or all are related as SUBSETS can be handled with small changes to the existing rules. The case where some hierarchies satisfy the EQUIV- ALENT case, some satisfy the SUBSET case, and some satisfy the OVERLAPPING case can be han- dled by a combination of the rules. The EQUIV-

ALENT and SUBSET hierarchies can be merged w-ith their respective rules. The resulting hierarchies can then be merged according to the rules of the OVERLAPPING case. The No Dominance subcase can be handled with minor revisions to the existing rules, but the higher and lower level subcases cannot. These cases require pairwise assertions and a sequence of binary merging operations. The order of the merging operations could be determined by a rank associated with each overlapping generalization hierarchy. The ranking replaces the DOMINANCE assertion described earlier.

The Subtree merging phase also requires exten- sions. The Level Deletion and Sublist Raising rules

require no modification. The Sublist Merging and Add Subtype rules must be changed to accommodate N object assertions.

7. CONCLUSION

We described the underlying operators and rules of

a procedure to merge generalization hierarchies. The operators manipulate generalization hierarchies in the following ways: (1) connecting hierarchies to form a new hierarchy; (2) adding and deleting sub- hierarchies; and (3) adding the deleting intermediate levels of a hierarchy. We defined three primitive

operators (GEN, ADDSUBTYPE, and DELETE) and three secondary operators (ADDSUBLIST, DELINTERMEDIATE and MOVESUBLIST), which are based on the primitive operators. For each operator, we provided a functional and, where possi- ble, an algebraic definition, and discussed its informa- tion preserving properties.

The merging procedure interactively constructs a

generalization hierarchy in two phases. The inter-

active nature is necessary because of designer deci-

sions regarding names, assertions and alternative actions. In the connecting phase, two generalization

hierarchies are conducted based on the intersection of their extensions. The intersection can be complete, a subset relationship, some overlap but no subset re- lationship, or no overlap. We further subdivide the last two cases based on the dominant generalization hierarchy. This results in a total of five cases in the Connecting Phase. In the subtree merging phase, the new generalization hierarchy is revised in the following ways: (1) intermediate levels can be added; (2) subtypes can be raised to higher levels; and (3) collections of subtypes can be merged. These revisions are primarily based on attribute equivalence assertions.

A preliminary version of the binary merging pro- cedure as described in Section 5 was implemented at the University of Florida [32]. The merging procedure is intended as part of a global view design workbench. Ideally, a workbench should provide a common design database and graphical interface, and a variety of tools for schema conversion. assertion specifi- cation and analysis, and schema merging. Prototype versions of a graphical schema editor [33] and a view definition language translator [21] have been developed. Further development is underway in a joint project between Honeywell Corporation and the University of Florida. The formulation of integration rules in this paper lends itself to the development of

a rule-based expert system for global view design.

Acknowledgements-We thank Bernice Buntaran for imple- mentation of the binary merging algorithms and Celestine Blanton for an implementation of the graphical display functions. We also thank Aloysius Cornelio for useful discussions.

REFERENCES

[I] J. M. Smith and D. C. P. Smith. Database abstractions: aggregation and generalization. ACM Trans. Database Systems 2(3), 105-133 (June, 1977).

[2] U. Dayal and H. Hwang. View definition and gener- alization for database integration in multidatabase system. In IEEE Trans. Software Ennineerinp. SE-10.6. pp. 6288644 (November, i984). ” ~’ T. Landers and R. Rosenthal. An overview of multi- base. In Proc. 2nd ht. Symp. on Distributed Databases, Berlin (Edited by H. J. Schneider). pp. 1533184 (September, 1982). V. Gligor and G. Luckenbaugh. Interconnecting heterogeneous database management systems, IEEE Compur. 17(l), 3343 (January, 1984). D. Heimbigner and D. McLeod. A federated architec- ture for information management. ACM Trans. on O&e Information Systems 3(3), 234252 (July, 1985). D. Brill and M. Templeton. Distributed query processing strategies in mermaid, a frontend to data management systems. In Proc. ht. Cor$ on Data Engineering, pp. 211-218 (April, 1984). Y. Breitbart, P. Olson and G. Thompson. Database integration in a distributed heterogeneous database system. In Proc. ht. Conf. on Data Engineering (February, 1986).

131

[41

151

161

[71

272 MICHAEL V. MANNINO et al.

PI

191

I101

1111

1121

[131

[141

[151

[161

[171

[181

[191

[201

1211

M. Mannino. A methodology for global schema design, UF-CIS Technical Rep. TR-84-1, Computer and Information Systems Dept. Univ. Florida (September, 1984). M. Mannino and W. Effelsberg. Matching techniques in global schema design. In Inf. Conf. on Data Engineering, IEEE, Los Angeles (April, 1984). W. Effelsberg and M. Mannino. Attribute equivalence in global schema design for heterogeneous distributed databases. Inform. Svstems 9(3) (1984). S. Navathe, k. Elmasri and J.‘Larson.‘Integrating user views in database design. IEEE Comput. 19(l), 5062 (January, 1986). M. Mannino. A methodology for global schema design. Ph.D. Dissertation, Dept of Management Information Systems, Univ. Arizona (June, 1983). C. Batini, M. Lenzerini and S. Navathe. A comparative analysis of methodologies for database schema integration. Technical Rep TR-86-1, Computer and Information Sciences Dept, Univ. Florida (1986). A. Motro and P. Buneman. Constructing superviews. In Proc. ACM SIGMOD Cont. Ann Arbor, Michigan (May, 1981). A. Motro. Interrogating superviews. In Proc. Second Int. Conf. on Databases (ICOD ). Churchill Colleee. Cambridge, U.K., 107-126 (September, 1983). - A. Motro. Superviews: virtual integration of multiple databases. IEEE Trans. Software Engineering SE-13, 7, 785-798 (July, 1987). R. Elmasri and S. Navathe. Object integration in database design. In Proc. Ist Int. Conf. Data Engineer- ing (COMPDEC), IEEE, pp. 426433 (April, 1984). S. Navathe, T. Sashidar and R. Elmasri. Relationship merging in schema integration. In Proc. 10th Inc. Co@ on Very Large Data Bases, Singapore, pp. 67-79 (August, 1984). J. Larson, S. Navathe and R. Elmasri. Attribute equiv- alence and its role in database schema integration. IEEE Trans. Software Engineering (1988). To be published. D. Shipman. The functional data model and the data lannuaae DAPLEX. ACM Trans. Database Svstems 6(l). 14(r173 (March, 1981). M. Mannino and C. Karle. An extension of the general entity manipulator language for global view definition. Data Knowledge Engng. 1, 305-326 (1985).

[22] W. Kim, J. Banerjee, H. Kim and H. Korth. Semantics and implementation of schema evolution in object- oriented databases. In Proc. ACM SIGMOD Conf., San Francisco, pp. 31 l-322 (June, 1987).

[23] C. Zaniolo. The database language GEM. In Proc. ACM SIGMOD Conf. and SIGMOD Record 13(4), 207-218 (May, 1983).

[24] R. Elmasri, J. Weeldryer and A. Hevner. The category

[251

I261

1271

1281

1291

[301

[311

1321

[331

1341

[351

concept: an extension to the entity-relationship model. Data Knowledge ENGNG. l(l), (June, 1985). S. Dumpala and S. Arora. Schema translation using the entity relationship approach. In Proc. 3rd E-R Conf.., Elsevier, Amsterdam, pp. 337-356 (1983). J. Iossiphidis. A translator to convert the DDL of ERM to the DDL of system 2000. In Proc. 1st Int. Co@ on Entity Relationship Approach, pp. 55&578 (December, 1979). F. Lochovsky and D. Tsichritzis. Dafa Models. Prentice-Hall, Englewood Cliffs, NJ (1982). C. Zaniolo. Design of relational views over network schemas. In Proc. ACM SIGMOD Conf. (June, 1979). C. Data. An Introduction to Database Systems, Fourth Edn. Addison Wesley, Reading, Mass. (1985). W. Kent. Choices in practical data design. In Proc. 8th Int. Co@ on Very Large Data Bases, Mexico City, pp. 1655180 (September, 1983). U. Dayal. Processing queries over generalization hierarchies in a multidatabase system. In Proc. 9th Int. Conf. on Very Large Data Bases, Milan, Italy, pp. 342-353 (1983). B. Buntaran. A Procedure to Merge Hierarchically Structured Entity Types, Master’s Thesis, Computer and Information Sciences Dept, Univ. Florida (1985). C. Blanton. A schema definition tool, Master’s Thesis, Computer and Information Sciences Dept, Univ. Florida (1985). R. Elmasri, J. Larson and S. Navathe. A comprehen- sive methodology for database schema integration. Honeywell Computer Sciences Center Technical Rept, Golden Valley, MN (April, 1986). A. Rosenthal and D. Reiner. Extending the algebraic framework of query processing to handle outerjoins. In Proc. 10th Inr. Conf. Very Large Data Bases, Singapore, pp. 334-343 (August, 1984).