Conceptual and Context based Combination of Schema Matchers

7
2008 International Conference on Emerging Technologies IEEE-ICET 2008 Rawalpindi, Pakistan, 18-19 October, 2008 Conceptual and Context based Combination of Schema Matchers Nayyer Masood Dept. of Computer Science M. A. J. University, Islamabad nayyer@jinnah. edu. ok. Abstract In this paper we are presenting a semi-automated schema matching approach named C3SM (Conceptual and Context based Combination of Schema Matchers). A single schema matching technique is unlikely to be successful to achieve high match accuracy for a large variety of schemas therefore C3SM combines different matchers. C3SM exploits various types of schema information, e.g. element names, data types and structural properties. It also utilizes auxiliary sources, such as taxonomies, dictionaries, thesauri or synonym tables. In order to manage different types of heterogeneities and to increase the accuracy of the similarity assertions generated, matching between elements is performed in the contexts in which they are modeled and similarity is established on the basis of the concepts that the elements model. 1. Introduction. Schema matching is a fundamental operation of schema integration process which is a basic problem in many database application domains, like database integration, E-business, data warehousing [1]. Schema Matching (SM) takes two schemas as input and produces correspondences between elements of two schemas that are modeling same or similar concepts. It is a process that can neither be performed manually due to huge efforts required nor can it be performed totally automatically due to lack of semantic content in the database schemas. The only viable approach, hence, is to perform the process semi- automatically, where the process is mainly driven 978-1-4244-2211-1/08/$25.00 ©2008 IEEE Orner Iqbal Dept. of Computer Science M. A. J. University, Islamabad [email protected] by a software tool requiring user input at different stages and in different forms. The initial research on SM was in the perspective of view integration; a process performed in database system design [10]. Later the focus of the same and the new approaches was shifted to schema integration mainly in the context of Multi/federated database system design [11]. In the recent past, new applications have emerged that are based on the SM, like semantic query processing, e-commerce, data warehousing. Schema matching now is studied as an independent research area providing input to numerous diversified applications [2, 3, 5, 7, 8]. The focus of research in SM is to provide affinity/similarityl assertions among schema elements belonging to different schemas with least possible human intervention and with maximum accuracy. Both of these targets are hampered mainly due to existence of semantic heterogeneities among schemas; a phenomenon that reflects a situation where same real-world concept is modeled differently in different schemas. Other issues associated with SM are representation of component schemas, auxiliary information required and cardinality of matcher (a particular SM approach) results. These issues are addressed differently in different SM approaches and hence introduce different matchers. Many of the recent SM approaches [2, 3, 5, 6, 7] are mainly linguistic based [1], that is, the matching is performed by comparing elements' names, description etc. Name matching works under the assumption that if two schema elements have same or similar names, they model same or similar 1 These two words are used interchangeably

Transcript of Conceptual and Context based Combination of Schema Matchers

2008 International Conference on Emerging TechnologiesIEEE-ICET 2008Rawalpindi, Pakistan, 18-19 October, 2008

Conceptual and Context based Combination ofSchemaMatchers

Nayyer MasoodDept. ofComputer Science

M. A. J. University, Islamabadnayyer@jinnah. edu. ok.

Abstract

In this paper we are presenting a semi-automatedschema matching approach named C3SM(Conceptual and Context based Combination ofSchema Matchers). A single schema matchingtechnique is unlikely to be successful to achievehigh match accuracy for a large variety of schemastherefore C3SM combines different matchers.C3SM exploits various types of schemainformation, e.g. element names, data types andstructural properties. It also utilizes auxiliarysources, such as taxonomies, dictionaries, thesaurior synonym tables. In order to manage differenttypes of heterogeneities and to increase theaccuracy of the similarity assertions generated,matching between elements is performed in thecontexts in which they are modeled and similarityis established on the basis of the concepts that theelements model.

1. Introduction.

Schema matching is a fundamental operation ofschema integration process which is a basicproblem in many database application domains,like database integration, E-business, datawarehousing [1]. Schema Matching (SM) takes twoschemas as input and produces correspondencesbetween elements of two schemas that aremodeling same or similar concepts. It is a processthat can neither be performed manually due to hugeefforts required nor can it be performed totallyautomatically due to lack of semantic content in thedatabase schemas. The only viable approach,hence, is to perform the process semi­automatically, where the process is mainly driven

978-1-4244-2211-1/08/$25.00 ©2008 IEEE

Orner IqbalDept. ofComputer Science

M. A. J. University, [email protected]

by a software tool requiring user input at differentstages and in different forms.

The initial research on SM was in the perspectiveof view integration; a process performed indatabase system design [10]. Later the focus of thesame and the new approaches was shifted toschema integration mainly in the context ofMulti/federated database system design [11]. In therecent past, new applications have emerged that arebased on the SM, like semantic query processing,e-commerce, data warehousing. Schema matchingnow is studied as an independent research areaproviding input to numerous diversifiedapplications [2, 3, 5, 7, 8]. The focus of research inSM is to provide affinity/similarityl assertionsamong schema elements belonging to differentschemas with least possible human interventionand with maximum accuracy. Both of these targetsare hampered mainly due to existence of semanticheterogeneities among schemas; a phenomenonthat reflects a situation where same real-worldconcept is modeled differently in differentschemas. Other issues associated with SM arerepresentation of component schemas, auxiliaryinformation required and cardinality of matcher (aparticular SM approach) results. These issues areaddressed differently in different SM approachesand hence introduce different matchers. Many ofthe recent SM approaches [2, 3, 5, 6, 7] are mainlylinguistic based [1], that is, the matching isperformed by comparing elements' names,description etc. Name matching works under theassumption that if two schema elements have sameor similar names, they model same or similar

1 These two words are used interchangeably

concepts. This assumption can result some verymisleading results, for example, account could be abank account or a library account, likewise namecould be of a department or of a customer. Suchflaws are recovered by applying multiple matchersjointly, so different matchers support or opposeeach others results. Hence the notions of hybridand composite matchers are introduced [1]; hybridmatchers are combination of different matcherswhich are simultaneously applied, whereascomposite matchers apply different matchersindividually and finally combine their results usingdifferent techniques.

In this paper, we are presenting a part of a semi­automatic schema matching approach C3SM thattackles the above mentioned problem of namematchers by using the concept of in-contextmatching [4]. The C3SM works in syntactic andsemantic modes. In syntactic mode it mainly usesthe schematic and auxiliary information andcombines them in both composite and hybridfashions. In semantic mode, it includes a mappingphase in which each schema element is mapped tothe real-world concept(s) that it models.Contextual matching follows the mapping phase.In this paper, we are discussing only the syntacticaspect of C3SM due to shortage of space.The paper is structured as follows: Section 2discusses different schema matching approaches.Section 3 presents characteristics of C3SM and itsmatch processing approach. Section 4 presents theconstituent matchers of C3SM. Section 5 presentsthe results of C3SM applied on example schemas.Finally, conclusion and future work is presented insection 6.

2.0. Literature Review.

Many approaches for SM have been proposed inthe literature. These techniques vary from manualto fully automatic. The range of elements identifiedas matching candidates also varies from one-to-oneto one-to-many.

The SemInt [10] and SKAT (Semantic KnowledgeArticulation Tool) [6] match prototype identifiesone-to-one matching candidates. SemInt is basedon both constraints based and content basedmatching criteria. Its uses metadata informationand instances to find similarity between elements.SKAT on the other hand is based on predefinedrules and requires user acceptance to find similaritybetween elements. These rules are applicationspecific and cannot be applied in different

domains. In practice it is found that there can bemultiple elements relating with a single element,but only one of them can be a true match.

COMA [2] identifies one-to-many matchcandidates. It supports both composite and hybridcombination of schema matchers. Like previousapproaches COMA does not involve heavy userinteraction but it is still a semiautomatic approach.It has number of advantages over other schemes,which include its rich library of matchers. Oneimportant aspect of COMA is reusability of itspreviously obtained results.

Each schema matching approach has somestrengths and weaknesses. However, due toexistence of heterogeneities and modeling of sameconcept in different contexts these approaches tendto generate many invalid similarity assertions.These two major issues need to be addressed.

3.0. Conceptual and Context basedCombination of Schema Matchers(C3SM).

The Conceptual and Context based Combination ofSchema Matchers (C3SM) is a novel schemamatching approach with following salient features.

It provides the option of concepts basedmatchingThe correspondence between elements isestablished based on the context(s) withinwhich an element exists in a schema

The architecture of C3SM is shown in figure 2.The input to the C3SM is the component schemasexpressed in the Unified Model [11] . However,rather than accepting it purely as a structuralartifact, we provide an optional mapping phasewhere user can map the schema elements to theconcepts that they are modeling. The concepts aretaken from the domain-specific concept hierarchies[4] forming the concepts-based models of thecomponent schemas on which the semanticmatcher is applied. Focus in this paper, however,is only the schema based part of the C3SM.

In schema based C3SM, we are mainly usingschema based matchers, that exploit theinformation available in database schemas plussome auxiliary information, like synonyms. Ourapproach transforms component schemas intounified representation Le. Unified Model [11],which provides a unified view of all schemas to ourmatcher. Figure 2 shows two such example graphs.Our Matcher will identify match candidates and

will estimate the degree of similarity by anormalized numeric value in the range 0-1, inorder to identify the best match candidates.

The match operation takes as input two schemasand produces a mapping indicating which elementsof the input schemas are logically related to eachother. The match results specify the matchingschema elements together with a similarity valuebetween 0 and 1. A 0 indicates total dissimilarityand 1 indicates strong similarity. Most ofpreviousapproaches including COMA have focused on one­to-one (1: 1) match relationships. But our approachtries to identify (1 :n) match relationships.

3.1 Match Processing in C3SM

The C3SM consists of four phases, some of themperformed in an iterative fashion. As a pre­matching phase, component schemas aretransformed into unified model. The schema basedmatching consists of execution of differentmatchers exploring intrinsic similarity betweenelements, aggregation of the results to establish(finalize) the intrinsic similarity and to establish thein-context similarity between elements and finallyoptional user feedback phase. Currently, in asemiautomatic fashion, C3SM applies differentmatchers in a predefined order, however later weplan to offer the selection of matchers and otheroptions to the users like in COMA [2, 3].

Context based MatcherExecution

Matc- Matc- Matc­her1 her2 her-n

r···_· -----,: User :

: Input i .1111111111111111111.'.....rm

' 110 Result Aggregation

Iteration

Fig. 1: Architecture of C3SM

. The four phases of C3SM are explained below:

1. User Input (Optional): In this phase user candefine his own criteria to match elements. Ifsuch information is not specified then defaultspecifications are used. In this phase user mayalso set some threshold to select matchcandidates automatically based on theiraffinity value.

2. Matcher Execution for Intrinsic Similarity:It is the main phase of match processing, inwhich multiple independent matchers chosenfrom the matcher library are executed. Theunified model DAGs representing componentschemas are traversed and each node(representing an element) in one graph iscompared with each node in the other graph todetermine the elements that have gotsimilarity/affinity between them. We offer two

categories of matchers Le. Simple and Hybridand combine the results generated by differentmarchers in a composite way. Matchersexploit different kinds of schema information,such as names, data types etc., and auxiliaryinformation, such as synonym andabbreviation tables. They produce mappingsbetween related schema elements consisting ofaffinity value between 0 and 1 and associatedmapping expressions. Later, results ofdifferent independently executed matchers arecombined together to produce more accuratematch results.

3. Result Aggregation for In-ContextSimilarity: Context of each schema element isestablished by the relationships/associations inwhich element is involved with otherelements. For this purpose, we consider

immediate higher or peer level relationships/associations of a schema element. Forexample, in the diagram 2(a), noderepresenting PhNo is associated with nodeDepartment; the latter provides a context forPhNo. In C3SM we do not establish similarityon the basis of intrinsic similarity, rather weuse the in-context similarity. For each pair ofelements that have non-zero intrinsic similarity(step 2), we traverse the graph to find theirrespective in-context elements, and check theintrinsic similarity between them. If in-contextelements also have non-zero similarity then wecombine (aggregate) the two similarity valuesand establish the in-context similarity.

4. User Feedback: Assertions having affinityvalue more than certain threshold value areshown to user for final decision (accept orreject). Moreover, situations where there arel:n mappings and the tool cannot resolve themultiplicity automatically, are also shown touser for final decision. In this phase user canalso manually define mappings betweenelements.

Phases of C3SM have been discussed. In thefollowing sections we are explaining how thesephases are implemented on the example schemas.

4.0 Constituent Matchers

In this section, we are presenting the constituentmatchers of C3SM.

1) Simple Matchers

Simple matchers are the ones that are based on asingle technique/approach for matching. In thefollowing, we have briefly given the underlyingconcepts of the simple matchers that we haveused in C3SM. A comprehensive list of matchershas been presented in [1].

• NameEquality: This matcher is based onthe assumption that same names usually bearthe same semantics [2, 5, 6, 7].

• Synonym: Domain specific synonym list isprepared and element name is comparedwith those in the synonym list to find anymatch. For example (perfume = cosmeticand car = automobile) [2, 5, 6, 7].

• Abbreviations: Domain specificabbreviations lists are prepared and are usedfor the element names matching. Forexample, Std for Student, Dep or Dept for

Department, Emp for Employee etc. [2, 5, 6,7]

• Data Type: Similarity can be based on theequivalence/compatibility of data types, e.g.(string=varchar) [2, 10].

• NamePath: Graph representing schema istraversed from top to a specific element todetermine the relationship in which theelement is involved or elements throughwhich it can be referenced [2, 4].

• Key: Key constraints can be useful tofind/strengenthen similarity betweenelements. For e.g. (StdID7 Primary Keyand StdN07PrimaryKey) [10].

2) Hybrid Matchers

Simple matchers applied in isolation can notproduce good results due to heterogeneitiesamong elements. For example, NameEqualitymatcher establishes similarity if two elementnames are exactly equal, it will not be able toidentify the similarity between attributesnames Address and Adr but the Abbreviationsmatcher can. So matchers are applied in ahybrid fashion, which is a fixed combinationof simple matchers, and results produced bythese matchers are combined to obtain moreaccurate affinity values. In the following, weare describing hybrid matchers used in C3SMand the way they are applied:

• Name: This hybrid name matcher combinesdifferent simple matchers. In C3SM wecombine NameEquality, Synonym, andAbbreviation matchers in the order ofdescription. Against each simple matcher,this hybrid matcher produces a similarityvalue of 1 if a match is found, otherwise itassigns a o.

Name matcher is applied twice in C3SM; firsttime it is applied taking the schema elementsnames as such. The elements for which matchis not found through this first run, we apply thetokenization and run the name matcher again.Tokenization, firstly, identifies differentcomponents in an element name, if any, forexample, the element name StudHomeAddressis tokenized into [Stud, Home, Address].Secondly, tokens are mapped to regular namesby looking into auxiliary information, likesynonyms and abbreviations. For example,'Stud' token identified above will be convertedto 'Student'. The affinities found through

Fig. 2: Example Schema Graphs

5.1 User Input

0.50

0.0 0.50

0.50

0.0 0.67

Table 1: Name Matcher Results

As explained previously, the Name (Hybrid)matcher is also applied after tokenization for theelements for which similarity is not found in thefirst phase. Some of such results are shown in inthe table 2. Gray cells show the concepts (words)obtained after tokenization, since abbreviations aretransformed into full words, that is why thismatcher has not been applied in table 2. Moreover,in this table, second column contains the elementsfor which tokenization has been applied.

Table 2: Name matcher applied afterTokenization

table 1, first two columns show the element namesobtained from nodes of unified schemas. Nextthree columns show the results obtained byapplying different simple matchers and last columncontains the aggregated results (Max).

Element AUSl

Element A ElementB Name Synonym Abbrev TotalUSl US2 Equality ~ity

MaxDepartment Dept 0.0 0.0 1.0 1.0Department. Dept. 0.0 0.0 0.0 0.0Name D NameDepartment. Dept. 0.0 0.1 0.0 1.0HOD InchargeDepartment Student 0.0 0.0 0.0 0.0Dept Stud Student 0.0 0.0 0.0 0.0Dept_Stud. Dept.PhNo 1.0 0.0 0.0 1.0PhNo

SlDlplt Grapll 1(a)

The graph representation of the example schemasis shown in the following figure.

Name matcher in the first run (withouttokenization) have more confidence level ascompared to those found in the second run.

• ConstraintName: This element levelmatcher matches the DataType and Keyconstraints of the attributes.

5.0 Experimental Results of C3SM

Application of different steps of C3SM isexplained below:

In C3SM, the results produced by this matcherare either used to enhance/strengthen thematching assertions produced by Name(hybrid matcher) or to produce the assertionsfor those elements for which matching is notfound by Name matcher. The assertionsgenerated by this matcher, however, will beconsidered of weaker strength as they are notsupported by the Name matcher.

In the next section, we are presenting the resultsobtained through C3SM on two example schemas.Due to shortage of space we are only presenting asmall part of the results.

We set the threshold value as 0.5 and let the systemrun in the default mode.

5.2 Name Matcher (Intrinsic Similarity)

After running two rounds of Name matcher toestablish intrinsic similarity (shown in table 1 &2),we calculate the in-context similarity betweenelements as explained below.

Name Matcher only utilizes element namesobtained regardless of their path and hierarchy. In

5.3 NamePath Matcher (In-contextSimilarity)

In C3SM, we establish the in-context similarityamong only those elements that have non-zeronaming similarity among themselves and alsoamong their respective in-context elementsidentified by NamePath matcher. The in-contextsimilarity is computed by taking the average ofthese two similarities; some of them are shown inthe table below:

Table 3: In-context similarity

ElementElements' In-context Elements'

Element AB

Similarity Elements' In-contextUSl

US2(a) Equality Similarity

(b) ~vR(~ b)Department Dept. 0.67 1.0 0.84

. Name D NameDepartment Dept. 1.0 1.0 1.0. Supervisors InchargeDepartment Dept. 0.67 1.0 0.84

Number Dept IDDept_Stud. Dept. 0.67 0.67 0.67

Name D NameDept_Stud. Student. 0.67 0.67 0.67

Name S Name

In the first row, Name and D_Name have 0.67intrinsic similarity, their respective in-contextelements, that is, Department and Dept, have 1.0intrinsic similarity, the average of two values is0.84 which is in-context similarity betweenDepartment.Name and Dept.D_Name. In the lasttwo rows, Dept_Stud.Name has 0.67 similaritywith each of the Dept.D_Name andStudent.S_Name. However, assertion involvingDept.D_Name is cancelled in this case as thisattribute has a higher affinity with anotherattribute, that is, Department.Name (0.84). SoDept_Stud.Name is declared similar to onlyStudent.S Name.The elements having in-context similarity arefurther compared using other matchers like,ConstraintName and Description. If these matchersalso find similarity between already similarelements, then previous assertions are assignedhigher confidence level, like Department.Numberand Dept.Id in Table 3.

5.4 User Feedback

Similarity assertions are presented to user whomakes final decision about them. The assertionshaving similarity value of 1 or closer and having a1:1 matching should be acceptable right away.However, lower affinity values or where there arel:n assertions need to be examined by the user tomake final decision.

We have briefly discussed the results generated byC3SM for two example schemas. The assertionsgetting verified by the user are then used for themerging of schema elements which makes possiblethe fetching of data from multiple resources in anintegrated way.

6. Conclusion

In this paper, we presented a schema matchingapproach that accepts two schemas in the unifiedmodel and finds corresponding elements betweenthem. The main features of our approach are1. It provides option for semantic and schematic

based SM2. It is a combination of multiple matchers each

one with capable of handling differentheterogeneities. Moreover, different matcherssupport (or reject) each others' results. Theoverall effect is the better results.

3. Similarity is established on the basis oncontexts in which elements are modeled. Thathelps to reduce the number of invalidassertions generated.

In future, we intend to include more matchers thatwill improve the accuracy of the results produced.

References:

.1 Rahm, E., Bernstein P.A.: A Survey ofApproachesto Automatic Schema Matching. VLDB Journal 10:4,2001

.2 H.H., Do, Rahm, E.:COMA - A system for flexiblecombination of schema matching approaches,Proceedings of the 28th VLDB Conference, HongKong, China 2002,

.3 Aumeller, D. et al: Schema and Ontology Matchingwith COMA++, SIGMOD, June 14-16,2005

.4 Masood, N.,: A Schema Comparison Approach toDetermine Semantic Similarity among SchemaElements, Infonnation Technology Journal, Vol. 3,No., 1, Jan.-Mar., 2004

.5 Doan, A.H., Domingos, P., Levy, A.: Leamingsource descriptions for data integration,Proceedings of WebDB Workshop, pp. 81-92, 2000

.6 Mitra P., Wiederhold, G., Jannink, 1.: Semi­automatic integration of knowledge sources,Proceedings of Fusion '99, Sunnyvale, USA, 1999

.7 Milo, T., Zohar, S.: Using schema matching tosimplify heterogeneous data translation,Proceedings of 24th Int Conf On Very Large DataBases, pp. 122-133, 1998

.8 Li, W., Clifton, C.: SemInt: a tool for identifyingattribute correspondences in heterogeneousdatabases using neural network, Data Knowl Eng33(1), pp 49-84, 2000

.9 Bouzeghoub, M., Comyn-Wattiau, I., 'Viewmtegration By Semantic Unification andTransformation of Data Structures', Proc. of theNinth International Conference on the Entity­Relationship Approach, Lausane, Switzerland, Oct.1990.

.10 Dayal, U., Hwang, H., 'View Definition andGeneralization for Database Integration in aMultidatabase System', IEEE Transactions onSoftware Engineering, p(629-645), Nov., 1984

.11 S.M. Chung, P.S. Mah, 'Schema Integration forMultidatabases Using the Unified Relational andObject-Oriented Model', Proc. of ACM ComputerScience Conference, ACM Press, pp 208-215, 1995