Analysis of An Inventory of Information Systems In the Public Administration

Requirements Eng (1996) 1 :47-62 �9 1996 Springer-Verlag London Limited Requirements

Engineering

Analysis of an Inventory of Information Systems in the Public Administration

C. Batini a, S. Castano b, V. De Antonellis c, M.G. Fugini d and B. Pernici d aAutorit& per I'lnformatica nella Pubblica Amministrazione, bUniversit& di Milano, cUniversit& di Ancona and dpolitecnico di Milano, Italy

The paper deals with the problem of building an inventory o f information systems for the public administration, with reference to an ongoing project in Italy. We describe the investigation techniques defined for collect- ing information and the techniques developed for a systematic analysis o f the large set o f conceptual schemas resulting from the investigation. These schemas describe the data used by the public administration work processes. In particular, we describe the conceptual schema of the inventory, which is the basis for discussing the methodology of investigation, the choice o f units o f investigation, the data collection and merging, and the access to information. Then, we present the schema analysis techniques developed to analyse semi-automatically the large set o f conceptual schemas resulting from the investigation. In particular, we illustrate indexing techniques for identifying representative descriptors o f schemas and similarity techniques to compare schemas for their classification into families. Finally, the tool developed to support the storage, analysis and classification of schemas is described and experimentation results are discussed.

Keywords: Conceptual schema analysis; Information system re-engineering; Reference components; Sim- ilarity measures; Schema clustering

1. Introduction

The entity relationship (ER) model [1] (and several extensions of it) is widely adopted in practice to

Correspondence and offprint requests to: Professor Barbara Pernici, Dipartimento di Elettronica e Informazione, Politecnico di Milano, via Ponzio 34/5, 20133 Milano, Italy. Email: [email protected] omal.it [castano,deantone,fugini,pernici]@elet.polimi.it

represent conceptual information. In fact, the ER model is not only used during information and software system design, for conceptual design of databases [2] and software engineering [3], but also for representing information in multiple databases for querying purposes [4] and to describe information resources in an organisation [5].

Considering the large amount of available schemas, several research areas are starting to derive interesting information from them. A first area of research is that of analysing instances of databases for data mining purposes [6]. Another approach deals with the analysis of the structure of available information, i.e., the conceptual schemas associated with existing applications. This analysis can have multiple purposes: to re- engineer existing systems [7], to derive reference components to be adopted in new developments [8], or to integrate data from different databases [9].

In the Italian Public Administration, a large inventory of existing information systems has been created, using guided questionnaires, by the Information System Authority for Public Administration (AIPA), founded in Italy in 1993. As a result of such effort a large collection of ER schemas associated with processes and databases in the Central Public Administration has been created. The purpose of this paper is to present the research results of the application of a set of analysis techniques to this large set of schemas. The goal of the analysis techniques considered is to derive semi-automatically synthetic information to be used for:

�9 indexing and abstracting schemas, for cataloguing purposes;

~ comparing schemas, to evaluate the potential for integration in information systems containing similar information in different divisions or Ministries;

48 C. Batini et al.

�9 abstracting and comparing similar information in different schemas to derive reference components to be adopted in future developments.

The techniques proposed in this paper are based on schema similarity evaluations and have been initially discussed as a result of the F 3 (From Fuzzy to Formal) ESPRIT 3 Project of the European Commission [10]. The application of the techniques in the Public Admini- stration domain has resulted in an adaptation of the techniques to the characteristics of the available schemas, and in the development of a computer-based tool for performing schema analysis.

The need for a semi-automatic computation emerges from the practical infeasibility of a complete semantic analysis of a very large set of schemas. In fact, in the context of our experience, given that the inventory has been collected during a six-month interviewing phase at all Italian Ministries, the amount of available data is very large, and, furthermore, the possibility of going back to each administration for requesting further information is not viable. As a consequence, we cannot assume the availability of detailed domain information, in addition to the information associated with the schemas themselves.

Given that the application domain is based in Italy, some of the problems concerning the creation of a thesaurus and of tools for the analysis of text contained in ER schemas are obviously based on the Italian language. In the present paper, examples are provided in English; therefore a translation of terms had to be performed for presentation purposes. We try to present general considerations about textual analysis in the following, although the interested Italian-speaking readers of the paper are referred to detailed documentation of the tool for further information [11].

In Section 2, we present the context in which the present study has been developed. In Section 3, the schema analysis techniques adopted in this research are illustrated. In Section 4, the repository developed to store ER schemas and the tools realised to support the analysis described in Section 3 are presented. In Section 5, we evaluate our approach with respect to other experiences conducted in AIPA on a more semantic and manual basis.

2. The Information System Inventory of the Italian Public Administration

AIPA has the task of promoting, co-ordinating, planning and controlling the public administration information systems through the standardisation, interconnection and integration of automated systems, with the

goal of improving service quality and achieving fairness in administrative activity.

In particular, AIPA has the following goals:

�9 to define criteria for planning, implementation, management and maintenance of information systems;

�9 to submit the three-year plan for the Public Admini- stration information system to the Prime Minister as part of the Italian three-year budget plan;

�9 to promote technology innovation in the Public Administration information systems;

�9 to define guidelines and programmes for personnel training;

�9 to set rules for purchasing goals and services related to information systems;

�9 to define procedures and criteria for monitoring large-scale contracts.

One of the first activities launched at AIPA has been the project to build an inventory of existing information systems within the Central Public Administrations in Italy. In the past years, Public Administrations have developed their own information systems on an unre- lated basis, yielding high replication and inconsistency of data, low interconnection and interoperability, and scarce accessibility by end users to information about their own interest. As a consequence, the information technology (IT) investment for services offered to end users (the citizens) was scarcely efficient and effective. Therefore, in order to prepare the 1995-1997 plan for IT investments, the decision was made to collect in a systematic way information about the actual state of the systems, in order to be able to assess the major problems and to introduce new technologies. In particular, some goals of interest that can be achieved with the development of the inventory are related to the following issues:

�9 Data and process integration. Due to the lack of standardised data and process structures across Public Administration information systems, process re-engineering activities are planned to reconstruct integrated reference conceptual architecture of data, describing all data circulating among processes and their relationships in an integrated way. Information available from the inventory can be accessed to point out replicated and/or inconsistent types of data, to be used as the basis for deriving integrated representa- tions for the involved data types and processes.

�9 Reuse o f existing applications. Due to the lack of co- ordination in developing information systems in the past years, applications pertaining to different Minis- tries have been implemented separately, pursuing local objectives, without a global vision of the

Analysis of an Inventory of Information Systems in the Public Administration 49

organisation. As a consequence, different Ministries are often characterised by the same or similar applications. Exploiting inventory information, proper reusable components can be defined to abstract commonalities between existing applications (e.g., data components). These components, properly organised into a library, can be used in developing new applications similar to existing applications.

�9 Greater availability and accessibility o f information. The systematic organisation of inventory information and its proper classification in a repository facilitate the user in querying and retrieval operations.

The design and organisation of the inventory has been influenced also by these aspects.

2.1. Inventory Organisation

In order to perform the investigation, AIPA has developed its own original methodology, tailoring existing approaches and methodologies on information system planning [5,12-14], and inventory organisation [15,16] to the particular characteristics and rules of the Italian public sector. The main project decisions characterising the investigation are the following:

�9 To focus the investigation not only on technological issues (i.e., the applications and the computer and networks on which applications operate), but also on the organisational aspects, to point out the structure and the characteristics of the organisational units, the human resources involved, and the work processes performed within organisational units. In particular, the typology of information employed, stored, and exchanged among organisational units is a relevant feature to be pointed out.

�9 To perform the investigation not only on the organisational units responsible for the computer-based information systems (referred to in the project as E D P manager organisational units), but also on the other organisational units participating in some work process (referred to in the project as user organisational units), which are important for the information flows in which they participate.

�9 To perform the investigation by means of questionnaires properly designed by AIPA, filled in by the various users of the organisational units, with the assistance of a team of experts trained for the purpose.

These decisions make the AIPA investigation peculiar with respect to previous investigations on the Italian Public Administration, which concentrated mainly on technological aspects and on EDP manager organisa-

tional units, and employed only pre-defined questionnaires to collect information.

Due to the huge complexity of the investigation, the inventory has been organised like a typical statistical census, splitting the population into different units of investigation, and designed for each of them a specific questionnaire, that has been filled in with the assistance of a specialist. The following units of investigation are considered (their number is indicated in parentheses):

1. EDP managers (250), who provide information on computer-based information systems;

2. Users who provide information on work processes; users are hierarchically organised into three levels: - information systems users responsible for Minis-

tries (40); - users responsible for Directorates (directly

depending on Ministries, 300); - u s e r s responsible for Divisions (depending on

Directorates, 2200).

With respect to a typical census, the questionnaire contains a higher percentage of questions with free text answer, due to the little awareness of the user of some information and the need for integrating information from different sources.

The information collected in the inventory through the questionnaires is described by means of the two ER meta-schemas shown in Figs 1 and 2.

The user questionnaire concerns:

�9 Organisational units, i.e., the types of users of the information systems, classified at three different levels, to reflect their hierarchical/functional place- ment in the organisation chart of the Public Admini- stration (Ministries, Directorates, Divisions).

�9 Processes, i.e., groups of activities (executed either manually or by applications) performed to provide services to internal and/or external users. Organisa- tional units can be responsible for clients or suppliers of work processes.

�9 Applications, which are computer-based and are managed by the EDP managers; they have client, supplier and user organisational units associated with them.

�9 Information objects exchanged and manipulated by processes, distinguishing between paper-based information objects and computer-based information objects. For each process, an ER schema representing input, output, and manipulated information is to be provided in the questionnaire, designed in co-operation by the experts and the users.

The EDP manager questionnaire concerns:

50 C. Batini et aL

�9 A p p l i c a t i o n s , with the meaning above (thus, applications are acquired both from users and from EDP managers);

�9 D a t a b a s e s used by the applications; ~ D a t a b a s e objects, which roughly correspond to the

computer-based information objects of the user questionnaire. For each database, an ER schema of database objects has to be provided.

Textual documentation is associated with the schemas, specifying the nature and the usage characteristics of data types represented in conceptual schemas. The ER model has been used 'to provide a conceptual level of description, independent of the support on which data are stored, to facilitate data and process analysis.

As a result of the investigation, the information collected consists of a very large set of ER schemas (240 database schemas with 3873 database objects and 2203 information schemas related to processes with 10123 information objects) describing the data circulating among various processes and/or stored in the databases.

In the paper, we focus on the analysis techniques developed at the Politecnico di Milano in a research project carried out in collaboration with AIPA. In

particular, indexing techniques for identifying the representative entities of schemas (see Section 3.1) and classification techniques for grouping schemas operat- ing on similar data in the same family (see Section 3.2) have been developed, with the objective of supporting the 'similarity-driven' analysis of schemas, and the reconstruction of information flows among processe~

3, Schema Analysis Techniques

In this section, we describe our techniques for analysing conceptual schemas associated with Public Administra- tion processes and databases. Due to the large amount of schemas to be analysed, the minimisation of the user intervention is a challenging requirement to be considered in developing the analysis techniques. For this purpose, we exploit as much as possible the information associated with the schemas to perform the analysis. The information available for each schema Sh is the following:

�9 T h e n a m e of the schema, nsh, which coincides with the name of the process or the database to which it refers, nsh uniquely identifies the process within the

Organisational Unit (OU)

l, I ir t' vello ec~ hird, ve,o

(O,n)

~ Typology o f ~ OU

Belongs to I

(1,n)[

Process Responsible for

Selected [ @ (0,n) process

Application

(1,n)

(1,n)

I I [Information[ [Information

(0,n)[ entity I [relationship

(l'l)l T (1,n)

I Information generalisation

Fig, 1. Entity relationship meta-schema of the user questionnaire.

Analysis of an Inventory of Information Systems in the Public Administration

organisational unit to which it pertains and has the form nSh = (OU_D, ID), where OU D is the organisational unit identifier, and ID is the process name acronym, derived from the extended process name, or the database name. The O U_D identifier specifies the complete path of the involved organisational unit, according to the administration organisation chart.

�9 The structure of the schema: a set of entities S h = {e l ,

. . . . e n h } , where an entity ej E Sh is generally characterised by a set of attributes and a set of links with other entities of Sh. An extended ER model is considered, including n-ary relationships and is-a hierarchies.

�9 Textual information, which is a natural language description of the process or database. It is a natural

51

language description of input/output data, starting/ ending events, possible supporting application pro- gram(s), and type of process (e.g., management, administration).

�9 For each entity, the number of elementary instances of the entity (in logical terms, the number of records of the file represented by the entity).

To correctly perform the analysis, it is also important to consider the meaning and the role of entity and schema names, especially when different schemas must be compared to discover their similarities. To take into account these semantic considerations, and to make the analysis more flexible and uniform, we refer to a thesaurus, where the semantic relationships existing between terms used in ER schemas of the Public Administration are properly stored (e.g., synonyms,

(1,n)

(l,n)

Supplier

Identifier t (1,1)

(O,n) I (O,n)

(O,n) T

EDPManage r (1,n) OU

(l,n)

(1,1) (1,n)

(~

Organizational Unit (ou)

Database generalization

Database object

attribute

(1,1)

(1,1)

Database Object

I Database I Database

entity relationship I (1,n) @ (0,n)

I

Fig. 2. Entity relationship meta-scherna of the EDP manager questionnaire.

(1,n)

Database

52 C. Batini et al.

acronyms, abbreviations). The structure and construction of the thesaurus will be illustrated in Section 3.2.1.

Before presenting the analysis techniques, a few comments have to be presented about the types of available schemas. As mentioned above, the schemas are of two different types:

�9 Schemas associated with processes: this type of schema is generally characterised by the fact that it collects summary information about data used within a process of an organisation unit; for such schemas, data characterising entities, relationships and is-a hierarchies have been defined, while no information about attributes associated with entities or relationships is available. In addition, the terms used to label entities are usually not uniform across several organisational units using the same data, giving rise, therefore, to a number of problems that we tackle with our techniques for evaluating similarities between entity names. In general, relationship names are not particularly significant and cannot be used during similarity analysis. These schemas are generally rather small (the average number of entities is five); the information objects represented can relate both to structured and unstructured data, such as laws, norms, and the like.

�9 Schemas associated with databases: this type of schema contains more uniform terms, derived from existing database schemas, although some simplifica- tions have been performed during the data collection process; in addition, also the most important attributes for each entity are available and can be used during the evaluation process. The average number of entities contained in such schemas is 10; the type of data represented in these schemas is usually rather structured.

A general characteristic of entity labels is their length, since in the name of the entity also information about its contents is included, in particular when attributes are not available. The consequence of this aspect of schemas is that no simple term comparison can be performed, but a similarity measure for entity labels has to be investigated according to the specific labelling structure of the available schemas, as illustrated in the following paragraphs.

The techniques presented in the following for schema indexing and comparison based on this information are:

�9 Derivation o f schema descriptors: schema descriptors are used to identify important parts in a schema, and can be used for classification purposes and for simplifying other schema analysis steps;

�9 Derivation ofschemafamilies: schema families group schemas with similar characteristics for facilitating the analysis of related schemas, both for identifying information flows, integration candidates, and for abstracting more generic schema elements.

3.1. Schema Descriptors

By means of this technique, we associate with the schema a set of descriptors, capable of describing the contents of the schema at a more abstract level.

Descriptors of a schema correspond to the labels of the representative entities of the schema, selected using the 'quantity of information' criterion. Descriptors have an associated weight, which is the measure of their relevance with respect to the whole set of indexed schemas. Weighted descriptors allow schemas to be indexed according to a flexible paradigm, and to be searched and retrieved using imprecise queries. Hence, the developer can navigate in the space of schemas by entering the target descriptors of his own interest, and by finding both perfectly matching candidate schemas, and similar candidate schemas that 'can fit' his current needs.

The selection of the representative entities for a given schema Sh is based on the following steps:

�9 computation of the 'quantity of information' Wsh (ej) for each entity ej E Sh;

�9 definition of a threshold T for automatic selection of representative entities on the basis of their quantity of information.

The computation of the quantity of information for the entities of a schema Sh is based on their structure in Sh; in particular, for each entity ej E Sh we compute the following quantities:

�9 Number of attributes of Q, Ato t (Q);

�9 Number of relationships in which ej participates, Rtot (e j);

�9 Number of 'is-a' links in which ej participates, [to t

(e j).

The rationale is that the number of attributes and links of an entity can be used as a (heuristic) measure of its relevance within the schema. The greater this is, the higher the relevance is, since the entity is characterised by several attributes and it is referred to by several other entities of the schema. According to link taxono- mies for database schema design, different types of links involve a different strength of the semantic connection among the involved elements [17]. As a consequence, different types of link carry a different


quantity of information. To numerically determine Wsh(ei), we assign a strength coefficient to attributes, denoted by Satt, a strength coefficient to relationships, denoted by Srel, and a strength coefficient to 'is-a' links, denoted by Sis_a, with Sat" Srel, Sis_ a ~ [0,1] and Sat t > Sis_ a >-- Sre l. Wsh(ej) is computed as a linear combination of previous quantities, as follows:

Wsh (ej) = SattAto t (ej) + Sre l Rto t (ej) + Sis_ a Itot(ei)

From experimental results, we adopted the values Sat t =

1, Srel = 0.4, and Sis-a = 0.6 as significant strength values, and we used the average quantity of information WSh in a schema as the selection threshold (i.e., given Sh = {el, . . . , enh,} we compute T = Wsh = Z Wsh(ej)/nh). The entities whose quantity of information is greater than or equal to T are selected as representative, and their labels become descriptors of Sh.

For instance, in Fig. 3, an example of a conceptual schema is shown, related to 'Accounting' management. By applying the formula previously illustrated for quantity of information computation to S~, we evaluate quantity of information as follows:

Wsl (Item) = 6.8 Wsl (Variation) = 5.4 W S 1 (Appropriation) = 4.6 Wsl (Order of Payment) = 9.4

Wsl (Credit Order) = 6.4 Wsl (Grantee) = 4.4

and an average quantity of information Wsl = 6.16. As a consequence, entities Item, Order of Payment and Credit Order are selected as representative, and become descriptors for $1 (representative entities are shown in the boldline boxes in the schema).

As a subsequent step, a manual inspection on automatically extracted descriptors has been performed, to correct some ambiguities or odd terms due to the special jargon used in the schema, and to evaluate their significance against the schema subject.

As a result, with each schema Sh the set D ( S h ) of its descriptors is associated:

D (Sh) = {dlh , d2h , ..., dnSh}

where nSh depends on the characteristics of the considered schema Sh (number of entities, attributes, and links).

For each descriptor we compute a weight w E [0,1], which represents the relevance of the descriptor with respect to the whole set of descriptors of all indexed schemas. The higher the weight for a descriptor is, the more relevant the descriptor is for characterising the schema with which it is associated. The weight of a descriptor may vary depending on the Ministry being

Amount Motivation

NotificationDate I NetificationNr.

CashAmount CompetenceAmount I Fin.Year

I Name item Balance

Variation . @ _ ~ +

Approp.Nr. ---] [ Date ~ Appropriation

Amount

Authoriz.info I ValidityPeriod ---I "~2~2~t I I

Motivation I ListNr. '1 II II TitleNr. I II II

Acc~ r I [ I ITAm2unt IncomingAcc.Year TreasuryLocation

I

Grantee ~ -- SSN

Name - - Bank ~--- AccountNr.

1--- ListNr. Credit ~ TitleNr. Order I-- AccountingYear

I---- IncomingAcc.Year [ TreasuryLocation

Amount

Fig. 3. An example of the Public Administration schema ($1).

54 C. Batini et al.

considered, and is the same for schemas related to the same Ministry. For descriptor weighting, we adapted a function from classical text retrieval systems [18]. The weight of the descriptor k with respect to the Ministry i is computed by means of the following weighting function:

Wik =

.

comparisons needed for this evaluation. The use of the descriptors as the first elements to be considered in the comparison can be helpful in reducing the number of needed comparisons and ruling out similarities between pairs of schemas. Descriptor-based similarity is recom- mended for large schemas or for schemas containing a great quantity of information in order to reduce the comparisons to be performed, while preserving preci- sion, since descriptors are the representative and most significant entities characterising a schema.

3.2. Schema Families

where vik is the frequency of the occurrence of the descriptor k in Ministry i, N is the total number of considered schemas, nk is the number of schemas exhibiting descriptor k, and F is the total number of distinct descriptors used across all the schemas.

As a result of applying the weighting function, the set of descriptors associated with a schema Sh is defined as follows:

D (Sh) = { (d lh;Wlh) , (d2h;W2h), . . . , (dnsh;WnSh) }

For example, by applying the weighting function to the descriptors previously extracted for the schema $1 in Fig. 3, we obtain the following set of descriptors D (&):

D ($1) = {((Item);0.5), ((Order of Payment);0.72), ((Credit Order);0.72)}

These weights have been computed on a set of N = 178 schemas, for a total of 677 descriptors, of which F = 503 are distinct, by considering an average value n~ = 2.

In the AIPA case, descriptors are useful for a variety of reasons. First of all, they can be used for indexing purposes. In fact, it is interesting both to list descriptors associated with a schema, and to list schemas associated with a descriptor. In addition, descriptors can be grouped according to the similarity criteria listed in the following paragraph, thus yielding a more general classification schema. For instance, the descriptors 'Order of Payment' and 'Credit Order' can both be related to the generic term 'Order', and help grouping entities and schemas referring to orders in general. Common attributes in different schemas can be identified and therefore a common representation for generic entities (e.g., generic orders) can be derived. Another possible use of descriptors is in the computation of schema similarities aimed at pointing out commonalities between schemas. One of the problems in the computation of schema similarity is the number of

In this section, we describe the techniques for classify- ing Public Administration conceptual schemas, i.e., to partition schemas into families on the basis of their similarities. Grouping schemas in families can be used for several purposes:

�9 to identify similar schemas associated with processes; these schemas can lead to the identification of hidden flows of data between administrations;

�9 to identify schemas of databases that are candidates for integration or for sharing data;

�9 to identify common portions in schemas, that can be analysed in order to construct reference components (e.g., a unique representation for the entity type 'Person' and all its possible specialisations).

The similarity of a pair of schemas depends on the number of their entities that are similar. Intuitively, the greater the number of similar entities, the higher the similarity between corresponding schemas. In order to determine similarity between entities in different schemas, we evaluate the similarity of the terms constituting their labels, on the basis of the information stored in the thesaurus, which has been constructed from the analysed conceptual schemas and their associated documentation (Section 3.2.1). The similarity of a pair of schemas is then computed by considering their similar entities, using 'ad hoc' metrics (Section 3.2.2). On the basis of the determined schema similarity coefficients, schemas are then grouped into families using hierarchical clustering techniques [19] (Section 3.2.3).

3.2.1. Thesaurus and Term Similarity

A thesaurus provides a classification of the terms characterising a certain domain, by storing in a structured way the semantic relationships existing among them [19]. To construct a thesaurus starting from a set of available documents, the following elements are to be provided [19]:


�9 a list of words not to be considered during document analysis, called stop-words, i.e., articles, prepositions, and the like;

�9 a set of relationships among terms;

�9 a set of criteria to decide the lexical similarity between terms to extract their common root.

In our project, the thesaurus has been constructed to take into account the specific characteristics and the role of the terms contained in the conceptual schemas of the inventory. In particular, the entity labels and the relationships between entities have been considered for thesaurus construction. The relationships included in the thesaurus are the ones between single terms included in the entity labels. The following single term relationships are considered:

�9 USE, defined between terms that are considered equivalent because they have the same meaning. In our schemas, we consider as equivalent terms: the acronym terms and their expansions, a term and its incorrect occurrences (incorrect spelling) in the examined schemas, and singular terms and their plural.

�9 BT/NT (broader term/narrower term), defined between single terms denoting entities participating in generalisation hierarchies in the analysed schemas (e.g., 'Person' and 'Employee!).

�9 RT (related terms), defined between terms presenting some kind of relationship. In our case, we consider lexical relationships, and we put in this category terms having the same root (e.g., 'Person' and 'Personnel').

�9 SYN (synonym terms), defined between terms considered synonyms. In our case, this relationship is defined for names of entities denoting the same class of real world objects.

To extract the use relationship, a spelling checker has been used; broader/narrower terms have been derived automatically from the schemas stored in the repository; related terms and synonyms have been inserted manually for the most frequent terms in the repository.

We assign a weight ~J to each type of relationship in the Public Administration thesaurus to operationally compute similarities. Larger weights are assigned to semantically stronger relationships. Specific values have been chosen after experiments performed on a sample of schemas. The weights are: us~ = 1, sYN = 0.8, BT/NT = 0.5, and RT = 0.5. For each pair of terms tk and h, if several relationships are defined between the two terms, the one with the larger weight cr~l is considered.

The thesaurus constructed in this way is used to determine entity similarity on the basis of the similarity of the terms contained in their labels. Multiple term similarity is computed on the basis of the relationships stored in the thesaurus. Before illustrating how entity similarity is computed, we describe the basic criterion to determine term similarity.

The similarity value Rel (tk,h) between two terms tk and tt is defined as follows:

Rel (tlc, tt) =

f l if tk = h

c~kl if tk ~ h and 30kl in the thesaurus for the pair (tk, h)

0 otherwise

The term-based similarity between a pair of entities e i and Q, denoted by TSim (ei,ej), is computed by comparing the terms contained in their labels, on the basis of their Rel values, as follows:

2 �9 ~ Rel (tik, tit) TSim (% ej) =

ni + nj

where ni and nj denote the number of terms in the labels of ei and ej, respectively.

Note that each term is considered in at most one similarity pair during entity similarity computation. If a term in an entity label has a similarity value with more than one term of the other entity's label, the pair with the greatest similarity value is considered, and the other pairs are discarded. For example, the similarity between the entities 'Order of Payment' and 'Order' is TSim (Order o f Payment, Order) = 2 • 1/3 = 0.66, being the term Order common to both the entity labels, and the total number of terms in the entity labels equal to 3.

We noted that more effective similarity coefficients can be obtained by taking into account the relevance of the terms in the context under consideration. For instance, when analysing schemas pertaining to the 'Labour' Ministry, we found that the term 'enterprise' appears in about 30% of the analysed schemas, thus carrying low information content. The most frequently occurring terms are then included in the thesaurus as stop-words too, or penalised in the computation of the term similarity values, assigning Rel (tk,h) < 1.

3.2.2. Schema Similarity

The similarity between a pair of schemas Sh and S~ is expressed by a coefficient Sire (Sh,Sk), and is based on the similarity of the entities contained in the schemas. To compute schema similarity, we use a weighted sum function. With this function, Sire (Sh,S~) is computed on

56 C. Batfni et aL

the basis of TSim evaluation for pairs of entities of Sh and Sk, as follows:

Sire (Sh, Sk) = 2" ~, TSim(eph, eql~)

nSh +flSk

where eph E Sh, eqk E Sk, and TSim (eph,eqk) is computed according to the thesaurus contents.

Also in this case, each entity is considered in at most one similarity pair. This function returns a similarity coefficient Sire E [0,1]. The value 0 indicates absence of similarity, i.e., no entities of Sh and S~ are similar. The value 1 indicates identity, i.e., all entities of S h and S~ are similar, with a similarity value equal to 1. Inter- mediate values indicate situations of more or less similarity, depending on the number of similar entities in both schemas.

As an example, let us consider again schema $1, shown in Fig. 3, and a second ER schema $2 shown in Fig. 4, related to 'Cash', whose descriptors are shown in boldline boxes in the figure.

The similarity coefficient Sire between St and $2 is computed as follows:

Sire (S1,Sz) - 2 x 3 . 6

6 + 8 - 0 . 5 1

In fact, the similarity between Item in $1 and Item in $2 is 1 because they are equally defined in both schemas, as well as between 'Credit Order' in $1 and 'Credit Order' in Sz. The similarity between 'Grantee' in $1 and 'Beneficiary' in $2 is 0.8 because their labels verify the

syn relationship in the thesaurus, while the similarity between 'Order of Payment' in Sa and 'Bank Order of Payment' in $2 is 2 • 2/5--0.8, computed according to the TSim metric.

3.2.3~, Schema Clustering

The similarity evaluation between schemas is used to identify families of similar schemas. For grouping schemas into families, we adopted the algorithms proposed for document clustering in the context of information retrieval, based on measures of similarity between documents. In particular, we tested two hierarchical agglomerative clustering techniques known as the 'single link' technique and the 'complete link' technique [19]. Both techniques are based on the construction of a similarity matrix M of rank n, where n is the number of elements to be clustered (in our case, schemas). An entry M[h,k] of the similarity matrix is the similarity value between schemas Sh and Sk. Both techniques produce a similarity tree, by iteratively grouping the most similar schemas, according to the contents of the similarity matrix M. In the similarity tree, the leaves correspond to schemas, and inter- mediate nodes correspond to numerical similarity values for families of schemas in the tree.

The two considered techniques differ in the strategy used to group schemas in clusters, and, consequently, in the characteristics of the produced families. The single link technique produces a small number of large families, while the complete link technique produces a larger number of small families, which point out more

I Nr.

Amount Cash Description Voucher

Date

Nr. ~ Customs Amount Warrant

Date

I

I-- ICode I-- Year

Item I-- Type 1---- Balance

~ 1 x C~

[~Nr. Bank Order Date. of Payment Register date

TreasuryType [ Amount

~ _ ~ Beneficiary

Operation

] , Nr. Credit Date Desk

] .A:l~ount - - Order

~ d (y/n) [

Fig. 4. An example of the Public Administration schema (S2).

~ Code Name Status Address

Accounting year Benef.Code

I-- ccountingPe.o


precisely the similarities and the differences of classified schemas.

A portion of the similarity tree constructed with the single link technique is shown in Fig. 5, while a portion of the similarity tree obtained by applying the complete link technique is shown in Fig. 6. Both portions illustrate the 'Outlay entries' family, which, in addition to schemas S~ and $2 (see Figs 3 and 4), contains other schemas describing information related to the type of expenses, the issued/paid orders of payments and the orders' beneficiaries, which share entities like 'Outlay entry', 'Item', 'Credit order', 'Beneficiary', and 'Order of payment'.

We observed that each technique alone does not guarantee a precise classification, due both to the typology of descriptors and to the grouping strategy of each technique. By manually analysing the involved schemas and associated descriptors, we decided that a mixed technique is preferable, consisting in applying the complete link technique first, to select a homogeneous set of schemas with respect to their specialised application domain and, upon resulting families, the single link technique to select similar schemas which

can be related to the family, although not strictly related to the same specialised application domain. For instance, in Fig. 6, we group in the indicated clusters systems in the 'accounting' domain. For similarity purposes, only the clusters with a similarity value greater than an established threshold are considered, which are evidenced in the figure.

In both Figs 5 and 6, the schemas are related to information systems both in the same and in different Ministries. The clustering captures the fact that the schemas contain common elements.

In complete link clustering, the similarity value captures the minimal level of similarity between all the schemas in a cluster, therefore giving indications to form a family of homogeneous schemas. In single link clustering, instead, the level of similarity of each schema with the other ones is considered in isolation.

In Fig. 5, we show that, although 'Balance and Cash' is not part of any clusters in Fig. 6, it has a significant similarity with some of the schema clusters, denoting a possible relationship with the elements of the schemas belonging to the clusters. It has to be noted that single link clustering can also lead to consideration of non-

0.28

0 3 0.3 ~ 0 33 0.3 MoS/d Analysis

MIBCA/ MI/ MLPS/ MAE/ Accounting (S1) Cash ($2) S.e.s.a.m. Foreign Location Items

Accounting

MAE Foreign Affairs Ministry MBPE Balance and Economical Planning MI Internal Affairs Ministry MIBCA Culutural Affairs Ministry MLPS Labour Ministry MS Health Ministry

Fig. 5. An example of single [ink schema clustering.

58 C. Batini et al.

0.051

Forests Experimentation

MAE Foreign Affaks Ministry MBPE Balance and Economical Planning Nil Internal Affairs Ministry MIBCA Cniutural Affairs Ministry MLPS Labour Ministry MRAAF Agriculture, Forest and Food Ministry MS Health Ministry

Fig. 6, An example of complete link schema clustering.

re levan t items (such as 'Food Analysis'), which are usually motivated by the presence of very generic terms in the schemas. It is the task of the schema analyst to evaluate the significance of such similarities to consider the inclusion of such schemas in a family.

By using the weighted sum function and the mixed clustering technique on the analysed sample of database schemas we obtained schema families, related, for instance, to 'Outlay entries', 'Revenue', 'Employees', 'Internal security', and 'Land register'.

4. Inventory Technological Platform

To store the very large amount of information collected through the questionnaires, AIPA decided to adopt the database technology.

Currently, an AIPA tool is available,, w.ith a repository storing data about processes in the different organisational units and their schemas. The implementation environment is PC-based, using Access 2.0. Retrieval and presentation functionalities have been realised using the development environment provided by Access. The schemas can be visualised both with textual information and with a simple graphical editor for ER schemas.

Concerning schema analysis support, the following functionalities have been implemented:

�9 extraction of a set of descriptors from each schema; �9 indexing of schemas based on descriptors;

�9 computation of similarity between multi-term entity labels, based on the relationship weights stored in the thesaurus;

�9 computation of schema similarity;

�9 clustering of schemas using both simple link and complete link techniques, in order to identify families of schemas. Such an analysis is presented to the user as a tree of schemas as shown in Figs 5 and 6, similarly to the interface presented in Salton et al. [18].

The schema analyst can also choose to produce reports retrieving all schemas related to a given schema in decreasing order of similarity, or produce similarity reports for entity labels. These functionalities are particularly useful to support manual semantic analysis of schemas and of their elements, given the high volume of available information.

In addition, all the information about processes and databases in the Central Public Administration collected with questionnaires is stored in the repository, converting it automatically from filled-in questionnaires.

5. Applicability of Techniques and Evaluation of Results

In previous sections we presented techniques for the analysis of a large set of ER schemas. We have


discussed automatic techniques for evaluating schema descriptors, similarity between entities and similarity between schemas. The similarity measures are at the basis of evaluation of families of schemas presenting similar characteristics. In this section we are interested in discussing two aspects related to the effectiveness of the techniques in the context of AIPA objectives, i.e.:

�9 to what extent such techniques can be applied to a variety of activities in progress at AIPA;

�9 the result of a comparison between such techniques and different techniques used at AIPA on the basis of semantic oriented and non-computer-based methods.

We discuss the two aspects together, in order to clarify primarily the type of activity performed at AIPA, and subsequently, the results of the comparison, when performed.

A first important field of application is the prepara- tion of the three-year IT plan for Public Administra- tion. The first step in the planning activity is an analysis of the state of the information system. Such analysis is performed on a quantitative basis, by comparing suitable quality indicators against reasonable reference levels. Once the expected target values for the indicators are established, the plans of the Administrations have to be prepared in order to match the target values within a fixed budget. Indicators relate different resources of the information system, such as organisation, expenses, processes, applications, information and data, technologies, and networks. We concentrate in the following on aspects related to information resources.

In order to understand the main characteristics of the information resources, such as completeness in the representation of phenomena, redundancy, distribution among administrations, intensity os use, and accessibility, the construction of an inventory has been of great importance, whose inner structure and modality of construction has been based on the methodology for the organisation of data repositories described in Batini et al. [9].

The methodology proceeds in four steps:

1. clustering/classification of database/process data schemas into families;

2. integration of families of schemas into a unique representative schema;

3. creation of an abstract version of the representative schema;

4. iteration of steps 2 and 3, until a unique final abstract schema is created.

We examine now the four steps in some detail in the AIPA context.

First of all, the large amount of schemas collected at AIPA needs a classification, in order to understand the actual extension of the information resources for each Information area and for each Ministry, and the coverage with respect to an ideal situation. Matrices are built at several clustering levels, which allow AIPA to identify stable 'business areas' in Public Administra- tion, i.e., groups of Information Areas and Ministries characterised by high internal cohesion and low external coupling.

In order to provide a suitable classification, a semantic method has been adopted at AIPA, which follows a top-down approach; it starts from a general and well-known classification of topics of interest in the Public Administration (e.g., Personnel, Environment, Land, Health), initially assigning schemas on the basis of manual inspection. During such assignment, a reasonable classification has not been found for approximately 30% of schemas: this situation leads to the introduction of new families, and in some cases to the splitting and merging of previous ones. Fur- thermore, the presence of highly populated families leads to refining them into more specific ones, finally achieving a stable assignment after two or three iterations of the whole process. Such time-consuming activity could be applied only to database schemas, which represent 10% of the totality of schemas; as a consequence, the process schemas presently remain unintegrated in the inventory.

The families obtained with the process described above have been compared with families resulting from the clustering methods presented in Section 3.2.3. We noted that the complete link technique is, in general, more precise than the single link technique, because a higher percentage o~similar target schemas is included in the correct family.

Once the families of schemas have been created, the integration process proceeds following well-known methodologies; the integration can be performed by examining the structure of similar entities, to identify commonalities and possible conflicting elements in ttieir structure. In particular, name conflicts between entities with synonym labels, and type conflicts between concepts modelled using different constructs, can arise [20]. The choice and analysis of similar entities can be performed either with a manual methodology or with the similarity techniques described in Section 3.2.3. Furthermore, the final decision can be made with the help of the thesaurus described in Section 3.2.1.

A good classification in terms of schema families is crucial also in this case; in fact, the experience suggests that, in the integration of a large amount of schemas, the effectiveness of the process is highly influenced by

60 C. Batini et aL

the order of integration of schemas. Such order is in turn characterised by three aspects:

1. choice of clusters;

2. choice of their dimension;

3. order of integration among clusters and inside clusters.

Intuitively, it seems preferable to integrate first schemas characterised by high similarity. According to the similarity techniques discussed in the paper, we put in a family all the schemas sharing entity types with similar characteristics, which are candidates to be integrated into a unique description.

We do not have for this step quantitative results of the comparison among the two methods. In general terms, the semantic method is more effective when the integration activity is critical and the domain knowledge plays a significant role in the decision, for example, when two identical concepts are described differently (e.g., they are synonyms) and, as a consequence, are not taken into account in the choice of descriptors. At the same time, the semantic method is much less efficient with respect to the semi-automatic technique, resulting in a human labour time that is an order of magnitude greater.

The third step of the inventory construction concerns the choice of abstract schemas. This activity can be positively influenced by semi-automatic techniques; the experience at AIPA suggests that, when applying a semantic method, in a high percentage of situations abstract concepts are chosen among objects that appear in the original schemas, so with a technique that is typical and well performed by the method of descriptors.

Once the repository has been created, the analysis has concentrated on specific characteristics of the information resources, In general, if the development of applications in the Public Administrations is not subject to careful planning, it may give rise to redun- dant applications or databases. Redundancies can be both in intensional (concepts) and in extensional terms (records). As an example, similarity criteria have been applied to the family of the 37 personnel databases. An integrated schema for personnel has been produced, with approximately 60 entities; each existing personnel schema has been compared to the integrated schema, resulting in an average similarity of 0.3, with a minimum of 0.12 and a maximum of 0.62 (associated with the database of the Ministry of Treasury). These values give an estimate for intensional redundancy.

The availability of schema families facilitates also the construction of reference components for the Public Administration. A reference component is defined as

an ER schema portion obtained from a set of similar entity types, by integrating all their common attributes, relationships, and specialisations, and by resolving possible conflicts [21,22]. Reference components constructed in this way can be used as reusable components, to be employed as the starting point in the development of future applications [23] to promote data standardisation across information systems of different Ministries. An interesting application of this approach occurred when an administration decided to implement its own personnel information system; the integrated schema of the Personnel family of schemas was made available to the administration, which chooses the schema as a specialisation of such an integrated schema. Another study is ongoing on the feasibility of a unique federated architecture for the personnel information system.

The experimentation has shown that the semi- automatic method and the semantic method should be used in combination, and that the only possibility to make the semantic method presently applied at AIPA feasible is to use the semi-automatic method proposed in this paper in a precomputation phase, and subsequently to perform a semantic adjustment in excep- tional situations.

The analysis of information resources has proved useful for another important activity at AIPA, i.e., process re-engineering. Process re-engineering is generally considered a challenging and critical issue for organisations in the Public Sector [8], since it has the objective of optimising work processes with respect to the efficiency of provided services and customer sat- isfaction. In our case, a basic requirement for re- engineering Public Administration processes is the capability of understanding and capturing them. The organisation of schemas into families facilitates schema analysis for process re-engineering, In fact, schemas in a family operate on similar data, and, by exploiting the information associated with ER schemas, AIPA experts can evaluate the existence of information flows among them. On the basis of identified information flows, they can have indications for reconstructing macroprocesses, i,e., complex activities: involving several processes possibly pertaining to several Offices and Divisions of Ministries ~which have a common objective of produc- ing a service to clients, and evaluate re-engineering interventions on them, for example, making the processes belonging t o t h e same macroprocess commu- nicate directly through the information system. It would be very difficult to perform these activities

complete ly manually, without any information on process similarities.


6. Concluding Remarks

In this paper we have presented the results obtained from the application of a set of schema analysis techniques to a large set of E R schemas. The focus of the paper is on discussing the application in the AIPA case of different types of analysis techniques proposed by the authors.

Semi-automatic techniques are effective on the large scale, where manual inspection-based methods are absolutely unfeasible. Used together, they provide useful precomputation and analysis, and allow the analyst to concentrate on important aspects. Although there is no experience yet, it seems these techniques could also be useful outside Public Administration, in all situations in which the organisation to be analysed is articulated in terms of a high number of units, characterised according to their role and mission, by a mixing of similarity/diversity/specialisation. This is cer- tainly true, for example, for local Public Administra- tions and several types of plant, while other areas such as banking show greater homogeneity among organisational units.

Finally, the practical experience obtained has been extremely useful for suggesting new investigations and improvements in the theoretical aspects of our research, in the following areas:

�9 new clustering techniques: based on the available similarity measures, other techniques for clustering schemas can be proposed, as an alternative to the clustering approach proposed in this paper; in particular, we are investigating the possibility of applying techniques based on Galois lattices that can be used to classify complex objects in knowledge bases [24] and the use of neural networks;

�9 analysis of natural language sentences: the information associated with schemas also includes descrip- tions in natural language, which can be used to complement the analysis of schemas based only on schema elements [25];

�9 use of domain knowledge: domain knowledge about existing Public Administration subjects could be exploited to initialise clusters in the classification of schemas, to obtain families based on closely related subjects [261;

�9 case-based reasoning: case-based reasoning techniques could be used to derive reference components from families of schemas, providing a way to compare analogous sets of elements;

�9 comparisons among parts of schemas: the techniques proposed in the paper provide a way to compare schemas as a whole; an interesting issue is the

identification of common parts of schemas, for a more effective classification based on subschemas.

Acknowledgements

This research has been partially funded by the AIPA (Autorit~ per l'Informatica nella Pubblica Amministrazione) under a contract between AIPA and Politecnico di Milano. The authors thank Dr Eng. E De Stefano and Dr Eng. M. Pellizzoni, who developed the repository and support tools, and Dr Eng. C. Francalanci, who participated in the first phases of the present research. Dr E Naggar of AIPA participated in discussions about the organisation of the repository.

References

1. Chen PP. The entity-relationship model: towards a unified view of data. ACM Trans Database Syst 1976; 1(1): 9-37

2. Batini C, Ceri S, Navathe SK. Conceptual database design. Benjamin Cummings, 1992

3. Yourdon E. Modern structured analysis. Prentice-Hall, Englewood Cliffs, NJ, 1989

4. Papazoglou MR Unraveling the semantics of conceptual schemas. Commun ACM 1995; 38(9): 80-94

5. US Department of Commerce, National Bureau of Standards. Guide to information resource dictionary system applications: general concepts and strategic systems planning. NBS special publication 500-152, US Government Printing Office, 1988

6. Srikant R, Agrawal R. Mining generalized association rules In Proceedings of VLDB'95. Zurich, September 1995, pp 407-419

7. Brodie ML, Stonebraker M. DARWIN: on the incre- mental migration of legacy information systems. DOM Technical Report, TM-0588-10-92-165, GTE Laborato- ries, November 1992

8. Aiken R Muntz A, Richards R. DoD legacy systems: reverse engineering data requirements. Commun ACM 1994; 37(5)

9. Batini C, Di Battista G, Santucci G. Structuring primitives for a dictionary of entity relationship data schemas. IEEE Trans Software Eng 1993; 19(4)

10. Bellinzona R, Castano S, De Antonellis V, Fugini MG, Pernici B Requirements reuse in the F 3 project. Eng Inform Syst 1994; 2(6)

11. Politecnico di Milano, Department of Electronic Engi- neering and Information Sciences, Information Systems Group. AIPA tool documentation, October 1995 (in Italian)

12. IBM. IBM business system planning: information system planning guide, 4th edn. July 1994

13. Martin J. Information engineering. Arthur Young, 1986 14. Strassmann E The business value of computers. Informa-

tion Economics Press, 1990 15. Thompson C. Living an enterprise model. Database

Programming Design 1993; March

62

16. Cross II JH. Speaker cites standard data sets as a major challenge facing software reverse engineering researches. Computer 1993; November

17. Teorey TJ, Wei G, BoRon DL, Koenig JA. ER model clustering as an aid for user communication and documentation in database design. Commun ACM 1989; 3(8)

18. Salton G, Allan J, Buckley C. Automatic structuring and retrieval of large text files. Commun ACM 1994; 37(2)

19. Salton G. Automatic text processing: the transformation, analysis and retrieval of information by computer. Addi- son-Wesley, Reading, MA, 1989

20. Batini C, Lenzerini M, Navathe S. A comprehensive analysis of methodologies for database schema integration. ACM Comput Surveys 1986; September

21. Castano S, De Antonellis V, Pernici B. Building reusable conceptual components in the public administration

C. Batini et aL

domain. In Proceedings of SSR'95, ACM SIGSOFT conference on software reuse, Seattle, April 1995

22. Castano S, De Antonellis V. Reference conceptual archi- tectures for re-engineering information systems. Int J Coop Inform Syst 1995; 4(2, 3)

23. Bellinzona R, Fugini MG, Pernici B. Reusing specifica- tions in OO applications. IEEE Software 1995; 12(2)

24. Mineau G, Godin R. Automatic structuring of knowledge bases by conceptual clustering. IEEE Trans Knowledge Data Eng 1995; 7(5)

25. De Antonellis V, Zonta B. A disciplined approach to office analysis. IEEE Trans Software Eng 1990; 16(8)

26. Frakes WB, Prieto-Diaz R, Fox C. DARE: domain analysis and reuse environment. In: Wentzel K, Latour L (eds). Proceedings of the workshop on institutionalizing software reuse. St Charles, IL, August 1995

Analysis of An Inventory of Information Systems In the Public Administration

Documents

Transcript of Analysis of An Inventory of Information Systems In the Public Administration