SUPPORTING TRUSTED DATA EXCHANGES IN COOPERATIVE INFORMATION SYSTEMS

38
I SUPPORTING TRUSTED DATA EXCHANGES IN COOPERATIVE INFORMATION SYSTEMS Paola Bertolazzi 1 , Maria Grazia Fugini 2 , Massimo Mecella 3 Barbara Pernici 2 , Pierluigi Plebani 2 , Monica Scannapieco 1,3 1 Istituto di Analisi dei Sistemi ed Informatica Consiglio Nazionale delle Ricerche (IASI-CNR) [email protected] 2 Dipartimento di Elettronica e Informazione Politecnico di Milano {fugini,pernici}@elet.polimi.it, [email protected] 3 Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza” {mecella,monscan}@dis.uniroma1.it Contact author: Monica Scannapieco Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza” Via Salaria 113 (2nd floor, room 231) I-00198 Roma, Italy Phone: +39 06 49918479 Fax: +39 06 85300849 E-mail: [email protected] Abstract. In cooperative processes and in e-services, an evaluation of the quality of exchanged data is essential for building mutual trust among cooperating organizations and correctly performing cooperative activities. Several quality dimensions, related to the intrinsic nature of data and to the context of the cooperative process where data are used, must be taken into consideration. In addition, in order to accomplish a trusted cooperative environment, data sensitivity parameters must be taken into account. A model for data quality in cooperative information systems and e-applications is proposed, together with an architecture for trusted exchanges of data and quality information associated to it. Strategic use of the model and of the architecture is discussed. Keywords: Data quality, workflow systems, e-services, security

Transcript of SUPPORTING TRUSTED DATA EXCHANGES IN COOPERATIVE INFORMATION SYSTEMS

I

SUPPORTING TRUSTED DATA EXCHANGES IN COOPERATIVE INFORMATION SYSTEMS

Paola Bertolazzi1, Maria Grazia Fugini2, Massimo Mecella3 Barbara Pernici2, Pierluigi Plebani2, Monica Scannapieco1,3

1 Istituto di Analisi dei Sistemi ed Informatica

Consiglio Nazionale delle Ricerche (IASI-CNR) [email protected]

2 Dipartimento di Elettronica e Informazione Politecnico di Milano {fugini,pernici}@elet.polimi.it, [email protected]

3 Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza” {mecella,monscan}@dis.uniroma1.it

Contact author: Monica Scannapieco Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza”

Via Salaria 113 (2nd floor, room 231) I-00198 Roma, Italy Phone: +39 06 49918479 Fax: +39 06 85300849 E-mail: [email protected]

Abstract. In cooperative processes and in e-services, an evaluation of the quality of exchanged data is essential for building mutual trust among cooperating organizations and correctly performing cooperative activities. Several quality dimensions, related to the intrinsic nature of data and to the context of the cooperative process where data are used, must be taken into consideration. In addition, in order to accomplish a trusted cooperative environment, data sensitivity parameters must be taken into account. A model for data quality in cooperative information systems and e-applications is proposed, together with an architecture for trusted exchanges of data and quality information associated to it. Strategic use of the model and of the architecture is discussed. Keywords: Data quality, workflow systems, e-services, security

1

SUPPORTING TRUSTED DATA EXCHANGES IN COOPERATIVE INFORMATION SYSTEMS

Abstract. In cooperative processes and in e-services, an evaluation of the quality of exchanged data is essential for building mutual trust among cooperating organizations and correctly performing cooperative activities. Several quality dimensions, related to the intrinsic nature of data and to the context of the cooperative process where data are used, must be taken into consideration. In addition, in order to accomplish a trusted cooperative environment, data sensitivity parameters must be taken into account. A model for data quality in cooperative information systems and e-applications is proposed, together with an architecture for trusted exchanges of data and quality information associated to it. Strategic use of the model and of the architecture is discussed. Keywords: Data quality, workflow systems, e-services, security

1 INTRODUCTION

Recently, the widespread use of information technology and the availability of

networking services have enabled new types of applications, characterized by

several geographically distributed interacting organizations. The term

Cooperative Information Systems (CIS) is used to denote distributed information

systems that are employed by users of different organizations under a common

goal (Mylopoulos and Papazoglou 1997, Brodie 1998). A recent extension of the

CIS allows providing e-services on line in a cooperative context, by means of e-

applications (VLDB-TES 2000, Mecella and Pernici 2001). In addition to

geographical distribution and inter-organization cooperation, in e-applications (i)

cooperating organizations may not know each other in advance and (ii) e-

services can be composed both at design and run-time. Whereas in traditional

“closed” CIS mutual knowledge and agreements upon design of applications are

the basis for the cooperation, the availability of a complex platform for e-services

(Mecella et al. 2001a) allows “open” cooperation among different organizations.

An approach towards e-applications can be found in UDDI, an initiative for

defining eXtensible Markup Language (XML) (W3C 1998) documents to publish

and discover services on the Web. The UDDI Business Registry stores different

2

types of information about a service, that is, business contact information (“white

pages''), business category information (“yellow pages'') and technical service

information (“green pages'') (UDDI 2000). In such a framework, organizations

willing to offer e-services give a description of the services informally, based on

free text, and other organizations willing to use these e-services interact with the

offering organization on the basis of agreed upon interfaces.

Other proposals for architectures for e-services based on workflow

systems have been presented in the literature (VLDB-TES 2000, Casati et al.

2001, Mecella et al. 2001a). The starting point of all these approaches is the

concept of cooperative process, also referred to as macro process (Mecella and

Batini 2001) or multi-enterprise process (MEP, Schuster et al. 2000), defined as

a complex workflow involving different organizations; unlike traditional workflow

processes where all the activities concern the same enterprise, in a cooperative

process the activities involve different organizations, either because they form

together a virtual enterprise or since they exchange services and information in a

coordinated way. The approach presented in (Mecella et al. 2001a), which

constitutes the underlying framework in this paper, assumes that a cooperative

process can be abstracted and modeled as a set of e-services exported from

cooperating organizations. The definition of a cooperative process as a set of e-

services constitutes a reference schema for the cooperation among

organizations; an e-service represents a “contract” on which an organization

involved in the cooperative process agrees.

Organizations, which cooperate in CIS/e-applications, can be of two types:

• trusted organizations: data transmission occurs among organizations which

trust each other in a network due to organizational reasons (e.g.,

homogeneous work groups in a departmental structure, or supply-chain

relationships among organizations forming a virtual enterprise);

• external organizations: data are transmitted among cooperating entities in

general, possibly accessing external data sources.

Every time mutual knowledge among organizations participating in CIS/e-

applications is not given in advance, new mechanisms are needed to ensure that

3

mutual trust is established during cooperative process executions. Trust regards

mainly two aspects: (i) the quality of data being exchanged, and (ii) a secure

environment for information exchange to guarantee sensitive information.

The properties to indicate the quality of data being exchanged are both

intrinsic to data itself and process dependent, i.e., they depend on the activity in

which they are used and when they are used. We argue that organizations need

to specify and to exchange information explicitly oriented to describe the quality

of data circulating in CIS/e-applications. The availability of quality data allows

interacting organizations to assess the quality of received and of available data

before using them.

Sensitivity concerns both correct authentication of cooperating

organizations and guaranteeing that only authorized organizations can read, use,

and generate data in the cooperative process. To guarantee sensitive

information, security technologies can be used, e.g., based on the use of digital

certificates and signatures, to allow the cooperating organizations to establish a

secure communication environment and to ensure the needed level of

confidentiality.

The goal of the present paper is to propose a model for data quality,

including both traditional and original quality dimensions, and an architecture for

trusted data exchange supporting sensitivity among cooperating organizations.

Both quality and sensitivity are used to define the level of trust of the CIS/e-

application.

The paper is organized as follows. In Section 2, we first introduce a

running example, to be used for further illustration of our approach, and then we

discuss both classical data quality dimensions and additional information that

must be associated to data to build mutual trust in CIS/e-applications. The

running example stems from the experience of the Italian e-Government initiative

(Mecella and Batini 2001), which provides motivations for our work and the test

bed in which we will try our approach. In Section 3, the model for data quality is

presented in detail, whereas in Section 4 the cooperative framework is

described. Finally, in Section 5 we discuss the strategic use of trusted data

4

exchanges in CIS/e-applications. Section 6 discusses related work specifically

focused on data quality issues and data security aspects, and Section 7

concludes the paper by remarking future work.

2 A RUNNING EXAMPLE AND THE DATA QUALITY DIMENSIONS

In this section we pose the basis for defining a conceptual framework for trusted

data exchanges in CIS/e-applications. First, we shortly introduce a running

example, then we define data quality dimensions for trusted cooperation in

environments such as the one described in the example.

2.1 A RUNNING EXAMPLE An example taken from the Italian e-Government scenario (Mecella and Batini

2001) will be used throughout the paper. In Italy, the Unitary Network project and

the related Nationwide Cooperative Information System (Batini et al. 2001) are

currently undertaken, with the aim of implementing a “secure Intranet” able to

interconnect public administrations and of developing a Unitary Information

System of Italian Public Administrations in which each subject can participate by

providing services (e-services) to other subjects. Specifically each administration

has been represented as a domain, and each domain offers data and application

services, deployed and made accessible through cooperative gateways.

Similar initiatives are currently undertaken also in the United Kingdom,

where the e-Government Interoperability Framework (e-GIF) sets out the

government’s technical policies and standards for achieving interoperability and

information systems coherence across the UK public sector. For this purpose the

government has launched the UK GovTalk initiative (CITU 2000), that is a joint

government and industry forum for generating and agreeing standards, through

the definition of XML Document Type Definitions (DTDs) (Goldfarb and Prescod

2000) to be used for information exchange.

In this paper we use as running example a simplified version of the

cooperative process for income management (see Figure 1).

5

Figure 1. UML Activity Diagram of the cooperative process

“Income Management” and the identified e-services.

Citizens send income-tax returns to the Department of Finance, which,

after executing some activities of its own competence, needs to access the

family composition of the citizen from other administrations with the purpose of

cross-checking data. The family composition of the citizen is checked against

data available from the City Council where the citizen is resident. Information

about retirement plans (in case some retired persons exist in the family) is

obtained from the Italian Social Security Service.

More in details, the workflow consists of the Department of Finance

receiving income-tax returns by citizens (sent by ordinary-mail, nowadays

Receive Income-tax Return

Send Family Status Request

Receive Family Status Documentation

Receive Family Status Request

Produce Family Status Documentation

Send Family Status Documentation

I: Income-tax Return

fsr: Family Status Request

fsd: Family Status Documentation

[There are retired relatives living with]

[Else]

Send Pension Plan Status Request

Receive Pension Plan Status Documentation

Calculate Expected Taxes

[OK]

Send OK Notification

OK: OK Notification

Open Inquiry

[Else]

in: Inquiry Notification

d: Documentation

Receive Pension Plan Status Request

Produce Pension Plan Status Documentation

Send Pension Plan Status Documentation

ps: Pension Plan Status Request

ds: Pension Plan Documentation

Italian Social Security Italian Social Security Service eService e--Service (schemas)Service (schemas)

City Council eCity Council e--Service Service (schemas)(schemas)

Department of Finance Department of Finance

ee--Service (schemas)Service (schemas)

Other Other OrganizationsOrganizations

CitizenCitizen

CitizenCitizen

Receive Income-tax Return

Send Family Status Request

Receive Family Status Documentation

Receive Family Status Request

Produce Family Status Documentation

Send Family Status Documentation

I: Income-tax Return

fsr: Family Status Request

fsd: Family Status Documentation

[There are retired relatives living with]

[Else]

Send Pension Plan Status Request

Receive Pension Plan Status Documentation

Calculate Expected Taxes

[OK]

Send OK Notification

OK: OK Notification

Open Inquiry

[Else]

in: Inquiry Notification

d: Documentation

Receive Pension Plan Status Request

Produce Pension Plan Status Documentation

Send Pension Plan Status Documentation

ps: Pension Plan Status Request

ds: Pension Plan Documentation

Italian Social Security Italian Social Security Service eService e--Service (schemas)Service (schemas)

City Council eCity Council e--Service Service (schemas)(schemas)

Receive Income-tax Return

Send Family Status Request

Receive Family Status Documentation

Receive Family Status Request

Produce Family Status Documentation

Send Family Status Documentation

I: Income-tax Return

fsr: Family Status Request

fsd: Family Status Documentation

[There are retired relatives living with]

[Else]

Send Pension Plan Status Request

Receive Pension Plan Status Documentation

Calculate Expected Taxes

[OK]

Send OK Notification

OK: OK Notification

Open Inquiry

[Else]

in: Inquiry Notification

d: Documentation

Receive Pension Plan Status Request

Produce Pension Plan Status Documentation

Send Pension Plan Status Documentation

ps: Pension Plan Status Request

ds: Pension Plan Documentation

Italian Social Security Italian Social Security Service eService e--Service (schemas)Service (schemas)

City Council eCity Council e--Service Service (schemas)(schemas)

Department of Finance Department of Finance

ee--Service (schemas)Service (schemas)

Other Other OrganizationsOrganizations

CitizenCitizen

CitizenCitizen

6

submitted also through a Web portal); the Department, in order to verify the

correct amount of taxes, needs to check incomes of all people forming the same

family of the citizen; it requests the family composition to the City Council where

the citizen lives. After receiving the family status of the citizen, the Department

queries the Italian Social Security Service in order to know the amount of

pension perceived by retired persons in the citizen’s family; this activity is carried

out only if there are retired persons living with the citizen. After collecting all this

information, the Department owns all the needed data to check income-tax

returns and possibly start further actions against fraudulent citizens.

Until recently, the information exchange described above has been

carried out using paper documents; the document exchange activated specific

processes in each organization aiming at producing response documents. Now,

on the basis of the Unitary Network and Nationwide CIS projects, each

administration can develop e-services (shown in Figure 1), allowing other

cooperating organizations to ask for and obtain requested data. In the present

paper we assume that data are exchanged as XML documents and described

through DTDs agreed upon by all the cooperating administrations.

The cooperation is effective if exchanged data are trusted, that is, their

quality is assessed and their security is guaranteed: if for each exchanged data,

its quality is assessed, the receiving organization can set up appropriate

measures to face up poor quality situations. As an example, if the citizen address

provided by a City Council is assessed not to be updated, the Department of

Finance can arrange different activities trying to validate update data against

other organizations (e.g., telecommunication companies maintains billing

addresses for their customers).

Security requirements in this scenario regard the authentication of the

cooperating organizations, the decision of the sensitivity levels of data, and the

certification of the data transmission. Communication can be assumed to be

other trusted (e.g., between the Department of Finance and the City Council) or

untrusted (e.g., between the citizen and the City Council).

7

2.2 DATA QUALITY DIMENSIONS We distinguish two kinds of data quality dimensions: data intrinsic and process

specific. Intrinsic data quality dimensions characterize properties that are

inherent to data, i.e., depend on the very nature of data; an example is a

dimension specifying whether the data about the family composition of a citizen

is updated or not. Process specific quality dimensions describe properties that

depend on the cooperative process in which data are exchanged; in our

reference example, the “timeliness” of exchanged data between the Department

of Finance and the City Council is a parameter that is fundamental to measure

the efficiency and effectiveness of the cooperative process.

Process specific parameters are an original contribution of this paper, as

we show how quality is related also to the usage of data and to its evaluation in a

cooperative framework. As regards intrinsic data quality dimensions, we refer to

a subset of the ones proposed in the literature, by considering the most

important ones (Wand and Wang 1996); we provide new definitions based on

the classical ones, to adopt them in the CIS/e-application context. We will refer

only to data quality dimensions concerning data values; conversely, we do not

deal with aspects concerning data quality of logical schemas and data format; in

the following definitions, we refer to schema element meaning, for instance, an

entity in a Entity-Relationship schema or a class in an object oriented schema

expressed in the Unified Modeling Language (UML 2000).

2.2.1 Intrinsic data quality dimensions

Our purpose is to associate data with those dimensions that are useful for

organizations receiving data to evaluate and validate them before further use.

We associate to data (i) syntactic and semantic accuracy, (ii) completeness, (iii)

currency, and (iv) internal consistency.

Syntactic and Semantic Accuracy. In (Redman 1996) accuracy refers to the

proximity of a value v to a value v′ considered as correct. Based on such a

definition, we introduce a further distinction between syntactic and semantic

8

accuracy.

Syntactic Accuracy It is the distance between v and v’, being v’ the value considered syntactically correct.

Semantic Accuracy It is the distance between v and v’, being v’ the value considered semantically correct.

Let us consider the following examples:

• Person is a schema element with Name as the attribute of interest, and p

an instance of Person. If p.Name1 has a value v = JON, while v′ =

JOHN, this is a case of a low syntactic accuracy as JON is not an

admissible value according to a dictionary of English names;

• if p.Name has a value v = ROBERT, while v′ = JOHN, this is a case of a

low semantic accuracy, as v is a syntactical admissible value but the

person whose name is stored as ROBERT has a name which is JOHN in

the real world.

Syntactic accuracy can be easily checked by comparing data values with

reference dictionaries. Semantic accuracy is more difficult to quantify since,

according to our definition, the terms of comparison have to be derived from

real world, and so verification of semantic accuracy may be expensive.

Semantic accuracy can be checked through comparison of the information

related to the same instance stored in different databases. A typical process

that aims at identifying similar instances consists of two phases:

• A searching phase, in which possibly matching instances are identified

(Bitton and DeWitt 1983, Hernandez and Stolfo 1998, Monge and Elkan

1997);

• A matching phase, in which a decision about a match, a non-match or a

1 The dot notation refers to instances and their attributes, i.e., a.x indicates the value of

the attribute x on a specific instance a of the schema element A.

9

possible match is taken, (Hernandez and Stolfo 1998, Monge and Elkan

1997, Cochinwala et al. 1998). Usually, the decision is made in an

authomatic or semi-authomatic way, on the basis of the database which is

considered as storing values which are considered correct.

As an example, all the attribute values related to p (with p.Name =

ROBERT), such as, for example, DateOfBirth and EmployeeNumber,

could be compared with another instance of Person from a different

database considered as correct. In such a case, the process of checking the

semantic accuracy requires the matching of < ROBERT, 11-20-1974,

1024 > and < JOHN, 11-20-74, 1024 >, that is (i) recognizing the two

instances as potential match, (ii) deciding for a match of the two instances,

and (iii) then correcting ROBERT into JOHN.

Completeness. We define this dimension as:

Completeness The degree to which values of a schema element are present in the schema element instance.

In evaluating completeness, it is important to consider the meaning of null

values of an attribute, depending on the attribute being mandatory, optional,

or inapplicable: a null value for a mandatory attribute is associated with a

lower completeness, whereas completeness is not affected by optional or

inapplicable null values. As an example, let us consider the attribute Email

of the Person schema element; a null value for the Email attribute may

have different meanings, that is (i) the specific person has no email address,

and therefore the attribute is inapplicable (this case does not impact on

completeness), or (ii) the specific person has an email address but it has not

been stored (in this case completeness is low).

Currency. The currency dimension refers only to data values that may vary in

10

time; as an example, values of Address may vary in time, whereas

DateOfBirth can be considered invariant. Therefore currency can be

defined as the “age” of a value, namely:

Currency The distance between the instant when a value is last updated and the instant when the value itself is used.

It can be measured either by associating to each value an “updating

timestamp” (Missier et al. 2001) or a “transaction time” in temporal databases

(Tansell et al. 1993).

Internal Consistency. Consistency implies that two or more values do not

conflict each other. By referring to internal consistency we mean that all the

values that are compared in order to evaluate consistency are within a

specific instance of a schema element.

A semantic rule is a constraint that must hold among values of attributes of a

schema element, depending on the application domain modeled by the

schema element. On the basis of this definition, internal consistency can be

defined as:

Internal Consistency The degree to which the values of the attributes of an instance of a schema element satisfy the specific set of semantic rules defined on the schema element.

As an example, if we consider Person with attributes Name, DateOfBirth,

Sex and DateOfDeath, some possible semantic rules to be checked as

satisfied are:

• the values of Name and Sex are consistent; if Name has a value v =

JOHN and the value of Sex is FEMALE, this is a case of internal

inconsistency;

11

• the value of DateOfBirth needs to precede the value of DateOfDeath.

2.2.2 Process specific dimensions

The need for data quality dimensions dependent on the context is recognized in

(Wang and Strong 1996); we observe that in CIS/e-applications, the context is

the cooperative process and data quality dimensions are related to the evolution

of data during time and within the process. We have therefore chosen and

adapted some of the dimensions proposed in (Wang and Strong 1996)

(timeliness and source reliability), and in addition we propose new dimensions

dependent on cooperative processes (importance and confidentiality).

Process specific dimensions are tied to specific data exchanges within the

process, rather than to the whole process. Hence, in the following definitions, we

consider a data exchange as a triple < source organization i,

destination organization j, exchange id >, representing the

cooperating organizations involved in the data exchange and the specific

exchange2.

Timeliness. It can be defined as follows:

Timeliness The availability of data on time, that is within the time constraints specified by the destination organization.

For instance, we can associate a low timeliness value for the schedule of the

lessons in a University, if such a schedule becomes available on line after

that the lessons have already started. For computing this dimension, each

organization has to indicate the due time, i.e., the latest time within which

data have to be received. According to our definition, the timeliness of a value

cannot be determined until it is received by the destination organization.

2 Two organizations may be involved in more than one exchange of the same data within

the same cooperative process.

12

Importance. This dimension can be defined as:

Importance The significance of data for the destination organization.

As an example, we can consider an organization B (e.g., the Department of

Finance) that cannot start an internal process until an organization A (e.g.,

the City Council) transfers values of the schema element X (e.g. the family

composition of a citizen); in this case, the importance of X for B is high.

Importance is a complex dimension that can be defined based on specific

indicators measuring: for a schema element the amount of instances

managed by the destination organization with respect to a temporal unit, the

number of processes internal to the destination organization in which the data

are used, the ratio between the number of core business processes using the

data and the overall number of internal processes using the data. Therefore

importance is: Importance (data, destination org.) = f(# instances of data, # internal processes of destination org. using data, # core business processes of destination org. using data / # internal processes of destination org. using data)

Source Reliability. It can be defined as:

Source reliability The credibility of a source organization with respect to provided data; it refers to the pair < source, data >.

The dependence on < source, data > can be clarified through an

example: a University (source) has a reputation of high reliability when

treating data regarding its students and offered courses, but it can have a low

reliability when releasing information regarding forthcoming commercial

events related to companies that offer stages, since such information is not

totally of competence of the University. As another example, the source

reliability of the Italian Department of Finance concerning Address of

13

citizensis lower than the one of City Councils; whereas as concern the

SocialSecurityNumber its source reliability is the highest among all Italian

administrations.

The values of source reliability may depend on the methods each

organization uses to clean its data and to measure their quality.

Confidentiality. In a cooperative process sensitivity concerns protecting data

from accidental and fraudulent misuse. In general, three dimensions are

associated to secure information exchange: confidentiality, integrity, and

authentication. Confidentiality means that data are not read during

transmission, integrity that they are not altered, and authentication that

sources and destinations are correct. In the following, we will assume that

integrity and authentication are in any case guaranteed by CIS/e-applications,

as detailed in the following of this paper, and we associate to data additional

information only about confidentiality to data.

Confidentiality Indicates whether data must be protected from access by non authorized users.

As an example, let us consider the instance of Person with

Name=”John”, DateOfBirth=”11-20-1974”, Sex = “M”. Using the

destination public key of the recipient key-pair, data can be ciphered, obtaining

the sequence:

D“å–àVÌÁÇ9•ûeÑÉÔ;ÿaˆäqÜdNÞוeYdXN}-çÊCª•éï$t

In this way, only the recipient can decrypt the message using his own

private key of the same key-pair.

14

3 DATA AND QUALITY MODELS

3.1 DATA MODEL In our framework, all the organizations involved in CIS/e-applications need to

export their data according to some specific schemas; we refer to these schemas

as cooperative data schemas.

They are class schemas defined in accordance with the ODMG Object

Model (Cattell and Barry 1997). Specifically they describe types of exchanged

data items, wherein types can be:

• classes, whose instances have their own identities;

• literals, when instances have not identities, and they are identified by values.

It is possible to define new classes as collections of objects (instances are

objects) and also structured literals, as record of literals. As an example, in

Figure 2 a detail of the cooperative data schema exported by the City Council in

our reference example is shown.

This schema defines a Citizen as a class, and Address as structured

literal (e.g., records).

3.2 DATA QUALITY MODEL This section defines the conceptual data quality model that each cooperating

organization has to define in order to export the quality of its own data. First we

define the notion of cooperative data quality schema, then we distinguish

between intrinsic and process specific data quality schemas and describe them

in details.

A cooperative data quality schema is a UML Class Diagram associated to

a cooperative data schema, describing the data quality of each element of the

data schema.

15

struct Address {

string street;

string cityName;

string state;

string country;

short ZIPCode;

}

… …

class Citizen {

attribute string name;

attribute string surname;

attribute string SSN;

attribute Date birthDate;

attribute Address currentAddress;

… …

}

Figure 2. The cooperative data schema exported by the City

Council (detail)

3.2.1 Intrinsic Data Quality Schemas

Intrinsic data quality dimensions can be modeled by considering classes, that we

call dimension classes, describing the data quality of the data schema elements

with reference to a specific dimension; therefore dimension classes represent

specific intrinsic data quality dimensions (e.g., completeness or currency).

We distinguish two types of dimension classes, according to the fact they

refer either to a class or to a structured literal of a data cooperative schema,

namely dimension classes and dimension structured literals. Each dimension

class represents the abstraction of the values of a specific data quality

dimension for each of the attributes of the class or of the structured literals to

which it refers, and to which it is associated by a one-to-one association.

A dimension class (or dimension structured literal) is represented by a

UML class labeled with the stereotype <<Dimension>> (<<Dimension_SL>>),

and the name of the class should be < DimensionName_ClassName > (<

DimensionName_SLName >).

As an example, considering the class Citizen, it may be associated to a

16

dimension class, labeled with the stereotype <<Dimension>>, and the name of

which is SyntacticAccuracy_Citizen; its attributes correspond to the

syntactic accuracy of the attributes Name, Surname; SSN, etc. (see Figure 3

referring to Figure 2).

Figure 3. An example of dimension class.

3.2.2 Process specific data quality schemas

Tailoring UML in a way similar to the one adopted for intrinsic data quality

dimension, we introduce process dimension classes, which represent process

specific data quality dimensions, in such a way as dimension classes represent

intrinsic data quality dimensions.

We introduce the exchange structured literal, necessary to characterize

process dimension classes. According to the definitions proposed in Section 2,

process specific data quality dimensions are tied to a specific exchange within a

cooperative process; in our framework, a cooperative process is modeled as the

interaction of different e-services provided by the different organizations, and we

introduce exchange structure literals to represent the dependence of process

specific dimensions from source and destination e-services (and organizations

exporting such e-services).

We distinguish two types of process dimension classes, process

dimension classes and process dimension structured literals; they include the

values of the attributes of the class or of the structured literals to which they

refer, and to which they are associated by a one-to-one association. We use the

stereotypes <<P_Dimension>> and <<P_Dimension_SL>>, for dimension

process classes and dimension process structured literals respectively. The

name of the class should be < DimensionName_ClassName > (<

SyntacticAccuracy_CitizenNameSurnameSSN

<<Dimension>>CitizenNameSurnameSSN 11 11

17

DimensionName_SLName >). See Figure 4 as an example.

Figure 4. An example of process dimension class.

An exchange structured literal is a structured literal associated to process

dimension classes. It includes the following mandatory attributes:

• source e-service,

• destination e-service,

• process identifier,

• exchange identifier.

Because of the fact that within a cooperative process two e-Services may

have more than one exchange, it is necessary to introduce an exchange

identifier, to identify the exchange itself univocally. Exchange structured literals

are labeled with the stereotype <<Exchange_SL>>.

The considerations exposed in this section are summarized in Figure 5, in

which the quality referring to both intrinsic and process specific dimensions for

the Citizen class is represented; the intrinsic data quality dimensions (syntactic

and semantic accuracy, completeness, currency, internal consistency) are

labeled with the stereotype <<Dimension>>, whereas the process specific data

quality dimensions (timeliness, importance, source reliability, confidentiality) are

labeled with the stereotype <<P_Dimension>>, and are associated to the

structured literal Exchange_Info, labeled with the stereotype

<<Exchange_SL>>.

CitizenNameSurnameSSN

Importance_CitizenNameSurnameSSN

<<P_Dimension>>

111 1

18

Figure 5. Cooperative data quality schema (detail referring to the

Citizen class). All the associations are 1-ary.

4 THE FRAMEWORK FOR TRUSTED COOPERATION

4.1 THE ARCHITECTURE FOR TRUSTED E-SERVICES Many approaches can be adopted to allow different organizations to cooperate

through the definition and development of CIS/e-applications, as described in the

introduction. The approach adopted in this paper is workflow-based, that is the

different organizations export data and services necessary to carry out specific

cooperative processes in which they participate. Such an approach requires an

agreement on the data and service models exported by different organizations

(Mecella et al. 2001a).

InternalConsistency_Citi zenNameSurnameSSN

<<Dimension>>

Completeness_CitizenNameSu rnameSSN

<<Dim ensi on>>

Currency_CitizenNameSurnameSSN

<<Di mensi on>>

SyntacticAccuracy_CitizenNameSurnameSSN

<<Dimension>>

Timeli ness_Cit izenNameSurnameSSN

<<P_Dimension>>SourceReliabil ity_Citizen

NameSurnameSSN

<<P_Dimension>>

Exchange_InfoSou rceEServi ceDesti nat ionEServi ceProcessIDExchangeID

<<Exchan ge_SL>>

Importance_CitizenNameSu rnam eSSN

<<P_Dimension>>

CitizenNameSurnameSSN

SemanticAccuracy_CitizenNameSurnameSSN

<<Dimension>>

19

In this section, we describe the architecture enabling trusted CIS/e-

applications, by focusing on the two central elements, namely data quality and

security. The starting point of our framework is the definition of a conceptual

cooperative workflow specification, that is, an abstract workflow description that

hides the details of process execution in each of the cooperating organizations;

an example of this conceptual cooperative workflow specification for the running

example has been shown in Figure 1.

On the basis of such a schema, each organization defines its cooperative

data schemas, which specify the structure of exchanged data. Such schemas

are the static interfaces of e-services that implement the cooperative process

through exchanges of trusted data and service requests among different

cooperating organizations. As an example, in Figure 1 the areas limited by

dotted lines identify the e-services. In addition to data schemas, each

organization exports cooperative data quality schemas, described in Section 3.2,

in which information about the quality of the exported data is modeled.

The proposed architecture is shown in Figure 6.

Each cooperating organization exports e-services as application

components deployed on cooperative gateways; a cooperative gateway is the

computing server platform which hosts these components; different

technologies, such as OMG Common Object Request Broker Architecture (OMG

1998), SUN Enterprise JavaBeans (Monson-Haefel 2000), and Microsoft

Enterprise .NET (Trepper 2000) allow the effective development of such

architectural elements, as detailed in (Mecella and Batini 2000).

A cooperative process is therefore realized through the coordination of

different e-services, to be provided by e-applications. An e-application realizes

the “glue” interconnecting and orchestrating different e-services; such a “glue”

needs to be based on the cooperative schemas, regarding both data and their

quality.

20

Figure 6. The architecture for trusted CIS/e-applications.

Some elements provide infrastructure services needed for the correct and

effective deployment of trusted e-services in the context of this architecture:

• a repository, which stores e-service specifications, that is data schemas, data

quality schemas and application interfaces provided by each e-service; this

repository is accessed at run-time by e-applications to discover and compose

e-services that each organization makes available;

• a source reliability manager, that, for each e-service and for each data

exported by such an e-service, certifies its source reliability (refer to Section

2.2.2); therefore the source reliability manager stores triples < e-service,

data, source reliability value >;

• a certification authority, providing digital certificates, a certificate repository

Repository of e-services

CORBA

e-service software “glue”e-applications

Source Reliability Manager

Certification Authority

Certificate Repository

Certificate Revocation

List

e-services

Cooperative gateway Cooperative gateway Cooperative gateway

export

Cooperative Data and Data Quality Schemas Cooperative

Organizations

based on

Repository of e-services

CORBA

e-service software “glue”

e-service software “glue”e-applications

Source Reliability Manager

Certification Authority

Certificate Repository

Certificate Revocation

List

e-services

Cooperative gateway Cooperative gateway Cooperative gateway

export

Cooperative Data and Data Quality Schemas

e-services

Cooperative gateway Cooperative gateway Cooperative gateway

export

Cooperative Data and Data Quality Schemas Cooperative

Organizations

based on

21

and a certificate revocation list (Housley et al. 1999); the roles of such

elements will be described in the next section, when security aspects

concerning exchange data exchange will be discussed.

4.2 EXCHANGE UNIT FORMAT Different information needs to be associated to each data exchange in order to

support trust; we define an exchange unit as data:

• transmitted from one e-service to another in the cooperative process,

• associated with quality data, and

• transmitted according to security rules.

All data are exchanged according to the exchange unit format (shown in

Figure 7), in order to ensure that they all can be adequately validated by the

receiving organization (this concept will be further explained in Section 5).

Figure 7. Exchange Unit Format.

Data are exchanged as XML files, specifically cooperative data schemas are

described as DTDs; as an example, in Figure 8 an XML document,

corresponding to the detail of the cooperative data schema exported by the

Data

Quality Data History

Sensitivity Information

Digital certificate

Digital signature

Information about data and process

Security Aspects

Data

Quality Data History

Sensitivity Information

Digital certificate

Digital signature

Information about data and process

Security Aspects

22

City Council (refer to Figure 2), is shown. <Citizen>

<FirstName>John</FirstName>

<LastName>McLeod</LastName>

<SSN>000111222333</SSN>

<Date property=“birthDate”>

<Day>10</Day>

<Month>06</Month>

<Year>1945</Year>

</Date>

<Address field=“currentResidence”>

<Street>… …</Street>

<CityName>New York</CityName>

<State>NY</State>

<Country>USA</Country>

<ZIP>… …</ZIP>

</Address>

</Citizen>

Figure 8. A possible XML document corresponding to the

cooperative data schema shown in Figure 2.

Quality data concerning intrinsic dimensions (i.e., syntactic and semantic

accuracy, completeness, currency and internal consistency) may be the result

of an assessment activity performed by each organization on the basis of

traditional methods for measuring data quality, i.e. statistical methods

proposed in (Morey 1982).

History; it can be defined as a list of n-uples < source e-service, destination e-service, operation, link to previous data,

timeliness >, describing the history of manipulations applied to data. For

the purpose of the present paper, we assume that:

• the history of data tracks the transfer of data among interacting

organizations (i.e., e-services) only if the nature of data is not changed

through processing executed by the destination organization;

• if a value is changed, it will be transferred in a new data exchange,

starting a new history list;

23

• operations that preserve the history are those that do not alter identities of

exchanged data, that is: read, clean (according to data cleaning

algorithms), realign operations (such as changing the format of dates from

the European to the American one).

Sensitivity information denotes the level of confidentiality of data being

transferred, and, according to this level, information useful for its encryption.

The confidentiality level can be assigned to data according to standard

security policies, e.g., using rules for data labeling (Castano et al. 1995).

Depending on the relevance level of exchanged data, confidentiality can be

ensured at different granularity levels. We can encrypt: (i) only the data

package, (ii) also quality data and history, (iii) no data parts, or (iv) any

possible combinations thereof. To cope with these requirements, sensitivity

information regards:

• Confidentiality: for each component of the exchange unit (i.e., data, quality

data, history), we define a boolean value (confidentiality flag) indicating

whether the component is confidential or not.

• Encryption method: indicates the asymmetric encryption algorithm (e.g.,

RSA), and the hash algorithm (e.g., SHA1) (Tanenbaum 1996) to be used

to generate the digital signature (see Figure 7).

• Session key: the key to be used to encrypt the relevant information using

symmetric cryptography (e.g., Triple-DES) (Tanenbaum 1996) in order to

improve transmission performances.

Security aspects of the exchange unit need to be addressed, namely (i)

integrity, (ii) authentication and (iii) confidentiality.

As regards integrity, it is provided by creating a secure and efficient

transmission channel through the following components of the exchange unit:

• the digital certificate, owned by the source organization;

• the digital signature of both the listed components of the exchange unit

and the digital certificate.

The digital certificate is issued by a Certification Authority, basically according

to the X.509 format (some extensions can be possibly required but, as they

24

regard data contents and source rather than data exchange, they are not

further analyzed in this paper) (Housley et al. 1999). The digital signature is

created according to the PKCS#7 specification (RSA Laboratories 1993),

thus allowing the destination organization to verify the integrity of the data

and of the digital certificate. By signing also the certificate, we guarantee the

association between the data and its creator.

Authentication can be weak or strong:

• Weak authentication, required for trusted organizations, means that the

destination e-service checks the signature of the source e-service using

the public key of the source e-service, but trusts the certificate of the

source e-service. The advantage is that data transmission is fast and

reliable: trusted organizations know each other by means of a list of

certificates (in the certificate repository); integrity and reliability of such

lists are under the responsibility of the Certification Authority.

• Strong authentication, required for untrusted/external organizations, uses

a Public Key Infrastructure (PKI) (Housley et al. 1999), specifically a

certificate revocation list, in order to validate the certificate of the source

e-service.

Finally, as regards confidentiality, data, quality data and history are encrypted

using the session key included in the sensitivity information part, according to

the value of the confidentiality flags. To avoid disclosure of the session key,

this is encrypted by the source e-service using the public key of the

destination one.

4.3 QUALITY DIMENSIONS VS. FRAMEWORK ELEMENTS The information transmitted in the exchange unit does not describe all data

quality dimensions introduced in Section 2.2. Some of the dimensions are

associated to data during exchanges between source and destination

organizations, whereas other data quality dimensions are evaluated, or directly

associated to data, by the destination organization.

A summary table describing where the quality dimensions are elaborated

25

is shown in Table 1. All intrinsic quality data are transmitted together with data by

the source organization.

Process related quality data reflect instead the dynamic nature of data

within the processes. Timeliness is evaluated by the destination organization,

according to the expected arrival time of the data (i.e., due time). The importance

of data is associated to data by the destination organization according to the

activity to be performed on these data. The destination organization accesses

the source reliability manager to know the reliability of the source organization

with respect to the exchanged data; finally confidentiality is transferred by means

of sensitivity information in the exchange unit.

Quality dimension Where elaborated Intrisic dimensions

- Syntactic and Semantic Accuracy

- Completeness

- Currency

- Internal Consistency

Transferred with data within the exchange unit; evaluated by the source organization

Process specific dimensions

Timeliness Not transferred; evaluated by the destination organization

Importance Not transferred; associated to data by the destination organization

Source Reliability Not transferred; provided by the Source Reliability Manager which is accessed by the destination organization

Confidentiality Transferred with data; associated to data by the source organization

Table 1. Quality dimensions in cooperative processes.

5 THE FRAMEWORK UNDER A STRATEGIC PERSPECTIVE

The framework proposed in the previous sections for data exchange in CIS/e-

applications allows the assessment of received data by organizations upon

receiving them. In the present section, we discuss methodological issues related

26

to interpretation and possible strategic uses of information about trust of the

exchanged data.

From a methodological point of view, we can examine different points

related to data quality evaluation by an organization:

• data creation;

• assessment of the quality of received data;

• evaluation of acceptable quality levels;

• actions to be taken when low quality data is received.

Data exchanged in the cooperative environment can be originated internally

in the organizations. Upon data creation, it is necessary to evaluate the

quality of newly created data, in particular with respect to accuracy,

completeness, and internal consistency. Accuracy can be assessed

according to statistical evaluations based on the type of data creation, e.g.,

being it manual data entry or capture using OCR systems. Corrections to

such assessments can be applied if data cleaning techniques are used on

created data to improve its quality.

As regards the assessment of the quality of received data by destination

organizations, it is important to note that quality is not an absolute value, but

it is mainly related to the intended use of the data by the destination

organization in that specific exchange in the process. Several conflicting

considerations can be made based on available quality parameters, yielding

different evaluations, and we discuss some examples in the following.

Let us suppose that a given organization B receives from an organization A

an exchange unit x. First of all, B must compute the timeliness for all the data

values of x, on the basis of their due time. Importance affects assessment of

timeliness; as an example, if importance is “high” but data are not delivered in

time, then B will consider them “poor quality” data during the evaluation

phase. All the intrinsic data quality values can be weighted on the basis of the

related values of importance and source reliability, by using some weighting

27

function chosen by the organization B. The values of importance of a given

data are chosen by the organization B, whereas the source reliability of A,

with respect to the specific data, is maintained by the Source Reliability

Manager. In many cases there is a trade-off between source reliability,

importance and other dimensions; as an example, B may consider that a

"low" source reliability for data within x may be balanced by a “high” accuracy

for them.

The assessment can be done either on single data values, or on the whole

exchange unit; it is a choice of the destination organization to aggregate and

elaborate received data to assess a global quality value. On the other hand, it

is not possible to disaggregate data which are being received as a single

value with respect to quality parameters; as an example, if an Address

instance is transmitted as composed of Street, ZIPCode, CityName,

State and Country, it is possible to evaluate both the quality of each value

and the global quality of the Address instance. Conversely if the Address

value is transmitted as a simple string, it is possible to evaluate only its quality

as such, and not the quality of each of its components.

Once the quality of received data has been assessed, it can be evaluated for

acceptability, by using a multi-argument function. The decision whether to

accept or reject incoming data depends on complex tradeoffs among quality

parameters; as an example, while in some cases timeliness of data is more

important than accuracy, in other cases the contrary is true, and the

organization B prefers receiving late but accurate data. The result of this step

is a general acceptance of the received exchange unit, e.g., if the importance

of x is “very high” whereas the overall quality is “very poor”, B can decide to

reject it. In this evaluation, some organizations may choose to examine the

complete history of data, basing acceptance not only on information

concerning the last exchange, but also evaluating all the manipulations and

timeliness information about previous data exchanges, as stored in the

history component of x. For instance, data cleaning operations already

28

applied to data by other organizations can be a support for an evaluation of a

greater global data quality value.

After the decision to accept and to use data, it is possible to continue the

execution of the cooperative process, according to the cooperative workflow

specification. Conversely, if the quality of available data is insufficient, it is

necessary to take corrective actions to improve the quality of data. Several

actions are possible:

• Received data are rejected, and the source organization A is requested to

resend the same data with better quality parameters. This situation is

acceptable when low global quality is not related to lack of timeliness.

• An e-service can be raising an exception to its normal execution. An

exception causes the activation of other e-services that are not part of the

normal workflow.

• A data cleaning or improvement action is undertaken inside the e-service

of the organization B in order to improve data quality.

From these basic considerations, we now derive suggestions for design,

implementation and strategic use of trust parameters, and for improvement and

possibly restructuring interventions.

Framework design and management issues. The design and maintenance of

the cooperative environment supporting trusted data exchanges, comprises

several aspects:

• Granularity criteria for e-services design. An e-service can be designed to

cover a whole organization, or portions thereof, hence at different

granularities. We just mention criteria that can be adopted here, such as

criteria employed for workflow process design: homogeneity of activities,

manageability of problems, number of interfaces to be designed, number

of agents assigned to activities. Other criteria to be used here can be

taken from the literature on (distributed) data design: dimension of

29

exchanged data units, number of data values to be transmitted if the

designed exchange unit is too small/large, granularity of encryption and

signature/certificate mechanisms to ensure security and reliability.

Both classes of criteria can be applied to design the e-services at a

correct level. Obviously, the granularity deeply impacts on the efficiency

and the maintainability of the environment. As an example, let us consider

Figure 1 and the problems related to the introduction of a new e-service:

this can imply the substitution of a schema portion related to an e-service

(i.e., a portion delimited by dot lines in the figure) with the one of the new

provider organization (who for instance can offer the same service under

competitive conditions), or the reorganization of the whole cooperative

workflow specification, if a new e-service is defined and added to the

existing cooperative process.

• Benchmarking. The quality parameters described in the paper can be

regarded as strategic means for benchmarking the cooperative process

design, since they help monitoring e-services and help:

- to better define the granularity of the schemas;

- to restructure and re-engineer the schemas;

- destination organizations to improve their relationships towards other

entities (e.g. their business customers);

- source organizations to ameliorate their services (e.g., balancing

accuracy vs. timeliness vs. importance).

• Accounting and Monitoring. To help improving framework efficacy, a

mechanism of e-monitoring can be set up to observe quality information,

thus supporting tracing, analysis, and certification of data exchanges.

Accounting information is a basic aspect of e-monitoring. It should also be

accompanied by documentation about data flows, about testing and

probing reports resulting from samples on the framework operation, and

by trust reports that contain all security relevant parameters (history of

flows, of user behavior, of security violations, and so on).

Another way of verifying the quality of the design is the observation of

30

exceptions to the normal flow. Frequent exceptions can be a symptom of

mis-functioning of some e-services, due to various reasons. One is

straightforward and is generally concerned with wrong design choices

(wrong granularity is one for all example). A second type of cause can be

the low quality of data provided by a given e-service; for example, data

from one provider e-service (i.e., organization) always present “very low”

timeliness, or are scarcely secure. Triggers can be inserted in the

cooperative workflow specification to monitor these anomalies in order to:

- signal to the destination organization that a given provider e-service is

unreliable;

- signal to the source organization that the quality of data provided by its

e-services is low and that it might become out-of-market.

Anomalies can therefore be regarded as a means to strategically monitor

framework design and performance and for organizations to improve their

strategic orientations.

• Contractual aspects bound to e-service executions. Cooperating

organizations should be able to get certification of data exchanged, of

their quality and sensitivity levels, of user satisfaction measured through

parameters of quality.

• Compliance between the cooperative data model and the organizational

model. This aspect can be studied by observing the overall behavior of

the e-services, the customer satisfaction, the percentage of discarded

data, the exceptions occurred during workflow executions, and so on; in

particular, exceptions and their management are useful to decide whether

an e-service has to be corrected or redesigned.

Implementation framework. Several elements are needed in order to realize

the proposed framework for trusted cooperation; specifically (i) a descriptive

structure, consisting of models and languages able to describe e-services at

a high level of abstraction, (ii) tools for mapping such structure into a multi-

technology platform, and to initialize trust parameters.

31

E-services are described by using an abstract description language (Mecella

et al. 2001a, Mecella et al. 2001b), which is the abstraction of technological

component models (e.g. CORBA, EJB, .NET); each e-service needs to be

effectively provided by an organization as a component in a specific

implementation technology. The implementation interfaces of such a

component can be generated from the e-service specification, by basing on

specific generation rules and tools. The use of different component models,

one for e-service description, and many at technological level, is due to the

coexistence of different cooperative technologies and to the opportunity, in a

multi-organization environment, of integrating all these components by

adopting technology-independent component model for them.

The coordination of different e-services composing a cooperative process is

carried out by the coordination “glue” inside e-applications; such glue is able

to coordinate different components by generating at run-time specific service

requests, according to the specific implementation technologies; the

generation of the specific service requests is possible by using e-service

descriptions, stored in the repository, and the availability of mapping modules

which realizes the transformation rules from e-service descriptions to

technological component models.

The focus of this paper is on introducing trust in data exchanges among e-

services; cooperative data schemas represent the interfaces of the e-

services, that is the specifications of input and output data to e-services. In

order to provide trust, we have also introduced cooperative data quality

schemas, to be offered by e-services as part of their interfaces, and security

aspects in communication among e-services.

Finally, as far as the management of trust parameters, the framework is

assumed to be initialized with fixed quality and sensitivity information and to

be then updated using a Feedback and Monitoring Module (Bellettini et al.

1999), that observes the behavior and ameliorates performances along time

using feedback about quality parameters, triggers, user actions, and the

customer satisfaction.

32

6 RELATED WORK

As in our framework trust is obtained by introducing quality information and

security, we will briefly describe related work in these fields.

The notion of data quality has been widely investigated in the literature;

among the many proposals we cite the definitions of data quality as “fitness for

use” (Wang and Strong 1996), and as “the distance between the data views

presented by an information system and the same data in the real world” (Orr

1998, Wand and Wang 1996). The former definition emphasizes the subjective

nature of data quality, whereas the latter is an “operational” definition, although

defining data quality on the basis of comparisons with the real world is a very

difficult task. In this paper we have considered data quality as an implicit concept

strictly dependent from a set of dimensions; they are usually defined in the data

quality literature as quality properties or characteristics of data (e.g., accuracy,

completeness, consistency, etc.).

Many definitions of data quality dimensions have been proposed; among

them we cite: the classification given in (Wang and Strong 1996), in which four

categories (i.e., intrinsic, contextual, representation and accessibility aspects of

data) are identified for data quality dimensions, and the taxonomy proposed in

(Redman 1996), in which more than twenty data quality dimensions are

classified into three categories, namely conceptual view, values and format. A

survey of data quality dimensions is given in (Wang et al. 1995); it is important to

note that in the literature there is not an agreement not only on the set of the

dimensions strictly characterizing data quality, but also on the meaning of each

of them.

We have defined some dimensions based on the ones proposed in the

literature, and we have introduced some new quality dimensions, as they are

specifically relevant in cooperative environments.

Data quality issues have been addressed in several research areas, i.e.,

data cleaning, quality management in information systems, data warehousing,

integration of heterogeneous databases and web information sources. As of our

knowledge, many aspects concerning data quality in CIS/e-applications have not

33

been yet addressed; anyway when dealing with data quality issues in

cooperative environments, some of the results already achieved for traditional

and web information systems can be borrowed. In CIS/e-applications, the main

data quality problems are:

• Assessment of the quality of the data exported by each organization;

• Methods and techniques for exchanging quality information;

• Improvement of quality;

• Heterogeneity, due to the presence of different organizations, in general with

different semantics about data.

As regards the opportunity of assessment phases of the quality of intra-

organizational data, results achieved in the data cleaning area (Elmagarmid et al.

1996, Hernandez and Stolfo 1998, Galhardas et al. 2000), as well as in the data

warehouse area (Vassiliadis et al. 1999, Jeusfeld et al. 1998) can be adopted.

Heterogeneity has been widely addressed in the literature, especially

focusing on schema integration issues (Batini et al. 1984, Gertz 1998, Ulmann

1997, Madnick 1999, Calvanese et al. 1998)

Improvement and methods and techniques for exchanging quality

information have been only partially addressed in the literature (e.g., Mihaila et

al. 1998) and are the main focus of this paper; we have proposed a conceptual

model for exchanging such information in a cooperative framework and some

hints for improvement based on the availability of quality information.

Finally as regards security, the main problems tackled in the literature

regard data protection during storage and during transmission, with the

associated aspects of confidentiality, integrity and authentication (Castano et al .

1995). Several solutions have been proposed, based on standards and

specifications regarding the use of cryptography for signatures and certificates,

such as PKCS#7 and RFC2459 (RSA Laboratories 1993, Housley et al. 1999).

We have relied on these standard proposals for data exchange.

34

7 CONCLUDING REMARKS AND FUTURE WORK

In this paper an approach to trusted data exchange in cooperative processes has

been presented. The main emphasis of this work has been on supporting

information being exchanged with additional information enabling receiving

organizations to assess the suitability of data before using it. In addition, a

framework has been proposed to allow trusted data exchange with quality

information in a secure environment.

The data quality problem in cooperative environments in general is still an

open issue. Further work is still needed to precisely define the data quality

dimensions proposed in the literature. In the context of cooperative processes,

our approach, to our knowledge, is the first proposal for a comprehensive

framework for defining trusted data exchange based on data quality information.

Our approach will be validated on practical cases in the public administration

domain and based on these experiences the model will be refined.

In the present paper, we have concentrated our attention on data

exchange within a cooperative process. Though we have identified the exact

format of the exchange unit, we also need to explore possible ways to translate

not only data, but also quality data, history and sensitivity information of the

exchange unit into XML structures. Based on the proposed approach, future

work will also concentrate on aspects related to process improvement based on

the evaluation of the quality of data being exchanged. In fact, the analysis of the

quality of data being exchanged, its evaluation by receiving organizations, and

compensating actions started when data quality is considered insufficient can be

the basis for new techniques for process improvement. In addition, more work is

needed to provide mechanisms to associate information about the reliability of

sources of data, to validate it, and to revise it according to a statistical analysis of

instances of processes evaluated in the past.

Future work about sensitivity and security regards the extension of XML

DTDs to treat security properties at the needed level of data granularity (i.e., data

item, quality attributes, other detail levels).

35

ACKNOWLEDGEMENTS

The authors thank Carlo Batini for his discussions and suggestions about this work.

REFERENCES

Batini C., Cappadozzi E., Mecella M., Talamo M. (2001): Cooperative Architectures: The Italian Way Along e-Government. To appear in Elmagarmid A.K., McIver Jr W.J. (eds): Advances in Digital Government: Technology, Human Factors, and Policy, Kluwer Academic Publishers, 2001.

Batini C., Lenzerini M., Navathe S.B. (1984): A comparative analysis of methodologies for database schema integration. ACM Computing Survey, vol. 15, no. 4, 1984.

Bellettini C., Damiani E., Fugini M.G. (1999): Design of an XML-based Trader for Dynamic Identification of Distributed Services, Proceedings of the 1st Symposium on Reusable Architectures and Components for Developing Distributed Information Systems, RACDIS’99 , Orlando, FL, August 1999.

Bitton D., DeWitt D. (1983): Duplicate Record Elimination in Large Data Files. ACM Transactions od Database Systems, vol. 8, no. 2, 1983.

Brodie M.L. (1998): The Cooperative Computing Initiative. A Contribution to the Middleware and Software Technologies. GTE Laboratories Technical Publication, 1998, available on-line (link checked July, 1st 2001): http://info.gte.com/pubs/PITAC3.pdf.

Calvanese D., De Giacomo G., Lenzerini M., Nardi D., Rosati (R.) (1998): Information Integration: Conceptual Modeling and Reasoning Support. In Proceedings of the 6th International Conference on Cooperative Information Systems (CoopIS'98), New York City, NY, USA, 1998.

Casati F., Sayal M., Shan M.C. (2001): Developing E-Services for Composing E-Services. Proceedings 13th International Conference on Advanced Information Systems Engineering (CAISE 2001), Interlaken, Switzerland, 2001.

Castano S., Fugini M.G., Martella G., Samarati P. (1995): Database Security, Addison Wesley, 1995.

Cattell, R.G.G., Barry D.K. (eds.) (1997): The Object Database Standard: ODMG 2.0. Morgan Kaufmann Publishers, 1997.

Central IT Unit (CITU) of the Cabinet Office (2000): The GovTalk initiative. http://www.govtalk.gov.uk/ (link checked July, 1st 2001).

Cochinwala M., Kurien V., Lalk G., Shasha D. (1998): Efficient Data Reconciliation. Bellcore Technical Report 1998.

Elmagarmid A., Horowitz B., Karabatis G., Umar A. (1996): Issues in Multisystem Integration for Achieving Data Reconciliation and Aspects of Solutions. Bellcore Research Technical Report, 1996.

Galhardas H., Florescu D., Shasha D., Simon E. (2000): An Extensible Framework for Data Cleaning. Proceedings of the 16th International Conference on Data Engineering (ICDE 2000), San Diego, CA, USA, 2000.

Gertz M. (1998): Managing Data Quality and Integrity in Federated Databases. Second Annual IFIP TC-11 WG 11.5 Working Conference on Integrity and Internal Control in Information Systems, Airlie Center, Warrenton, Virginia, 1998.

Hernadez M.A., Stolfo S.J. (1998): Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Journal of Data Mining and Knowledge Discovery, vol. 1, no. 2, 1998.

Housley R., Ford W., Polk W., Solo D. (1999): Internet X.509 Public Key Infrastructures Certificate and CRL Profile. Network Working Group Standards Track, 1999.

Jeusfeld M.A., Quix C., Jarke M. (1998): Design and Analysis of Quality Information for Data Warehouses. Proceedings of the 17th International Conference on Conceptual Modeling (ER'98), Singapore, 1998.

Madnick S. (1999): Metadata Jones and the Tower of Babel: The Challenge of Large –Scale Semantic Heterogeneity. Proceeding of the 3rd IEEE Meta-Data Conference (Meta-Data ’99), Bethesda, MA, USA, 1999.

36

Mecella M., Batini C. (2000): Cooperation of Heterogeneous Legacy Information Systems: a Methodological Framework. Proceedings of the 4th International Enterprise Distributed Object Computing Conference (EDOC 2000), Makuhari, Japan, 2000.

Mecella M., Batini C. (2001): Enabling Italian e-Government Through a Cooperative Architecture. In Elmagarmid, A.K., McIver Jr, W.J. (eds.): Digital Government. IEEE Computer, vol. 34, no. 2, February 2001.

Mecella M., Pernici B. (2001): Designing Wrapper Components for e-Services in Integrating Heterogeneous Systems. To appear in VLDB Journal, Special Issue on e-Services, 2001.

Mecella M., Pernici B., Craca P. (2001b): Compatibility of Workflow e-Services in A Cooperative Multi-Platform Environment. To appear in Proceedings of the 2nd VLDB Workshop on Technologies for E-Services (VLDB-TES 2001), Roma, Italy, September 2001.

Mecella M., Pernici B., Rossi M., Testi A. (2001a): A Repository of Workflow Components for Cooperative e-Applications. Proceedings of the 1st IFIP TC8 Working Conference on E-Commerce/E-Business, Salzburg, Austria, 2001.

Mihaila G., Raschid L., Vidal M. (1998): Querying Quality of Data Metadata. Proceedings of the 6th International Conference on Extending Database Technology (EDBT’98), Valencia, Spain, 1998.

Missier P., Scannapieco M., Batini C. (2001): Cooperative Architectures. Introducing Data Quality. Technical Report 14-2001, Dipartimento di Informatica e Sistemistica, Università di Roma “La Sapienza”, Roma, Italy, 2001.

Monge A., Elkan C. (1997): An Efficient Domain Independent Algorithm for Detecting Approximate Duplicate Database Records. Proceedings of SIGMOD Workshop on Research Issues on DMKD, 1997.

Monson-Haefel, R. (2000): Enterprise JavaBeans (2nd Edition). O'Reilly 2000. Morey R.C. (1982): Estimating and Improving the Quality of Information in the MIS.

Communications of the ACM, vol.25, no.5, 1982. Mylopoulos J., Papazoglou M. (eds.) (1997): Cooperative Information Systems. IEEE Expert

Intelligent Systems & Their Applications, vol. 12, no. 5, September/October 1997. Object Management Group (1998): The Common Object Request Broker Architecture and

Specifications. Revision 2.3. Object Management Group, Document formal/98-12-01,Framingham, MA, 1998.

Orr K. (1998): Data Quality and Systems Theory. Communications of the ACM, vol. 41, no. 2, 1998.

Redman T.C. (1996): Data Quality for the Information Age. Artech House, 1996. RSA Laboratories (1993): Cryptographic Message Syntax Standard. RSA Laboratories Technical

Note Version 1.5, 1993. Schuster H., Georgakopoulos D., Cichocki A., Baker D. (2000): Modeling and Composing

Service-based and Reference Process-based Multi-enterprise Processes. Proceedings of the 12th International Conference on Advanced Information Systems Engineering (CAISE 2000), Stockholm, Sweden, 2000.

Tanenbaum A. S. (1996): Computer Networks, Third Edition. Prentice Hall. 1996. Tansell A., Snodgrass R., Clifford J., Gadia S., Segev A. (eds.) (1993): Temporal Databases.

Benjamin-Cummings, 1993. Trepper, C. (2000): E-Commerce Strategies. Microsoft Press, 2000. UDDI.org (2000): UDDI Technical White Paper, 2000. Available on line (link checked July, 1st

2001): http://www.uddi.org/pubs/Iru_UDDI_Technical_White_Paper.pdf. Ulmann J.D. (1997): Information Integration using Logical Views. Proceedings of the International

Conference on Database Theory (ICDT ‘97), Greece,1997. Vassiliadis P., Bouzeghoub M., Quix C. (1999): Towards Quality-Oriented Data Wharehouse

Usage and Evolution. Proceedings of the 11th International Conference on Advanced Information Systems Engineering (CAiSE’99), Heidelberg, Germany, 1999.

VLDB-TES (2000): Proceedings of the 1st VLDB Workshop on Technologies for E-Services (VLDB-TES 2000), Cairo, Egypt, 2000.

Wand Y., Wang R.Y. (1996): Anchoring data quality dimensions in ontological foundations. Communication of the ACM, vol. 39, no. 11, 1996.

Wang R.Y., Storey V.C., Firth C.P. (1995): A Framework for Analysis of Data Quality Research. IEEE Transaction on Knowledge and Data Engineering, vol.7, no. 4, 1995.

37

Wang R.Y., Strong D.M. (1996): Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, vol. 12, no. 4, 1996.

World Wide Web Consortium (W3C) (1998): Extensible Markup Language (XML) Version 1.0. February 1998. http://www.w3.org

Goldfarb G.F., Prescod P. (2000): The XML Handbook. Prentice Hall, 2000. Object Management Group (OMG) (2000): OMG Unified Modeling Language Specification.

Version 1.3. Object Management Group, Document formal/2000-03-01, Framingham, MA, 2000.