Post on 10-Apr-2023
I
SUPPORTING TRUSTED DATA EXCHANGES IN COOPERATIVE INFORMATION SYSTEMS
Paola Bertolazzi1, Maria Grazia Fugini2, Massimo Mecella3 Barbara Pernici2, Pierluigi Plebani2, Monica Scannapieco1,3
1 Istituto di Analisi dei Sistemi ed Informatica
Consiglio Nazionale delle Ricerche (IASI-CNR) bertola@iasi.rm.cnr.it
2 Dipartimento di Elettronica e Informazione Politecnico di Milano {fugini,pernici}@elet.polimi.it, plebani@fusberta.elet.polimi.it
3 Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza” {mecella,monscan}@dis.uniroma1.it
Contact author: Monica Scannapieco Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza”
Via Salaria 113 (2nd floor, room 231) I-00198 Roma, Italy Phone: +39 06 49918479 Fax: +39 06 85300849 E-mail: monscan@dis.uniroma1.it
Abstract. In cooperative processes and in e-services, an evaluation of the quality of exchanged data is essential for building mutual trust among cooperating organizations and correctly performing cooperative activities. Several quality dimensions, related to the intrinsic nature of data and to the context of the cooperative process where data are used, must be taken into consideration. In addition, in order to accomplish a trusted cooperative environment, data sensitivity parameters must be taken into account. A model for data quality in cooperative information systems and e-applications is proposed, together with an architecture for trusted exchanges of data and quality information associated to it. Strategic use of the model and of the architecture is discussed. Keywords: Data quality, workflow systems, e-services, security
1
SUPPORTING TRUSTED DATA EXCHANGES IN COOPERATIVE INFORMATION SYSTEMS
Abstract. In cooperative processes and in e-services, an evaluation of the quality of exchanged data is essential for building mutual trust among cooperating organizations and correctly performing cooperative activities. Several quality dimensions, related to the intrinsic nature of data and to the context of the cooperative process where data are used, must be taken into consideration. In addition, in order to accomplish a trusted cooperative environment, data sensitivity parameters must be taken into account. A model for data quality in cooperative information systems and e-applications is proposed, together with an architecture for trusted exchanges of data and quality information associated to it. Strategic use of the model and of the architecture is discussed. Keywords: Data quality, workflow systems, e-services, security
1 INTRODUCTION
Recently, the widespread use of information technology and the availability of
networking services have enabled new types of applications, characterized by
several geographically distributed interacting organizations. The term
Cooperative Information Systems (CIS) is used to denote distributed information
systems that are employed by users of different organizations under a common
goal (Mylopoulos and Papazoglou 1997, Brodie 1998). A recent extension of the
CIS allows providing e-services on line in a cooperative context, by means of e-
applications (VLDB-TES 2000, Mecella and Pernici 2001). In addition to
geographical distribution and inter-organization cooperation, in e-applications (i)
cooperating organizations may not know each other in advance and (ii) e-
services can be composed both at design and run-time. Whereas in traditional
“closed” CIS mutual knowledge and agreements upon design of applications are
the basis for the cooperation, the availability of a complex platform for e-services
(Mecella et al. 2001a) allows “open” cooperation among different organizations.
An approach towards e-applications can be found in UDDI, an initiative for
defining eXtensible Markup Language (XML) (W3C 1998) documents to publish
and discover services on the Web. The UDDI Business Registry stores different
2
types of information about a service, that is, business contact information (“white
pages''), business category information (“yellow pages'') and technical service
information (“green pages'') (UDDI 2000). In such a framework, organizations
willing to offer e-services give a description of the services informally, based on
free text, and other organizations willing to use these e-services interact with the
offering organization on the basis of agreed upon interfaces.
Other proposals for architectures for e-services based on workflow
systems have been presented in the literature (VLDB-TES 2000, Casati et al.
2001, Mecella et al. 2001a). The starting point of all these approaches is the
concept of cooperative process, also referred to as macro process (Mecella and
Batini 2001) or multi-enterprise process (MEP, Schuster et al. 2000), defined as
a complex workflow involving different organizations; unlike traditional workflow
processes where all the activities concern the same enterprise, in a cooperative
process the activities involve different organizations, either because they form
together a virtual enterprise or since they exchange services and information in a
coordinated way. The approach presented in (Mecella et al. 2001a), which
constitutes the underlying framework in this paper, assumes that a cooperative
process can be abstracted and modeled as a set of e-services exported from
cooperating organizations. The definition of a cooperative process as a set of e-
services constitutes a reference schema for the cooperation among
organizations; an e-service represents a “contract” on which an organization
involved in the cooperative process agrees.
Organizations, which cooperate in CIS/e-applications, can be of two types:
• trusted organizations: data transmission occurs among organizations which
trust each other in a network due to organizational reasons (e.g.,
homogeneous work groups in a departmental structure, or supply-chain
relationships among organizations forming a virtual enterprise);
• external organizations: data are transmitted among cooperating entities in
general, possibly accessing external data sources.
Every time mutual knowledge among organizations participating in CIS/e-
applications is not given in advance, new mechanisms are needed to ensure that
3
mutual trust is established during cooperative process executions. Trust regards
mainly two aspects: (i) the quality of data being exchanged, and (ii) a secure
environment for information exchange to guarantee sensitive information.
The properties to indicate the quality of data being exchanged are both
intrinsic to data itself and process dependent, i.e., they depend on the activity in
which they are used and when they are used. We argue that organizations need
to specify and to exchange information explicitly oriented to describe the quality
of data circulating in CIS/e-applications. The availability of quality data allows
interacting organizations to assess the quality of received and of available data
before using them.
Sensitivity concerns both correct authentication of cooperating
organizations and guaranteeing that only authorized organizations can read, use,
and generate data in the cooperative process. To guarantee sensitive
information, security technologies can be used, e.g., based on the use of digital
certificates and signatures, to allow the cooperating organizations to establish a
secure communication environment and to ensure the needed level of
confidentiality.
The goal of the present paper is to propose a model for data quality,
including both traditional and original quality dimensions, and an architecture for
trusted data exchange supporting sensitivity among cooperating organizations.
Both quality and sensitivity are used to define the level of trust of the CIS/e-
application.
The paper is organized as follows. In Section 2, we first introduce a
running example, to be used for further illustration of our approach, and then we
discuss both classical data quality dimensions and additional information that
must be associated to data to build mutual trust in CIS/e-applications. The
running example stems from the experience of the Italian e-Government initiative
(Mecella and Batini 2001), which provides motivations for our work and the test
bed in which we will try our approach. In Section 3, the model for data quality is
presented in detail, whereas in Section 4 the cooperative framework is
described. Finally, in Section 5 we discuss the strategic use of trusted data
4
exchanges in CIS/e-applications. Section 6 discusses related work specifically
focused on data quality issues and data security aspects, and Section 7
concludes the paper by remarking future work.
2 A RUNNING EXAMPLE AND THE DATA QUALITY DIMENSIONS
In this section we pose the basis for defining a conceptual framework for trusted
data exchanges in CIS/e-applications. First, we shortly introduce a running
example, then we define data quality dimensions for trusted cooperation in
environments such as the one described in the example.
2.1 A RUNNING EXAMPLE An example taken from the Italian e-Government scenario (Mecella and Batini
2001) will be used throughout the paper. In Italy, the Unitary Network project and
the related Nationwide Cooperative Information System (Batini et al. 2001) are
currently undertaken, with the aim of implementing a “secure Intranet” able to
interconnect public administrations and of developing a Unitary Information
System of Italian Public Administrations in which each subject can participate by
providing services (e-services) to other subjects. Specifically each administration
has been represented as a domain, and each domain offers data and application
services, deployed and made accessible through cooperative gateways.
Similar initiatives are currently undertaken also in the United Kingdom,
where the e-Government Interoperability Framework (e-GIF) sets out the
government’s technical policies and standards for achieving interoperability and
information systems coherence across the UK public sector. For this purpose the
government has launched the UK GovTalk initiative (CITU 2000), that is a joint
government and industry forum for generating and agreeing standards, through
the definition of XML Document Type Definitions (DTDs) (Goldfarb and Prescod
2000) to be used for information exchange.
In this paper we use as running example a simplified version of the
cooperative process for income management (see Figure 1).
5
Figure 1. UML Activity Diagram of the cooperative process
“Income Management” and the identified e-services.
Citizens send income-tax returns to the Department of Finance, which,
after executing some activities of its own competence, needs to access the
family composition of the citizen from other administrations with the purpose of
cross-checking data. The family composition of the citizen is checked against
data available from the City Council where the citizen is resident. Information
about retirement plans (in case some retired persons exist in the family) is
obtained from the Italian Social Security Service.
More in details, the workflow consists of the Department of Finance
receiving income-tax returns by citizens (sent by ordinary-mail, nowadays
Receive Income-tax Return
Send Family Status Request
Receive Family Status Documentation
Receive Family Status Request
Produce Family Status Documentation
Send Family Status Documentation
I: Income-tax Return
fsr: Family Status Request
fsd: Family Status Documentation
[There are retired relatives living with]
[Else]
Send Pension Plan Status Request
Receive Pension Plan Status Documentation
Calculate Expected Taxes
[OK]
Send OK Notification
OK: OK Notification
Open Inquiry
[Else]
in: Inquiry Notification
d: Documentation
Receive Pension Plan Status Request
Produce Pension Plan Status Documentation
Send Pension Plan Status Documentation
ps: Pension Plan Status Request
ds: Pension Plan Documentation
Italian Social Security Italian Social Security Service eService e--Service (schemas)Service (schemas)
City Council eCity Council e--Service Service (schemas)(schemas)
Department of Finance Department of Finance
ee--Service (schemas)Service (schemas)
Other Other OrganizationsOrganizations
CitizenCitizen
CitizenCitizen
Receive Income-tax Return
Send Family Status Request
Receive Family Status Documentation
Receive Family Status Request
Produce Family Status Documentation
Send Family Status Documentation
I: Income-tax Return
fsr: Family Status Request
fsd: Family Status Documentation
[There are retired relatives living with]
[Else]
Send Pension Plan Status Request
Receive Pension Plan Status Documentation
Calculate Expected Taxes
[OK]
Send OK Notification
OK: OK Notification
Open Inquiry
[Else]
in: Inquiry Notification
d: Documentation
Receive Pension Plan Status Request
Produce Pension Plan Status Documentation
Send Pension Plan Status Documentation
ps: Pension Plan Status Request
ds: Pension Plan Documentation
Italian Social Security Italian Social Security Service eService e--Service (schemas)Service (schemas)
City Council eCity Council e--Service Service (schemas)(schemas)
Receive Income-tax Return
Send Family Status Request
Receive Family Status Documentation
Receive Family Status Request
Produce Family Status Documentation
Send Family Status Documentation
I: Income-tax Return
fsr: Family Status Request
fsd: Family Status Documentation
[There are retired relatives living with]
[Else]
Send Pension Plan Status Request
Receive Pension Plan Status Documentation
Calculate Expected Taxes
[OK]
Send OK Notification
OK: OK Notification
Open Inquiry
[Else]
in: Inquiry Notification
d: Documentation
Receive Pension Plan Status Request
Produce Pension Plan Status Documentation
Send Pension Plan Status Documentation
ps: Pension Plan Status Request
ds: Pension Plan Documentation
Italian Social Security Italian Social Security Service eService e--Service (schemas)Service (schemas)
City Council eCity Council e--Service Service (schemas)(schemas)
Department of Finance Department of Finance
ee--Service (schemas)Service (schemas)
Other Other OrganizationsOrganizations
CitizenCitizen
CitizenCitizen
6
submitted also through a Web portal); the Department, in order to verify the
correct amount of taxes, needs to check incomes of all people forming the same
family of the citizen; it requests the family composition to the City Council where
the citizen lives. After receiving the family status of the citizen, the Department
queries the Italian Social Security Service in order to know the amount of
pension perceived by retired persons in the citizen’s family; this activity is carried
out only if there are retired persons living with the citizen. After collecting all this
information, the Department owns all the needed data to check income-tax
returns and possibly start further actions against fraudulent citizens.
Until recently, the information exchange described above has been
carried out using paper documents; the document exchange activated specific
processes in each organization aiming at producing response documents. Now,
on the basis of the Unitary Network and Nationwide CIS projects, each
administration can develop e-services (shown in Figure 1), allowing other
cooperating organizations to ask for and obtain requested data. In the present
paper we assume that data are exchanged as XML documents and described
through DTDs agreed upon by all the cooperating administrations.
The cooperation is effective if exchanged data are trusted, that is, their
quality is assessed and their security is guaranteed: if for each exchanged data,
its quality is assessed, the receiving organization can set up appropriate
measures to face up poor quality situations. As an example, if the citizen address
provided by a City Council is assessed not to be updated, the Department of
Finance can arrange different activities trying to validate update data against
other organizations (e.g., telecommunication companies maintains billing
addresses for their customers).
Security requirements in this scenario regard the authentication of the
cooperating organizations, the decision of the sensitivity levels of data, and the
certification of the data transmission. Communication can be assumed to be
other trusted (e.g., between the Department of Finance and the City Council) or
untrusted (e.g., between the citizen and the City Council).
7
2.2 DATA QUALITY DIMENSIONS We distinguish two kinds of data quality dimensions: data intrinsic and process
specific. Intrinsic data quality dimensions characterize properties that are
inherent to data, i.e., depend on the very nature of data; an example is a
dimension specifying whether the data about the family composition of a citizen
is updated or not. Process specific quality dimensions describe properties that
depend on the cooperative process in which data are exchanged; in our
reference example, the “timeliness” of exchanged data between the Department
of Finance and the City Council is a parameter that is fundamental to measure
the efficiency and effectiveness of the cooperative process.
Process specific parameters are an original contribution of this paper, as
we show how quality is related also to the usage of data and to its evaluation in a
cooperative framework. As regards intrinsic data quality dimensions, we refer to
a subset of the ones proposed in the literature, by considering the most
important ones (Wand and Wang 1996); we provide new definitions based on
the classical ones, to adopt them in the CIS/e-application context. We will refer
only to data quality dimensions concerning data values; conversely, we do not
deal with aspects concerning data quality of logical schemas and data format; in
the following definitions, we refer to schema element meaning, for instance, an
entity in a Entity-Relationship schema or a class in an object oriented schema
expressed in the Unified Modeling Language (UML 2000).
2.2.1 Intrinsic data quality dimensions
Our purpose is to associate data with those dimensions that are useful for
organizations receiving data to evaluate and validate them before further use.
We associate to data (i) syntactic and semantic accuracy, (ii) completeness, (iii)
currency, and (iv) internal consistency.
Syntactic and Semantic Accuracy. In (Redman 1996) accuracy refers to the
proximity of a value v to a value v′ considered as correct. Based on such a
definition, we introduce a further distinction between syntactic and semantic
8
accuracy.
Syntactic Accuracy It is the distance between v and v’, being v’ the value considered syntactically correct.
Semantic Accuracy It is the distance between v and v’, being v’ the value considered semantically correct.
Let us consider the following examples:
• Person is a schema element with Name as the attribute of interest, and p
an instance of Person. If p.Name1 has a value v = JON, while v′ =
JOHN, this is a case of a low syntactic accuracy as JON is not an
admissible value according to a dictionary of English names;
• if p.Name has a value v = ROBERT, while v′ = JOHN, this is a case of a
low semantic accuracy, as v is a syntactical admissible value but the
person whose name is stored as ROBERT has a name which is JOHN in
the real world.
Syntactic accuracy can be easily checked by comparing data values with
reference dictionaries. Semantic accuracy is more difficult to quantify since,
according to our definition, the terms of comparison have to be derived from
real world, and so verification of semantic accuracy may be expensive.
Semantic accuracy can be checked through comparison of the information
related to the same instance stored in different databases. A typical process
that aims at identifying similar instances consists of two phases:
• A searching phase, in which possibly matching instances are identified
(Bitton and DeWitt 1983, Hernandez and Stolfo 1998, Monge and Elkan
1997);
• A matching phase, in which a decision about a match, a non-match or a
1 The dot notation refers to instances and their attributes, i.e., a.x indicates the value of
the attribute x on a specific instance a of the schema element A.
9
possible match is taken, (Hernandez and Stolfo 1998, Monge and Elkan
1997, Cochinwala et al. 1998). Usually, the decision is made in an
authomatic or semi-authomatic way, on the basis of the database which is
considered as storing values which are considered correct.
As an example, all the attribute values related to p (with p.Name =
ROBERT), such as, for example, DateOfBirth and EmployeeNumber,
could be compared with another instance of Person from a different
database considered as correct. In such a case, the process of checking the
semantic accuracy requires the matching of < ROBERT, 11-20-1974,
1024 > and < JOHN, 11-20-74, 1024 >, that is (i) recognizing the two
instances as potential match, (ii) deciding for a match of the two instances,
and (iii) then correcting ROBERT into JOHN.
Completeness. We define this dimension as:
Completeness The degree to which values of a schema element are present in the schema element instance.
In evaluating completeness, it is important to consider the meaning of null
values of an attribute, depending on the attribute being mandatory, optional,
or inapplicable: a null value for a mandatory attribute is associated with a
lower completeness, whereas completeness is not affected by optional or
inapplicable null values. As an example, let us consider the attribute Email
of the Person schema element; a null value for the Email attribute may
have different meanings, that is (i) the specific person has no email address,
and therefore the attribute is inapplicable (this case does not impact on
completeness), or (ii) the specific person has an email address but it has not
been stored (in this case completeness is low).
Currency. The currency dimension refers only to data values that may vary in
10
time; as an example, values of Address may vary in time, whereas
DateOfBirth can be considered invariant. Therefore currency can be
defined as the “age” of a value, namely:
Currency The distance between the instant when a value is last updated and the instant when the value itself is used.
It can be measured either by associating to each value an “updating
timestamp” (Missier et al. 2001) or a “transaction time” in temporal databases
(Tansell et al. 1993).
Internal Consistency. Consistency implies that two or more values do not
conflict each other. By referring to internal consistency we mean that all the
values that are compared in order to evaluate consistency are within a
specific instance of a schema element.
A semantic rule is a constraint that must hold among values of attributes of a
schema element, depending on the application domain modeled by the
schema element. On the basis of this definition, internal consistency can be
defined as:
Internal Consistency The degree to which the values of the attributes of an instance of a schema element satisfy the specific set of semantic rules defined on the schema element.
As an example, if we consider Person with attributes Name, DateOfBirth,
Sex and DateOfDeath, some possible semantic rules to be checked as
satisfied are:
• the values of Name and Sex are consistent; if Name has a value v =
JOHN and the value of Sex is FEMALE, this is a case of internal
inconsistency;
11
• the value of DateOfBirth needs to precede the value of DateOfDeath.
2.2.2 Process specific dimensions
The need for data quality dimensions dependent on the context is recognized in
(Wang and Strong 1996); we observe that in CIS/e-applications, the context is
the cooperative process and data quality dimensions are related to the evolution
of data during time and within the process. We have therefore chosen and
adapted some of the dimensions proposed in (Wang and Strong 1996)
(timeliness and source reliability), and in addition we propose new dimensions
dependent on cooperative processes (importance and confidentiality).
Process specific dimensions are tied to specific data exchanges within the
process, rather than to the whole process. Hence, in the following definitions, we
consider a data exchange as a triple < source organization i,
destination organization j, exchange id >, representing the
cooperating organizations involved in the data exchange and the specific
exchange2.
Timeliness. It can be defined as follows:
Timeliness The availability of data on time, that is within the time constraints specified by the destination organization.
For instance, we can associate a low timeliness value for the schedule of the
lessons in a University, if such a schedule becomes available on line after
that the lessons have already started. For computing this dimension, each
organization has to indicate the due time, i.e., the latest time within which
data have to be received. According to our definition, the timeliness of a value
cannot be determined until it is received by the destination organization.
2 Two organizations may be involved in more than one exchange of the same data within
the same cooperative process.
12
Importance. This dimension can be defined as:
Importance The significance of data for the destination organization.
As an example, we can consider an organization B (e.g., the Department of
Finance) that cannot start an internal process until an organization A (e.g.,
the City Council) transfers values of the schema element X (e.g. the family
composition of a citizen); in this case, the importance of X for B is high.
Importance is a complex dimension that can be defined based on specific
indicators measuring: for a schema element the amount of instances
managed by the destination organization with respect to a temporal unit, the
number of processes internal to the destination organization in which the data
are used, the ratio between the number of core business processes using the
data and the overall number of internal processes using the data. Therefore
importance is: Importance (data, destination org.) = f(# instances of data, # internal processes of destination org. using data, # core business processes of destination org. using data / # internal processes of destination org. using data)
Source Reliability. It can be defined as:
Source reliability The credibility of a source organization with respect to provided data; it refers to the pair < source, data >.
The dependence on < source, data > can be clarified through an
example: a University (source) has a reputation of high reliability when
treating data regarding its students and offered courses, but it can have a low
reliability when releasing information regarding forthcoming commercial
events related to companies that offer stages, since such information is not
totally of competence of the University. As another example, the source
reliability of the Italian Department of Finance concerning Address of
13
citizensis lower than the one of City Councils; whereas as concern the
SocialSecurityNumber its source reliability is the highest among all Italian
administrations.
The values of source reliability may depend on the methods each
organization uses to clean its data and to measure their quality.
Confidentiality. In a cooperative process sensitivity concerns protecting data
from accidental and fraudulent misuse. In general, three dimensions are
associated to secure information exchange: confidentiality, integrity, and
authentication. Confidentiality means that data are not read during
transmission, integrity that they are not altered, and authentication that
sources and destinations are correct. In the following, we will assume that
integrity and authentication are in any case guaranteed by CIS/e-applications,
as detailed in the following of this paper, and we associate to data additional
information only about confidentiality to data.
Confidentiality Indicates whether data must be protected from access by non authorized users.
As an example, let us consider the instance of Person with
Name=”John”, DateOfBirth=”11-20-1974”, Sex = “M”. Using the
destination public key of the recipient key-pair, data can be ciphered, obtaining
the sequence:
D“å–àVÌÁÇ9•ûeÑÉÔ;ÿaˆäqÜdNÞוeYdXN}-çÊCª•éï$t
In this way, only the recipient can decrypt the message using his own
private key of the same key-pair.
14
3 DATA AND QUALITY MODELS
3.1 DATA MODEL In our framework, all the organizations involved in CIS/e-applications need to
export their data according to some specific schemas; we refer to these schemas
as cooperative data schemas.
They are class schemas defined in accordance with the ODMG Object
Model (Cattell and Barry 1997). Specifically they describe types of exchanged
data items, wherein types can be:
• classes, whose instances have their own identities;
• literals, when instances have not identities, and they are identified by values.
It is possible to define new classes as collections of objects (instances are
objects) and also structured literals, as record of literals. As an example, in
Figure 2 a detail of the cooperative data schema exported by the City Council in
our reference example is shown.
This schema defines a Citizen as a class, and Address as structured
literal (e.g., records).
3.2 DATA QUALITY MODEL This section defines the conceptual data quality model that each cooperating
organization has to define in order to export the quality of its own data. First we
define the notion of cooperative data quality schema, then we distinguish
between intrinsic and process specific data quality schemas and describe them
in details.
A cooperative data quality schema is a UML Class Diagram associated to
a cooperative data schema, describing the data quality of each element of the
data schema.
15
struct Address {
string street;
string cityName;
string state;
string country;
short ZIPCode;
}
… …
class Citizen {
attribute string name;
attribute string surname;
attribute string SSN;
attribute Date birthDate;
attribute Address currentAddress;
… …
}
Figure 2. The cooperative data schema exported by the City
Council (detail)
3.2.1 Intrinsic Data Quality Schemas
Intrinsic data quality dimensions can be modeled by considering classes, that we
call dimension classes, describing the data quality of the data schema elements
with reference to a specific dimension; therefore dimension classes represent
specific intrinsic data quality dimensions (e.g., completeness or currency).
We distinguish two types of dimension classes, according to the fact they
refer either to a class or to a structured literal of a data cooperative schema,
namely dimension classes and dimension structured literals. Each dimension
class represents the abstraction of the values of a specific data quality
dimension for each of the attributes of the class or of the structured literals to
which it refers, and to which it is associated by a one-to-one association.
A dimension class (or dimension structured literal) is represented by a
UML class labeled with the stereotype <<Dimension>> (<<Dimension_SL>>),
and the name of the class should be < DimensionName_ClassName > (<
DimensionName_SLName >).
As an example, considering the class Citizen, it may be associated to a
16
dimension class, labeled with the stereotype <<Dimension>>, and the name of
which is SyntacticAccuracy_Citizen; its attributes correspond to the
syntactic accuracy of the attributes Name, Surname; SSN, etc. (see Figure 3
referring to Figure 2).
Figure 3. An example of dimension class.
3.2.2 Process specific data quality schemas
Tailoring UML in a way similar to the one adopted for intrinsic data quality
dimension, we introduce process dimension classes, which represent process
specific data quality dimensions, in such a way as dimension classes represent
intrinsic data quality dimensions.
We introduce the exchange structured literal, necessary to characterize
process dimension classes. According to the definitions proposed in Section 2,
process specific data quality dimensions are tied to a specific exchange within a
cooperative process; in our framework, a cooperative process is modeled as the
interaction of different e-services provided by the different organizations, and we
introduce exchange structure literals to represent the dependence of process
specific dimensions from source and destination e-services (and organizations
exporting such e-services).
We distinguish two types of process dimension classes, process
dimension classes and process dimension structured literals; they include the
values of the attributes of the class or of the structured literals to which they
refer, and to which they are associated by a one-to-one association. We use the
stereotypes <<P_Dimension>> and <<P_Dimension_SL>>, for dimension
process classes and dimension process structured literals respectively. The
name of the class should be < DimensionName_ClassName > (<
SyntacticAccuracy_CitizenNameSurnameSSN
<<Dimension>>CitizenNameSurnameSSN 11 11
17
DimensionName_SLName >). See Figure 4 as an example.
Figure 4. An example of process dimension class.
An exchange structured literal is a structured literal associated to process
dimension classes. It includes the following mandatory attributes:
• source e-service,
• destination e-service,
• process identifier,
• exchange identifier.
Because of the fact that within a cooperative process two e-Services may
have more than one exchange, it is necessary to introduce an exchange
identifier, to identify the exchange itself univocally. Exchange structured literals
are labeled with the stereotype <<Exchange_SL>>.
The considerations exposed in this section are summarized in Figure 5, in
which the quality referring to both intrinsic and process specific dimensions for
the Citizen class is represented; the intrinsic data quality dimensions (syntactic
and semantic accuracy, completeness, currency, internal consistency) are
labeled with the stereotype <<Dimension>>, whereas the process specific data
quality dimensions (timeliness, importance, source reliability, confidentiality) are
labeled with the stereotype <<P_Dimension>>, and are associated to the
structured literal Exchange_Info, labeled with the stereotype
<<Exchange_SL>>.
CitizenNameSurnameSSN
Importance_CitizenNameSurnameSSN
<<P_Dimension>>
111 1
18
Figure 5. Cooperative data quality schema (detail referring to the
Citizen class). All the associations are 1-ary.
4 THE FRAMEWORK FOR TRUSTED COOPERATION
4.1 THE ARCHITECTURE FOR TRUSTED E-SERVICES Many approaches can be adopted to allow different organizations to cooperate
through the definition and development of CIS/e-applications, as described in the
introduction. The approach adopted in this paper is workflow-based, that is the
different organizations export data and services necessary to carry out specific
cooperative processes in which they participate. Such an approach requires an
agreement on the data and service models exported by different organizations
(Mecella et al. 2001a).
InternalConsistency_Citi zenNameSurnameSSN
<<Dimension>>
Completeness_CitizenNameSu rnameSSN
<<Dim ensi on>>
Currency_CitizenNameSurnameSSN
<<Di mensi on>>
SyntacticAccuracy_CitizenNameSurnameSSN
<<Dimension>>
Timeli ness_Cit izenNameSurnameSSN
<<P_Dimension>>SourceReliabil ity_Citizen
NameSurnameSSN
<<P_Dimension>>
Exchange_InfoSou rceEServi ceDesti nat ionEServi ceProcessIDExchangeID
<<Exchan ge_SL>>
Importance_CitizenNameSu rnam eSSN
<<P_Dimension>>
CitizenNameSurnameSSN
SemanticAccuracy_CitizenNameSurnameSSN
<<Dimension>>
19
In this section, we describe the architecture enabling trusted CIS/e-
applications, by focusing on the two central elements, namely data quality and
security. The starting point of our framework is the definition of a conceptual
cooperative workflow specification, that is, an abstract workflow description that
hides the details of process execution in each of the cooperating organizations;
an example of this conceptual cooperative workflow specification for the running
example has been shown in Figure 1.
On the basis of such a schema, each organization defines its cooperative
data schemas, which specify the structure of exchanged data. Such schemas
are the static interfaces of e-services that implement the cooperative process
through exchanges of trusted data and service requests among different
cooperating organizations. As an example, in Figure 1 the areas limited by
dotted lines identify the e-services. In addition to data schemas, each
organization exports cooperative data quality schemas, described in Section 3.2,
in which information about the quality of the exported data is modeled.
The proposed architecture is shown in Figure 6.
Each cooperating organization exports e-services as application
components deployed on cooperative gateways; a cooperative gateway is the
computing server platform which hosts these components; different
technologies, such as OMG Common Object Request Broker Architecture (OMG
1998), SUN Enterprise JavaBeans (Monson-Haefel 2000), and Microsoft
Enterprise .NET (Trepper 2000) allow the effective development of such
architectural elements, as detailed in (Mecella and Batini 2000).
A cooperative process is therefore realized through the coordination of
different e-services, to be provided by e-applications. An e-application realizes
the “glue” interconnecting and orchestrating different e-services; such a “glue”
needs to be based on the cooperative schemas, regarding both data and their
quality.
20
Figure 6. The architecture for trusted CIS/e-applications.
Some elements provide infrastructure services needed for the correct and
effective deployment of trusted e-services in the context of this architecture:
• a repository, which stores e-service specifications, that is data schemas, data
quality schemas and application interfaces provided by each e-service; this
repository is accessed at run-time by e-applications to discover and compose
e-services that each organization makes available;
• a source reliability manager, that, for each e-service and for each data
exported by such an e-service, certifies its source reliability (refer to Section
2.2.2); therefore the source reliability manager stores triples < e-service,
data, source reliability value >;
• a certification authority, providing digital certificates, a certificate repository
Repository of e-services
CORBA
e-service software “glue”e-applications
Source Reliability Manager
Certification Authority
Certificate Repository
Certificate Revocation
List
e-services
Cooperative gateway Cooperative gateway Cooperative gateway
export
Cooperative Data and Data Quality Schemas Cooperative
Organizations
based on
Repository of e-services
CORBA
e-service software “glue”
e-service software “glue”e-applications
Source Reliability Manager
Certification Authority
Certificate Repository
Certificate Revocation
List
e-services
Cooperative gateway Cooperative gateway Cooperative gateway
export
Cooperative Data and Data Quality Schemas
e-services
Cooperative gateway Cooperative gateway Cooperative gateway
export
Cooperative Data and Data Quality Schemas Cooperative
Organizations
based on
21
and a certificate revocation list (Housley et al. 1999); the roles of such
elements will be described in the next section, when security aspects
concerning exchange data exchange will be discussed.
4.2 EXCHANGE UNIT FORMAT Different information needs to be associated to each data exchange in order to
support trust; we define an exchange unit as data:
• transmitted from one e-service to another in the cooperative process,
• associated with quality data, and
• transmitted according to security rules.
All data are exchanged according to the exchange unit format (shown in
Figure 7), in order to ensure that they all can be adequately validated by the
receiving organization (this concept will be further explained in Section 5).
Figure 7. Exchange Unit Format.
Data are exchanged as XML files, specifically cooperative data schemas are
described as DTDs; as an example, in Figure 8 an XML document,
corresponding to the detail of the cooperative data schema exported by the
Data
Quality Data History
Sensitivity Information
Digital certificate
Digital signature
Information about data and process
Security Aspects
Data
Quality Data History
Sensitivity Information
Digital certificate
Digital signature
Information about data and process
Security Aspects
22
City Council (refer to Figure 2), is shown. <Citizen>
<FirstName>John</FirstName>
<LastName>McLeod</LastName>
<SSN>000111222333</SSN>
<Date property=“birthDate”>
<Day>10</Day>
<Month>06</Month>
<Year>1945</Year>
</Date>
<Address field=“currentResidence”>
<Street>… …</Street>
<CityName>New York</CityName>
<State>NY</State>
<Country>USA</Country>
<ZIP>… …</ZIP>
</Address>
</Citizen>
Figure 8. A possible XML document corresponding to the
cooperative data schema shown in Figure 2.
Quality data concerning intrinsic dimensions (i.e., syntactic and semantic
accuracy, completeness, currency and internal consistency) may be the result
of an assessment activity performed by each organization on the basis of
traditional methods for measuring data quality, i.e. statistical methods
proposed in (Morey 1982).
History; it can be defined as a list of n-uples < source e-service, destination e-service, operation, link to previous data,
timeliness >, describing the history of manipulations applied to data. For
the purpose of the present paper, we assume that:
• the history of data tracks the transfer of data among interacting
organizations (i.e., e-services) only if the nature of data is not changed
through processing executed by the destination organization;
• if a value is changed, it will be transferred in a new data exchange,
starting a new history list;
23
• operations that preserve the history are those that do not alter identities of
exchanged data, that is: read, clean (according to data cleaning
algorithms), realign operations (such as changing the format of dates from
the European to the American one).
Sensitivity information denotes the level of confidentiality of data being
transferred, and, according to this level, information useful for its encryption.
The confidentiality level can be assigned to data according to standard
security policies, e.g., using rules for data labeling (Castano et al. 1995).
Depending on the relevance level of exchanged data, confidentiality can be
ensured at different granularity levels. We can encrypt: (i) only the data
package, (ii) also quality data and history, (iii) no data parts, or (iv) any
possible combinations thereof. To cope with these requirements, sensitivity
information regards:
• Confidentiality: for each component of the exchange unit (i.e., data, quality
data, history), we define a boolean value (confidentiality flag) indicating
whether the component is confidential or not.
• Encryption method: indicates the asymmetric encryption algorithm (e.g.,
RSA), and the hash algorithm (e.g., SHA1) (Tanenbaum 1996) to be used
to generate the digital signature (see Figure 7).
• Session key: the key to be used to encrypt the relevant information using
symmetric cryptography (e.g., Triple-DES) (Tanenbaum 1996) in order to
improve transmission performances.
Security aspects of the exchange unit need to be addressed, namely (i)
integrity, (ii) authentication and (iii) confidentiality.
As regards integrity, it is provided by creating a secure and efficient
transmission channel through the following components of the exchange unit:
• the digital certificate, owned by the source organization;
• the digital signature of both the listed components of the exchange unit
and the digital certificate.
The digital certificate is issued by a Certification Authority, basically according
to the X.509 format (some extensions can be possibly required but, as they
24
regard data contents and source rather than data exchange, they are not
further analyzed in this paper) (Housley et al. 1999). The digital signature is
created according to the PKCS#7 specification (RSA Laboratories 1993),
thus allowing the destination organization to verify the integrity of the data
and of the digital certificate. By signing also the certificate, we guarantee the
association between the data and its creator.
Authentication can be weak or strong:
• Weak authentication, required for trusted organizations, means that the
destination e-service checks the signature of the source e-service using
the public key of the source e-service, but trusts the certificate of the
source e-service. The advantage is that data transmission is fast and
reliable: trusted organizations know each other by means of a list of
certificates (in the certificate repository); integrity and reliability of such
lists are under the responsibility of the Certification Authority.
• Strong authentication, required for untrusted/external organizations, uses
a Public Key Infrastructure (PKI) (Housley et al. 1999), specifically a
certificate revocation list, in order to validate the certificate of the source
e-service.
Finally, as regards confidentiality, data, quality data and history are encrypted
using the session key included in the sensitivity information part, according to
the value of the confidentiality flags. To avoid disclosure of the session key,
this is encrypted by the source e-service using the public key of the
destination one.
4.3 QUALITY DIMENSIONS VS. FRAMEWORK ELEMENTS The information transmitted in the exchange unit does not describe all data
quality dimensions introduced in Section 2.2. Some of the dimensions are
associated to data during exchanges between source and destination
organizations, whereas other data quality dimensions are evaluated, or directly
associated to data, by the destination organization.
A summary table describing where the quality dimensions are elaborated
25
is shown in Table 1. All intrinsic quality data are transmitted together with data by
the source organization.
Process related quality data reflect instead the dynamic nature of data
within the processes. Timeliness is evaluated by the destination organization,
according to the expected arrival time of the data (i.e., due time). The importance
of data is associated to data by the destination organization according to the
activity to be performed on these data. The destination organization accesses
the source reliability manager to know the reliability of the source organization
with respect to the exchanged data; finally confidentiality is transferred by means
of sensitivity information in the exchange unit.
Quality dimension Where elaborated Intrisic dimensions
- Syntactic and Semantic Accuracy
- Completeness
- Currency
- Internal Consistency
Transferred with data within the exchange unit; evaluated by the source organization
Process specific dimensions
Timeliness Not transferred; evaluated by the destination organization
Importance Not transferred; associated to data by the destination organization
Source Reliability Not transferred; provided by the Source Reliability Manager which is accessed by the destination organization
Confidentiality Transferred with data; associated to data by the source organization
Table 1. Quality dimensions in cooperative processes.
5 THE FRAMEWORK UNDER A STRATEGIC PERSPECTIVE
The framework proposed in the previous sections for data exchange in CIS/e-
applications allows the assessment of received data by organizations upon
receiving them. In the present section, we discuss methodological issues related
26
to interpretation and possible strategic uses of information about trust of the
exchanged data.
From a methodological point of view, we can examine different points
related to data quality evaluation by an organization:
• data creation;
• assessment of the quality of received data;
• evaluation of acceptable quality levels;
• actions to be taken when low quality data is received.
Data exchanged in the cooperative environment can be originated internally
in the organizations. Upon data creation, it is necessary to evaluate the
quality of newly created data, in particular with respect to accuracy,
completeness, and internal consistency. Accuracy can be assessed
according to statistical evaluations based on the type of data creation, e.g.,
being it manual data entry or capture using OCR systems. Corrections to
such assessments can be applied if data cleaning techniques are used on
created data to improve its quality.
As regards the assessment of the quality of received data by destination
organizations, it is important to note that quality is not an absolute value, but
it is mainly related to the intended use of the data by the destination
organization in that specific exchange in the process. Several conflicting
considerations can be made based on available quality parameters, yielding
different evaluations, and we discuss some examples in the following.
Let us suppose that a given organization B receives from an organization A
an exchange unit x. First of all, B must compute the timeliness for all the data
values of x, on the basis of their due time. Importance affects assessment of
timeliness; as an example, if importance is “high” but data are not delivered in
time, then B will consider them “poor quality” data during the evaluation
phase. All the intrinsic data quality values can be weighted on the basis of the
related values of importance and source reliability, by using some weighting
27
function chosen by the organization B. The values of importance of a given
data are chosen by the organization B, whereas the source reliability of A,
with respect to the specific data, is maintained by the Source Reliability
Manager. In many cases there is a trade-off between source reliability,
importance and other dimensions; as an example, B may consider that a
"low" source reliability for data within x may be balanced by a “high” accuracy
for them.
The assessment can be done either on single data values, or on the whole
exchange unit; it is a choice of the destination organization to aggregate and
elaborate received data to assess a global quality value. On the other hand, it
is not possible to disaggregate data which are being received as a single
value with respect to quality parameters; as an example, if an Address
instance is transmitted as composed of Street, ZIPCode, CityName,
State and Country, it is possible to evaluate both the quality of each value
and the global quality of the Address instance. Conversely if the Address
value is transmitted as a simple string, it is possible to evaluate only its quality
as such, and not the quality of each of its components.
Once the quality of received data has been assessed, it can be evaluated for
acceptability, by using a multi-argument function. The decision whether to
accept or reject incoming data depends on complex tradeoffs among quality
parameters; as an example, while in some cases timeliness of data is more
important than accuracy, in other cases the contrary is true, and the
organization B prefers receiving late but accurate data. The result of this step
is a general acceptance of the received exchange unit, e.g., if the importance
of x is “very high” whereas the overall quality is “very poor”, B can decide to
reject it. In this evaluation, some organizations may choose to examine the
complete history of data, basing acceptance not only on information
concerning the last exchange, but also evaluating all the manipulations and
timeliness information about previous data exchanges, as stored in the
history component of x. For instance, data cleaning operations already
28
applied to data by other organizations can be a support for an evaluation of a
greater global data quality value.
After the decision to accept and to use data, it is possible to continue the
execution of the cooperative process, according to the cooperative workflow
specification. Conversely, if the quality of available data is insufficient, it is
necessary to take corrective actions to improve the quality of data. Several
actions are possible:
• Received data are rejected, and the source organization A is requested to
resend the same data with better quality parameters. This situation is
acceptable when low global quality is not related to lack of timeliness.
• An e-service can be raising an exception to its normal execution. An
exception causes the activation of other e-services that are not part of the
normal workflow.
• A data cleaning or improvement action is undertaken inside the e-service
of the organization B in order to improve data quality.
From these basic considerations, we now derive suggestions for design,
implementation and strategic use of trust parameters, and for improvement and
possibly restructuring interventions.
Framework design and management issues. The design and maintenance of
the cooperative environment supporting trusted data exchanges, comprises
several aspects:
• Granularity criteria for e-services design. An e-service can be designed to
cover a whole organization, or portions thereof, hence at different
granularities. We just mention criteria that can be adopted here, such as
criteria employed for workflow process design: homogeneity of activities,
manageability of problems, number of interfaces to be designed, number
of agents assigned to activities. Other criteria to be used here can be
taken from the literature on (distributed) data design: dimension of
29
exchanged data units, number of data values to be transmitted if the
designed exchange unit is too small/large, granularity of encryption and
signature/certificate mechanisms to ensure security and reliability.
Both classes of criteria can be applied to design the e-services at a
correct level. Obviously, the granularity deeply impacts on the efficiency
and the maintainability of the environment. As an example, let us consider
Figure 1 and the problems related to the introduction of a new e-service:
this can imply the substitution of a schema portion related to an e-service
(i.e., a portion delimited by dot lines in the figure) with the one of the new
provider organization (who for instance can offer the same service under
competitive conditions), or the reorganization of the whole cooperative
workflow specification, if a new e-service is defined and added to the
existing cooperative process.
• Benchmarking. The quality parameters described in the paper can be
regarded as strategic means for benchmarking the cooperative process
design, since they help monitoring e-services and help:
- to better define the granularity of the schemas;
- to restructure and re-engineer the schemas;
- destination organizations to improve their relationships towards other
entities (e.g. their business customers);
- source organizations to ameliorate their services (e.g., balancing
accuracy vs. timeliness vs. importance).
• Accounting and Monitoring. To help improving framework efficacy, a
mechanism of e-monitoring can be set up to observe quality information,
thus supporting tracing, analysis, and certification of data exchanges.
Accounting information is a basic aspect of e-monitoring. It should also be
accompanied by documentation about data flows, about testing and
probing reports resulting from samples on the framework operation, and
by trust reports that contain all security relevant parameters (history of
flows, of user behavior, of security violations, and so on).
Another way of verifying the quality of the design is the observation of
30
exceptions to the normal flow. Frequent exceptions can be a symptom of
mis-functioning of some e-services, due to various reasons. One is
straightforward and is generally concerned with wrong design choices
(wrong granularity is one for all example). A second type of cause can be
the low quality of data provided by a given e-service; for example, data
from one provider e-service (i.e., organization) always present “very low”
timeliness, or are scarcely secure. Triggers can be inserted in the
cooperative workflow specification to monitor these anomalies in order to:
- signal to the destination organization that a given provider e-service is
unreliable;
- signal to the source organization that the quality of data provided by its
e-services is low and that it might become out-of-market.
Anomalies can therefore be regarded as a means to strategically monitor
framework design and performance and for organizations to improve their
strategic orientations.
• Contractual aspects bound to e-service executions. Cooperating
organizations should be able to get certification of data exchanged, of
their quality and sensitivity levels, of user satisfaction measured through
parameters of quality.
• Compliance between the cooperative data model and the organizational
model. This aspect can be studied by observing the overall behavior of
the e-services, the customer satisfaction, the percentage of discarded
data, the exceptions occurred during workflow executions, and so on; in
particular, exceptions and their management are useful to decide whether
an e-service has to be corrected or redesigned.
Implementation framework. Several elements are needed in order to realize
the proposed framework for trusted cooperation; specifically (i) a descriptive
structure, consisting of models and languages able to describe e-services at
a high level of abstraction, (ii) tools for mapping such structure into a multi-
technology platform, and to initialize trust parameters.
31
E-services are described by using an abstract description language (Mecella
et al. 2001a, Mecella et al. 2001b), which is the abstraction of technological
component models (e.g. CORBA, EJB, .NET); each e-service needs to be
effectively provided by an organization as a component in a specific
implementation technology. The implementation interfaces of such a
component can be generated from the e-service specification, by basing on
specific generation rules and tools. The use of different component models,
one for e-service description, and many at technological level, is due to the
coexistence of different cooperative technologies and to the opportunity, in a
multi-organization environment, of integrating all these components by
adopting technology-independent component model for them.
The coordination of different e-services composing a cooperative process is
carried out by the coordination “glue” inside e-applications; such glue is able
to coordinate different components by generating at run-time specific service
requests, according to the specific implementation technologies; the
generation of the specific service requests is possible by using e-service
descriptions, stored in the repository, and the availability of mapping modules
which realizes the transformation rules from e-service descriptions to
technological component models.
The focus of this paper is on introducing trust in data exchanges among e-
services; cooperative data schemas represent the interfaces of the e-
services, that is the specifications of input and output data to e-services. In
order to provide trust, we have also introduced cooperative data quality
schemas, to be offered by e-services as part of their interfaces, and security
aspects in communication among e-services.
Finally, as far as the management of trust parameters, the framework is
assumed to be initialized with fixed quality and sensitivity information and to
be then updated using a Feedback and Monitoring Module (Bellettini et al.
1999), that observes the behavior and ameliorates performances along time
using feedback about quality parameters, triggers, user actions, and the
customer satisfaction.
32
6 RELATED WORK
As in our framework trust is obtained by introducing quality information and
security, we will briefly describe related work in these fields.
The notion of data quality has been widely investigated in the literature;
among the many proposals we cite the definitions of data quality as “fitness for
use” (Wang and Strong 1996), and as “the distance between the data views
presented by an information system and the same data in the real world” (Orr
1998, Wand and Wang 1996). The former definition emphasizes the subjective
nature of data quality, whereas the latter is an “operational” definition, although
defining data quality on the basis of comparisons with the real world is a very
difficult task. In this paper we have considered data quality as an implicit concept
strictly dependent from a set of dimensions; they are usually defined in the data
quality literature as quality properties or characteristics of data (e.g., accuracy,
completeness, consistency, etc.).
Many definitions of data quality dimensions have been proposed; among
them we cite: the classification given in (Wang and Strong 1996), in which four
categories (i.e., intrinsic, contextual, representation and accessibility aspects of
data) are identified for data quality dimensions, and the taxonomy proposed in
(Redman 1996), in which more than twenty data quality dimensions are
classified into three categories, namely conceptual view, values and format. A
survey of data quality dimensions is given in (Wang et al. 1995); it is important to
note that in the literature there is not an agreement not only on the set of the
dimensions strictly characterizing data quality, but also on the meaning of each
of them.
We have defined some dimensions based on the ones proposed in the
literature, and we have introduced some new quality dimensions, as they are
specifically relevant in cooperative environments.
Data quality issues have been addressed in several research areas, i.e.,
data cleaning, quality management in information systems, data warehousing,
integration of heterogeneous databases and web information sources. As of our
knowledge, many aspects concerning data quality in CIS/e-applications have not
33
been yet addressed; anyway when dealing with data quality issues in
cooperative environments, some of the results already achieved for traditional
and web information systems can be borrowed. In CIS/e-applications, the main
data quality problems are:
• Assessment of the quality of the data exported by each organization;
• Methods and techniques for exchanging quality information;
• Improvement of quality;
• Heterogeneity, due to the presence of different organizations, in general with
different semantics about data.
As regards the opportunity of assessment phases of the quality of intra-
organizational data, results achieved in the data cleaning area (Elmagarmid et al.
1996, Hernandez and Stolfo 1998, Galhardas et al. 2000), as well as in the data
warehouse area (Vassiliadis et al. 1999, Jeusfeld et al. 1998) can be adopted.
Heterogeneity has been widely addressed in the literature, especially
focusing on schema integration issues (Batini et al. 1984, Gertz 1998, Ulmann
1997, Madnick 1999, Calvanese et al. 1998)
Improvement and methods and techniques for exchanging quality
information have been only partially addressed in the literature (e.g., Mihaila et
al. 1998) and are the main focus of this paper; we have proposed a conceptual
model for exchanging such information in a cooperative framework and some
hints for improvement based on the availability of quality information.
Finally as regards security, the main problems tackled in the literature
regard data protection during storage and during transmission, with the
associated aspects of confidentiality, integrity and authentication (Castano et al .
1995). Several solutions have been proposed, based on standards and
specifications regarding the use of cryptography for signatures and certificates,
such as PKCS#7 and RFC2459 (RSA Laboratories 1993, Housley et al. 1999).
We have relied on these standard proposals for data exchange.
34
7 CONCLUDING REMARKS AND FUTURE WORK
In this paper an approach to trusted data exchange in cooperative processes has
been presented. The main emphasis of this work has been on supporting
information being exchanged with additional information enabling receiving
organizations to assess the suitability of data before using it. In addition, a
framework has been proposed to allow trusted data exchange with quality
information in a secure environment.
The data quality problem in cooperative environments in general is still an
open issue. Further work is still needed to precisely define the data quality
dimensions proposed in the literature. In the context of cooperative processes,
our approach, to our knowledge, is the first proposal for a comprehensive
framework for defining trusted data exchange based on data quality information.
Our approach will be validated on practical cases in the public administration
domain and based on these experiences the model will be refined.
In the present paper, we have concentrated our attention on data
exchange within a cooperative process. Though we have identified the exact
format of the exchange unit, we also need to explore possible ways to translate
not only data, but also quality data, history and sensitivity information of the
exchange unit into XML structures. Based on the proposed approach, future
work will also concentrate on aspects related to process improvement based on
the evaluation of the quality of data being exchanged. In fact, the analysis of the
quality of data being exchanged, its evaluation by receiving organizations, and
compensating actions started when data quality is considered insufficient can be
the basis for new techniques for process improvement. In addition, more work is
needed to provide mechanisms to associate information about the reliability of
sources of data, to validate it, and to revise it according to a statistical analysis of
instances of processes evaluated in the past.
Future work about sensitivity and security regards the extension of XML
DTDs to treat security properties at the needed level of data granularity (i.e., data
item, quality attributes, other detail levels).
35
ACKNOWLEDGEMENTS
The authors thank Carlo Batini for his discussions and suggestions about this work.
REFERENCES
Batini C., Cappadozzi E., Mecella M., Talamo M. (2001): Cooperative Architectures: The Italian Way Along e-Government. To appear in Elmagarmid A.K., McIver Jr W.J. (eds): Advances in Digital Government: Technology, Human Factors, and Policy, Kluwer Academic Publishers, 2001.
Batini C., Lenzerini M., Navathe S.B. (1984): A comparative analysis of methodologies for database schema integration. ACM Computing Survey, vol. 15, no. 4, 1984.
Bellettini C., Damiani E., Fugini M.G. (1999): Design of an XML-based Trader for Dynamic Identification of Distributed Services, Proceedings of the 1st Symposium on Reusable Architectures and Components for Developing Distributed Information Systems, RACDIS’99 , Orlando, FL, August 1999.
Bitton D., DeWitt D. (1983): Duplicate Record Elimination in Large Data Files. ACM Transactions od Database Systems, vol. 8, no. 2, 1983.
Brodie M.L. (1998): The Cooperative Computing Initiative. A Contribution to the Middleware and Software Technologies. GTE Laboratories Technical Publication, 1998, available on-line (link checked July, 1st 2001): http://info.gte.com/pubs/PITAC3.pdf.
Calvanese D., De Giacomo G., Lenzerini M., Nardi D., Rosati (R.) (1998): Information Integration: Conceptual Modeling and Reasoning Support. In Proceedings of the 6th International Conference on Cooperative Information Systems (CoopIS'98), New York City, NY, USA, 1998.
Casati F., Sayal M., Shan M.C. (2001): Developing E-Services for Composing E-Services. Proceedings 13th International Conference on Advanced Information Systems Engineering (CAISE 2001), Interlaken, Switzerland, 2001.
Castano S., Fugini M.G., Martella G., Samarati P. (1995): Database Security, Addison Wesley, 1995.
Cattell, R.G.G., Barry D.K. (eds.) (1997): The Object Database Standard: ODMG 2.0. Morgan Kaufmann Publishers, 1997.
Central IT Unit (CITU) of the Cabinet Office (2000): The GovTalk initiative. http://www.govtalk.gov.uk/ (link checked July, 1st 2001).
Cochinwala M., Kurien V., Lalk G., Shasha D. (1998): Efficient Data Reconciliation. Bellcore Technical Report 1998.
Elmagarmid A., Horowitz B., Karabatis G., Umar A. (1996): Issues in Multisystem Integration for Achieving Data Reconciliation and Aspects of Solutions. Bellcore Research Technical Report, 1996.
Galhardas H., Florescu D., Shasha D., Simon E. (2000): An Extensible Framework for Data Cleaning. Proceedings of the 16th International Conference on Data Engineering (ICDE 2000), San Diego, CA, USA, 2000.
Gertz M. (1998): Managing Data Quality and Integrity in Federated Databases. Second Annual IFIP TC-11 WG 11.5 Working Conference on Integrity and Internal Control in Information Systems, Airlie Center, Warrenton, Virginia, 1998.
Hernadez M.A., Stolfo S.J. (1998): Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Journal of Data Mining and Knowledge Discovery, vol. 1, no. 2, 1998.
Housley R., Ford W., Polk W., Solo D. (1999): Internet X.509 Public Key Infrastructures Certificate and CRL Profile. Network Working Group Standards Track, 1999.
Jeusfeld M.A., Quix C., Jarke M. (1998): Design and Analysis of Quality Information for Data Warehouses. Proceedings of the 17th International Conference on Conceptual Modeling (ER'98), Singapore, 1998.
Madnick S. (1999): Metadata Jones and the Tower of Babel: The Challenge of Large –Scale Semantic Heterogeneity. Proceeding of the 3rd IEEE Meta-Data Conference (Meta-Data ’99), Bethesda, MA, USA, 1999.
36
Mecella M., Batini C. (2000): Cooperation of Heterogeneous Legacy Information Systems: a Methodological Framework. Proceedings of the 4th International Enterprise Distributed Object Computing Conference (EDOC 2000), Makuhari, Japan, 2000.
Mecella M., Batini C. (2001): Enabling Italian e-Government Through a Cooperative Architecture. In Elmagarmid, A.K., McIver Jr, W.J. (eds.): Digital Government. IEEE Computer, vol. 34, no. 2, February 2001.
Mecella M., Pernici B. (2001): Designing Wrapper Components for e-Services in Integrating Heterogeneous Systems. To appear in VLDB Journal, Special Issue on e-Services, 2001.
Mecella M., Pernici B., Craca P. (2001b): Compatibility of Workflow e-Services in A Cooperative Multi-Platform Environment. To appear in Proceedings of the 2nd VLDB Workshop on Technologies for E-Services (VLDB-TES 2001), Roma, Italy, September 2001.
Mecella M., Pernici B., Rossi M., Testi A. (2001a): A Repository of Workflow Components for Cooperative e-Applications. Proceedings of the 1st IFIP TC8 Working Conference on E-Commerce/E-Business, Salzburg, Austria, 2001.
Mihaila G., Raschid L., Vidal M. (1998): Querying Quality of Data Metadata. Proceedings of the 6th International Conference on Extending Database Technology (EDBT’98), Valencia, Spain, 1998.
Missier P., Scannapieco M., Batini C. (2001): Cooperative Architectures. Introducing Data Quality. Technical Report 14-2001, Dipartimento di Informatica e Sistemistica, Università di Roma “La Sapienza”, Roma, Italy, 2001.
Monge A., Elkan C. (1997): An Efficient Domain Independent Algorithm for Detecting Approximate Duplicate Database Records. Proceedings of SIGMOD Workshop on Research Issues on DMKD, 1997.
Monson-Haefel, R. (2000): Enterprise JavaBeans (2nd Edition). O'Reilly 2000. Morey R.C. (1982): Estimating and Improving the Quality of Information in the MIS.
Communications of the ACM, vol.25, no.5, 1982. Mylopoulos J., Papazoglou M. (eds.) (1997): Cooperative Information Systems. IEEE Expert
Intelligent Systems & Their Applications, vol. 12, no. 5, September/October 1997. Object Management Group (1998): The Common Object Request Broker Architecture and
Specifications. Revision 2.3. Object Management Group, Document formal/98-12-01,Framingham, MA, 1998.
Orr K. (1998): Data Quality and Systems Theory. Communications of the ACM, vol. 41, no. 2, 1998.
Redman T.C. (1996): Data Quality for the Information Age. Artech House, 1996. RSA Laboratories (1993): Cryptographic Message Syntax Standard. RSA Laboratories Technical
Note Version 1.5, 1993. Schuster H., Georgakopoulos D., Cichocki A., Baker D. (2000): Modeling and Composing
Service-based and Reference Process-based Multi-enterprise Processes. Proceedings of the 12th International Conference on Advanced Information Systems Engineering (CAISE 2000), Stockholm, Sweden, 2000.
Tanenbaum A. S. (1996): Computer Networks, Third Edition. Prentice Hall. 1996. Tansell A., Snodgrass R., Clifford J., Gadia S., Segev A. (eds.) (1993): Temporal Databases.
Benjamin-Cummings, 1993. Trepper, C. (2000): E-Commerce Strategies. Microsoft Press, 2000. UDDI.org (2000): UDDI Technical White Paper, 2000. Available on line (link checked July, 1st
2001): http://www.uddi.org/pubs/Iru_UDDI_Technical_White_Paper.pdf. Ulmann J.D. (1997): Information Integration using Logical Views. Proceedings of the International
Conference on Database Theory (ICDT ‘97), Greece,1997. Vassiliadis P., Bouzeghoub M., Quix C. (1999): Towards Quality-Oriented Data Wharehouse
Usage and Evolution. Proceedings of the 11th International Conference on Advanced Information Systems Engineering (CAiSE’99), Heidelberg, Germany, 1999.
VLDB-TES (2000): Proceedings of the 1st VLDB Workshop on Technologies for E-Services (VLDB-TES 2000), Cairo, Egypt, 2000.
Wand Y., Wang R.Y. (1996): Anchoring data quality dimensions in ontological foundations. Communication of the ACM, vol. 39, no. 11, 1996.
Wang R.Y., Storey V.C., Firth C.P. (1995): A Framework for Analysis of Data Quality Research. IEEE Transaction on Knowledge and Data Engineering, vol.7, no. 4, 1995.
37
Wang R.Y., Strong D.M. (1996): Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems, vol. 12, no. 4, 1996.
World Wide Web Consortium (W3C) (1998): Extensible Markup Language (XML) Version 1.0. February 1998. http://www.w3.org
Goldfarb G.F., Prescod P. (2000): The XML Handbook. Prentice Hall, 2000. Object Management Group (OMG) (2000): OMG Unified Modeling Language Specification.
Version 1.3. Object Management Group, Document formal/2000-03-01, Framingham, MA, 2000.