Knowledge discovery with classification rules in a cardiovascular dataset
Transcript of Knowledge discovery with classification rules in a cardiovascular dataset
1
Knowledge Discovery with Classification Rules
in a Cardiovascular Dataset
Vili Podgorelec(1), Peter Kokol(1), Milojka Molan Stiglic(2), Marjan Heričko(1), Ivan Rozman(1)
(1)University of Maribor – FERI, Smetanova 17, SI-2000 Maribor, Slovenia [email protected]
(2)Maribor Teaching Hospital, Department of Pediatric Surgery, Maribor, Slovenia
Abstract. In the paper we study an evolutionary machine learning approach to data mining and knowledge discovery based on the classification rules induction. A method for automatic rules induction called AREX using evolutionary induction of decision trees and automatic programming is introduced. The proposed algorithm is applied to a cardiovascular dataset consisting of different groups of attributes which should possibly reveal the presence of some specific cardiovascular problems in young patients. A case study is presented that shows the use of AREX for the classification of patients and for discovering possible new medical knowledge from the dataset. The defined knowledge discovery loop comprises a medical expert’s assessment of induced rules to drive the evolution of rule sets towards more appropriate solutions. The final result is the discovery of a possible new medical knowledge in the field of pediatric cardiology. Index terms: machine learning, knowledge discovery, classification rules, pediatric cardiology, medical data mining 1. INTRODUCTION Modern medicine generates huge amounts of data and there is an acute and widening gap
between data collection and data comprehension. Obviously it is very difficult for a human to
make a use of such amount of information (i.e. hundreds of attributes, thousand of images,
several channels of 24 hours of ECG or EEG signals) and to be able to find basic patterns,
relations or trends in the data. In such manner data becomes less and less useful, the
transformation data ⇒ information harder and harder and the transformation data ⇒
information ⇒ knowledge almost impossible. Thus, there is a great need to find new methods
for data analysis to facilitate the creation of knowledge that can be used for clinical decision
Citation Reference: V. Podgorelec, P. Kokol, M. Molan Stiglic, M. Heričko, I. Rozman, Knowledge Discovery with Classification Rules in a Cardiovascular Database, Computer Methods and Programs in Biomedicine, Elsevier, vol. 80, suppl. 1, pp. S39-S49, 2005.
2
making. Intelligent systems for knowledge extraction are the tools that can help in achieving
this goal.
1.1. Objectives and scope of the paper There are two main objectives of this paper. The first objective is to introduce a new
intelligent knowledge extraction paradigm based on evolutionary rule sets induction. We
present a new hybrid classification algorithm based on genetic algorithms (GAs) and genetic
programming (GP) – the AREX approach. AREX (Automatic Rules Extractor) is a general
hybrid method that incorporates two original, independent algorithms that together solve the
problem of automatic classification rules induction. First algorithm is a multi-population self-
adapting genetic algorithm for the induction of decision trees. Second is the system for the
evolution of programs in an arbitrary programming language, which is used to evolve
classification rules. Finally, an optimal set of classification rules is determined with a simple
genetic algorithm.
The second objective is to present a case study of using the developed algorithm to discover
new knowledge in a problem of early and accurate identification of cardiovascular problems
in pediatric patients. It is shown how AREX can be used to extract medical knowledge and
the results obtained in this manner are evaluated by a medical expert. To objectively compare
the developed AREX approach with the existing methods the results are compared to those
obtained with other classification methods.
This paper is organized as follows. Section 2 presents a short overview of data mining and
knowledge discovery with the emphasis on the evolutionary induction of decision trees; it
indicates some reasons for the development of AREX. Section 3 presents the developed
AREX algorithm in detail. Section 4 presents a case study of using AREX upon a
cardiovascular database, where all the obtained results are evaluated and compared with the
existing classification algorithms. Finally, section 5 presents a discussion that concludes the
paper.
2. DATA MINING AND KNOWLEDGE DISCOVERY Although a great deal of time and effort is spent building and maintaining all kinds of
databases, it is nonetheless rare that the full potential of this valuable resource is realised. The
principle reason for this paradox is that the majority of organisations lack the insight and/or
3
expertise to effectively translate information into usable knowledge [1]. In light of these
conditions, there exists a clear need for automated methods and tools to assist in exploiting
the vast amount of available data. This requirement has led to the development of data mining
technology. Data mining is an umbrella term which describes the process of uncovering
patterns, associations, changes, anomalies and statistically significant structures and events in
data. Traditional data analysis is assumption driven in the sense that a hypothesis is manually
formed and validated (by statistical means) against the data. In contrast, data mining is
discovery driven in that useful patterns are automatically extracted from the data [2]. In order
to accomplish this task, data mining systems frequently utilise methods from disciplines such
as artificial intelligence, machine learning and pattern recognition [3].
The data mining algorithms usually operate on data sets composed of vectors (instances) of
independent variables (features, attributes). For example, a database may describe a group of
people in terms of their age, sex, income and occupation. In this case, age is an example of an
attribute and each instance corresponds to a distinct individual.
To discover the hidden patterns in data, it is essential to build a model consisting of
independent variables that can be used to determine a dependent variable (also known as class
or decision). Building such a model therefore consists of identifying the relevant independent
variables (attributes) and minimising the predictive error [4]. It is also highly desirable to find
the simplest possible model that fits the data, since these are typically the most meaningful
and easiest to interpret. This last requirement reflects the principle of Occam's Razor which
tells us to prefer the simplest model that fits the data [5].
Before we proceed further, it is important to distinguish data mining from pattern recognition
as these terms are sometimes confused with each other. Pattern recognition is primarily
concerned with the construction of accurate classifiers. A classifier is fundamentally a
mapping between a set of input variables x1, ..., xn to an output variable y whose value
represents the class label ω1, ..., ωm [5]. In general, representing the knowledge embodied
within the classifier structure is not a priority. Consequently, while there is no shortage of
extremely accurate classifiers, some of the best are akin to a black box; that is, they give little
or no insight into why they make decisions. Neural networks [5, 6] exemplify these types of
systems because its classification rules are embedded in its structure. Since neural networks
components (node activation functions, connection weights, etc.) encode complex
mathematical functions, articulating the rules they represent is a difficult problem [7]. In
contrast, the primary purpose of data mining is not simply classification, but to provide
4
meaningful knowledge to the user regarding the classification process. Thus, the models
produced by data mining algorithms should be in a form that lends itself to analysis by the
user. Decision rule sets which linearly partition the data space into class homogeneous regions
meet this requirement. Examples of techniques that accomplish decision rule induction from
data include decision trees [8, 9, 10, 11] and evolutionary algorithm based classifier systems
[12, 13, 14, 15].
2.1. Evolutionary induction of decision trees Almost since their introduction decision trees (DTs) have been exhaustively used as a
classification method, showing a great potential in several domains. Their efficiency and
classification accuracy have surprised many experts, but their most important advantage is the
transparency of the classification process that one can easily interpret, understand and
criticize. However, the classical induction approach of DTs, that has not changed much since
the introduction, contains several disadvantages, like 1) poor processing of incomplete, noisy
data, 2) inability to build several trees for the same dataset, 3) inability to use the preferred
attributes, etc.
For all those reasons a need for an approach that would preserve the positive aspects of DTs
and avoid their disadvantages emerged. Encouraged by the success of evolutionary algorithms
for optimization tasks a series of attempts occurred to induce a DT-like models with
evolutionary methods [16, 17, 18, 19]. Despite their high classification accuracy, GPs have
proven extremely difficult to interpret – this is a major obstacle to their use in data mining
problems. Nevertheless, many researchers have tried to use the power of GAs/GPs for the
problem of data mining and knowledge discovery [20, 21].
As we learn from the history, the ideal choice for effective, accurate and efficient data mining
algorithm would be to combine the power of GA/GP with the interpretability of DTs (or rule
sets) in a successful way. In this paper we present one such approach that shows a great
potential both in data mining (classification) and knowledge discovery.
2.2. Data and knowledge representation Knowledge discovered with data mining algorithms should ideally give an in-depth
explanation of a problem domain along with a good classification [22]. Experts are
performing exhaustive analyses of the results, which are the output of knowledge discovery
tools, in order to extract the useful knowledge. To make a part of their work as easy as
5
possible the best way is to present the results in a form of a set of classification rules, which
are clear and straightforward to understand, accept or reject. Ideal system would therefore
include:
- accuracy – classification with minimal error rate, - compactness – use of a minimum number of rules, and - simplicity – single rules are not complex.
Data to use as the source for knowledge discovery system are represented with the set of
training objects o1, …, oN. Each training object oi is described with the values of attributes ai1,
…, aik and the accompanied decision ωi from the set of m possible decisions [Ω1, …,Ωm].
Attributes can be either numeric (value is a number from a continues interval) or discrete
(value is one from the discrete set of all possible values). Usually algorithms can work also
with missing values, in this case it is not necessary for all values to be known.
3. THE AREX ALGORITHM Knowledge that is discovered with the help of our algorithm is represented with a set of
classification rules. Each single rule in a set is in the following form:
if <condition> then <decision>, where <condition> := <c1> and … and <cd>, and <decision> := ω, ω ∈ [Ω1, …, Ωm]
In this manner the rules are clear and easy enough for further analyses and in the same time
their functional power is strong enough for successful classification. A set of classification
rules can lose on the understandability if the number of rules in a set is too high. On the other
hand the classification accuracy would decrease if the number of classification rules in a set is
too low. Our algorithm searches for a balanced solution between these two extremes.
The complete rules extracting algorithm has been called AREX (Automatic Rules EXtractor).
AREX uses as the input a training set of objects and based on those objects classification rules
are induced. Algorithm AREX includes a hybrid system of two basic algorithms: 1) an
evolutionary algorithm for the construction of decision trees [23, 19] that is used to build the
initial set of classification rules, and 2) proGenesys system that allows automatic evolution of
programs in an arbitrary programming language and is used for the construction and
6
refinement of classification rules. The basic outline of the AREX algorithm is illustrated in
Figure 1 and can be described in the following steps:
Rules construction withproGenesys system.
Input (training) objects -attributes values
vectors
A set of evolutionarydecision trees.
IF ... THEN ... w1IF ... THEN ... w2
...IF ... THEN ... wn
Refined rules.
Simpleobjects.
Complexobjects.
A classification rule set.
if (A3 > split) thenif (INV2 < TRESHOLD2) then
moveTo(3)else
predict(DECISION1)endif
elsemoveTo(1)
endifinc(INV2)
if (A3 > split) thenif (INV2 < TRESHOLD2) then
moveTo(3)else
predict(DECISION1)endif
elsemoveTo(1)
endifinc(INV2)
Initial rules.
Figure 1. The diagram of AREX algorithm.
0. input: a set of training objects S, decision forest size nt, and classification tolerance ct=nt/2
1. build N evolutionary decision trees upon objects from S 2. for all objects oi from S:
a) classify object oi with nt randomly chosen trees from all N trees b) if frequency of the most frequent decision class classified by nt trees > nt-ct: copy oi
from S to a set S* and clear oi from S 3. from all N decision trees create M initial classification rules 4. with the proGenesys system create another ⎣M/2+1⎦ classification rules randomly 5. using the proGenesys system evolve the final set of rules for classifying objects from S* 6. find the optimal final set of rules using simple optimization 7. if S is not empty:
a) add |S| randomly chosen objects from S* to S; b) increase clearness tolerance ct=ct+1; c) repeat the whole procedure from step 1
8. finish if S is empty
7
3.1. Genetic algorithm for the construction of decision trees First step of the genetic algorithm is the creation of the initial population. A random decision
tree is constructed based on the following algorithm, where the input is a randomly chosen
number of attribute nodes that will compose the tree:
0. input: number of attribute nodes M that will be in the tree 1. select an attribute Xi from the set of all possible attributes and set it as a rood node t 2. in accordance with the selected attribute's Xi type (discrete, continuous) define a test for
this node t: 1) for continuous attributes in a form of ƒt(Xi) < φi , where ƒt(Xi) is the attribute value for a data object and φi is a split constant; 2) for discrete attributes two disjunctive sets of all possible attribute values are randomly defined
3. connect empty leaves to both new branches from node t 4. randomly select an empty leaf node t (the probability of selecting an empty leaf is
decreased with the depth of the leaf in a growing tree) 5. randomly select an attribute Xi from the set of all possible attributes (the probability of
choosing an attribute depends on a number of previous uses of that attribute in a tree – in this manner unused attributes have better chances to be selected)
6. replace the selected leaf node t with the attribute Xi and go to step 2 7. finish when M attribute nodes has been created
For each empty leaf the following algorithm determines the appropriate decision class: let S
be the training set of all training objects N with K possible decision classes d1, .., dK and Ni is
the number of objects within S of a class di. Let St be the sample set at node t (an empty leaf
for which we are trying to select a decision class) with Nt objects; Nit is the number of objects
within St of a decision class di. Now we can define a function that measures a potential
percentage of correctly classified objects of a class di:
i
ti
NN
itF =),( (1)
Decision di
t for the leaf node t is then set as a decision di, for which F(t,i) is maximal.
The ranking of an individual DT within a population is based on the local fitness function:
∑ ∑= =
⋅++−⋅=K
i
N
iuiii nuwtcaccwLFF
1 1)()1( (2)
where K is the number of decision classes, N is the number of attribute nodes in a tree, acci is
the accuracy of classification of objects of a specific decision class di, wi is the importance
8
weight for classifying the objects of a decision class di, c(ti) is the cost of using the attribute in
a node ti, nu is number of unused decision (leaf) nodes, i.e. where no object from the training
set fall into, and wu is the weight of the presence of unused decision nodes in a tree.
3.2. System proGenesys for automatic evolution of programs For constructing classification rules we used a system developed for the evolution of
programs in an arbitrary programming language, described with BNF productions –
proGenesys (program generation based on genetic systems) [24]. In our approach an
individual is represented with a syntax tree (a derivation tree), as it is usual for grammar-
based GP [25,26]. To get the final solution this tree (genotype) is transformed into a program
(phenotype) [27].
First step of automatic programming is the initialization of the initial population. For the
successful continuation of the program evolution process the initial programs should be
evenly distributed [28]. Known program initialization procedures are theoretically correct but
are working well only for small problems [29]. The problem is that limitations regarding the
tree size are not considered during the induction. For this reason a lot of built trees have to be
rejected, which is time consuming. An initialization procedure based on dynamic grammar
pruning as proposed in [30] could solve the problem. In the proGenesys system a procedure is
used that allows the induction of a program tree of exactly specified size.
A good classification rule should simultaneously be clear (most of the objects covered by the
rule should fall into the same decision class) and general (it covers many objects – otherwise
it tends to be too specific). Those two criteria can be measured with the following formulas:
num. of classified objects – 1
generality = num. of objects of this decision class ω2 Ω2 ω1 clearness = 1 –
Ω1 where ω1 is number of objects covered by the rule that belong to the most frequent decision
class, ω2 is number of objects covered by the rule that belong to the second most frequent
decision class, Ω1 is number of all objects in the training set that belong to the most frequent
(4)
(3)
9
decision class of the rule, and Ω2 is number of all objects in the training set that belong to the
second most frequent class of the rule. Now a fitness function can be defined as
FF = clearness × generality + ∑
=
N
iitc
1
)( (5)
where the last part represents a cost of the use of specific attributes, the same as in the local
fitness function LFF in building decision trees.
3.3. Finding the optimal set of rules System proGenesys is used to evolve single rules, whereas for the classification of all objects
a set of rules is required. For this purpose between all the evolved rules a set of rules should
be found that together classify all the objects – with high classification accuracy and a small
number of rules. A problem is solved with a simple genetic algorithm that optimizes the
following fitness function:
∑ ∑= =
⋅+⋅++−⋅=K
i
N
iumiii nuwnmwtcaccwFF
1 1)()1( (6)
where the first two parts are the same as in the LFF for building decision trees, and nm is the
number of multiple classified objects, nu is the number of non-classified objects, and wm and
wu are the corresponding weights. The appropriate coverage of the training set is thus
achieved by reducing the number of not classified and multiple classified objects. A more
advanced way to achieve the uniform coverage is discussed in [31].
4. A CASE STUDY The role of classification rules in medical decision making is very important, since they
provide a very important feature – the possibility of explaining the decisions in a way
understandable by humans. But surprisingly this feature somehow shortcut the use of
classification rules from their primary purpose, that is classifying/decision making, to more
explanatory and statistical uses. In our long experience in introducing the classification
models and more generally intelligent systems into real world medical applications we
noticed that rarely they were successful in supporting diagnosing/classifying, but in the case
they were the use was very different. They were read in the opposite way, from decisions
10
towards conditions either in the manner to see the influence/relation of the attributes on the
diagnosis or the influence/relation of the diagnosis and attributes on a specific attribute in the
rule. The auxiliary information like the number of objects covered by the rule suddenly
became very important – the physicians were only interested in rules covering many objects.
Consequently, for us very important information like accuracy, sensitivity, specificity was
normally ignored and the division on training and testing sets disliked, because it reduced the
number of objects. So we changed our attitude and stopped introducing the rule sets for
classifying only but instead we presented to the medical staff additionally their use in
knowledge discovery, hypothesis generation, hypothesis testing, etc. – and suddenly the
approach became much more successful.
Interestingly, another weakness was noticed. In analyzing the rules in the above manner, at
first medical experts expected to find some revolutionary new knowledge, which normally did
not happen. But after the first disappointment they were surprised how well the classification
rules represent their own decision making concepts. And furthermore, they wondered how to
generate rule sets that would teach them something new and unexpected. After some research
we found that the problem is in the classical information content method of decision tree
induction. So we decided to test our AREX method as a new iterative rule generation
approach for new knowledge discovery in data – and the aim of this case study is to present
its application in cardiology.
4.1. The knowledge discovery loop Naturally, as a knowledge discovery system we used AREX, a rule extraction method that is
able to induce a set of classification rules based on the given dataset. For the purpose of
searching for new knowledge in medical datasets, the developed algorithm had to be modified
in some aspects.
In the first run the AREX algorithm generates a set of rules, which are then assessed by
physicians according to medical relevance and originality (see figure 2). The rules, which are
at the same time relevant and represent some new knowledge, are stored in the database of
good rules. In the next runs AREX gives the bonus preference to rules which use the same
attributes and parts of the rules stored in the database. After several runs the quality of the
rules produced by AREX is significantly improved regarding the preference criteria given by
physicians and the process eventually leads to some new knowledge. The outline of the
approach is presented in figure 2.
11
AREX rulesinduction system
Rule assesmentby physicians
The base of good rules
rules
asse
smen
tinduction criteria
rules
Rules assesed goodby physiciansType of rules preferred
by physicians
A set of induced rulesregarding the criteria
Figure 2. The outline of our approach to the discovery of new knowledge.
4.2. The cardiovascular problems of children patients The early and accurate identification of cardiovascular problems in pediatric patients is of
vital importance. In order to help the clinicians in diagnosing the type of the cardiovascular
problems in young patients, computerized data mining and decision support tools are used
which are able to help clinicians to process a huge amount of data available from solving
previous cases and suggest the probable diagnosis based on the values of several important
attributes. Clearly, black-box classification methods (neural networks for example) are not
appropriate for this kind of task, because the clinical experts need to evaluate and validate the
decision making process, induced by those tools, before there is enough trust to use the tools
in practice.
On the other hand, the evaluation of the induced classification rules produced by the
computerized tools by a clinical expert can be an important source of new knowledge on how
to make a diagnosis based on the available attributes. In order to achieve this goal, the
classification process should be easily understandable and straightforward. Different kinds of
knowledge discovery methods are therefore appropriate to do the job, and we decided to use
the developed knowledge discovery method presented above.
12
4.3. A dataset Two cardiovascular datasets have been composed to be used for the knowledge discovery
process. Each of them contains data of 100 patients from Maribor Hospital. A protocol has
been defined to collect the important data. The attributes include general data (age, sex, etc.),
a health status (data from family history and child’s previous illnesses), a general
cardiovascular data (blood pressure, pulse, chest pain, etc.) and more specialized
cardiovascular data – data from child’s cardiac history and clinical examinations (with
findings of ultrasound, ECG, etc.).
In the first dataset three different diagnoses are possible: innocent heart murmur, congenital
heart disease, palpitations with chest pain. In the second dataset five different diagnosis are
possible: innocent heart murmur, congenital heart disease with left to right shunt, aortic valve
disease with aorta coarctation, arrhythmias, and chest pain. Because the first database is
actually the simplified version of the second one, we decided to use in our study only the
second one with five possible decision classes.
4.4. The results The most common measure of efficiency when assessing a classification method is accuracy,
defined as a percentage of correctly classified objects from all objects (correctly classified and
not correctly classified). Accuracy can be thus calculated as:
FTTACC+
= (7)
where T stands for “true” cases (i.e. correctly classified objects) and F stands for “false” cases
(i.e. not correctly classified objects).
The above measure is used to determine the overall accuracy of the classification. In many
cases, especially in medicine, accuracy of each specific decision class is even more important
than the overall accuracy. When there are two decision classes possible (i.e. positive and
negative patients), the common measures in medicine are sensitivity and specificity:
FNTPTPSens+
= (8)
FPTNTNSpec+
= (9)
13
where TP stands for “true positives” (i.e. correctly classified positive cases), TN stands for
“true negatives” (i.e. correctly classified negative cases), FP stands for “false positives” (i.e.
not correctly classified positive cases) and FN stands for “false negatives” (i.e. not correctly
classified negative cases).
In our case of cardiovascular dataset, as in many other real-world medical datasets, there are
more than two decision classes possible, actually five. In this case the direct calculation of
sensitivity and specificity is not possible. Therefore, the separate class accuracy of i-th single
decision class is calculated as:
ii
iik FT
TACC
+=, (10)
and the average accuracy over all decision classes is calculated as
∑= +
⋅=v
i ii
ik FT
Tv
ACC1
1 (11)
where v represents the number of decision classes.
4.4.1. Classification results
To evaluate the classification results of our method AREX the above mentioned measures of
efficiency were calculated based on 10 independent evolutionary runs. For the comparison
with other classification algorithms we calculated the classification results obtained with
algorithms ID3, See5/C5 [9, 32], Naïve-Bayes (N-B), Naïve-Bayes tree (NB tree), and
instance-based classifier (IB) [33]. Based on 10-fold cross validation for each algorithm we
calculated overall accuracy on a training set (Table 1 and Figure 3) and a testing set (Table 2
and Figure 4), and average accuracy over decision classes on testing set (Table 2 and Figure
5).
Table 1. Average overall accuracy of classification on a training set.
overall accuracy [%] Algorithm
average stddev IB 100.00 0.00 ID3 100.00 0.00 C4.5 87.33 0.67 C5/See5 90.67 4.67 Naïve-Bayes 80.00 2.67 NB tree 82.00 0.67 AREX 95.33 3.33
14
IB ID3 C4.5 See5/C5 N-B NB tree AREX75
80
85
90
95
100
105
accu
racy
Average accuracy with std. deviation on training set.
Figure 3. Average overall accuracy of classification on a training set for different classification algorithms.
Table 2. Average overall accuracy and average accuracy of classification over decision classes on a
testing set for different classification algorithms.
overall accuracy [%] accuracy on classes [%] Algorithm average stddev average stddev
IB 77.08 2.08 74.72 31.93 ID3 70.83 12.50 62.67 30.77 C4.5 75.00 8.33 64.56 33.20 C5/See5 81.25 6.25 69.72 34.03 Naïve-Bayes 87.50 4.17 83.89 25.87 NB tree 83.33 0.00 80.56 24.91 AREX 87.50 4.17 86.39 14.93
15
IB ID3 C4.5 See5/C5 N-B NB tree AREX55
60
65
70
75
80
85
90
95
accu
racy
Average accuracy with std. deviation on testing set.
Figure 4. Average overall accuracy of classification on a testing set for different classification algorithms.
When considering the results on a training set (Table 1 and Figure 3) obviously the best
overall accuracy was obtained with IB and ID3 algorithms; that is most probably the
consequence of extreme overfitting. From all of the other algorithms our method AREX
achieved the highest accuracy (over 95%), followed by See5/C5 (90%) and C4.5 (87%), and
the lowest accuracy was obtained by both Naïve-Bayes algorithms (82% and 80%). The
standard deviation of results for AREX was in between those for Naïve-Bayes and See5/C5.
On a testing set two types of accuracy were measured: the overall accuracy and the average
accuracy of classification over decision classes. When considering the overall accuracy on a
testing set (Table 2 and Figure 4) the best results were achieved by Naïve-Bayes and AREX
(87.5%), all the other algorithms performed worse (from 70% up to 83%). Also the standard
deviation of the results for AREX were not much higher than on a training set and one of the
lowest of all the methods. What is especially encouraging is the fact that only AREX (and to
some extent also Naïve-Bayes and See5/C5) achieved similar and high overall accuracy at the
same time both on a training and on a testing set. This fact confirms good learning and
generalization capability of our method AREX (at least on the used database).
When considering the average accuracy of classification over decision classes (Table 2 and
Figure 5) the best result was achieved by AREX, followed by both Naïve-Bayes methods. At
least as important as good average result is also (by far) the lowest standard deviation of
16
results amongst all the algorithms. This result speaks in favor of AREX’s ability to classify
equally well different decision classes, either if they represent the majority or the minority of
the cases. In most of the classification algorithms the less frequent decision classes are often
neglected or even ignored, because they do not contribute much to the overall accuracy. But
of course those less frequent classes are very important, especially in medicine, since they
usually represent specific patients, which have to be treated even with more care.
Furthermore, the correct classification of those rare cases has the highest potential to reveal
some new patterns to medical experts.
IB ID3 C4.5 See5/C5 N-B NB tree AREX20
30
40
50
60
70
80
90
100
110
120
accu
racy
Average accuracy with std. deviation over decision classes on testing set.
Figure 5. Average accuracy of classification over decision classes on a testing set for different classification algorithms.
For the clear understanding of induced rule sets by domain experts it is important that the
number of classification rules is not very high. When using evolutionary methods it is also
important to guarantee the convergence of evolutionary runs with stable increase of fitness
scores. Figure 6 shows average convergence rates of 10 independent evolutionary runs for the
first 100 generations; a moderate and stable increase can be seen. Figure 7 shows the number
of induced classification rules for average and best solution during the evolution of the first
100 generations; after some initial oscillation the number of rules remain stable at around 10
rules.
17
0 10 20 30 40 50 60 70 80 90 1000.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
generations
fitne
ss
Average and best fitness for CECCH dataset.
bestaverage
Figure 6. Change of fitness (best fitness and average for the whole population) through the first 100
generations.
0 10 20 30 40 50 60 70 80 90 1000
5
10
15
20
25
30
generations
clas
sific
atio
n ru
les
Number of rules for average and best solution for CECCH dataset.
bestaverage
Figure 7. Number of induced classification rules (for the best solution and the average of the whole
population) through the first 100 generations.
4.4.2. Knowledge discovered The knowledge discovery process started from a rule set obtained with AREX. A medical
expert assessed all classification rules as “good”, “not good” or “no preference” (all rules are
initially marked as “no preference”) every ten generations. After one evolutionary run was
18
finished, the process started again from the induction of an initial rule set, but keeping all the
marks about specific rules (good and not good). The whole process was repeated 7 times
(when the base of good rules remained stable). Some interesting patterns have been
discovered, which should provide some information on how the diagnosis of the
cardiovascular problems can be made. The most important result is the rule, presented on
Figure 8 together with the medical expert’s comment on the rule; it has been assessed by the
expert as “possible rule – potentially containing new knowledge”. The discovering of this rule
speaks in favor of our hypothesis that new patterns can be found in the cardiovascular data
with the use of AREX.
Rule: IF Stenocardy = no RR_Diastolic <= 55 ECHO = Pathological THEN class 1
Possible rule – potentially new knowledge How the hemodynamic changes in patients with congenital heart disease with left to right shunt influence to the possible changes of systemic arterial blood pressure?
Figure 8. The most important rule found by AREX. It has been assessed by a medical expert as
potentially containing new medical knowledge.
5. CONCLUSION
There are some specifics in the classification of medical data. The overall accuracy of
classification is only one of several important aspects that have to be fulfilled in order to
attract the attention of medical experts for further research on the given problem. Regarding
the classification results of our method AREX upon a cardiovascular database, we can say
that it shows a promising base for other real-world medical data analysis. One of the most
evident advantages of AREX is the simultaneous very good generalization (high and similar
overall accuracy on both training and testing set) and specialization capability (high and very
similar accuracy of all decision classes, also the least frequent ones).
Also the knowledge discovery results on cardiovascular data, obtained with the presented
approach to knowledge discovery with the help of physicians’ assessment, turned out to be
very good. The physician’s evaluation of the final solution – a set of rules obtained through an
iterative rule induction process – shows that our method AREX produces mostly correct and
reliable rules. What is especially important is the fact that between the rules in the induced
19
rule set there is a rule showing a potential to be new knowledge. We may say that the
obtained results satisfy our intentions and, more importantly, equip the physicians with a
powerful technique to 1) confirm their existing knowledge about some medical problem, and
2) enable searching for new facts, which should reveal some new interesting patterns and
possibly improve the existing medical knowledge.
One of the patterns we were able to discover with the combination of AREX and MtDeciT, a
tool for the induction of decision trees developed in our laboratory [34], showed the possible
relation between the operation under a general anesthesia and arrhythmias even months after
the operation. At the 37th Annual Scientific Meeting of the AEPC (Association for Pediatric
Cardiology in Europe) professor Marie-Christine Seghaye in her state of the art lecture [35]
presented almost the same conclusion, as the result of a very comprehensive and expensive
European Union founded research project.
REFERENCES [1] S.R. Hedberg, The Data Gold Rush, Byte, (October 1995).
[2] R.L. Grossman, Data Mining: Challenges and Opportunities for Data Mining During the Next Decade, Technical Report, (Laboratory for Advanced Computing, University of Illinois at Chicago, 1997).
[3] S.M. Weiss and N. Indurkhya, Predictive Data Mining: A Practical Guide, (Morgan-Kaufmann, 1998).
[4] R.D. Small, H.A. Edelstein, Building, Using and Managing the Data Warehouse, chapter Scalable Data Mining, pp. 151-172 (Prentice Hall, 1997).
[5] C.M. Bishop, Neural Network for Pattern Recognition (Oxford University Press, 1995).
[6] R.B. Macy, A.S. Pandya, Pattern Recognition with Neural Networks in C++, (CRC Press, 1996).
[7] H. Lu, et al., NeuroRule: A Connectionist Approach to Data Mining, in Proceedings of the 21st Very Large Data Base Conference (1995).
[8] L. Breiman, J.H. Friedman, R.A. Olsen, C.J. Stone, Classification and regression trees, (Wadsworth, USA, 1984).
[9] J.R. Quinlan, C4.5: Programs for Machine Learning, (Morgan Kaufmann, 1993).
[10] K.V.S. Murthy, On Growing Better Decision Trees from Data, PhD dissertation, (Johns Hopkins University, Baltimore, MD, 1997).
[11] V. Podgorelec, P. Kokol, B. Stiglic, I. Rozman, Decision trees: an overview and their use in medicine, Journal of Medical Systems, 26(5), pp. 445-463 (2002).
[12] D.E. Goldberg, Genetic algorithms in search, optimization and machine learning, (Addison Wesley, 1989).
20
[13] J.H. Holland, J.S. Reitman, Cognitive systems based on adaptive algorithms, Pattern-directed inference systems, (Academic Press, New York, 1978).
[14] S.F. Smith, Flexible learning of problem solving heuristics through adaptive search, in Proceedings of the 8th International Joint Conference on Artificial Intelligence, pp. 422-425 (1983).
[15] P.L. Lanzi, W. Stolzmann, S.W. Wilson (eds.), Learning Classifier Systems: From Foundations to Applications, Lecture Notes in Artificial Intelligence, vol. 1813, (Springer-Verlag, 2000).
[16] K.A. DeJong, et al., Hybrid learning using genetic algorithms and decision trees for pattern classification, in Proceedings of the IJCAI Conference (1995).
[17] P.D. Turney, Cost-sensitive classification: empirical evaluation of a hybrid genetic decision tree induction algorithm, Journal of Artificial Intelligence Research, 2, pp. 369-409 (1995).
[18] A. Papagelis, D. Kalles, Breeding decision trees using evolutionary techniques, in Proceedings of the ICML’01 (2001).
[19] V. Podgorelec, P. Kokol, Evolutionary induced decision trees for dangerous software modules prediction, Information Processing Letters, 82(1), pp. 31-38 (2002).
[20] V. Podgorelec, P. Kokol, Towards more optimal medical diagnosing with evolutionary algorithms, Journal of Medical Systems, 25(3), pp. 195-220 (2001).
[21] A.A. Freitas, A survey of evolutionary algorithms for data mining and knowledge discovery, in Advances in Evolutionary Computation , eds. A. Ghosh and S. Tsutsui, pp. 819-845, (Springer-Verlag, 2002).
[22] J. Han and M. Kamber, Data Mining: Concepts and Techniques, (Morgan Kaufmann Publishers, 2000).
[23] V. Podgorelec and P. Kokol, Evolutionary decision forests – decision making with multiple evolutionary constructed decision trees, in Problems in Applied Mathematics and Computational Intelligence, pp. 97-103 (WSES Press, 2001).
[24] V. Podgorelec, ProGenesys – program generation tool based on genetic systems, in Proceedings of the ICAI’99, pp. 299-302 (1999).
[25] F. Gruau, On using syntactic constraints with genetic programming, in Advances in Genetic Programming II, pp. 377-394, (MIT Press, 1996).
[26] P.A. Whigham, Inductive Bias and Genetic Programming, in IEE Conference Proceedings 414, pp. 461-466 (1995).
[27] C. Ryan, Shades – a polygenic inheritance scheme, in Proceedings of Mendel'97 Conference, pp. 140-147 (1997).
[28] A. Geyer-Schulz and W. Böhm, Exact uniform initialization for genetic programming, Foundations of Genetic Algorithms 4 (1996).
[29] H. Hörner, A C++ class library for genetic programming, M.Sc. thesis, (Vienna University of Economics, 1996).
[30] A. Rattle, M. Sebag, Genetic programming and domain knowledge: beyond the limitations od grammar-guided machine discovery, in Parallel Problem Solving from Nature VI, pp. 211-220, (Springer Verlag, 2000).
21
[31] C. Anglano, A. Giordana, G. Lo Bello, L. Saitta, An experimental evaluation of coevolutive concept learning, in Proceedings of the International Conference on Machine Learning ICML98, pp. 19-27 (1998).
[32] −, RuleQuest Research Data Mining Tools, http://www.rulequest.com, (2001).
[33] ⎯, Machine Learning in C++, MLC++ library, http://www.sgi.com/tech/mlc, (2001). [34] M. Zorman, S. Hleb, M. Sprogar, Advanced tool for building decision trees MtDeciT
2.0, in Proceedings of the ICAI’99, pp. 315-318 (1999).
[35] M.-C. Seghaye, Impact of the inflammatory reaction on organ dysfunction after cardiac surgery, in Cardiology for the young, (Association for European Paediatric Cardiology, XXXVII Annual Meeting, 2002).