Finding the right decision tree's induction strategy for a hard real world problem

13
International Journal of Medical Informatics 63 (2001) 109 – 121 Finding the right decision tree’s induction strategy for a hard real world problem Milan Zorman a, *, Vili Podgorelec a, *, Peter Kokol a , Margaret Peterson b , Matej S progar a , Milan Ojsters ˇek a a Laboratory for System Design, Faculty of Electrical Engineering and Computer Science, Uniersity of Maribor, Smetanoa 17, SI -2000 Maribor, Sloenia b Hospital for Special Surgery, 535 70th East St., New York, NY 10021, USA Abstract Decision trees have been already successfully used in medicine, but as in traditional statistics, some hard real world problems can not be solved successfully using the traditional way of induction. In our experiments we tested various methods for building univariate decision trees in order to find the best induction strategy. On a hard real world problem of the Orthopaedic fracture data with 2637 cases, described by 23 attributes and a decision with three possible values, we built decision trees with four classical approaches, one hybrid approach where we combined neural networks and decision trees, and with an evolutionary approach. The results show that all approaches had problems with either accuracy, sensitivity, or decision tree size. The comparison shows that the best compromise in hard real world problem decision trees building is the evolutionary approach. © 2001 Elsevier Science Ireland Ltd. All rights reserved. Keywords: Orthopaedic fractures; Neural networks; Decision trees www.elsevier.com/locate/ijmedinf 1. Introduction As in many other areas, predictions play an important role in health care. What will be the status of the patient after the selected treatment ? Will the healing process end with the success or not ? These are just some of the possible predictive questions faced by both patients and medical staff. Statistics can be of great help in many real world situa- tions, but sometimes its use is limited. As an alternative we can use so called intelligent data analysis techniques like neural networks, case based reasoning, logic programming, or decision trees. The greatest weakness of the majority of these techniques is that they are black box approaches — normally giving good predictions but without explanation. One of the techniques overcoming this fault is the decision tree approach, that in addition to immense prediction power also explains * Corresponding author. E-mail addresses: [email protected] (M. Zorman), [email protected] (V. Podgorelec). 1386-5056/01/$ - see front matter © 2001 Elsevier Science Ireland Ltd. All rights reserved. PII:S1386-5056(01)00176-9

Transcript of Finding the right decision tree's induction strategy for a hard real world problem

International Journal of Medical Informatics 63 (2001) 109–121

Finding the right decision tree’s induction strategy for ahard real world problem

Milan Zorman a,*, Vili Podgorelec a,*, Peter Kokol a, Margaret Peterson b,Matej S� progar a, Milan Ojstersek a

a Laboratory for System Design, Faculty of Electrical Engineering and Computer Science, Uni�ersity of Maribor, Smetano�a 17,SI-2000 Maribor, Slo�enia

b Hospital for Special Surgery, 535 70th East St., New York, NY 10021, USA

Abstract

Decision trees have been already successfully used in medicine, but as in traditional statistics, some hard real worldproblems can not be solved successfully using the traditional way of induction. In our experiments we tested variousmethods for building univariate decision trees in order to find the best induction strategy. On a hard real worldproblem of the Orthopaedic fracture data with 2637 cases, described by 23 attributes and a decision with threepossible values, we built decision trees with four classical approaches, one hybrid approach where we combined neuralnetworks and decision trees, and with an evolutionary approach. The results show that all approaches had problemswith either accuracy, sensitivity, or decision tree size. The comparison shows that the best compromise in hard realworld problem decision trees building is the evolutionary approach. © 2001 Elsevier Science Ireland Ltd. All rightsreserved.

Keywords: Orthopaedic fractures; Neural networks; Decision trees

www.elsevier.com/locate/ijmedinf

1. Introduction

As in many other areas, predictions playan important role in health care. What willbe the status of the patient after the selectedtreatment ? Will the healing process end withthe success or not ? These are just some ofthe possible predictive questions faced by

both patients and medical staff. Statistics canbe of great help in many real world situa-tions, but sometimes its use is limited. As analternative we can use so called intelligentdata analysis techniques like neural networks,case based reasoning, logic programming, ordecision trees. The greatest weakness of themajority of these techniques is that they areblack box approaches—normally givinggood predictions but without explanation.One of the techniques overcoming this faultis the decision tree approach, that in additionto immense prediction power also explains

* Corresponding author.E-mail addresses: [email protected] (M. Zorman),

[email protected] (V. Podgorelec).

1386-5056/01/$ - see front matter © 2001 Elsevier Science Ireland Ltd. All rights reserved.

PII: S1386 -5056 (01 )00176 -9

M. Zorman et al. / International Journal of Medical Informatics 63 (2001) 109–121110

the decisions leading to the predictions usinga simple two dimensional graphical model.

Decision trees have been already success-fully used in medicine [1–6], but as in tradi-tional statistics, some hard real worldproblems can not be solved successfully usingthe traditional way of induction. Consideringthe advantages of the evolutionary methodsin performing complex tasks, we decided toovercome these problems by inducing deci-sion trees with neural networks and geneticalgorithms. The aim of the paper is to presentthese novel induction approaches and com-pare them with traditional techniques.

2. Decision trees

Inductive inference is the process of mov-ing from concrete examples to general mod-els, where the goal is to learn how to classify(predict) objects by analyzing a set of in-stances (already solved cases) whose classes(predictions) are known. Instances are typi-cally represented as attribute-value vectors.Learning input consists of a set of such vec-tors, each belonging to a known class (predic-tion), and the output consists of a mappingfrom attribute values to classes (predictions).This mapping should accurately classify/pre-dict both the given instances and other un-seen instances.

A decision tree [7–11] is a formalism forexpressing such mappings and consists oftests or attribute nodes linked to two or moresubtrees and leafs or decision nodes labeledwith a class which means the decision. A testnode computes some outcome based on theattribute values of an instance, where eachpossible outcome is associated with one ofthe subtrees. An instance is classified by start-ing at the root node of the tree. If this nodeis a test, the outcome for the instance isdetermined and the process continues using

the appropriate subtree. When a leaf is even-tually encountered, its label gives the pre-dicted class of the instance.

3. Classical approach

The algorithm for teaching a decision treeis trivial and the representation of accumu-lated knowledge can be easily understood.Namely, the decision trees do not give us justthe decision in a earlier unseen case. Theyalso give us the explanation of the decision,and that is essential in medical and nursingapplications.

The procedure of constructing a decisiontree from a set of training objects is calledinduction of the tree. In induction of a deci-sion tree, we start with an empty tree and anentire training set. The procedure of induc-tion is recursive and has four steps:1. If all the training objects in the training

(sub)set have the same outcome, create aleaf labeled with that outcome and go tostep 4.

2. With the help of heuristic evaluation func-tion find the best attribute out of all at-tributes that have not been used on thepath from the root to the current node.Create an internal node with split basedon the selected attribute. Split the trainingset to subsets.

3. For each subset of training objects go tostep 1.

4. Go one level up in recursion.

3.1. Discretization of continuous attributes

Due to the nature of decision trees, allnumeric attributes must be mapped in thediscrete space. Many different techniques ofdiscretization exist, but we will focus only onthose used in our experiments. The purpose

M. Zorman et al. / International Journal of Medical Informatics 63 (2001) 109–121 111

Fig. 1. Equidistant discretization of continuous at-tribute, which has values between 60 and 100.

Fig. 3. Dynamic discretization of continuous attribute,which has values between 60 and 100.

of this section is to present a novel dis-cretization technique called dynamic dis-cretization, which was used in ourexperiments.

The most primitive way is to split theinterval into equidistant intervals (Fig. 1).Typical equidistant discretizations are quar-tiles (four subintervals) and octiles (eightsubintervals).

In many experiments claimed as very suc-cessful, the way of discretization is thres-hold discretization, this splits the intervalinto two subintervals by determining athreshold with the help of the informationgain function and training objects (Fig. 2).

Dynamic discretization of continuous at-tributes (developed by authors of this pa-per), determines the subintervals dynam-ically during the process of building the de-cision tree. This technique first splits the in-terval into many subintervals, so that everytraining object’s value has its own subinter-val. In the second step it merges togethersmaller subintervals that are labeled withthe same outcome into larger subintervals.In each of the following steps we merge to-gether three subintervals: two ‘stronger’subintervals with one ‘weak’ interval, where

‘weak’ interval lies between those two‘strong’ subintervals. Here ‘strong’ and‘weak’ applies to number of training objectsin the subinterval tree (see Fig. 3). In com-parison to earlier two approaches the dy-namic discretization returns more ‘natural’subintervals, which results in better andsmaller decision trees.

We differentiate between two types of dy-namic discretization:

General dynamic discretization, andnodal dynamic discretization.general dynamic discretization uses all

available training objects for the definitionof subintervals. That is why we perform thegeneral dynamic discretization before westart building the decision tree. All thesubintervals of all attributes are memorizedin order to be used later in the process ofbuilding of the decision tree.

Nodal dynamic discretization performsthe definition of subintervals for all continu-ous attributes that are available in the cur-rent node of the decision tree. Only thosetraining objects, that came in the currentnode are used for setting the subintervals ofthe continuous attributes.

Fig. 2. Threshold discretization of continuous attribute,which has values between 60 and 100.

M. Zorman et al. / International Journal of Medical Informatics 63 (2001) 109–121112

4. Neuro generated decision trees

If we compare decision trees and neuralnetworks we can see that their advantagesand drawbacks are almost complementary.For instance knowledge representation ofdecision trees is easily understood by hu-mans, which is not the case for neural net-works, decision trees have trouble dealingwith noise in training data, which is againnot the case for neural networks, decisiontrees learn very fast and neural networkslearn relatively slow, etc. Our idea was tocombine decision trees and neural networksin order to combine their advantages. Thatis why we developed a combined approachfor building decision trees.

First we build a decision tree that is thenused to initialise a neural network. Such anetwork is then trained using the sametraining objects. After that the neural net-work is again converted to a decision tree,which is better than the original decisiontree [12] (Fig. 4).

4.1. Con�ersion decision tree-neural network

The source decision tree is converted to adisjunctive normal form, which is a set ofnormalised rules. Then the disjunctive nor-mal form serves as source for determiningthe neural network’s topology and weights.The neural network has two hidden layers.The number of neurons on each hiddenlayer depends on rules in the disjunctivenormal form. The number of neurons in theoutput layer depends on how many out-comes are possible in the training set.

The conversion is described in the nextsteps:

1. Build a decision tree, using any availableapproach.

2. Every path from the root of the tree toevery single leaf is presented as a rule.

3. The set of rules if transformed into thedisjunctive normal form (DNF), which isminimal representation of original set ofrules.

4. In the input layer create as many neu-

Fig. 4. Neuro generated decision tree.

M. Zorman et al. / International Journal of Medical Informatics 63 (2001) 109–121 113

rons as there are attributes in the train-ing set.

5. For each literal in the DNF there is aneuron created in the first hidden layer(literal layer) of a neural network.

6. Set weights for each neuron in the lit-eral layer, that represents a literal inthe form (attribute�value) to w0= −�*value for each literal in the form (at-tribute�value) to w0=�*value. Set allthe remaining weights to +� or −�

with equal probability. Constant � isusually a number larger then 1 (usually5) and constant � is a number close to0 (usually 0.05).

7. For every conjunction of literals createa neuron in the second hidden layer(conjunctive layer).

8. Set weights that link each neuron in theconjunctive layer with the appropriateneuron in the literal layer to w0= −�*(2n−1)/2, where n is a number ofliterals in the conjunct. Set all the re-maining weights to +� or −� withequal probability.

9. For every possible class create a neuronin the output layer (disjunctive layer).

10. Set weights that link each neuron in thedisjunctive layer with the appropriateneuron in the conjunctive layer to w0=−�*(1/2) Set all remaining weights to+� or −� with equal probability.

11. Train the neural network using thesame training objects as were used fortraining the decision tree.

Such network is then trained using back-propagation. Mean square error of suchnetwork converges toward 0 much fasterthan it would in the case of randomly setweights in the network. Even if we woulduse neural network before the backpropaga-tion stage, it would already give good re-sults.

4.2. Con�ersion neural network-decision tree

Trained neural network is then examinedin order to determine the most importantattributes that influence the outcomes of theneural network. A list containing the mostimportant attributes is then used to buildthe final decision tree, which in majority ofcases performs better than the source deci-sion tree.

The conversion is described in the nextsteps:1. Vary values of each attribute for each

training object at the input layer of aneural network and observe how the out-puts of the network change.

2. Make a list of attributes that influence thenet’s outputs at most in descending order(the most influential attributes come firstin the list).

3. Run an algorithm for building of decisiontrees, where instead of classic heuristicevaluation function use a list of attributes,provided by the neural network. This ap-proach is very fast, but it produces sym-metric decision trees (on the same leveloccur the same attributes).

The last conversion usually causes loss ofsome knowledge contained in the neural net-work, but even so most knowledge is trans-formed into the final decision tree. If theapproach is successful, then the final decisiontree has better classification capabilities thanthe source decision tree.

5. Evolutionary decision trees

5.1. Genetic algorithms

Genetic algorithms are adaptive heuristicsearch methods which may be used to solveall kinds of complex search and optimisationproblems [13–15]. They are based on theevolutionary ideas of natural selection and

M. Zorman et al. / International Journal of Medical Informatics 63 (2001) 109–121114

genetic processes of biological organisms. Asthe natural populations evolve according tothe principles of natural selection and ‘sur-vival of the fittest’ first laid down by CharlesDarwin, so by simulating this process, geneticalgorithms are able to evolve solutions toreal-world problems, if they have been suit-ably encoded [16]. They are often capable offinding optimal solutions even in the mostcomplex of search spaces or at least theyoffer significant benefits over other searchand optimisation techniques.

A typical genetic algorithm operates on aset of solutions (population) within thesearch space. The search space represents allthe possible solutions, that can be obtainedfor the given problem, and is usually verycomplex or even infinite. Every point of thesearch space is one of the possible solutionsand, therefore, the aim of the genetic al-gorithm is to find an optimal point or at leastcome as close to it as possible.

The genetic algorithm consists of three ge-netic operators: selection, crossover (recombi-nation), and mutation. Selection is thesurvival of the fittest individuals within thegenetic algorithm with the aim of giving thepreference to the best ones. For this purposeall solutions have to be evaluated, which isdone with the use of the fitness function.Selection determines individuals to be usedfor the second genetic operator-crossover orrecombination, where from two good individ-uals a new, even better one is constructed.The crossover process is repeated until thewhole new population is completed with theoffspring produced by the crossover. All con-structed individuals have to preserve the fea-sibility regarding the given problem, in thismanner it is important to co-ordinate internalrepresentation of individuals with genetic op-erators. The last genetic operator is mutation,which is an occasional (low probability) alter-ation of an individual that helps to find an

optimal solution to the given problem fasterand more reliably.

5.2. Construction of decision trees usinge�olutionary method

Decision trees have so far proven to be anefficient and reliable approach to supportdecisions in medical diagnostics [17]. How-ever, their traditional induction method is notvery appropriate to optimise and most of allto personalise the diagnostic procedures.Therefore, we developed an evolutionary al-gorithm for the induction of decision treesand in this manner profit from two veryimportant features: the optimisation of deci-sion trees’ topology which helps to reduce thenumber of needed examinations, and the plu-rality of evolved solutions which provides uswith the necessary alternatives from whichthe most appropriate diagnostic procedurecan be determined for each individualpatient.

Genetic algorithms are generally used forvery complex optimisation tasks [14], forwhich no efficient heuristic method is devel-oped. Construction of decision trees is a com-plex task, but heuristic methods exist, thatusually works efficiently and reliably [8,9].Nevertheless, there are some reasons justify-ing our evolutionary approach. Genetic al-gorithms provide a very general concept, thatcan be used in all kinds of decision makingproblems. Due to their robustness they canbe used also on incomplete, noisy data(which often happens in medicine because ofmeasurement errors, unavailability of properinstruments, risk to the patient, etc.) andwhich are not very successfully solvable bytraditional techniques of decision tree con-struction. Furthermore, genetic algorithmsuse evolutionary principles to evolve solu-tions, therefore, solutions can be found thatcan be easily overlooked otherwise. Another

M. Zorman et al. / International Journal of Medical Informatics 63 (2001) 109–121 115

important advantage of genetic algorithms isthe possibility of optimising the decisiontree’s topology and the adaptation of classintervals for numeric attributes, simulta-neously with the evolution process. And lastbut not least, by weighting different parame-ters, searching can be directed to the situa-tion that best applies to current needs,particularly in multi-class decision makingprocesses, where we have to decide for whichdecision the reliability should be maximisedand which attributes are more desirable/un-desirable than the others.

5.3. E�olutionary process

When defining the internal representationof individuals within the population, togetherwith the appropriate genetic operators thatwill work upon the population, it is impor-tant to assure the feasibility of all solutionsduring the whole evolution process. There-fore, we decided to present individuals di-rectly as decision trees. This approach hassome important features: all intermediate so-lutions are feasible, no information is lostbecause of conversion between internal repre-sentation and the decision tree, the fitnessfunction can be straightforward, etc. Theproblem with direct coding of solution maybring some problems in defining of geneticoperators. As decision trees may be seen as akind of simple computer programs (with at-tribute nodes being conditional clauses anddecision nodes being assignments) we decidedto define genetic operators similar to thoseused in genetic programming where individu-als are computer program trees [18].

5.3.1. Selection and the fitness functionFor the selection purposes a slightly

modified linear ranking selection was used.The ranking of an individual decision treewithin a population is based on the LocalFitness Function (LFF):

LFF= �K

i=1wi(1−acci)+ �

N

i=1ciattr(ti)+wu nu

where K is the number of decision classes, Nis the number of attribute nodes in a tree, acci

is the accuracy of classification of objects of aspecific decision class di, wi is the importanceweight for classifying the objects of a decisionclass di, attr(ti) is the attribute used for a testin a node ti, ci is the cost of using theattribute of attr(ti), nu is number of unuseddecision (leaf) nodes, i.e. where no objectfrom the training set fall into, and wu is theweight of the presence of unused decisionnodes in a tree.

According to this LFF the best trees (themost fit ones) have the lowest function val-ues; the aim of the evolutionary process is tominimise the value of LFF for the best tree.A near optimal decision tree would classifyall training objects with accuracy in accor-dance with importance weights wi (some deci-sion classes may be more important than theothers), would have very little unused deci-sion nodes (there is no evaluation possible forthis kind of decision nodes regarding thetraining set) and would consist of low-costattribute nodes (in this manner the desirable/undesirable attributes can be prioritised).

5.3.2. Crosso�er and mutationCrossover works on two selected individu-

als as an exchange of two randomly selectedsub-trees. In this manner a randomly selectedtraining object is used to determine paths (byfinding a decision through the tree) in bothselected trees. Then an attribute node is ran-domly selected on a path in the first tree andan attribute is randomly selected on a path inthe second tree. Finally, the sub-tree from aselected attribute node in the first tree isreplaced with the sub-tree from a selectedattribute node in the second tree and in thismanner an offspring is created which is putinto a new population.

M. Zorman et al. / International Journal of Medical Informatics 63 (2001) 109–121116

Mutation consists of several parts: (1) onerandomly selected attribute node is replacedwith an attribute, randomly chosen from theset of all attributes; (2) a test in a randomlyselected attribute node is changed, i.e. thesplit constant is mutated; (3) a randomlyselected decision (leaf) node is replaced by anattribute node and (4) a randomly selectedattribute node is replaced by a decision node.

With the combination of presentedcrossover, which works as a constructive op-erator towards local optimums, and muta-tion, which works as a destructive operator inorder to keep the needed genetic diversity,the search for the solution tends to be di-rected toward the global optimal solution,which is the most appropriate decision treeregarding our specific needs (expressed in theform of LFF). As the evolution repeats, morequalitative solutions are obtained regardingthe chosen fitness function. The evolutionstops when an optimal or at least an accept-able solution is found or if the fitness score ofthe best individual does not change for apredefined number of generations.

5.3.3. Self-adaptation by informationspreading in multi-population model

The quality and reliability of obtained re-sults depend a great deal on the selectedfitness function, therefore, it is reasonable toexpect that automatic adaptation of an fitnessfunction (LFF that was presented earlier)would result in better results. One basic ques-tion has lead us to the development of amulti-population model for the constructionof decision trees: how to provide a methodfor automatic adaptation of the fitness func-tion that would still assure the quality ofevolved solutions? Namely, if the modifica-tion of the fitness function is unsupervised,the fitness function can easily become inap-propriate which of course gives bad solu-tions. Our idea was that the fitness function

Fig. 5. The multi-population decision support modelwith information spreading.

should evolve together with the decisiontrees, where solutions are regulated in thatthey are being compared with another set ofdecision trees, where the fitness function ispredefined and tested. This brings us to ourmulti-population model (Fig. 5).

The model consists of three independentpopulations, one main population and twocompeting ones. The first population is asingle decision tree, initially induced in tradi-tional way with the C4.5 algorithm [8], andthe decision trees in the other two popula-tions evolve in accordance with the evolu-tionary algorithm, described in the earliersection. The best individual from the mainpopulation, which serves as the general solu-tion, is competing with the best individualsfrom other populations in solving a givenproblem upon a training set. When it be-comes dominant over the others (in the senseof the accuracy, sensitivity and specificity ofthe classification of training objects) itspreads its ‘knowledge’ to them. In this waythe other solutions become equally successfuland the main population has to improve fur-ther to reclaim its dominance. This can bedone by adjusting the LFF-for this purposethe weights used for the LFF is modified:

M. Zorman et al. / International Journal of Medical Informatics 63 (2001) 109–121 117

�i,i� [1…K ]:wi=wi+ (1−acci)*rand(av)

where av is the adaptation value and ran-d(av) is random value from [0,av].

One global fitness function that is differentfrom the local ones (those used in the evolu-tionary process to evaluate the fitness of indi-vidual decision trees) is used to determine thedominance of one population over another.It is predetermined as a weighted sum ofaccuracy, sensitivity and specificity (see theresults in the next section) of the given solu-tion, together with the cost of chosenattributes:

GFF=wacc(1−acc)+wsens(1−sens)

+wspec(1−spec)+ �N

i=1citype(ti)

+wunu

where the last two parts are the same as inthe LFF and, acc is the accuracy of classifiedtraining objects, sens is the sensitivity ofclassified training objects, spec is the specific-ity of classified training objects, wacc is theimportance weight of accuracy, wsens is theimportance weight of sensitivity and wspec isthe importance weight of specificity.

One of the recent and very promising as-pect of genetic algorithms, that is still notexploited, is using the natural phenomenonof co-evolution. In nature it often happensthat two or more species have to evolve withrespect to each other to become more suc-cessful in fighting for resources in the envi-ronment as a whole. We used the principle ofco-evolution to solve the simultaneous evolu-tion of decision trees and the adaptation ofLFF in the main population of our multi-population model. By competing with otherpopulations the adaptation of evolving fitnessfunction is supervised, and in this way theproblem of automatic adaptation is solved.

6. The orthopaedic fracture data

The data consisted of 2637 cases which hadthe data verified, as correctly entered fromthe physician’s report. There was one recordper case. The outcome was classed by thephysician’s report as ‘successful’, ‘failed’ or‘unverified failure’. Cases were classed as un-verified failures, if no record of success wasever recorded. The ‘unverified failures’ wereclassed as ‘failures’ for the statistical analysis,since this is the more rigorous route whenexamining success rate.

The data were examined for associationsbetween all the independent variables and theoutcome. Where cell sizes were small, cate-gories were collapsed. All variables whichwere associated with the outcome at an alphalevel of 0.05, were then examined forcollinearity. A logistic regression model withthe success as the outcome was then built.The best model had 63% accuracy, 49% sensi-tivity and 76% specificity to predict failure.Age, age of fracture, fracture type andwhether the bone was the humerus or notwere associated with success. Age correlatedwith many of the other possible predictorsbut removing it from the model did not in-crease the precision. Selecting the cases with afracture age of less than 120 days, yielded amodel with 95% specificity but only 17%sensitivity to predict failure, for an overallaccuracy of 68%. The predictors werehumerus (yes/no) and age of fracture.

Weighting the cases to improve the sensi-tivy did not improve the overall accuracy.

In a set of 2637 cases there were 2435 casesclassified as ‘healed’, 116 cases as ‘failed’, and87 cases as ‘unverified’ failure. Each case wasdescribed with 23 attributes (six continuousand 17 discrete) and a decision. The set con-tained attributes with missing values, whichmade the problem even harder. For trainingand testing purposes, that set was split in

M. Zorman et al. / International Journal of Medical Informatics 63 (2001) 109–121118

approximate ratio 4:1 in favour of the train-ing set. The set was split in such way that therelative frequencies in both parts were ap-proximately the same. So each approach used2095 objects for training and 543 objects fortesting. On both sets we measured accuracyof prediction for each decision, sensitivity tofailure, specificity, and overall accuracy. Inthe following section we will present theseresults.

Our intention was to find a classifier, thatwould have as high sensitivity to failure aspossible, in combination with high overallaccuracy. Cases with different decisions werenot equally represented in the training set, soit was expected for the decision tree al-gorithms to learn more about the cases thathad the most representatives in the trainingset. Too much generalization could meanthat no knowledge about the special cases isextracted, even though the overall accuracy ishigh.

7. Results

We used four different approaches to per-form the tests on the orthopaedic data. Forthe classical decision trees with equidistantand dynamic discretization and for neurogenerated decision trees we used a MtDeciT2.0, a Java 1.2 based package, that was devel-oped at our institution [19]. Evolutionary de-cision trees were built with the genTrees 1.7decision tree generator [20], written in C+ +, which was also developed in our laboratory.Tests were also performed using Quinlan’sSee 5 package. The decision trees were builtusing training set of 2095 objects.

The success of the trees was tested by the:� accuracy of predicting every singe class in

the testing set,� sensitivity on the testing set,� specificity on the testing set,

� accuracy on the testing set,and also by the:� accuracy of predicting every singe class in

the training set,� sensitivity on the training set,� specificity on the training set,� accuracy on the training set.

Different settings of build parameters weretested and the results presented in Table 1were achieved with the following settings:� Classic decision trees with 4% prepruning,

discretization of continuous attributes toquartiles (equidistant, four subintervals),and entropy heuristic function for selec-tion of attributes (MtDeciT 2.0).

� Classic decision trees with 4% prepruning,dynamic discretization of numerical at-tributes (40% tolerance, z=2), and en-tropy heuristic function for selection ofattributes (MtDeciT 2.0).

� Classic decision trees with 4% prepruning,dynamic discretization of numerical at-tributes (40% tolerance, z=2), entropyheuristic function for selection of at-tributes, postpruned with static error esti-mate pruning (MtDeciT 2.0).

� Neuro generated decision tree (decisiontree�neural network�decision tree).Source decision tree was a classic decisiontree with 4% prepruning, dynamic dis-cretization of numerical attributes (40%tolerance, z=2), entropy heuristic func-tion for selection of attributes, postprunedwith static error estimate pruning (eightnodes). Neural network had two hiddenlayers, ten neurons in the first hidden layerand six neurons in the second hiddenlayer. Destination decision tree was builtusing a list with attributes, that influencethe outputs of a neural network at most(MtDeciT 2.0).

� Evolutionary decision trees: genTree 1.7was running for 2400 generations. Thetraining was stopped at that point since no

M. Zorman et al. / International Journal of Medical Informatics 63 (2001) 109–121 119

significant improvements occurred if train-ing was extended to more generations.

� Quinlan’s C5 is a successor of more fa-mous C4.5. Since it is considered as ameasure of success for decision tree ap-proaches, we included it in our test. Thetolerance for prepruning was set to 4%,but the tree had only one node, thatclassified all objects as ‘healed’. When wesunk the tolerance to 1%, the resultsdidn’t change.

8. Discussion

Looking at the Table 1, it can be seen thatall approaches (except the Evolutionary DT)had very high specificity on both trainingand testing set. On the other hand all ap-proaches with the exception of EvolutionaryDT had sensitivity equal or very close to 0%.

The reason for that can be found in thetraining set. There you can see, that themajority of cases in the set were classified as‘healed’ and only 4.59% were classified as‘failed’. Classic approaches were unable tofind knowledge that would classify all threeclasses with approximately the same accu-racy. For the same reason, overall accuracyis misleadingly high (again except for Evolu-tionary DT).

Another criteria of decision tree’s successis the size of the tree-number of nodes in thetree. Decision trees generated by the classicapproaches without postpruning (first twocases) and those generated by the neuro ap-proach are too large to be feasible-the accu-mulated knowledge cannot be understoodbecause of the large quantity—a clear caseof overfitting. On the other hand, the treegenerated by the C5 has only one leaf andcontains no valuable knowledge.

Table 1Comparison of results

Accuracy Tree sizeSensitivity (to failure) Specificity(nodes)(%)(%)(%)

Classic DT, equidistant discretization 423Training set 53.26 98.05 96.08

8.33Testing set 90.75 87.1Classic DT, dynamic discretization 169

97.35Training set 94.172590.694.79Testing set 0

8Classic DT, dynamic discretization,postpruned

1.08Training set 96.8 92.691.7195.950Testing set

Neuro generated DT 127358.69 98.2 96.46Training set4.16 84.39Testing set 80.66

Evolutionary DT 2148.9 50.97Training set 50.88

49.3551.63Testing set 20.8C5 1

0 96.65 92.41Training set0 96.14 91.89Testing set

M. Zorman et al. / International Journal of Medical Informatics 63 (2001) 109–121120

If we compare the remaining two ap-proaches (classic DT with dynamic discretiza-tion and postpruning, and evolutionary DT),there is no doubt about the winner. Evolu-tionary approach built a small, easy-to-un-derstand decision tree, that has the bestcapabilities to distinguish between cases withdifferent decisions.

Most of the other methods have good re-sults on the training set and poor perfor-mance on testing set—again we are talkingabout overfitting.

But why do evolutionary decision treeshave results, that are so different from otherresults? The secret is in the approach itself. IngenTree 1.7 there is a possibility to setweights for different decisions. In that waywe can build decision trees, that are tailoredto our needs. We can raise the sensitivity orspecificity or accuracy, which suits our prob-lem best. In that way the decision trees canbe ‘personalised’.

In the real world, useful parameters forguiding clinical decisions are not identified bythe several methods tested in this paper.However, some models are certainly moresuitable for solving real world problems,where no ideal training set is available. Bettermodels should give us the possibility to im-prove the chosen feature of the decision treeand in that sense compensate for lack oftraining objects, presence of missing values,etc.

9. Conclusion

In our experiments we tested various meth-ods for building univariate decision trees inorder to find the best induction strategy. Ona hard real world problem of the orthopaedicfracture data with 2637 cases, described by 23attributes and a decision with three possiblevalues, we built decision trees with four clas-

sical approaches (classic DT with equidistantor dynamic discretization, with or withoutpostpruning, and C5), a hybrid approachwhere we combined neural networks and de-cision trees with evolutionary approach.

Overall accuracy of all approaches wasvery high, but real value of their classificationcapabilities was low. Sensitivity—our primeconcern—was low in all cases, the best valuewas achieved by an approach that had theworse overall accuracy, namely the evolution-ary decision trees. In addition, evolutionarydecision trees had feasible size, which meansthat it wasn’t too large nor too small.

In other words, if we had to decide whichinduction strategy to choose based on earlierresults, we would choose the evolutionarydecision trees strategy. The deciding factorsare possibilities to personalise decision trees,set the preferred decisions, and build decisiontrees that are not limited by the constraints ofclassical approaches like numeric attributesand missing values.

Since overall success of tested approacheswas not what we expected and what wewished for, we will continue our work oncurrent and new hybrid and evolutionary ap-proaches. In the near future we will start ourtests on hybrids that include rough sets andan extension of decision trees called decisionnets.

References

[1] M. Zorman, P. Kokol, Decision trees and auto-matic learning in medical decision making. V:Adeli, H. (Ed.). Intelligent Information SystemsIIS’97, Grand Bahama Island, Bahamas, Decem-ber 8–10, 1997, Proceedings. Los Alamitos, Cali-fornia [etc.]: IEEE Computer Society, 1997, pp.37–41.

[2] Milan Zorman, Milojka Molan S� tiglic, PeterKokol, Ivan Malcic, The limitations of decisiontrees and automatic learning in real world medicaldecision making, J. Med. Sys. 6 (21) (1997) 403–415.

M. Zorman et al. / International Journal of Medical Informatics 63 (2001) 109–121 121

[3] Tom af Klercker, Effect of pruning of a decision-tree for the ear, nose and throat realm in primaryhealth care based on case-notes, J. Med. Sys. 4(20) (1996) 215–226.

[4] B. Zupan, D.S. Stokic, M. Bohanec, M.M. Priebe,A.M. Sherwood, Relating Clinical and Neuro-physiological Assessment of Spasticity by MachineLearning. In: P Kokol, B Stiglic (Eds.) Proceed-ings on Tenth IEEE Symposium on Computer-Based Medical Systems, IEEE Computer Society,Maribor, June 1997, pp. 190–194.

[5] A. McQuatt, P.J.D. Andrews, D. Sleeman, V.Corruble, P.A. Jones, The Analysis of Head In-quiry Data Using Decision Tree Techniques. JointEuropean Conference on Artificial Intelligence inMedicine and Medical Decision Making, Aalborg,Denmark, 20–24th June 1999.

[6] W.R. Shankle, M. Subramani, M.J. Pazzani, P.Smyth, Detecting Very Early Stages of Dementiafrom Normal Aging with Machine LearningMethods. Lecture Notes in Artificial Intelligence1211, Artificial Intelligence in Medicine, Proceed-ings of the 6th Conference on Artificial Intelli-gence in Medicine Europe, Grenoble, France,1997, pp. 73–85.

[7] J.R. Quinlan, Induction of decision trees, MachineLearning 1 (1986) 81–106.

[8] J.R. Quinlan, C4.5: Programs for Machine Learn-ing, Morgan Kaufmann, San Mateo, CA, 1993.

[9] J.R. Quinlan, Simplifying decision trees, Int. J.Man-machine Studies 27 (1987) 221–234.

[10] J.R. Quinlan, Decision trees and instance-basedclassifiers, in: Allen B. Tucker Jr (Ed.), CRCComputer Science and Engineering Handbook,CRC Press, Boca Raton, 1997, pp. 521–535.

[11] Stuart J. Russel, Peter Norvig, et al., ArtificialIntelligence: a Modern Approach, Prentice-Hall,Englewood Cliffs, 1995, pp. 525–562.

[12] A. Banerjee, Initializing Neural Networks UsingDecision Trees. Proceedings of the InternationalWorkshop on Computational Learning and Natu-ral Learning Systems, 1994, pp. 3–15.

[13] T. Baeck, Evolutionary Algorithms in Theory andPractice, Oxford University Press, Oxford, 1996.

[14] D.E. Goldberg, Genetic Algorithms in Search,Optimization, and Machine Learning, AddisonWesley, Reading, MA, 1989.

[15] D.E. Goldberg, Genetic and evolutionary al-gorithms come of age, Commun. ACM 37 (3)(1994) 113–119.

[16] J.H. Holland, Adaptation in Natural and Artifi-cial Systems, MIT Press, Cambridge, MA, 1975.

[17] P. Kokol, M. Mernik, J. Zavrsnik, K. Kancler, I.Malcic, Decision trees based on automatic learn-ing and their use in cardiology, J. Med. Syst. 18(4) (1994) 201–206.

[18] J.R. Koza, Genetic Programming: on the Pro-gramming of Computers by Natural Selection,MIT Press, Cambridge, MA, 1992.

[19] M. Zorman, S� . Hleb, M. S� progar: Advanced toolfor building decision trees MtDeciT 2.0. V: In: P.Kokol, T. Welzer-druovec, H.R. Arabnia (Eds.),International conference on artificial intelligence,28 June–1 July 1999, Las Vegas, Nevada, USA.Las Vegas: CSREA, Vol. 1, 1999, pp. 315–318.

[20] V. Podgorelec, P. Kokol, Induction of MedicalDecision Trees with Genetic Algorithms, Proceed-ings of the International ICSC Congress on Com-putational Intelligence Methods and ApplicationsCIMA 1999, Rochester, NY, USA, ICSC Aca-demic Press, June 1999.

.