Discrimination-Based Criteria for the Evaluation of Classifiers

12
Discrimination-based criteria for the evaluation of classifiers Thanh Ha Dang 1,2 , Christophe Marsala 1 , Bernadette Bouchon-Meunier 1 and Alain Boucher 2 1 Universit´ e Pierre et Marie Curie - Paris6, CNRS UMR 7606, DAPA, LIP6 8, rue du Capitaine Scott, Paris, F-75015, France 2 Institut de la Francophonie pour l’Informatique, Equipe MSI, Bˆ at D, 42 rue Ta Quang Buu, Hanoi, Vietnam Abstract. Evaluating the performance of classifiers is a difficult task in machine learning. Many criteria have been proposed and used in such a process. Each criterion measures some facets of classifiers. However, none is good enough for all cases. In this communication, we justify the use of discrimination measures for evaluating classifiers. The justification is mainly based on a hierarchical model for discrimination measures, which was introduced and used in the induction of decision trees. 1 Introduction Machine learning techniques become increasingly popular in both academic and industrial domains. In classification problems, the use of such techniques often involves the assessment of how good is a classifier in relation with a dataset. The standard practice is to take a collection of examples in the domain of interest, select randomly a subset of these examples as training set and apply machine learning algorithms to it for obtaining a classifier, which is also named classifi- cation model or model for short. Then this one is used to classify the remaining test cases. The performance of classifiers is usually evaluated by the classification results in the test sets. Most of measures of evaluation are designed with the hypothesis that all examples are equally important and that datasets are distributed in a balanced manner by their classes. In the classical case, each example in the test set is classified in a class. But in the more general ones, probabilistic classification, possibilistic classification and fuzzy classification for instances, several classes may be assigned to an example. A confusion matrix contains information about actual and predicted classifi- cations done by a classification model. Performance of such model is commonly evaluated using the data in the matrix. Most of measures evaluate the relation between the predicted classes and the actual classes of examples on a test set and do not pay enough attention to the characterization of classification problems. For examples: accuracy, error rate, true/false positive rate, true/false negative rate, sensitivity, specificity, etc. So there is a bias on classification results con- cerned with the characterization of problems, in particular the distribution of

Transcript of Discrimination-Based Criteria for the Evaluation of Classifiers

Discrimination-based criteriafor the evaluation of classifiers

Thanh Ha Dang1,2, Christophe Marsala1, Bernadette Bouchon-Meunier1

and Alain Boucher2

1 Universite Pierre et Marie Curie - Paris6, CNRS UMR 7606, DAPA, LIP6 8, rue duCapitaine Scott, Paris, F-75015, France

2 Institut de la Francophonie pour l’Informatique, Equipe MSI, Bat D, 42 rue TaQuang Buu, Hanoi, Vietnam

Abstract. Evaluating the performance of classifiers is a difficult task inmachine learning. Many criteria have been proposed and used in such aprocess. Each criterion measures some facets of classifiers. However, noneis good enough for all cases. In this communication, we justify the useof discrimination measures for evaluating classifiers. The justification ismainly based on a hierarchical model for discrimination measures, whichwas introduced and used in the induction of decision trees.

1 Introduction

Machine learning techniques become increasingly popular in both academic andindustrial domains. In classification problems, the use of such techniques ofteninvolves the assessment of how good is a classifier in relation with a dataset. Thestandard practice is to take a collection of examples in the domain of interest,select randomly a subset of these examples as training set and apply machinelearning algorithms to it for obtaining a classifier, which is also named classifi-cation model or model for short. Then this one is used to classify the remainingtest cases. The performance of classifiers is usually evaluated by the classificationresults in the test sets.

Most of measures of evaluation are designed with the hypothesis that allexamples are equally important and that datasets are distributed in a balancedmanner by their classes. In the classical case, each example in the test set isclassified in a class. But in the more general ones, probabilistic classification,possibilistic classification and fuzzy classification for instances, several classesmay be assigned to an example.

A confusion matrix contains information about actual and predicted classifi-cations done by a classification model. Performance of such model is commonlyevaluated using the data in the matrix. Most of measures evaluate the relationbetween the predicted classes and the actual classes of examples on a test set anddo not pay enough attention to the characterization of classification problems.For examples: accuracy, error rate, true/false positive rate, true/false negativerate, sensitivity, specificity, etc. So there is a bias on classification results con-cerned with the characterization of problems, in particular the distribution of

examples as showed by many authors [7, 12]. Sometimes, the classes have veryunequal frequency. For instances, in e-commerce: 99% of visitors do not buyanything and only 1% of them buy something; in security, 99.99% of peopleare not terrorists; etc. The situation is similar with multiple classes. A majorityclass classifier can have a very high accuracy: 99% in the e-commerce situation,99.99% in the security problem, but it should be useless.

In the last decade, the ROC curves have been widely used in the machinelearning community [2, 6]. An attractive property of the ROC curves is the in-sensitivity to the class distribution. In many cases, such as rule sets, classifiersproduce only a class decision for each example. The use of such a discrete classi-fier on a test set produces a single confusion matrix. From the matrix, only onepoint in ROC space is determined. In such cases, classifiers should be convertedto generate scores from each example rather than just a class. Moreover, ROCanalysis is not convenient for the choice of classifiers. For concluding that a clas-sifier is better than another, the first one should be better than the second oneover the whole performance space [11] i.e. it has a higher true positive rate and alower false positive rate. Several information-based measures [3, 7, 8] which willbe recalled in the next section, have been proposed for a similar purpose. Theyusually try to exclude the effect of prior class probabilities as with the ROCanalysis.

In classification process, it can be pointed out two partitions of the test set.The first one is natural: examples are partitioned by their classes. The secondone is the partition raised by a classifier. In this paper, we propose to considerthe adequation between these two partitions. This allows an evaluation of thediscrimination power of classification models with regard to the classes and acomparaison of classifiers. The initial idea was introduced for the induction ofdecision trees process [13] which helps to select the most adequate attribute forpartitioning a learning set. The selected attribute is the most discriminated onewith regard to the classes.

The paper is organized as follows. In Section 2, several criteria for classi-fier evaluation, in particular the information-based criteria are presented andformalized with a common formalism. In Section 3, a hierarchical model formeasures of discrimination in inductive learning is recalled. In Section 4, thediscrimination-based criteria are introduced based on the hierarchical model. InSection 5, several properties of the proposed criteria are presented. In Section 6,a set of experiments is done to illustrate and validate the use of these criteria.In the last section, a conclusion is done and future work is proposed.

2 Criteria for classifier evaluation

There are many criteria of performance for a classification model. We cite here alist of some usual criteria : accuracy, error rate, true/false positive rate, true/falsenegative rate, sensitivity, specificity, positive/negative predictive value, fallout,precision/recall, F measure, ROC convex hull, area under the ROC curve, cal-ibration error, mean cross-entropy, root mean squared error, expected cost, ex-

plicit cost, etc. In several cases, the interpretability and the sensitivity based ona geometric analysis [1] of the model are considered.

In the next of this section, several information-based criteria for classifierevaluation are presented and formalized with a common formalism. Most ofthese measures are proposed for the probabilistic classification and they takethe classical classification (target is one class and the classifier assigns one classto an example) as a particular one. They usually evaluate the coherence betweenthe predicted probabilistic distribution by the classification model and the targetprobabilistic distribution for each example then aggregate them for the wholedataset.

Let ξT = {e1, e2, ..., eN} be a test set of examples. Suppose that an examplee belongs to class ce from a set of classes C = {c1, c2, .., cn}. Class ce is thetarget class (also named the natural class or the real class) of classificationprocess. In the more general case, each example is associated to a probabilisticdistribution (pc1(e), pc2(e), .., pcn

(e)), in which pci(e) is the probability that the

example belongs to class ci. This is the target of classifiers. The probabilisticdistribution for the classical case is in the form (0, .., 1, ..0), where 1 correspondsto class ce.

Suppose that in the prior, the probability that an example e belongs to classc of C is pe(c). So the prior probabilistic distribution is (pe(c1), pe(c2), .., pe(cn)).This probabilitic distribution may be estimated from the frequency of each classin dataset.

Suppose that the posterior probability that an example e is classified in classc is p′

e(c). This probability is returned by the classifier. So the posterior prob-abilistic distribution is (p′

e(c1), p′e(c2), .., p′

e(cn)). It is also called the predictedprobabilistic distribution.

The cross-entropy error is described [3] for the case of two classes 0 et 1.It measures how close predicted values are to target values. In the simple case,with each example e we have either p0(e) = 1 and p1(e) = 0, or p0(e) = 0and p1(e) = 1. It assumes that the predicted values are probabilities p′

e(1) onthe interval [0,1] that indicate the probability that the example is in class 1.Obviously, if p0(e) = 1 and p1(e) = 0 i.e. the example belongs to class 0, for theclassification result, a small probability p′

e(1) that the example belongs to class1 is preferred. Otherwise the big p′

e(1) is is preferred. The cross-entropy, thatshould be minimized, for an example e is defined as 3 :

cross-entropy(e) = −p1(e) log p′e(1)− (1− p1(e)) log(1− p′

e(1)) (1)

The cross-entropy for a test set is defined as the sum of the cross-entropyfor each example that belongs to the set. To make cross-entropy independent ofdataset size, mean cross-entropy is defined by the sum of the cross-entropy foreach example divided by the total number of examples in the test set.

3 In this paper, log x should be understand as log2 x.

The directed divergence i.e. Kullback-Leibler measure is also used in thedomain. It measures the Kullback-Leibler distance between the predicted prob-abilistic distribution and the target probabilistic distribution of an example.

dKL((pc1(e), .., pcn(e)), (p′

e(c1), .., p′e(cn))) =

n∑i=1

pci(e) log

pci(e)

p′e(ci)

(2)

In the case of two classes and (pc1(e), pc2(e), .., pcn(e)) takes only one amongtwo distributions (1,0) and (0,1), the directed divergence reduces to the cross-entropy mentioned previously because:

cross-entropy(e) = −(p1(e) log p′e(1) + (1− p1(e)) log(1− p′

e(1)))= −(pe(1) log p′

e(1) + pe(0) log p′e(0))

= pe(1) logpe(1)p′

e(1)+ pe(0) log

pe(0)p′

e(0)(3)

with the convention: 0 log 0 = 0.A disadvantage of the above measures is that their values are infinity when

an example is completely bad classified, for instance the target distribution is(1,0) and the predicted distribution is (0,1). Moreover, they do not take intoaccount the prior probabilistic distribution.

An information reward measure is proposed in [8]. It is applied in the casewhere each example e belongs to only one class ce. In the binominal classification,for each example the information reward is defined as: reward(e) = 1+log p′(ce).It is 1 bit if the classification is correct (p′(ce) = 1) and 0 bit for the completeignorance in the classification task (p′(ce) = 0.5) and it is negative if p′(ce) < 0.5.Like the Kullback-Leibler distance, the information reward does not take intoaccount the characterization of the problems, in particular the distribution ofexamples by their classes.

Unlike the previous measures, I. Kononenko and I. Bratko [7] proposed ameasure that explicitly takes prior probabilities of classes into account. Thisinteresting property is maintained in the discrimination-based measures whichare introduced in the next sections. I. Kononenko and I. Bratko suggested tocalculate the amount of information gained or lost in the classification of eachexample and then in the classification of the whole dataset. In prior, the amountof information necessary for confirming that e is in class c is: − log pe(c) bits .Analogously, the amount of information necessary to correctly decide that e doesnot belong to class c is: − log(1−pe(c)) bits. In posterior, if p′

e(ce) ≥ pe(ce) thenthe probability of class ce changes in the good direction, the gain of informationin this case is: − log pe(ce)+log p′

e(ce) bits. If p′e(ce) < pe(ce) then the probability

of class ce changes in the bad direction, the loss of information in this case is:− log(1 − pe(ce)) + log(1 − p′

e(ce)) bits. The information score is the differencebetween the quantity of gained information and the quantity of lost informationover all examples of the test set. It can be normalized by dividing by the numberof examples in the test set.

3 Hierarchical model for measures of discrimination

Let ξ be a training set. Suppose that each example e in ξ is described by meansof a set of values for K attributes in a set A = {A1, A2, .., AK} and belongsto a class from a set of classes C. Each attribute Aj takes its value in the set{v1, v2, .., vmAj

} where the values vi can be either symbolic, numerical. In in-ductive learning, a decision tree is build to point out the relation between thevalues of attributes and the classes in C. That enables us to obtain a class fromany forthcoming description of an example.

A decision tree corresponds to a ranking of the attributes according to theirinfluence related to the classes. The nodes of such a tree correspond to testing thevalue of an attribute, the vertices correspond to a given value, and the leaves areassociated with a class. The main method to construct a decision tree is the so-called Top Down Induction method. The tree is constructed from its root to itsleaves by successive partitioning of the training set into subsets. Each partitionis done thanks to the values of a selected attribute and leads to the definitionof a node of the tree labelled by this attribute. The selected attribute is the onethat has the most discrimination power with regard to the classes. A measureof discrimination helps the choice by estimating the information brought out byan attribute. For instance, in the ID3 algorithm [13] the Shannon entropy [16]is used to measure the quantity of information. A leaf is constructed from a setthat contains only examples with the same class.

In the induction of decision trees, the measure used to select the best at-tribute must fulfill a several properties. Such a measure is called a measure ofdiscrimination. The hierarchical model of measures of discrimination has beenproposed to validate a given function as a good measure of discrimination for adecision trees construction process [10]. A number of measures have been vali-dated by this model [5]: Renyi entropy, Daroczy entropy, etc.

This hierarchical model, named FGH-model is based on the definition ofthree levels F ,G,H of functions. The functions in each level are aggregated toconstruct a measure of discrimination. The F-level of functions is concernedby measuring the inclusion between the set of examples with class cj and theset of examples with the value vi for attribute A. The G-level of functions isconcerned by aggregating functions from the F-level to measure the informationbrought out by a value vi related to the whole set of classes C. The H-level offunctions is concerned by aggregating functions from the G-level to measure thediscrimination power of attribute A related to C. At each level, a set of propertiesrequired for the function is given.

Given a set X. Let F [X] be the set of all subsets of X and IP[X] be the setof all partitions of X.

Definition 1. (F-function). An F-function is a function F : F [X]×F [X] →IR+

such that:

1. F (U, V ) is minimum when U ⊆ V .2. F (U, V ) is maximum when U ∩ V = ∅.

3. F (U, V ) is strictly decreasing with U ∩ V i.e. U ∩ V1 ⊂ U ∩ V2 impliesF (U, V1) > F (U, V2).

Given an F-function F and a sequence of continuous functions gk :IRk −→IR+,k ∈IN.

Definition 2. (G-function). An G-function is a function G: F [X]×IP[X] →IR+

such that:

G(U, V ) = gn(F (U, V1), F (U, V2), .., F (U, Vn)) (4)

where V = {V1, V2, .., Vn} a partition of X and :

1. G(U, V ) is minimum when there exists Vj (1 ≤ j ≤ n) such that U ⊆ Vj.2. G(U, V ) is maximum when F (U, V1) = F (U, V2) = .. = F (U, Vn).

Given an G-function G and a sequence of continuous functions hk :IR+k −→IR+,k ∈IN.

Definition 3. (H-function). An H-function is a function H: IP[X]×IP[X] →IR+

such that:

H(U, V ) = hm(G(U1, V ), G(U2, V ), .., G(Um, V )) (5)

where U = {U1, U2, .., Um} a partition of X.

In inductive learning, the definition of H-function can be applied to: X is alearning set, V is a partition of the set of examples according to classes, U isalso a partition of the set of examples according to values of an attribute, Ui

is the set of examples such that the value of attribute A is vi. A H- functionmeasures the discriminating power of attribute A with regard to all classes. Formore details on this model, see [10, 14].

4 Discrimination-based criteria

Inductive learning is a process to induce knowledge from cases or examples.Learning algorithms try to extract as must as possible the information in thedataset for constructing a classifier and use it to classify a set of new examples.The classifier should discriminate all examples by their classes. So it raises apartition of the set of examples. We propose to evaluate the discrimination powerof the classifier by analysing the adequation between the partition generatedby the classifier and the partition with regard to the real classes of examplesi.e. the natural partition. Like the information-based criterion proposed by I.Kononenko and I. Bratko [7], the discrimination-based criteria take into accountthe difference between the information given by the classifier (posterior) andthe available information on the test set (prior). But we consider directly theprobabilistic distribution on the whole examples instead of the one for eachindividual.

We denote ξTc the set of all examples in ξT belonging to class c and ξTc′ theset of all examples in ξT classified in class c. So the proposed method evaluatesthe adequation between two partitions {ξTc1 , ξTc2 , .., ξTcn} and {ξTc′

1, ξTc′

2, .., ξTc′

n}.

Suppose that a classifier M is induced from ξ and it classifies all examplesfrom a test set ξT . The classifier M labels each example e in ξT with a class inC. Thus, from M we can introduce a new attribute fM : for each example of ξT ,the attribute takes the label assigned by M to the example as its value (Table1).

Example Real class A1 A2 ... AK fM1 fM2

e1 ce1 v11 v12 ... v1K c11 c12

e2 ce2 v21 v22 ... v2K c21 c22

... ... ... ... ... ... ... ...

eN ceN vN1 vN1 ... vNK cN1 cN2

Table 1. Classification results by two classifiers: M1 and M2

As mentioned in the previous section, a measure validated by the hierarchicalmodel can estimate the discrimination power of an attribute, in particular thenew attribute fM . Thus, it can measure the discrimination power of the classifierM . The measurement of the discrimination power of M consists also in 3 levels.

F-level: The F-level is concerned by measuring the adequation between theset of examples with class cj and the set of examples classified in class ci thatwe denote: F (ξTcj , ξTc′

i). It takes its minimum value when the set of examples

classified in class ci is a subset of the set of examples with class cj , and itsmaximum value when any example of class cj is classified in class ci.

G-level: The G-level is concerned by aggregating functions from the F-levelto measure the information brought out by classifying the examples in class ci.It is noted G({ξTc1 , ξTc2 , .., ξTcn}, ξTc′

i). It takes its minimum value when there

exists a class cj that the set of examples classified in class ci is a subset of theset of examples with class cj (for example, in the case of ideal classifier). G takesits maximum value when the adequations of the set of examples classified inclass ci and each set of examples of the same class are identical. H-level: TheH-level is concerned by aggregating functions from the G-level to measure thediscrimination power of the model M related to the whole classes in C. It isnoted H({ξTc1 , ξTc2 , .., ξTcn

}, {ξTc′1, ξTc′

2, .., ξTc′

n}).

This function measures the inadequation between the two partitions by theclassifier and by the classes of examples. In other words, it measures how goodthe partition according to classifier M is. The smaller H the more two partitionsare adequate. So a small value for H is preferred. When H = 0, the two partitionsare the same.

It has been shown [9] that Shannon entropy [16], a very usual informationmeasure, is a particular discrimination measure validated by the hierarchicalmodel. In the next, we establish an evaluation criterion with the Shannon entropyfor illustrating.

F-level:

F (ξTcj , ξTc′i) = − log

|ξTcj ∩ ξTc′i|

|ξTc′i|

= − log p(cj |c′i) (6)

where p(cj |c′i) is the probability that an example classified in class ci is in class

cj and |.| is the cardinality of a set.G-level:

G({ξTc1 , ξTc2 , .., ξTcn}, ξTc′

i) =

n∑j=1

|ξTcj∩ ξTc′

i|

|ξTc′i|

F (ξTcj, ξTc′

i)

= −n∑

j=1

|ξTcj∩ ξTc′

i|

|ξTc′i|

log|ξTcj

∩ ξTc′i|

|ξTc′i|

= −n∑

j=1

p(cj |c′i) log p(cj |c′

i)

= I(ξTc′i) (7)

It is the entropy of the subset of examples classified in class ci with regardto their real classes.

H-level:

H({ξTc1 , .., ξTcn}, {ξTc′

1, .., ξTc′

n}) =

n∑i=1

|ξTc′i|

|ξT |G({ξTc1 , .., ξTcn

}, ξTc′i)

=n∑

i=1

p(c′i)I(ξTc′

i) = I(ξT |M) (8)

where p(c′i) is the probability that an example is classified in class ci and

I(ξT |M) is the entropy of the test set conditioned by classifier M .We denote: I(ξT ) the entropy of the test set ξT :

I(ξT ) = I(p1, p2, .., pn) = −n∑

i=1

p(ci) log p(ci) (9)

where p(ci) is the probability that an example of ξT is in class ci.The following formula allows us to measure the information bring out by

classifier M :

4I(M, ξT ) = I(ξT )− I(ξT |M) (10)

We have : 0 ≤ 4I(M, ξT ) ≤ I(ξT ).In learning process, the algorithms try to learn as much as possible about the

dataset. The information obtained in such a process enable the induction of theclassifier. So the classifier can be considered as an information-holder. Therefore,the above formula measures how much information of the test set is hold in

the classifier. In other words, it measures the part of necessary information fordescribing the test set gained by the learning process on the learning set. Itexpresses also the difference between the average uncertainty of the sets issuedby the classifier from the whole test set and the uncertainty of the whole testset.

The ratio of information gain is defined as follows:

τ(M, ξT ) =4I(M, ξT )

I(ξT )(11)

We have : 0% ≤ τ(M, ξT ) ≤ 100%.τ(M, ξT ) measures the ratio of information of the test set hold in the classifier.

Of course, the grand ratio and the important information gain are preferred.

5 Additional properties of the discrimination-basedcriteria

The discrimination-based criteria evaluate the adequation between the naturalpartition of examples and the one raised by the classification model. They elim-inate the bias on the examples distribution by their classes. It could be usefulas criteria in addition to other existing ones. However, they just evaluate thediscrimination capacity of the classifier but does not evaluate how correct theclassification is. If we consider that a classification process is a successive per-formance of a process for discriminating the examples in a number of partitionsand then a process for labelling each partition, the discrimination-based criteriaevaluate the first one. Even in the case where the class assignment is bad (soaccuracy is low) whereas the discrimination capacity is good, the classificationmodel is not totally useless.

As argued in the previous section, if the classification model is perfect i.e.the good classification rate is 100% then model is also perfect by means ofdiscrimination measure and τ(M, ξT ) = 100%.

In the case of one class classifier that classifies all examples in only one class,the two partitions are completely inadequate and τ(M, ξT ) = 0%.

In the case that each partition raised by the classification model has the sameproportion of each class as the test set, we have: τ(M, ξT ) = 0%.

As τ(M, ξT ) is a normalized value, it can be an useful index when comparingthe performance of a classification model across different datasets or comparingthe performance of different classification model across the same dataset.

6 Some experimental results

A set of experiments has been conducted with several artificial datasets anddatasets from the UCI repository of machine learning [15] to illustrate andvalidate the discrimination-based criteria. The DTGen software (Decision Tree

Generator Software) [17, 4] has been used. The following results are obtained byrunning the ID3 algorithm for the construction of decision tree with the datasets.Each dataset is divided in two subsets which have similar numbers of examplesin each class. One is used as the learning set in the training phase and anotheris used as the test set. This process is repeated several times then the averageresults are deduced. The criterion used is the one based on the Shannon entropy.

Let us consider the following confusion matrices (Table 2) for three artificialdatasets. Each one contains 2 classes but has different distributions by theirclasses. The classification accuracy for the three datasets are identical: 80%.However, the classifier for the second dataset classifies all examples in the ma-jority class, it does not give any useful information. So the information gainratio in this case is 0%. The first dataset is more heterogeneous than the thirdone. It means that is more difficult to discriminate the examples. With the sameclassification accuracy, the classifier perform on the first dataset is more de-served than the one perform on the third dataset. The indicators related to thediscrimination power validate this hypothesis.

Dataset 1 Dataset 2 Dataset 3Positive Negative Positive Negative Positive Negative

Positive 40 10 40 10 0 20Negative 10 40 10 40 0 80

Table 2. Confusion matrices for three artificial datasets

Dataset 1 Dataset 2 Dataset 3

Accuracy 80.0 80.0 80.0

Dataset entropy (bit) 1 0.72 0.88

Conditional entropy (bit) 0.72 0.72 0.69

Information gain (bit) 0.28 0.0 0.19

τ(%) 28 0 21.7Table 3. Classification results for artificial datasets

Table 5 shows the results obtained by performing the experiments on sev-eral UCI datasets which are described in Table 4. In the Pima Indians diabetesdataset, the information gain ratio is small because the contribution of the clas-sifier for discriminate the examples is relatively small, it does not make anysignificative difference to the one class classifier. The classification accuracies forthe Ecoli dataset and for the Balance scale dataset are similar but the informa-tion gain ratio for the first one is clearly better. Because the Ecoli dataset ismore complex than the Balance scale dataset (8 classes vs. 3 classes, 2.18 bitsvs. 1.31 bits in the information measure), the classifier should work better inthe Ecoli dataset for obtaining the close accuracy. There are the same situations

between the Waveform dataset and the Glass Identification dataset and betweenthe Iris plant dataset and the Ionosphere dataset.

PimaIndian

Ecoli Balance Wavefo-rm

Glass Iris Ionosph-ere

Number of examples 768 336 625 300 214 150 351

Number of attributes 8 7 4 21 10 4 24

Number of classes 2 8 3 3 6 3 2

Class distribution 65.1% +34.9%

42.6% +22.9% +15.5% +...

46.8% +46.8% +7.4%

3 ×33.3%

35.5% +32.7% +13.5% +...

3 ×33.3%

64.2% +35.8%

% majority class 61.5 42.6 46.8 33.3 35.5 33.3 64.2Table 4. UCI datasets description

PimaIndian

Ecoli Balance Wavefo-rm

Glass Iris Ionosph-ere

Accuracy 69.37 78.24 76.52 66.61 65.11 96.0 87.21

Dataset entropy (bit) 0.93 2.18 1.31 1.58 2.14 1.58 0.94

Conditional entropy(bit)

0.85 0.99 0.79 1.25 1.38 0.22 0.55

Information gain(bit)

0.08 1.19 0.52 0.33 0.76 1.36 0.39

τ(%) 8.32 54.75 39.58 21.02 35.50 86.27 41.49Table 5. Classification results for UCI datasets

7 Conclusion and future work

In this paper, the discrimination-based criteria have been proposed by consid-ering the adequation between the partitions of the set of examples by their realclasses and by the classifier. The criteria give us an additional index for evaluat-ing a classification model. These criteria are designed based on the hierarchicalmodel for discrimination measures. A set of experiments have been done to il-lustrate and validate the criteria.

In the future, the discrimination-based criteria should be extended for themore general case such as fuzzy classification or probability classification. It maybe interesting to take into account the importance of examples.

As a classifier M can be regarded as a special attribute fM , it suggests apossibility of aggregation of several classification models to obtain the morepowerful one. Suppose that M1, M2, .., Ms are induced from the learning set.

These classification models generate a set of attributes {fM1 , fM2 , .., fMs} of the

learning set. From the set of these attributes, a decision tree may be induced.An empirical validation of such a aggregation process should be done.

References

1. I. Alvarez. Explaining the result of a decision tree to the end-user. In Proceedingsof the 16th Eureopean Conference on Artificial Intelligence, ECAI’2004.

2. A. P. Bradley. The use of the area under the roc curve in the evaluation of machinelearning algorithms. Pattern Recognition, 30(7):1145–1159, 1997.

3. R. Caruana, T. Joachims, and L. Backstrom. KDD-Cup 2004: results and analysis.SIGKDD Explorations, 6(2):95–108, 2004.

4. T. H. Dang. Entropies et leurs applications en apprentissage inductif (rapport depresoutenance). Technical report, France, 2005.

5. T. H. Dang, B. Bouchon-Meunier, and C. Marsala. Measures of information forinductive learning. In Proc. of Information Processing and Management of Uncer-tainty in Knownledge-Based Systems, IPMU’04, pages 1495–1502, Perugia - Italy,July 2004.

6. T. Fawcett. Roc graphs: Notes and practical considerations for researchers. (1-2),2004.

7. I. Kononenko and I. Bratko. Information-based evaluation criterion for classifier’sperformance. Machine Learning, 6:67–80, 1991.

8. K. B. Korb, L. R. Hope, and M. J. Hughes. The evaluation of predictive learners:Some theoretical and empirical results. In EMCL ’01: Proceedings of the 12thEuropean Conference on Machine Learning, pages 276–287, London, UK, 2001.Springer-Verlag.

9. C. Marsala. Apprentissage inductif en presence de donnees imprecises: construc-tion et utilisation d’arbres de decision flous. PhD thesis, Universite Paris 6, France,1998.

10. C. Marsala, B. Bouchon-Meunier, and A. Ramer. Hierarchical model for discrim-ination measures. In Proc. of the eight IFSA’99 World Congres, pages 339–343,Taipei - Taiwan, August 1999.

11. F. J. Provost and T. Fawcett. Analysis and visualization of classifier performance:Comparison under imprecise class and cost distributions. In Knowledge Discoveryand Data Mining, pages 43–48, 1997.

12. F. J. Provost, T. Fawcett, and R. Kohavi. The case against accuracy estimationfor comparing induction algorithms. In ICML ’98: Proceedings of the FifteenthInternational Conference on Machine Learning, pages 445–453, San Francisco, CA,USA, 1998. Morgan Kaufmann Publishers Inc.

13. J. Quinlan. Induction of decision trees. Machine Learning, 1(1):81–106, 1986.14. A. Ramer, B. Bouchon-Meunier, and C. Marsala. An alytical structure of hierarchi-

cal discrimination. In Proc. of the IEEE Int. Conf. on Fuzzy Systems, FUZZ-IEEE,pages 1050–1053, Seoul - Korea, August 1999.

15. C. B. S. Hettich and C. Merz. UCI repository of machine learning databases, 1998.16. C. Shannon. A mathematical theory of communication. Bell System Technical

Journal, 27:379–423 and 623–656, July and October 1948.17. F. Stermann and N. Longuet. Document technique de DTGen, rapport de stage

de fin d’etude, DESS IA. Technical report, Laboratoire d’Informatique de Paris 6,Avril 2003.