Confidence Transformation for Combining Classifiers

16
ORIGINAL ARTICLE Cheng-Lin Liu Hongwei Hao Hiroshi Sako Confidence Transformation for Combining Classifiers Received: 23 May 2003 / Accepted: 24 October 2003 / Published online: 10 March 2004 Ó Springer-Verlag London Limited 2004 Abstract This paper investigates a number of confidence transformation methods for measurement-level combi- nation of classifiers. Each confidence transformation method is the combination of a scaling function and an activation function. The activation functions correspond to different types of confidences: likelihood (exponen- tial), log-likelihood, sigmoid, and the evidence combi- nation of sigmoid measures. The sigmoid and evidence measures serve as approximates to class probabilities. The scaling functions are derived by Gaussian density modeling, logistic regression with variable inputs, etc. We test the confidence transformation methods in handwritten digit recognition by combining variable sets of classifiers: neural classifiers only, distance classifiers only, strong classifiers, and mixed strong/weak classifi- ers. The results show that confidence transformation is efficient to improve the combination performance in all the settings. The normalization of class probabilities to unity of sum is shown to be detrimental to the combi- nation performance. Comparing the scaling functions, the Gaussian method and the logistic regression perform well in most cases. Regarding the confidence types, the sigmoid and evidence measures perform well in most cases, and the evidence measure generally outperforms the sigmoid measure. We also show that the confidence transformation methods are highly robust to the vali- dation sample size in parameter estimation. Keywords Classifier combination Confidence transformation Evidence combination Gaussian modeling Logistic regression Pattern classification Introduction The combination of multiple classifiers has been studied for many years with the aim of overcoming the limita- tions of individual classifiers [1, 2, 3, 4]. The classifiers with different structures, trained with different algo- rithms and data sources, as well as those using different pattern features, exhibit complementary classification behavior and the combination of them can reduce the risk of misclassification. The combination methods can be divided into three groups according to the level of classifier outputs in fusion: abstract level (crisp class), rank level (rank list of classes), and measurement level [2, 5]. Since the crisp class and rank list can be derived from the class scores (measurements), the measurement- level combination is able to give better classification performance provided that an appropriate fusion strat- egy is used. This paper concerns the measurement-level combina- tion of classifiers with diverse outputs, typically, similarity and dissimilarity/distance measures. To this end, the output measurements need to be transformed to a uni- form scale which represents the confidence (or reliability) of classification, hopefully, the likelihood or probability that the input pattern belongs to a class. Ideally, the confidence value equals the Bayesian likelihood or a posteriori probability. The transformed confidence measures are input to the combiner for generating the combined class measures. The confidence measures can also be used for rejection prior to decision combination [6]. In some cases, the confidence measures are used to- gether with the raw classifier outputs in combination [7]. Confidence transformation is also called confidence evaluation or classifier output calibration [8]. In addition to classifier combination, confidence evaluation is also useful for pattern verification/rejection, candidate class selection [9], contextual processing, etc. Confidence transformation was avoided in most works by assuming that the classifier outputs inherently represent the class likelihood or probabilities, as for C.-L. Liu (&) H. Sako Central Research Laboratory, Hitachi, Ltd. 1-280 Higashi-koigakubo, Kokubunji-shi, Tokyo 185-8601, Japan E-mail: [email protected] H. Hao Department of Computer Science, University of Science and Technology Beijing, Beijing 100083, P.R. China Pattern Anal Applic (2004) 7: 2–17 DOI 10.1007/s10044-003-0199-5

Transcript of Confidence Transformation for Combining Classifiers

ORIGINAL ARTICLE

Cheng-Lin Liu Æ Hongwei Hao Æ Hiroshi Sako

Confidence Transformation for Combining Classifiers

Received: 23 May 2003 / Accepted: 24 October 2003 / Published online: 10 March 2004� Springer-Verlag London Limited 2004

Abstract This paper investigates a number of confidencetransformation methods for measurement-level combi-nation of classifiers. Each confidence transformationmethod is the combination of a scaling function and anactivation function. The activation functions correspondto different types of confidences: likelihood (exponen-tial), log-likelihood, sigmoid, and the evidence combi-nation of sigmoid measures. The sigmoid and evidencemeasures serve as approximates to class probabilities.The scaling functions are derived by Gaussian densitymodeling, logistic regression with variable inputs, etc.We test the confidence transformation methods inhandwritten digit recognition by combining variable setsof classifiers: neural classifiers only, distance classifiersonly, strong classifiers, and mixed strong/weak classifi-ers. The results show that confidence transformation isefficient to improve the combination performance in allthe settings. The normalization of class probabilities tounity of sum is shown to be detrimental to the combi-nation performance. Comparing the scaling functions,the Gaussian method and the logistic regression performwell in most cases. Regarding the confidence types, thesigmoid and evidence measures perform well in mostcases, and the evidence measure generally outperformsthe sigmoid measure. We also show that the confidencetransformation methods are highly robust to the vali-dation sample size in parameter estimation.

Keywords Classifier combination Æ Confidencetransformation Æ Evidence combination Æ Gaussianmodeling Æ Logistic regression Æ Pattern classification

Introduction

The combination of multiple classifiers has been studiedfor many years with the aim of overcoming the limita-tions of individual classifiers [1, 2, 3, 4]. The classifierswith different structures, trained with different algo-rithms and data sources, as well as those using differentpattern features, exhibit complementary classificationbehavior and the combination of them can reduce therisk of misclassification. The combination methods canbe divided into three groups according to the level ofclassifier outputs in fusion: abstract level (crisp class),rank level (rank list of classes), and measurement level[2, 5]. Since the crisp class and rank list can be derivedfrom the class scores (measurements), the measurement-level combination is able to give better classificationperformance provided that an appropriate fusion strat-egy is used.

This paper concerns the measurement-level combina-tion of classifierswith diverse outputs, typically, similarityand dissimilarity/distance measures. To this end, theoutput measurements need to be transformed to a uni-form scale which represents the confidence (or reliability)of classification, hopefully, the likelihood or probabilitythat the input pattern belongs to a class. Ideally, theconfidence value equals the Bayesian likelihood or aposteriori probability. The transformed confidencemeasures are input to the combiner for generating thecombined class measures. The confidence measures canalso be used for rejection prior to decision combination[6]. In some cases, the confidence measures are used to-gether with the raw classifier outputs in combination [7].Confidence transformation is also called confidenceevaluation or classifier output calibration [8]. In additionto classifier combination, confidence evaluation is alsouseful for pattern verification/rejection, candidate classselection [9], contextual processing, etc.

Confidence transformation was avoided in mostworks by assuming that the classifier outputs inherentlyrepresent the class likelihood or probabilities, as for

C.-L. Liu (&) Æ H. SakoCentral Research Laboratory,Hitachi, Ltd. 1-280 Higashi-koigakubo,Kokubunji-shi, Tokyo 185-8601, JapanE-mail: [email protected]

H. HaoDepartment of Computer Science,University of Science and Technology Beijing,Beijing 100083, P.R. China

Pattern Anal Applic (2004) 7: 2–17DOI 10.1007/s10044-003-0199-5

neural networks [10, 11] and parametric statistical clas-sifiers [12]. The calibration of classifier outputs, how-ever, is necessary for the combination of classifiers withdifferent scales of outputs. Even for neural networks orparametric classifiers, the re-scaling of outputs maybenefit the combination performance. For combiningdifferent classifiers, some researchers have used simpleempirical functions such as the contrast between theclass distance and the minimum distance or the re-ciprocal of class distance to represent the confidence [13,14]. Histograms and confusion matrix are frequentlyused in confidence evaluation for combination [13, 15].

For transforming classifier outputs to probabilisticconfidence measures, Denker and LeCun have shownthat the posterior probabilities of neural networks can becomputed by soft-max under certain assumptions [16].Hoekstra, et al., approximated the classification correct-ness of neural networks using k-nearest neighbor (k-NN),Parzen window, and logistic regression [17]. Duin andTax used different confidence models for various classi-fiers [18]. For mapping distances into posterior proba-bilities, they used one-dimensional logistic regression.Logistic regression has also been used in speech recogni-tion [19] and the posterior probability estimation ofsupport vector machine (SVM) [20], etc. The probabilitytransformationmethod ofGorski [21] is similar to logisticregression. Wei, et al., proposed a histogram-based ap-proach to re-map the sigmoid outputs of neural networksfor improving the posterior probability estimate [22].

Viewing the classifier outputs as feature variables, theconfidence evaluation problem is reduced to densityestimation or regression in this feature space. Due to therestrictive distribution and the high independence ofclassifier outputs, either the regression with one variableor the one-dimensional density modeling may give goodconfidence estimates. In general, one classifier output(the measurement of one class) separates patterns intotwo meta-classes: the target class and the others.Assuming one-dimensional Gaussian densities for thetwo meta-classes, the posterior probability of target classis shown to be a sigmoid function [23].

We have shown in previous experiments that confi-dence transformation can lead to high classificationperformance using simple combination rules [24]. In thispaper, we investigate more confidence evaluationmethods in the combination of various classifier sets. Wewill show that even for combining neural classifiers,whose outputs approximate the a posteriori probabili-ties, confidence transformation is beneficial to the com-bination performance. Various combination rules areavailable to fuse the transformed confidence measures,including fixed rules [4], weighted rules [21, 25, 26], aswell as meta-classifiers on the feature space of classifieroutputs or confidence measures [27, 28, 29].

In this paper, we focus on the confidence transfor-mation of classifier outputs while the combination per-formance is tested using the fixed sum-rule, which wasshown to perform fairly well in various combinationproblems [4, 28]. In some sense, we can view the confi-

dence values as dynamic combining weights because theyare dependent on the input pattern and the underlyingclassifier. For weighted combination and classification-based combination, we expect that confidence evaluationwill benefit the classification performance as well. For thelimitation of space, the comparison of combination rulesbased on confidence transformation will be treated inanother paper in near future.

We decompose a confidence transformation methodinto the combination of a scaling function and an acti-vation function. The four activation functions that weconsider correspond to four confidence types: log-like-lihood, exponential, sigmoid, and the evidence combi-nation of sigmoid measures. The evidence measure issupposed to better represent multi-class probabilitiesand its promise is justified in our experiments. Thescaling functions are determined by the Gaussian mod-eling of Schurmann [23] and logistic regression withdifferent number of variables, as well as some heuristicstrategies. The sensitivities of combination performanceto the degree of parameter sharing and to the validationsample size are also studied.

The rest of this paper is organized as follows. Section 2introduces the confidence transformation problem andSection 3 describes the confidence transformation meth-ods. Section 4 describes the setting of experiments; Sec-tion 5 shows some illustrative examples and Section 6presents the combination results. Concluding remarks areprovided in Section 7.

Problem Formulation

To classify a patternXintoM classes { �x1,..., �xM} , assumewe have K classifiers (classification experts) {E1,...,EK} ,each using a feature vector xk,k=1,...,K. On an inputpattern, each classifier Ek outputs discriminant measures(scores) to all classes: dkj(xk),j=1,...,M. The measure-ments represent the class membership/similarity or dis-similarity/distance. A classifier classifies the input patternto the class of maximum similarity or minimum dissimi-larity.

By combining multiple classifiers at measurement le-vel, the decisions of the constituent classifiers are de-ferred and the final decision is made after the outputs ofmultiple classifiers are fused to give combined classmembership measures. Formally, the combined mea-sures are computed by

Dj Xð Þ ¼ F

d11 x1ð Þ � � � ; d1M x1ð Þ... . .

. ...

dK1 xKð Þ � � � ; dKM xKð Þ

0B@

1CA; j ¼ 1; . . . ;M

where Dj(X) denotes the combined measure for class �xj.For many combination rules, the combined measure ofone class considers only the corresponding outputs ofthe same class:

Dj Xð Þ ¼ F d1j x1ð Þ; . . . ; dKj xKð Þ� �

; j ¼ 1; . . . ;M :

3

This class of rules includes the sum-rule, product-rule, median-rule, the weighted sum and product rules,etc.

On transforming the classifier outputs into confidencemeasures representing the class probabilities or likeli-hood, a simple combination rule is able to give highperformance. In this case, the combined class measuresare given by

Dj Xð Þ ¼ F z1j x1ð Þ; . . . ; zKj xKð Þ� �

; j ¼ 1; . . . ;M : ð1Þ

where zkj(xk),j=1,...,M, denote the confidence measurestransformed from the outputs of classifier Ek :

zkj xkð Þ ¼ zkj dkð Þ ¼ T dk1; . . . ; dkMð Þ: ð2Þ

To design the confidence transformation function T(Æ)is what we are concerned with in this paper.

Now we show that confidence transformation pro-vides classifier-specific or data-dependent weights toclassifier outputs though the confidence measures arecombined by un-weighted rules. As a special case ofconfidence transformation, the confidence function is alinearly shifted and scaled form of the classifier outputs:

zkj ¼ akdkj þ bk;

then by un-weighted averaging, the combined classmeasures are

Dj Xð Þ ¼ 1

K

XK

k¼1zkj ¼

XK

k¼1

ak

Kdkj þ

1

K

XK

k¼1bk;

where in akK can be viewed as a weight to the classifier Ek.

On the other hand, assume the confidence measureapproximates the Bayesian likelihood:

zkj ¼ p xkjxj� �

P xj� �

;

then the posterior probabilities are calculated by

p xj

��xk� �

¼ zkjPMi¼1 zki

:

By averaging the likelihood measures, the combinedclass measures are

Dj Xð Þ ¼ 1

K

XK

k¼1zkj ¼

1

K

XK

k¼1

XMi¼1

zki

!p xj

��xk� �

:

wherein the sumPM

i¼1 zki can be viewed as a data-dependent weight of classifier Ek. By averaging the nor-malized posterior probabilities, this weight will be lost.This may partially explain why normalizing the classconfidences into unity of sum deteriorates the combina-tion performance, as will be shown in our results.

A block diagram of multiple classifier system withconfidence evaluation is shown in Fig. 1. In this system,confidence transformation is performed independentlyfor each constituent classifier, and the transformedconfidences are fused to give final class measures.

Confidence Transformation Methods

The outputs of a classifier are transformed to confidencemeasures by output re-scaling followed by activation.Though the activation function (corresponding to con-fidence type) is used after the scaling function, it is oftheoretical interest and is more relevant to the aim ofconfidence transformation, while the scaling function isconnected to implementation. Hence, we introduce theconfidence types prior to scaling functions.

Confidence types

To achieve optimal combination performance by prob-abilistic fusion, the transformed confidence values aredesired to equal the Bayesian likelihood p(x| �xj)P( �xj) orclass posterior probability p( �xj|x) (in this section, weconcern the confidence transformation of one classifier,so the subscript k of classifier index is dropped). In thiscase, the posterior probabilities are not necessary tosatisfy the axiom of probabilities (sum up to unity) be-cause when an input pattern deviates from the underlyingdensity model of the classifier, it is normal that the sum ofclass probabilities is smaller than one. This is becausethough we assume M classes in classification, the inputpattern does not necessarily belong to the M definedclasses, rather, it may belong to an ‘‘outlier’’ class.

We approximate the class likelihood and posteriorprobability with exponential function and sigmoidfunction, respectively. For transforming the classifieroutputs d=[d1...dM]T, a scaling function is used to shiftand re-scale the outputs to give a basic measure fj(d) foreach class (the scaling functions will be described later).Then the exponential measure and sigmoid measure aregiven by

zej ¼ exp fj dð Þ

� �ð3Þ

and

zsj ¼

1

1þ exp �fj dð Þ� � ; ð4Þ

respectively.

Fig. 1 Block diagram of multiple classifier system (CT=confidencetransformation)

4

The normalization of exponentials to unity of sumresults in the soft-max of posterior probabilities. Thelogarithm of Bayesian likelihood (log-likelihood) servesas another type of confidence. In the case of exponentialform, the log-likelihood is simply the scaling function

zlj ¼ fj dð Þ: ð5Þ

Since the scaling function is basically linear with re-spect to classifier outputs, we also call the approximatelog-likelihood a linear measure.

The sigmoid function is prevalently used for theneuronal outputs of neural networks and it was proventhat the neural network outputs approximate the classposterior probabilities, typically under the mean squareerror (MSE) and cross-entropy (CE) criteria [10, 11]. Onthe other hand, assuming that the output measure of oneclass separates the class from the others and the twometa-classes (the target class as one meta-class and theother classes in another one) have one-dimensionalGaussian densities, the class posterior probability is asigmoid function [23]. Therefore, it is reasonable toestimate class posterior probabilities using sigmoid.

For re-evaluating the confidences of neural networkoutputs, we take the outputs before sigmoid squashing(linear combinations of the outputs of hidden units) asthe inputs of scaling function. The effect of re-scalingfollowed by sigmoid activation is similar to that of theposterior probability re-mapping of Wei, et al. [22]. There-scaled neural network outputs can also be trans-formed by other activation functions, however.

The plausibility of approximating Bayesian likeli-hood using exponential is rooted in parametric classifi-cation with Gaussian density assumptions. In particular,the linear discriminant function (LDF) and the qua-dratic discriminant function (QDF) are the logarithm ofBayesian likelihood under the Gaussian densityassumption [12]. We can then recover the Bayesianlikelihood as the exponential of discriminant function,while the LDF and the QDF serve as log-likelihoodmeasures. We generalize this connection to the discri-minant functions other than LDF and QDF andapproximate the Bayesian likelihood as the exponentialof re-scaled outputs of classifier, as in Eq. 3. The outputsof many statistical and neural classifiers can be viewed asgeneralized LDF or QDF. For example, viewing thehidden units of neural networks as feature extractors,the linear output layer is a linear classifier; Mahalanobisdistance and square Euclidean distance for prototype-based classifiers are the special cases of QDF.

Viewing the sigmoid measure as two-class (one classversus the others) probability, the multi-class probabil-ities can be obtained by combining the sigmoid measuresaccording to the Dempster-Shafer theory of evidence[30, 31], which has been applied to classifier combination[1, 2, 32, 33]. Mandler, et al. [1] and Rogova [32] alsoused it for computing the class confidences from dis-tances or similarity measures. We calculate the classconfidences from sigmoid measures as follows. In the

framework of discernment, we have 2M focal elements(singletons and negations) { �x1, �x1,..., �xM, �xM} with basicprobability assignments (BPAs)

mj xj� �

¼ zsj; mj ��xj

� �¼ 1� zs

j

then the combined evidence of �xj (evidence measure) isgiven by

zcj ¼ m xj

� �

¼ A � mj xj� � YM

i¼1;i6¼j

mi �xið Þ ¼ A � zsj

YMi¼1;i6¼j

1� zsi

� �; ð6Þ

where

A�1 ¼XMj¼1

zsj

YMi¼1;i6¼j

1� zsi

� �þYMi¼1

1� zsi

� �:

The evidence measure serves as a multi-class proba-bility estimate and satisfies

XMj¼1

zcj61:

The remaining probability

1�XMj¼1

zcj ¼ A

YMi¼1

1� zsi

� �

represents the probability that the input pattern belongsto an ‘‘outlier’’ class.

Normalizing the exponential, sigmoid, and evidencemeasures into unity of sum, we obtain the class posteriorprobabilities satisfying the axiom of probabilities (clas-sification in the closed world of M defined classes):

P x xj

��d� �¼

zxjPM

i¼1 zxi

; ð7Þ

where the superscript x stands for e, s, or c. The nor-malization does not apply to the linear (log-likelihood)measure.

We can easily prove that using the same scalingfunction,

P e xj

��d� �¼ P c xj

��d� �; ð8Þ

because

zcj ¼ A

1þexp �fj dð Þ½ �QM

i¼1;i6¼jA exp �fj dð Þ½ �1þexp �fj dð Þ½ �

¼ A exp fj dð Þ½ �1þexp fj dð Þ½ �

QMi¼1;i6¼j

A1þexp fj dð Þ½ �

¼ AMQM

i¼1 1þexp fj dð Þ½ � exp fj dð Þ� �

¼ B � exp fj dð Þ� �

¼ B � zej :

Note that the constant B is dependent on the classi-fier, so even though the normalized ze

j and zcjar equiva-

5

lent, the un-normalized measures are not necessarilyequivalent.

Scaling functions

The scaling function plays a crucial role to the value ofconfidence. An essential requirement to the scaling func-tion is that the re-scaled classifier outputs distribute in amoderate range around zero. A good scaling function isdesired that the transformed confidence represents theprobability that the input pattern belongs to a specificclass.We categorize the scaling functions into two classes:those based on heuristics or statistical models (basicscaling functions) and those based on logistic regression.

Basic scaling functions

We have tested three scaling functions in our previouswork [24] : global normalization, a function derivedfrom multivariate Gaussian densities of classifier out-puts, and a function from one-dimensional Gaussiandensities of Schurmann [23]. The global normalization(Global) simply re-scales the pooled outputs to zeromean and standard deviation 1:

fj dð Þ ¼ dj � l0

r0; ð9Þ

where l0 and r20 are themean value and the variance of the

classifier outputs pooled in one Gaussian distribution.The parameters are calculated on the classifier outputs ofvalidation samples. For classifiers that output dissimi-laritymeasures, the sign of the scaling function is reversed.

The parameters of global scaling function do notconsider the class information of validation samples. Itfulfills the very essential requirement of confidencetransformation, i.e., transforming the output measure-ments of different classifiers to similar scale. We intendto use the global normalization as the baseline to eval-uate the performance of other scaling functions thatconsider class information.

Assume that for each class, the classifier outputs un-dergo a multivariate Gaussian density with restrictions, ithas been shown that the class posterior probabilities arecalculated by soft-max [24], similar to that of Denker andLeCun for neural networks derived from differentassumptions [16]. The Gaussian distribution of classifieroutputs holds true for linear classifiers on Gaussian dis-tributed classes. We generalize this assumption to otherclassifiers, including neural networks with linear outputlayer. Assume for a class �xj, the density of classifier out-puts is a multivariate Gaussian with identity variance

r2j :

p djxj� �

¼ 1

2pr2j

� �M=2exp �

PMi¼1 di � mji� �22r2

j

" #;

where mji,i=1,...,M, are the mean values of outputs overthe patterns of class �xj. Considering that the outputs ofstrong classifier are well ordered such that the genuineclass generally has high similarity while other classeshave low similarities, we assume that all the classes sharetwo distinct mean values, l+ for the genuine class and l–

for the other classes such that for class �xj,mjj=l+andmji=l–, i „ j. Further assume equal variances

r2j ¼ r2

and equal prior probabilities among classes, the log-likelihood is (omitting the terms irrespective of classindex)

fj dð Þ ¼ lþ � l�

r2dj: ð10Þ

Based on this, the class posterior probabilities arecalculated by soft-max:

P xj

��d� �¼

exp lþ�l�

r2 dj

� �

PMi¼1 exp

lþ�l�

r2 di

� � :

The form of Eq. 10, however, is not a good scalingfunction, because the classifier output is not shifted.Intuitively, the output of class �xj should be shifted suchthat the boundary between the samples of �xj and those ofthe other classes (negative samples) becomes zero. Heu-ristically, we estimate the boundary as the midway of thesamplemean l+on positive samples and themean lr of djon selected negative samples. When class �xj is considered,the negative samples are selected such that on each se-lected sample from class �xi,i „ j, the output dj is the closestrunner-up to the output di. Eventually, the scaling func-tion is given by

fj dð Þ ¼ lþ � l�

r2dj �

lþ þ lr

2

� : ð11Þ

We refer to this scaling function as Mean-var, whichuses the mean values and variances to shift and re-scaleclassifier outputs.

Schurmann gave a sigmoid measure as class posteriorprobability by assuming that the class output djfunctionsas a two-class discriminant that separates class �xj fromthe others. Assuming one-dimensional Gaussian densi-ties for the two meta-classes ( �xj and the others):

p dj

��xj� �

¼ 1ffiffiffiffi2pp

rþjexp � dj�lþjð Þ2

2 rþjð Þ2� �

;

p dj

���xj� �

¼ 1ffiffiffiffi2pp

r�jexp � dj�l�jð Þ2

2 r�jð Þ2� �

;

8>><>>:

the class posterior probability is

P xj

��dj� �

¼ p djjxjð ÞP xjð Þp djjxjð ÞP xjð Þþp djj�xjð ÞP �xjð Þ

¼ 1

1þexp � a2d2jþa1dj�a0�log

P �xjð ÞrþjP wjð Þr�j

� � � ; ð12Þ

6

where

a2 ¼ rþjð Þ2� r�jð Þ22 rþjð Þ2 r�jð Þ2

;

a1 ¼ lþj

rþjð Þ2� l�j

r�jð Þ2;

a0 ¼ 12

lþjð Þ2rþjð Þ

2 �l�jð Þ2r�jð Þ

2

� �

8>>>>>><>>>>>>:

From the sigmoid form in Eq. 12we extract thescaling function as

fj dð Þ ¼ a2d2j þ a1dj � a0 � log

P �xj� �

rþjP wj� �

r�j; ð13Þ

which is a quadratic function of class output dj. Assumethat the two Gaussians share a unique variancerþj ¼ r�j ¼ rj, the scaling function becomes a linear form

fj dð Þ ¼ lþj �l�jr2

jdj �

lþjð Þ2� l�jð Þ22r2

j� log

P �xjð ÞP xjð Þ ;

¼ lþj �l�jr2

jdj �

lþj þl�j2 þ log

P �xjð ÞP xjð Þ �

r2j

lþj �l�j

� � �

¼ a dj � bþ ca

� �� �:

ð14Þ

We refer to the scaling functions of Eqs. 13 and 14as

Gaussian, with two variances and one variance, respec-

tively. The parameters lþj ; l�j ; r

2j

n oare estimated on a

validation sample set by maximum likelihood (ML).Treating the classification in an open world where out-of-class patterns are present in addition to the M definedclasses and assuming equal prior probabilities

P xj� �

¼ 1Mþ1, the parameter c ¼ log

P �xjð ÞP xjð Þ ¼ logM :

Comparing Eq. 14 and 11, we can see that assumingshared class means and variances, the scaling coefficienta of Eq. 14 is identical to that of Eq. 11. When calcu-lating the posterior probabilities by soft-max (normal-izing exponentials to unity of sum), the two scalingfunctions give the same (normalized) posterior proba-bilities. Consequently, the combination of classifiersgives the same classification results.

Logistic regression

Logistic regression (LR) has been widely used in binarydata analysis problems, and recently, it was applied tothe confidence evaluation of classifiers [17, 18, 19] andthe weighted combination [3, 34, 35]. In LR, a weightedlinear combination of input variables is used to estimatethe log-odds (logit) of the probability of response. De-note the input variables by xi,i=1,...,m, and the responseby Y, the log-odds of the probability of true classp(x)=P(Y=1|x) is approximated by

log itp xð Þ ¼ logp xð Þ

1� p xð Þ ¼Xm

i¼1bixi þ b0:

Accordingly, the probability of correctness (PoC) is asigmoid function:

p xð Þ ¼ 1

1þ exp �Pm

i¼1 bixi þ b0

� �� � :

The weight parameters are estimated on a sampledataset by optimizing an objective function. Generalizedto multi-class problems, the PoC is computed for eachclass and then classification or combination is based onthe multiple class probabilities.

The objective function of multi-class LR is describedas follows. Assume m variables x1,...,xm are used toestimate the probabilities of each class. The PoC of �xj iscomputed by

pj xð Þ ¼ 1

1þ exp �Pm

i¼1 bjixi þ bj0

� �� � :

On a validation sample set (xn,cn),n=1,...,N, thetarget probability of true class cn is tcn ¼ 1 and that ofother classes is tj=0,j „ cn. The objective is to maximizethe likelihood of validation samples:

max L ¼YNn¼1

YMj¼1

ptj

j 1� pj� �1�tj ;

and equivalently, to minimize the negative log-likeli-hood:

min J ¼ �XN

n¼1

XMj¼1

tj log pj þ 1� tj� �

log 1� pj� �� �

:

ð15Þ

The criterion in Eq. 15 is often referred to as cross-entropy (CE) [36].

In our implementation, we add a term of weight de-cay to the criterion to alleviate the overfitting to vali-dation data:

min J ¼ �XN

n¼1

XMj¼1

tj log pj þ 1� tj� �

log 1� pj� �� �

þ kXMj¼1

Xm

i¼1b2

ji; ð16Þ

where k is a pre-specified coefficient for weight decay.The weights and biases are estimated in minimizing thecriterion by stochastic gradient descent [37].

For the confidence evaluation of classifier outputs,the variables are selected from the classifier outputs, andthe scaling function is extracted from the sigmoid formof PoC:

fj dð Þ ¼Xm

i¼1bjixi þ bj0:

The scaling function is combined with all the fouractivation functions to give four confidence measures.

7

Depending on the selected variables, we can generatevarious scaling functions in LR. We consider the fol-lowing three schemes in our experiments. In the firstscheme as we call LR-1, the scaling function of eachclass uses its own output as the only variable:

fj dð Þ ¼ bj1dj þ bj0: ð17Þ

This function has the identical form to that ofGaussian scaling function in Eq. 14. To generate ascaling function corresponding to Eq. 13, we use thepolynomial terms of class output to give a functioncalled LR-1P:

fj dð Þ ¼ bj1dj þ bj2d2j þ bj0: ð18Þ

In the third scheme called LR-2, the scaling functionfj(d) of class �xj uses the output dj and the closest runner-up dr :

fj dð Þ ¼ bj1dj þ bj2dr þ bj0: ð19Þ

Using multiple variables surely will improve theprecision of probability approximation, but does notnecessarily give higher combination performance due tothe overfitting. Considering that the classifier outputsdistribute in a highly restricted subspace, a small numberof variables may suffice to give precise probability esti-mates. In particular, when using one variable as inEq. 17, the scaling function and PoC is monotonic suchthat the order of classes is preserved. This property isdesirable for high accuracy classifiers. The scalingfunctions also vary depending on whether the classparameters are shared or not. The parameter sharingalso helps alleviate the overfitting. We will experimentwith shared as well as class-specific parameters in LR.

In our implementation of LR, the classifier outputsare first re-scaled by global normalization to transformthe measurement values to a moderate range. Inparameter estimation by stochastic gradient descent, theinitial values of parameters are set to bj1=1 andbji=0,i „ 1.

Summary of scaling/activation functions

We described a number of activation functions andscaling functions in the above. Each activation functioncan be combined with a scaling function to form aninstantiation of confidence transformation method. Forbetter understanding and comparing the methods, thescaling functions and activation functions are summa-rized in Table 1 and Table 2, respectively.

Table 1 gives six scaling functions. The Global scalingfunction is based on the assumption that the classifieroutputs are pooled in a one-dimensional Gaussiandensity. The Mean-var scaling function is based onmultivariate Gaussian density assumption with restric-tions. The Gaussianscaling function is based on one-dimensional Gaussian densities of two meta-classes (one

class and the others). It is linear when the two meta-classes share a common variance (1-variance), otherwiseis nonlinear (2-variance). The parameters of LR scalingfunctions are estimated by minimizing the cross entropy(CE) of Eq. 16. The Gaussian and LR scaling functionscan use either shared or class-specific parameters.

Table 2 gives four activation functions correspondingto four confident types. The linear activation functionapproximates the log-likelihood. When the scalingfunction is linear, it is also linear respect to the classifieroutputs. The exponential measure, sigmoid measure,and evidence measure approximate the Bayesian likeli-hood, two-class probability, and multi-class probability,respectively. They can either be normalized to unity ofsum or not.

Experimental Setup

To evaluate the performance of the confidence trans-formation methods for classifier combination, we con-ducted experiments in handwritten digit recognitionwith variable classifier settings. Handwritten digit rec-ognition is an important problem in document analysisapplications and has been intensively studied by thepattern recognition community. In our previous works,we have made many efforts and have achieved highaccuracies [38, 39, 40]. Nevertheless, after we have triedmany algorithms, it is hard to further improve theaccuracy using a single classifier. In particular, theaccuracy on low quality images (large distortion, imagedegradation) remains at a low level (lower than 90%).So, we are now resorting to multiple classifier combi-nation to break through the bottleneck.

We experiment with seven digit classifiers, includingfour neural classifiers and three dissimilarity-basedclassifiers. The specifications of individual classifiers aregiven in Table 3. The features used by the classifiersare blurred chaincode feature (blr), deslant chaincode

Table 1 Summay of scaling functions (SFs)

SF type Assumption Equation Linear/non Param sharing

Global pooled Gaussian (9) linear sharedMean-var multivariate (11) linear sharedGaussian 1D Gaussian 1-var (14)

2-var (13)linearnonlinear

share or notshare or not

LR-1 minCE (16) (17) linear share or notLR-1P minCE (16) (18) nonlinear share or notLR-2 minCE (16) (19) linear share or not

Table 2 Summay of activation functions (AFs)

AF type Approximate Equation Normalized

Linear log-likelihood (5) N/AExponential likelihood (3) (7)Sigmoid 2-class probability (4) (7)Evidence m-class probability (6) (7)

8

feature (des), normalization-cooperated chaincode fea-ture (ncf, feature extracted from original image withnormalization skipped [41] ), and gradient directionfeature (grd). The pre-fix ‘‘e-’’ denotes 8-direction fea-tures, while other features are of 4-orientation; the suffix‘‘-p’’ means that profile structure feature is added toenhance the feature representation [42], and ‘‘-g’’denotes feature extraction from gray-scale normalizedimages. Various aspect ratio functions are used in thecoordinate mapping of image normalization [40, 43].

The classifier structures are single-layer perceptron(SLP, trained by MSE minimization), radial basisfunction (RBF) classifier [44], polynomial classifier (PC)[45], learning vector quantization (LVQ) classifier [46],discriminative learning QDF (DLQDF) [47], and thenearest mean (N-Mean) classifier. The outputs ofDLQDF classifier represent negative log-likelihood, andthe outputs of LVQ and nearest mean classifiers aresquare Euclidean distances. For the neural classifiers,the output values before sigmoid squashing are used asthe inputs of scaling functions.

The seven classifiers were trained with two datasets,one collected by Hitachi, Ltd., and one extracted fromthe NIST special database 19 (NIST-SD19) [48]. TheHitachi dataset contains 164,158 digit samples. TheNIST dataset contains 66,214 digit samples written by600 writers [38]. The classification performance is eval-uated on two test datasets: Test-1 contains 9,725 samplescollected in Japan (ATM transfer forms) and Test-2contains 36,473 difficult samples that were mis-classifiedor rejected by an old recognizer of Hitachi. The imagesin Test-2 are either highly distorted or degraded. Therecognition accuracies of individual classifiers are givenin Table 2. We can see that the highest accuracy on Test-2 is lower than 90%.

To estimate the parameters of confidence transfor-mation, we made a validation dataset containing 30,000

samples collected in Japan (Valid-1) and 10,000 samplesfrom NIST-SD19 (Valid-2). In the validation dataset,each class has the same number of samples. The samplesof Valid-1 were collected in similar environment to Test-1. The samples of Valid-2 were the leading samples ofeach class in a previously generated validation dataset of200 writers [38].

To investigate the performance of combination invariable classifier settings, we test four classifier subsets,CS1={E0,E1,E2,E3}, CS2={E4,E5,E6}, CS3={E1,E2,E3,E4,E5}, and CS4={E0,E1,E2,E3,E4,E5,E6}. CS1contains four neural classifiers, whose (sigmoid) outputsapproximate the class posterior probabilities in MSEtraining. We hope to see whether the re-evaluation ofconfidences benefits the combination performance ornot. CS2 contains the classifiers that output distancemeasures, which distribute in much larger ranges thanneural network outputs. CS3 contains the classifiers thatgive high accuracies (strong classifiers), while CS4 con-tains both high accuracy classifiers and low accuracyones (weak classifiers, E0 and E6).

To evaluate the performance of combination basedon confidence transformation, we give the accuracies ofabstract-level combination in Table 5 as a baseline. Incombination by AND-rule, the classification is consid-ered correct when all the individual classifiers give thecorrect result, while by OR-rule, the classification resultis correct when at least one individual classifier give thecorrect result. We can take the accuracy of OR-combi-nation as the upper bound that a practical combinationmethod can achieve. The accuracies of best individualclassifiers and those of plurality vote are the targets tosurpass.

As to the combination rule, we only present the re-sults of the sum-rule in this paper for saving space. Wehave tested other fixed rules in the combination ofconfidences but the performance was inferior to that ofsum-rule. Some results of product-rule and nearest meanhave been given in [24]. The comparison of various

Table 3 Specifications of seven digit recognizers

Expert Feature Aspect Dimen Classifier Training

E0 e-ncf-p ARF0 233 SLP HitachiE1 ncf-p ARF6 133 RBF HitachiE2 e-grd-g ARF9 200 PC HitachiE3 e-grd ARF8 200 PC NISTE4 e-grd ARF8 200 LVQ HitachiE5 des ARF8 100 DLQDF NISTE6 blr ARF5 100 N-Mean Hitachi

Table 4 Recognition rates (%) of individual classifiers

Expert Validate Test-1 Test-2

E0 96.19 99.01 77.55E1 97.13 99.53 86.02E2 98.92 99.77 89.93E3 98.38 98.25 82.90E4 98.39 99.64 87.58E5 98.61 98.65 82.43E6 88.83 92.81 60.44

Table 5 Accuracies (%) of abstract-level combination on fourclassifier settings

Expert Combine Test-1 Test-2

E{0–3} (CS1) AND 97.38 64.86OR 99.94 97.25Best 99.77 89.93Plurality 99.74 88.45

E{4–6} (CS2) AND 92.03 51.69OR 99.93 96.39Best 99.64 87.58Plurality 99.29 83.94

E{1–5} (CS3) AND 97.23 66.28OR 99.96 97.67Best 99.77 89.93Plurality 99.83 91.42

E{0-6} (CS4) AND 91.31 45.69OR 99.97 98.17Best 99.77 89.93Plurality 99.78 89.67

9

combination rules based on confidence transformationwill be treated in the future.

Illustrative Examples

We look into the histograms of classifier outputs and thetransformed confidence values to validate the relevanceof confidence transformation methods.

Histograms of classifier outputs

The confidence evaluation method of Schurmann as-sumes that for each class output, the one-dimensionaldensities for two meta-classes are Gaussian. Though thisassumption does not hold strictly, it provides an elegantway of confidence parameter estimation by maximumlikelihood (ML). To validate the Gaussian assumption,we calculate the histograms of selected outputs (outputsof classes ‘‘0’’ and ‘‘1’’) of two neural classifiers (E0, E2)

and two distance classifiers (E4, E6). The normalized (tounity of sum) histograms are shown in Fig. 2 and Fig. 3,respectively. We can see that the distributions of twometa-classes (class 0 and non-0, or class 1 and non-1) arebasically uni-modal, and are similar to Gaussian, thoughthe variances are variable.

Confidence values of a pattern

To view the plausibility of confidence measures, wecalculate the sigmoid measures of four selected scalingfunctions (with shared class parameters) as well as theoriginal sigmoid outputs of neural classifiers (E0 andE2). For the other scaling functions, we observed thatthe confidence values of Mean-var are similar to those ofGaussian, and the values of LR-1P are similar to those ofLR-1.

Figure 4 shows a digit pattern ‘‘1’’. The confidencevalues on this pattern are shown in Table 6. First, letus see the confidence values of classifier E0, which

Fig. 2 Histograms of outputsof classifiers E0 and E2

Fig. 3 Histograms of outputsof classifiers E4 and E6

10

mis-classifies the pattern to class ‘‘2’’. We can see thatthe class probabilities assigned by the scaling functionGlobal are too ‘‘soft’’ in the sense that the probabilitiesof different classes are competing. Since the classifieroutputs are not appropriately shifted, the confidencevalues do not represent well the class probabilities. Thescaling functions Gaussian and LR-1 assigns classprobabilities similar to the original sigmoid outputs. TheLR with two variables, LR-2, nevertheless, gives over-estimated probabilities, as is more evident for classifiersE4 and E6.

For classifiers E0, E4 and E6, the scaling functionsGaussian and LR-1assign low probabilities to all thedefined classes such that the sum of probabilities is muchsmaller than one. This is because the input patterndeviates apart from the density model or templates ofthe classifier. For classifier combination, this property ispreferable in that when a classifier cannot reliably clas-sify a pattern, it should assign low confidences to allclasses. Meanwhile, if another classifier assigns highconfidence to certain class, the decision of this highconfidence classifier will dominate in combination tolead to correct final classification.

Combination Results

We compare the performance of confidence transfor-mation methods (all with shared class parameters) incombining four classifier subsets. To investigate theeffects of parameter sharing, we then compare the

combination performance of shared and class-specificparameters on a classifier subset. Last, we investigate theeffects of confidence parameter estimation on smallsample size.

Combining variable sets of classifiers

In comparing the performance of confidence transfor-mation methods in four classifier settings, all the scalingfunctions use shared class parameters. Further, thescaling function Gaussian uses one unique variance. Tojustify the preference of parameter sharing, we will givethe combination results of class-specific parameters la-ter. For neural classifiers, the raw outputs (before sig-moid squashing) are also transformed by the fouractivation functions directly to give confidence mea-sures. In combining the distance classifiers, we also usethe reciprocal of distance values as confidence.

The classification accuracies of combining four neuralclassifiers (CS1) are listed in Table 7. We can see thattransforming the raw outputs using sigmoid, the com-bination performance is improved. The accuracies ofcombining three distance classifiers (CS2), combiningfive classifiers (CS3), and combining seven classifiers(CS4) are listed in Table 8, Table 9, and Table 10,respectively.

From the results of four classifier settings, we candraw some common observations. The first observationis that normalizing the exponential, sigmoid, and evi-dence measures to unity of sum evidently deteriorate thecombination performance. This is because the nature ofconfidence measures as data-dependent weights is lost innormalization. When an input pattern deviates from thedensity model of classifier, forcing the class probabilitiesto sum up to unity brings out negative effect. The nor-malized confidence measures give lower combinationaccuracies in most cases except that the accuracy onTest-1 is marginally higher occasionally. Nevertheless,the result of Test-2 is more confident because Test-2contains much more samples than Test-1.

Fig. 4 A sample pattern of digit ‘‘1’’

Table 6 Confidence values(sigmoid ·1000 ) of the digitpattern in Fig. 4

Classifier Scaling C0 C1 C2 C3 C4 C5 C6 C7 C8 C9

E0 Original 0 36 367 0 6 0 0 0 0 0Global 369 712 778 313 664 495 318 442 323 219Gaussian 0 88 303 0 36 1 0 0 0 0LR-1 0 65 399 0 16 0 0 0 0 0LR-2 0 130 758 0 52 2 0 0 0 0

E2 Original 0 967 29 0 12 47 1 0 0 0Global 401 869 654 453 616 674 519 471 412 260Gaussian 0 998 0 0 0 0 0 0 0 0LR-1 0 990 11 0 3 21 0 0 0 0LR-2 0 992 3 0 1 5 0 0 0 0

E4 Global 378 643 519 403 482 520 248 198 254 258Gaussian 0 37 6 1 3 6 0 0 0 0LR-1 0 6 0 0 0 0 0 0 0 0LR-2 0 952 23 1 8 24 0 0 0 0

E6 Global 140 474 356 110 370 462 233 152 237 150Gaussian 0 12 3 0 4 11 0 0 0 0LR-1 0 4 0 0 0 3 0 0 0 0LR-2 0 418 61 0 78 322 4 0 4 0

11

The second observation is that comparing the fourconfidence types, the exponential, sigmoid, and evidencemeasures mostly outperform the linear (log-likelihood)measure. While the exponential measure outperformsthe sigmoid measure and evidence measure in somecases, its performance is not stable. The third observa-

Table 7 Accuracies (%) of combining four neural classifiers (CS1)

Scalingfuntion

Activationfunction

Un-normalized Normalized

Test-1 Test-2 Test-1 Test-2

Original Linear 99.76 90.10Exponen 99.78 91.05 99.78 90.23Sigmoid 99.77 91.27 99.75 89.98Evidence 99.79 91.29 99.78 90.23

Global Linear 99.77 90.55Exponen 99.78 91.41 99.78 91.27Sigmoid 99.75 90.13 99.74 90.12Evidence 99.78 91.24 99.78 91.27

Mean-var Linear 99.78 91.42Exponen 99.80 91.16 99.81 90.48Sigmoid 99.81 91.45 99.80 90.65Evidence 99.81 91.38 99.81 90.48

Gaussian Linear 99.78 91.42Exponen 99.80 91.73 99.81 90.48Sigmoid 99.80 91.81 99.80 90.58Evidence 99.80 91.82 99.81 90.48

LR-1 Linear 99.78 90.66Exponen 99.81 91.74 99.78 90.37Sigmoid 99.78 91.34 99.77 90.27Evidence 99.79 91.58 99.78 90.37

LR-1P Linear 99.77 90.91Exponen 99.80 91.68 99.79 90.44Sigmoid 99.79 91.37 99.78 90.30Evidence 99.79 91.46 99.79 90.44

LR-2 Linear 99.78 91.12Exponen 99.80 91.65 99.79 90.31Sigmoid 99.80 90.63 99.79 90.60Evidence 99.81 90.76 99.79 90.31

Table 8 Accuracies (%) of combining three distance classifiers(CS2)

Scalingfuntion

Activationfunction

Un-normalized Normalized

Test-1 Test-2 Test-1 Test-2

Reciprocal Linear 98.75 82.87 99.54 87.96Global Linear 99.51 86.92

Exponen 99.61 88.40 99.53 88.29Sigmoid 99.40 86.43 99.29 86.09Evidence 99.54 88.36 99.53 88.29

Mean-var Linear 99.62 88.34Exponen 99.50 88.24 99.66 89.76Sigmoid 99.61 89.01 99.60 88.94Evidence 99.64 89.90 99.66 89.76

Gaussian Linear 99.62 88.34Exponen 99.59 89.09 99.66 89.76Sigmoid 99.62 89.35 99.65 89.21Evidence 99.65 89.97 99.66 89.76

LR-1 Linear 99.63 88.99Exponen 99.53 88.92 99.54 88.71Sigmoid 99.51 88.87 99.50 88.21Evidence 99.52 89.46 99.54 88.71

LR-1P Linear 99.66 89.54Exponen 99.56 88.84 99.56 88.98Sigmoid 99.52 88.82 99.50 88.57Evidence 99.53 89.31 99.56 88.98

LR-2 Linear 99.65 89.49Exponen 99.60 90.20 99.48 87.27Sigmoid 99.56 88.23 99.53 88.37Evidence 99.50 88.44 99.48 87.27

Table 9 Accuracies (%) of combining five strong classifiers (CS3)

Scalingfuntion

Activationfunction

Un-normalized Normalized

Test-1 Test-2 Test-1 Test-2

Global Linear 99.78 92.04Exponen 99.81 92.68 99.79 92.50Sigmoid 99.76 91.65 99.76 91.55Evidence 99.80 92.52 99.79 92.50

Mean-var Linear 99.80 92.04Exponen 99.85 92.29 99.81 92.14Sigmoid 99.84 92.75 99.83 91.91Evidence 99.84 92.84 99.81 92.14

Gaussian Linear 99.80 92.04Exponen 99.83 92.23 99.81 92.14Sigmoid 99.84 92.89 99.83 92.04Evidence 99.81 92.87 99.81 92.14

LR-1 Linear 99.76 92.05Exponen 99.81 92.75 99.83 92.42Sigmoid 99.83 92.87 99.80 92.14Evidence 99.83 92.97 99.83 92.42

LR-1P Linear 99.81 92.29Exponen 99.81 92.76 99.83 92.51Sigmoid 99.83 92.87 99.83 92.25Evidence 99.84 92.98 99.83 92.51

LR-2 Linear 99.79 92.43Exponen 99.84 92.50 99.84 92.29Sigmoid 99.83 92.61 99.83 92.60Evidence 99.83 92.64 99.84 92.29

Table 10 Accuracies (%) of combining all the seven classifiers(CS4)

Scalingfuntion

Activationfunction

Un-normalized Normalized

Test-1 Test-2 Test-1 Test-2

Global Linear 99.77 90.52Exponen 99.79 91.38 99.78 91.25Sigmoid 99.72 90.06 99.72 89.82Evidence 99.78 91.27 99.78 91.25

Mean-var Linear 99.79 91.48Exponen 99.79 91.32 99.83 91.18Sigmoid 99.80 91.72 99.84 91.12Evidence 99.80 91.89 99.83 91.18

Gaussian Linear 99.79 91.48Exponen 99.80 91.82 99.83 91.18Sigmoid 99.81 92.14 99.80 91.13Evidence 99.83 92.22 99.83 91.18

LR-1 Linear 99.79 91.09Exponen 99.80 92.07 99.81 91.06Sigmoid 99.81 92.06 99.81 90.98Evidence 99.81 92.13 99.81 91.06

LR-1P Linear 99.80 91.39Exponen 99.79 91.94 99.81 91.16Sigmoid 99.83 92.02 99.81 91.07Evidence 99.81 92.14 99.81 91.16

LR-2 Linear 99.79 91.37Exponen 99.78 91.60 99.79 90.72Sigmoid 99.81 91.15 99.79 91.16Evidence 99.81 91.23 99.79 90.72

12

tion is that the evidence measure mostly outperforms thesigmoid measure, especially in combining CS2. This isbecause the sigmoid measure as a two-class posteriorprobability does not represent multi-class probabilityvery well. By combining the evidences of sigmoid mea-sures, the performance of classifier combination is im-proved.

In comparing the performances of various scalingfunctions, let us focus on the accuracies of Test-2. Incombining CS1 (Table 7), the highest accuracies of Test-2 are given by the scaling function Gaussian. As to lo-gistic regression, high accuracies of combination are gi-ven by scaling functions LR-1 and LR-1P. It isnoteworthy that the combination accuracies of Gaussian,LR-1, and LR-1P are evidently higher than those of theoriginal classifier outputs without re-scaling. This indi-cates that the re-evaluation of confidences is beneficialfor combining neural classifiers.

In combining CS2 (Table 8), the scaling functionsMean-var, Gaussian and logistic regression give highaccuracies to the two datasets. The highest accuracy ofTest-2 is given by the exponential measure with LR-2.With LR scaling functions, the linear measure also giveshigh combination accuracies. This indicates that thedistance measures are re-scaled appropriately. The re-ciprocal of distance values, on the other hand, gives verypoor combination performance.

In combining CS3 (Table 9), the scaling functionsMean-var, Gaussian, LR-1and LR-1P give high accura-cies. In combining CS4 (Table 10), these four scalingfunctions also give high accuracies when compared toother scaling functions.

The scaling functions that perform well in all the fourclassifier settings are Gaussian, LR-1, and LR-1P.Comparing the results of combining CS3 and combingCS4, it is evident that the combination of CS3 giveshigher accuracies despite that CS3 is a subset of CS4.This indicates that increasing the number of constituentclassifiers does not necessarily improve the combinationperformance. When combining the classifiers using un-weighted rule, the weak constituent classifiers bringsnegative effect. We expect that a weighted combinationrule can mitigate this effect.

In all the four settings of classifiers, the global scalingfunction performs fairly well, though it is outperformedby other scaling functions that consider class informa-tion in scaling parameter estimation.

Comparing the accuracies of confidence-based com-bination to those of abstract-level combination by plu-rality vote (in Table 5), we can see that the accuracy ofconfidence-based combination is mostly higher, for ei-ther un-normalized or normalized confidence measures.The accuracies of confidence-based combination are alsomostly higher than the accuracies of the best individualclassifier. It is noteworthy that the confidence-basedcombination of five strong classifiers (CS3) yields thehighest accuracies to the two test datasets. Comparingthe highest accuracies, 99.85% and 92.98%, to those ofbest individual classifiers, 99.77% and 89.93%, the error

rate was reduced by a factor of 34.8% and 30.3%,respectively. If excluding the irreducible errors of OR-combination, the error reduction rates are

99:85� 99:77

99:96� 99:77¼ 42:1%

and

92:98� 89:93

97:67� 89:93¼ 39:4%;

respectively.

Effects of parameter sharing

To justify that sharing class parameters in confidencetransformation is preferable, we present the combina-tion results of combining five strong classifiers (CS3)with variable degrees of parameter sharing.

Table 11 shows the combination accuracies of logisticregression with class-specific weights and biases. Com-paring the results with those in Table 9, we can see thatthe accuracies of class-specific parameters do not differsignificantly from those of shared parameters. On Test-2, the class-specific parameters give lower accuracies ifwe ignore the results of normalized confidences (whichare inferior to those without normalization). This indi-cates that regarding the combination performance,sharing class parameters in logistic regression is prefer-able.

Table 12 shows the results of scaling functionGaussian with variable parameter sharing. We can seethat the results of the third block (two-variance, sharedmeans and variances) and the sixth block (class-specificvariances) are comparable to those of totally sharedparameters in Table 9, while the accuracies of otherparadigms are lower than or comparable to the accu-racies of totally shared parameters. In summary, thecombination performance of sharing all class parametersis among the best.

Table 11 Accuracies (%) of combining CS3: logistic regressionwithout parameter sharing

Scalingfuntion

Activationfunction

Un-normalized Normalized

Test-1 Test-2 Test-1 Test-2

LR-1 Linear 99.79 91.70Exponen 99.78 92.45 99.79 92.06Sigmoid 99.83 92.64 99.77 91.76Evidence 99.85 92.80 99.79 92.06

LR-1P Linear 99.79 92.06Exponen 99.78 92.53 99.81 92.25Sigmoid 99.83 92.69 99.78 91.94Evidence 99.84 92.89 99.81 92.25

LR-2 Linear 99.81 92.42Exponen 99.80 92.44 99.84 92.39Sigmoid 99.83 92.52 99.84 92.50Evidence 99.84 92.63 99.84 92.39

13

Effects of validation sample size

To investigate the sensitivity of combination perfor-mance to the sample size of scaling parameter estima-tion, we have experimented with variable number ofvalidation samples. From the whole set of 40,000 vali-dation samples (30,000 in Valid-1 and 10,000 in Valid-2),the leading samples of Valid-1 and Valid-2 are taken tocompose a validation subset according to the ratio 3:1.We estimated the parameters of scaling functions on40,000, 10,000, 2,000, 500, and 100 samples, and foundthat the accuracies of classifier combination changed

very slightly. As an extreme case, the combinationaccuracies with parameters estimated on 100 validationsamples are shown in Table 13. For the scaling functionsexcept Gaussian, the parameters are shared amongclasses. Two schemes were experimented with Gaussian:totally shared parameters, and class-specific one-vari-ance with shared means.

Comparing the results of Table 13 with the corre-sponding accuracies in Table 9 and Table 12, we can seethat except for LR-2, the changes of accuracies on thetwo test datasets are very small. For LR-2, the testaccuracies even increase on small validation sample size.This is perhaps because the distribution of validationsamples changed to favor the test data.

Conclusion

This paper presented the results of classifier combinationbased on various confidence transformation methods. Inour formulation, each confidence transformation meth-od is decomposed into the combination of an activationfunction and a scaling function. The parameter estima-tion of scaling functions is crucial to determine theconfidence values. The combination results in variableclassifier settings show that the scaling function based onone-dimensional Gaussian density modeling and thelogistic regression with one variable perform very well inmost cases. For most scaling functions, sharing param-eters among classes show preference, and the combina-tion performance is highly robust against the validationsample size in confidence parameter estimation. As to

Table 12 Accuracies (%) of combining CS3: Gaussian scaling withvariable parameter sharing

Activation function Un-normalized Normalized

Test-1 Test-2 Test-1 Test-2

Two-variance,class meansand variancesLinear 99.78 91.39Exponen 99.79 91.89 99.81 92.01Sigmoid 99.85 92.63 99.79 91.82Evidence 99.84 92.65 99.81 92.01One-variance,class meansand variancesLinear 99.75 91.82Exponen 99.77 91.96 99.74 92.28Sigmoid 99.78 92.58 99.73 92.06Evidence 99.76 92.64 99.74 92.28Two-variance,shared meansand variancesLinear 99.75 91.69Exponen 99.80 92.18 99.80 92.24Sigmoid 99.83 92.87 99.81 92.10Evidence 99.81 92.80 99.80 92.24Two-variance,shared means,class variancesLinear 99.78 91.39Exponen 99.79 91.89 99.81 92.01Sigmoid 99.85 92.63 99.79 91.82Evidence 99.84 92.65 99.81 92.01Two-variance,shared variances,class meansLinear 99.73 91.71Exponen 99.73 92.07 99.77 92.21Sigmoid 99.79 92.70 99.78 92.07Evidence 99.79 92.67 99.77 92.21One-variance,shared means,class variancesLinear 99.81 92.01Exponen 99.84 92.09 99.83 92.21Sigmoid 99.83 92.90 99.81 92.08Evidence 99.85 92.94 99.83 92.21One-variance,shared variances,class meansLinear 99.74 91.82Exponen 99.74 91.84 99.75 92.00Sigmoid 99.77 92.39 99.73 91.92Evidence 99.77 92.34 99.75 92.00

Table 13 Accuracies (%) of combining CS3 with confidenceparameters estimated on 100 samples

Scalingfuntion

Activationfunction

Un-normalized Normalized

Test-1 Test-2 Test-1 Test-2

Mean-var Linear 99.80 92.07Exponen 99.84 92.30 99.80 92.17Sigmoid 99.84 92.75 99.83 91.81Evidence 99.84 92.84 99.80 92.17

Gaussian Linear 99.80 92.07Exponen 99.81 92.29 99.80 92.17Sigmoid 99.84 92.91 99.81 91.98Evidence 99.84 92.94 99.80 92.17

Gaussian(class var)

Linear 99.79 92.06Exponen 99.81 92.02 99.81 92.19Sigmoid 99.85 92.87 99.80 92.04Evidence 99.83 92.83 99.81 92.19

LR-1 Linear 99.77 92.05Exponen 99.81 92.79 99.83 92.68Sigmoid 99.83 92.83 99.83 92.35Evidence 99.83 93.00 99.83 92.68

LR-1P Linear 99.81 92.43Exponen 99.81 92.49 99.83 92.83Sigmoid 99.83 92.89 99.84 92.60Evidence 99.81 92.99 99.83 92.83

LR-2 Linear 99.79 92.39Exponen 99.84 92.58 99.84 92.69Sigmoid 99.84 92.76 99.84 92.76Evidence 99.84 92.81 99.84 92.69

14

confidence types, the evidence measure shows promise asa multi-class probability estimate while the exponentialmeasure and the sigmoid measure also perform well.

Remaining works include trained combination basedon confidence transformation, classifier selection, het-erogeneous confidence transformation methods forconstituent classifiers, etc. We expect that trained com-bination can further improve the performance andconfidence transformation remains beneficial. The re-sults of combining mixed strong and weak classifiersindicate that selecting a subset of classifiers is necessaryto achieve high classification performance. The variationof performance of confidence transformation methods indifferent classifier settings implies that using differentconfidence transformation methods for the constituentclassifiers may yield better combination performance.We need to investigate theoretically and experimentallythat for a type of classifier, which confidence type andwhich scaling function are optimal.

Originality and Contributions

Previous works of classifier combination have mostly circumventedthe confidence transformation of classifier outputs, which havebeen treated in other contexts though. This paper presents acomprehensive comparison of confidence transformation methodsfor classifier combination. We decompose a confidence transfor-mation method into a scaling function and an activation functionsuch that the transformation strategy can be designed flexibly. Theactivation function corresponds to one of four confidence types,while the scaling function re-scales classifier outputs and deter-mines the confidence values. Among the four confidence types, thesigmoid measure is prevalently used as an estimate to class prob-ability, yet we show that the evidence combination of sigmoidmeasures is more relevant to multi-class probability estimation andit gives higher combination performance. The comparison ofscaling functions favors the one-dimensional Gaussian densitymodeling of classifier outputs and the logistic regression.

The comparison of scaling functions and confidence types issignificant for pattern recognition engineers to design an appro-priate confidence transformation method for practical applications.The results in handwritten digit recognition justified the efficiencyof confidence transformation for classifier combination. Onappropriate transformation of classifier outputs, a simple combi-nation rule can yield high classification accuracies. In our experi-ments, the error rates on two test data sets were reduced by a factorof around 40% compared to the best individual classifier. Theconfidence transformation methods are readily applicable to otherapplications as well.

About the Authors

Cheng-Lin Liu received the B.S. degree in electronicengineering from Wuhan University, the M.E. degreein electronic engineering from Beijing PolytechnicUniversity, the Ph.D. degree in pattern recognition andartificial intelligence from the Institute of Automation,Chinese Academy of Sciences, in 1989, 1992 and 1995,respectively. He was a postdoctor fellow in KoreaAdvanced Institute of Science and Technology (KA-IST) and later in Tokyo University of Agriculture andTechnology from March 1996 to March 1999. He then

became a research staff member at the Central Re-search Laboratory, Hitachi, Ltd., and now is a seniorresearcher. His research interests include pattern rec-ognition, artificial intelligence, image processing, neuralnetworks, machine learning, and especially the appli-cations to character recognition and document pro-cessing.

Hongwei Hao received the Ph.D. degree in PatternRecognition and Artificial Intelligence from the Instituteof Automation, Chinese Academy of Sciences, in 1996.From August 1996 to August 1998, he was a postdoc-toral fellow in Chinese Academy of Sciences. SinceAugust 1998, he has been an Associate Professor at theDepartment of Computer Science, University of Scienceand Technology Beijing. From September 2001 to Sep-tember 2002, he was a visiting researcher at the CentralResearch Laboratory, Hitachi, Ltd., Tokyo, Japan. Hisresearch interests are mainly focused on handwrittencharacter recognition and artificial neural networks.

Hiroshi Sako received the B.E. and M.E. degrees inmechanical engineering from Waseda University, To-kyo, Japan, in 1975 and 1977, respectively. In 1992, hereceived the Dr. Eng. degree in computer science fromthe University of Tokyo. From 1977 to 1991, he workedin the field of industrial machine vision at the CentralResearch Laboratory of Hitachi, Ltd., Tokyo, Japan(HCRL). From 1992 to 1995, he was a senior researchscientist at Hitachi Dublin Laboratory, Ireland, wherehe did research in facial and hand gesture recognition.Since 1996, he has been with the HCRL, where he di-rects a research group of image recognition and char-acter recognition. Currently, he is a chief researcherthere. As concurrent posts, he has been a visiting pro-fessor at Japan Advanced Institute of Science andTechnology, Hokuriku (Postgraduate University) since1998, and a visiting lecturer at Hosei University since2003. Dr. Sako was a recipient of the 1988 Best PaperAward from the Institute of Electronics, Information,and Communication Engineers (IEICE) of Japan, andone of the recipients of the Industrial Paper Awardsfrom the 12th ICPR, Jerusalem, Israel, 1994. He is asenior member of the IEEE, and a member of IEICE,JSAI(the Japanese Society of Artificial Intelligence) andIPSJ(the Information Processing Society of Japan).

Acknowledgements The work of Hongwei Hao was done when hewas working at the Hitachi Central Research Laboratory. Theauthors would thank Kazuki Nakashima and Ryuji Mine forproviding the datasets.

References

1. Mandler E, Schurman J. Combining the classification results ofindependent classifiers based on the Dempster-Shafer theory ofevidence. In: Gelsema ES, Kanal LN (eds). Pattern Recognitionand Artificial Intelligence. Elsevier, 1988, pp.381–393.

2. Xu L, Krzyzak A, Suen CY. Methods of combining multipleclassifiers and their applications to handwriting recognition.IEEE Trans. System, Man, and Cybernetics 1992; 22(3): 418–435.

15

3. Ho TK, Hull J, Srihari SN. Decision combination in multipleclassifier systems. IEEE Trans. Pattern Analysis and MachineIntelligence 1994; 16(1): 66–75.

4. Kittler J, Hatef M, Duin RPW, Matas J. On combining clas-sifiers. IEEE Trans. Pattern Analysis and Machine Intelligence1998; 20(3): 226–239.

5. Suen CY, Lam L. Multiple classifier combination methodolo-gies for different output levels. In: Kittler J, Roli F (eds).Multiple Classifier Systems, LNCS 1857. Springer, 2000,pp.52–66.

6. Rahman AF, Fairhurst MC. A novel confidence-basedframework for multiple expert decision fusion. In: Carter N,Nixon MS (eds). Proc. 9th British Machine Vision Confer-ence, 1998.

7. Bengio S, Marcel C, Marcel S, Mariethoz J. Confidence mea-sures for multimodal identity identification. Information Fu-sion 2002; 3(4): 267–276.

8. Duin RPW. The combining classifiers: to train or not to train.In: Proc. 16th International Conference on Pattern Recogni-tion, Vol.2. Quebec, Canada, 2002, pp.765–770.

9. Liu CL, Nakagawa M. Precise candidate selection for largecharacter set recognition by confidence evaluation. IEEETrans. Pattern Analysis and Machine Intelligence 2000; 22(6):636–642.

10. Ruck DW, Rogers SK, Kabrisky M, Oxley ME, Suter BW. Themultilayer perceptron as an approximation to a Bayes optimaldiscriminant function. IEEE Trans. Neural Networks 1990;1(4): 296–298.

11. Richard MD, Lippmann RP. Neural network classifiers esti-mate Bayesian a posteriori probabilities. Neural Computation1991; 4:461–483.

12. Duda RO, Hart PE, Stork DG, Pattern Classification, 2ndedition. Wiley-Interscience, 2001.

13. Cordella LP, Foggia P, Sansone C, Tortorella F, Vento M.Reliability parameters to improve combination strategies inmulti-expert systems. Pattern Analysis and Applications 1999;2(3): 205–214.

14. Atukorale AS, Suganthan PN. Combining classifiers based onconfidence values. In Proc. 5th International Conference onDocument Analysis and Recognition. Bangalore, India, 1999,pp.37–40.

15. Lin X, Ding X, Chen M, Zhang R, Wu Y. Adaptive confidencetransform based classifier combination for Chinese characterrecognition. Pattern Recognition Letters 1998; 19:975–988.

16. Denker JS, LeCun Y. Transforming neural-net output levels toprobability distribution. In: Lippmann RP, Moody JE, Tour-etzky DS (eds). Advances in Neural Information Processing 3.Morgan Kauffman, 1991, pp.853–859.

17. Hoekstra A, Tholen SA, Duin RPW. Estimating the reliabilityof neural network classification. In Proc. International Con-ference on Artificial Neural Networks. Bochum, Germany,1996, pp.53–58.

18. Duin RPW, Tax DMJ. Classifier conditional posterior proba-bilities. In: Amin A, Dori D, Pudil P, Fremman H (eds). Ad-vances in Pattern Recognition, LNCS 1451. Springer, 1998,pp.611–619.

19. Gillick L, Ito Y, Young J. A probabilistic approach to confi-dence estimation and evaluation. In Proc. International Con-ference on Acoustics, Speech, and Signal Processing, vol.2.Munich, Germany, 1997, pp.879–882.

20. Platt J. Probabilistic outputs for support vector machines andcomparisons to regularized likelihood methods, In: Smola AJ,Bartlett P, Scholkopf D, Schuurmanns D (eds). Advances inLarge Margin Classifiers. MIT Press, 1999.

21. Gorski N. Practical combination of multiple classifiers, In:Downton AC, Impedovo S (eds), Progress of HandwritingRecognition. World Scientific, 1997.

22. Wei W, Leen TK, Barnard E. A fast histogram-based post-processor that improves posterior probability estimates. NeuralComputation 1999; 11(5): 1235–1248.

23. Schurmann J, Pattern Classification: A United View of Statis-tical and Neural Approaches. Wiley-Interscience, 1996.

24. Hao H, Liu CL, Sako H. Confidence evaluation for combiningdiverse classifiers. In Proc. 7th International Conference onDocument Analysis and Recognition. Edinburgh, Scotland,2003, pp.760–764.

25. Hashem S. Optimal linear combinations of neural networks.Neural Networks 1997; 10(4): 599–614.

26. Ueda N. Optimal linear combination of neural networks forimproving classification performance. IEEE Trans. PatternAnalysis and Machine Intelligence 2000; 22(2): 207–215.

27. Lee DS, Srihari SN. A theory of classifier combination: theneural network approach. In Proc. 3rd International Confer-ence on Document Analysis and Recognition. Montreal, 1995,pp.42–45.

28. Duin RPW, Tax DMJ. Experiments with classifier combiningrules. In: Kittler J, Roli F (eds). Multiple Classifier Systems,LNCS 1857. Springer, 2000, pp.16-29.

29. Kuncheva LI, Bezdek JC, Duin RPW. Decision templates formultiple classifier fusion: an experimental comparison. PatternRecognition 2001; 34(2): 299–314.

30. Shafer G, A Mathematical Theory of Evidence. PrincetonUniv. Press, 1976.

31. Barnett JA. Computational methods for a mathematical theoryof evidence. In Proc. 7th International Joint Conference onArtificial Intelligence. Vancouver, Canada, 1981, pp.868–875.

32. Rogova G. Combining the results of several neural networkclassifiers. Neural Networks 1994; 7(5): 777–781.

33. Tomai CI, Srihari SN. Combination of type III digit recog-nizers using the Dempster-Shafer theory of edivence. In Prof.7th International Conference on Document Analysis andRecognition. Edinburgh, 2003, pp.854–858.

34. Jain AK, Prabhakar S, Chen S. Combining multiple matchesfor a high security fingerprint verification system. PatternRecognition Letters 1999; 20(11–13): 1371–1379.

35. Wu L, Oviatt SL, Cohen PR. From members to teams tocommittee—a robust approach to gestural and multimodalrecognition. IEEE Trans. Neural Networks 2002; 13(4): 972–982.

36. Bridle JS. Probabilistic interpretation of feedforward classifi-cation network outputs, with relationships to statistical patternrecognition. In: Fogelman-Soulie, Herault (eds). Neurocom-puting: Algorithms, Architectures and Applications. Springer,1990, pp.227–236.

37. Robbins H, Monro S. A stochastic approximation method.Annals of Mathematical Statistics 1951; 22:400–407.

38. Liu CL, Sako H, Fujisawa H. Performance evaluation of pat-tern classifiers for handwritten character recognition. Int. J.Document Analysis and Recognition 2002; 4(3): 191–204.

39. Liu CL, Nakashima K, Sako H, Fujisawa H. Handwritten digitrecognition: benchmarking of state-of-the-art techniques. Pat-tern Recognition 2003; 36(10): 2271–2285.

40. Liu CL, Nakashima K, Sako H, Fujisawa H. Handwritten digitrecognition: investigation of normalization and featureextraction techniques. Pattern Recognition 2003; 37(2):265–279

41. Hamanaka M, Yamada K, Tsukumo J. Normalization-coop-erated feature extraction method for handprinted Kanji char-acter recognition. In Proc. 3rd International Workshop onFrontiers of Handwriting Recognition. Buffalo, NY, 1993,pp.343-348.

42. Liu CL, Liu YJ, Dai RW. Preprocessing and statistical/struc-tural feature extraction for handwritten numeral recognition.In: Downton AC, Impedovo S (eds). Progress of HandwritingRecognition. World Scientific, 1997, pp.161-168.

43. Liu CL, Koga M, Sako H, Fujisawa H. Aspect ratio adaptivenormalization for handwritten character recognition. In: TanT, Shi Y, Gao W (eds). Advances in Multimodal Inter-faces—ICMI2000, LNCS 1948. Springer, 2000, pp.418–425.

44. Bishop CM, Neural Networks for Pattern Recognition. Clad-eron Press, Oxford, 1995.

45. Kreßel U, Schurmann J. Pattern classification techniques basedon function approximation. In: Bunke H, Wang PSP (eds).Handbook of Character Recognition and Document ImageAnalysis. World Scientific, 1997, pp.49–78.

16

46. Liu CL, Nakagawa M. Evaluation of prototype learning algo-rithms for nearest neighbor classifier in application to hand-written character recognition. Pattern Recognition 2001; 34(3):601–615.

47. Liu CL, Sako H, Fujisawa H. Learning quadratic discriminantfunction for handwritten character recognition. In Proc. 16th

International Conference on Pattern Recognition, vol.4. Que-bec, Canada, 2002, pp.44–47.

48. Grother PJ, NIST special database 19: handprinted formsand characters database. Technical report and CDROM,1995.

17