Information Analysis of Multiple Classifier Fusion?

10
Information Analysis of Multiple Classifier Fusion Jiˇ ı Grim 1 , Josef Kittler 2 , Pavel Pudil 1 , and Petr Somol 1 1 Institute of Information Theory and Automation, P.O.BOX 18, CZ-18208 Prague 8, Czech Republic, [email protected], [email protected], [email protected] 2 School of Electronic Engineering, Information Technology and Mathematics, University of Surrey, Guildford GU2 5XH, United Kingdom Abstract. We consider a general scheme of parallel classifier combina- tions in the framework of statistical pattern recognition. Each statistical classifier defines a set of output variables in terms of a posteriori prob- abilities, i.e. it is used as a feature extractor. Unlike usual combining schemes the output vectors of classifiers are combined in parallel. The statistical Shannon information is used as a criterion to compare different combining schemes from the point of view of the theoretically available decision information. By means of relatively simple arguments we derive a theoretical hierarchy between different schemes of classifier fusion in terms of information inequalities. 1 Introduction A natural way to solve practical problems of pattern recognition is to try different classification methods, parameters and feature subsets with the aim to achieve the best performance. However, as different classifiers frequently make different recognition errors, it is often useful to combine multiple classifiers in order to improve the final recognition accuracy. In the last few years various combination methods were proposed by different authors (cf. [12, 13] for extensive references). The most widely used approach typically combines the classifier outputs di- rectly by means of simple combining rules or functions. It relates to techniques like majority vote, threshold voting, averaged Bayes classifier, different linear combinations of a posteriori probabilities, maximum and minimum rules, prod- uct rule (cf. e.g. [11, 6, 14]) and also more complex combining tools like fuzzy logic or Dempster-Shafer theory of evidence (cf. e.g. [9, 10, 15]). Another approach makes use of classifiers as feature extractors. The extracted features (e.g. a posteriori probabilities) are used simultaneously in parallel to define a new decision problem (cf. [16, 18, 7, 8]). Instead of a simple combining function the new features are evaluated by a classifier again (e.g. by a neural 0 Supported by the grant No. 402/01/0981 of the Czech Grant Agency and partially by the Complex research project No. K1019101 of the Czech Academy of Sciences.

Transcript of Information Analysis of Multiple Classifier Fusion?

Information Analysisof Multiple Classifier Fusion

Jirı Grim1, Josef Kittler2, Pavel Pudil1, and Petr Somol1

1 Institute of Information Theory and Automation,P.O.BOX 18, CZ-18208 Prague 8, Czech Republic,

[email protected], [email protected], [email protected] School of Electronic Engineering, Information Technology and Mathematics,

University of Surrey, Guildford GU2 5XH, United Kingdom

Abstract. We consider a general scheme of parallel classifier combina-tions in the framework of statistical pattern recognition. Each statisticalclassifier defines a set of output variables in terms of a posteriori prob-abilities, i.e. it is used as a feature extractor. Unlike usual combiningschemes the output vectors of classifiers are combined in parallel. Thestatistical Shannon information is used as a criterion to compare differentcombining schemes from the point of view of the theoretically availabledecision information. By means of relatively simple arguments we derivea theoretical hierarchy between different schemes of classifier fusion interms of information inequalities.

1 Introduction

A natural way to solve practical problems of pattern recognition is to try differentclassification methods, parameters and feature subsets with the aim to achievethe best performance. However, as different classifiers frequently make differentrecognition errors, it is often useful to combine multiple classifiers in order toimprove the final recognition accuracy. In the last few years various combinationmethods were proposed by different authors (cf. [12, 13] for extensive references).

The most widely used approach typically combines the classifier outputs di-rectly by means of simple combining rules or functions. It relates to techniqueslike majority vote, threshold voting, averaged Bayes classifier, different linearcombinations of a posteriori probabilities, maximum and minimum rules, prod-uct rule (cf. e.g. [11, 6, 14]) and also more complex combining tools like fuzzylogic or Dempster-Shafer theory of evidence (cf. e.g. [9, 10, 15]).

Another approach makes use of classifiers as feature extractors. The extractedfeatures (e.g. a posteriori probabilities) are used simultaneously in parallel todefine a new decision problem (cf. [16, 18, 7, 8]). Instead of a simple combiningfunction the new features are evaluated by a classifier again (e.g. by a neural0 Supported by the grant No. 402/01/0981 of the Czech Grant Agency and partially

by the Complex research project No. K1019101 of the Czech Academy of Sciences.

2

network) to realize the compound classification. This approach is capable of verygeneral solutions but the potential advantages are achieved at the expense of thelost simplicity of the combining rules.

In the present paper we consider a general scheme corresponding to the lasttype of parallel classifier combinations in the framework of statistical patternrecognition. In particular we assume that each statistical classifier defines a setof output variables in terms of a posteriori probabilities which are simply usedin parallel as features. We use the term parallel classifier fusion to emphasizethe difference with the combining functions.

The standard criterion to measure the quality of classifier combination tech-niques is the recognition accuracy. In this paper we use the statistical Shannoninformation to compare different combining schemes as it is more sensitive thanthe classification error and easily applicable. By means of relatively simple factswe derive a theoretical hierarchy between basic schemes of classifier fusion interms of information inequalities. The results have general validity but theirmeaning is rather theoretical since the compared decision information is onlytheoretically available in the new compound feature space.

In Section 2 we describe the framework of statistical pattern recognition andintroduce the basic concept of information preserving transform. In Section 3we discuss the information properties of imprecise classifiers and show how theinformation loss of practical solutions can be reduced by the parallel fusion (Sec.4). In Sec. 5 we compare the parallel classifier fusion with a method based ongeneral combining rules. The obtained hierarchy of different methods of multipleclassifier fusion is discussed in Sec. 6 and finally summarized in the Conclusion(Sec. 7).

2 Information Preserving Transform

Considering the framework of statistical pattern recognition we assume in thispaper that some N -dimensional binary observations x have to be classified intoone of mutually exclusive classes ω ∈ Ω:

x = (x1, . . . , xN ) ∈ X , X = 0, 1N , Ω = ω1, . . . , ωK.

The observations x ∈ X are supposed to occur randomly according to some apriori probabilities p(ω) and class-conditional probability distributions

P (x|ω)p(ω), x ∈ X , ω ∈ Ω. (1)

Given the probabilistic description of classes, we can compute for any inputvector x the a posteriori probabilities

p(ω|x) =P (x|ω)p(ω)

P (x), P (x) =

∑ω∈Ω

P (x|ω)p(ω), x ∈ X (2)

where P (x) is the unconditional joint probability distribution of x. The a pos-teriori probabilities p(ω|x) contain all statistical information about the set of

3

classes Ω given x ∈ X and can be easily used to classify uniquely the inputvector x, if necessary.

For the sake of information analysis let us consider the following vector trans-form T of the original decision problem defined on the input space X :

T : X → Y, Y ⊂ RK , y = T (x) = (T 1(x), . . . ,TK(x)) ∈ Y (3)

yk = Tk(x) = ϕ(p(ωk|x)), x ∈ X , k = 1, . . . ,K. (4)

where ϕ is any bijective function. Let us remark that there are strong argumentsto specify the function ϕ as logarithm in connection with neural networks [2, 4].

The transform (3), (4) naturally induces a partition of the space X

S = Sy,y ∈ Y, Sy = x ∈ X : T (x) = y,⋃y∈Y

Sy = X (5)

and transforms the original distributions PX , PX|ω on the input space X to thedistributions QY , QY|ω on Y:

QY(y) =∑

x∈Sy

PX (x) = PX (Sy), y ∈ Y, (6)

QY|ω(y|ω) =∑

x∈Sy

PX|ω(x|ω) = PX|ω(Sy|ω), ω ∈ Ω. (7)

Throughout the paper we use, whenever tolerable, the more simple notation

P (x) = PX (x), P (x|ω) = PX|ω(x|ω), Q(y) = QY(y), Q(y|ω) = QY|ω(y|ω).

In analogy with (2) we can write

p(ω|y) =Q(y|ω)p(ω)

Q(y), ω ∈ Ω, y ∈ Y. (8)

It is well known that, from the point of view of classification error, the aposteriori probabilities represent optimal features (cf. [1]). Moreover, it has beenshown that the transform defined by Eqs. (4) preserves the statistical decisioninformation and minimizes the entropy of the output space (cf. [2, 17]). In partic-ular, if we introduce the following notation for the unconditional and conditionalShannon entropies

H(pΩ) =∑ω∈Ω

−p(ω) log p(ω), H(pΩ|x) =∑ω∈Ω

−p(ω|x) log p(ω|x), (9)

H(pΩ |PX ) =∑x∈X

P (x)H(pΩ|x), H(pΩ |QY) =∑y∈Y

Q(y)H(pΩ|y), (10)

then we can write

I(PX , pΩ) = H(pΩ)−H(pΩ |PX ) = H(pΩ)−H(pΩ |QY) = I(QY , pΩ). (11)

4

Let us recall that

Sy = x ∈ X : T (x) = y = x ∈ X : p(ωk|x) = ϕ−1(yk), k = 1, . . . ,K (12)

and therefore the distributions pΩ|x are identical for all x ∈ Sy. Thus, in viewof the definition (8), we can write for any ω ∈ Ω, y ∈ Y and for all x ∈ Sy:

p(ω|y) =Q(y|ω)p(ω)

Q(y)=

P (Sy|ω)p(ω)P (Sy)

=∑

x∈Sy

P (x)P (Sy)

p(ω|x) = p(ω|x). (13)

Consequently, we obtain equation

H(pΩ |QY) =∑y∈Y

P (Sy)H(pΩ|y) =∑y∈Y

∑x∈Sy

P (x)H(pΩ|x) = H(pΩ |PX ) (14)

which implies Eq. (11). In words, the transform T preserves the statistical Shan-non information I(PX , pΩ) about pΩ contained in PX (cf. [2, 17]).

3 Information Loss Caused by Imprecise Classifiers

Let P(i)X|ω be some estimates of the unknown conditional probability distributions

PX|ω obtained e.g. in I different computational experiments:

P (i)X|ω, ω ∈ Ω, i ∈ I, I = 1, . . . , I. (15)

We denote p(i)(ω|x) the a posteriori probabilities which can be computed fromthe estimated distributions

p(i)(ω|x) =P (i)(x|ω)p(ω)

P (i)(x), P (i)(x) =

∑ω∈Ω

P (i)(x|ω)p(ω), x ∈ X . (16)

Here and in the following Sections we assume the a priori probabilities p(ω) tobe known and fixed. In analogy with (3),(4) we define

T (i) : X → Y(i), Y(i) ⊂ RK , y(i) = T (i)(x) = (T (i)1 (x), . . . ,T (i)

K (x)) ∈ Y(i),

y(i)k = T (i)

k (x) = ϕ(p(i)(ωk|x)), x ∈ X , k = 1, . . . ,K, (i ∈ I). (17)

Again, the transform T (i) induces a partition of the input space X :

S(i) = Sy(i) ,y(i) ∈ Y(i), Sy(i) = x ∈ X : T (i)(x) = y(i) (18)

and transforms the original true probability distributions PX , PX|ω on X to thecorresponding distributions Q

(i)

Y(i) , Q(i)

Y(i)|ω on Y (cf. (6), (7)):

Q(i)

Y(i)(y(i)) = PX (Sy(i)), Q(i)

Y(i)|ω(y(i)|ω) = PX|ω(Sy(i) |ω), y(i) ∈ Y(i). (19)

5

In analogy with (13) we can write for any ω ∈ Ω and y(i) ∈ Y(i):

p(ω|y(i)) = p(ω|Sy(i)) =P (Sy(i) |ω)p(ω)

P (Sy(i))=

∑x∈S

y(i)

P (x)P (Sy(i))

p(ω|x). (20)

However, unlike Eq. (13), the probabilities p(ω|x) need not be identical forall x ∈ Sy(i) because the partition S(i) derives from the estimated distribu-tions P

(i)X|ω which may differ from the unknown true distributions PX|ω. Thus,

for some Sy(i) ∈ S(i), there may be different vectors x,x′ ∈ Sy(i) such that

p(ω|x) 6= p(ω|x′) for some ω ∈ Ω. Consequently, we can write the following

general Jensen’s inequality for the convex function ξ log ξ:

−p(ω|Sy(i)) log p(ω|Sy(i)) ≥∑

x∈Sy(i)

P (x)P (Sy(i))

[−p(ω|x) log p(ω|x)]. (21)

Multiplying the inequality (21) by P (Sy(i)) and summing through ω ∈ Ω andy(i) ∈ Y(i) we obtain the following inequality for conditional entropies

H(pΩ |Q(i)

Y(i)) =∑

y(i)∈Y(i)

Q(i)(y(i))H(pΩ|y(i)) ≥∑x∈X

P (x)H(pΩ|x) = H(pΩ |PX ).

(22)In view of Eq. H(pΩ |PX ) = H(pΩ |QY) (cf. (14)) it follows that

I(PX , pΩ) = I(QY , pΩ) ≥ I(Q(i)

Y(i) , pΩ), (i ∈ I). (23)

Thus, if the true probabilistic description PX|ω, ω ∈ Ω is unknown and we aregiven only some estimated distribution P

(i)X|ω, then we may expect the transform

T (i) to be accompanied with some information loss. In other words, as it is wellknown, the extracted features y

(i)k (x) usually contain only a part of the original

decision information.

Remark 4.1 It should be emphasized at this point that there is an importantdifference between the present concept of information analysis and a practi-cal problem of pattern recognition. In a practical situation we would use theestimated conditional distributions P

(i)X|ωto compute a posteriori probabilities

p(i)(ω|x) and finally the decision d(i)(x) ∈ Ω given an input observation x ∈ X .However, in case of the above information analysis, we use the estimated dis-tributions P

(i)X|ω only to define the related transform (feature extractor) T (i).

The resulting information inequality (23) compares the original “complete” de-cision information I(QY , pΩ) and the theoretically available information contentI(Q(i)

Y(i) , pΩ) of the new features y(i)k (cf. (17)). Note that the true statistical

properties of the new features expressed by the distributions Q(i)

Y(i)|ω, Q(i)

Y(i) can-

not be deduced from the estimated conditional distributions P(i)X|ω. It would be

necessary to estimate them again from the training data. This remark applies inanalogous way to all transforms considered in the following Sections.

6

4 Parallel Classifier Fusion

The inequality (23) suggests possible information loss which may occur in prac-tical solutions. One possibility to reduce this information loss is to fuse multipleclassifiers in parallel. In particular, considering multiple estimates (15) of theunknown conditional distributions PX|ω, ω ∈ Ω we can use the correspondingtransforms T (i), i ∈ I, (cf. (17)) simultaneously to define a compound trans-form T :

T : X → Y, Y ⊂ RIK , y = (y(1),y(2), . . . ,y(I)) = T (x),

y = T (x) = (T (1)1 (x), . . . ,T (1)

K (x), . . . ,T (I)1 (x), . . . ,T (I)

K (x)) ∈ Y (24)

y(i)k = T (i)

k (x) = ϕ(p(i)(ωk|x)), x ∈ X , k = 1, . . . ,K, i ∈ I. (25)

In this sense the joint transform T can be viewed as a parallel fusion of thetransforms T (1),T (2), . . . ,T (I). Again, the transform T induces the correspond-ing partition of the input space X :

S = Sy, y ∈ Y, Sy = x ∈ X : T (x) = y (26)

and generates the transformed distributions QY , QY|ω on Y :

QY(y) = PX (Sy), QY|ω(y|ω) = PX|ω(Sy|ω), y ∈ Y, ω ∈ Ω. (27)

It is easy to see (cf. (18)) that the partition S of X can be obtained by intersectingthe sets of the partitions S(1),S(2), . . . ,S(I):

Sy = x ∈ X : T (i)(x) = y(i), i ∈ I ⇒ Sy =⋂i∈I

Sy(i) , (28)

and therefore the partition S is a refinement of any of the partitions S(i), i ∈ I.Now we prove the following simple Lemma:

Lemma 4.1 Let QY , QY|ω be discrete probability distributions defined by thepartition S (cf. (6),(7)) and QY , QY|ω discrete probability distributions definedby the partition S:

QY(y) = PX (Sy), QY|ω(y|ω) = PX|ω(Sy|ω), y ∈ Y, ω ∈ Ω. (29)

Further let S be a refinement of the partition S. Then the statistical decisioninformation about pΩ contained in QY is greater or equal to that contained inQY , i.e. we can write the inequality

I(QY , pΩ) ≥ I(QY , pΩ). (30)

Proof. We use notation (cf. (6),(7) and (29))

p(ω|y) =P (Sy|ω)p(ω)

P (Sy), p(ω|y) =

P (Sy|ω)p(ω)P (Sy)

(31)

7

and recall that for any two subsets Sy ∈ S, Sy ∈ S it holds that either Sy ⊂ Sy

or their intersection is empty: Sy

⋂Sy = ∅. It follows that we can write for any

y ∈ Y

p(ω|y) =∑y∈Y

P (Sy ∩ Sy)P (Sy)

P (Sy|ω)p(ω)P (Sy)

=∑y∈Y

P (Sy ∩ Sy)P (Sy)

p(ω|y). (32)

Applying Jensen’s inequality to the function −ξ log ξ and considering (32) weobtain

−p(ω|y) log p(ω|y) ≥∑y∈Y

P (Sy ∩ Sy)P (Sy)

[−p(ω|y) log p(ω|y)]. (33)

Further, multiplying the inequality (33) by P (Sy) and summing through ω ∈ Ωand y ∈ Y, we obtain the following inequality for conditional entropies

H(pΩ |QY) =∑y∈Y

Q(y)H(pΩ|y) ≥∑y∈Y

Q(y)H(pΩ|y) (34)

which implies the inequality (30). •

Consequently, since the partition S is a refinement of any of the partitionsS(i), i ∈ I, Lemma 4.1 implies the following information inequality

I(QY , pΩ) ≥ I(Q(i)

Y(i) , pΩ), i ∈ I. (35)

We can conclude that, expectedly, the classifier fusion represented by the com-pound transform T preserves more decision information than any of the compo-nent transforms T (i). Let us remark, however, that the dimension of the featurespace Y produced by T is I-times higher than those of Y(i). This computationalaspect of the considered form of classifier fusion will be discussed later in Sec. 6.

5 Combining Functions

Now we return to the most widely used classifier combination scheme based onsimple combining functions or combining rules. In particular we assume that thea posteriori distributions p

(i)Ω|x, i ∈ I computed by different statistical classifiers

are transformed to a single K-dimensional output vector by means of somecombining functions. We denote by T the resulting transform

T : X → Y, Y ⊂ RK , y = T (x) = (T1(x), . . . , TK(x)) ∈ Y,

yk = Tk(x) = Φk(p(1)Ω|x, p

(2)Ω|x, . . . , p

(I)Ω|x), x ∈ X , k = 1, . . . ,K (36)

whereby Φk : RIK → R are arbitrary mappings which uniquely define the outputvariables yk as a function of the a posteriori distributions p

(i)Ω|x, i ∈ I. Let us note

8

that, in this way, we can express in a unified manner various combining rulesfor a posteriori distributions like average, median, product, weighted averageand others. More generally, the mapping Φk may be described e.g. by a simpleprocedure and, in this way, we can formally describe different voting schemeslike majority voting, weighted voting, etc.

If we define the partition of X induced by the transform T

S = Sy, y ∈ Y, Sy = x ∈ X : T (x) = y (37)

we can see that, for any particular type of the mappings Φk, the partition S ofSec. 4 is a refinement of the partition S. To verify this property of S we recallthat the a posteriori distributions p

(i)Ω|x are identical for any x ∈ C, C ∈ S

(cf. (25),(26)). Consequently, in view of the definition (36), we obtain identicalvectors y = T (x) for all x ∈ C and therefore C ⊂ Sy. In other words, for eachsubset C ∈ S there is a subset Sy ∈ S such that C ⊂ Sy, i.e. the partitionS is a refinement of the partition S. If we denote QY , QY|ω the transformed

distributions on Y defined by the partition S:

QY(y) = PX (Sy), QY|ω(y|ω) = PX|ω(Sy|ω), y ∈ Y, ω ∈ Ω (38)

then we can write the information inequality (cf. Lemma 4.1):

I(QY , pΩ) ≥ I(QY , pΩ). (39)

We can conclude that, expectedly, the classifier fusion represented by the com-pound transform T preserves more decision information than any transform ofthe type T based on the combining functions (36).

6 Discussion

Summarizing the inequalities derived in the above Sections we recall that thetransform T (x) (cf. (3), (4)) based on the true probability distributions PX|ω isinformation preserving in the sense of Eq. (11). Section 3 describes a practicalsituation when only some estimates P

(i)X|ω of the true conditional distributions

are available. We have shown that the transform T (i) defined by means of theestimated distributions P

(i)X|ω may be accompanied with some information loss,

as expressed by the inequalities (cf. (23))

I(QY , pΩ) ≥ I(Q(i)

Y(i) , pΩ), i ∈ I. (40)

The potential information loss (40) can be partly avoided by combining classi-fiers. In particular, by parallel fusion of the transforms T (i), i ∈ I, we obtainthe compound transform T and the corresponding transformed distribution QYsatisfies the inequality (35). Consequently, as the general inequality (40) can beproved for the distribution QY without any change, we can write

I(QY , pΩ) ≥ I(QY , pΩ) ≥ I(Q(i)

Y(i) , pΩ), i ∈ I. (41)

9

In other words, the compound transform T preserves more decision informationthan any of the component transforms T (i) but the dimension of the featurespace Y produced by T is I-times higher than the dimension of each of thesubspaces Y(i).

In view of the inequality (cf. (39))

I(QY , pΩ) ≥ I(QY , pΩ) (42)

the parallel classifier fusion preserves more decision information than variousmethods based on combining functions. However, it should be emphasized thatthe inequality (42) compares the decision information theoretically available bymeans of parallel classifier fusion and by using combining functions respectively.Moreover, the information advantage of parallel fusion is achieved at the ex-pense of the lost simplicity of the combining rules. In order to exploit the avail-able decision information it would be necessary to design a new classifier in thehigh-dimensional feature space Y. On the other hand, in the feature space Y ofthe combined classifier a large portion of the decision information may be lostirreversibly.

Let us remark finally that inequalities analogous to (41) can be obtained inconnection with probabilistic neural networks (PNN) when the estimated con-ditional distributions P

(i)X|ω have the form of distribution mixtures [5]. For each

PNN(i) we have a transform defined in terms of component distributions [3]which preserves more decision information about pΩ than the correspondingtransform T (i). Again, the underlying information loss connected with the in-dividual neural networks can be reduced by means of parallel fusion. It can beshown that, theoretically, the parallel fusion of PNN is potentially more efficientthat the classifier fusion of Section 4.

7 Conclusion

For the sake of information analysis we consider a general scheme of parallelclassifier combinations in the framework of statistical pattern recognition. For-mally each classifier defines a set of output variables (features) in terms of theestimated a posteriori probabilities. The extracted features are used in parallelto define a new higher-level decision problem. By means of relatively simple factswe derived a hierarchy between different schemes of classifier fusion in terms ofinformation inequalities. In particular, we have shown that the parallel fusion ofclassifiers is potentially more efficient than the frequently used techniques basedon simple combining rules or functions. However the potential advantages areachieved at the expense of the lost simplicity of the combining functions and ofthe increased dimension of the new feature space. Thus, unlike combining func-tions, the most informative parallel combining schemes would require to designa new classifier in a high-dimensional feature space.

10

References

1. Fukunaga K., Ando S.: The optimum nonlinear features for a scatter criterion anddiscriminant analysis. In Proc. ICPR 1976, pp 50-54.

2. Grim J.: Design of multilayer neural networks by information preserving trans-forms. In E. Pessa M.P. Penna A. Montesanto (Eds.), Proceedings of the ThirdEuropean Congress on System Science, Roma: Edizzioni Kappa, 1996, pp 977-982.

3. Grim J., Kittler J., Pudil P., Somol P.: Combining multiple classifiers in proba-bilistic neural networks. In Multiple Classifier Systems, Kittler J., Roli F., (Eds.),Springer, 2000, pp 157 - 166.

4. Grim J., Pudil P.: On virtually binary nature of probabilistic neural networks. In:Advances in Pattern Recognition, Sydney, August 11 - 13, 1998), A. Amin, D. Dori,P. Pudil, H. Freeman (Eds.), pp. 765 - 774, Springer: New York, Berlin, 1998.

5. Grim J., Pudil P., Somol P.: Recognition of handwritten numerals by structuralprobabilistic neural networks. In Proceedings of the Second ICSC Symposium onNeural Computation, Berlin, 2000. (Bothe H., Rojas R. eds.). ICSC, Wetaskiwin,2000, pp 528-534.

6. Hansen L.K., Salamon P.: Neural network ensembles, IEEE Transactions on Pat-tern Analysis and Machine Intelligence 1990; 12(10):993- 1001.

7. Huang Y.S., Suen C.Y.: A method of combining multiple classifiers - a neuralnetwork approach. In Proc. 12th IAPR Int. Conf. on Pattern Recognition andComputer Vision, Los Alamitos: IEEE Comp. Soc. Press., 1994, pp 473-475.

8. Huang Y.S., Suen C.Y.: Combination of multiple experts for the recognition ofunconstrained handwritten numerals. IEEE Transactions on Pattern Analysis andMachine Intelligence 1995; 17(1):90-94.

9. Cho S.B., Kim J.H.: Combining multiple neural networks by fuzzy integral forrobust classification, IEEE Transactions on Systems Man and Cybernetics 1995;25(2):380-384.

10. Cho S.B., Kim J.H.: Multiple network fusion using fuzzy logic, IEEE Transactionson Neural Networks 1995; 6(2):497-501.

11. Chuanyi Ji, Sheng Ma.: Combinations of weak classifiers. IEEE Transactions onNeural Networks, 8(1): 32-42.

12. Kittler J.: Combining classifiers: A theoretical framework. Pattern Analysis andApplications 1998; 1:18–27.

13. Kittler J., Duin R.P.W., Hatef M., Matas J.: On combining classifiers. IEEE Trans-actions on Pattern Analysis and Machine Intelligence 1998; 20(3):226–239.

14. Lam L., Suen C.Y.: A theoretical analysis of the application of majority voting topattern recognition. In Proc. 12th IAPR Int. Conf. on Pattern Recognition andComputer Vision, Los Alamitos: IEEE Comp. Soc. Press., 1994, pp 418-420.

15. Rogova G.: Combining the results of several neural network classifiers, NeuralNetworks 1994; 7(5):777-781.

16. Soulie F.F., Vinnet E., Lamy B.: Multi-Modular Neural Network Architectures:Applications in Optical Character and Human Face Recognition, Int. Journal ofPattern Recognition and Artificial Intelligence, Vol. 5, No. 4, pp 721-755, 1993.

17. Vajda I., Grim J.: About the maximum information and maximum likelihood prin-ciples in neural networks. Kybernetika 1998; 34(4):485-494.

18. Wolpert D.H.: Stacked generalization. Neural Networks, Vol. 5, No. 2, pp 241-260,1992.

CombinfoLN.tex, January 23, 2006