Incorporation of a Regularization Term to Control Negative Correlation in Mixture of Experts

17
Neural Process Lett (2012) 36:31–47 DOI 10.1007/s11063-012-9221-5 Incorporation of a Regularization Term to Control Negative Correlation in Mixture of Experts Saeed Masoudnia · Reza Ebrahimpour · Seyed Ali Asghar Abbaszadeh Arani Published online: 5 April 2012 © Springer Science+Business Media, LLC. 2012 Abstract Combining accurate neural networks (NN) in the ensemble with negative error correlation greatly improves the generalization ability. Mixture of experts (ME) is a popular combining method which employs special error function for the simultaneous training of NN experts to produce negatively correlated NN experts. Although ME can produce negatively correlated experts, it does not include a control parameter like negative correlation learning (NCL) method to adjust this parameter explicitly. In this study, an approach is proposed to introduce this advantage of NCL into the training algorithm of ME, i.e., mixture of negatively correlated experts (MNCE). In this proposed method, the capability of a control parameter for NCL is incorporated in the error function of ME, which enables its training algorithm to establish better balance in bias-variance-covariance trade-off and thus improves the gen- eralization ability. The proposed hybrid ensemble method, MNCE, is compared with their constituent methods, ME and NCL, in solving several benchmark problems. The experimen- tal results show that our proposed ensemble method significantly improves the performance over the original ensemble methods. Keywords Neural networks ensemble · Hybrid ensemble method · Mixture of experts · Negative correlation learning · Mixture of negatively correlated experts S. Masoudnia School of Mathematics, Statistics and Computer Science, University of Tehran, Tehran, Iran R. Ebrahimpour · S. A. A. A. Arani Brain & Intelligent Systems Research Laboratory, Department of Electrical and Computer Engineering, Shahid Rajaee Teacher Training University, 16785-163, Tehran, Iran R. Ebrahimpour (B ) School of Cognitive Sciences (SCS), Institute for Research in Fundamental Sciences (IPM), 19395-5746, Tehran, Iran e-mail: [email protected] 123

Transcript of Incorporation of a Regularization Term to Control Negative Correlation in Mixture of Experts

Neural Process Lett (2012) 36:31–47DOI 10.1007/s11063-012-9221-5

Incorporation of a Regularization Term to ControlNegative Correlation in Mixture of Experts

Saeed Masoudnia · Reza Ebrahimpour ·Seyed Ali Asghar Abbaszadeh Arani

Published online: 5 April 2012© Springer Science+Business Media, LLC. 2012

Abstract Combining accurate neural networks (NN) in the ensemble with negative errorcorrelation greatly improves the generalization ability. Mixture of experts (ME) is a popularcombining method which employs special error function for the simultaneous training of NNexperts to produce negatively correlated NN experts. Although ME can produce negativelycorrelated experts, it does not include a control parameter like negative correlation learning(NCL) method to adjust this parameter explicitly. In this study, an approach is proposed tointroduce this advantage of NCL into the training algorithm of ME, i.e., mixture of negativelycorrelated experts (MNCE). In this proposed method, the capability of a control parameterfor NCL is incorporated in the error function of ME, which enables its training algorithmto establish better balance in bias-variance-covariance trade-off and thus improves the gen-eralization ability. The proposed hybrid ensemble method, MNCE, is compared with theirconstituent methods, ME and NCL, in solving several benchmark problems. The experimen-tal results show that our proposed ensemble method significantly improves the performanceover the original ensemble methods.

Keywords Neural networks ensemble · Hybrid ensemble method · Mixture of experts ·Negative correlation learning · Mixture of negatively correlated experts

S. MasoudniaSchool of Mathematics, Statistics and Computer Science, University of Tehran, Tehran, Iran

R. Ebrahimpour · S. A. A. A. AraniBrain & Intelligent Systems Research Laboratory, Department of Electrical and Computer Engineering,Shahid Rajaee Teacher Training University, 16785-163, Tehran, Iran

R. Ebrahimpour (B)School of Cognitive Sciences (SCS), Institute for Research in Fundamental Sciences (IPM),19395-5746, Tehran, Irane-mail: [email protected]

123

32 S. Masoudnia et al.

1 Introduction

Combining classifiers is an approach to improve the performance in classification [1,2] andprediction [3–5] particularly for complex problems such as those involving limited numberof patterns, high-dimensional feature sets, and highly overlapped classes [6–8]. Combiningmethods have two major components, i.e., a method for creating individual neural networks(NNs) and a method for combining NNs. Both theoretical and experimental studies [9–11]have shown that the combining procedure is most effective when the experts’ estimates arenegatively correlated; they are moderately effective when the experts are uncorrelated andonly mildly effective when the experts are positively correlated. So, more improved gener-alization ability can be obtained by combining the outputs of NNs which are accurate andtheir errors are negatively correlated [12].

There are a number of alternative approaches that can be used to produce negativelycorrelated NNs for the ensemble. These include varying the initial random weights of NNs,varying the topology of NNs, varying the algorithm employed for training NNs, and varyingthe training sets of NNs. It is argued that training NNs using different training sets is likelyto produce more uncorrelated errors than other approaches [13,14].

The two popular algorithms to construct ensembles that train individual NNs indepen-dently and sequentially, respectively, using different training sets are bagging [15] and boost-ing [16] algorithms. Negative correlation learning (NCL) [17] and mixture of experts (ME)[18], two leading combining methods, employ special error functions to simultaneously trainNNs and produce negatively correlated NNs. While bagging and boosting create explicitlydifferent training sets for different NNs by probabilistically changing the distribution of theoriginal training data, NCL and ME implicitly create different training sets by encouragingdifferent NN experts to learn different parts or aspects of the training data. As the explicitand implicit approaches of partitioning training set between experts have complementarystrengths and limitations [13,19,20], previous researchers have attempted to combine theirfeatures in integrated approaches, as briefly surveyed in the following.

Waterhouse and Cook [21,22] attempted to combine the features of boosting and ME.They proposed two approaches to address the limitations of each method and overcome themby combining elements of the other method. The first approach may be viewed as an improvedME that initializes the partitioning of the data set for assignment to different experts in aboost-like manner. Because boosting encourages classifiers to become experts on patternsthat previous experts disagree on, it can be successfully used to split the data set into regionsfor each expert in the ME method, thus ensuring their localization. The second approach maybe viewed as an improved variant of the boosting algorithm, in which the main advantage isthe use of a dynamic combination method for the outputs of the boosted networks.

Avnimelech and Intrator [23] extended Waterhouse’s work and proposed a new dynam-ically boosted ME method. They analyzed the learning mechanism of two ensemble algo-rithms: boosting and ME. The authors discussed the advantages and weaknesses of eachalgorithm and reviewed several ways in which the principles of these algorithms can be com-bined to achieve improved performance. Furthermore, they suggested a flexible procedurefor constructing a dynamic ensemble based on the principles of these two algorithms. Theproposed ensemble method employs a confidence measure as the gating function, whichdetermines the contribution of each expert to the ensemble output. This proposed methodoutperformed both static approaches previously proposed by Waterhouse et al. [22].

In the present work, the properties of the NCL and ME methods are investigated, and theadvantages and disadvantages of each method are compared. As both methods have comple-mentary strengths and limitations, the question arises as to whether their integration can lead

123

Incorporation of a Regularization Term 33

to an improved and more powerful ensemble learning scheme. We studied this question, andbased on the complementary features of both methods, we proposed an improved ensemblemethod.

The rest of this paper is organized as follows. Section 2, first presents the learning andcombining procedure of the NCL and ME methods, respectively. Next, the various features ofNCL and ME are investigated against each other in a comparative review. In the Sect. 3, basedon the complementary features of both methods, an improved hybrid ensemble method isproposed. Section 4 presents the results of our experimental study. Finally, Sect. 5 concludesthe paper.

2 Investigation of NCL and ME

In this section, first the NCL and ME methods are reviewed. Next, the properties of NCL andME are investigated and compared.

2.1 NCL

In NN combining methods, the individual NN experts are usually trained independently. Oneof the disadvantages of such an approach is the loss of interactions among the individualNN experts during the learning process. It is thus possible that some of the independentlydesigned individual NN experts contribute little to the whole ensemble.

Liu and Yao [17,24] proposed the NCL method that trains NN experts in the ensemblesimultaneously and interactively through the correlation penalty terms in their error functions.In NCL, the error function of the ith NN is expressed by the equation:

Ei = 1

2(Oi − y)2 + λPi (1)

where Oi and y are the actual and desired outputs of the ith NN, respectively. The first termin Eq. 1 is the empirical risk function of the ith NN. The second term, Pi , is the correlationpenalty function, which can be expressed as:

Pi = − (Oi − Oens)2 (2)

where Oens is the average of NN experts outputs in the ensemble. Here, Pi can be regarded asa regularization term that is incorporated into the error function of each ensemble network.This regularization parameter provides a convenient way to balance the bias-variance-covari-ance trade-off [17,25]. This term is meant to quantify the amount of error correlation, so itcan be minimized explicitly during training, which leads to negatively correlated NN experts.The training process of base NN experts in NCL are shown in Fig. 1.

The term λ is a scaling coefficient parameter that controls the trade-off between the objec-tive and penalty functions. When λ = 0, the penalty function is removed, resulting in anensemble in which each expert trains independently of the others, using simple back-propa-gation (BP). Therefore, the interaction and correlation among NN experts of the ensemble iscontrolled explicitly by the value of λ. This penalty function encourages different individualexperts in an ensemble to learn different parts or aspects of the training data so that theensemble can better learn the whole training data set.

Brown et al. [26] showed that NCL can be viewed as a technique derived from ambigu-ity decomposition [26,27]. The ambiguity decomposition, given by Eq. 3, has been widelyrecognized as one of the most important theoretical results obtained for ensemble learning.

123

34 S. Masoudnia et al.

Fig. 1 Diagram of the trainingprocess in NCL. The base NNexperts are trained through theNCL error function

x ( )21

2i i iE y O Pλ= − +

Ei

.. .

NN1

.. .

NNi

NNN

It states that the mean-square-error (MSE) of the ensemble is guaranteed to be less than orequal to the average MSE of the ensemble members, i.e.:

(Oens − y)2 =∑

j

g j(O j − y

)2 −∑

j

g j(O j − Oens

)2 (3)

where y is the target value of an arbitrary data point and gi ≥ 0,∑

i gi = 1 and Oens is theconvex combination of the ensemble members:

Oens =∑

j

g j O j (4)

The ambiguity decomposition provides a simple expression for the effect of error correlationsin an ensemble. The decomposition is composed of two terms. The first,

∑j g j (O j −y)2, is the

weighted average error of the individual NN experts. The second term,∑

j g j(O j −Oens)2,

referred to as the ambiguity, measures the amount of variability among the ensemble membersand can be considered to be a correlation of the individual NN experts. As mentioned above,utilizing the coefficientλ allows us to vary the emphasis on the correlation components toyield a near-optimum balance in the trade-off between these two terms; this can be regardedas an accuracy-diversity trade-off [28,29].

Hansen [30] showed that there is a relationship between bias-variance-covariance decom-position and ambiguity decomposition, in which portions of the first decomposition termscorrespond with portions of the ambiguity decomposition terms. Therefore, any attempt tostrike a balance between the two ambiguity decomposition terms leads to three componentsof the other decomposition that can be balanced against each other, and the MSE tends to anear-minimum condition.

2.2 ME

The ME method was introduced by Jacobs et al. [18,31]. The authors examined the use ofdifferent error functions in the learning process for expert networks in the ME method. Jacobset al. proposed making base NNs into local experts for different distributions of data space; asa result, the increased diversity among the experts led to improvements in the performance ofthis method. Various error functions were then investigated with respect to the performancecriterion.

123

Incorporation of a Regularization Term 35

In the first test, the following error function was used for the NN experts:

E =⎛

⎝y −∑

j

g j O j

⎠2

(5)

where g j is the proportional contribution of expert j to the combined output vector.According to an analysis of the derivation of this error function, the weights of each

NN expert are updated based on the overall ensemble error rather than the errors of eachexpert. This strong coupling in the process of updating the weights of the experts engendersa high level of cooperation over the whole problem space and tends to employ almost all ofthe experts for each data sample. This situation is inconsistent with the localization of theexperts in different data distribution.

The second error function analyzed was the following:

E =∑

j

g j(y − O j

)2 (6)

According to the derivation of this term, the weights of each NN expert are updated based ontheir own error in the prediction of the target and yields a complete output vector rather thana residual, in contrast with the first error function. Despite this advantage, this error functiondoes not ensure the localization of the experts, which is the key factor in the efficiency ofME.

Therefore, Jacobs et al. introduced a new error function based on the negative log proba-bility of generating the desired output vector, assuming a mixture of Gaussian models:

E = − log∑

j=1

g j exp

(−1

2

(y − O j

)2)

(7)

To evaluate this error function, its deviation with respect to ith expert is analyzed:

δE

δOi= −

[gi exp

(− 12 (y − Oi )

2)

∑j g j exp

(− 12 (y − O j )2

)]

(y − Oi ) (8)

According to this derivation term, similar to the previous error function, the learning of eachNN expert is based on its individual error. Moreover, the weight-updating factor for eachexpert is proportional to the ratio of its error value to the total error. These two features inthe proposed error function that cause the localization of each expert in their correspondingsubspace eliminate the deficiencies of previous error functions. Thus, the ME method hasbetter efficiency with this error function.

In addition, a gating network is used to complete a system of competing local experts. Thegating network allows the mixing proportions of the experts to be determined by learninga partition of input space and trusts one or more expert(s) in each of these partitions. Thelearning rule for the gating network attempts to maximize the likelihood of the training setby assuming a Gaussian mixture model in which each expert is responsible for one compo-nent of the mixture. Thus, the network itself partitions the input space, so we refer to it asa self-directed partitioning network, and its experts are directed towards subspaces that aredetermined by the gating network and are in agreement with each expert’s performance. Thestructure of ME is shown in Fig. 2. For further details on the implementation of this scheme,please refer to [32,33].

The ME method has special characteristics that distinguish it from the other combiningmethods. This method differs from the others due to its dynamic combination method. In

123

36 S. Masoudnia et al.

Fig. 2 Block diagram of themixture of experts. The schemeof simultaneous functions of theexperts and gating network areshown in this figure. The finalmixed output of ME (Oens) iscalculated using the weightedaverage of ensemble members(Oi ) based on the given weights(gi ) by the gating network

Oens∑

Gating Network

O1

ON

Oix

NN1

.. .

NNN

NNi

NN1

g1

.. .

gN

gi

the literature on combining methods, ME refers to the methods in which complex problemsbased on a “divide and conquer” approach are partitioned into a set of simpler sub-problemsand are distributed among the experts. In this method, instead of assigning a set of fixedcombinational weights to the experts, as described previously, an extra gating network isused to compute these weights dynamically from the inputs.

Since Jacob’s proposal, according to the above-mentioned special features of the varioustypes of combining methods [12,34], ME has attracted a great deal of attention in the literatureon combining methods. Due to its good performance, ME has been employed for a varietyof problems in the pattern-recognition field [13,35,36] and machine learning applications[37,38].

2.3 NCL Versus ME

In this part, we compare the features of ME and NCL, discussing their advantages and disad-vantages. First, the similar features of the two methods are discussed. Both of these ensemblealgorithms train NN experts simultaneously and interactively. As mentioned before, the dif-ferent and unique error functions of the two methods have specific properties that encouragethe experts to learn different parts or aspects of the training data, so that the ensemble canlearn the entire training data set efficiently. By implicitly assigning different distributions ofthe data space to different experts, these two methods produce biased individual experts withnegatively correlated estimations [13,24].

Nevertheless, there are some differences between the ME and NCL methods that arisefrom their specific characteristics in comparison with other ensemble algorithms. One of theadvantages of ME over other combining methods is its distinct technique for combining theoutputs of the experts. ME uses a trainable combiner that, according to each input, dynam-ically selects the best expert(s) and combines their outputs to create the final output. Thecombining function of ME includes a dynamic weighted average in which the local compe-tence of the experts with respect to the input are estimated by the weights produced by thegating network. The outputs of all NN experts responsible for input x are then fused [12].

As mentioned before, combining NN systems have two major components. Regardingthe first component, the creation of individual NN experts, NCL has the better efficiency. Itssuperiority comes from its use of a regularization term that provides a convenient way tobalance the bias-variance-covariance trade-off and thus improves the generalization ability,whereas ME does not include such control over the trade-off [17]. In contrast, ME provides

123

Incorporation of a Regularization Term 37

a better approach for the second component of combining systems, the combination of baseNN experts.

As it is clear from the analysis of the features of both methods and their advantages anddisadvantages, the two methods have complementary features. In the next section, we presenta proposed approach that attempt to combine the features of both methods.

3 Proposed Hybrid Ensemble Method

In this section, based on the similar ensemble structures and strategies used in both the NCLand ME methods and due to their complementary features, an improved hybrid ensemblemethod is proposed.

3.1 Incorporation of a Regularization Term to Control Negative Correlation in ME

Based on the similar ensemble structures and strategies used in both the NCL and ME meth-ods and due to their complementary features, any attempt to combine the principles of bothshould address their limitations and overcome them by combining elements of the othermethod. In this study, we propose an approach for combining the features of ME and NCL.

ME and NCL methods use different error functions to incorporate negative correlationbetween the experts. Although ME can produce negatively correlated experts, it does notinclude a control parameter like NCL to adjust this parameter explicitly and so make a near-optimal balance in the bias-variance-covariance trade-off [13,17]. To introduce this advantageof NCL into the training algorithm of ME, we incorporated this control parameter in the MEerror function. The modified error function of ME was obtained by adding the penalty termfrom NCL to the error function. Therefore, we call this proposed method mixture of nega-tively correlated experts (MNCE). The new error function thus takes the form of Eqs. 9 and10:

E = − logN∑

j=1

g j exp

(−1

2

(y − O j

)2 + λPj

)(9)

Pj = − (O j − Oens

)2 (10)

where Oens is the combination output of ensemble and N is the number of NN experts in theensemble.

The introduced penalty term, similarly to NCL, provides a control parameter leading to anear-optimal balance in the bias-variance-covariance trade-off. In this ensemble architecture,each expert network is a multi-layer perceptrons (MLP) network with one hidden layer thatcomputes an output Oi as a function of the input vector x and the weights of hidden andoutput layers with a sigmoid activation function. To train MLP experts based on the new errorfunction using the BP training algorithm, the weights for each expert i are updated accordingto the following rules:

hMNCE,i = −[

gi exp(− 1

2 (y − Oi )2 + λPi

)∑N

j=1 g j exp(− 1

2 (y − O j )2 + λPj)

](11)

�wy,i = ηehMNCE,i

[(y − Oi ) − λ

∂ Pi

∂Oi

](Oi (1 − Oi ))OT

h,i (12)

123

38 S. Masoudnia et al.

�wh,i = ηehMNCE,iwTy,i

[(y − Oi ) − λ

∂ Pi

∂Oi

](Oi (1 − Oi ))Oh,i

(1 − Oh,i

)xi (13)

∂ Pi

∂Oi= −2

(1 − 1

N

) (Oi − O

)(14)

O (n) = 1

N

N∑

i=1

Oi (15)

where ηe is the learning rate, λ is the NCL control parameter, gi is the ith output of the gatingnetwork after applying the softmax function, hM NC E,i is the learning factor which can beinterpreted as the proportional measure of competence for the expert i . wh and wy are theweight vectors of the inputs to the hidden layers and those of the hidden to the output layersof the expert networks, respectively. OT

h is the transpose of Oh , the outputs of the hiddenlayers of the expert networks.

Similar to original ME, the gate is composed of two layers: the first layer is a MLP network,and the second layer is a softmax nonlinear operator. Thus, the gating network computes Og ,which is the output of the MLP layer of the gating network, then applies the softmax functionto obtain:

gi = soft-max(Og,i) (16)

soft-max(Og,i ) = exp(Og,i )∑Nj=1 exp(Og, j )

i = 1, ...., N (17)

where Og,i is ith output value of the gating network. Here, the gi values are nonnegative andsum to unity, and they can be interpreted as estimates of the prior probability that expert ican generate the desired output y.

According to the MNCE error function (Eq. 9), the modified error function of the gatingnetwork can be written as:

EG,MNCE = 1

2(hMNCE − g)2 (18)

where hMNCE = [hMNCE,i

]Ni=1 and g = [gi ]N

i=1 is the output vector of the gating networkafter applying the softmax function. Based on the modified error function, the weights of thegating network in the MNCE method are determined using the BP error algorithm accordingto the following rules:

�wy,g = ηg(hMNCE − g)(Og(1 − Og))OTh,g (19)

�wh,g = ηgwTy,g(hMNCE − g)(Og(1 − Og))Oh,g

(1 − Oh,g

)xi (20)

where ηg is the learning rate, and wh,g and wy,g are the weights of the inputs to hidden andhidden to output layers of the gating network, respectively. OT

h,g is the transpose of Oh,g , theoutputs of the hidden layer of the gating network.

Finally, to combine the experts’ outputs, the gate assigns a weight gi as function of x toeach of expert’s output O j , and the final mixed output of the ensemble is:

OT =N∑

j=1

O j g j (21)

The structure of MNCE and its simultaneous training algorithm for the experts and gatingnetwork are shown in Fig. 3.

123

Incorporation of a Regularization Term 39

OT∑

Gating Network

O1

ON

Oix

g1

.. .

.. .

NN1

NNi

NNN

[gi]

( )2G,MNCE MNCE1

E = h - g2

( )2

1

1- log exp - -

2

N

i i ii

E g y O Pλ=

+= ∑

gN

gi

Fig. 3 Diagram of the mixture of negatively correlated experts (MNCE). The MNCE is composed of theexperts and a gating network. The expert training process and the gating network work simultaneously tominimize the modified error functions. The experts compete to learn the training patterns, and the gating net-work mediates the competition. The added control parameter λPi provides an explicit control for efficientlyadjusting the measure of negative correlation between the experts

According to this MNCE training procedure, in the network’s learning process, the expertnetworks compete for each input pattern and the gating network rewards the winner of eachcompetition with stronger error feedback signals. This competitive learning procedure leadsto a localization of the experts into possibly overlapping regions. Additionally, incorporatingthe control parameter of NCL into the error function of ME provides an explicit control forefficiently adjusting the measure of negative correlation between the experts. Thus, it yieldsa better balance in the bias-variance-covariance trade-off, which improves generalizationability.

4 Experimental Results

4.1 Experiments

Several experiments were conducted to analyze the performance of our proposed methods.The first experiment investigated the dependence of MNCE performance on the controlparameter λ. The second experiment was conducted to compare the performance of MNCEwith that of ME and NCL. To evaluate the performance of these methods, several classifica-tion problems using real data from the UCI Machine Learning Repository [39] were used inthese experiments. A summary of the data sets is given in Table 1.

The third experiment was conducted to investigate the dependence of diversity measureof base experts on the control parameter λ in MNCE method. In previous experiments, thecompared ensemble methods were employed in classification tasks, but the fourth experimentanalyzed the training process in MNCE versus that in ME in a regression task. In the forthexperiment, the training processes of the ME and MNCE methods were evaluated in terms ofthe bias-variance-covariance trade-off. This experiment was designed to show how MNCE,with the added control parameter, works better than ME.

123

40 S. Masoudnia et al.

Table 1 Summary of the tenclassification data sets

a Balance scaleb Pen-based recognition ofhandwritten digitsc Pima Indians diabetesd Statlog (vehicle silhouettes)e Statlog (landsat satellite)f Features of 17th–20th, asrecommended by the databasedesigners

Dataset Size Attribute Class

Balancea 625 4 3

Glass identification 214 10 7

Ionosphere 351 34 2

Pen-digitb 10,992 16 10

Phoneme 5,404 5 2

Liver disorders 345 7 2

Pimac 768 8 2

Vehicled 946 18 4

Sat-imagee 6,435 4f 6

Yeast 1,484 8 10

4.2 Parameters

In all experiments except for the second, each of the ensemble systems included five MLPnetworks with one hidden layer. In the second experiment, to investigate the effects of ensem-ble size on the performance of the compared methods, ensembles with three, five and sevenexpert networks were tested. The number of nodes in the hidden layer was set at 5. Allmethods were trained using the BP algorithm. In the MNCE and ME methods, the expertsand gating network were trained with the learning-rate values of ηe = 0.1 and ηg = 0.05,respectively. The learning rate values for the experts in NCL were also 0.1. For the NCLand MNCE methods, λ∗, the optimum value of λ in terms of the minimum error rate wasdetermined using a trial-and-error procedure for each classification problem in the interval[0:0.1:2].

The classification performance was measured using k-fold cross validation technique, withk = 5. The data set was partitioned into five equally-sized subsets. Each of the subsets wastaken in turn as the test set, making 5 trials in all. In each experiment, one of the remainingsubsets was chosen randomly as the validation set, while the remaining three subsets werecombined to form the training set. In each trial, the system was trained on the training set. Toensure the generalization ability, the early stopping technique on the validation set was used.If the value of MSE on the validation set increase for more than five consecutive epochs,training ceases. Also, λ∗ was determined in terms of the minimum error rate on the union oftraining and validation set. The obtained system was tested on the test set.

4.3 Effect of λ on the Performance of MNCE

In this experiment, the performance of MNCE with five expert networks with different valuesof the control parameter λ was investigated using the Vehicle and Pen-digit datasets. Theeffects of varying the control parameter in the range of [0:0.2:2] on the test error rates (%)of MNCE are shown in Fig. 4a and b.

As mentioned before, the λ coefficient allows the method to vary the emphasis on thecorrelation component to yield a near-optimum balance in the trade-off between the twoaccuracy and diversity terms (i.e., their correlation). Both figures showed that the error ratesof MNCE appeared to decrease initially and then increase with increasing values of λ. Thisvariation trend in the error rate also showed the aforementioned trade-off. Figure 4a showedthat MNCE reached a minimum error rate on the vehicle classification problem at λ = 0.8.

123

Incorporation of a Regularization Term 41

Fig. 4 Dependence of the test error rate (%) of the MNCE method on the control parameter λ. The variationtrends of the (average and variance) error rates of the MNCE method for different λ values in interval [0:0.2:2]in the classification of the pen-digit and vehicle datasets are shown in (a) and (b), respectively

Table 2 The classification test error rates (%) of different ensemble methods for ten classification benchmarksλ∗ is the optimum value of λ, determined using a trial-and-error procedure for the NCL and MNCE methods

Data set Error rate (%)

Bagging Boosting ME NCL MNCE

Average Average Average λ* Average λ* Average(std) (std) (std) (std) (std)

Balance 7 (2E−1) 6.8 (3E−1) 6.7 (1E−1) 1.2 6.5 (6E−2) 0.9 5.2 (1E−1)

Ionosphere 6.1 (1E−1) 5.5 (2E−1) 5.2 (4E−2) 0.8 5.9 (2E−1) 1.7 3.6 (3E−2)

Glass identification 29.9 (3E−1) 32 (6E−1) 30.8 (3E−1) 1 31.4 (1E−1) 0.1 28.3 (4E−1)

Liver disorders 30.5 (4E−1) 29 (8E−1) 29.1 (5E−1) 1.1 28 (4E−1) 0.2 26.7 (2E−1)

Pen-digit 7 (2E−2) 6.2 (3E−1) 6.4 (2E−1) 1 5.5 (8E−2) 0.8 4.9 (3E−1)

Pima 25.4 (1E−1) 24.1 (4E−1) 23.9 (2E−1) 1 24.7 (1E−1) 1 22 (2E−1)

Phoneme 19.2 (1E−1) 18 (2E−1) 17.9 (9E−2) 0.8 18.3 (5E−2) 1 17.1 (2E−1)

Vehicle 20.3 (2E−1) 19.8 (3E−1) 19.5 (1E−1) 0.8 18.6 (6E−2) 1.2 17.7(2E−1)

Sat-image 17 (1E−1) 16.1 (3E−1) 16.2 (1E−1) 1 15.9 (7E−2) 1.1 13.8 (3E−1)

Yeast 42.6 (7E−1) 42.1(9E−1) 40.5 (7E−1) 0.9 41 (4E−1) 1.3 39 (5E−1)

Marked in boldface are the lowest error rate in each dataset

Figure 4b showed that the minimum error rate occurred at λ = 1.2 for the Pen-digit classifi-cation problem. As also shown in this figure, when λ was too large, the error rates increaseddramatically. In such cases, the learning procedure mainly minimized the correlation penaltyterms in the error functions rather than the error of the ensemble.

4.4 MNCE in Comparison with ME and NCL

In this experiment, the performance of MNCE was compared with the common ensemblemethods including ME, NCL, bagging and boosting methods in each classification problem.For the NCL and MNCE methods, the optimum value of λ in the interval [0:0.1:2] in eachof classification problems was determined by trial and error. Based on these settings, in thefist part of this experiment, the test error rates (%) of the studied ensemble methods with fiveexpert networks on ten classification problems are reported in Table 2.

123

42 S. Masoudnia et al.

In this study, the statistical t test [40] based on the error rate was used to determine if therewas a statistically significant performance improvement with the proposed method versusME and NCL. We found that the error rate of MNCE with the optimum λ value was signifi-cantly lower than that of ME at the 95 % confidence level on all classification problems, withthe exception of the Phoneme problem. In this benchmark problem, although MNCE hadbetter performance, its superiority is not significant. In comparison with NCL, our proposedmethod had significantly lower error rates on all classification problems, with the exceptionof the Pen-digit and Vehicle problems. In these benchmark problems, although MNCE hadbetter performance, its superiority was not significant based on the t-test.

In the second part of this experiment, the proposed method was additionally comparedwith the ME and NCL methods with different ensemble sizes including three, five and sevenexperts. The effect of ensemble size on the test error rates (%) of the compared methods onthe five classification benchmarks is shown in Fig. 5(a–e).

As observed in these figures, MNCE had lower error rates than the ME and NCL methodson all of the classification benchmarks for all considered ensemble sizes including three, fiveand seven expert networks.

4.5 Dependence of Diversity Measure on the λ Coefficient in MNCE

This experiment was conducted to investigate the dependence of diversity measure of baseexperts on the control parameter λ in the MNCE method. A classification task on the Vehicledataset was used in this experiment. Two pairwise diversity measures which are based on themeasurement of any pairwise classifiers, including the correlation coefficient and disagree-ment measures [28] were investigated in this experiment. These pairwise diversity measuresrely on Table 3, where fi and f j are two classifiers and N ab is the number of data points forwhich fi and f j are correct/wrong when a = 1/0 and b = 1/0.

The correlation coefficient measure indicates the strength and direction of a linear rela-tionship between two classifiers. In general, the correlation refers to the departure of twovariables from independence. The correlation between two binary classifiers, fi and f j , is

CCi, j = N 11 N 00 − N 10 N 01√

(N 11 + N 10)(N 01 + N 00)(N 11 + N 01)(N 10 + N 00)(22)

The correlation coefficient measure varies in the range of [−1,1] and has an inverse rela-tion with the measure of diversity between the classifiers, as the more negative correlationcoefficient value shows more diversity.

The disagreement measure [28] is the ratio between the number of observations on whichone classifier is correct and the other is incorrect to the total number of observations.

Disi, j = N 01 + N 10

N 11 + N 10 + N 01 + N 00 (23)

The disagreement measure varies in the range of [0,1] and has a direct relation with themeasure of diversity between the classifiers.

In order to investigate the effect of λ parameter on these diversity measures, similar to theprevious experimental setting, our proposed method was tested on the different values of λ

and the diversity between the base experts of ensemble was measured at the end of trainingprocess, which are shown in Fig. 6(a, b).

As shown in Fig. 6(a, b), with increasing values of λ in the range of [0,1.1], the variationtrends of both diversity measures showed increasing of diversity among the experts. But in

123

Incorporation of a Regularization Term 43

Fig. 5 The average and variance of test error rates (%) of MNCE are compared with ME and NCL methods indifferent ensemble sizes including three, five and seven experts on the classification problems of a phoneme,b sat-image, c pen-digit, d glass and e vehicle datasets

Table 3 Table of the relationshipbetween a pair of classifiers fiand f j

f j , correct(1) f j , wrong(0)

fi , correct(1) N11 N10

fi , wrong(0) N01 N00

the range of [1.2,2], there were not observed any direct relationship between the variationtrends of λ and the diversity measures.

Both figures showed that both diversity measures were maximized at λ = 1.2. It is inter-esting that according to Table 2, the error rate of MNCE was also minimized in this value ofλ, which means that in this classification problem, the maximum performance was achievedwhen there was the maximum diversity between the experts of MNCE.

123

44 S. Masoudnia et al.

Fig. 6 Dependence of two diversity measures, including the disagreement (Dis) and correlation coefficient(CC) measures on the control parameter λ for the MNCE method in the classification of the vehicle datasetare shown in (a) and (b), respectively

4.6 Analysis of MSE Generalization Terms in the MNCE and ME Training Processes

Jacobs et al. [13] analyzed an ensemble system in which the outputs are based on weightedaverages of the outputs of base experts and proposed estimations for quantifying the bias,variance and covariance terms in this ensemble system. These estimations were called theintegrated bias (IB), integrated variance (IV), and Integrated Covariance (IC), respectively;the Integrated MSE (IMSE) was defined as the sum of the three previous terms:

IB = 1

|X ||X |∑

i=1

(Oens(xi ) − yi )2 (24)

IV =N∑

n=1

1

|X ||X |∑

i=1

1

S

S∑

n=1

(gn(xi , s)On(xi , s) − gn(xi )On(xi )

)2(25)

IC =N∑

n=1

m �=n

1

|X ||X |∑

i=1

1

S

s∑

s=1

(gn(xi , s)On(xi , s) − gn(xi )On(xi )

) (gm(xi , s)Om(xi , s)

−gm(xi )Om(xi ))

(26)

where S is the number of simulations, N is the number of experts, gn(xi , s) is the nth outputof the gating network on input xi on epoch s, On(xi , s) is the output of the nth expert on

123

Incorporation of a Regularization Term 45

Fig. 7 The variation trends of the IB, IV, IC and IMSE terms in the training processes of the ME and MNCEmethods on a regression task are shown in (a), (b), (c) and (d), respectively

input xi on epoch s, gn(xi )On(xi ) is the average weighted output of expert n on input xi , and|X | is the size of the data set. The IMSE can be expressed as the sum of the IB, IV and IC:

IMSE = 1

|X ||X |∑

i=1

1

N

N∑

n=1

(Oens(xi , n) − yi )2 (27)

We investigated the variation trends of these error terms in the training processes of theME and MNCE methods on a regression task. The regression function employed in thisexperiment was:

f (x) = 1

13

[10 sin (πx1x2) + 20

(x3 − 1

2

)2

+ 10x4 + 5x5

]− 1 (28)

where x = [x1, x2, x3, x4, x5] is an input vector with components lying between zero andone. The value of f (x) lies in the interval [−1,1]. The experimental settings were similarto Jacobs’ suggestions [13]. The variation trends of IB, IV, IC and IMSE terms during thetraining processes of the ME and MNCE methods are plotted in Fig. 7a–d, respectively.

The base experts in the ME and MNCE methods were localized in the training process. Asdescribed by Jacobs [13] (and as shown in Fig. 7b, c), through this localization process, theIV of the experts’ weighted outputs is increased, and the IC of the experts’ weighted outputsis decreased. Besides controlling the IV and IC parameters, the added regularization termin MNCE decreased the IB of the ensemble, as shown in Fig. 7a. Due to the incorporationof a decorrelation term in the objective function of the MNCE method, as shown in Fig. 7c,the IC of MNCE was lower than that of ME; hence, it shows that MNCE incorporates morenegative correlation among the experts in comparison with ME. Despite the increasing IVterm, the decreases in the IB and IC terms yielded a lower IMSE, as shown in Fig. 7d.

123

46 S. Masoudnia et al.

5 Conclusions

This study analyzed two of the more advanced frameworks for ensembles of learning ma-chines: NCL and ME. We compared the algorithms and discussed the advantages and weak-nesses of each. Based on their complementary features, an approach was proposed to combinetheir complementary strengths to address their respective weaknesses and to improve perfor-mance, including variants of each algorithm incorporating elements of the other.

This approach, MNCE, may be viewed as an improved version of ME. In this method, theability to tune the control parameter of the NCL training algorithm is integrated into the errorfunction of ME to promote negative correlation explicitly. This control parameter can beregarded as a regularization term added to the error function of ME to provide a convenientway to balance the bias-variance-covariance trade-off.

The theoretical analysis and experimental results with various regression and classifica-tion benchmark problems showed that the proposed hybrid method preserve the beneficialfeatures of their components and overcome their deficiencies and display significantly betterperformance than the ME and NCL methods. Additionally, an experiment was conductedto investigate the performance of MNCE in comparison with ME in terms of bias-variance-covariance to demonstrate how and why MNCE performs better than ME. This experimentshowed that MNCE controlled not only the variance and covariance of the experts but alsothe bias of the ensemble system; thus, reducing the bias and MSE significantly.

References

1. Mu XY, Watta P, Hassoun M (2009) Analysis of a plurality voting-based combination of classifiers.Neural Process Lett 29(2):89–107. doi:10.1007/s11063-009-9097-1

2. Wang Z, Chen SC, Xue H, Pan ZS (2010) A Novel regularization learning for single-view patterns: multi-view discriminative regularization. Neural Process Lett 31(3):159–175. doi:10.1007/s11063-010-9132-2

3. Valle C, Saravia F, Allende H, Monge R, Fernandez C (2010) Parallel approach for ensemble learning withlocally coupled neural networks. Neural Process Lett 32(3):277–291. doi:10.1007/s11063-010-9157-6

4. Aladag CH, Egrioglu E, Yolcu U (2010) Forecast combination by using artificial neural networks. NeuralProcess Lett 32(3):269–276. doi:10.1007/s11063-010-9156-7

5. Gómez-Gil P, Ramírez-Cortes JM, Pomares Hernández SE, Alarcón-Aquino V (2011) A neural networkscheme for long-term forecasting of chaotic time series. Neural Process Lett 33(3):215–233

6. Lorrentz P, Howells WGJ, McDonald-Maier KD (2010) A novel weightless artificial neural based multi-classifier for complex classifications. Neural Process Lett 31(1):25–44. doi:10.1007/s11063-009-9125-1

7. Ghaderi R (2000) Arranging simple neural networks to solve complex classification problems. SurreyUniversity, Surrey

8. Ghaemi M, Masoudnia S, Ebrahimpour R (2010) A new framework for small sample size face recognitionbased on weighted multiple decision templates. Neural Inf Process Theory Algorithms 6643/2010:470–477. doi:10.1007/978-3-642-17537-4_58

9. Tresp V, Taniguchi M (1995) Combining estimators using non-constant weighting functions. Adv NeuralInf Process Syst:419–426

10. Meir REngineering T-IIoTDoE (1994) Bias, variance and the combination of estimators: the case of linearleast squares. TR Deptartment of Electrical Engineering, Technion, Haifa

11. Tumer K, Ghosh J (1996) Error correlation and error reduction in ensemble classifiers. Connect Sci8(3):385–404

12. Kuncheva LI (2004) Combining pattern classifiers: methods and algorithms. Wiley-Interscience, Hobo-ken

13. Jacobs RA (1997) Bias/variance analyses of mixtures-of-experts architectures. Neural Comput 9(2):369–383

14. Hansen JV (2000) Combining predictors: meta machine learning methods and bias/variance & ambiguitydecompositions. Computer Science Deptartment, Aarhus University, Aarhus

15. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–14016. Schapire RE (1990) The strength of weak learnability. Mach Learn 5(2):197–227

123

Incorporation of a Regularization Term 47

17. Liu Y, Yao X (1999) Ensemble learning via negative correlation. Neural Netw 12(10):1399–140418. Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE (1991) Adaptive mixtures of local experts. Neural Comput

3(1):79–8719. Islam MM, Yao X, Nirjon SMS, Islam MA, Murase K (2008) Bagging and boosting negatively correlated

neural networks. Ieee Trans Syst Man Cybern B 38(3):771–784. doi:10.1109/Tsmcb.2008.92205520. Ebrahimpour R, Arani SAAA, Masoudnia S (2011) Improving combination method of NCL experts using

gating network. Neural Comput Appl:1–7. doi:10.1007/s00521-011-0746-821. Waterhouse SR (1997) Classification and regression using mixtures of experts. Unpublished doctoral

dissertation, Cambridge University22. Waterhouse S, Cook G (1997) Ensemble methods for phoneme classification. Adv Neural Inf Process

Syst:800–80623. Avnimelech R, Intrator N (1999) Boosted mixture of experts: an ensemble learning scheme. Neural

Comput 11(2):483–49724. Liu Y, Yao X (1999) Simultaneous training of negatively correlated neural networks in an ensemble. Ieee

Trans Syst Man Cybern B 29(6):716–72525. Ueda N, Nakano R (1996) Generalization error of ensemble estimators. Proc Int Conf Neural Netw

91:90–9526. Brown G, Wyatt JM (2003) Negative correlation learning and the ambiguity family of ensemble methods.

Mult Classif Syst Proc 2709:266–27527. Brown G (2004) Diversity in neural network ensembles. Unpublished doctoral thesis, University of

Birmingham, Birmingham, UK28. Brown G, Wyatt J, Harris R, Yao X (2005) Diversity creation methods: a survey and categorisation. Inf

Fusion 6(1):5–2029. Chen H (2008) Diversity and regularization in neural network ensembles. PhD thesis, School of Computer

Science, University of Birmingham30. Hansen JV (2000) Combining predictors: Meta machine learning methods and bias/variance & ambiguity

decompositions. Unpublished doctoral thesis, Computer Science Deptartment, Aarhus University, Aarhus31. Jacobs RA, Jordan MI, Barto AG (1991) Task Decomposition through competition in a modular connec-

tionist architecture: the what and where vision tasks. Cogn Sci 15(2):219–25032. Dailey MN, Cottrell GW (1999) Organization of face and object recognition in modular neural network

models. Neural Netw 12(7–8):1053–107433. Ebrahimpour R, Kabir E, Yousefi MR (2007) Face detection using mixture of MLP experts. Neural

Process Lett 26(1):69–82. doi:10.1007/s11063-007-9043-z34. Rokach L (2010) Pattern classification using ensemble methods, vol 75. World Scientific Pub Co Inc.,

Singapore35. Ebrahimpour R, Nikoo H, Masoudnia S, Yousefi MR, Ghaemi MS (2011) Mixture of MLP-experts for

trend forecasting of time series: A case study of the Tehran stock exchange. Int J Forecast 27(3):804–81636. Xing HJ, Hua BG (2008) An adaptive fuzzy c-means clustering-based mixtures of experts model for

unlabeled data classification. Neurocomputing 71(4-6):1008–1021. doi:10.1016/j.neucom.2007.02.01037. Ebrahimpour R, Kabir E, Yousefi MR (2008) Teacher-directed learning in view-independent face recogni-

tion with mixture of experts using overlapping eigenspaces. Comput Vis Image Underst 111(2):195–206.doi:10.1016/j.cviu.2007.10.003

38. Ubeyli ED (2009) Modified mixture of experts employing eigenvector methods and Lyapunov exponentsfor analysis of electroencephalogram signals. Expert Syst 26(4):339–354. doi:10.1111/j.1468-0394.2009.00490.x

39. Asuncion A, Newman DJ (2007) UCI Machine Learning Repository [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California. School of Information andComputer Science

40. Pepe MS (2004) The statistical evaluation of medical tests for classification and prediction. OxfordUniversity Press, Oxford

123