An assessment of the effectiveness of a rotation forest ensemble for land-use and land-cover mapping

18
International Journal of Remote Sensing, 2013 Vol. 34, No. 12, 4224–4241, http://dx.doi.org/10.1080/01431161.2013.774099 An assessment of the effectiveness of a rotation forest ensemble for land-use and land-cover mapping Taskin Kavzoglu* and Ismail Colkesen Department of Geodetic and Photogrammetric Engineering, Gebze Institute of Technology, Kocaeli, Turkey (Received 25 August 2011; accepted 23 September 2012) Increasing the accuracy of thematic maps produced through the process of image classification has been a hot topic in remote sensing. For this aim, various strategies, classifiers, improvements, and their combinations have been suggested in the literature. Ensembles that combine the prediction of individual classifiers with weights based on the estimated prediction accuracies are strategies aiming to improve the classifier per- formances. One of the recently introduced ensembles is the rotation forest, which is based on the idea of building accurate and diverse classifiers by applying feature extrac- tion to the training sets and then reconstructing new training sets for each classifier. In this study, the effectiveness of the rotation forest was investigated for decision trees in land-use and land-cover (LULC) mapping, and its performance was compared with per- formances of the six most widely used ensemble methods. The results were verified for the effectiveness of the rotation forest ensemble as it produced the highest classification accuracies for the selected satellite data. When the statistical significance of differences in performances was analysed using McNemar’s tests based on normal and chi-squared distributions, it was found that the rotation forest method outperformed the bagging, Diverse Ensemble Creation by Oppositional Relabelling of Artificial Training Examples (DECORATE), and random subspace methods, whereas the performance differences with the other ensembles were statistically insignificant. 1. Introduction Determination of land-use and land-cover (LULC) information about the Earth’s surface through the classification of remotely sensed images has been the most concentrated topic in remote sensing (Foody 2002). Producing accurate and up-to-date thematic maps is essen- tial for many local- to global-scale studies. Thematic maps are usually produced through the use of classification techniques. Up to now, many techniques have been developed and applied in the literature (Lu and Weng 2007; Xie, Sha, and Yu 2008; Tso and Mather 2009). Owing to its high importance, factors affecting classification accuracy have been studied by many researchers in the field of remote sensing (e.g. Foody 2004, 2009; Liu, Frazier, and Kumar 2007; Ben-David 2008; Kavzoglu 2009). In recent years, ensemble learning methods based on the combination of multiple classifiers have been suggested to improve the performance of an individual classifier (e.g. Breiman 1998; Webb 2000; Gislason, Benediktsson, and Sveinsson 2006; Zhang and Zhang 2010). Studies have shown that ensemble classifiers have considerable effects on prediction accuracy, and ‘a combined *Corresponding author. Email: [email protected] © 2013 Taylor & Francis Downloaded by [Gebze Yuksek Teknoloji Enstitïsu ] at 05:39 11 March 2013

Transcript of An assessment of the effectiveness of a rotation forest ensemble for land-use and land-cover mapping

International Journal of Remote Sensing, 2013Vol. 34, No. 12, 4224–4241, http://dx.doi.org/10.1080/01431161.2013.774099

An assessment of the effectiveness of a rotation forestensemble for land-use and land-cover mapping

Taskin Kavzoglu* and Ismail Colkesen

Department of Geodetic and Photogrammetric Engineering, Gebze Instituteof Technology, Kocaeli, Turkey

(Received 25 August 2011; accepted 23 September 2012)

Increasing the accuracy of thematic maps produced through the process of imageclassification has been a hot topic in remote sensing. For this aim, various strategies,classifiers, improvements, and their combinations have been suggested in the literature.Ensembles that combine the prediction of individual classifiers with weights based onthe estimated prediction accuracies are strategies aiming to improve the classifier per-formances. One of the recently introduced ensembles is the rotation forest, which isbased on the idea of building accurate and diverse classifiers by applying feature extrac-tion to the training sets and then reconstructing new training sets for each classifier.In this study, the effectiveness of the rotation forest was investigated for decision trees inland-use and land-cover (LULC) mapping, and its performance was compared with per-formances of the six most widely used ensemble methods. The results were verified forthe effectiveness of the rotation forest ensemble as it produced the highest classificationaccuracies for the selected satellite data. When the statistical significance of differencesin performances was analysed using McNemar’s tests based on normal and chi-squareddistributions, it was found that the rotation forest method outperformed the bagging,Diverse Ensemble Creation by Oppositional Relabelling of Artificial Training Examples(DECORATE), and random subspace methods, whereas the performance differenceswith the other ensembles were statistically insignificant.

1. Introduction

Determination of land-use and land-cover (LULC) information about the Earth’s surfacethrough the classification of remotely sensed images has been the most concentrated topicin remote sensing (Foody 2002). Producing accurate and up-to-date thematic maps is essen-tial for many local- to global-scale studies. Thematic maps are usually produced throughthe use of classification techniques. Up to now, many techniques have been developed andapplied in the literature (Lu and Weng 2007; Xie, Sha, and Yu 2008; Tso and Mather2009). Owing to its high importance, factors affecting classification accuracy have beenstudied by many researchers in the field of remote sensing (e.g. Foody 2004, 2009; Liu,Frazier, and Kumar 2007; Ben-David 2008; Kavzoglu 2009). In recent years, ensemblelearning methods based on the combination of multiple classifiers have been suggestedto improve the performance of an individual classifier (e.g. Breiman 1998; Webb 2000;Gislason, Benediktsson, and Sveinsson 2006; Zhang and Zhang 2010). Studies have shownthat ensemble classifiers have considerable effects on prediction accuracy, and ‘a combined

*Corresponding author. Email: [email protected]

© 2013 Taylor & Francis

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

International Journal of Remote Sensing 4225

classifier’ could outperform individual classifiers (Dietterich 2000; Zhang, Zhang, andWang 2008).

An ensemble is defined as a set of individually trained classifiers called base classi-fiers whose predictions are combined when classifying a new dataset. In other words, theindividual classifiers are designed on different subsets of the training data in the classifierensemble stage so that a set of diverse classifiers is built (Kuncheva and Whitaker 2003;Pal 2007). Diversity of the members of an ensemble is an important characteristic in deter-mining generalization error (Melville and Mooney 2005). In fact, the power of ensemblemethods comes from the diversity-reducing variance of the ensemble models (Kim 2009).These methods create ensembles of individual classifiers and acquire classification deci-sions by voting from the individual classifiers. So far, a variety of ensemble methods hasbeen proposed and applied in the literature, but the most popular ones have been bag-ging, AdaBoost, MultiBoosting, Diverse Ensemble Creation by Oppositional Relabellingof Artificial Training Examples (DECORATE), random subspace, and random forest meth-ods. A new ensemble method called rotation forest has been introduced by Rodriguez et al.(2006). This method, which uses feature extraction to subsets and reconstructs a full fea-ture set for each classifier in the ensemble, has been applied in several studies in machinelearning (e.g. Stiglic and Kokol 2007; Zhang, Zhang, and Wang 2008; Stiglic, Rodriguez,and Kokol 2011).

In this study, classification performance of the rotation forest ensemble learning methodis compared with the performances of bagging, AdaBoost, MultiBoosting, DECORATE,random subspace, and random forest methods for an LULC classification problem. Also,a maximum likelihood classifier, which has been widely used in the literature, was alsoapplied as a benchmark method. Decision trees with ensemble methods were applied toa Terra Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER)image for the delineation of LULC classes covering the bulk of the study area, whichcovers the Gebze district of Kocaeli in Turkey. In addition to the evaluations on classifica-tion performances using standard accuracy measures of overall accuracy, kappa coefficient,individual accuracy measures (producer’s accuracy, user’s accuracy, and conditionalkappa), and statistical tests of significance were applied to make objective comparisonsand draw sound conclusions about the ensemble method performances. For this purpose,McNemar’s tests based on normal and chi-squared distributions, a widely used measure ofstatistical significance of difference, were employed in this study.

2. Test site and data

The study area chosen for this research covers an area of approximately 450 km2 locatedin the Gebze district of Kocaeli province, Turkey. It is situated in the westernmost portionof Kocaeli province, with neighbour districts of Istanbul, Tuzla, and Pendik. Gebze is anindustrial town located at the border of Istanbul, the most populated city of Turkey, on thenorthern shore of the Marmara Sea (Figure 1).

A Terra ASTER image of 743 × 673 pixels covering the study area was used to deter-mine land-cover and land-use types. The image, which was acquired in October 2002, wasregistered to the Universal Transverse Mercator (UTM, zone 35, ED50 datum) projectionsystem by applying a first-order polynomial transformation using the ERDAS IMAGINEsoftware package (Intergraph Corporation, Huntsville, AL, USA). Coordinates of groundcontrol points (GCPs) were obtained from 1:25,000 scale topographic maps. The imagewas registered with 40 well-distributed GCPs at a root-mean-square error of about 0.5 pix-els. A nearest-neighbour resampling method was used to determine the new pixel values.

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

4226 T. Kavzoglu and I. Colkesen

Figure 1. Location of the study area: Gebze district of Kocaeli, Turkey.

In this study, the Weka software package was used in the evaluation of decision tree andensemble learning methods.

Forestry maps and aerial photographs taken in 1999 and 2003 were used to createground reference maps. In addition, field surveys were carried out in September andOctober 2002 using a handheld global positioning system unit to collect ground refer-ence information. After careful consideration, LULC types covering the bulk of the studyarea was determined to be water, pasture, coniferous forest, bare soil, deciduous forest,and urban land. To create training and test datasets, a random pixel-selection strategy wasemployed, considering the ground reference map using an in-house program. As a result,training data with 400 samples per class and test data with 225 samples per class wereformed from the available samples with a random-stratified sampling strategy. It should benoted that these datasets were employed without any change in all experiments conductedin this study to make objective comparisons on the performances of ensemble methods.

3. Methodology

Decision trees, whose structures are similar to flowcharts or hierarchical decision schemes,are supervised learning algorithms that have been recently used in remote-sensing applica-tions (e.g. Pal and Mather 2003; Friedl et al. 2010; Boulila et al. 2011). Decision trees basedon the ‘divide and conquer’ strategy are defined as a classification procedure that recur-sively partitions a dataset into smaller subdivisions on the basis of a set of tests definedat each branch in the tree (Friedl and Brodley 1997). These techniques have substantialadvantages for remote-sensing classification problems due to their flexibility, nonparamet-ric nature, and ability to handle nonlinear relations between features and classes (Pal 2005).

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

International Journal of Remote Sensing 4227

A decision tree is composed of a root node (containing all data), a set of internal nodes(splits), and a set of terminal nodes (leaves). Each node of the decision tree structure makesa binary decision that separates either one class or some of the classes from the remainingclasses. The process is usually carried out until a leaf node is reached (Friedl and Brodley1997). Iterative dichotomiser 3 (ID3), the popular C4.5, and classification and regressiontree (CART) are the commonly used tree-growing algorithms for classification construc-tion in the literature (Duda, Hart, and Stork 2000; Witten and Frank 2005; Wang and Li2008). In this study, the C4.5 method was employed for the construction of decision trees.C4.5 builds decision trees from a set of training data using the concept of informationentropy.

3.1. Performance metrics for classification assessment

The assessment of thematic map accuracy has been long discussed by researchers (e.g.Stehman 1997; Foody 2002, 2004, 2008; Liu, Frazier, and Kumar 2007). Many perfor-mance metrics or measures were given in the literature with the aim of providing a bettermeasure of thematic map accuracy. Ferri, Hernandez-Orallo, and Modroiu (2009) experi-mentally analysed the behaviour of 18 performance metrics (e.g. mean F-measure, macroaverage arithmetic, LogLoss, mean probability rate, mean squared error) in several scenar-ios. Although many measures exist in the literature, overall and individual class accuraciesderived from the error matrix are the simplest and therefore have been the most popularaccuracy measures (Congalton 1991). In addition to these measures, the kappa coefficient,which measures the difference between the actual agreement in the confusion matrix andthe chance agreement, was used. Although Congalton and Green (2009) underline that thekappa coefficient provides a better measure for the accuracy of a classifier than the over-all accuracy, as it takes into account the whole confusion matrix rather than the diagonalelements alone, several studies have raised objections to the use of kappa as a measure ofclassification accuracy. For example, Foody (1992) argues that chance agreement is overes-timated in the calculation of the kappa coefficient and suggests two alternative kappa-likecoefficients. According to Stehman (1997), the kappa coefficient of agreement does notpossess a probabilistic interpretation such as overall accuracy because of the adjustmentfor hypothetical chance agreement incorporated into this measure, and the strong depen-dence of kappa on the marginal proportions of the error matrix makes the utility of kappafor comparisons suspect. Also, Pontius and Millones (2008) highlight that the kappa coeffi-cient fails to penalize for large quantity disagreement and fails to reward for small quantitydisagreement. In this study, both overall accuracy and kappa coefficient were estimated toevaluate ensemble performances. In addition to these accuracy measures, producer’s accu-racy, user’s accuracy, and conditional kappa coefficients were calculated to estimate theaccuracy of individual classes. In addition, the significance of the differences in classifica-tion performances was tested through statistical tests to avoid the debate on the validity androbustness of the accuracy metrics.

3.2. Performance comparison of classifiers

Theoretical and empirical validation is definitely essential to the process of designing andimplementing new algorithms. To date, one-to-one comparison of accuracy metrics, mainlythrough the comparison of calculated overall accuracies, has been the common practiceadopted by researchers. Lately, statistical tests have been utilized to analyse the differencesin classifier performances to decide whether these differences are statistically significant

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

4228 T. Kavzoglu and I. Colkesen

or insignificant (Gao et al. 2006; Pal and Foody 2010; Gao et al. 2011; Yan and Shaker2011). Only by applying such testing approaches, objective and sound evaluations can beperformed. As stated by Salzberg (1997), statistics offers many tests that are designedto measure the significance of any difference between two or more ‘treatments’. Thesetests can be adapted for use in comparisons of classifiers, but the adaptation must be donecarefully because the statistics were not designed with computational experiments in mind.

A higher accuracy of one method relative to that of another does not, however, necessar-ily prove the superiority of that method because accuracy statistics are subject to samplingvariability (De Leeuw et al. 2006). A variety of statistical tests including the z-test, receiver-operating characteristic curves, and McNemar’s test have been suggested for comparingclassifier performances (Dietterich 1998; Foody 2004; De Leeuw et al. 2006). Consideringthe debates given in Section 3.1, it was decided not to employ a kappa-based performancecomparison measure (i.e. z-test) in this study.

Foody (2009) states that the comparison of proportions is often only appropriate foruse with a dataset that may be considered independent. In comparative studies (e.g. assess-ment of the accuracy of the thematic maps or classification performance of algorithms),the same testing set is commonly used and therefore the test sets are considered relatedsamples (Foody 2004, 2009), as in this research. Because the kappa coefficient does notsatisfy the assumption of independence, a kappa-based z-test is not suitable for cases inwhich related samples are utilized. For related samples, McNemar’s test could be usefulto estimate the statistical significance of the difference between two proportions (Foody2004). It is a nonparametric test based on confusion matrices at 2 × 2 in dimension. It isbased on the standardized normal test statistic as shown in the following equation:

z = nij − nji√nij + nji

, (1)

where nij indicates the number of pixels misclassified by classifier i but not by classifier jand nji indicates the number of pixels misclassified by classifier j but not by classifier i. Themost frequently applied statistical test of the significance of the association indicated bythe data is the classic chi-squared test. Because, by definition, the square of a quantity thathas standard normal distribution as chi-squared distribution with one degree of freedom,and the square of z calculated from Equation (1) may be referred to tables of chi-squareddistribution with one degree of freedom (Agresti 1996; Fleiss, Levin, and Paik 2003). In thiscase, McNemar’s test can be expressed as

χ2 =(nij − nji

)2

nij + nji. (2)

The calculated statistic value is compared against tabulated chi-squared values to assess itsstatistical significance. Because a normal distribution is being used to represent the discretedistribution of sample frequencies in a test based on z, it is sometimes recommended thata correction for continuity be undertaken (Foody 2004). McNemar’s χ2, with a correctionfor continuity, is given as

χ2 =(∣∣nij − nji

∣∣ − 1)2

nij + nji, (3)

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

International Journal of Remote Sensing 4229

where –1 indicates continuity correction. If the statistic χ2 estimated from Equations (2)and (3) is greater than the critical table value (i.e. χ2 = 3.84 at the level of confidenceequal to 0.05), it can be stated that the two classifiers differ in their performances (Pal2008). In other words, the difference in accuracy between classifiers i and j is said to bestatistically significant.

4. Ensemble learning methods

Ensemble learning methods combining the results of base classifiers have received muchattention in the literature and have demonstrated promising capabilities in improving clas-sification accuracy (Wang et al. 2009). These methods work by constructing a set ofpredictive models according to a given algorithm called the base classifier. The predic-tion for a new observation is then a vote (weighted or not) of all the models in the set(Hamza and Larocque 2005). In other words, ensemble learning methods train a set ofclassifiers and combine their results through a voting process. Thus, they reduce the errorlevel reached by a weak learning algorithm that produces classification results on variousdistributions over the training data. The rotation forest ensemble method, whose perfor-mance was investigated in this study, was compared to most widely used ensembles, namelybagging, AdaBoost, MultiBoosting, DECORATE, random subspace, and random forest.

4.1. Bagging ensemble

The bagging predictor or bootstrap aggregating, introduced by Breiman (1996), has been apopular ensemble approach that generates an ensemble of individual classifiers by manip-ulating or bootstrap sampling of training data. Bagging improves classification accuracyby decreasing the variance of the classification errors. In other words, it requires that thelearning system should not be stable, so that small changes to the training set should leadto a different classifier. Breiman (1996) suggests that bagging can improve accuracy if per-turbing the learning set can cause a significant change in the predictor constructed. Multiplesamples from the training set are generated by sampling with replacement from the trainingdata (Kim 2009). Some examples may be selected several times while others may not beselected at all. In other words, bagging creates a classifier from the samples and combinesall classifiers generated at different trials to produce the final classifier. Prediction of a testsample is performed by taking the majority vote of the ensemble (Banfield et al. 2007; Sun,Zhang, and Zhang 2007). Because each ensemble member is not exposed to the same setof examples, they differ from each other. By voting the predictions of each classifier, bag-ging seeks to reduce the error level due to the variance of the base classifier (Melville andMooney 2005).

4.2. AdaBoost ensemble

In the AdaBoost ensemble, which has been the most widely used boosting algorithm in theliterature, a series of individual classifiers are produced iteratively, and each classifier inthe ensemble tries to classify the training data accurately (Freund and Schapire 1996). Thetraining data chosen for a classifier are determined depending on the performance of itsprevious classifier. This is achieved using adaptive resampling, that is, a misclassified pixelproduced by an earlier classifier is selected more often than a correctly classified one andthe next classifier in the ensemble can perform well on the new dataset (Kim 2009). For eachcycle in the training iteration for the members of ensemble, a weight is assigned to eachtraining pixel. In the next iteration, the classifier is obliged to concentrate on reweighted

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

4230 T. Kavzoglu and I. Colkesen

pixels that were misclassified in the previous iteration. In each case, the AdaBoost ensemblecreates a classifier that intends to correct errors in the previous iteration. The final classifieris a weighted sum of the ensemble predictions.

4.3. MultiBoosting ensemble

MultiBoosting is an extension to the AdaBoost ensemble for forming decision commit-tees. The method can be viewed as a combination of AdaBoost and a variant of baggingensembles. The effect of combining different classifiers can be explained with the the-ory of bias-variance decomposition. The total expected error of a classifier is the sumof the bias and variance (Webb and Zheng 2004). MultiBoosting is able to harness bothAdaBoost’s high bias and variance reduction with bagging’s superior variance reduction(Webb 2000). Experimental results indicated that MultiBoosting produced decision com-mittees with lower errors than either AdaBoost or bagging when decision trees were usedas the base classifier.

4.4. DECORATE ensemble

DECORATE directly constructs diverse hypotheses using additional artificially constructedtraining examples. This is accomplished by adding randomly constructed examples tothe training set when building new committee members (Nanni and Lumini 2006).DECORATE reduces the correlation between weak classifiers in the ensemble by train-ing them on oppositely labelled artificial examples. These artificially constructed examplesare given category labels that disagree with the current decision of the committee, therebyeasily and directly increasing diversity when a new classifier is trained on the augmenteddata and added to the committee (Melville and Mooney 2005). New examples are createdusing a data distribution (e.g. Gaussian), mean, and standard deviation in each trainingcycle, and they are labelled based on the predictions of the current ensemble. Therefore,the diversity is increased when a new classifier is trained on the augmented data. At the endof each iteration, artificial data are removed from the dataset, and the process is repeatedwith new artificially generated data until the maximum number of iterations is performed.

4.5. Random subspace ensemble

The random subspace ensemble consists of multiple trees constructed systematically bypseudo-random selection of subsets of the components of the feature vector; that is, treesare constructed in randomly chosen subspaces (Ho 1998). Subspace methods support diver-sity by training the classifiers in the ensemble using different subsets of input data. Themethod creates a decision tree-based classifier that maintains the highest accuracy on train-ing data and improves generalization accuracy as it grows in complexity. Thus, one obtainsa random subspace of the original feature space and constructs classifiers inside the reducedsubspace. The aggregation is usually performed using weighted voting on the basis of theaccuracy of base classifiers. The method is effective for classifiers having a decreasinglearning curve constructed on small and critical training sample sizes (Pal 2007).

4.6. Random forest method

Random forests are a combination of tree classifiers such that each tree depends on thevalues of a random vector sampled independently and with the same distribution for alltrees in the forest (Breiman 2001). Random forests utilize bagging to generate an ensemble

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

International Journal of Remote Sensing 4231

of CART classifiers. A large number of trees (i.e. classifiers) are generated, and finallyunweighted voting is used to assign an unknown pixel to a class (Pal 2005). Each tree istrained on a bootstrapped sample of the training data, and at each node the algorithm onlysearches across a random subset of the variables to determine a split. In other words, toclassify an input vector, the vector is submitted as an input to each of the trees in the forest,and the classification is then determined by a majority vote (Gislason, Benediktsson, andSveinsson 2004). In random forests, the Gini index, measuring the impurity of an attributewith respect to the classes, is used as an attribute selection measure.

4.7. Rotation forest ensemble

The rotation forest ensemble is based on generating a classifier ensemble using a featureextraction technique (i.e. principal component analysis (PCA)). In this ensemble method,the training dataset is randomly split into subsets, and feature extraction is applied to eachsubset. All features are retained in order to preserve the variability information in the data.Thus, diversity is promoted through the feature extraction for each classifier (Rodriguezet al. 2006). The framework of the rotation forest ensemble can be described as follows.

Let x = [x1, ..., xn]T be a data point described by n features, and let X be the datasetcontaining the training objects in the form of an N × n matrix. Let Y be a vector with classlabels for the data, Y = [y1, ..., yN ]T, where yi takes a value from the set of class labels{ω1, ..., ωc}. Denote the classifiers in the ensemble by D1, ..., DL and the feature set by F.As with most ensemble methods, it is required to pick L in advance. All classifiers canbe trained in parallel, which is also the case with bagging and random forests (Rodriguezet al. 2006). To construct the training set for classifier Di, the following steps are carriedout (Han, Zhu, and Yao 2012).

Step 1: Split F randomly into K subsets (K is a parameter of the algorithm). The subsetsmay be disjoint or intersecting. To maximize the chance for high diversity, disjoint subsetsare chosen. For simplicity, suppose that K is a factor of n, so that each feature subsetscontains M = n

/K features.

Step 2: Denote the jth subset of features for training set of classifier Di as Fij. For eachsubset, randomly select a nonempty subset of classes and then draw a bootstrap sample ofobjects of size 75% of the input datasets. The PCA technology is used only on M featuresin Fij and the selected subset of X. Store the coefficients of the principal components as

a(1)ij , ..., a

(Mj)ij , and each of these Mj coefficients takes the size of M × 1. It is possible that

some of the eigenvalues are zero; therefore, it may not be possible to have all M vectors, soMj ≤ M . Running the PCA on a subset of the dataset X instead of on the whole set is donein a bid to avoid identical coefficients in case the same feature subset is chosen for differentclassifiers.

Step 3: Organize the obtained vectors with coefficients in a sparse ‘rotation’ matrixAij. In this matrix, the jth set of coefficients of the principal components is rj, rj =⌊

a(1)ij , a(2)

ij , ... , a(Mj)ij

⌋, i = 1, 2, ..., L; j = 1, 2, ..., K, and use [0] to denote zero matrix, which

has the same dimension as rj.

Ai =

⎡⎢⎢⎢⎢⎣

r1 [0] ... [0][0] r2 ... [0]...

.

.

.

.

.

.

.

.

.[0] [0] ... rK

⎤⎥⎥⎥⎥⎦

. (4)

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

4232 T. Kavzoglu and I. Colkesen

The success of the rotation forest method lies in the application of the rotation matrixconstructed by linearly transformed subsets. An effective transformation method is the corefor the rotation matrix. Therefore, the transformation method will have significant impacton the final classification accuracy (Liu and Huang 2008).

5. Results and discussion

The performance of rotation forest in comparison to the six most widely used ensemblemethods (i.e. bagging, AdaBoost, MultiBoosting, DECORATE, random subspace, and ran-dom forest) was analysed in this study on an LULC classification problem. In ensemblelearning practices performed in this study, a decision tree classifier was used as the baseclassifier. All ensemble methods require the setting of some parameters from the user side.The number of iterations is a critical parameter that needs careful consideration for allensemble methods. It should be mentioned that the number of iterations directly shows thenumber of trees for the random forest classifier. A cross-validation strategy was appliedto determine the optimum iteration numbers using a training dataset. Classification perfor-mances were estimated for iterations ranging from 1 to 1000 for all methods considered,and the results are given in Figure 2. Optimum iterations of 70, 90, 250, 30, 150, 30, and350 were determined for bagging, AdaBoost, MultiBoosting, DECORATE, random sub-space, and rotation forest ensembles and random forest classifier, respectively (Table 1).By applying the optimum numbers of iterations producing the highest classification accu-racies for the ensemble methods, it was ensured that all methods were compared with theiroptimal solutions, thus avoiding bias towards one of the methods.

Some ensembles require additional parameters to be set by the analyst in the modelconstruction stage. These parameters are summarized in Table 1. For MultiBoosting, the

87

88

89

90

91

92

93

10 20 30 40 50 60 70 80 90 100 150 250 350 500 750 1000

Accura

cy (

%)

Number of iterations

Bagging AdaBoost

Rotation forest DECORATE

Random subspace MultiBoosting

Random forest

Figure 2. Relationship between the number of iterations and classification accuracy using ensemblemethods. Note that number of iterations indicates the number of trees to be grown for the randomforest classifier.

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

International Journal of Remote Sensing 4233

Table 1. Parameter settings of the ensemble methods.

Ensemble method Parameter setting

Bagging Number of iterations: 70Bag size: 100%

AdaBoost Number of iterations: 90Number of iterations: 250

MultiBoosting Number of subcommittees: 3Number of iterations: 30

DECORATE Number of desired members: 15Number of artificial examples: 0.5

Random subspace Number of iterations: 150Percentage of the number of attributes: 0.5

Random forest Number of trees: 30Number of features: 2

Rotation forest Number of iterations: 350Number of splits: 3

number of subcommittees having considerable impact on classification performance wasset to three after a few trials with the training dataset. For DECORATE in addition to thenumber of iterations, there are two more user-defined parameters: the number of desiredmember classifiers and the number of artificial examples. Melville (2003) states that theaccuracy of DECORATE increases with the desired ensemble size and the performancelevels out with ensemble sizes of 10–25. Also, the number of artificial examples givesacceptable results for the range between 0.5 and 1. Values of 15 and 0.5 were used for thenumber of desired member classifiers and the number of artificial examples, respectively.In addition, the number of features used at each node to generate a tree must be determinedfor a random forest classifier. A number of experiments were carried out using a trainingdataset, and two numbers of features at each node were found optimal. Finally, for therotation forest ensemble, the number of splits (K) controlling the sizes of the feature subsetswas set to three as suggested by Rodriguez et al. (2006).

The final ensemble models constructed with an optimal parameter setting were appliedto the satellite imagery, and thematic maps of the study area were produced for each method(Figure 3). Although the subfigures show great resemblance to each other, there are someparts in the maps that were identified differently by the methods. These parts that to someextent reveal the generalization capability of the methods were specifically highlighted onthe subfigures with white circles. When the highlighted parts on the figure were analysed,it could be seen that urban and bare soil-rock pixels were classified more accurately by therotation forest ensemble method when compared with the others. In addition, the decisiontree classifier did not efficiently distinguish urban and bare soil-rock pixels having sim-ilar spectral characteristics. Moreover, it was observed that the misclassifications mainlyresulted from the identification of three spectrally similar classes, i.e. deciduous, pasture,and urban pixels. In other words, these differences result from confusions in class boundarydefinitions in the feature space.

The results of the maximum likelihood classification and the decision tree classificationwith and without the ensemble methods in terms of accuracy measures are summarized inTable 2. It can be seen from the table that the lowest overall accuracy of 82.22% was pro-duced by the maximum likelihood classifier. Another important finding is that the ensemblemethods help to improve the classification performance of the decision tree classifier (upto 5% in overall accuracy). In fact, the decision tree without the use of an ensemble method

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

4234 T. Kavzoglu and I. Colkesen

(a) (b) (c)

(f)(e)(d)

(g) (h) (i)

Water Deciduous Coniferous Pasture Bare soil-rock Urban

0 5 10 km

N

W E

S

Figure 3. Thematic maps produced using (a) decision tree, (b) bagging, (c) AdaBoost, (d)MultiBoosting, (e) DECORATE, (f ) random subspace, (g) random forest, (h) rotation forest ensem-ble learning, and (i) maximum likelihood methods. Note that circles indicate regions that the methodsclassify differently.

resulted in classification with an overall accuracy of 86.74% and kappa coefficient of 0.840.While evaluating the ensemble performances, it was found that the highest classificationaccuracy (overall accuracy of 92.38% and kappa coefficient of 0.908) was produced bythe use of the rotation forest ensemble, whereas the lowest accuracy (overall accuracy of88.89% and kappa coefficient of 0.867) was produced with the bagging ensemble. Among

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

International Journal of Remote Sensing 4235

Table 2. Classification accuracies and computational costs estimated for all methods considered inthis study.

Class User’s (%) Producer’s (%)Conditional

kappa

(a) Maximum likelihoodWater 100.00 99.56 0.995Deciduous 81.94 78.67 0.746 Overall accuracy 88.22Coniferous 84.19 87.56 0.849 Kappa coefficient 0.859Pasture 84.75 84.00 0.808 Computational cost* 0.11 sSoil–rock 87.39 92.44 0.908Urban 91.16 87.11 0.847

(b) Decision treeWater 100.00 100.00 1.000Deciduous 85.65 82.22 0.788 Overall accuracy 86.74Coniferous 85.78 88.44 0.860 Kappa coefficient 0.840Pasture 82.98 86.67 0.839 Computational cost* 0.13 sSoil–rock 85.59 84.44 0.814Urban 80.45 78.67 0.745

(c) BaggingWater 100.00 100.00 1.000Deciduous 90.29 82.67 0.795 Overall accuracy 88.89Coniferous 87.83 89.78 0.877 Kappa coefficient 0.867Pasture 82.87 92.44 0.907 Computational cost* 12.14 sSoil–rock 86.84 88.00 0.856Urban 86.19 80.44 0.768

(d) AdaBoostWater 100.00 100.00 1.000Deciduous 91.08 86.22 0.836 Overall accuracy 92.00Coniferous 91.27 92.89 0.914 Kappa coefficient 0.904Pasture 89.36 93.33 0.919 Computational cost* 19.48 sSoil–rock 91.03 90.22 0.883Urban 89.33 89.33 0.872

(e) MultiBoostingWater 100.00 100.00 1.000Deciduous 91.63 87.56 0.852 Overall accuracy 91.85Coniferous 92.00 92.00 0.904 Kappa coefficient 0.902Pasture 87.92 93.78 0.924 Computational cost* 18.48 sSoil–rock 90.18 89.78 0.877Urban 89.59 88.00 0.857

(f ) DECORATEWater 99.56 100.00 1.000Deciduous 89.25 84.89 0.820 Overall accuracy 89.11Coniferous 88.44 88.44 0.861 Kappa coefficient 0.869Pasture 85.83 91.56 0.897 Computational cost* 6.28 sSoil–rock 84.75 88.89 0.865Urban 87.08 80.89 0.774

(g) Random subspaceWater 100.00 100.00 1.000Deciduous 88.89 81.78 0.785 Overall accuracy 89.78Coniferous 89.33 89.33 0.872 Kappa coefficient 0.877Pasture 85.54 94.67 0.935 Computational cost* 10.47 sSoil–rock 86.96 88.89 0.866Urban 88.32 84.00 0.810

(Continued)

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

4236 T. Kavzoglu and I. Colkesen

Table 2. (Continued).

Class User’s (%) Producer’s (%)Conditional

kappa

(h) Random forestWater 100.00 100.00 1.000Deciduous 91.47 85.78 0.831 Overall accuracy 91.63Coniferous 90.83 92.44 0.909 Kappa coefficient 0.900Pasture 88.02 94.67 0.935 Computational cost* 4.01 sSoil–rock 89.38 89.78 0.877Urban 90.32 87.11 0.846

(i) Rotation forestWater 100.00 100.00 1.000Deciduous 92.79 85.78 0.832 Overall accuracy 92.38Coniferous 90.87 92.89 0.914 Kappa coefficient 0.908Pasture 87.97 94.22 0.930 Computational cost* 23.03 sSoil-rock 91.15 91.56 0.899Urban 91.82 89.78 0.878

Note: *Computational costs indicate measured runtimes calculated for each method using a personal computer(AMD dual core 2.11GHz with 1 GB of RAM) with iterations fixed at 100.

the ensemble methods, four methods, namely AdaBoost, MultiBoosting, random forest,and rotation forest, gave results with over 90% overall accuracy, and the improvement wasat least 2% compared with the others. When the estimated user’s and producer’s accu-racies were analysed, it was found that the highest user’s accuracies were calculated fordeciduous (92.79%), bare soil-rock (91.15%), and urban (91.82%) classes by the rotationforest method. In addition, the highest producer’s accuracies were obtained for coniferous(92.89%) and urban (89.78%) classes using the rotation forest method.

The computational cost of the training process in the classification of remotely sensedimages is an important parameter in the evaluation of ensemble methods. The training timecan be drastically longer for some classifiers (e.g. artificial neural networks) compared withthe others (e.g. decision trees). The ensemble methods considered in this study were alsocompared in terms of their computational costs required for the training phase (Table 2).It was found that the rotation forest ensemble required the highest time (23.03 sec) whilstthe random forest was the fastest with 4.01 sec for the training dataset. Although the timetaken to train a decision tree with rotation forest was the longest, the duration was quiteclose to that of the AdaBoost and MultiBoosting ensembles. The main reason for thisoutcome can be related to the complexity of the rotation forest ensemble employing twoparameters and the PCA in the modelling stage.

To evaluate whether the differences in the classification accuracies produced by theensemble methods are statistically significant, McNemar’s tests based on normal and chi-squared distributions were applied (Table 3). McNemar’s tests considered in this study arestandardized normal test statistic (z) and chi-squared distribution with continuity correctionand without continuity correction. Statistical tests were conducted at 90% and 95% confi-dence intervals to analyse the results of the tests in more detail. It should be noted that alltests considered in this study were two-tailed, and a difference between two classificationresults is significant at 95% confidence interval if z > 1.96 for the normal test statistic andχ2 > 3.84 for chi-squared tests, and at 90% confidence interval if z > 1.64 for the normaltest statistic and χ2 > 2.71 for chi-squared tests. The test statistics that are greater than thecritical values are shown as ‘Yes’ in Table 3. It was found that the rotation forest ensemble

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

International Journal of Remote Sensing 4237

Tabl

e3.

Sta

tist

ical

com

pari

son

ofth

ere

sult

spr

oduc

edby

the

clas

sifi

ers

usin

gM

cNem

ar’s

test

base

don

stan

dard

ized

norm

alte

stst

atis

tic

(z)

and

chi-

squa

red

dist

ribu

tion

wit

han

dw

itho

utco

ntin

uity

corr

ecti

ons.

Com

pari

son

ofov

eral

lacc

urac

ies

(%)

Sta

tist

ical

test

resu

lts

Sig

nifi

cant

atC

lass

ifica

tion

1(c

1)C

lass

ifica

tion

2(c

2)D

iffe

renc

e(c

1–c2

)z

=n i

j−

n ji

√ n ij

+n j

2=

( n ij

−n j

i) 2

n ij

+n j

2=

(∣ ∣ ∣n ij

−n j

i∣ ∣ ∣−1) 2

n ij

+n j

i95

%90

%

Rot

atio

nfo

rest

vsde

cisi

ontr

ee92

.38

86.7

45.

646.

7844

.77

45.9

7Y

esY

es

Rot

atio

nfo

rest

vsba

ggin

g92

.38

88.8

93.

495.

4328

.21

29.4

5Y

esY

es

Rot

atio

nfo

rest

vsA

daB

oost

92.3

892

.00

0.38

0.65

0.27

0.42

No

No

Rot

atio

nfo

rest

vsM

ulti

Boo

stin

g92

.38

91.8

50.

530.

930.

630.

85N

oN

o

Rot

atio

nfo

rest

vsD

EC

OR

AT

E92

.38

89.1

13.

274.

8622

.55

23.6

1Y

esY

es

Rot

atio

nfo

rest

vsra

ndom

subs

pace

92.3

889

.78

2.60

4.56

19.5

920

.76

Yes

Yes

Rot

atio

nfo

rest

vsra

ndom

fore

st92

.38

91.6

30.

751.

712.

382.

94N

oY

es

Not

e:N

ote

that

allt

ests

wer

etw

o-ta

iled

;cri

tica

lval

ues

for

zan

2te

sts

for

95%

confi

denc

ein

terv

alar

e1.

96an

d3.

84,a

ndfo

r90

%co

nfide

nce

inte

rval

1.64

and

2.71

,res

pect

ivel

y.

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

4238 T. Kavzoglu and I. Colkesen

with decision tree classifier produced significantly higher classification accuracies than thedecision tree itself and decision tree with bagging, DECORATE, and random subspaceensembles for the dataset considered in this study. It was apparent that the large differencesin accuracy expressed in terms of proportions of correctly allocated pixels were statisticallysignificant at the 5% and 10% levels of significance. Although there exists evidence thatthe rotation forest method produced better results than the AdaBoost, MultiBoosting, andrandom forest methods in terms of classification performance and overall accuracy, differ-ences were statistically insignificant at the 5% level of significance. In other words, theseensemble methods generated similar or comparable results at this level of significance.

Table 3 clearly indicated that all statistical tests, independent of distribution type, gavethe same result in comparing the significance of ensemble performance differences at the5% level of significance. However, when the level of significance was changed to 10%(critical values for z and χ2 are 1.64 and 2.71, respectively), test values of z and χ2 withcontinuity correction showed that differences in classification accuracies produced by rota-tion forest and random forest ensembles became statistically significant. In this case, therotation forest ensemble outperformed all other ensemble methods except for the AdaBoostand MultiBoosting ensembles. This finding also revealed that McNemar’s χ2 without con-tinuity correction cannot verify the outcomes of the other two McNemar’s tests at the 10%level of significance (i.e. 90% confidence interval).

6. Conclusions

In recent years, the classification ensemble or classifier combination aiming to improve theclassification accuracy of individual or weak classifiers has become a popular research topicin remote sensing. Various ensemble learning algorithms, including bagging, AdaBoost,MultiBoosting, DECORATE, random subspace, and random forest, have been suggestedin the literature to accomplish this purpose. In this study, a recently introduced rotationforest ensemble method with decision trees was described and then compared with the otherensemble methods considering classification performances. In addition to the analysis ofclassification accuracies alone, three McNemar’s test statistics were applied to assess thestatistical significance of the differences in classifier performances.

Some important conclusions can be drawn from the results produced in this study usingmultispectral image data. First, the results noticeably indicated that all ensemble methodsconsidered in this study outperformed the decision tree classifier when estimated accura-cies were compared. Improvements in performance varied up to 5% with regard to overallclassification accuracy. Secondly, statistical tests revealed that the accuracy level reached bythe rotation forest ensemble compared with the decision tree was statistically significant.That is, the rotation forest significantly improved the performance of the decision trees.Thirdly, when the rotation forest performance was compared with the performances ofother ensembles through McNemar’s tests (i.e. z, χ2, χ2 with continuity correction), it wasfound that the method produced significantly higher results than bagging, DECORATE,and random subspace and produced similar results with AdaBoost, MultiBoosting, and ran-dom forest methods at the 5% level of significance. It should be pointed out that althoughthe differences in performances were insignificant, the rotation forest method producedthe highest classification accuracies in terms of overall accuracy and kappa coefficient.Another important finding is that the three McNemar’s test statistics at the 95% confidenceinterval reached the same conclusion in the performance comparison of ensemble meth-ods. However, when the confidence interval was reduced to 90%, performance differencesbecame statistically significant for rotation and random forest ensembles according to z and

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

International Journal of Remote Sensing 4239

χ2 with continuity correction. Finally, the results indicate that more research is needed tojustify the strong and weak sides of the McNemar’s test statistics based on normal andchi-squared distributions.

ReferencesAgresti, A. 1996. An Introduction to Categorical Data Analysis. New York: Wiley.Banfield, R. E., L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer. 2007. “A Comparison of Decision

Tree Ensemble Creation Techniques.” IEEE Transactions on Pattern Analysis and MachineIntelligence 29: 173–180.

Ben-David, A. 2008. “Comparison of Classification Accuracy Using Cohen’s Weighted Kappa.”Expert Systems with Applications 34: 825–832.

Boulila, W., I. R. Farah, K. S. Ettabaa, B. Solaiman, and H. Ben Ghezala. 2011. “A Data MiningBased Approach to Predict Spatiotemporal Changes in Satellite Images.” International Journalof Applied Earth Observation and Geoinformation 13: 386–395.

Breiman, L. 1996. “Bagging Predictors.” Machine Learning 24: 123–140.Breiman, L. 1998. “Arcing Classifiers.” Annals of Statistics 26: 801–824.Breiman, L. 2001. “Random Forests.” Machine Learning 45: 5–32.Congalton, R. G. 1991. “A Review of Assessing the Accuracy of Classifications of Remotely Sensed

Data.” Remote Sensing of Environment 37: 35–46.Congalton, R. G., and K. Green. 2009. Assessing the Accuracy of Remotely Sensed Data. Boca Raton,

FL: CRC.De Leeuw, J., H. Jia, L. Yang, X. Liu, K. Schmidt, and A. K. Skidmore. 2006. “Comparing Accuracy

Assessments to Infer Superiority of Image Classification Methods.” International Journal ofRemote Sensing 27: 223–232.

Dietterich, T. G. 1998. “Approximate Statistical Tests for Comparing Supervised ClassificationLearning Algorithms.” Neural Computation 10: 1895–1923.

Dietterich, T. G. 2000. “An Experimental Comparison of Three Methods for Constructing Ensemblesof Decision Trees: Bagging, Boosting, and Randomization.” Machine Learning 40: 139–157.

Duda, R. O., P. E. Hart, and D. G. Stork. 2000. Pattern Classification. New York: John Wiley andSons.

Ferri, C., J. Hernandez-Orallo, and R. Modroiu. 2009. “An Experimental Comparison of PerformanceMeasures for Classification.” Pattern Recognition Letters 30: 27–38.

Fleiss, J. L., B. Levin, and M. C. Paik. 2003. Statistical Methods for Rates and Proportions. NewJersey: Wiley.

Foody, G. M. 1992. “On the Compensation for Chance Agreement in Image Classification AccuracyAssessment.” Photogrammetric Engineering and Remote Sensing 58: 1459–1460.

Foody, G. M. 2002. “Status of Land Cover Classification Accuracy Assessment.” Remote Sensing ofEnvironment 80: 185–201.

Foody, G. M. 2004. “Thematic Map Comparison: Evaluating the Statistical Significance ofDifferences in Classification Accuracy.” Photogrammetric Engineering and Remote Sensing 70:627–633.

Foody, G. M. 2008. “Harshness in Image Classification Accuracy Assessment.” International Journalof Remote Sensing 29: 3137–3158.

Foody, G. M. 2009. “Classification Accuracy Comparison: Hypothesis Tests and the Use ofConfidence Intervals in Evaluations of Difference, Equivalence and Non-Inferiority.” RemoteSensing of Environment 113: 1658.

Freund, Y., and R. Scapire. 1996. “Experiments with a New Boosting Algorithm.” In 13thInternational Conference on Machine Learning, edited by L. Saitta, 3–6 July, Bari, Italy,148–156. San Francisco, CA: Morgan Kaufmann.

Friedl, M. A., and C. E. Brodley. 1997. “Decision Tree Classification of Land Cover from RemotelySensed Data.” Remote Sensing of Environment 61: 399–409.

Friedl, M. A., D. Sulla-Menashe, B. Tan, A. Schneider, N. Ramankutty, A. Sibley, and X. M. Huang.2010. “MODIS Collection 5 Global Land Cover: Algorithm Refinements and Characterizationof New Datasets.” Remote Sensing of Environment 114: 168–182.

Gao, Y., J. F. Mas, N. Kerle, and J. A. N. Pacheco. 2011. “Optimal Region Growing Segmentationand Its Effect on Classification Accuracy.” International Journal of Remote Sensing 32:3747–3763.

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

4240 T. Kavzoglu and I. Colkesen

Gao, Y., J. F. Mas, B. H. P. Maathuıs, X. M. Zhang, and P. M. Van Dıjk. 2006. “Comparison of Pixel-Based and Object-Oriented Image Classification Approaches – a Case Study in a Coal Fire Area,Wuda, Inner Mongolia, China.” International Journal of Remote Sensing 27: 4039–4055.

Gislason, P. O., J. A. Benediktsson, and J. R. Sveinsson 2004. “Random Forest Classification ofMultisource Remote Sensing and Geographic Data.” In IEEE International Geoscience andRemote Sensing Symposium Proceedings (IGARSS 2004), Anchorage, AK, September 20–24,vol. II, 1049–1052. Hoboken, NJ: IEEE International.

Gislason, P. O., J. A. Benediktsson, and J. R. Sveinsson. 2006. “Random Forests for Land CoverClassification.” Pattern Recognition Letters 27: 294–300.

Hamza, M., and D. Larocque. 2005. “An Empirical Comparison of Ensemble Methods Based onClassification Trees.” Journal of Statistical Computation and Simulation 75: 629–643.

Han, M., X. R. Zhu, and W. Yao. 2012. “Remote Sensing Image Classification Based on NeuralNetwork Ensemble Algorithm.” Neurocomputing 78: 133–138.

Ho, T. K. 1998. “The Random Subspace Method for Constructing Decision Forests.” IEEETransactions on Pattern Analysis and Machine Intelligence 20: 832–844.

Kavzoglu, T. 2009. “Increasing the Accuracy of Neural Network Classification Using RefinedTraining Data.” Environmental Modelling & Software 24: 850–858.

Kim, Y. S. 2009. “Boosting and Measuring the Performance of Ensembles for Successful DatabaseMarketing.” Expert Systems with Applications 36: 2161–2176.

Kuncheva, L. I., and C. J. Whitaker. 2003. “Measures of Diversity in Classifier Ensembles and TheirRelationship with the Ensemble Accuracy.” Machine Learning 51: 181–207.

Liu, C. R., P. Frazier, and L. Kumar. 2007. “Comparative Assessment of the Measures of ThematicClassification Accuracy.” Remote Sensing of Environment 107: 606–616.

Liu, K. H., and D. S. Huang. 2008. “Cancer Classification Using Rotation Forest.” Computers inBiology and Medicine 38: 601–610.

Lu, D., and Q. Weng. 2007. “A Survey of Image Classification Methods and Techniques for ImprovingClassification Performance.” International Journal of Remote Sensing 28: 823–870.

Melville, P. 2003. Creating Diverse Ensemble Classifiers. Technical Report UT-AI-TR-03-306,Artificial Intelligence Lab, University of Texas, Austin.

Melville, P., and R. J. Mooney. 2005. “Creating Diversity in Ensembles Using Artificial Data.”Information Fusion 6: 99–111.

Nanni, L., and A. Lumini. 2006. “An Experimental Comparison of Ensemble of Classifiers forBiometric Data.” Neurocomputing 69: 1670–1673.

Pal, M. 2005. “Random Forest Classifier for Remote Sensing Classification.” International Journalof Remote Sensing 26: 217–222.

Pal, M. 2007. “Ensemble Learning with Decision Tree for Remote Sensing Classification.” WorldAcademy of Science, Engineering and Technology 36: 258–260.

Pal, M. 2008. “Artificial Immune-Based Supervised Classifier for Land-Cover Classification.”International Journal of Remote Sensing 29: 2273–2291.

Pal, M., and G. M. Foody. 2010. “Feature Selection for Classification of Hyperspectral Data bySVM.” IEEE Transactions on Geoscience and Remote Sensing 48: 2297–2307.

Pal, M., and P. M. Mather. 2003. “An Assessment of the Effectiveness of Decision Tree Methods forLand Cover Classification.” Remote Sensing of Environment 86: 554–565.

Pontius, R. G., and M. Millones. 2008. “Problems and Solutions for Kappa-Based Indices ofAgreement.” Paper presented at the International Conference on Studying, Modeling and SenseMaking of Planet Earth, University of the Aegean, Mytilene, June 1–6.

Rodriguez, J. J., L. I. Kuncheva, and J. Carlos. 2006. “Rotation Forest: A New Classifier EnsembleMethod.” IEEE Transactions on Pattern Analysis and Machine Intelligence 28: 1619.

Salzberg, S. L. 1997. “On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach.”Data Mining and Knowledge Discovery 1: 317–327.

Stehman, S. V. 1997. “Selecting and Interpreting Measures of Thematic Classification Accuracy.”Remote Sensing of Environment 62: 77–89.

Stiglic, G., and P. Kokol. 2007. “Effectiveness of Rotation Forest in Meta-Learning based GeneExpression Classification.” Paper presented at the 20th IEEE International Symposium onComputer-Based Medical Systems, Maribor, June 20–22, 243–250. Washington, DC: IEEEComputer Society.

Stiglic, G., J. J. Rodriguez, and P. Kokol. 2011. “Rotation of Random Forests for Genomic andProteomic Classification Problems.” Software Tools and Algorithms for Biological Systems 696:211–221.

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013

International Journal of Remote Sensing 4241

Sun, S. L., C. S. Zhang, and D. Zhang. 2007. “An Experimental Evaluation of Ensemble Methods forEEG Signal Classification.” Pattern Recognition Letters 28: 2157–2163.

Tso, B., and P. M. Mather. 2009. Classification Methods for Remotely Sensed Data. 2nd ed. London:Taylor & Francis.

Wang, S. J., A. Mathew, Y. Chen, L. F. Xi, L. Ma, and J. Lee. 2009. “Empirical Analysis of SupportVector Machine Ensemble Classifiers.” Expert Systems with Applications 36: 6466–6476.

Wang, Y. Y., and J. Li. 2008. “Feature-Selection Ability of the Decision-Tree Algorithm and theImpact of Feature-Selection/Extraction on Decision-Tree Results Based on Hyperspectral Data.”International Journal of Remote Sensing 29: 2993–3010.

Webb, G. I. 2000. “MultiBoosting: A Technique for Combining Boosting and Bagging.” MachineLearning 40: 159–196.

Webb, G. I., and Z. J. Zheng. 2004. “Multistrategy Ensemble Learning: Reducing Error by CombiningEnsemble Learning Techniques.” IEEE Transactions on Knowledge and Data Engineering 16:980–991.

Witten, I. H., and E. Frank. 2005. Data Mining: Practical Machine Learning Tools and Techniques.San Francisco, CA: Morgan Kaufmann.

Xie, Y. C., Z. Y. Sha, and M. Yu. 2008. “Remote Sensing Imagery in Vegetation Mapping: A Review.”Journal of Plant Ecology 1: 9–23.

Yan, W. Y., and A. Shaker. 2011. “The Effects of Combining Classifiers with the Same TrainingStatistics Using Bayesian Decision Rules.” International Journal of Remote Sensing 32:3729–3745.

Zhang, C. X., and J. S. Zhang. 2010. “A Variant of Rotation Forest for Constructing EnsembleClassifiers.” Pattern Analysis and Applications 13: 59–77.

Zhang, C. X., J. S. Zhang, and G. W. Wang. 2008. “An Empirical Study of Using Rotation Forest toImprove Regressors.” Applied Mathematics and Computation 195: 618–629.

Dow

nloa

ded

by [

Geb

ze Y

ukse

k T

ekno

loji

Ens

titïs

u ]

at 0

5:39

11

Mar

ch 2

013