Feature Selection by Iterative Reweighting - CU Scholar

164
Feature Selection by Iterative Reweighting: An Exploration of Algorithms for Linear Models and Random Forests. by Abhishek Jaiantilal B.A., Dharamsinh Desai Institute of Technology, 2003 M.S., New Jersey Institute of Technology, 2006 M.S., University of Colorado, 2011 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science 2013

Transcript of Feature Selection by Iterative Reweighting - CU Scholar

Feature Selection by Iterative Reweighting: An Exploration

of Algorithms for Linear Models and Random Forests.

by

Abhishek Jaiantilal

B.A., Dharamsinh Desai Institute of Technology, 2003

M.S., New Jersey Institute of Technology, 2006

M.S., University of Colorado, 2011

A thesis submitted to the

Faculty of the Graduate School of the

University of Colorado in partial fulfillment

of the requirements for the degree of

Doctor of Philosophy

Department of Computer Science

2013

This thesis entitled:Feature Selection by Iterative Reweighting: An Exploration of Algorithms for Linear Models and Random

Forests.written by Abhishek Jaiantilal

has been approved for the Department of Computer Science

Michael Mozer

Prof. Aaron Clauset

Prof. Vanja Dukic

Prof. Qin Lv

Prof. Sriram Sankaranarayanan

Date

The final copy of this thesis has been examined by the signatories, and we find that both the content andthe form meet acceptable presentation standards of scholarly work in the above mentioned discipline.

iii

Jaiantilal, Abhishek (Ph.D., Computer Science)

Feature Selection by Iterative Reweighting: An Exploration of Algorithms for Linear Models and Random

Forests.

Thesis directed by Prof. Michael Mozer

In many areas of machine learning and data science, the available data are represented as vectors

of feature values. Some of these features are useful for prediction, but others are spurious or redundant.

Feature selection is commonly used to determine the utility of a feature. Typically, features are selected

in an all-or-none fashion for inclusion in a model. We describe an alternative approach that has received

little attention in the literature: determining the relative importance of features via continuous weights, and

performing multiple iterations of model training to iteratively reweight features such that the least useful

features eventually obtain a weight of zero. We explore feature selection by employing iterative reweighting

for two classes of popular machine learning models: L1 penalized linear models and Random Forests.

Recent studies have shown that incorporating importance weights into L1 models leads to improvement

in predictive performance in a single iteration of training. In Chapter 3, we advance the state-of-the-art by

developing an alternative method for estimating feature importance based on subsampling. Extending the

approach to multiple iterations of training, employing the importance weights from iteration n to bias the

training on iteration n+ 1 seems promising, but past studies yielded no benefit to iterative reweighting. In

Chapter 4, we obtain a significant reduction of 7.48% in the error rate over standard L1 penalized algorithms,

and nearly as large an improvement over alternative feature selection algorithms such as Adaptive Lasso,

Bootstrap Lasso, and MSA-LASSO using our improved estimates of feature importance.

In Chapter 5, we consider iterative reweighting in the context of Random Forests and contrast this with

a more standard backward-elimination technique that involves training models with the full complement of

features and iteratively removing the least important feature.In parallel with this contrast, we also compare

several measures of importance, including our own proposal based on evaluating models constructed with

and without each candidate feature. We show that our importance measure yields both higher accuracy and

greater sparsity than importance measures obtained without retraining models (including measures proposed

by Breiman and Strobl), though at a greater computational cost.

Dedication

Dedicated to my parents, sisters, and the memories of my friend Keyur.

v

Acknowledgements

The foundations of this thesis were laid the week i started graduate school at CU. During the first

week of school, i met both Greg Grudic and Mike Mozer; unbeknownst to each other, both suggested the

idea of exploring sparse linear models as initial research. Sparse models turned out to be the focus of my

research and the motivation of this thesis.

I am indebted to Mike for pushing me towards trying out ideas, without an apriori bias towards a

particular method, and for allowing me to pursue ideas out of my own interest.

I am indebted to Greg for exchange of ideas over the years and the initial push towards some of

the algorithms discussed in this thesis; Greg and Flashback Technologies provided valuable opportunities

concerning employment, computing equipment, and data, that were essential for the completion of this thesis.

I would like to thank Prof. Sriram for his encouragement, support, and providing me some of the

data that proved essential to this thesis. I am thankful to the other committee members, Prof. Vanja, Prof.

Qin, and Prof. Aaron, for serving on my defense committee and their valuable insights. I am very thankful

to my other collaborators during my stay at CU, Prof. Shivakant Mishra and Yifei Jiang.

Though mentioned last, the greatest amount of thanks goes to the support and guidance of my parents,

sisters, and friends, Miraaj Soni and Greg Theodore.

Contents

Chapter

1 Thesis Overview 1

1.1 Scope of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Weighed Linear Models and Iteratively Reweighted Linear Models . . . . . . . . . . . 3

1.2.2 Iteratively Reweighted Random Forests Models . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Arrangement of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Current Approaches to Supervised Learning with Feature Selection 6

2.1 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Overview of Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Feature Relevancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Type of Feature Selection Algorithms: Wrapper, Filter, and Embedded . . . . . . . . 11

2.3 Feature Selection Using Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Regularization Using the Ridge Penalty Function . . . . . . . . . . . . . . . . . . . . . 13

2.3.2 Regularization Using the Lasso Penalty Function . . . . . . . . . . . . . . . . . . . . . 15

2.3.3 Regularization Using Other Penalty Functions . . . . . . . . . . . . . . . . . . . . . . 18

2.4 Feature Selection Using Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.1 Why Use Random Forest? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4.2 Model Combination in an Ensemble of Classifiers . . . . . . . . . . . . . . . . . . . . . 21

vii

2.4.3 The Random Forest Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4.4 Implicit Feature Selection Using Random Forests . . . . . . . . . . . . . . . . . . . . . 27

2.4.5 Explicit Feature Selection Using Random Forests . . . . . . . . . . . . . . . . . . . . . 28

3 Single-Step Estimation and Feature Selection using L1 Penalized Linear Models 31

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Background Information and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.1 Asymptotic properties of L1 penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.2 Motivation for our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Randomized Sampling (RS) Method to Create Weight Vector . . . . . . . . . . . . . . . . . . 35

3.3.1 Randomized Sampling (RS) Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.2 Consistency of choosing a set of Features from Randomized Sampling (RS) Experiments 37

3.4 Algorithms and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4.2 Real World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.5 Conclusions and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Iterative Training and Feature Selection Using L1 Penalized Linear Models 46

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1.1 Existing Single and Multiple Step Estimation Algorithms for L1 Penalized Algorithms 47

4.2 Subsampling/Bootstrap Adaptive Lasso Algorithm (SBA-LASSO) . . . . . . . . . . . . . . . 52

4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.1 Tuning Parameter and Parametric Search in MSA . . . . . . . . . . . . . . . . . . . . 54

4.3.2 Parametric Search in SBA, Adaptive Lasso, and Bootstrap Lasso . . . . . . . . . . . . 55

4.3.3 L1 Penalized Algorithms and Their Parameters . . . . . . . . . . . . . . . . . . . . . . 57

4.3.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4.1 Artificial Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

viii

4.4.2 Results for Real World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5 Iterative Training and Feature Selection using Random Forest 75

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.1.1 Existing Feature Importance Measures for Random Forest . . . . . . . . . . . . . . . . 76

5.2 A Challenge for Existing Feature Importance Measures in Random Forests . . . . . . . . . . 78

5.2.1 Description of the Artificial Dataset Models . . . . . . . . . . . . . . . . . . . . . . . . 78

5.2.2 Results for Artificial Dataset Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.3 Retraining-Based Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.4.1 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.4.2 Details of Iterated Reweighting of Random Forest Models . . . . . . . . . . . . . . . . 85

5.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.6.1 Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.6.2 Analysis Based on % Reduction of Error . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.6.3 Analysis Based on Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.6.4 Computation Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6 Conclusion 106

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

ix

Bibliography 109

Appendix

A L1-SVM2 Algorithm 117

A.1 Optimization of a Double Differentiable Loss Function Regularized with the L1 Penalty . . . 117

A.1.1 Necessary and Sufficient Conditions for Optimality for the Whole Regularization Path 117

A.2 L1-SVM2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

B Appendix for Chapter 4 122

B.1 Description of Real World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

B.2 Error Results on Artificial and Real World Datasets . . . . . . . . . . . . . . . . . . . . . . . 125

B.3 Feature Selection Results for Artificial Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 129

B.4 Expanding Dataset Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

C Appendix for Chapter 5 131

C.1 Description of Real World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

C.2 Results on Real World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

C.3 Choice of ntree Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

C.4 Choice of mtry parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

C.5 A Challenge for ML algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

C.6 Why Does Prob+Reweighting Perform Worse Than Other Retraining+Removal Algorithms? 142

C.7 Comparison to Boruta and rf-ace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

C.8 Random Forests - Computational Complexity and Software . . . . . . . . . . . . . . . . . . . 146

Tables

Table

3.1 Mean ± Std. Deviation of Error Rates in % on Models 1 & 4 by SVM2. The best algorithms

are in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2 Mean ± Std. Deviation of Error Rates in % on on Models 1 & 4 by 1-norm SVM. The best

algorithms are in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3 Variable Selection Results on Models 1 & 4 using SVM2. The best algorithms are in bold. . 41

3.4 Mean Error rates in % for Models 2, 3 & 5 using SVM2. The best algorithms are in bold. . . 41

3.5 Mean ± Std. Deviation of Error Rates on Real world Datasets. The best performing L1 and

L2 penalized algorithms are in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.6 Avg. Error rate on Robotic Datasets from [88]. The best algorithms are in bold. . . . . . . . 44

4.1 Average number of features discovered correctly and incorrectly by the MSA over four different

L1 penalized algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.2 List of real world datasets used in our experiments . . . . . . . . . . . . . . . . . . . . . . . . 69

5.1 Seventeen datasets used in our experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

B.1 Results on real world classification and regression datasets. . . . . . . . . . . . . . . . . . . . 126

B.2 Results on nine synthetic datasets using four different L1 penalized algorithms . . . . . . . . 127

B.3 Features (correctly/incorrectly) selected by four different L1 penalized algorithms . . . . . . . 128

C.1 Error rates on Individual Datasets, for all algorithms. . . . . . . . . . . . . . . . . . . . . . . 134

xi

C.2 % Relative Improvement on Error Rate – over baseline – on Individual Datasets, for all

algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

C.3 Results of rf-ace and Boruta. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

C.4 Relative Execution Time of existing random forest software compared to developed software. 149

Figures

Figure

2.1 Feature relevancy according to Kohavi and John [74] . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Feature relevancy according to Yu and Liu [129] . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 The search for the best feature at each node in bagged trees and random forest trees. . . . . . 25

4.1 Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for variations

of the MSA, using percent error results obtained using nine artificial dataset models × four

different L1 penalized algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.2 Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for variations

of the SBA, using percent error results obtained using nine artificial dataset models × four

different L1 penalized algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for different

L1 penalized algorithms, using percent error results obtained using nine artificial dataset models 67

4.4 Graphical representation of Friedman ranks and results of the Bergmann-Hommel procedure

for L1-SVM and LARS using percent error results obtained using real world datasets . . . . . 70

4.5 Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for variations

of the SBA, using percent error results obtained using real world datasets. . . . . . . . . . . . 73

4.6 Overall number of features discovered using L1 penalized algorithms for real world datasets. . 74

5.1 Feature importance results for random forests and conditional forests on 3 artificial dataset

models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

xiii

5.2 Comparison between search of features at the nodes of trees in random forests and FIbRF . . 86

5.3 Percent error results and percent relative reduction of error for all algorithms. . . . . . . . . . 94

5.4 Analysis of four importance measures and two elimination schemes. . . . . . . . . . . . . . . . 96

5.6 Difference in percent reduction of error among algorithms that use retraining-based vs non-

retraining-based feature importance measures, collapsed across removal schemes. . . . . . . . 98

5.7 Difference for the same training/test split of a dataset is represented by a ◦ red circle. We show:

(a) comparison of Err+Removal and Breiman+Removal and (b) comparison of Err+Removal

and Prob+Removal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.8 Comparison of various algorithms on the basis of the similarity score . . . . . . . . . . . . . . 101

5.9 Sparsity of models created by different algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.10 Sparsity of models vs % reduction in error over the baseline model for Prob+Removal and

Err+Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.11 Computational complexity relative to Strobl+Removal (1X). . . . . . . . . . . . . . . . . . . 103

B.1 Features correctly and incorrectly detected by four different L1 penalized algorithms . . . . . 129

B.2 Results by using expanded dictionary for real world datasets with the standard L1 penalized

algorithm for LARS and L1-SVM2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

C.1 What is an appropriate value of ntree? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

C.2 For Random Forests, do we need to parametrically search for the best mtry? . . . . . . . . . 135

C.3 Importance (without retraining) using various algorithms for the artificial dataset . . . . . . . 138

C.4 Comparison of Prob+Reweighting vs other retraining based algorithms . . . . . . . . . . . . . 144

C.5 Feature importance results for random forests and conditional forests on 3 artificial dataset

models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Chapter 1

Thesis Overview

1.1 Scope of this thesis

We live in a data-intensive world, and in many domains—the internet, genetics, multimedia, finance,

among others—the amount of data generated is too much for a human to manually sift through for relevant

information. Algorithms providing insights to the data are in demand, and have given rise to the fields of

machine learning and data mining and the newly named field of data science. The contributions of this

thesis are towards two sub-fields within machine learning, namely supervised machine learning and feature

selection.

Supervised Machine Learning. Supervised machine learning involves building a model from data

provided by a teacher. Supervised problems involve both regression and classification. Regression involves

transforming an input to a continuous output value; classification involves transforming an input to a discrete

output label. The input to a supervised machine learning model, x, typically consists of a vector of p features.

The output, y, is a continuous value (regression) or discrete value (classification). The problem is supervised

in the sense that a teacher provides a set of {x, y} examples mapping a particular input to a target output.

In supervised machine learning, an input representing some object of interest is represented as X, an array

of size n×p, and target values/labels Y , a vector of size n, are available; n is the number of examples and p is

the number of features observed for individual examples; in regression, the target values are continuous values

2

and in classification, the target labels are class labels. The trained machine learning model is expected to

reliably predict the target values or labels of future data. In this thesis, we restrict our study to the following

supervised machine learning techniques: linear regression [59, 122], linear support vector machine (SVM)

[119], and random forests [19].

Feature Selection. The goal of accurately predicting the target values or labels of unseen data is

hindered if noise is present in the data used for training the models. An irrelevant1 feature is either a feature

with measurement noise or a feature that does not help in prediction and, depending on the sensitivity of

the classifier algorithm, such a feature can adversely affect the creation of accurate models as well as the

interpretation of the data. Feature selection is a sub-field within machine learning aimed at creating accurate

models by excluding noisy (or irrelevant) features and by including only the relevant features.

In this thesis, we discuss embedded feature selection algorithms, like lasso regression [42, 112], 1-norm

support vector machine (SVM) [131], and random forests [19], that include the feature selection process

embedded in their model training process; that is, the algorithms simultaneously train the model and perform

feature selection. The embedded feature selection algorithms, discussed in this thesis, consist of both linear

and nonlinear algorithms. Linear algorithms assume an underlying linear model generating the data, and thus

they can address models restricted to certain types of data. Nonlinear algorithms are universal approximators

and are able to model any kind of data.

1.2 Thesis Summary

In this thesis, we explore the hypothesis that iteratively reweighted models—based on either the

L1 penalized algorithm or the random forests algorithm—outperform standard models that do not utilize

reweighting. We summarize the contributions of this thesis for the two class of machine learning models we

explore: linear models constructed using L1 penalized algorithms and nonlinear models constructed using

random forests.

1 We discuss features and their relevancy later in Section 2.2.1.

3

1.2.1 Weighed Linear Models and Iteratively Reweighted Linear Models

Feature selecting linear models are created by using a linear loss function in conjunction with a

sparsity inducing penalty function. Popular linear loss functions, utilized in this thesis, include the squared

loss function for regression and the hinge loss function for classification [59]. The L1 penalty function is

a popular means of obtaining sparsity. It has the advantage that the resultant formulation is convex and

convex optimization has been well studied [97, 132, 133].

Recently, weighted linear models [132, 133], trained with a linear loss function and a weighted L1

penalty, have been proposed that employ user-specified weights to bias linear models towards relevant features

and away from irrelevant or noisy features. A multi-step estimation procedure, which iteratively reweights

the linear model, was suggested by [24, 27]. Existing weighted algorithms differ in the manner they derive

the weights. The weights in [132, 133] were derived from an L2 penalized algorithm, whereas the weights

used in [24, 27] were derived from an L1 penalized algorithm. Our contributions to the field of L1 penalized

linear models are as follows:

� In Chapter 3, we propose a single-step procedure using weights obtained from subsampled datasets. We

show that our results are comparable to an existing weighted algorithm (Adaptive Lasso [132, 133]) for

two different types of linear models. Our proposed algorithm can also be used in an online setting unlike

existing algorithms.

� In Chapter 4, we extend the single-step procedure from Chapter 3 into an iteratively reweighted proce-

dure. We show that our proposed method achieves higher prediction accuracy than existing methods,

including Adaptive Lasso [132, 133], Bootstrap Lasso [6], and MSA-LASSO [24]. We also propose a tun-

ing parameter for increasing the accuracy of MSA-LASSO [24]. Our results for weighted linear models

show that the magnitudes of the weights, and thus the method for obtaining weights, is essential to the

performance of the weighted linear model.

4

1.2.2 Iteratively Reweighted Random Forests Models

Our other major contributions are for a popular nonlinear algorithm known as random forests [19].

A random forests is an ensemble of simple decision trees that can collectively perform either classification

or regression. Decision trees recursively partition the input data [23, 90, 109, 114]; typically, at each node

in a decision tree, the partitions are created by choosing the feature that best helps in partitioning the

data among all available features; whereas, at each node in the decision trees utilized for random forests,

the partitions are created by choosing the feature that best helps in partitioning the data among randomly

selected mtry < p features, where p is the number of available features. Random forests are so named due

to the random search over mtry < p features at each node in the decision tree. Our contributions to the

field of random forests are as follows:

� The notion of iterated estimation is gradually becoming popular for linear models trained using the L1

penalty, but has not been explored for nonlinear models. We propose an approach to iterated reweighting

for random forest models. One of our contribution is a weighted form of random forests that can be used

to bias the random forest models towards relevant features and away from irrelevant features.

� Using validation data, we can quantify the increase in prediction error when a particular feature is

removed. A large increase in validation error implies the necessity of the feature for prediction and

vice-versa; feature importance is quantified as the relative increase in validation error when a feature is

absent. Typically, feature importance is calculated by training a single model and evaluating the absence

of features, one at a time, from validation data; note as features are not considered in isolation from one

another in the training process, the trained model may overfit correlated (but spurious) features. We

propose a new feature importance measure, based on retrained models, that evaluates features in isolation

for both training and validation data. Typically, feature importance values are uninterpretable as the

relative increase in error for individual features is data dependent; thus, we also propose an interpretable

form of feature importance based on probabilities obtained from a significance test. We motivate the

advantage of our importance measure using artificial data and show improved results, on real-world

datasets, over existing random forests methods.

5

1.3 Arrangement of the thesis

Chapter 2 presents an overview of feature selection and feature selection through L1 penalized algo-

rithms and random forests. In Chapter 3, we propose a single-step procedure for linear models using weights

derived from subsampled datasets. In Chapter 4, we extend the single-step procedure, for linear models, into

an iterative procedure. In Chapter 5, we propose an iteratively reweighted procedure for random forests.

Chapters 3, 4, and 5 can be read independent of each other; though Chapter 4 is an extension of Chapter 3.

In Chapter 6, we present our conclusions and proposals for future work.

Chapter 2

Current Approaches to Supervised Learning with Feature Selection

In this chapter, we summarize the relevant literature and background information needed to under-

stand our research. This chapter is divided into four sections. In Section 2.1, we present an overview of

supervised learning. In Section 2.2, we present an overview of feature selection methods. In Section 2.3, we

discuss using an L1 or an L2 penalty function to train linear models, and in Section 2.4, we discuss random

forests, a nonlinear algorithm. We start with a discussion of supervised machine learning.

2.1 Supervised Machine Learning

In supervised machine learning, we assume that a training dataset D = {(xi, yi), i = 1, . . . , n, xi ∈

Rp, yi ∈ R} is available, where x is a vector containing information about p features that are discrete or

continuous, and y is a target value that is either discrete or continuous. Features are also known as variables

or attributes. We assume that there exists a joint probability distribution, P (x, y), and that a finite number

of n examples are drawn i.i.d. from P (x, y) to construct the training dataset D.

A classifier or model is a function h(x) : Rp → y, where y is a discrete label. Regression involves

transforming an input x to a continuous output value, whereas classification involves transforming an input

x to a discrete output label. The non-negative loss function L(y, y) is used to measure the difference between

prediction y (obtained from classifier h) and target value or label y. The expectation of the loss function,

7

also known as the risk function associated with h, is defined as,

R(h) = E[L(y, h(x)] =

∫L(y, y)dP (x, y). (2.1)

The goal of supervised machine learning is to find a classifier h∗ from the set {h(x)}, x ∈ Rp for which

the risk is minimized, that is,

h∗ = arg minh∈{h(x)},x∈Rp

R(h). (2.2)

The classifier h∗ models P (y|x) which is the probability of y given x. In contrast to this discriminative

model, a generative model attempts to model the joint probability distribution P (x, y) and use it to derive

P (y|x) via Bayes’ rule. In real world datasets, the full joint distribution P (x, y) is usually unknown, and

only a finite set of examples is available. In this thesis, we restrict our discussion to discriminative models.

For discriminative models, we calculate the empirical risk function, which is an approximation of risk

function R(h), as follows:

Remp(h) =

n∑i=1

L(yi, yi). (2.3)

The theory of empirical risk minimization (ERM) [119] states that a learning algorithm, also known

as a classifier algorithm or an induced algorithm, should choose a classifier h such that the empirical risk is

minimized, that is:

h = arg minh∈{h(x)},x∈Rp

Remp(h),

= arg minh∈{h(x)},x∈Rp

n∑i=1

L(yi, yi). (2.4)

The selection of classifier h from a set of classifiers {h(x)}, x ∈ Rp, is known as model selection.

In contrast, model combination is a procedure in which an ensemble of models, classifiers, or experts

are used together for predicting target values or labels of unseen data. The prediction of an ensemble

is a function of the individual classifiers in the ensemble and the individual classifiers are also known as

base classifiers. The motivation behind model combination is that a single base classifier may not be able

to accurately model the entire distribution of data due to biases in the base classifier. In such cases, an

ensemble whose prediction is a combination of outputs from individual base classifiers, with different model

biases (architectures), may perform better than any single base classifier.

8

In this thesis, we discuss linear algorithms that are trained using model selection and random forest,

a nonlinear algorithm, that uses model combination for training. Before discussing these algorithms, we

digress and discuss feature selection.

2.2 Overview of Feature Selection

Feature selection is a subfield within machine learning aimed at creating accurate models by excluding

irrelevant features, and by including only relevant features.

We focus on the area of supervised learning, and ignore techniques for feature selection and dimen-

sionality reduction via unsupervised learning, where target values or labels are unavailable.

The goal of feature selection is to obtain good performance from a model, i.e., high accuracy or low

error. Good performance is defined over the entire distribution of the data. However, practically speaking,

such distributions are not available and thus held-out test sets are employed for evaluating classifier and

regression accuracy.

Why is feature selection necessary? Consider the Bayes classifier, which is a theoretical ideal classifier

that predicts the most probable class for a given input. The Bayes classifier is assumed to have access to the

distribution of the data. Practically, we cannot create Bayes classifiers, and we do not know the underlying

distribution of the data, except in trivial cases. The Bayes classifier is optimal in its use of input features.

If purely noise features are included in the input, accuracy of the Bayes classifier will not decrease. Thus,

from a theoretical standpoint, features need not be be discarded. However, in contrast to this theoretical

ideal, supervised machine learning algorithms are susceptible to noisy features, and the best classification

accuracy is achieved only if the noisy features are ignored; we define a noisy feature as a feature with high

measurement noise. Such susceptibility to noise is due to the fact that any algorithm has to balance the

estimation of multiple parameters (for reducing the bias of the model) against accuracy in estimating the

parameters (for reducing the variance of the model).

9

2.2.1 Feature Relevancy

Intuitively, one has a sense of what a relevant or irrelevant feature is, but formally defining these terms

is tricky. We summarize some of the attempts at formalization that have been proposed in past research

[11, 22, 70, 74, 129].

� Kohavi and John [74] defines the optimal feature subset as the subset of features for which the accuracy of

the constructed classifier is maximized. They note that such a classifier might not be unique. Furthermore,

Kohavi and John [74] and John et al. [70] categorize features into two types based on their relevance to the

prediction: weak and strong features. The set of strongly relevant features, denoted S, consists of those

features whose removal will cause a performance deterioration in the optimal Bayes classifier. In contrast,

a feature X is a weakly relevant feature, if it is not a strongly relevant feature, and there exists a subset

of features M , such that the performance of the Bayes classifier on M is worse than the performance on

M ∪ {X}. Strongly relevant features are essential and cannot be removed without loss of prediction

accuracy, whereas weakly relevant features can sometime contribute to prediction accuracy. Irrelevant

features are features that are neither strongly nor weakly relevant. In Figure 2.1, we depict Kohavi and

John’s [74] definition of the optimal set of features via the darkly shaded region, and a Bayes classifier

has to use all the strongly relevant features and some of the weakly relevant features.

Figure 2.1: The figure is based on the discussion in Kohavi and John [74]. The optimal set of features isdefined by the darkly shaded region which contains all strongly relevant features and some, but not all, ofthe weakly relevant features.

� Blum and Langley [11] define strongly relevant feature as follows: A feature S is strongly relevant if

there exist examples A and B that differ only in their assignment to S and have different target values or

10

I II III IV

I: Irrelevant featuresIII: Weakly relevant but non-redundant features

II: Weakly relevant and redundant features

IV: Strongly relevant features

III + IV: Optimal Subset

Figure 2.2: Reproduced from Yu and Liu [129]. According to Yu and Liu [129], redundancy of featuresshould be considered within sets of strongly and weakly relevant features (as defined by Kohavi and John[74]). Ideally, the optimal set of features are present in the set III+IV, that is the set containing stronglyrelevant features and weakly relevant but non-redundant features.

labels; whereas, weakly relevant features are features that become strongly relevant when a subset of

features are removed from the data. Though, their definition of strongly and weakly relevant features is

a bit different from [70, 74], the implications are similar; that is, strongly relevant features should never

be discarded as their removal increases the ambiguity of the data, thereby increasing the probability of

prediction error on future data.

� Yu and Liu [129] argue that redundancy of features should be considered in addition to the relevancy of

the features. Their definition of redundant features coexisting with weakly and strongly relevant features

is depicted in Figure 2.2; redundant features are considered to be a form of weakly relevant features. The

optimal set of features should consist of strongly relevant features and weakly relevant features that are

non-redundant.

� Breiman [22] relates the relevancy of features to the selection of the features in a decision tree. At each

node in a decision tree, a decision tree algorithm searches the entire feature space for the feature that best

separates/splits the data according to an information criterion. Breiman characterizes strong features as

features which have high probability of being selected for splitting at the nodes of the decision tree and

vice-versa for the weak features. Breiman does not define the irrelevancy of features, but assumes that

noisy feature are a type of weak features. Similarly, Breiman does not indicate a probability threshold

11

that differentiates a weak feature from a strong feature.

2.2.2 Type of Feature Selection Algorithms: Wrapper, Filter, and Embedded

Now, we give a brief overview of feature selection algorithms. According to Kohavi and John [74],

feature selection algorithms broadly fall into three categories: filter, wrapper, and embedded methods.

(1) Filter -based feature selection algorithms work by analyzing the characteristics of the data and evaluating

the features without including any learning algorithm in the process. Thus, results of the filter methods

are independent of the classifier.

(2) Wrapper -based feature selection algorithms use a learning algorithm to identify the relevant features and

create the final model using only the relevant features. The wrapper based algorithms use multiple search

strategies in conjunction with a classifier. As the exploration of the entire feature set is an NP-hard1

problem, wrapper based algorithm use the classifier as a black box and evaluate the performance of

the classifier on certain feature sets. The search strategies include backward elimination (removing one

feature at a time) and forward selection (adding one feature at a time).

(3) Algorithms like lasso regression [112], 1-norm Support Vector Machine (SVM) [131], random forests [19],

and decision trees [23, 90], are embedded feature selecting algorithms, and include feature selection as a

part of their model training process; that is, the algorithms simultaneously train the model and perform

feature selection. The algorithms that we focus on in this thesis are embedded methods.

A disadvantage of the feature sets obtained from the wrapper and embedded methods is their dependence

on the classifier used. However, this dependency allows wrapper and embedded methods to obtain higher

predictive accuracy than filter methods because selection and evaluation of the features is done by the same

classifier.

In the next section, we discuss linear models which are an interesting class of algorithms as their linear

form facilitates analysis.

1 A dataset with p features will have 2p possible feature sets.

12

2.3 Feature Selection Using Linear Models

Next, we discuss linear models, a type of classifier.

Linear Model. Assume that a training dataset D = {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R} is available.

A linear model, for training example i, is defined as,

yi = β0 +

p∑i=1

xijβj , (2.5)

where xij represents the value of the jth feature in the input xi for the example i, yi is an output value, β0

is known as bias, and βj are known as model parameters. We refer to the vector β = (β0, β1, . . . , βp) as the

model parameter vector; for convenience, we include the constant variable 1 in xi which allows us to include

β0 inside the parameter vector β.

Least Squares Method. The least squares method is a popular learning algorithm used to obtain the

model parameters β for the linear model shown in Equation 2.5.

Assume that a training dataset D = {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R} is available. We represent

the inputs xi, i = 1, ....n, xi ∈ Rp in a matrix X of size n× p and the target values yi, i = 1, ....n, yi ∈ R in

a vector y of size n. Then, the least squares method estimates are

β = arg minβ

(y −Xβ)2. (2.6)

If we assume that {h(x)}, x ∈ Rp, is a set of linear models, then β is the parameter vector of the linear

classifier h selected using the theory of empirical risk minimization (ERM), which was discussed earlier in

Section 2.1.

The analytical solution for the least squares method in Equation 2.6 is

β = (XTX)−1XT y, (2.7)

which can be obtained by setting the derivative of Equation 2.6 to zero, and is only defined when XTX is

invertible. When XTX is singular or near-singular, its inversion is not stable and will cause large magnitudes

in β. In order to rectify this instability, a method known as regularization [61, 81, 85, 117] is employed.

13

Regularization. Regularization is a method for obtaining stable models. For regularization of the least

squares method, we augment Equation 2.6 to form

β = arg minβ

(y −Xβ)2 + λ||β||22, λ ≥ 0, (2.8)

where ||β||22 = (√β2

1 + ...β2d)2. The analytical solution for Equation 2.8 is

β = (XTX + λI)−1XT y. (2.9)

and the additional λI term helps in mitigating the effects of a singular or near-singular XTX.

A general form of regularization is defined as

β = arg minβL(y, h(X,β)) + λJ(β), λ ≥ 0, (2.10)

where L(y, h(X,β)) is the loss function, J(β) is called the penalty function, and h(X,β) is a classifier or a

model. J(β) controls the complexity of the model, via the model parameter β, and the λ parameter, also

known as the regularization parameter, controls the amount of complexity allowed for the model. Typically,

models trained using Equation 2.4 are known as regularized or penalized models and the learning algorithms

are known as regularized or penalized algorithms.

In the case when least squares is regularized, as shown in Equation 2.8, h(X,β) = Xβ, L(y, h(X,β)) =

(y − h(X,β))2 = (y − Xβ)2, and J(β) = ||β||22. The term J(β) = ||β||22 = (√β2

1 + ...β2d)2 is known as the

ridge penalty function or the L2-norm penalty function, and for brevity, henceforth, we will refer them as

the ridge penalty or the L2 penalty, respectively. L2 penalty gets its name from the Lp-norm function, which

for a vector x of size n is ||x||p = (∑ni=1 |xi|p)

1/p.

The rest of this section is devoted towards discussing different types of functions that can be used

for modeling the loss function L(y, h(X,β)) and the penalty function J(β) in Equation 2.10. We limit our

discussion of loss functions to convex functions [14, 95], which have a unique global minimum.

2.3.1 Regularization Using the Ridge Penalty Function

Ridge Regression. In Equation 2.8, we showed that the least squares method can be regularized by

using the ridge penalty or the L2 penalty; the resultant formulation is called ridge regression.

14

Support Vector Machine (SVM) Classification Using the L2 Penalty. Up until now, we have

primarily discussed regression using the least squares loss function where the targets y are continuous values.

When targets are discrete values or class labels, the learning problem is known as classification. A popular

loss function, used for classification, is the hinge loss function which is defined as

Hinge Loss: L(y, h(X,β)) =

n∑i=1

[1− yi ∗ (xiβ)]+, [v]+ = max(v, 0), yi = {+1,−1}. (2.11)

In Equation 2.11, we only consider the case of binary classification and assume that the possible class labels

for a target are either +1 or -1. Note that in regression β represents a p-dimensional hyperplane passing

through the data, whereas in classification β represents a p-dimensional hyperplane separating the two

classes, where p is the number of features. There are differences between the least squares loss function and

the hinge loss function. The least squares loss function awards a positive value if there is a difference between

an example’s target value and the predicted value, and zero otherwise. In contrast, the hinge loss function

awards a positive value only if an example is misclassified. The value of the positive loss for a misclassified

example is dependent on the distance of the example from the separating hyperplane represented via model

parameter β.

The formulation of the support vector machine (SVM) algorithm [59, 119] is

L2 penalized SVM: β = arg minβ

{λ||β||22 +

n∑i=1

[1− yi ∗ (xiβ)]+, [v]+ = max(v, 0), λ ≥ 0

}. (2.12)

The parameter vector β is discovered by optimizing the hinge loss function regularized with the L2 penalty.

For clarity, we refer to the SVM algorithm as L2 penalized SVM.

The name of the SVM algorithm is derived from the concept of support vectors, which are examples

that lie near the hyperplane separating the two classes and can be used to identify the separating hyperplane.

Typically, the number of support vectors is much less than the number of training examples. In case of linear

models, including the linear SVM, we can identify the hyperplane either using the parameter vector β or the

support vectors. When the SVM algorithm is used to train nonlinear models, the model parameter β is not

available and the support vectors are used to define the nonlinear hyperplane separating the two classes. In

this thesis, we limit ourselves to the linear SVM.

15

Unlike ridge regression, an analytical solution does not exist for the L2 penalized SVM shown in

Equation 2.12. Though quadratic programming can be employed for solving L2 penalized SVM, more

efficient algorithms have been designed; for more information refer to Scholkopf and Smola [101] and Platt

[86].

2.3.2 Regularization Using the Lasso Penalty Function

The lasso penalty or L1 penalty was considered by Tibshirani [112] for regularizing least squares

regression. This paper showed that the lasso penalty induces feature selection within the trained linear

model by producing a sparse β. That is, many of the values in the model parameter β may be set to zero

depending on the value of the regularization parameter λ. Before discussing the feature selection properties

of the lasso penalty, we present formulations of the lasso penalty that incorporate both the least squares and

hinge loss functions.

Lasso Regression. The lasso or L1 penalty, represented as ||β||1 = |β1|+ ...|βd|, is used with least

squares to form

β = arg minβ

(y −Xβ)2 + λ||β||1, λ ≥ 0, (2.13)

and the resultant formulation is known as lasso regression.

Support Vector Machine (SVM) Classification Using the L1 Penalty. Like we defined the L2

penalized SVM in Equation 2.12, we can define an L1 penalized form of the hinge loss function as

L1 penalized SVM: β = arg minβ

{λ||β||1 +

n∑i=1

[1− yi ∗ (xiβ)]+, [v]+ = max(v, 0), λ ≥ 0

}. (2.14)

Unlike ridge regression, an analytical solution does not exist for either lasso regression or the L1 penalized

SVM. But before we discuss algorithms for solving lasso regression and the L1 penalized SVM, we discuss

feature selection properties induced in lasso regression and the L1 penalized SVM due to the L1 penalty.

Feature Selection in Lasso Regression and Other Lasso Penalized Algorithms. The following

example was described in Tibshirani [112] to show that the lasso penalty performs a type of feature selection.

16

Assume that the input matrix X is orthogonal (XTX = I). βls is used to identify the least squares

estimates for the orthogonal (XTX = I) case, and thus βls = (XTX)−1XT y = XT y, using Equation 2.7.

We digress a bit and first discuss the analytical solution for ridge regression. When input matrix X

is orthogonal (XTX = I), the analytical solution for ridge regression is

βridge = (XTX + λI)−1XT y = (1 + λ)−1βls, using Equation 2.9, where βls = XT y. (2.15)

Note that the model parameter βridge, obtained from ridge regression, is scaled by a factor of (1 + λ) when

compared to the model parameter βls, obtained from least squares.

In general, there is no analytical solution for lasso regression, but there exists an analytical solution

for lasso regression when the input matrix X is orthogonal (XTX = I). We can differentiate Equation 2.13

and set the derivative to 0. We obtain

βj(lasso) = sign(βj(ls))(|βj(ls)| − λ)+, where (v)+ = max(v, 0), (2.16)

where βj(lasso) and βj(ls) are used to represent model parameters from lasso regression and least squares,

respectively, for the jth feature. As shown in Equation 2.16, the model parameter βj(lasso) obtained from

lasso regression for the jth feature is nonzero only when |βj(ls)| > λ, and zero otherwise. Thus, for lasso

regression, the lasso penalty thresholds features based on the amount of λ specified, producing a sparse βlasso,

the parameter vector obtained from lasso regression; a sparse βlasso is still produced by lasso regression when

the input matrix X is not orthogonal.

When we consider other convex loss functions, including the hinge loss function shown in Equation 2.11

and used for L2 penalized SVM, sparsity in the model parameters is induced due to the lasso penalty. In

contrast, the model parameter obtained from optimizing a convex loss function regularized with the ridge or

L2 penalty is dense; that is, most of the values in β are non-zero.

Solving Lasso Penalized Formulations Using LARS and LARS-like Algorithms. Earlier, we

noted that an analytical solution does not exist for either lasso regression or the L1 penalized SVM. However,

because the L1 penalty is a convex penalty and both least squares and hinge loss functions are convex

functions, there exists a global minimum [14, 95] for the resultant formulations of both lasso regression

17

and the L1 penalized SVM. As the least squares loss function is quadratic, Tibshirani [112] proposed using

quadratic programming for solving lasso regression given a fixed value of the regularization parameter λ, λ ≥ 0

in Equation 2.13. Similarly, as both the loss function and the penalty function are linear, we can use linear

programming for solving the L1 penalized SVM, given a fixed value of the regularization parameter λ, λ ≥ 0

in Equation 2.14.

An efficient algorithm, known as Least Angle Regression (LARS), was proposed by Efron et al. [42]

for solving lasso regression. For lasso regression, the running time of LARS is less than that of a solver using

quadratic programming. Also, LARS can simultaneously solve for the entire range of the regularization

parameter λ (0 ≤ λ ≤ ∞) Equation 2.13.

Before discussing the LARS algorithm, we make the following two observations for lasso regression

that were essential in the design of LARS: (1) For lasso regression, between any two regularization values

λ1 < λ1+ε, ε > 0 and ε is a sufficiently small value, the change in model parameter βj is linear, where j is the

index of an input feature and βj 6= 0. (2) In Equation 2.16, we concluded that the lasso penalty thresholds

features using the λ parameter, and the larger the value of λ, the sparser is the model parameter β.

The working of the LARS algorithm can be described as follows: In the LARS algorithm, the initial

value of the regularization parameter is set to λ = ∞ and is gradually decreased to λ = 0. Due to the

thresholding property of the lasso penalty, as λ decreases, more features gain a nonzero value in the parameter

vector β. Also, for lasso regression, the change in the parameter vector β is linear for features that are

nonzero. LARS analytically calculates the change in λ required to go from k features to k + 1 features and

the corresponding change in the feature values represented in the model parameter β. Thus, the LARS

algorithm can calculate the parameter vector β for the entire regularization path 0 ≤ λ ≤ ∞ in p steps,

when p features are present; in practice, the number of steps taken are min(n, p), where n is the number of

examples and p is the number of features.

Recently, many papers have proposed algorithms that calculate the entire regularization paths for

different types of linear loss functions and these algorithms bear similarity to the LARS algorithm. Zhu

et al. [131] proposed an efficient algorithm for the L1 penalized SVM (shown in Equation 2.14). Later,

Rosset and Zhu [97] provided an algorithmic framework modeled around LARS that can calculate the entire

18

regularization path (0 ≤ λ ≤ ∞) for single and double differentiable loss functions using the L1 penalty.

An example of single differentiable loss functions is the hinge loss function as shown in Equation 2.11; an

example of double differentiable loss functions is least squares, as shown in Equation 2.6. Henceforth, we

refer to algorithms that follow the algorithmic framework described in Rosset and Zhu [97], and which also

includes LARS, as piecewise-linear solution-path-generating algorithms; we extensively use such algorithms

in this thesis.

Even though the algorithms discussed in this thesis are based on L1 and L2 penalty functions, we note

that the literature includes exploration of other penalty functions. We briefly discuss these other functions

in the next section.

2.3.3 Regularization Using Other Penalty Functions

Lγ Family of Penalty Functions. The most popular form of penalty functions are the ridge or L2

penalty and the lasso or L1 penalty. In this section, we describe other penalty functions starting with the

members in the Lγ family of penalty functions. The bridge penalty [50] is a family of penalty functions

defined as

J(β) =∑j

|βj |γ , γ ≥ 1. (2.17)

The bridge family of penalty functions also includes the L1 and L2 penalty functions. The family of

penalty functions existing between the L1 and L2 penalty functions, that is J(β) =∑j

|βj |γ , 1 ≤ γ < 2

(includes the L1 penalty but excludes the L2 penalty), are shown to induce sparsity in the model parameter

β, when used to regularize a convex loss function; the reason for sparsity is the discontinuity existing in the

derivatives of the penalty functions, that forces certain model parameters to be set to zero thus obtaining a

sparse β vector. For Lγ (γ ≥ 1) penalty functions, the highest amount of sparsity in β is obtained with the

L1 penalty, whereas a lower amount of sparsity in β is obtained using a penalty function with the value of γ

near 2. Penalty functions with J(β) =∑j

|βj |γ , γ ≥ 2, which also includes the ridge or L2 penalty, are not

sparse and that is because such penalty functions are continuous everywhere.

19

The members of the Lγ family of penalty functions, other than bridge family of penalty functions, are

J(β) =∑j

|βj |γ , 0 ≤ γ < 1. (2.18)

The L0 penalty function is different from all other norm penalty functions as it only considers the

number of non-zero values present in the model parameter β. The optimization of a loss function regularized

with the L0 penalty function is NP-hard, because in order to find a model with a fixed number of non-zero

values in β, we need to evaluate 2p different linear models, where p is the number of features. Also, the L0

penalty and the rest of the J(β) =∑j

|βj |γ , 0 < γ < 1, penalty functions are non-convex in nature; thus,

when such penalty functions are used for regularizing a convex loss function Equation 2.10, there is a chance

that a global minimum may not be found due to the existence of multiple local minima. All the members of

the family of penalty functions, represented by J(β) =∑j

|βj |γ , 0 ≤ γ < 1 and when used for regularizing a

loss function Equation 2.10, impart sparsity in the model parameter β. The L0 penalty produces the sparsest

β and the rest of the penalty functions produce sparser β compared to the L1 penalty. Thus, there is interest

in devising algorithms that can optimize such non-convex penalty functions. Friedman [48] proposed an

algorithm that approximates the model parameter β along the entire regularization path (0 ≤ λ ≤ ∞), when

either a convex (Lγ , 1 ≤ γ < ∞) or a non-convex penalty (Lγ , 0 < γ < 1) function is used to regularize a

convex loss function.

Some Other Penalty Functions. We do not discuss the details of other popular penalty functions but

acknowledge their usage in literature. The Smooth Clipped Absolute Deviation (SCAD) penalty function is

a non-convex function for which efficient algorithms have been proposed [44, 135]. The elastic-net penalty

function [134] uses a combination of L1 and L2 penalty functions, inducing sparsity in the model parameters

via the L1 penalty but controlling the amount of sparsity via the L2 penalty. The fused lasso penalty [113]

was proposed as an extension of the L1 penalty, when feature magnitudes are related via an a priori known

feature ordering; in the fused lasso penalty, in addition to the L1 penalty function, penalty terms are added

that constrain the model to consider the a priori known feature ordering.

This section primarily discussed linear models that model linear relationships among features. Next,

we discuss random forests, an algorithm that can model nonlinear relationships among features.

20

2.4 Feature Selection Using Random Forest

The random forest model has proven extremely popular in application, due to its excellent performance

on a variety of problems, the fact that it can capture nonlinearity in input features, and has been shown to

be a universal approximator.

Next, we try to motivate random forest as a state-of-the-art supervised machine learning algorithm

worthy of further research.

2.4.1 Why Use Random Forest?

The field of supervised machine learning would be simplified if we can find a single best supervised

machine learning algorithm. Unfortunately finding the best supervised algorithm is not possible and this

fact is best elucidated via the ‘No Free Lunch Theorem’ by Wolpert and Macready [126, 127]. According

to ‘No Free Lunch Theorem’, in order to find the best supervised learning algorithm, we have to compare

algorithms on all possible (infinite) datasets, and thus, the search for the best algorithm is futile.

Instead, using a fixed number of datasets, supervised learning algorithms are compared and analyzed

on the basis of their prediction error. When the prediction results are not significantly different, secondary

properties like training time and testing time of the models, interpretability of the models, among others

play a role in the choice of an algorithm. In certain cases, the secondary properties are more important than

prediction error.

The current state-of-the-art supervised machine learning algorithms, for modeling nonlinear data, are

SVM, boosting, and random forest; extensive studies show comparable accuracy among these algorithms. Our

interest in random forest is due to the following two properties: implicit feature selection and computational

complexity. Due to the use of decision trees, random forest models are feature selecting in a nonlinear setting.

Additionally, random forest has lower computational complexity for training, than SVM and boosting; we

briefly compare computational complexity between SVM and random forest in Appendix C.8.

Random Forest is a Popular Off-the-shelf Classifier. Like other popular supervised learning tech-

niques, random forests have been extensively used for multiple domains including bioinformatics, ecology,

21

among others; instead of providing an exhaustive list of studies, we refer the reader to the following notable

studies: Saeys et al. [99] and Yang et al. [128] in bioinformatics, Watts and Lawrence [121] and Cutler et al.

[33] in ecology, and Toole et al. [115] and Jun et al. [71] in general computing.

We can also quantify the general interest in random forests by their usage in a competitive set-

ting. Kaggle (www.kaggle.com) is a platform where researchers can compete on machine learning tasks

provided by companies and other researchers; according to Jeremy Howard, the founder of Kaggle at

http://strataconf.com/strata2012/public/schedule/detail/22658, “Ensembles of decision trees (of-

ten known as ‘Random Forests’) have been the most successful general-purpose algorithm in modern times.

For instance, most Kaggle competitions have at least one top entry that heavily uses this approach. This

algorithm is very simple to understand, and is fast and easy to apply.” At the same hyperlink, Jeremy also

notes the usefulness and popularity of L1 penalized algorithms, another focus of this thesis. Similarly for

the NIPS 2003 Feature Selection Challenge, a competition for feature selection algorithms, Guyon et al. [55]

notes random forest as one of the top performing algorithm.

This thesis addresses feature selection in random forest, but before we can discuss feature selection,

we need to lay the groundwork by describing the ideas that led to their genesis—ensemble techniques and

decision trees.

2.4.2 Model Combination in an Ensemble of Classifiers

In Section 2.1, we briefly discussed model combination, a procedure in which an ensemble of models,

classifiers, or experts are used together for predicting target values or labels of unseen data. In order to

discuss model combination in depth, we present a particular type of base classifier, a decision tree, which is

the base classifier used in random forests and many other ensemble approaches.

2.4.2.1 Decision Trees: A Type Of Base Classifier

Decision trees are trained by recursively partitioning the feature space and thus can model nonlinear

data. The creation of an optimal decision tree is NP-complete [65] and greedy algorithms like Classification

and Regression Trees (CART) [23], C4.5 [90], among others are used to train decision trees. The decision

22

trees that we consider are binary decision trees.

The training of a decision tree is done as follows. At each node in a decision tree, a decision tree

algorithm searches the entire feature space for the feature that best separates/splits the data according

to an information criterion; information criteria include the gini impurity which is used for CART trees

[23, 59, 109], and information gain which is used for C4.5 [90]. Decision trees are full-grown, such that each

node contains a certain number of examples; by using a held-out set or via cross-validation, pruning of the

trees is performed to remove nodes that do not provide additional information and to prevent overfitting.

Decision trees are popular because they are fairly interpretable. Decision trees are a versatile mod-

eling tool and are used for both classification and regression problems; furthermore, decision trees can also

accommodate different types of features including continuous, discrete, and in general any kind of feature

whose values can be compared.

2.4.2.2 Components of Model Combination

Model combination can roughly be segmented into the following two components: the method used

for training base classifiers and the method used for combining prediction results of base classifiers; we briefly

describe these two components.

Training Base Classifiers. Base classifiers can either be trained independent of one another or using

the information from a previously trained base classifier to guide the training of future base classifiers. We

explain the contrast via illustration by two popular methods: boosting [47, 100] and bagging [15].

Boosting is a generic method in which multiple base classifiers are repeatedly trained on a different

subset of the training data and the prediction of the ensemble is a weighted linear combination of the

prediction from base classifiers. AdaBoost [47, 100] is a boosting algorithm in which new base classifiers

are trained by focusing on examples that are misclassified by existing base classifiers; examples that are

misclassified get an increased weighting. Typically, the base classifier in AdaBoost is a weak classifier that

is slightly better than a random classifier; an example of such a weak classifier is a decision stump which is

a one level decision tree. For more information on other boosting algorithms refer to Hastie et al. [59].

23

Bagging [15], or bootstrap aggregating, is a generic method in which individual base classifiers, typi-

cally decision trees, are trained on a different bootstrap sample of the original training dataset. Bootstrap

[34, 36, 41] is a method for creating a new dataset from available data by sampling the data, multiple times,

with or without replacement. The most popular form of bootstrap involves sampling n examples with re-

placement from a training dataset of size n. Typically, the base classifiers in bagging are decision trees and

the prediction of the ensemble—for regression—is an average of the prediction over individual base classifiers;

for classification, the class label for a test example is the class with the highest number of votes among the

base classifiers.

Combining Predictions From Base Classifiers. The method of combining predictions across base

classifiers can be roughly segmented into weighted methods and meta-learning methods [96].

Weighted methods weigh the prediction of individual base classifiers. AdaBoost [47, 100] and stacked

regression [17] are weighted methods. The final prediction they yield is a weighted average of the prediction

from base classifiers. In contrast, for bagging [15], the final prediction is an unweighted average of the

prediction from base classifiers.

Stacked generalization [125], or stacking, is a meta-learning method for combining predictions from

base classifiers; the base classifiers may be obtained from different learning algorithms. In stacking, a meta-

dataset is created from the predictions of the base classifiers, and then, a meta-classifier is trained on the

meta-dataset. By using a meta-classifier, stacking can identify the reliability of individual base classifiers.

For more information on other weighted and meta-learning methods refer to Rokach [96].

2.4.2.3 The Role of Resampling in the Evolution of Ensemble Methods

Now we briefly describe some studies on ensemble training that motivated Breiman to propose random

forest [19].

In 1994, Breiman [15] proposed the bagging algorithm which uses an ensemble of decision trees trained

using bootstrapped datasets; bootstrapped datasets sample the space of available training examples. In 1996,

Schapire and Freund [47, 100] proposed AdaBoost, a boosting algorithm, in which base classifiers are trained

24

so as to minimize the training misclassification; misclassified examples get an increased penalization cost;

in 1998, Breiman [18] proposed that AdaBoost is a type of ‘adaptive resampling and combining’ algorithm

or ‘arcing’ algorithm, where examples, used for training a base classifier, are sampled according to their

misclassification rate. In 1997, Amit and Geman [3] proposed growing an ensemble of trees and, at each

node in the trees, selecting the best feature from a subset of available features; their work was focused

on character recognition and they limited the depth of the trees. In 1998, Ho [60] proposed the random

subspace method for creating an ensemble of decision trees, where each tree is trained using a randomly

selected subset of features from the available feature set. In 2000, Dietterich [39] compared bagging using a

randomized version of C4.5 trees to bagging using C4.5 trees (without randomization) and boosting; in the

randomized version of C4.5, at each node of a decision tree, a random feature was selected, for splitting the

data, from the top 20 features.

A common theme among these ensemble methods is the use of resampling, either in the feature space,

example space, or both.

2.4.3 The Random Forest Algorithm

The random forest algorithm was proposed by Breiman [19] in 2001. Random forest are constructed

using an ensemble of decision trees which are trained by resampling both the feature space and the example

space. Like bagging, each decision tree is constructed using a bootstrapped dataset, but unlike bagging, at

each node in a tree, a random subset of features are used to find the best feature; the differences in bagging

trees and random forest trees is shown in Figure 2.3. The prediction of the random-forest ensemble is as

in other ensemble methods: averaging predictions from the trees in case of regression, or by taking a vote

among the trees in case of classification.

Parameters of the Random Forest Algorithm. Bagging [15] has a single parameter, ntree, that

controls the number of trees used in the ensemble. Instead, the random forest algorithm has two parameters:

mtry controls the number of random features to search for the best feature, at each node of individual

decision trees in the forest, and ntree is the number of decision trees to include in the forest.

25

Figure 2.3: We depict the search for the best feature at each node within bagged trees and random foresttrees; the best feature is the feature with the highest information criterion. For trees in bagging, we considera search over all p features for the best feature. For trees in a random forest, we search over mtry < pfeatures for the best feature, where p is the number of features.

It may be prudent to discuss the theory behind random forest and examine the effects of the ntree

and mtry parameter on the generalization error (or the probability of the prediction error over the entire

distribution of the data).

2.4.3.1 A Theoretical Motivation for Random Forest

Breiman in his paper on random forest [19] showed theoretical motivation for ensemble methods,

including random forest and we briefly describe his derivations in the following paragraphs.

First, we define a maximum margin classifier. For data that contains two separable classes, the

maximum margin classifier constructs a hyperplane that divides the classes, and this hyperplane is furthest

away from both classes. Other popular supervised learning algorithms, like the SVM [31, 119] and boosting

algorithms [98], are also maximum margin classifiers. Next, we consider a maximum margin classifier for k

classes, which separates a given class from all other classes.

The margin function for a maximum margin classifier is defined, over the entire ensemble, as

mr(x, y) = PΘ(h(x,Θ) = y) − maxj 6=y

PΘ(h(x,Θ) = j), where y is the target class for the input x, Θ is a

classifier distribution, h(x,Θ) is a classifier, and j is a class label. The margin is defined as the difference in

confidence of the ensemble towards x belonging to the correct class (represented via y) and confidence of the

26

ensemble towards x belonging to any other class (represented via j). When the value of the margin function

mg(x, y) for x is greater than zero, there are more base classifiers voting for the correct class compared to

any other classes and a value less than zero implies that there are more base classifiers voting for an incorrect

class compared to the correct class. Also, we assume that the base classifiers are generated from the same

distribution Θ.

The expected prediction error rate can be defined as PE = Px,y(mr(x, y) < 0) over the entire space

of data and classifiers; the prediction error is the probability that the margin function is negative.

For a random variable m with finite mean E(m) and finite variance var(m), the Chebyshev inequal-

ity is defined as P (|m− E(m)| ≥ t) ≤ V ar(m) / t2; the Chebyshev inequality implies that given a large

number of values, most values are close to the mean; using t = E(m) in the Chebyshev inequality, we

get P (|m− E(m)| ≥ E(m)) ≤ V ar(m) / E(m)2, which can then be used to derive the probability that the

margin function is negative (and the probability of prediction error) as

PE = Px,y(mr(x, y) < 0) ≤ V ar(mr)

(Ex,y(mr))2. (2.19)

Furthermore, Breiman derives the variance in the margin function as V ar(mr) ≤ ρ(1 −

(Ex,ymr(X,Y ))2, where ρ is the correlation among classifiers in the ensemble. Thus, the prediction er-

ror is bounded from above as

PE ≤ ρ(1− (Ex,y(mr))2)

(Ex,y(mr))2. (2.20)

Though the bound of the prediction error PE is loose, it gives a motivation for ensemble techniques

using classifiers sampled from the same classifier distribution. Based on the bound for prediction error PE,

Breiman argues that an ensemble should reduce the correlation among classifiers ρ and increase the expected

value Ex,y(mr) of the margin.

We briefly describe the effects of the parameters in random forest on the prediction error PE. The

expected value Ex,y(mr) of the margin and should converge to a fixed value as the number of trees ntree

goes to infinity. In contrast, mtry is responsible for reducing the correlation ρ among classifiers; a large value

of mtry will create correlated classifiers whereas a very small value of mtry may not be enough to model the

underlying distribution.

27

Breiman [15, 16, 19] notes that decision trees are unstable classifiers; unstable classifiers are defined

as those that produce vastly different models due a small change in the training data. In contrast, Support

Vector Machine (SVM) is a stable classifier. A reason for using an unstable classifier is the reduced amount

of correlation in an unstable classifier compared to a stable classifier like SVM.

Studies on the Theory Behind Random Forest. Now, we briefly discuss some past studies trying

to explain the random forest algorithm.

Breiman [20] showed that random forests can be cast as a kernel function based algorithm that

separates two classes; a kernel function is a nonlinear function. Larger trees are shown to produce better

kernels and thus they have better accuracy.

Breiman [22] proposed that, for random forests, the value of the optimal mtry does not depend on the

sample size, but on the number of strong and weak features present in the data, where strong features are

defined as those having a high probability of being selected for splitting the nodes of the decision tree and

vice-versa for the weak features. Furthermore, mtry is shown to be independent of ntree. Biau [8] showed

that the convergence rate of random forests only depends on the number of strong features in the model and

not on the number of noisy features present.

Lin and Jeon [79] described the similarity between random forests and an adaptively weighted form of

the k -nearest neighbor. The k -nearest neighbor is a nonlinear model that predicts a test example in terms

of k training examples closest to it. Biau and Devroye [9] showed the consistency of bagging an ensemble of

k -nearest neighbor and discussed connections with random forests.

Next, we discuss how random forests are used for feature selection.

2.4.4 Implicit Feature Selection Using Random Forests

Inherently, a decision tree is a feature selecting classifier, because at each node of the decision tree

only the best feature among available feature is used for partitioning the data, according to an information

criterion. We assume that the decision trees are trained in a manner that depends on the target labels; there

are versions of decision trees and random forests [32, 53] that are trained independent of the target labels.

28

Next, we discuss how random forests are utilized for selecting features in an explicit manner.

2.4.5 Explicit Feature Selection Using Random Forests

In order to perform explicit feature selection using random forests, we first define feature importance,

which is a quantification of the intrinsic value of a feature for prediction.

2.4.5.1 Feature Importance in Random Forests

Before discussing the details of Breiman’s feature importance measure, we must explain a detail of

random forests.

Out-of-bag (OOB) data as a Substitute for Validation Data. In random forests, each tree is

trained using a bootstrapped dataset. If we use sampling with replacement (that is, sample N times from N

examples), the limiting probability of observing a sample in a bootstrapped dataset is 0.6322 . This means

that on average only 63.2% of the original data are reflected in each of the bootstrapped data sets. Thus,

each tree is trained on 63.2% of the original data and we can use the remaining 36.8% of the data as a

substitute for validation data.

The out-of-bag or OOB error rate over the entire forest, using OOB examples, is calculated as

Classification: erroob =1

N

N∑i=1

1

Noob

(ntree∑k=1

I[(xi, yi) /∈ datasetk] [I(yi 6= yi)]

), (2.21)

Regression: erroob =1

N

N∑i=1

1

Noob

(ntree∑k=1

I[(xi, yi) /∈ datasetk](yi − yi)2

), (2.22)

where Noob =∑ntreek=1 I[(xi, yi) /∈ datasetk] is the number of trees in which example i is OOB, ntree

bootstrapped datasets represented by datasetk are used to train ntree trees, N is the number of examples,

I() is an indicator function, erroob is the OOB error rate, x is an input, y is a target, and y is a prediction;

the expression in parentheses of Equation 2.21 is the number of times example i was misclassified when the

example was OOB within ntree trees.

2 The 0.632 value is derived as follows: if there are N samples, the probability of choosing a particular sample is 1/N , and ifwe perform this procedure N times, then the probability of not choosing a particular sample in N trials is (1− 1/N)N = 0.368(for N ≥ 1000), and the probability of choosing that particular sample in N trials is 0.632.

29

The procedure for calculating the OOB error rate, for classification based trees using Equation 2.21,

can be described as follows: once a random forest model is trained, for each tree in the ensemble, the OOB

examples are classified and the votes are stored; the label predicted for an example is the label for the class

that won the maximum votes, over all the trees in the ensemble for which the example was OOB. The OOB

error rate is then the proportion of times OOB examples were misclassified; a similar procedure is employed

in regression based random forests using Equation 2.22.

Feature Importance Calculated Using the OOB Error Rate. Breiman [19, 21] proposed to use

the OOB error rate to calculate the feature importance as follows: (1) a single random forest models is

trained, (2) the OOB error rate is calculated, (3) for an individual feature, the values of the feature in the

OOB data is permuted and the OOB error rate is recalculated, and (4) the increase in OOB error rate due

to permutation is the feature importance value of a feature.

Mathematically, Breiman’s feature importance for classification tasks is calculated as

feature importance for featurem =

1

N

N∑i=1

1

Noob

(ntree∑k=1

I[(xi, yi) /∈ datasetk] [I(yi 6= yi)]−ntree∑k=1

I[(xi, yi) /∈ datasetk] [I(yi 6= yi)]

), (2.23)

and Breiman’s feature importance for regression tasks is calculated as

feature importance for featurem =

1

N

N∑i=1

1

Noob

(ntree∑k=1

I[(xi, yi) /∈ datasetk](yi − yi)2 −ntree∑k=1

I[(xi, yi) /∈ datasetk](yi − yi)2

), (2.24)

where Noob =∑ntreek=1 I[(xi, yi) /∈ datasetk] is the number of trees in which example i is OOB, ntree

bootstrapped datasets represented by datasetk are used to train ntree trees, N is the number of examples,

I() is an indicator function, x is an input, y is a target, and y is a prediction before permuting feature m

whereas y is a prediction after permuting feature m.

30

2.4.5.2 Feature Selection in Random Forests using Feature Importance and Backward Elim-

ination

The feature importance measure we described in the previous section for random forests has been

used to rank features and perform feature selection. The variables selection for random forests (varSelRF)

algorithm [37], proposed for classifying gene datasets, iteratively eliminates features determined to be least

important. Iteratively, the varSelRF algorithm trains a model and eliminates 20% of the features with the

lowest feature importance values; finally, the random forests model with the lowest OOB error or validation

error is selected.

This concludes our review of linear models and random forests. In the next chapter, we propose a

method for improving the accuracy of L1 penalized algorithms.

Chapter 3

Single-Step Estimation and Feature Selection using L1 Penalized Linear Models

This chapter was accepted and presented at Feature Selection in Data Mining ‘10 as “Increasing Fea-

ture Selection Accuracy for L1 regularized linear models in Large Datasets” [68]. Other than some minor

edits, the chapter is reproduced as-is from the paper. The next chapter “Iterative Training and Feature

Selection Using L1 Penalized Linear Models” is an extension to this chapter.

3.1 Introduction

Feature selection using the L1 penalty (also referred as 1-norm or Lasso penalty) has been shown

to perform well when there are spurious features mixed with relevant features and this property has been

extensively discussed in Efron et al. [42], Tibshirani [112], and Zhu et al. [131]. In this paper, we focus

on feature selection using the L1 penalty for classification, addressing open problems related to feature

selection accuracy and large datasets. This paper is organized as follows, Section-2 presents motivation

and background, primarily focusing on the fact that asymptotically L1 penalty based method might include

spurious features. Based on the work in Zou [133], we show that random sampling can find a set of weights

that improves accuracy over the unweighted (which is normally used) L1 penalty methods and we detail

this in Section-3. In Section-4, we show results on two different classification algorithms and compare the

weighted method proposed in Zou [132] with the random sampling method described in our paper. Our

method differs from Zou’s method as it hinges on random sampling to find the weight vector instead of using

32

the L2 penalty. The proposed method is shown to give significant improvement in accuracy over a number

of data sets.

The contribution of our work is as follows: we show that a fast pre-processing step can be used to

increase the accuracy of L1 regularized models and is a good fit when the number of examples are large;

we connect the theoretical results from Rocha et al. [94] showing the viability of our method on various L1

penalized algorithms and also show empirical results supporting our claim.

3.2 Background Information and Motivation

Consider the following setup in which information about n examples, each with p dimensions, is

represented in a n x p design matrix denoted by X , with y ∈ Rn representing target values/labels, and

β ∈ Rp representing a set of model parameters to be estimated. For our paper, we consider classification

based linear models with a convex loss function and a penalty term (a regularizer). A regularized formulation

that can be used to generally describe many machine learning algorithms as

minβL(X, y, β) + λJ(β), (3.1)

where L(X, y, β) = loss function , J(β) = penalty function, and λ ≥ 0 .

The metric or loss, L(X, y, β), may represent various loss functions including ‘hinge loss’ for classifi-

cation based Support Vector Machines (SVM) and ‘squared error loss’ for regression. Popular forms of the

penalty function J(β) include using the L2 and L1 norm function on β and are termed Ridge and Lasso

penalty, respectively, in literature (refer to Tibshirani [112]).

3.2.1 Asymptotic properties of L1 penalty

Many papers including Tibshirani [112], Efron et al. [42], and Zhu et al. [131] discuss the merits of the

L1 penalty. The L1 penalty has been shown to be efficient in producing sparse models (models with many

of the β’s set to 0) and this feature selecting ability makes it robust against noisy features. In addition,

the L1 penalty is a convex penalty and when used in conjunction with convex loss functions, the resultant

formulation has a global minimum.

33

As the L1 penalty is used for simultaneous feature selection and correct estimation, a topic of interest

is to understand whether sparsity holds when n → ∞, n=number of examples. Intuitively, given enough

samples, the estimated parameters βn should approach the true parameters β0.

Assume that the data is generated as

y = Xβ0 + ε. (3.2)

with ε being gaussian noise of zero mean and β0 being the true generating model parameters. Also, βjk

represents the jth feature for βk. If A0={j | βj0 6= 0} is the true model and An is the model found for n→∞.

For consistency in feature selection, we need An={ j | βjn 6= 0} and limn→∞

P (An = A0)= 1, that is we find the

correct set of features A0 asymptotically. Zou [133] showed that lasso estimator is consistent (in terms of

βn → β0) but can be inconsistent as a feature selecting estimator in presence of correlated noisy features.

3.2.1.1 Hybrid SVM

Zou [133] showed that adaptive lasso regression defined as

Adaptive Lasso Regression: minβ||y −Xβ||2 + λ

∑j

Wj |βj | s.t. Wj = |β(OLS)j |−γ , γ > 0, (3.3)

can be used for simultaneous feature selection and creating more accurate models than the standard L1

penalty. The main difference between the adaptive lasso penalty and the standard lasso penalty is the use

of user-defined weights Wj . Note β(OLS) denotes the estimates found via least squares regression.

In Zou [132], the same properties were applied in case of classification, and the resultant algorithm

was known as ‘Improved 1-norm SVM ’ or ‘Hybrid SVM’, and is defined as

Improved 1-norm SVM: minβ,β0

∑i

[1− yi(x:,iβ + β0)]+ + λ∑j

Wj |βj |, (3.4)

where Wj = |β(l2)j |−γ , γ > 0, β(l2) = arg minβ,β0

∑i

[1− yi(x:,iβ + β0)]+ + λ2

∑j

||βj ||22,

{x:,i, yi} represents an example, λ and λ2 are regularizing parameters. v+ = max(v, 0).

In this paper, we also discuss an L1 penalized algorithm with the squared hinge loss function as is

34

defined as

Improved SVM2: minβ,β0

∑i

[1− yi(x:,iβ + β0)]2+ + λ∑j

Wj |βj |, (3.5)

where Wj = |β(l2)j |−γ , γ > 0, β(l2) = arg minβ,β0

∑i

[1− yi(x:,iβ + β0)]2+ + λ2

∑j

||βj ||22,

{x:,i, yi} represents an example, λ and λ2 are regularizing parameters. v+ = max(v, 0).

For the adaptive lasso penalty, the formulations in Equation 3.3 and Equation 3.4 are convex and

will require almost no modification to the standard L1 penalized algorithms. Refer to Zou [133] for the

modifications that are needed.

Intuitively, the weights found via the L2 penalty are inversely proportional to the true model parameter

β0 in Equation 3.2. For a given feature, if the weight is small (i.e. the true model magnitude is higher)

then in the adaptive lasso penalty we are penalizing that feature lesser, compared to a feature with larger

weights. Thereby, we are encouraging those features to have higher magnitude in the adaptive L1 models

and vice-versa for noisy features.

3.2.2 Motivation for our Method

The weighted lasso penalty is dependent on obtaining suitable weights ‘W ’. Zou [132, 133] shows that

the ordinary least squares estimates and the estimates from SVM with the L2 norm penalty can be used to

find the weights as shown in Equation 3.3 and Equation 3.4. For our paper, we obtain these weights via

feature selection on randomized subsets of the training data. If the accuracy is higher than the unweighted

case, it means that the features are appropriately (and correctly) weighted.

One of our goals was to see the translation of results from Zou [133] to other linear formulations

and thus we also experimented on the weighted SVM2 formulation shown in Equation 3.5 (unweighted

formulation is shown in Equation 3.8). The SVM2 formulation is referred to in literature as quadratic loss

SVM (but with L2 penalty) or 2-norm SVM (refer to [103]); it is squared hinge loss coupled with the L1

penalty.

35

3.2.2.1 Efficient Algorithms to solve formulations with L1 norm penalty

Efron et al. [42] showed an efficient algorithm for lasso regression called Least Angle Regression

(LARS), that can solve for all values of λ, that is 0 ≤ λ ≤ ∞. In Rosset and Zhu [97], a generic algorithm,

for which LARS is a special case, is documented that can be used for all double differentiable losses with

the L1 penalty. For our experiments, we resort to specific linear SVM based formulations for which entire

regularization paths can be constructed.

The penalized formulation for ‘1-norm SVM’ Zhu et al. [131] is defined as

1-norm SVM: minβ,β0

∑i

[1− yi(x:,iβ + β0)]+ + λ∑j

|βj |, (3.6)

Equivalent to Equation 3.6 : minβ,β0

||β||1 + C∑i

ξi, s.t. yi(x:,iβ + β0) ≥ 1− ξi, ξi ≥ 0. (3.7)

Zhu et al. [131] showed a simple piecewise algorithm to solve for 0 ≤ λ ≤ ∞ in the 1-norm SVM. As the loss

and the penalty function are both singly differentiable, a piecewise path cannot be constructed as efficiently

as in LARS but linear programming can be employed to calculate the step size. Equation 3.7 is an equivalent

to Equation 3.6 and similar to the formulation seen in literature except with the L2 loss function instead of

the L1 loss function.

The penalized formulation for squared hinge loss (or quadratic loss SVM) with the L1 penalty is

defined as

SVM2: minβ,β0

∑i

[1− yi(x:,iβ + β0)]2+ + λ∑j

|βj |, v+ = max(v, 0). (3.8)

As the loss function is doubly differentiable, an efficient piecewise algorithm can be constructed, via

the method described by [97], to solve for 0 ≤ λ ≤ ∞. Our vested interest in using such piecewise algorithms,

is to help understand whether better (entire) regularization paths are created or not for the weighted L1

penalty.

3.3 Randomized Sampling (RS) Method to Create Weight Vector

Our randomized sampling method depends on a small random subset of training data. We assume

that the subset of the training data is small, i.e. it is computationally cheap to act on such a set in a

36

reasonable time. Also, such randomized sampling is done multiple times.

3.3.1 Randomized Sampling (RS) Method

Our randomized sampling algorithm is described below in Algorithm 1: Randomized Sampling

Method. The algorithm can be explained as follows: We choose a subset of m examples out of the pre-

sented n examples such that m << n. We train an L1 penalty based algorithm (e.g. 1-norm SVM [131],

SVM2, etc.) so that we can find a set of relevant features. We keep a note of the features that we found in

that particular experiment. After many such randomized experiments, the counts of the number of times a

feature was found in these randomized experiments is summed up and normalized and denoted by V . This

Algorithm 1: Randomized Sampling (RS) Method

Input: n examples each with p features, K randomized experiments, B (Block) number of examples

used to train model in each randomized experiment.

Output: Count vector W (vector of size p) representing number of times features were selected in K

randomized experiment.

Divide n examples into K randomized sets each of size B and denote them as Ntrni, i = 1 . . .K and

let V ←− 0.

for i = 1 . . .K do

Get Ntrni, construct Ntsti and Nvali set.

Train Modeli = L1 Algorithm(Ntrni, Ntsti, Nvali).

Si = selected features in Modeli via validation data.

V ← V + {x, x ∈ Rp|xj = 1 if j ∈ Si else xj = 0}.

count vector, denoted by V , is then inverted and used as weights for the weighted version of the algorithm;

i.e. weights used in the weighted formulations are W = 1/V . Intuitively, if a feature is important and is

found multiple times via the RS method, then the corresponding weight for the feature is less and thus it is

penalized lesser, encouraging higher magnitude for the feature.

37

3.3.2 Consistency of choosing a set of Features from Randomized Sampling

(RS) Experiments

Our method is dependent on finding some set of relevant features and their counts for a given dataset

via the RS method. Our experimental results are restricted to the weighted and unweighted formulation for

SVM2 and 1-norm SVM, but our theoretical results are applicable to all linear models with twice differentiable

loss function with the L1 penalty. We next mention results, regarding the asymptotic consistency and

normality properties in n (number of examples) for L1 penalized algorithms, which can help understand the

consistency of our method.

Lemma 1. This result is derived from Theorem-5 in Rocha et al. [94]. If the loss function L(X, y, β)

shown in Equation 3.1 is bounded, unique and a convex function, with E|L(X, y, β)| <∞ and furthermore

L(X, y, β) is twice differentiable with a positive hessian H matrix, then the following consistency condition

defined for the L1 penalty when using the formulation in Equation 3.1 and true model in Equation 3.2:

||HAc,A[HA,A −HA,β0H−1β0,β0

Hβ0,A]−1sign(βA)||∞ ≤ 1, where Hx,y =d2L(X, y, β)

dxdy, (3.9)

Ac = {j ∈ 1...p|βj = 0} , A = {j ∈ 1...p|βj 6= 0} and β0 is an intercept.

� if λn is a sequence of non-negative real numbers such that λnn−1 → 0 and λnn

−(1+c)/2 → λ > 0 for some

0 < c < 1/2 as n → ∞ and the condition Equation 3.9 is satisfied then P [sign(βn(λn) = sign(β)] ≥

1− exp[−nc]. βn is parameter found for number of examples=n .

� If the condition in Equation 3.9 is not satisfied, then for any sequence of non-negative numbers λn,

limn→∞

P [sign(βn(λn)) = sign(β)] < 1. (3.10)

The probability of choosing incorrect variables is bounded to exp(−Dnc), where D is a positive constant

(shown in the proof of Theorem 5 of [94]).

If the condition in Equation 3.9 is fulfilled, it means that the interactions between relevant and noisy

features are distinguishable and the L1 penalty can correctly identify the signs in β. If the condition in

38

Equation 3.9 is not fulfilled, then noisy features will be added to the model with a probability away from 1.

Also, note that the above conditions are applicable for 1-norm SVM, as shown in [94].

Lemma 2. We use b to specify the size of the subset and assuming b→∞, then from Lemma-1, when

condition of consistency Equation 3.9 is satisfied then P [sign(βb(λb) = sign(β)] ≥ 1−exp[−bc] ≈ 1, where βb

and λb represent the parameters for the subset of size b. For k such subsets V , as depicted in the algorithm

in Section 1, is bounded to k(1− exp[−bc]) ≈ k, b→∞. When the condition in Equation 3.9 is not satisfied

then the probability of choosing noisy variables in a subset is upper-bounded to exp(−Dbc) and for k subsets,

sum(Vj) ≤ k · exp(−Dbc) and Vj ≈ 0, b → ∞, (where Vj are indices of noisy variables). Thus, the noisy

variables have a probability of having a low count in V and a large weight in W , thus penalizing the noisy

features heavily .

q p 2-norm SVM2 1-norm SVM2 Hybrid (Zou) RS(20%) RS(30%) RS(40%)2 14 9.64±2.30 7.92±1.89 7.88±2.09 7.69±1.71 7.67±1.66 7.68±1.694 27 10.90±2.41 8.01±1.84 7.88±2.09 7.73±1.59 7.73±1.59 7.71±1.606 44 12.17±2.64 7.93±1.79 7.79±1.69 7.64±1.60 7.64±1.59 7.64±1.528 65 13.45±2.96 8.13±2.10 7.87±1.84 7.82±1.85 7.83±1.85 7.81±1.8312 119 16.91±3.24 8.11±1.95 8.05±1.94 7.78±1.71 7.78±1.70 7.76±1.6616 189 17.93±3.32 7.87±1.78 8.29±2.41 7.66±1.57 7.66±1.63 7.66±1.6320 275 19.31±3.32 8.06±2.14 8.04±2.01 7.69±1.81 7.74±1.89 7.77±1.87

Table 3.1: Mean ± Std. Deviation of Error Rates in % on Models 1 & 4 by SVM2. The best algorithms arein bold.

Random Sampling is Subsampling. To better quantify our random sampling method, we explain

it in terms of subsampling (refer to Politos et al. [87]). Subsampling is a method of sampling m examples

from n total examples with m < n, unlike Bootstrap that samples n times with replacement from n samples.

Let estimator θ be a general function of i.i.d data generated from some probability distribution P . In our

case of feature selection, this estimator is the feature set. We are interested in finding an estimator and

it’s confidence region based on the probability P of the data and we define it as θ(P ). When P is large

then we can construct an empirical estimator θn of θ(P ) such that θn = θ(Pn), where Pn is the empirical

distribution; that is estimate the true feature set empirically. We define a root of form τn(θ − θ), where

39

τn is some sequence (like√n or n) increasing with n (number of examples), and we are looking at the

difference between the empirical estimator θn and the true estimator θ. We define Jn(P ) to be the sampling

distribution of τn(θ − θ(F )) based on a sample size of n from P and define the CDF as

Jn(x, P ) = ProbabilityP {τn(θn − θ) ≤ x}, x ∈ R. (3.11)

q p 2-norm SVM 1-norm SVM Hybrid (Zou) RS(20%) RS(30%) RS(40%)2 14 8.74±1.30 7.64±0.09 7.64±1.02 7.63±1.02 7.64±1.01 7.53±0.094 27 9.76±1.75 7.85±1.14 7.95±1.34 7.83±1.28 7.79±1.24 7.69±1.196 44 10.57±1.95 7.85±1.01 7.92±1.12 7.79±1.12 7.77±1.18 7.69±1.238 65 11.47±2.31 7.81±0.99 7.99±1.36 7.75±1.13 7.74±1.15 7.63±1.0912 119 13.27±2.48 7.91±0.98 8.04±1.35 7.77±1.16 7.82±1.21 7.63±1.0016 189 15.58±2.94 7.94±1.15 7.87±1.21 7.74±1.31 7.75±1.23 7.64±1.1420 275 17.14±2.96 7.90±1.00 7.85±1.11 7.77±1.20 7.80±1.27 7.69±1.19

Table 3.2: Mean ± Std. Deviation of Error Rates in % on on Models 1 & 4 by 1-norm SVM. The bestalgorithms are in bold.

Lemma 3. From [87], for data generated via i.i.d., there is a limiting law J(P ) such that Jn(P ) converges

weakly (in probability) to J(P ) and τb(θn − θ) → 0 as n → ∞ with the conditions that τb/τn → 0, b → ∞

and b/n→ 0, where b is the number of examples in the subsample experiment and n is the total number of

available examples.

Lemma 3, has remarkably weak conditions for subsampling and it requires that the root has some

limiting distribution and the sample size b is not too large (but still going to infinity) compared to n . In

our case, the subsets are of size b→∞, b << n and for the rate of estimation at τn ∝ nc, τb ∝ bc, 0 < c ≤ 1,

then τb/τn → 0. For the RS method, we create weight vector whose index for a feature is non-zero if that

feature was found in a particular experiment. θn is the sample mean of n such RS experiments weights,

having mean converging to θ(P ) (due to Lemma-2). Thus estimating the true feature set on basis of random

sampling of subsets of data is weakly convergent. [133] used a root-n-consistent estimator’s weight (from

the L2 penalty) but mentions that the conditions can be further weakened and if there is an an such that

an →∞ and an(θ − θ) = O(1) then such an estimator can also be used. By Lemma-3, our RS estimator is

40

one such consistent estimator and thus can be used as a valid estimator for usage with the weighted lasso

penalty.

3.4 Algorithms and Experiments

In this paper, we limit ourselves to an empirical study of data block sizes for the RS estimator. We

replicate the experiments from ‘An Improved 1-norm SVM for Simultaneous Classification and Variable

Selection’ by [132] and report on 1-norm SVM and SVM2.

Method for choosing Weights (for Hybrid and RS) and Validation data (for RS). For the

Hybrid SVM, in order to find the optimal weights via L2 penalty, we use the method described by [132]. We

first find the best SVM (or SVM2) algorithm model weights (β(l2)) with the L2 penalty via a parametric

search over costs C={0.1, 0.5, 1, 2, 5, 10}. We then create entire piecewise paths for various weight values

|β(l2)|−γ , γ = {1, 2, 4}; choose the best performing model on validation data and then report on the test

dataset. Description on how we chose training set for the RS method is given in individual experiments.

Our RS experiments need validation data to help choose the relevant features for each of the RS training set.

We do the following: if n is the size of the training set and we choose m of those examples for the current

RS training set, we just use the left out n−m examples (as validation) for choosing the best features from

the piecewise paths generated by the L1 algorithm on the m examples. In case, if a held out validation set

was present, we use that instead.

3.4.1 Synthetic Datasets

We simulate two synthetic datasets, one akin to “orange data” described in [131] and another a

bernoulli distribution based dataset. The following notation is used for some of the tables: We use C and

IC to denote the mean number of correctly and incorrectly selected features, respectively. Also, we resort

to reporting to mean and std. deviation as the median of the incorrectly selected features was 0 for many

experiments. PPS stands for the probability of perfect selection, i.e the probability of only choosing the

correct feature set.

41

q 6 8 12 16 20p 44 65 119 189 275

1-norm SVM2IC 1.5±2.59 1.42±2.44 1.67±3.4 1.58±2.95 1.71 ±3.52

PPS 0.554 0.544 0.536 0.564 0.592

Hybrid SVM2IC 1.05±1.87 1.03±1.79 1.19±2.05 1.35±2.51 1.13±2.21

PPS 0.596 0.598 0.554 0.576 0.596

RS(20%)IC 0.65±1.15 0.62±1.15 0.8±1.48 0.61±1.17 0.54±1.04

PPS 0.636 0.646 0.600 0.686 0.666

RS(30%)IC 0.69±1.18 0.73±1.15 0.70±1.27 0.63±1.25 0.56±1.06

PPS 0.626 0.604 0.626 0.666 0.662

RS(40%)IC 0.62±1.05 0.61±1.01 0.68±1.29 0.66±1.25 0.55±1.02

PPS 0.644 0.636 0.650 0.668 0.672

RS(50%)IC 0.67±1.11 0.65±1.19 0.69±1.31 0.62±1.36 0.59±1.14

PPS 0.628 0.628 0.630 0.670 0.670

*C (mean of Correct features)=2 for all above experiments.

Table 3.3: Variable Selection Results on Models 1 & 4 using SVM2. The best algorithms are in bold.

Exp. Name Correlation Bayes 2-norm 1-norm Hybrid RS(20%) RS(30%) RS(40%)

Model 2ρ = 0 6.04 9.77 8.14 7.46 7.51 7.53 7.55ρ = 0.5 4.35 7.74 6.43 5.96 5.97 5.86 5.86

Model 3ρ = 0 8.48 11.04 9.79 9.54 9.46 9.46 9.45ρ = 0.5 7.03 8.49 9.51 8.45 8.17 8.17 8.20

Model 5 ρ = 0.5|i−j| 6.88 31.31 9.32 8.5 8.6 8.56 8.22

*range of std. deviation in accuracy for above table was between 1.02 to 1.96.

Table 3.4: Mean Error rates in % for Models 2, 3 & 5 using SVM2. The best algorithms are in bold.

Models 1 and 4 from [132]. The “orange data” has two classes, one inside the other like the core inside

the skin of the orange. The first class has two independent standard normals x1 and x2. The second class

also has two independent standard normals x1 and x2 but is conditioned on 4.5 ≤ x21 + x2

2 ≤ 8. To simulate

the effects of noise, there are ‘q’ independent standard normals. The Bayes rule is 1-2I(4.5 ≤ x21 + x2

2 ≤ 8),

where I() is an indicator function and the Bayes error is about 4%. We resort to an enlarged dictionary

D = {√

2xj ,√

2xjxk, x2j , j, k = 1, 2, . . . , 2 + q} as the original space is not linear. We have independent sets

of 100 validation examples and 20000 test examples and. ‘q’ is set to 2, 4, 6, 8, 12, 16, 20 and we report on

500 experiments.

For the RS method, block sizes were set to 20%, 30% and 40% of the total training size and performed

10/(%size of each block/100) total experiments; i.e. for 20% we generated 10/0.2=20 total randomized

42

training sets each of size 0.2*(total training data). The weighted vector was created via the RS method

described earlier and then used to train the weighted 1-norm and SVM2 algorithms.

We depict error rates in Table 3.1 and Table 3.2 for SVM2 and 1-norm SVM respectively. q depicts

the number of noise features in original space and p represents the number of features in the new space via

the dictionary D. The L2 algorithm version, in the 3rd column, show increasing error rates as the number

of noisy features increase. The L1 algorithm version, in the 4th column is much more robust to noise and

the error rates do not degrade at all. Hybrid SVM perform usually better than the unweighted 1-norm SVM

(except for couple of times for in Table 3.2). For all different block sizes, the RS method performs best. The

feature selecting ability of individual algorithm is depicted in Table 3.3 (Note: 1-norm SVM results were

omitted for space constraints and the results were similar to those of SVM2). We can see that the probability

of finding the best model is high for all the algorithms. Hybrid is better at that compared to the 1-norm

and the RS method performs best.

Models 2, 3, and 5 from [132]. Models 2, 3, and 5 are simulated from the model y ∼ Bernoulli{p(u)}

where p(u) = exp(xTβ + β0 + ε)/(1 + exp(xTβ + β0) + ε)} with ε being a standard normal representing

error. We create 100 training examples, 100 validation examples, 20,000 test examples and report on 500

randomized experiments.

Model 2 (Sparse Model): We set β0 = 0 and β = {3, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 3}. The features x1, ...x12

are standard normals and experiments are done with correlation between xi and xj set to ρ = {0, 0.5}. The

Bayes rule is to assign classes to 2I(x1 + x6 + x12)− 1.

Model 3 (Sparse Model): We use β0 = 1 and β = {3, 2, 0, 0, 0, 0, 0, 0, 0}. The features x1, ...x12 are

standard normals and experiments are done with correlation between xi and xj set to ρ = {0, 0.5}|i−j|. The

Bayes rule is to assign classes to 2I(3x1 + 2x2 + 1)− 1.

Model 5 (Noisy features): We use β0 = 1 and β = {3, 2, 0, 0, 0, 0, 0, 0, 0}. The features x1, ...x12 are

standard normals and experiments are done with correlation set to ρ = 0.5|i−j|. We added 300 independent

normal variables as noise features to get a total of 309 features.

For Models 2, 3 and 5: error rates are reported in Table 3.4 for SVM2 (results for 1-norm SVM were

43

similar and hence skipped). Note, weighted models are consistently better than both of their 1-norm and

2-norm unweighted counterparts. The RS method has equal or greater accuracy than the Hybrid version.

3.4.2 Real World Datasets

UCI datasets. In Table 3.5, results on the Spam, WDBC and Ionosphere datasets from UCI repository,

by [5], are reported. For WDBC and Ionosphere dataset, we split the data into 3 parts with 2 parts used

for training (and validation) and the 3rd remaining part for testing. For the Spam dataset, indicators for

test (1536 examples) set and training set can be obtained from http://www-stat.stanford.edu/~tibs/

ElemStatLearn/. For our RS method we generated smaller datasets from the training set as follows: If

the training set size is N and size for individual RS set is K, then the number of datasets generated are

10 ∗ N/K. We also show the size of the RS training set as Block in the table. For Hybrid SVM, the best

parameter for γ and C are chosen as described earlier in Section . We report on 50 randomized experiments.

In Table 3.5, error rates for both SVM2 and 1-norm SVM are shown. The use of weights via Hybrid and RS

method, always increases the accuracy from the unweighted case. Also, as seen on both synthetic and real

world datasets, RS blocksize does not create that much variability in the results.

Dataset AlgorithmWithout Randomized Hybrid

2-norm SVMWeighting Sampling SVM

(Name/Block) Weighting

WDBC

1-norm (100)3.66 ± 1.17

2.79 ± 0.93 2.89 ± 0.79

4.05 ± 1.361-norm (150) 2.79 ± 0.90 3.16 ± 1.22SVM2 (100)

3.55 ± 1.812.78 ± 1.03 2.73 ± 1.01

SVM2 (150) 2.90 ± 1.15 2.91 ± 1.13

SPAM

1-norm (200)9.09 ± 0.878

8.18 ± 0.49 8.31 ± 0.61

7.06 ± 0.041-norm (1000) 7.53 ± 0.17 8.19 ± 0.73SVM2 (200)

8.45 ± 3.437.38 ± 0.52 7.39 ± 0.30

SVM2 (1000) 7.70 ± 2.73 7.48 ± 0.52

Ionosphere

1-norm (50)12.38 ± 2.04

11.52 ± 1.39 11.84 ± 1.38

13.03 ± 2.86

1-norm (75) 11.25 ± 1.98 11.56 ± 1.731-norm (100) 11.29 ± 1.65 11.72 ± 1.23SVM2 (50)

12.69 ± 2.8211.43 ± 2.52 11.21 ± 2.66

SVM2 (75) 11.61 ± 2.50 11.22 ± 2.58SVM2 (100) 11.37 ± 2.67 11.26 ± 2.68

Table 3.5: Mean ± Std. Deviation of Error Rates on Real world Datasets. The best performing L1 and L2

penalized algorithms are in bold.

44

Robotic Dataset. We now discuss a novel use of our subsampling method on robotic datasets [88].

These datasets are created by hand labeling 100 images obtained from running the DARPA LAGR robot in

varied outdoor environments. The classes labeled are robot traverseable path and obstacles. The authors

provide pre-extracted color histogram features for the dataset at [89]. We used a subset (12,000 examples)

of the available data for each of the 100 frames. Each example is 15 dimensional.

DS1A DS2A DS3AUnweighted SVM2 8.92 4.36 1.24Weighted SVM2 6.41 4.13 1.15

Table 3.6: Avg. Error rate on Robotic Datasets from [88]. The best algorithms are in bold.

We set our experimentation as follows: for each frame Fi, i is the index of the frame, we divide

the obtained examples (12K examples) into 8 folds (9.6K examples) for training, 1 fold (1.2K examples) for

validation and 1 fold (1.2K examples) for testing. We train/validate/test on the unweighted SVM2 algorithm.

For the weighted experiment, we train via our RS method, by dividing the training into 10 subsets (each

960 examples) and finding the weight vector. This weight vector is then used to create the weighted SVM2

models and we report on the test set. Now, instead of discarding weights when a new frame arrives, we use

the weights found in frame Fi again in Fi+1, i.e. if weights in frame Fi are noted as Wi then:

Wi+1 ← {Wi + weight results of RS for frame Fi+1}.

This is one experimental environment, where creating L2 models for the entire data is not feasible

and the RS estimator is a potential approach. Also, propagating feature importance between frames is an

advantage for the RS estimator. In Table 3.6, we show overall results for 100 frames for 3 datasets done

10 times. We propagate the weights for the weighted SVM2 between frames. As shown, there is a drop in

error rates (between 5-28%) for the weighted SVM2 compared to the unweighted SVM2. The overhead of

computing the weights via RS was < 10% that of computing a model for the entire training set.

45

3.5 Conclusions and Future work

A random sampling framework is presented and is empirically shown to give effective feature weights

to the lasso penalty, resulting in both increased model accuracy and feature selection accuracy. The proposed

framework is at least as effective (and at times more effective than) the Hybrid SVM, with the added benefit

of significantly lower computational cost. In addition, unlike the Hybrid SVM which must see all the data at

once, Random Sampling is shown to be effective in an on-line setting where predictions must be made based

on only partially available data (as in data taken from the robotics domain). In this paper the framework is

demonstrated on two types of linear classification algorithms, and theoretical support is presented showing

its applicability, in general, to sparse algorithms.

Chapter 4

Iterative Training and Feature Selection Using L1 Penalized Linear Models

4.1 Introduction

L1 penalized linear models are extremely popular. These models use an L1 (or Lasso) penalty in

conjunction with a linear or quadratic loss function. Their popularity stems from the fact that the resultant

model implicitly performs feature selection: the weights for irrelevant features become small or zero [42, 112,

131].

Recent studies [130, 133] have shown that in the presence of correlated but irrelevant features, the

L1 penalty may select features inconsistently; feature inconsistency means that the L1 penalty selects both

relevant features and correlated but irrelevant features. Zou [133] showed that an alternative, the adaptive

lasso penalty works in cases in which the standard L1 penalty is inconsistent. The adaptive lasso penalized

algorithms use user-specified weights to penalize individual features and allow for consistent feature selection.

The adaptive lasso penalty is a single step algorithm that uses weights derived from an L2 or an unpenalized

algorithm. Adaptive lasso penalized algorithms have been proposed for both classification [132] and regression

[133]. We explain this algorithm in more detail shortly.

The adaptive lasso algorithm has been extended to operate over multiple iterations [24]. At each

iteration, this algorithm, the multi-step adaptive lasso (hereafter, MSA) algorithm uses the weights are

derived from an L1 penalized solution to bias the model for the next iteration. The accuracy results for MSA

are not promising [24].

47

For both the adaptive lasso and its multi-step variant, the MSA, weights play a central role in deter-

mining the sparsity and predictive abilities of the final model. In this study, we propose an alternative to

the MSA, which we call the subsampling/bootstrap adaptive lasso (SBA-LASSO), that uses weights obtained

from subsampled datasets instead of weights obtained from an L1 or L2 penalized algorithm. Our results

show that when used with both artificial and real world datasets, the SBA-LASSO significantly outperforms

existing single and multiple step algorithms in terms of accuracy for both regression and classification.

This paper is organized as follows: Section 4.1.1 presents the details of various single and multiple

step algorithms proposed for L1 penalized algorithms. Section 4.2 presents the details of our proposed

subsampling/bootstrap adaptive lasso (SBA-LASSO). Section 4.3 presents the details of our experimental

setup for various algorithms. Section 4.4 presents experimental results for both artificial and real world

datasets.

4.1.1 Existing Single and Multiple Step Estimation Algorithms for L1 Penalized

Algorithms

We discuss existing single and multiple step algorithms proposed for L1 penalized algorithms. These

algorithms were developed as a means of reducing the number of irrelevant features incorrectly detected by

a standard L1 penalized algorithm.

To remind the reader, the general form of an L1 penalized algorithm is

minβL(X, y, β) + λ

p∑j=1

(|βj |), λ ≥ 0, (4.1)

where L(X, y, β) is a loss function, X is a matrix of size n × p and y is a vector of size n representing

the training data {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R}; β is the model parameter vector of size p that is

estimated for a given value of the regularization parameter λ.

4.1.1.1 Bootstrap Lasso

Bootstrap lasso [6] begins with the creation of multiple bootstrap datasets. For each of these datasets,

we find the L1 penalized model with the lowest error in validation data and obtain a list of features selected

48

by the model, i.e., features whose β coefficients are nonzero. The final bootstrap lasso model is trained using

an L1 penalized algorithm but only on those features that were selected by more than a given percentage of

L1 models. We use the notation τ for the threshold-selection parameter.

The estimation algorithms we describe next use a weighted form of the lasso penalty.

4.1.1.2 Adaptive Lasso

Adaptive Lasso Penalty. Zou [132, 133] proposed an adaptive lasso penalized formulation:

minβL(X, y, β) + λ

p∑j=1

(Wj |βj |), (4.2)

where W is a user specified vector of size p, and other variables are as before: L(X, y, β) is either a single or

a double differentiable convex function, X is a matrix of size n × p and y is a vector of size n representing

the training data {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R}, and β is the model parameter vector of size p that

is estimated for a given value of the regularization parameter λ.

The idea behind the adaptive lasso is to set W high for irrelevant features and low for relevant features

so that irrelevant features are penalized more than relevant features. This weighting allows the adaptive

lasso penalty to reduce the effects of irrelevant features further than the standard lasso penalty. Instead of

setting the weights W by hand, they may be obtained using an L2 penalized algorithm for classification and

an unpenalized algorithm for regression. We describe this approach next.

Adaptive Lasso Regression and Classification. Zou [133] proposed that the adaptive lasso penaltyp∑j=1

Wj |βj | s.t. Wj = |β(OLS)j |−γ , γ > 0 could be used to regularize least squares as follows:

minβ||y −Xβ||2 + λ

p∑j=1

Wj |βj | s.t. Wj = |β(OLS)j |−γ , γ > 0, (4.3)

where X is a matrix of size n × p and y is a vector, representing target values, of size n, representing the

training data {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R}; β(OLS) denote the model parameter vector found using

ordinary least squares (OLS) regression; and β is the model parameter vector of size p that is estimated for

a given value of the regularization parameter λ. The resultant formulation is known as an adaptive lasso

regression.

49

Zou [132] applied the same properties to support vector machine (SVM) classification, and the resulting

formulation is called the Improved 1-norm SVM or Hybrid SVM, and is defined as:

minβ,β0

n∑i=1

[1− yi(xiβ + β0)]+ + λ

p∑j=1

Wj |βj |, v+ = max(v, 0), s.t. Wj = |β(l2)j |−γ , γ > 0, (4.4)

β(l2) = arg minβ,β0

n∑i=1

[1− yi(xiβ + β0)]+ + λ1

p∑j=1

||βj ||22, v+ = max(v, 0),

λ and λ1 are regularizing parameters, and {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R} is the training data; β(l2)

denote the model parameter vector found using the L2 penalized SVM; β is the model parameter vector of

size p and β0 is an offset value that are estimated for a given value of the regularization parameter λ. Note

the inclusion of the parameter γ in the exponent of the formula for the weights. This parameter modulates

the scaling of the weights.

Henceforth, we will refer to both the adaptive lasso penalized least squares regression and the SVM

classification as adaptive lasso algorithms. Note that although the adaptive lasso penalty was proposed for

the L1-SVM and lasso regression, it can also be used for other loss functions.

For adaptive lasso, the weights of the relevant features are expected to be small (as the corresponding

L2 or unpenalized estimates are expected to be large), whereas the weights of irrelevant features are expected

to be large (as the corresponding L2 or unpenalized estimates are expected to be small). Thus, adaptive

lasso penalizes irrelevant features more than relevant features and in so doing is able to reduce the effects of

irrelevant features further than the standard lasso penalty.

The adaptive lasso has multiple free parameters, some associated with the L2 penalty as well as λ and

γ. Instead of performing a grid search over all parameters, Zou [132, 133] proposed a hierarchical approach

using validation data to select the best L2 penalized classification model or unpenalized regression model

and corresponding weights; then using an adaptive lasso to perform a grid search over λ and γ given the

selected weights. For our experiments, we adopted Zou’s method.

Next, we discuss an algorithm that is like adaptive lasso, but instead of using weights derived from

an L2 penalized algorithm, it uses weights derived from an L1 penalized algorithm.

50

4.1.1.3 Multi Step Estimation Using Multi-Step Adaptive Lasso (MSA)

The multi-step adaptive lasso algorithm (MSA) [24], summarized in Algorithm 1, converts the single

step adaptive lasso algorithm into an iterative algorithm.

In the MSA, the weights derived at step n are used to bias the model at step n+1. The MSA alternates

between estimating β (from previous weights) and the weights (from previous β estimates). The weights

used in the MSA are derived from an L1 penalized algorithm whereas the weights used in the adaptive lasso

algorithm are derived from an L2 penalized (or an unpenalized) algorithm.

Algorithm 2: Multi Step Adaptive Lasso (MSA) Algorithm

Input: n training examples with p features represented in input matrix X with target values/labels

y. Validation data represented by Nval are available. M steps are performed.

Output: Best model β(m) (using validation data) from possible β(k), k = 1, 2, ...,M models.

Step 1. Initialize the weights W(0)j = 1, where j = 1, 2, ..., p.

Step 2. For k = 1, 2, ...,M ,

Use the adaptive lasso penalty in conjunction with a loss function as

minβ(k)

L(X, y, β(k)) + λ∑j

W(k−1)j |β(k)

j |. (4.5)

Assume that, at the kth step, λ(k)∗ is the optimal regularization parameter and β(k) is the model

parameter vector found using validation data Nval.

Update the weights W(k)j (the weights at the kth step) as

W(k)j =

1

|β(k)j |

. (4.6)

β(k)j is the model parameter value obtained for the jth feature and optimal regularization

parameter λ(k)∗ at the kth step.

In this algorithm, the number of features will monotonically decrease or remain the same with the

number of iterations, as only features that have a non-zero β value in the previous step will be considered in

subsequent steps; as weights are derived from an L1 penalized algorithm, the model parameter β is expected

to be sparse.

51

When the number of iterations, k, is 2, the MSA is identical to the adaptive lasso algorithm except

for the fact that it uses weights derived from an L1 penalized algorithm rather than an L2 penalized (or

unpenalized) algorithm. Buhlmann and Meier [24] noted that results for MSAs with more than k = 3 steps

were not promising; when k > 3, models trained with MSAs are sparser and less accurate than models with

k = 3 steps. Note that once the β estimate for a feature is set to zero, the feature is eliminated and is not

considered in subsequent steps of the MSA.

Instead of performing a parametric search over λ and γ as in adaptive lasso, Buhlmann and Meier

[24] limited their experiments to a parametric search over regularization parameter λ and setting γ = 1. As

shown in Equation 4.4 and Equation 4.3, γ is used to tune the adaptive lasso. In Section 4.3.1, we briefly

discuss how γ can be used to tune the MSA.

Finally, we briefly describe an algorithm that is similar to the MSA but is used in the domain of

signal recovery. One of our goals is to explore whether modifications to the MSA allow for higher prediction

accuracy.

Similarity of MSA to an Algorithm in Compressed Sensing / Signal Recovery. An issue with

the MSA algorithm, noted by Buhlmann and Meier [24] is that as the number of steps increases, the models

generated are sparser and less accurate than previous models. When MSA was proposed, a similar algorithm

was concurrently suggested by Candes et al. [27] in which sparsity was controlled by including a small positive

offset ε > 0; Candes et al. [27] used W(k)j =

1

|β(k)j |+ ε

instead of W(k)j =

1

|β(k)j |

in Equation 4.6. Due to ε, a

zero-valued β estimate for a feature does not remove the feature from consideration in subsequent steps.

Candes et al. [27] designed their algorithm to be used in compressed sensing, and the value of ε is

based on the characteristics of the data in that domain. We briefly digress and describe the use of L1

penalized algorithms in compressed sensing: L1 penalized algorithms are used to obtain relevant features in

a high-dimensional setting in which the number of features � the number of examples (an underdetermined

linear system); in contrast to real world machine learning datasets, for data obtained in compressed sensing,

there exists a unique sparse solution for the underdetermined linear system [26].

In the algorithm proposed by Candes et al. [27], the weights are updated to W(k)j =

1

|β(k)j |+ ε

,

52

where ε is set as follows: we sort |β| in descending order, find the φth term [φ = n/4 log(p/n)] and set

ε = max{|β|φ, 10−3}; φ is linked to the anticipated accuracy of L1 penalty minimization for a sparse signal.

Only when p > n (more features than examples), is log(p/n) a real number and φ a positive value. An

argument against using this formula for ε in machine learning problems is that |β|φ is based in compressed

sensing [27, 40], in which the ratio of signal to noise can be estimated, but |β|φ can not be estimated for real

world machine learning datasets. We experimented with ε in order to determine whether we could prevent

the MSA from producing very sparse models with low accuracy.

We will describe experiments with the different variations of the MSA in Section 4.4.1.2. First,

however, we propose a novel multiple step estimation algorithm.

4.2 Subsampling/Bootstrap Adaptive Lasso Algorithm (SBA-

LASSO)

To motivate our new algorithm, we contrast the approach taken by adaptive lasso and MSA with that

taken by bootstrap lasso. Both adaptive lasso and MSA directly use weights that are inversely proportional

to β estimates obtained from an L1 or L2 penalized algorithm. In contrast, the bootstrap lasso creates a

final model with only features that are selected in more than a certain proportion of L1 models trained on

bootstrapped datasets. Bootstrap lasso performs a threshold function on the selection counts of the features

and ignores the β estimates.

We propose the subsampling/bootstrap adaptive lasso algorithm (SBA-LASSO) to combine the idea

of the bootstrap lasso and the weighting used by the adaptive lasso and the MSA. The algorithm is specified

as Algorithm 3; for brevity, we refer to the SBA-LASSO algorithm as SBA. Like the MSA, the SBA is a

multiple step algorithm that uses weights. Unlike the MSA and the adaptive lasso, in which the weights are

inversely proportional to the β estimates obtained from an L1 or an L2 algorithm, the weights in the SBA are

inversely proportional to the number of times features are selected by an L1 algorithm trained on (multiple)

subsampled or bootstrapped datasets; the choice of subsampling or bootstrap is left to the user. The goal

of the SBA is to discourage penalization of features that are often selected by L1 penalized models trained

53

Algorithm 3: Subsampling/Bootstrap Adaptive Lasso (SBA-LASSO) Algorithm

Input: n training examples with p features represented in input matrix X and target values/labels y;

t controls the number of subsampled or bootstrapped datasets created, b controls the size of

subsampled datasets, M is the number of steps performed, and validation data Nval are

available.

Output: Best model β(m) (using validation data) from possible β(k), k = 1, 2, ...,M models.

Step 1. Initialize the weights W(0)j = 1, where j = 1, 2, ..., p.

Step 2. For k = 1, 2, ...,M , update the weights W (k) (the weights at the kth step) as follows:

For subsampling, sample b examples without replacement from n training examples t times to

create t datasets. For bootstrap, sample n examples with replacement from n training

examples t times to create t datasets. Denote these datasets as Ntrni, i = 1 . . . t.

V ← 0

For i = 1 . . . t,

Modeli is trained with Ntrni data using an adaptive lasso penalized algorithm and the

weights W (k−1) from the previous step; that is, estimate optimal λNtrni and

corresponding βNtrniestimates using validation data Nval via

minβNtrni

L(XNtrni , yNtrni , βNtrni) + λ∑j

W(k−1)j |(βj)Ntrni |. (4.7)

Also, Si = {j|(βj)Ntrni 6= 0} ( features selected in modeli ).

V ← V + {x, x ∈ Rp|xj = 1 if j ∈ Si else xj = 0}.

V ← V /max(V , 1). (4.8)

Update the weights W(k)j (the weights at the kth step) as follows:

W(k)j ← 1

Vj. (4.9)

Estimate optimal λ, γ and corresponding β(k) (using validation data Nval) for

minβ(k)

L(X, y, β) + λ

p∑j=1

(W(k)j )γ |β(k)

j |. (4.10)

on subsampled/bootstrapped datasets and to encourage the penalization of features that are infrequently

selected.

54

A single step version of the SBA-LASSO algorithm was proposed earlier in Chapter 3 and by Jaiantilal

and Grudic [68]. We extend the algorithm to multiple steps, thereby combining properties of various models

that have already been studied in the literature.

The SBA adaptively determines the threshold above which features should not be penalized. In

the SBA, features that are frequently selected in L1 penalized models trained using multiple subsampled

and bootstrapped datasets will have marginal or no penalization. This behavior is similar to that of the

bootstrap lasso, in which the final model is created by selecting only features that are selected by more

than a certain proportion of multiple L1 penalized models trained on bootstrapped datasets. Both the SBA

and the bootstrap lasso ignore feature magnitude |β|. In the SBA, features that are occasionally selected in

L1 penalized models trained using multiple subsampled and bootstrapped datasets are penalized in inverse

proportion to the number of times they were selected. In contrast, the degree of penalization in the adaptive

lasso and the MSA depends on weights derived using an L1 or L2 penalized algorithm, and all features are

penalized to a certain degree.

4.3 Methodology

In this section, we present the methodology for comparing our proposed algorithm against algorithms

in the literature. We discuss the determination of parameters of each algorithm and our evaluation metrics.

Our experimental setup is complicated because we have to consider different parameters that are used

by the MSA, the SBA, the adaptive lasso, and the bootstrap lasso, all of which augment an L1 penalized

algorithm; furthermore, an L1 penalized algorithm has its own regularization parameter (λ). In order to find

the optimal model, we have to jointly optimize over all parameters.

4.3.1 Tuning Parameter and Parametric Search in MSA

In order to put the MSA on equal footing with adaptive lasso and the SBA, we modify the MSA

objective function (Equation 4.5) to incorporate a tuning parameter γ:

minβ(k)

L(X, y, β(k)) + λ∑j

(W(k−1)j )γ |β(k)

j |, γ > 0, (4.11)

55

and perform a parametric grid search over both λ and γ within each step of MSA. In Section 4.4.1.2 (page 63),

we show that results improve when the tuning parameter is incorporated into the MSA.

We briefly summarize the different variations of the MSA, and the parametric search in the variations,

as follows:

(1) In Section 4.1.1.3, we discussed the MSA in which weights are set to W(k)j =

1

|β(k−1)j |

. We label this

variation untuned MSA. In an untuned MSA, at each step, we perform a search over regularization

parameter λ.

(2) In Section 4.1.1.3, we also discussed a related method proposed by Candes et al. [27] that introduced

the ε parameter W(k)j =

1

|β(k−1)j |+ ε

; we label this variation epsilon MSA. In an epsilon MSA, at

each step, we perform a search over regularization parameter λ. Variable φ is used in the calculation

of ε; for artificial datasets, we calculated φ = n/4 log(p/n), where p is the number of features and n

is the number of examples, in the training data.

(3) In the current section, we introduced a tuning step for MSA via γ as follows: W(k)j =

1

|β(k−1)j |γ

; we

label this variation tuned MSA. In a tuned MSA, at each step, we perform a grid search over (λ,γ),

where λ is the regularization parameter.

4.3.2 Parametric Search in SBA, Adaptive Lasso, and Bootstrap Lasso

Now we discuss the parameters associated with the SBA, the adaptive lasso, and the bootstrap lasso.

The parametric search in the various algorithms is performed as follows:

� The regularization parameter λ is dependent on the solver used for the L1 penalized algorithm. We note

the values of λ in Section 4.3.3.

� The parameter γ is used to tune the weights. For the artificial datasets used in our experiments, we

parametrically searched over γ={1,2,4}, and in case of real world datasets, we considered γ={0.05, 0.5,

1,2,4}.

56

� For the single step bootstrap lasso, we created 100 bootstrapped datasets, and for each bootstrapped

dataset, we trained an L1 model with an optimal regularization parameter value (λ) picked using valida-

tion data. We trained the final model (which uses an L1 penalized algorithm) using only those features

that were selected by more than a given percentage of L1 models trained on bootstrapped datasets; the

final model was also trained with an optimal regularization parameter (λ) picked using validation data.

For the sake of convenience, we refer to the parameter τ controlling the percentage as the ‘threshold value

for the bootstrap lasso’.

For the bootstrap lasso, Bach [6] reported results using τ = 90%; for the datasets used in this study,

the bootstrap lasso initially did not perform competitively with the other algorithms due to the high

threshold value, τ = 90%. For our experiments, we considered τ = {40%, 50%, 60%, 70%, 80%, 90%}.

The parameters for the final model were picked using validation data and a grid search over (τ ,λ).

� In each step of Subsampling SBA, we considered φ={10%, 25%, 40%, 50%}, where φ is the percentage of

examples sampled from the original dataset to create a subsampled dataset; we considered 100 subsampled

datasets for each φ value. Within each step of Subsampling SBA, we performed a grid search over (φ,γ,λ),

using validation data to pick the parameters for the final model.

� In each step of Bootstrap SBA, we created 100 bootstrapped datasets using bootstrap sampling with

replacement. Within each step of Bootstrap SBA, we performed a grid search over (γ,λ).

� As suggested by Zou [132, 133], for the adaptive lasso, we parametrically found the weights using the

best model selected by an L2 penalized algorithm and used those weights to perform a grid search over

(γ,λ).

� The parameter list for the MSA is complicated due to the multiple variations of MSAs. In Section 4.4.1.2,

we discuss how we chose the best variation of the MSA. More information about the parameters used for

the MSA is provided in that section.

� The MSA and the SBA are multiple step algorithms, and thus we have to consider a stopping criterion.

We limit the total number of steps to 50. After termination, we chose the model with the best validation

57

performance.

4.3.3 L1 Penalized Algorithms and Their Parameters

In this section, we describe the four linear L1 penalized algorithms used in our experiments, three

of which are used for binary classification and one of which is used for regression. We omit the weighted

formulations of the algorithms. We also mention the solvers used for both L1 penalized and L2 penalized

algorithms We also list the values from which individual solvers choose the best regularization parameters.

Some of the solvers use regularization parameter C (∝ 1/λ) instead of λ.

Note that the solver used for a standard L1 penalized algorithm can be adapted for a weighted L1

penalized algorithm by appropriately scaling1 the input matrix before being passed to the solver, as described

by Zou [132, 133].

L1-SVM. The formulation of the L1-SVM algorithm [131] with the hinge loss function regularized using

the L1 penalty is defined as

minβ,β0

n∑i=1

[1− yi(xiβ + β0)]+ + λ

p∑j=1

|βj |, v+ = max(v, 0), (4.12)

where {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R} is the training data; β is the model parameter vector of size p

and β0 is an offset value that are estimated for a given value of the regularization parameter λ.

Fung and Mangasarian [51] offer a solver for the L1-SVM through a software package called the LP

Support Vector Machine (LPSVM). LPSVM uses two parameters (ν and δ) for regularization; we parametri-

cally searched over ν = 2(−12,−11,...,11,12) and δ = 10(−3,−2,...,2,3). We also experimented on artificial datasets

with the L1-SVM solver proposed by Zhu et al. [131] which provides the entire regularization path for λ,

and obtained results similar to those obtained by LPSVM; we found the solver to be computationally slower

than LPSVM and thus we use LPSVM for our experiments.

The L2 penalized SVM algorithm is solved using LIBSVM [28]. We considered the following range of

values for the regularization parameter: C ={1e-2, 1e-1, 0.5, 1, 2, 10, 1e2, 1e4, 1e5}.1 In an L1 penalized algorithm, each feature in the input matrix is scaled to mean zero and standard deviation of one before

being passed to the solver; we refer to this matrix as a scaled input matrix. To implement an adaptive L1 penalized algorithmusing the solver for an L1 penalized algorithm, each feature in the scaled input matrix is divided by its corresponding weightvalue before passing it to the solver of an L1 penalized algorithm.

58

L1-SVM2. The SVM algorithm can also be formulated using a squared hinge loss function regularized

with the L1 penalty as follows:

minβ,β0

n∑i=1

[1− yi(xiβ + β0)]2+ + λ

p∑j=1

|βj |, v+ = max(v, 0), (4.13)

where {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R} is the training data; β is the model parameter vector of size p

and β0 is an offset value that are estimated for a given value of the regularization parameter λ.

In the literature, this SVM is sometimes referred to as L2-SVM [29], as it uses a squared hinge loss

function, an L2 loss function, with the L2 penalty function. In order to be consistent with our terminology

which uses the Lp or lasso penalty as the prefix in an algorithm’s name, we refer to Equation 4.13 as L1-SVM2;

the suffix SVM2 denotes a squared hinge loss function.

The software we used that calculates the entire piecewise path for L1-SVM2 is based on the algorithmic

framework described by Rosset and Zhu [97]. Optimization routines are described in Appendix A.2. The

optimization of a squared hinge loss function is easier than that of a hinge loss function because a squared

hinge loss function is differentiable everywhere [29].

The L2 penalized SVM2 algorithm is solved using LIBLINEAR [45, 77] software package. We consid-

ered the following range of values for the regularization parameter: C =2(−10,−9,...,9,10).

Lasso Regression. The formulation for the lasso regression is

minβ

n∑i=1

(yi − xiβ)2 + λ

p∑j=1

|βj |, (4.14)

where {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R} is the training data and β is the model parameter vector of size

p that is estimated for a given value of the regularization parameter λ. The software we used to calculate

the entire piecewise solution path (0 ≤ λ ≤ ∞) for lasso regression is based on LARS [42, 106].

In performing adaptive lasso regression, Zou [133] considered only weights derived from estimates

made through the unpenalized ordinary least squares method; for our experiments, we also considered ridge

regression (L2 penalized least squares) estimates. We considered the following range of values for the ridge/L2

penalty regularization parameter: λ={0, 1e-2, 1e-1, 0.5, 1, 2, 10, 1e2, 1e3, 1e4, 1e5}.

59

L1 Logistic Regression. The formulation for the L1 logistic regression is defined as

minβ,β0

l(β, β0) + λ

p∑j=1

|βj |, G = {+1,−1}, (4.15)

P (G = yi|X = xi) =exp((β0 + xTi β)T yi)

1 + exp((β0 + xTi β)T yi), l(β, β0) =

n∑i=1

log P (G = yi|X = xi).

where {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R} is the training data; β is the model parameter vector of size

p and β0 is an offset value that are estimated for a given value of the regularization parameter λ. The

LIBLINEAR [45] software package is used to estimate β at fixed values of the regularization parameter for

both L1 and L2 penalties. We considered the following range of values for the regularization parameters:

C ={0.1, 1, 1e1, 1e2, 1e3, 1e4, 1e5}.

4.3.4 Evaluation Metrics

In this section, we discuss the evaluation metrics used to compare and evaluate the performance of

algorithms.

Classification Error Percentage. For classification based datasets, we calculate the classification

error percentage as 100 ∗n∑i=1

I(yi 6= yi)/n, for n examples; yi is the target label for the ith example, yi is the

predicted label for the ith example, and I() is an indicator function.

Regression Relative Squared Error Percentage. For regression based datasets, we calculate the

relative squared error percentage as

Regression relative squared error percentage =

n∑i=1

(yi − yi)2

n∑i=1

(yi − y)2

∗ 100, for n examples, (4.16)

where yi is the target label for the ith example, yi is the predicted label for the ith example, y is the mean

target label, and I() is an indicator function.

Instead of using the mean squared error for regression problems, which has no intrinsic interpretation

and is therefore difficult to compare across datasets, we use the relative squared error percentage. The

60

relative squared error is greater than one if the algorithm lacks the ability to utilize the features to make

predictions, i.e., it is worse than an algorithm that makes a constant prediction of y.

Comparison of Multiple Algorithms Using Significance testing. When comparing two or more

algorithms, we can calculate the average error across datasets. However, errors are not commensurate over

different datasets, and a large variation on a single dataset may be enough to give one of the algorithms an

unfair advantage over the others.

An alternate method is to use a paired Student’s t-test or the Wilcoxon signed-rank test [124] to

compare the significance of differences in error between individual algorithms, but such tests are overly

optimistic when multiple algorithms are compared, as they do not consider the fact that multiple comparisons

can be rejected due to random chance.

To compare multiple algorithms, Demsar [35] suggested the Friedman test, which ranks algorithms’

performance using error results for individual datasets, from best to worst. If after performing the Friedman

test, the hypothesis that the algorithms are similar is rejected, one can use a post-hoc test, such as the

Bergmann-Hommel test, which adjusts the significance value based on the number of classifiers present; such

an adjustment takes into account the fact that many pairwise comparisons may be rejected due to random

chance. Garcıa and Herrera [52] provide software that lists the pairwise hypotheses that are rejected based

on the results of the Bergmann-Hommel test.

Note that the comparison of multiple algorithms via the Friedman test and post-hoc testing has

disadvantages when the difference between algorithms is not significant. Assuming that the errors from the

algorithms are obtained through a cross validation technique such as 10 fold cross validation [4, 25, 38], the

ranking may be incorrect if the differences across algorithms applied to a dataset differ due to random chance

alone, as when all algorithms applied to a given dataset obtain the same or similar final models.

4.4 Results

We present results for both artificial datasets and real world datasets. We compare various forms

of feature selections and various prediction models, formed by the Cartesian product: {variations of the

61

MSA, variations of the SBA, adaptive lasso, bootstrap lasso, standard L1 penalized algorithm}×{L1-SVM,

L1-SVM2, lasso regression, L1 logistic regression}.

4.4.1 Artificial Datasets

4.4.1.1 Description

The artificial datasets we used were originally used by [24, 64]. The multiple goals of these datasets are

to simulate a variety of conditions, including correlation and non-correlation between relevant and irrelevant

features and the existence or nonexistence of groups within the relevant features.

We generate regression datasets from the following linear model

y = Xβ + ε, ε ∼ N(0, σ2). (4.17)

Classification datasets with two-classes present are created from the model via y ∼ Bernoulli{p(u)},

where p(u) = exp(xTβ)/(1 + exp(xTβ))}, p is the number of features in the model, and n is the number of

examples. The artificial datasets are generated by choosing a value for the β vector and then sampling from

Equation 4.17. In the following presentation, N(x,m) denotes a Gaussian random variable with mean x and

standard deviation m. The specifics for the different artificial datasets are as follows.

Artificial Dataset-1. We set p = 200 and σ = 1.5. The first 15 features are uncorrelated with the remaining

185 features. The pairwise correlation between the first 15 features x1, ...x15 is ρ = {0.5}|i−j|. The pairwise

correlation between the remaining 185 features x16, ...x200 is also ρ = {0.5}|i−j|. The coefficients, of the

linear model, were set to β1, ...β5 = 2.5, β6, ...β10 = 1.5, and β11, ...β15 = 0.5.

Artificial Dataset-1 simulates the case in which p > n and correlation exists among groups of relevant

features and groups of irrelevant features, but groups of relevant and irrelevant features are uncorrelated

with each other.

Artificial Dataset-2. Same as Artificial Dataset-1, but with higher correlation—ρ = 0.95.

62

Artificial Dataset-3. Artificial dataset is generated, for p = 200 and σ = 1.5, as follows:

xi = Z1 + ei, Z1 ∼ N(0, 1), i = 1, . . . , 5,

xi = Z1 + ei, Z2 ∼ N(0, 1), i = 6, . . . , 10,

xi = Z3 + ei, Z3 ∼ N(0, 1), i = 11, . . . , 15,

Xi ∼ N(0, 1), Xi are uncorrelated, i = 16, . . . , 200, ei is uncorrelated N(0, 0.01), i = 1, . . . , 15.

The coefficients, of the linear model, were set to β1, ...β15 = 1.5, and the remaining coefficients are zero.

Artificial Dataset-3 simulates the case in which p > n and group structure exists among relevant features.

Artificial Dataset-4. Same as Artificial Dataset-1, but with p = 400 (lower signal to noise ratio).

Artificial Dataset-5. Same as Artificial Dataset-2, but with p = 400 (lower signal to noise ratio).

Artificial Dataset-6. Same as Artificial Dataset-3, but with p = 400 (lower signal to noise ratio).

Artificial Dataset-7. For p = 200 and σ = 1.5, we set the correlation among the 200 features x1, ...x200 to

ρ = {0.5}|i−j|. The coefficients were set to β1, ...β5 = 2.5, β11, ...β15 = 1.5, and β21, ...β25 = 0.5. Artificial

Dataset-7 is similar to Artificial Dataset-1 but has correlation among the relevant and irrelevant features.

Artificial Dataset-8. Same as Artificial Dataset-7, but with higher correlation—ρ = 0.95.

Artificial Dataset-9. For p = 200 and σ = 0, we set the correlation among the 200 features x1, ...x200 to

ρ = {0.5}|i−j|. The coefficients, of the linear model, were set to β1, ...β25 = 0.6. We eliminated the

estimation noise variable ε in y = Xβ + ε and lowered the signal-to-noise ratio in comparison to Artificial

Dataset-7.

For each of the nine artificial datasets, we generated 50 instances of each dataset, each consisting of

100 training examples, 1000 validation examples, and 20000 test examples. We present results which are the

mean over these 50 instances to reduce sampling variance.

Instead of presenting all results at once, we distill them to reduce the complexity of our analysis.

63

4.4.1.2 Distilling the results of MSA and SBA

Our model exploration is complicated by the existence of multiple variations of both the MSA and

the SBA. A direct comparison of all algorithms that includes the many variants of MSA and SBA may lead

to a complicated picture. Therefore, we first compare the different variants of MSA and SBA in isolation to

choose a single variant that we will put into competition with alternative models. We acknowledge that the

fact that MSA and SBA have many variants provides them with a greater opportunity to appear superior

to alternative models if the same test data is used both to select a variant of MSA (or SBA) and to compare

MSA (or SBA) against alternatives. Therefore, we use the artificial datasets to select the variants of MSA

and SBA that minimize squared error, and then compare the chosen variants of MSA and SBA against

alternatives on real world data sets.

Percent Error Results for Variations of MSA. In Section 4.3.1, we discussed the different variations

of MSA. Note: for our experiments, we also consider the single and multiple step versions of each variation

of the MSA.

In Figure 4.1, we show statistical results for the variations of the MSA and condensed over all four

L1 penalized algorithms; these are based on percent error results obtained for the nine artificial datasets

through Friedman ranks and the Bergmann-Hommel procedure [52]. For the sake of brevity, we omit results

for individual L1 penalized algorithms and refer to the single and multiple step algorithms as ‘single’ and

‘multiple’ respectively. Graphically, we arrange the algorithms from best to worst performing (top to bottom)

according to their average Friedman ranks across all datasets and L1 penalized algorithms, and we connect all

the algorithms whose pairwise hypotheses cannot be rejected (p = 0.05 according to the Bergmann-Hommel

procedure). Thus, algorithms that are significantly different from one another are not connected. We also

note the average percent error in parentheses next to the algorithm name.

Based on Figure 4.1, we can make two main conclusions: (a) for all algorithms (tuned, untuned, and

epsilon MSAs), performing multiple steps significantly decreases the percent error compared to a single step,

and (b) among the studied variations of the MSA, a tuned MSA with multiple steps is the best performing

and has a significantly higher Friedman rank than the rest of the variations.

64

1

2

3

4

5

6

1

2

3

4

5

6

1

2

3

4

5

6

Tuned + Multiple (11.44%)

Epsilon + Multiple (11.88%)Untuned + Multiple (11.73%)

Tuned + Single (12.58%)

Epsilon + Single (13.09%)Untuned + Single (12.89%)

Figure 4.1: Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for variationsof the MSA, using percent error results obtained using nine artificial dataset models × four different L1

penalized algorithms. We consider tuned MSA (using γ), epsilon MSA (using ε), and untuned MSA (usingneither ε nor γ) and results from both Single and Multiple step estimations; for brevity, we do not referto the algorithms by their full name. The algorithms are arranged top (best performing) to bottom (worstperforming) according to their Friedman test ranks. Algorithms that are insignificantly different from eachother (p=0.05 according to the Bergmann-Hommel procedure), are connected. Average percent error isnoted in parentheses next to the algorithm name.

# features correctly detected # features incorrectly detectedTuned + Multiple 9.59 (4.15) 5.04 (4.99)Epsilon + Multiple 10.19 (4.28) 10.83 (10.55)Untuned + Multiple 9.94 (4.13) 6.83 (5.45)

one-way ANOVA p =0.9448 p =0.0106

Table 4.1: Average number of features discovered correctly and incorrectly by the MSA over four different L1

penalized algorithms. We considered tuned MSA (using γ), epsilon MSA (using ε), and untuned MSA (usingneither ε nor γ) and their results from using multiple steps; for brevity, we do not refer to the algorithmsby their full name. We performed two one-way ANOVAs on the following two hypotheses: (1) the averagenumber of features discovered correctly by variations of the MSA is the same, and (2) the average numberof features discovered incorrectly by variations of the MSA is the same.

Difference in the Feature Selection Results Between the Algorithm of Candes and Wakin [26]

and MSA. In Table 4.1, we briefly examine the feature selection results obtained for tuned, epsilon, and

65

untuned MSAs. Because the linear models that generate the artificial datasets are known, we can calculate

the average number of features correctly and incorrectly selected in the best models of tuned, epsilon, and

untuned MSA.

To establish the statistical significance of our results in Table 4.1, we propose two null hypotheses:

first, that the average number of features correctly detected by the different algorithms is the same, and

second, that the average number of features incorrectly detected by the different algorithms is the same. A

one-way analysis of variance (ANOVA) of the average number of features correctly detected over the various

artificial datasets was insignificant, and thus we cannot reject the first null hypothesis. In contrast, a one-way

analysis of variance of the average number of features incorrectly detected was found to be significant. We

can thus reject the second null hypothesis: that is, at least one of the algorithms had a mean error percentage

significantly different from the rest of the algorithms. Compared to the tuned and untuned MSA, a larger

number of irrelevant features were detected by epsilon MSA; this phenomenon can be explained by the fact

that features are not explicitly removed from the model as ε > 0, even when the estimated model parameter

value for a feature is zero.

Based on the results presented here, we focus all subsequent evaluations of MSA using only the tuned

MSA with multiple steps.

Determining the best algorithm among variations of SBA. The SBA can use weights derived

from either bootstrap or subsampling. Likewise, the SBA can be used as a single or multiple step algorithm.

In Figure 4.2, we show statistical results, for variations of the SBA condensed over all four L1 penalized

algorithms, using percent error results obtained for nine artificial datasets via the Friedman test and the

Bergmann-Hommel procedure [52]. For the sake of brevity, we do not provide results for individual L1

penalized algorithms.

Based on Figure 4.2, we reach two main conclusions: (a) for both Subsampling SBA and Bootstrap

SBA, performing multiple steps rather than a single step significantly reduces the percent error, and (b)

among the variations of the SBA, Subsampling SBA with multiple steps is the best performing algorithm

and has a significantly higher Friedman rank than the rest of the algorithms. Based on these results with

66

1

2

3

4

1

2

3

4SBA(Bootstrap) + Single (13.11%)

SBA(Subsampling) + Single (11.81%)SBA(Bootstrap) + Multiple (11.16%)

SBA(Subsampling) + Multiple (9.53%)

Figure 4.2: Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for variations ofthe SBA, using percent error results obtained using nine artificial dataset models × four different L1 penalizedalgorithms. The algorithms are arranged top (best performing) to bottom (worst performing) according totheir Friedman test ranks. Algorithms that are insignificantly different from each other (p=0.05 accordingto the Bergmann-Hommel procedure), are connected. Average percent error is noted in parentheses afterthe algorithm name. Algorithms with multiple steps are referred to as Multiple and algorithms with a singlestep are referred to as Single.

artificial datasets, we report all subsequent results for SBA using the variant with subsampling and multiple

steps.

4.4.1.3 Overall Results

Now that we have chosen the best performing SBAs and MSAs, we can compare these algorithms

to the bootstrap lasso and the adaptive lasso. As a baseline for comparison, we also include the results

obtained from a standard L1 penalized algorithm. In Figure 4.3, we present statistical results based on

percent error results, using the Friedman test and the Bergmann-Hommel procedure [52] for nine artificial

datasets segmented over four L1 penalized algorithms.

Based on Figure 4.3, we can draw the following main conclusion: the SBA is the best performing

algorithm for all four L1 penalized algorithms, whereas the MSA is the second best performing algorithm.

Note that even though the SBA always has a rank of one for both L1 logistic regression and L1-SVM2, the

Bergmann-Hommel procedure does not reject the hypothesis that the SBA is significantly different from the

MSA. When we consider results across all L1 penalized algorithms, the SBA’s performance is significantly

better than that of the MSA and the rest of the algorithms according to the Bergmann-Hommel procedure;

67

(a) L1-SVM.

1

2

3

4

5

(a) L1-SVM.

1

2

3

4

5

(a) L1-SVM.

1

2

3

4

5

(a) L1-SVM.

1

2

3

4

5

MSA (12.76%)

SBA (9.32%)

Bootstrap (13.58%)Adaptive (13.66%)

Standard (14.28%)

(b) L1 Logistic Regression.

1

2

3

4

5

(b) L1 Logistic Regression.

1

2

3

4

5

(b) L1 Logistic Regression.

1

2

3

4

5

(b) L1 Logistic Regression.

1

2

3

4

5

MSA (10.51%)

SBA (8.84%)

Bootstrap (15.32%)

Adaptive (13.04%)

Standard (14.44%)

(c) L1-SVM2.

1

2

3

4

5

(c) L1-SVM2.

1

2

3

4

5

(c) L1-SVM2.

1

2

3

4

5

(c) L1-SVM2.

1

2

3

4

5

MSA (13.28%)

SBA (11.93%)

Bootstrap (14.37%)

Adaptive (13.97%)

Standard (14.58%)

(d) Lasso Regression.

1

2

3

4

5

(d) Lasso Regression.

1

2

3

4

5

(d) Lasso Regression.

1

2

3

4

5

(d) Lasso Regression.

1

2

3

4

5

MSA (7.26%)

SBA (8.01%)

Bootstrap (12.25%)Adaptive (10.75%)

Standard (11.40%)

Figure 4.3: Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for different L1

penalized algorithms, using percent error results obtained using nine artificial dataset models. The algorithmsare arranged top (best performing) to bottom (worst performing) according to their Friedman test ranks.Algorithms that are insignificantly different from each other (p=0.05 according to the Bergmann-Hommelprocedure), are connected. Average percent error is noted in parentheses after the algorithm name.

we do not show this result.

Our primary criterion for model selection is percent error: we showed that for artificial datasets,

68

models obtained using the SBA have significantly lower average percent error (reflected in higher Friedman

ranks) than other algorithms. In Appendix B.3, we present additional results for the artificial datasets

pertaining to accuracy of feature selection.

4.4.2 Results for Real World Datasets

4.4.2.1 Description

We could not find existing studies that performed an extensive comparison of the adaptive lasso,

the bootstrap lasso, and the MSA using real world datasets. Even published results for these individual

algorithms are sparse and usually limited to a few mostly domain-specific real world datasets; for example,

Buhlmann and Meier [24] presented results for a gene dataset and Zou [132] presented results for three real

world datasets (spambase, WDBC, and ionosphere). In this study, we hope to provide an extensive list of

datasets and comparisons among various algorithms.

We note the datasets used for our experiments in Table 4.2; a detailed description is presented in

Appendix B.1 (page 122). All of the datasets are publicly available at either UCI2 [5], LIBSVM3 [28], or

KEEL4 [2]. The datasets contain between 22 and 186,000 examples and between 6 and 300 features. Our only

criterion for choosing among datasets was to include those that contained at least a few hundred examples.

This ensured that enough validation data was available for all algorithms.

Many of the datasets had fewer than 35 input features and are thus not ideal candidates for feature

selection. To show the value of feature selection even for these low-dimensional data sets, we augmented

the feature vector with second-order features, i.e., features defined to be the product of all pairs of simple

features. Such augmentation yields the potential for a significant improvement in the performance of a

linear model, but this improvement will be highly dependent on feature selection to prevent overfitting.

Ideally, L1 penalized algorithms should be able to train models that suppress irrelevant5 features from the

expanded dictionary without suffering a reduction in accuracy. In Appendix B.4 (page 130), we show that

2 http://archive.ics.uci.edu/ml/index.html3 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/4 http://sci2s.ugr.es/keel/datasets.php5 Irrelevant features: features irrelevant to an L1 penalized algorithm may contain essential nonlinear information not

detectable by the algorithm.

69

Classification

DatasetClass

# examples # of features# of features

SourcePercentage originally

1 a1a+ 75/25 1605 119 119 [5, 28, 86]2 australian* 56/44 690 119 14 [28]3 breast-cancer 65/35 683 65 10 [5, 28]4 german.numer* 70/30 1000 324 24 [28]5 heart* 56/44 270 104 13 [28]6 hillvalley (noisy) 50/50 1212 100 100 [5]7 ionosphere 36/64 351 629 34 [5, 28]8 Chess (kr vs kp) dataset 52/48 3196 36 36 [5]9 BUPA liver-disorders 58/42 345 27 6 [5, 28]10 pimadiabetes 35/65 768 44 8 [5, 28]11 spambase 61/39 4597 57 57 [2, 5]12 splice+ 48/52 1000 60 60 [1, 5, 28]13 svmguide3 76/24 1243 275 22 [28, 63]14 w1a∆ 97/3 5000 300 300 [28, 86]15 breast-cancer (diagnostic)Σ 63/37 569 495 30 [5]

Regression

Dataset # examples # of features# of features

Sourceoriginally

1 cadata (houses.zip) 20640 44 8 [28, 120]2 compactiv 8192 252 21 [1, 2]3 diabetes 442 65 10 [42]4 housing 506 104 13 [5, 28]5 mortgage 1049 135 15 [2]6 Auto-mpg 392 35 7 [5, 28]7 Pole telecommunications 14998 377 26 [2]8 triazines% 186 60 60 [5, 28]9 census (house-price-16H) ∇ 22784 152 16 [1, 2]10 Concrete compressive strength 1030 44 8 [2, 5]

Table 4.2: List of real world datasets used in our experiments. There are 10 regression datasets and 15classification datasets containing 186-22000 examples and 6-300 features. We included second order featuresfor datasets with ≤ 35 original features. *These datasets were formerly available at StatLog and are currentlyavailable at LIBSVM-DATASETS. +We used only the training example specified at LIBSVM-DATASETSfor our experiments. #We combine both training and test data. %The triazines dataset is part of “QualitativeStructure Activity Relationships” at UCI [5]. ∆We subsampled 5000 examples from the training+test dataavailable. ∇The target value is long-tailed and thus is transformed via log2(value+1). ΣAlso known asWDBC. LIBSVM-DATASETS = http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

by augmenting the first-order features with second-order features, the error of L1 models can be significantly

reduced.

70

4.4.2.2 Setup

We performed 10 fold cross validation, internally using six folds for training and three folds as val-

idation data (various parameters were selected using validation data); for our experiments with real world

datasets, we ensured that class proportions in each fold were kept the same.

We use 10 regression datasets with lasso regression and 15 classification datasets with L1-SVM2. We

ran into issues with the LIBLINEAR and LPSVM solvers for some of the datasets, and thus we omit results

from L1 logistic regression and L1-SVM.

4.4.2.3 Results and Analysis

(a) L1-SVM2.

1

2

3

4

5

(a) L1-SVM2.

1

2

3

4

5

MSA (14.70%/1.41%)

SBA (13.46%/9.75%)

Bootstrap (15.02%/1.40%)

Adaptive (14.87%/2.18%)

Standard (15.18%/0.00%)

(b) Lasso Regression.

1

2

3

4

5

(b) Lasso Regression.

1

2

3

4

5

(b) Lasso Regression.

1

2

3

4

5

MSA (32.43%/0.65%)

SBA (30.90%/4.07%)

Bootstrap (31.52%/-5.75%)

Adaptive (32.10%/0.68%)

Standard (32.44%/0.00%)

Figure 4.4: Graphical representation of Friedman ranks and results of the Bergmann-Hommel procedurefor L1-SVM and LARS using percent error results obtained using real world datasets. The algorithmsare arranged top (best performing) to bottom (worst performing) according to their Friedman test ranks.Algorithms that are insignificantly different from each other (p=0.05 according to the Bergmann-Hommelprocedure), are connected. In parentheses following the algorithm name is the average percent error as wellas the average relative reduction in percent error over an L1 penalized algorithm.

71

Overall Results. In Figure 4.4, we show ranking results obtained for the real world data sets for lasso

regression and L1-SVM2. These results are based on the percent error for each feature-selection technique,

ranked according to the Friedman test and the Bergmann-Hommel procedure [52]. For each algorithm, we

also show (in parentheses) the average error across datasets and the percentage reduction in error relative

to the standard L1-SVM2 (Figure 4.4a) and standard lasso regression (Figure 4.4b).

From Figure 4.4(a), we can draw three main conclusions about the L1-SVM2: (a) the SBA outperforms

other algorithms, and according to the Bergmann-Hommel procedure, the difference is significant; (b) if we

consider the average reduction in percent error across individual datasets, the SBA had the highest relative

reduction in percent error (9.75%) over the standard L1-SVM2; and (c) if we consider the average percent

error across individual datasets, the SBA had the lowest average percent error (13.46%).

From Figure 4.4(b), we can draw three main conclusions about lasso regression: (a) the SBA outper-

forms both bootstrap lasso and standard lasso regression, and according to the Bergmann-Hommel procedure,

the difference is significant, but the SBA is not statistically different from the adaptive lasso and the MSA;

(b) if we consider the average reduction in percent error across individual datasets, of all the algorithms,

SBA had the highest relative reduction in percent error (4.07%) over standard lasso regression; and (c) if we

consider the average percent error across individual datasets, the SBA had the lowest percent error (30.90%)

of all the algorithms.

From Figure 4.4, we can conclude that the SBA outperforms the rest of the algorithms on all of the

three following metrics: Friedman rank, average reduction in error, and average error. The numerical tables

of the results for both real world and artificial datasets are presented in Appendix B.2 (page 125).

Is Subsampling SBA with Multiple Steps Still the Best Algorithm Among Variations of

SBA? In Figure 4.5, we briefly examine the statistical results obtained from variations of the SBA using

real world datasets. As shown in the figure, Subsampling SBA with multiple steps is the best performing

algorithm, but its performance was not significantly different from those of the rest of the algorithms, at

least for the L1-SVM2. Note that Subsampling SBA with multiple steps has the highest rank and lowest

average percent error of all the variations of the SBA, using both L1-SVM2 and lasso regression.

72

We omit detailed presentation of results for the MSA, but the results are similar as those for the

SBA: the tuning MSA variant beats out the alternatives on the real-world datasets as it did on the artificial

datasets.

Feature Selection Results. In Figure 4.6, we plot the number of features selected by LARS and L1-

SVM2. We arrange the algorithms left to right, sorted according to increasing percent error. For L1-SVM2,

we can draw the following conclusion: MSA and bootstrap lasso produce models that are sparser than SBA,

whereas adaptive lasso and the standard L1 penalized algorithm produce models that are denser than SBA.

For LARS, we can draw the following conclusion: the MSA produces the sparsest model, whereas the number

of features in the SBA, adaptive lasso, and bootstrap lasso were in the same range but lower than that of

the standard lasso regression algorithm.

Computational Cost. The computational costs of the various algorithms are not equivalent. The

computational cost of the SBA is an order of magnitude higher than that of the MSA algorithm because of

the parametric search over multiple subsampled or bootstrapped datasets. The time taken to perform all

of our experiments was about five days on a dual CPU computer; parallel programming was used for the

grid search over the parameters. Given the speed of modern machines and the ample availability of cycles,

reduction in test error and interpretability of results (via feature sparsity) seem to be more important criteria

than training time. Further, recent methods using coordinate descent [49] may reduce the computation time.

4.5 Conclusions and Future Work

Our empirical results support the following hypothesis: we can significantly improve prediction in

regression and classification problems using a multiple step estimation algorithm instead of a single step

estimation algorithm. We propose a multiple step algorithm, the SBA-LASSO, that uses weights derived

from either bootstrapped or subsampled datasets. We compare the SBA-LASSO with existing algorithms

including the adaptive lasso, the bootstrap lasso and the MSA-LASSO for different L1 penalized methods

(on both classification and regression problems). Subsampling SBA-LASSO with multiple steps is shown to

73

(a) L1-SVM2.

1

2

3

4

(a) L1-SVM2.

1

2

3

4

SBA(Bootstrap) + Single (14.32%/0.84%)

SBA(Subsampling) + Single (14.00%/1.42%)

SBA(Bootstrap) + Multiple (14.46%/0.00%)

SBA(Subsampling) + Multiple (13.46%/5.15%)

(b) Lasso Regression.

1

2

3

4

(b) Lasso Regression.

1

2

3

4

(b) Lasso Regression.

1

2

3

4

SBA(Bootstrap) + Single (31.81%/0.00%)

SBA(Subsampling) + Single (31.89%/3.53%)

SBA(Bootstrap) + Multiple (31.49%/5.92%)

SBA(Subsampling) + Multiple (30.90%/5.87%)

Figure 4.5: Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for variationsof SBA, using percent error results obtained using real world datasets. The algorithms are arranged top(best performing) to bottom (worst performing) according to their Friedman test ranks. Algorithms that areinsignificantly different from each other (p=0.05 according to the Bergmann-Hommel procedure), are con-nected. In parentheses after the algorithm name, we note (Average percentage / Average relative reductionin percent error over the worst ranked algorithm).

perform significantly better than existing algorithms both in ranking and in terms of reduction in percent

error, using both artificial and real world datasets.

74

0

5

10

15

20

25

30

35

40

SBA

Adaptive

MSA

Bootstrap

Standard

Ave

rage

num

ber

of fe

atur

es

(a) L1-SVM2.

0

20

40

60

80

100

SBA

MSA

Adaptive

Bootstrap

Standard

Ave

rage

num

ber

of fe

atur

es

(b) Lasso Regression.

Figure 4.6: Overall number of features discovered for real world datasets. Algorithms are arranged left toright based on increasing friedman ranks (obtained from Figure 4.4); the best algorithm is on the left andthe worst is on the right. In the plots, we show the average number of features and the standard deviation.Normalization of the standard deviation was performed as suggested by Masson and Loftus [80].

The theoretical justification for Subsampling SBA and other multiple step algorithms is an open

research question. We hypothesize that the Subsampling SBA is able to adaptively find the threshold after

which a feature should not be penalized in the adaptive lasso penalty. We further conjecture that MSA-

LASSO and adaptive lasso may be improved by using a squashing function on the weights, which will push

them toward binary values and therefore make the algorithms behave more like SBA-LASSO.

Chapter 5

Iterative Training and Feature Selection using Random Forest

5.1 Introduction

The random forest algorithm is popular and widely used in machine learning [69, 99, 128]. A random

forest is an ensemble of bagged decision trees, whose aggregated results are used for prediction. A random

forest can be considered an implicit feature selecting algorithm due to the use of decision trees.

Random forests perform explicit feature selection by using a feature importance measure in conjunction

with a backward elimination technique [37, 104]. A feature importance measure provides a way to rank

features in terms of their relevance to prediction. In previous studies [19, 111], feature importance values

were calculated by training a single model and evaluating the increase in error for out-of-bag (OOB) examples

by selectively permuting values for a feature in those examples; out-of-bag examples are examples that are

not used for training individual trees. Logically, if a feature is relevant to prediction, permuting a feature

will cause an increase in the OOB error rate, whereas if a feature is irrelevant, permuting the feature will

not increase the OOB error rate.

In this study, we explore a range of schemes for explicit feature selection using random forest. Existing

approaches start with the full complement of features and iteratively train models by removing features that

are deemed to be of least importance. The decision about how many features to remove is based on an

evaluation of held-out validation data or OOB data.

A key component of explicit feature selection is the feature importance measure. For our study, we

76

consider existing feature importance measures proposed by Breiman [19] and Strobl et al. [111]; we also

propose a new feature importance measure that uses multiple retrained models based on their performance

using artificial datasets. We show that our proposed feature importance measure yields both higher accuracy

and greater sparsity than existing feature importance measures, though at a greater computational cost.

In previous work, features were considered for inclusion in individual decision trees in an all-or-none

fashion. In contrast, we propose a strategy in which continuous weights are assigned to individual features;

features with higher weights have a greater probability than those with lower weights of being used to train

a model. We perform multiple iterations of model training to iteratively reweight features such that the least

useful features eventually obtain a weight of zero. Weights are derived from feature importance values; in

iterated reweighting of random forest models, features with high feature importance values have a greater

probability than features with low feature importance values of being used to train a model.

This chapter is organized as follows. Section 5.2 presents the details of a set of artificial dataset

models that present a challenge to existing random forest feature importance measures. Section 5.3 present

the details of our proposed feature importance measure. Section 5.4 presents an overview of our approach

to iterated reweighting of random forest models. Section 5.5 presents an overview of our experimental

methodology. In Section 5.6, we present our results.

5.1.1 Existing Feature Importance Measures for Random Forest

We describe feature importance measures for random forests that were previously proposed by Breiman

[19, 21] and Strobl et al. [111].

Breiman’s Feature Importance Measure. Before discussing the details of Breiman’s feature impor-

tance measure, we must explain a detail of random forests. In a random forest model, each tree is trained

using a bootstrap dataset. Examples in the original dataset that are not used to train a tree are known as

out-of-bag or OOB examples. OOB examples can be used to calculate an error rate for a random forest

model; this error rate is called the OOB error rate. This error rate can be used as a type of validation error.

Breiman [19, 21] proposed using OOB error rates to calculate feature importance values as follows: (1)

77

a single random forest models is trained; (2) the OOB error rate is calculated; (3) for an individual feature,

the values of the feature in the OOB data are permuted, and the OOB error rate is recalculated; and (4) the

increase in the OOB error rate due to permutation is the feature importance value of a particular feature.

Mathematically, Breiman’s feature importance for classification tasks is calculated as follows:

feature importance for featurem =

1

N

N∑i=1

1

Noob

(ntree∑k=1

I[(xi, yi) /∈ datasetk] [I(yi 6= yi)]−ntree∑k=1

I[(xi, yi) /∈ datasetk] [I(yi 6= yi)]

), (5.1)

and Breiman’s feature importance for regression tasks is calculated as

feature importance for featurem =

1

N

N∑i=1

1

Noob

(ntree∑k=1

I[(xi, yi) /∈ datasetk](yi − yi)2 −ntree∑k=1

I[(xi, yi) /∈ datasetk](yi − yi)2

), (5.2)

where Noob =∑ntreek=1 I[(xi, yi) /∈ datasetk] is the number of trees in which example i is OOB, ntree

bootstrapped datasets represented by datasetk are used to train ntree trees, N is the number of examples,

I() is an indicator function, x is an input, y is a target, and y is a prediction before permuting feature m

whereas y is a prediction after permuting feature m.

Breiman’s feature importance measure may be normalized by dividing the importance values obtained

from the equations above by the standard error of the feature value. However, a study by Strobl and Zeileis

[110] suggests that unnormalized feature importance values have better statistical properties. Thus, in this

study, we restrict our experiments to unnormalized feature importance values.

Strobl’s Feature Importance Measure. Strobl et al. [111] argued that Breiman’s feature importance

measure is biased and is prone to give correlated features an increased feature importance value. The propose

a new feature importance measure, which we refer to as the Strobl importance, that avoids the bias. Rather

than explain the details of their argument, we focus on differences in the calculation of Strobl’s and Breiman’s

feature importance measures.

The main difference between Strobl’s and Breiman’s feature importance measures is how the OOB

examples are permuted. To calculate Breiman’s feature importance, we permute all values for a feature

78

without considering the values of remaining features. In contrast, to calculate Strobl’s feature importance,

we permute values for a feature conditional on values of the remaining features.

5.2 A Challenge for Existing Feature Importance Measures in

Random Forests

In this section, we discuss a family of artificial datasets whose features existing feature importance

measures are unable to correctly rank. Our goal is to provide our motivation for proposing a new importance

measure that is able to correctly rank the features. First, we describe the nonlinear model used to generate

the datasets.

5.2.1 Description of the Artificial Dataset Models

We create three different artificial regression datasets. Each data set is obtained from the following

nonlinear generative model:

yi =

5∏j=1

[sin(xj) + 1], (5.3)

where yi is the target value generated from input xi for example i. We label the five features used

to generate a target value relevant features. They are uncorrelated with one another. Additionally, we add

irrelevant features that are correlated with the relevant features.

Each artificial dataset model differs in the number of relevant features correlated with the irrelevant

features. Next, we describe the three different artificial dataset models using their covariance matrix.

Artificial Dataset 1 has five irrelevant features in addition to the five relevant features, and the

79

covariance matrix describing the relationship among pairs of features is as follows:

relevant features︷ ︸︸ ︷ irrelevant features︷ ︸︸ ︷

Cov(X) =

0.5 0 0 0 0 0.1 0 0 0 0.1

0 0.5 0 0 0 0.1 0.1 0 0 0.1

0 0 0.5 0 0 0.1 0.1 0.1 0 0

0 0 0 0.5 0 0 0.1 0.1 0.1 0

0 0 0 0 0.5 0 0 0.1 0 0.1

0.1 0.1 0.1 0 0 0.2 0 0 0 0

0 0.1 0.1 0.1 0 0 0.2 0 0 0

0 0 0.1 0.1 0.1 0 0 0.2 0 0

0 0 0 0.1 0 0 0 0 0.2 0

0.1 0.1 0 0 0.1 0 0 0 0 0.2

, (5.4)

where each of the relevant features is correlated with two or three irrelevant features. Note that adjustments

were made to values in the covariance matrix to make it semi-definite.

Artificial Dataset 2 has the following covariance matrix:

relevant features︷ ︸︸ ︷ irrelevant features︷ ︸︸ ︷

Cov(X) =

0.5 0 0 0 0 0.1 0 0.1 0.1 0.1

0 0.5 0 0 0 0.1 0.1 0 0.1 0.1

0 0 0.5 0 0 0.1 0.1 0.1 0 0.1

0 0 0 0.5 0 0.1 0.1 0.1 0.1 0

0 0 0 0 0.5 0 0.1 0.1 0.1 0.1

0.1 0.1 0.1 0.1 0 0.1 0.1 0.1 0.1 0.1

0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

0.1 0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1

0.1 0.1 0 0.1 0.1 0.1 0.1 0.1 0.1 0.1

0.1 0.1 0.1 0 0.1 0.1 0.1 0.1 0.1 0.1

, (5.5)

where four irrelevant features are correlated with each relevant features. There are a total of five relevant

features and five irrelevant features.

Artificial Dataset 3 has the following covariance matrix:

relevant features︷ ︸︸ ︷irrelevant feature︷︸︸︷

Cov(X) =

0.5 0 0 0 0 0.1

0 0.5 0 0 0 0.1

0 0 0.5 0 0 0.1

0 0 0 0.5 0 0.1

0 0 0 0 0.5 0.1

0.1 0.1 0.1 0.1 0.1 0.1

, (5.6)

80

where a single irrelevant features is correlated with all relevant features. There are a total of five relevant

features and one irrelevant feature.

After the input matrix X is generated, we apply a tan(tan()) transform to the irrelevant features

to reduce linear correlation between the relevant and irrelevant features; in order to simulate the case of

nonlinear correlation.

5.2.2 Results for Artificial Dataset Models

Now, we present our method and results on the three artificial datasets created to evaluate existing

importance measures for random forests.

Setup. From each of the three artificial dataset models, we generated 100 datasets each of which con-

tained 1000 training examples and 1000 validation examples. For each of the 3x100 datasets, we obtained

values for various feature importance measures.

We considered results for the 3x100 datasets for the four combinations of {Breiman’s feature impor-

tance measure, Strobl’s feature importance measure} × {random forest, conditional forest}. A conditional

forest uses a different type of decision tree from a random forest; Strobl’s feature importance was proposed

originally for conditional forests, and thus we also included results obtained using conditional forests.

We used the following two software packages: (a) random forests [67, 107], and (b) conditional forests

[62]. For both software packages, we set the number of trees ntree = 1000 and parametrically selected mtry

using validation data. For the algorithm combination {Strobl’s feature importance, conditional forest}, we

modified our experimental setup as follows: for each dataset, we trained on a conditional forest model using

a subsampled1 set of 750 of the available 1000 training examples.

Method. Ideally, when applied to artificial datasets, the feature importance values determined by a

feature importance measure will be higher for relevant features than for irrelevant features. In order to

quantify whether relevant features actually have higher feature importance values than irrelevant features,

1 The conditional forest package did not complete successfully for a dataset with 1000 training examples when trained ona computer with 64GB of memory. Even after subsampling the dataset, the execution time for calculating feature importancewas almost a day per dataset.

81

we define the following scoring rule

correct detection =

I[min{importance(m, datasetk){i∈relevant}} ≥ max{importance(m, datasetk){i∈irrelevant}}], (5.7)

where importance(m,D){k} is the set of feature importance values calculated using feature importance

measure m for feature set k in the dataset D; relevant is the set of relevant features and irrelevant is the

set of irrelevant features and are known a priori for the artificial dataset models. When a feature importance

measure ranks all relevant features higher than irrelevant features, the correct detection scoring rule has a

value of one; otherwise, it has a value of zero.

Results. In Figure 5.1, we plot the correct detection rate obtained by four combinations of the feature

importance measure {Breiman’s feature importance measure, Strobl’s feature importance measure} and

forest type {conditional forests, random forests} on the 3x100 artificial datasets. We also plot the results

obtained for a new importance measure which we term retraining-based feature importance; the importance

measure will be discussed further in Section 5.3.

Based on Figure 5.1, we can draw two conclusions. First, for the majority of replications of artificial

data sets 2 and 3, 3, existing feature importance measures incorrectly rank an irrelevant feature higher

than one of the relevant feature. Second, the only feature importance measure that consistently ranks

relevant features higher than irrelevant features is the retraining-based feature importance (which we describe

shortly)..

Many other machine learning algorithms have problems with these data sets. For more information

refer to Appendix C.5. Having presented the case that there is a class of situations for which existing feature-

importance measures fail for random forests (and other machine learning models), we now turn to proposing

a feature-importance measure that we hypothesize is more meaningfully related to the relevance of features

to a task.

82

Dataset Model−1 Dataset Model−2 Dataset Model−3−20

0

20

Difference in error rates (models with all features − models with only relevant features)using Random Forests

% E

rror

diff

eren

ce

Dataset Model−1 Dataset Model−2 Dataset Model−30

50

100Random Forests + Breiman’s Feature Importance

# da

tase

ts

Dataset Model−1 Dataset Model−2 Dataset Model−30

50

100Random Forests + Strobl’s Feature Importance

# da

tase

ts

Dataset Model−1 Dataset Model−2 Dataset Model−30

50

100Conditional Forests + Breiman’s Feature Importance

# da

tase

ts

Dataset Model−1 Dataset Model−2 Dataset Model−30

50

100Conditional Forests + Strobl’s Feature Importance

# da

tase

ts

Dataset Model−1 Dataset Model−2 Dataset Model−30

50

100Random Forests + Retraining based Feature Importance

# da

tase

ts

Correct detectionIncorrect detection

Figure 5.1: Feature importance results for random forests and conditional forests on 3 artificial datasetmodels. From each artificial dataset models, we generate 100 datasets. We plot the number of times correctand incorrect detection is performed by an importance measure using the correct detection scoring ruledefined in Equation 5.7; incorrect detection = 1-correct detection.

5.3 Retraining-Based Feature Importance

Breiman’s feature importance measure quantifies the importance of a feature as the increase in OOB

error when the feature is permuted; furthermore, both Breiman’s and Strobl’s feature importance measures

83

are calculated using a single model. We propose a different formulation of feature importance that is based

on training multiple models.

Retraining-based feature importance for feature i is obtained by permuting feature i so that the feature

value for training example e now corresponds to the feature value of some other example e′. Following training

on the data set with feature i permuted, the increase in error can be measured. To minimize variance due to

the particular permutation chosen, we repeat the permutation-and-retraining procedure k times. Formally,

we define feature importance for feature i as:

Feature importance for featurei =1

k

∑k

ErrorOOB(TrainRF (X\i, y))− ErrorOOB(TrainRF (X, y))

(5.8)

where X is the input matrix of size n × p , y is the vector of size n containing target values or labels, X\i is

the matrix X with feature i permuted, k is the number of resamples, TrainRF (X,Y ) parametrically trains a

random forest model using the input data X and target values or labels Y , and ErrorOOB(TrainRF (X,Y ))

returns the OOB error rate of the random forest model trained using TrainRF (X,Y ). k resamples are

performed for improving the confidence in the calculated feature importance values. For a single resample,

we train two models, one with the original input matrix X and other with the input matrix X\i which has

the feature i either permuted or removed; the feature importance for a feature is quantified as the increase

in OOB error. In order to induce repeatability in the OOB error rate, we ensure that an individual tree in

both random forest models are trained with the same random seed and the same in-the-bag examples; note

an individual tree in a random forest model is trained using a bootstrap dataset, obtained from the original

dataset, and in-the-bag examples are examples present in the bootstrap dataset.

The retraining-based importance measure we just defined Equation 5.8 is cast in terms of the scaleless

measure of ‘error’, which has no intrinsic interpretation. Is an error of .45 good? No one can say. Conse-

quently, we define a second retraining-based importance measure in a currency that is readily interpreted:

the currency of probability.

We can reformulate feature importance in terms of the probability that the error will increase as a

84

result of permuting (or removing) a feature. We define

Feature importance for featurei = 1− Probability from ttest[perm, baseline] (5.9)

where, perm ∈ Rk, baseline ∈ Rk,m = 1, . . . , k,

permm = ErrorOOB(TrainRF (X\i, Y )), baselinem = ErrorOOB(TrainRF (X,Y )).

As before, X is the input matrix of size n × p , y is the vector of size n containing target values or labels, X\i

is the matrix X with feature i permuted, k is the number of resamples, TrainRF (X,Y ) parametrically trains

a random forest model using the input data X and target values or labels Y , ErrorOOB(TrainRF (X,Y ))

returns the OOB error rate of the random forest model trained using TrainRF (X,Y ). ttest(perm, baseline)

is the one-tail paired Student’s t-test. ttest(perm, baseline) tests the null hypothesis that the random variable

perm− baseline has a normal distribution with mean equal to zero and unknown variance. Logically, we are

trying to get a probability value for the alternative hypothesis: that the random variable perm − baseline

has a mean greater than zero. If the difference in the OOB error rate on permuting feature i is not reliably

worse, one-tailed t-test will return a probability value near 1 and the feature importance value for feature i

is near 0. In contrast, when there is significant increase in the OOB error rate when feature i is permuted,

according to the one-tailed t-test, the feature importance value of feature i will be near 1.

5.4 Approach

In this section, we provide the details of our research. Our three key contributions are summarized in

Section 5.4.1. Two of our contributions were discussed in Section 5.3. Our third contribution is discussed in

Section 5.4.2.

5.4.1 Key Contributions

We briefly describe our three key contributions.

Retraining-Based Feature Importance. In Section 5.2, we discussed a set of artificial datasets

models in which irrelevant features are correlated with the relevant features that generated the models.

85

Existing random forests importance measures, including those proposed by Breiman [19] and Strobl et al.

[111], are unable to consistently rank relevant features higher than irrelevant features. In Section 5.3, we

presented a new feature importance measure that uses retrained models and is able to correctly rank relevant

features higher than irrelevant features. Our proposed feature importance measure uses relative increases in

OOB error.

Retraining-Based Feature Importance in Terms of Probability of Error. In Section 5.3, we

motivated the idea that feature-importance values obtained from existing measures are difficult to interpret

as they are based on calculating the relative increase in OOB error. We argued that a one tail paired

Student’s t-test can be utilized to obtain a probability value associated with the hypothesis that permuting

or removing a feature reliably increases the OOB error. Such probability values provide an interpretive

meaning to feature importance values.

Iterated Reweighting of Random Forest Models. Before we discuss our proposed algorithm for

iterated reweighting of random forests models, we briefly note how existing feature selection methods for

random forests use feature importance measures. Procedures such as variable selection using random forests

(varSelRF) [37] and related procedures, begin with a full complement of features, and models are iteratively

created by removing the features with the lowest feature importance values. The final model is selected based

on results on held-out validation data or OOB data; logically, the features used to train models are selected

in an all-or-none fashion. In contrast, we propose a strategy in which continuous weights are assigned to

individual features; features with high weights have a greater probability than features with low weights of

being used to train a model. We perform multiple iterations of model training to reweight features such that

the least useful features eventually obtain a weight of zero.

Next, we discuss the details of the iterated reweighting of random forest models.

5.4.2 Details of Iterated Reweighting of Random Forest Models

To perform iterated reweighting of random forests models, we assign continuous weights between zero

and one to individual features; features with large weights have a greater probability that features with

86

small weights of being used to train a model. We perform multiple iterations of model training to iteratively

reweight features such that the least useful features eventually obtain a weight of zero.

To explain the procedure for iterated reweighting, we must first explain where the weights come from,

and then we must explain how a random forest model is trained with biasing from the specified weights.

Weights. The weights are derived from feature importance values; in iterated reweighting of random

forest models, features with higher feature importance value have a greater probability of being used to train

a model.

Figure 5.2: Comparison between search of features at the nodes of trees in random forests and FIbRF. Thenumber of available features is indicated by p.

Feature Importance Biased Random Forests (FIbRF). The other component for iterated

reweighting of random forests models is an algorithm that trains a random forest model biased with the

specified weights. We now make a proposal for how the biasing should be done.

In the standard random forest algorithm, at each node in a decision tree, mtry features are randomly

selected with uniform probability from p available features. We propose a modification to the random forest

algorithm called feature importance biased random forests (FIbRF), in which mtry features are selected not

with a uniform probability, but instead with probability dependent on pre-specified weights. The FIbRF

87

algorithm is explained in Algorithm 4 and in Figure 5.2.

Algorithm 4: FIbRF: Routine Called at Each Node of Individual Decision Trees in a Random Forest.

Input: Value of mtry. Weights w is a vector of size p. p is the number of available features.

Output: An array mtry array containing features to search for instead of randomly sampling mtry

features with uniform probability from available p features.

t← mtry

p

p array ← {1, . . . , p}

mtry array ← {}

mtry num← 0

prob is a vector of size p

forall the i in p array do

probi ←wi∑

k∈p array

wk

while t > 0 & mtry num < mtry do

Randomly select feature labeled j from p array based on probabilities in prob

/* Once a feature is added to mtry array, it is not sampled again from p array */

mtry num← mtry num+ 1

t← t− probj

p array ← p array\{j}

mtry array ← mtry array ∪ {j}return mtry array

Tuning Feature Importance Values for Use as FIbRF Weights. Because the values obtained from

feature importance measures for different datasets are on different scales, we perform a scaling operation on

the feature importance values before using them as weights. Tuning parameter λ determines the mapping

from feature importance values to weights as follows:

Weight wi in FIbRF for feature i = (feature importance value for feature i [scaled between 0 and 1])λ, λ ≥ 0.

(5.10)

88

Note that when λ = 0, the sampling of features at the nodes in individual decision trees is equivalent for

both FIbRF and random forest. For FIbRF, we perform a grid search over (mtry, λ). We recommend the

following range for λ: {0, 0.25, 0.5, 1}. As we discuss in Appendix C.3, the ntree parameter can be set to a

fixed value.

The tuning parameter is a power function. We scale the feature importance values between zero and

one before using the tuning parameter, and thus the weights are also bounded between zero and one for

λ ≥ 0; in Section 5.5, we discuss how to handle negative feature importance values that may be produced

by feature importance measures based on OOB error.

5.5 Methodology

In this section, we discuss datasets, experimental setup, and evaluation metrics used for our study.

5.5.1 Datasets

The 17 classification datasets used in our experiments are listed in Table 5.1; a detailed description

of the datasets is presented in Appendix C.1. Our criteria for selecting these datasets is as follows:

� We select datasets that have a couple thousand examples and at least 25 but no more than 250

features. High-dimensional datasets (> 250 features) are omitted, as they may make experiments

computationally intractable.

� Datasets with fewer examples (< 1000) are omitted, as the results are sensitive to the size of training

and testing data.

In addition to these standard data sets used to evaluate machine learning algorithms, we obtained

data from two additional domains. The description of the datasets is as follows:

(1) Medical data describing blood loss in patients [82]. The goal is to predict the amount of blood loss

induced in a patient based on inputs from various sensors such as pulse oximeters and heart rate

89

Dataset # of Classes Class Percentage # of Examples # of Features Source

covertype 5 37/49/7/3/3 Sampled 4884 54 [5, 10]from 581K

(Class 4 and 5dropped)

ibn sina 2 62/38 Sampled 4000 92 [57]from 20722

sylva (raw) 2 83/17 Sampled 4805 108 [56]from 13086

zebra 2 equal Sampled 4000 152 [57]from 61488

musk 2 85/15 6598 166 [5]hemodynamic 1 5 24/15/16/13/32 4594 35 [82]hemodynamic 2 2 equal 4000 43 [82]hemodynamic 3 5 15/30/21/15/19 7000 40 [82]hemodynamic 4 6 equal 3000 31 [82]

zernike 10 equal 2000 47 [5, 118]karhunen 10 equal 2000 64 [5, 118]

fourier 10 equal 2000 76 [5, 118]factors 10 equal 2000 216 [5, 118]pixel 10 equal 2000 240 [5, 118]

power 1 2 49/51 2000 191 [54]power 2 2 equal 2000 191 [54]power 3 2 equal 3000 191 [54]

Table 5.1: Seventeen datasets used in our experiments. The number of examples ranges from 2000 to 7000,and number of features ranges from 31 and 216.

monitors, among others. These datasets are listed in Table 5.1 as hemodynamic X, where X=1,2,3,

and 4.

(2) Data obtained from a power grid layout [54]. The goal is to discriminate between optimal and

sub-optimal power grid configurations, based on the connections present in the power grid and the

cost and reliability induced by those connections. These datasets are listed in Table 5.1 as power X,

where X=1,2, and 3.

Testing Methodology. For each of the 17 classification datasets, we randomly split the data into

two equal sets, using one set for training and the other for testing; we report our results averaged over 10

repetitions. Also, the same training-testing sets were used for all algorithm combinations.

90

Next, the different algorithm combinations of feature importance measures and iteration schemes are

discussed. For brevity, we use the term algorithm as a shorthand for the 8 algorithm combinations.

5.5.2 Experimental Setup

In our experiments, we explore both alternative approaches to computing feature importance and ap-

proaches to utilizing the importance. The four alternatives for computing feature importance are: Breiman

feature importance, Strobl feature importance, retraining-based feature importance using OOB error, and

retraining-based feature importance using probability of error. We earlier proposed an iterated reweighting

scheme for using feature importance. In order to evaluate this scheme, we compare it to the standard ap-

proach to utilizing importance: iterated removal. In iterated removal, the least important feature is removed

at each iteration of the feature selection process. Our goal is to identify the best algorithm combination of

a feature importance measure and an iterative scheme, based on performance of the algorithm combination

over many data sets.

Next, we list the various experimental parameters and conditions set for the eight algorithm combi-

nations. For brevity, we use the term algorithm as a shorthand for algorithm combination.

Number of Resamples for Calculating Feature Importance. The number of resamples is 30 for

all algorithms. For feature importance values obtained using Breiman’s method or Strobl’s method, the

number of resamples2 controls the number of times feature importance values are calculated and averaged.

Criteria for Explicit Removal of Features. For algorithms that use iterated removal, at the end of

each iteration we remove the feature with the lowest feature importance value. Algorithms that use iterated

reweighting do not explicitly remove features.

Adjustment of Feature Importance Values. We adjust feature importance values for the following

algorithms:

� For algorithms that use iterated reweighting, we set all features with importance values less than

2 This is equivalent to setting the nperm parameter to 30 in the randomForest software package [78].

91

zero to zero; note that feature importance values based on the OOB error rate may have values less

than zero.

� For algorithms that use retraining-based feature importance based on probability of error, we set all

probability values less than 0.05 to zero.

Parameters and Parametric Search. Parameter(s) and the range of values considered for those

parameter(s) are listed for the following algorithms:

� For all algorithms, we set the number of trees (ntree) to 1000. We present our justification3 in

Appendix C.3.

� For algorithms that use iterated removal, we consider the following range of values for mtry : {p/10,

2p/10, 3p/10, . . ., p, sqrt(p)}, where p is the number of features.

� For algorithms that use iterated reweighting, we perform a grid search over (mtry, λ). We consider

the following range of values: (1) for mtry : {p/10, 3p/10, 5p/10, . . ., p, sqrt(p)}, where p is the

number of features; and (2) for λ: {0, 0.25, 0.5, 1}.

Stopping Criterion. A general stopping criterion for all algorithms is an increase of 25% in the OOB

error rate over the lowest observed OOB error rate among all previous iterations. An additional stopping

criterion for algorithms that use iterated reweighting is maintenance of the same feature set for 25 iterations.

Selection of the Final Model. For all algorithms, at the end of each iteration, we save the values of

the OOB error rate and the test set error rate, which are averaged three times using the best parameters

selected via a parametric search. After termination, due to satisfying a stopping criterion, the model with

the lowest OOB error rate is selected as the final model and the error rate for the test set is reported.

Software. Our experiments were performed using Breiman’s random forests Breiman [19] implementa-

tion using Jaiantilal [67]. We modified the package to use FIbRF. We modified the package to use FIbRF

3 For the datasets used in this study, increasing the number of trees from 1000 to 2000 did not significantly increase accuracy.In contrast, increasing the number of trees from 500 to a 1000 produced a significant increase in accuracy.

92

and to allow the use of pre-specified random seeds and in-the-bag examples for individual trees of a random

forest model. The extendedForest [107] package was used to implement Strobl’s importance measure for

random forests.

5.5.3 Evaluation Metrics

For our experiments, we relied primarily on the error metric to evaluate and compare the performance

of various algorithms.

Classification Error Percentage. For classification based datasets, we calculate the classification

error percentage as 100 ∗n∑i=1

I(yi 6= yi)/n, for n examples; yi is the target label for ith example, yi is the

predicted label for ith example, and I() is an indicator function.

Comparison of Multiple Algorithms Using Significance testing. When comparing two or more

algorithms, we can calculate the average error across datasets. However, errors are not commensurate over

different datasets, and a large variation on a single dataset may be enough to give one of the algorithms an

unfair advantage over the others.

An alternate method is to use a paired Student’s t-test or the Wilcoxon signed-rank test [124] to

compare the significance of differences in error between individual algorithms, but such tests are overly

optimistic when multiple algorithms are compared, as they do not consider the fact that multiple comparisons

can be rejected due to random chance.

To compare multiple algorithms, Demsar [35] suggested the Friedman test, which ranks algorithms’

performance using error results for individual datasets, from best to worst. If after performing the Friedman

test, the hypothesis that the algorithms are similar is rejected, one can use a post-hoc test, such as the Holm

test, which adjusts the significance value based on the number of classifiers present; such an adjustment

takes into account the fact that many pairwise comparisons may be rejected due to random chance. Garcıa

and Herrera [52] provide software that lists the pairwise hypotheses that are rejected based on the results of

the Holm test.

Note that the comparison of multiple algorithms via the Friedman test and post-hoc testing has

93

disadvantages when the difference between algorithms is not significant. Assuming that the error values

from the algorithms are obtained through a cross validation technique such as Repeated Learning and

Testing (RLT)4 [4, 25, 38], the ranking may be incorrect if the differences across algorithms applied to a

dataset differ due to random chance alone, as when all algorithms applied to a given dataset obtain the same

or similar final models.

Next, we discuss our results.

5.6 Results

In this section, we present our results for the eight possible algorithm combinations of {Breiman

feature importance measure, Strobl’s feature importance measure, a retraining-based feature importance

measure based on OOB error, a retraining-based feature importance measure based on probability of error}

× {iterated reweighting, iterated removal}.

For the sake of convenience, we refer to the retraining-based feature importance measure based on

OOB error as Err, the retraining-based feature importance measure based on probability of error as Prob,

Breiman’s feature importance measure as Breiman, Strobl’s feature importance measure as Strobl, iterated

reweighting as reweighting, and iterated removal as removal.

Note that numerical tables for the results are presented in Appendix C.2 (page-133). Next, we present

the overall results and provide an analysis of those results.

5.6.1 Overall Results

This section concerns results obtained using 17 real world datasets.

Note that for some of the figures, we also include the results from the baseline model, which is a

random forest model trained using all features and an mtry value that is parametrically selected.

In Figure 5.3(a), we show the average percent error across the 17 datasets. We can make the following

main conclusion: the algorithms that use retraining-based feature importance measures beat out those based

4 In RLT, we randomly split the data into two sets, and use the larger set for training a model and the smaller set forevaluating the model; this procedure is repeated multiple times to obtain an estimate of the test error.

94

11.8

12

12.2

12.4

12.6

12.8

13

13.2

13.4

13.6

Strobl+

Reweighting

Baseline

Strobl+

Removal

Breim

an+Reweighting

Breim

an+Removal

Pro

b+Reweighting

Err+Reweighting

Pro

b+Removal

Err+Removal

Per

cent

Err

or

(a) average percent error and standard error for all algorithms.

−10

−5

0

5

Strobl+

Reweighting

Strobl+

Removal

Breim

an+Reweighting

Err+Reweighting

Breim

an+Removal

Pro

b+Reweighting

Pro

b+Removal

Err+Removal

% R

elat

ive

Red

uctio

n of

Err

or

(b) mean % relative reduction of error and standard error for all algo-rithms.

Figure 5.3: For all real world datasets, we plot: (a) average percent error and standard error for differentalgorithms, and (b) average percent relative reduction in error and standard error for different algorithms.The relative reduction is calculated in comparison to a baseline model. Algorithms are presented left to rightbased on increasing improvement. Normalization of the standard deviation was performed as suggested byMasson and Loftus [80].

95

on Breiman and Strobl measures.

In Figure 5.3(b), we show the relative reduction in error compared to the baseline model. The relative

reduction in error is a better measure than absolute error as one can quantify improvement over a control

condition (in our case, the baseline model). Based on the results in the Figure, it is clear that the three top

performing algorithms use retraining-based feature importance measures.

5.6.2 Analysis Based on % Reduction of Error

We began our analysis with an examination of the following two factors: (1) importance measures,

one of {Breiman, Strobl, Err, Prob}, and (2) iterative schemes, one of {reweighting, removal}. We per-

formed an analysis of variance (ANOVA) on the percent relative reduction in error and found that there

is a marginally significant interaction between the factors of importance measure and the iterative scheme

[F(306,2)=2.975, p=0.053]; in addition, both importance measure [F(306,2)=48.385, p <1e-3] and iterative

schemes [F(153,1)=69.171, p <1e-3] are significant factors. In Figure 5.4, we plot the mean percent relative

reduction of error for the two factors across all datasets. These results allow for the following conclusion:

reweighting performs worse than removal. Henceforth, our analysis will be primarily based on iterated

removal.

Comparing Algorithms Using Removal Based on a Statistical Procedure. Next, we compared

algorithms that use iterated removal. In Figure 5.5, we show statistical results based on significance testing

using the Friedman rank test and the Holm procedure as suggested by Garcıa and Herrera [52]; these

statistical results were obtained using the percent error results across all datasets. Graphically, we arrange

the algorithms from top (best performing algorithm) to bottom (worst performing algorithm) according

to their average Friedman ranks across all datasets. Pairs of algorithms whose outcomes are statistically

indistinguishable (cannot be rejected at p = 0.05 according to the Holm procedure) are connected in the

Figure. Thus, algorithms that are significantly different from each other are not connected. We also note

the average percent error and the reduction in percent error from that of the baseline model in parentheses

next to the algorithm name.

96

Reweighting Removal−12

−10

−8

−6

−4

−2

0

2

4

6

8

% R

elat

ive

Red

uctio

n in

Err

or

ProbErrBreimanStrobl

Figure 5.4: Analysis of four importance measures and two elimination schemes.

For Figure 5.5, we can draw the following conclusion: Prob+Removal has the highest rank but is

insignificantly different from Breiman+Removal and Err+Removal.

Next, we analyze the differences between algorithms on the basis of their performance on individual

datasets.

Is There a Difference Between Retraining and Non-Retraining-Based Feature Importance?

In this analysis, we considered only combinations that use removal. For each dataset and each train-

ing/test split for that dataset, we calculated the average percent reduction in error across algorithms that

use retraining-based feature importance measures; similarly, we calculate the average percent reduction in

error across algorithms that use non-retraining-based feature importance measures (Breiman and Strobl).

In Figure 5.6, we compare these two conditions and show their results on individual datasets and individual

training/test splits of those dataset. We use a ◦ red circle to indicate the difference in percent reduction in

error on the same training/test splits; a positive value on the y-axis implies that the retraining-based feature

importance measures perform better on average than the non-retraining-based feature importance measures.

As Figure 5.6 shows, there are 10 datasets for which retraining-based feature importance measures

97

1

2

3

4

1

2

3

4

1

2

3

4

Prob+Removal (11.96%/6.72%)Err+Removal (11.90%/7.12%)

Breiman+Removal (12.52%/2.77%)

Strobl+Removal (12.66%/0.83%)

Figure 5.5: Algorithms are arranged from top to bottom according to their Friedman test ranks. Algorithmsthat are insignificantly different from each other (p=0.05 according to the Holm procedure) are connected.(Average percent error over datasets and percent reduction of error over the baseline model) are noted inparentheses after the algorithm name.

perform better on average than non-retraining-based feature importance measures. There are 7 for which

the opposite is true, but the reduction in error is clearly larger for the 10 than the increase in error is for the

7. Thus, we conclude that retraining is a useful feature to include in random forest training when feature

selection is being performed.

Comparison of Err+Removal to Other Removal Algorithms. We observed that there are dis-

tinct datasets for which algorithms that use a retraining-based feature importance measures outperform

algorithms that use non-retraining-based feature importance measures. Next, we examine the example

of Err+Removal, an algorithm that uses a retraining-based feature importance measure, and compare

it with Breiman+Removal and Prob+Removal. We do not show the comparison of Err+Removal to

Strobl+Removal, as the results were similar to the comparison to Breiman+Removal.

In Figure 5.7(a), we compare Err+Removal to Breiman+Removal. We can draw the following two

98

−40

−30

−20

−10

0

10

20

30

40

Datasets

Diff

eren

ce in

% R

educ

tion

of E

rror

% reduction of errordifference between retrained and non−retrained importance measures for individual datasets

mean difference in % reduction of errordifference in individual folds

Figure 5.6: Difference in percent reduction of error among algorithms that use retraining-based vs non-retraining-based feature importance measures, collapsed across removal schemes. Difference in percent re-duction of error on the same training/test split of a dataset is represented by a ◦ red circle.

conclusions: (1) Err+Removal performs better than Breiman+Removal on ten datasets and for those datasets

the difference in percent reduction in error is ≥ 5%; and (2) both algorithms perform differently depending

on the dataset.

In Figure 5.7(b), we compare Err+Removal to Prob+Removal. We can draw the following conclusion:

the two algorithms are similar to each other; on about ten of the datasets, the difference in percent reduction

in error between the algorithms is less than 2%.

5.6.3 Analysis Based on Feature Selection

Next, we analyze differences in feature discovery across the eight different algorithms.

Similarity of Algorithms Based on Features Selected. To compare features discovered by any

two algorithms, for each individual training/test split, we calculated a similarity score, which is defined as

99

−30

−20

−10

0

10

20

30

40

50Err+Removal vs. Breiman+Removal

Datasets

Diff

eren

ce in

% R

educ

tion

of E

rror

mean difference in % reduction of errordifference in individual folds

(a) Comparison of Err+Removal and Breiman+Removal.

−30

−20

−10

0

10

20

30

40

50Err+Removal vs. Prob+Removal

Datasets

Diff

eren

ce in

% R

educ

tion

of E

rror

mean difference in % reduction of errordifference in individual folds

(b) Comparison of Err+Removal and Prob+Removal.

Figure 5.7: Difference for the same training/test split of a dataset is represented by a ◦ red circle. Weshow: (a) comparison of Err+Removal and Breiman+Removal and (b) comparison of Err+Removal andProb+Removal.

100

follows:

Similarity Score for Features Discovered (a, b)

= 100 ∗

1− 1

k ∗ t

k datasets∑m=1

1

Dk

Dk features∑i=1

|f counts(k, a, i)− f counts(k, b, i)|

,

(5.11)

where, m = 1, . . . , k datasets, i = 1, . . . , Dk are the features for dataset k, t randomized trials are performed

per dataset, a and b are algorithms for comparison, and f counts(k, a, i) returns the number of times feature

i was selected by algorithm a for dataset k. If we assume that we run t randomized trials per dataset, thus

f counts(k, a, i) is between 0 and t.

For the score to be valid, we have to ensure that same training/test splits are used by both algorithms.

If two algorithms discover similar feature sets, the similarity score will be high; otherwise, it will be low.

A similarity score near 100 implies that for individual training/test splits across multiple datasets, the

algorithms selected very similar features.

In Figure 5.8, we make pairwise comparisons of various algorithms based on their similarity scores.

We shade the range of similar to dissimilar algorithms from white (similar) to black (dissimilar). We can

draw the following two conclusions:

� Err+Removal, Err+Reweighting, and Prob+Removal have high similarity scores with one another,

whereas Prob+Reweighting has a lower similarity score with the others.

� Breiman+Removal and Strobl+Removal are similar to each other, but they have lower similarity

scores with the retraining-based algorithms.

Sparsity of Models. In Figure 5.9, we plot the average number of features discovered by the eight

algorithms across all datasets. We can draw the following three conclusions:

(1) The sparsest models are created by the following retraining algorithms: Err+Reweighting,

Err+Removal and Prob+Removal. Err+Reweighting produced the sparsest model, but in terms

of the overall results, it performed worse than Prob+Removal and Err+Removal.

101

83.13

86.31

62.21

72.87

66.55

52.67

55.83

83.13

84.81

61.35

74.39

65.51

53.68

57.06

86.31

84.81

63.79

70.99

68.69

51.28

53.42

62.21

61.35

63.79

60.43

66.13

72.45

70.13

72.87

74.39

70.99

60.43

79.97

71.62

71.97

66.55

65.51

68.69

66.13

79.97

67.07

66.92

52.67

53.68

51.28

72.45

71.62

67.07

88.13

55.83

57.06

53.42

70.13

71.97

66.92

88.13

Prob+Rem

oval

Err+Rew

eighting

Err+Rem

oval

Prob+Rew

eighting

Breim

an+Rem

oval

Strobl+

Rem

oval

Breim

an+Rew

eighting

Strobl+

Rew

eighting

Prob+Removal

Err+Reweighting

Err+Removal

Prob+Reweighting

Breiman+Removal

Strobl+Removal

Breiman+Reweighting

Strobl+Reweighting

51.28

56.16

61.03

65.90

70.77

75.64

80.51

85.39

90.26

95.13

100.00

Figure 5.8: Comparison of various algorithms on basis of the similarity score discussed in Equation 5.11.Algorithms that are similar to each other are indicated with a lighter shade.

(2) Prob+Reweighting did not output a sparse model, unlike the other retraining-based algorithms.

(3) Err+Reweighting, Err+Removal, and Prob+Removal produced sparser models than Breiman+Removal,

as 31%-57% more features were discovered by Breiman+Removal.

If there is a cost associated with feature acquisition, then retraining-based algorithms are more promis-

ing than non-retraining-based (using Breiman or Strobl’s feature importance) algorithms, due to the lower

error rate and fewer features acquired.

102

30

40

50

60

70

80

90

Err+Reweighting

Pro

b+Removal

Err+Removal

Breim

an+Removal

Strobl+

Reweighting

Pro

b+Reweighting

Strobl+

Removal

Breim

an+Reweighting

Ave

rage

Num

ber

of F

eatu

res

Dis

cove

red

Figure 5.9: Sparsity of models created by different algorithms. Note that the average number of featuresavailable over all datasets is ≈ 114. Normalization of the standard deviation was performed as suggested byMasson and Loftus [80].

Sparsity of Models vs Reduction in Error for Retraining+Removal Algorithms. In Fig-

ure 5.10, we plot the percent reduction in error over the baseline model against percent features present

in the final model for Prob+Removal and Err+Removal. We can draw the following conclusion: there is a

linear relation between the percent features remaining in the final model and the percent reduction in error

obtained; that is, sparser models have a greater reduction in error than models trained with all features.

5.6.4 Computation Complexity

Our earlier analysis was based on the error percentage and feature selection abilities of various algo-

rithms. An important additional criterion to consider is the computational complexity of the algorithms.

In Figure 5.11, we compare the number of random forests trained relative to Strobl+Removal; we

103

20 30 40 50 60 70 80 90 100−20

−10

0

10

20

30

% number of features in final model

% r

educ

tion

of e

rror

pval−removalerr−removal

Figure 5.10: Sparsity of models vs % reduction in error over the baseline model for Prob+Removal andErr+Removal.

0 20 40 60 80 100 120

Strobl+Removal

Breiman+Removal

Breiman+Reweighting

Strobl+Reweighting

Err+Reweighting

Prob+Reweighting

Prob+Removal

Err+Removal

number of models trained relative to Strobl+Removal (1X)

Figure 5.11: Computational complexity relative to Strobl+Removal (1X).

do not calculate the total running time, as the experiments were run on heterogeneous computers. We

104

acknowledge that comparing the computational complexity of the algorithms based on the number of models

is approximate for the following two reasons: (1) the amount of time taken for real world datasets is dataset

specific, and (2) by only showing results for the number of models trained, we are ignoring the time that

is required to calculate Breiman’s and Strobl’s feature importance measures for algorithms that use those

measures.

From Figure 5.11, we can draw the following two conclusions: (1) the number of models trained

for Prob+Removal and Err+Removal is 100 times greater than the number of models required for

Strobl+Removal and Breiman+Removal, and (2) Prob+Reweighting and Err+Reweighting have a com-

putational cost that is an order of magnitude higher than Breiman+Removal and Strobl+Removal, but an

order of magnitude lower than Prob+Removal and Err+Removal.

A reasonable concern is the high computational complexity of the algorithms that use retraining-based

feature importance measures, which have a computational complexity of O(p2) for the number of models

created, where p is the number of features. We argue that our results show that there are distinct datasets

for which algorithms that use a retraining-based feature importance measures show a significant reduction in

error rate when compared to algorithms that use existing feature importance measures; thus, we can justify

the use of a computationally expensive procedure. We can also consider the advantages of the parallel nature

of training a random forest model [105] and obtaining the retraining-based feature importance; we can train

multiple models in parallel to calculate the retraining-based feature importance.

We briefly note the computers used for our study. Our experiments were performed over a couple of

months on commodity hardware consisting of ten computers5 each of which contained either quad-core and

hyper-threaded Intel processors or multicore AMD processors.

5 On a personal note, the performance of a quad socket machine, which can be obtained at a cost of $10000 (in 2011), isnow matched in computational efficiency by a dual socket machine, which can be obtained at a cost of $3000 (in 2012). In2012, at a cost of $15000, we bought off-the-shelf computers whose total computational power was comparable to the fastestsupercomputer in 1999, ASCI-RED ( http://www.top500.org/system/168753 ); ASCI-RED was constructed at a cost of $185Million. Note we are comparing only advances in computing due to newer CPU architecture/manufacturing processes; evenlarger advances are possible with accelerators like GPU’s.

105

5.7 Conclusions and Future Work

In this study, we proposed an iterated reweighting scheme for random forests; we also proposed an

importance measure based on retrained models. Contrary to our initial intuition, an iterated reweighting

scheme did not perform as well as the existing iterated removal scheme. A combination of our proposed

feature importance measure and an iterated removal scheme showed improved results over existing feature

importance measures. Though our proposed method is computationally intensive, the sparsity in number of

features for the obtained models may offset costs associated with feature acquisition.

For the future, we suggest that it may be worth investigating a hybrid strategy using both iterated

reweighting and iterated removal to prevent an algorithm getting stuck in a local optimum, as it appears to

happened with iterated reweighting.

Chapter 6

Conclusion

6.1 Conclusions

In this thesis, we explored feature selection by employing iterative reweighting for two classes of

popular machine learning models: L1 penalized linear models and Random Forests.

Recent studies have shown that incorporating feature importance weights into L1 models leads to

improvement in predictive performance in a single iteration of training. In Chapter 3, we advanced the

state-of-the-art by developing an alternative method for estimating feature importance based on subsampling.

Extending the approach to multiple iterations of training, employing the importance weights from

iteration n to bias the training on iteration n + 1 seems promising, but past studies yielded no benefit to

iterative reweighting. In Chapter 4, we proposed a multiple step algorithm, the SBA-LASSO, that uses

weights derived from either bootstrapped or subsampled datasets. We tuned the characteristics of our

algorithm based on synthetic data sets, and then evaluated the final algorithm on 25 real-world data sets,

whose input dimension ranged from 6 to 300. For the lower-dimensional problems, we augmented the input

with second-order features, which not only allows the models to better exploit feature selection but yield

models with less bias. We obtained a significant reduction of 7.48% in the error rate over standard L1

penalized algorithms, and nearly as large an improvement over alternative feature selection algorithms such

as Adaptive Lasso, Bootstrap Lasso, and MSA-LASSO, on both classification and regression problems, using

our improved estimates of feature importance. Our empirical results also support the hypothesis that we can

107

significantly improve prediction in regression and classification problems using a multiple step estimation

algorithm instead of a single step estimation algorithm.

In Chapter 5, we considered iterative reweighting in the context of Random Forests and contrast this

with a more standard backward-elimination technique that involves training models with the full comple-

ment of features and iteratively removing the least important feature; weights determined the probability of

selecting a feature for inclusion in the trees of a Random Forest. In parallel with this contrast, we also com-

pared several measures of importance, including our own proposal based on evaluating models constructed

with and without each candidate feature. We showed that our importance measure yields both higher accu-

racy and greater sparsity than importance measures obtained without retraining models (including measures

proposed by Breiman and Strobl), though at a greater computational cost. In contrast to the results we ob-

tained with linear models, iterative reweighting of feature importance with the nonlinear random forests did

not outperform a backward elimination approach with all-or-none feature inclusion. Though our proposed

method is computationally intensive, the sparsity in number of features for the obtained models may offset

costs associated with feature acquisition.

6.2 Future Work

L1 Penalized Algorithms. Chapter 4 provided empirical evidence that Subsampling SBA outperforms

existing algorithms. The theoretical justification for Subsampling SBA and other multiple step algorithms

is still an open research question. We hypothesize that the Subsampling SBA is able to adaptively find the

threshold after which a feature should not be penalized in the adaptive lasso penalty. We further conjecture

that MSA-LASSO and Adaptive Lasso may be improved by using a squashing function on the weights, which

will push them toward binary values and therefore make the algorithms behave more like SBA-LASSO.

Comparison of our developed SBA-LASSO algorithm with other sparsity inducing penalty functions

should be investigated; a possible research avenue is using weights with other sparsity inducing penalty

functions.

108

Random Forest. In Chapter 5, we showed that our proposed importance measure based on models

constructed with and without each candidate feature provides large improvements; contrary to out initial

intuition, iterated reweighting schemes do not perform as well as backward elimination, an all-or-none feature

inclusion scheme. We have some initial evidence (Appendix C.6) to support the hypothesis that a reweighting

scheme gets stuck in a local optimum and for the future we suggest that it may be worth investigating a

hybrid strategy using both iterated reweighting and iterated removal.

As our proposed importance measure is computationally intensive, a further analysis of the effects of

removing more than a single feature at a time would be useful. Furthermore, an analysis of the effect of

number of resamples on the performance of the algorithms would be useful.

Random Forests was the main focus of this thesis and for the future we suggest experimenting1 iterated

retraining with other ensemble methods, like boosted trees [43].

We discussed an artificial dataset for which existing feature selection algorithms are unable to correctly

rank the relevant features higher than the irrelevant features. A thorough investigation of such datasets is an

essential step for understanding feature interactions and their influence on feature selection by an algorithm.

Bibliography

[1] Delve Datasets. Accessed 7-July-2013 from http://www.cs.toronto.edu/~delve/data/datasets.

html.

[2] J Alcala-Fdez, L Sanchez, S Garcıa, M J Del Jesus, S Ventura, J M Garrell, J Otero, C Romero,J Bacardit, V M Rivas, et al. KEEL: A Software Tool to Assess Evolutionary Algorithms for DataMining Problems. Soft Computing, 13(3):307–318, 2009.

[3] Yali Amit and Donald Geman. Shape Quantization and Recognition with Randomized Trees. NeuralComputation, 9(7):1545–1588, 1997.

[4] Sylvain Arlot and Alain Celisse. A survey of cross-validation procedures for model selection. StatisticsSurveys, 4:40–79, 2010.

[5] A Asuncion and DJ Newman. UCI Machine Learning Repository, 2010. URL http://archive.ics.

uci.edu/ml. University of California, Irvine, School of Information and Computer Sciences.

[6] Francis R Bach. Bolasso: Model Consistent Lasso Estimation through the Bootstrap. In Proceedingsof the 25th international conference on Machine learning, pages 33–40. ACM, 2008.

[7] Lars Bergstrom. Measuring NUMA effects with the STREAM benchmark. arXiv preprintarXiv:1103.3225, 2011.

[8] Gerard Biau. Analysis of a Random Forests Model. Journal of Machine Learning Research, 98888:1063–1095, June 2012. ISSN 1532-4435.

[9] Gerard Biau and Luc Devroye. On the layered nearest neighbour estimate, the bagged nearest neigh-bour estimate and the random forest method in regression and classification. Journal of MultivariateAnalysis, 101(10):2499–2518, 2010.

[10] J A Blackard and D J Dean. Comparative accuracies of artificial neural networks and discriminantanalysis in predicting forest cover types from cartographic variables. Computers and Electronics inAgriculture, 24:131–151, 1999.

[11] Avrim L Blum and Pat Langley. Selection of relevant features and examples in machine learning.Artificial Intelligence, 97(1):245–271, 1997.

[12] Leon Bottou and Chih-Jen Lin. Support vector machine solvers. Large scale kernel machines, pages301–320, 2007.

[13] Anne-Laure Boulesteix, Silke Janitza, Jochen Kruppa, and Inke R. Konig. Overview of random forestmethodology and practical guidance with emphasis on computational biology and bioinformatics. WileyInterdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(6):493–507, 2012. ISSN 1942-4795.

110

[14] Stephen P Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, Cam-bridge, UK; New York, 2004.

[15] Leo Breiman. Bagging Predictors. Machine Learning, 24(2):123–140, 1996.

[16] Leo Breiman. Heuristics of instability and stabilization in model selection. The Annals of Statistics,24(6):2350–2383, 1996.

[17] Leo Breiman. Stacked regressions. Machine Learning, 24(1):49–64, 1996.

[18] Leo Breiman. Arcing Classifier. The Annals of Statistics, 26(3):801–849, 1998.

[19] Leo Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.

[20] Leo Breiman. Some Infinity Theory for Predictor Ensembles. Technical Report 579, StatisticsDepartment, UC Berkeley, 2001. Accessed 7-July-2013 from http://www.stat.berkeley.edu/

tech-reports/579.pdf.

[21] Leo Breiman. Looking inside the black box, 2002. Wald Lecture II, Department of Statistics, Cal-ifornia University. Accessed 30-March-2010 from http://stat-www.berkeley.edu/users/breiman/

wald2002-2.pdf.

[22] Leo Breiman. Consistency for a simple model of random forests. Technical Report 670, StatisticsDepartment, UC Berkeley, 2004. Accessed 7-July-2013 from www.stat.berkeley.edu/~breiman/

RandomForests/consistencyRFA.pdf.

[23] Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone.Classification and Regression Trees. Wadsworth, Belmont, California, 1984.

[24] Peter Buhlmann and Lukas Meier. Discussion: One-step sparse estimates in nonconcave penalizedlikelihood models. Annals of Statistics, 36(4):1534–1541, 2008. Available at http://arxiv.org/pdf/0808.1013.

[25] Prabir Burman. A comparative study of ordinary cross-validation, v-fold cross-validation and therepeated learning-testing methods. Biometrika, 76(3):503–514, 1989.

[26] Emmanuel J Candes and Michael B Wakin. An introduction to compressive sampling. Signal ProcessingMagazine, IEEE, 25(2):21–30, 2008.

[27] Emmanuel J Candes, Michael B Wakin, and Stephen P Boyd. Enhancing sparsity by reweighted l1minimization. Journal of Fourier Analysis and Applications, 14(5):877–905, 2008.

[28] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Softwareavailable at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[29] Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. Coordinate Descent Method for Large-scale L2-lossLinear Support Vector Machines. The Journal of Machine Learning Research, 9:1369–1398, 2008.

[30] William H Cooke, Kathy L Ryan, and Victor A Convertino. Lower body negative pressure as a modelto study progression to acute hemorrhagic shock in humans. Journal of Applied Physiology, 96(4):1249–1261, 2004.

[31] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-basedvector machines. The Journal of Machine Learning Research, 2:265–292, 2002.

[32] Adele Cutler and Guohua Zhao. PERT-perfect random tree ensembles. Computing Science andStatistics, 33:490–497, 2001.

[33] D Richard Cutler, Thomas C Edwards Jr, Karen H Beard, Adele Cutler, Kyle T Hess, Jacob Gibson,and Joshua J Lawler. Random forests for classification in ecology. Ecology, 88(11):2783–2792, 2007.

111

[34] A C Davison and D V Hinkley. Bootstrap Methods and Their Application (Cambridge Series inStatistical and Probabilistic Mathematics , No 1). Cambridge University Press, 1 edition, October1997. ISBN 0521574714.

[35] Janez Demsar. Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of MachineLearning Research, 7:1–30, Dec 2006. ISSN 1532-4435.

[36] Persi Diaconis and Bradley Efron. Computer-intensive Methods in Statistics. Scientific American,pages 116–130, May 1983.

[37] Ramon Diaz-Uriarte. varSelRF: Variable selection using Random Forests, 2009. URL http://

ligarto.org/rdiaz/Software/Software.html. R package version 0.7-1.

[38] Thomas G Dietterich. Approximate Statistical Tests for Comparing Supervised Classification LearningAlgorithms. Neural Computation, 10(7):1895–1923, Oct 1998. ISSN 0899-7667.

[39] Thomas G Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles ofDecision Trees: Bagging, Boosting, and Randomization. Machine learning, 40(2):139–157, 2000.

[40] David L Donoho. Compressed Sensing. IEEE Transactions on Information Theory, 52(4):1289–1306,2006. ISSN 0018-9448.

[41] Bradley Efron and Robert Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.

[42] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. Annalsof Statistics, 32:407–499, 2004.

[43] Jane Elith, John R Leathwick, and Trevor Hastie. A working guide to boosted regression trees. Journalof Animal Ecology, 77(4):802–813, 2008.

[44] Jianqing Fan and Runze Li. Variable Selection via Nonconcave Penalized Likelihood and its OracleProperties. American Statistical Association, 96(456):1348–1360, 2001.

[45] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: Alibrary for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.

[46] Agner Fog. Software Optimization Resources. Accessed 1-March-2013 from http://www.agner.org/

optimize/.

[47] Yoav Freund and Robert E Schapire. Experiments with a New Boosting Algorithm. InternationalConference on Machine Learning, 96:148–156, 1996.

[48] Jerome Friedman. Fast Sparse Regression and Classification. International Journal of Forecasting, 28(3):722–738, 2012. Accessed 7-July-2013 from http://www-stat.stanford.edu/~jhf/ftp/GPSpub.

pdf.

[49] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization Paths for Generalized LinearModels via Coordinate Descent. Journal of Statistical Software, 33(1):1, 2010.

[50] Wenjiang J. Fu. Penalized Regressions: The Bridge versus the Lasso. Journal of Computational andGraphical Statistics, 7(3):397–416, 1998.

[51] G Fung and O L Mangasarian. A Feature Selection Newton Method for Support Vector Ma-chine Classification. Technical Report 02-03, Data Mining Institute, Computer Sciences Depart-ment, University of Wisconsin, Madison, Wisconsin, September 2002. Accessed 7-July-2013 fromftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/02-01.ps.

[52] Salvador Garcıa and Francisco Herrera. An Extension on “Statistical Comparisons of Classiers overMultiple Data Sets” for all Pairwise Comparisons. Journal of Machine Learning Research, 9(2677–2694):66, 2008.

112

[53] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine Learning,63(1):3–42, 2006.

[54] J Giraldez, A Jaiantilal, J Walz, S Suryanarayanan, S Sankaranarayanan, H E Brown, and E Chang.An evolutionary algorithm and acceleration approach for topological design of distributed resourceislands. PowerTech, 2011 IEEE Trondheim, pages 1–8, June 2011.

[55] Isabelle Guyon, Steve Gunn, and Gideon Dror. Result Analysis of the NIPS 2003 Feature SelectionChallenge. Advances in Neural Information Processing Systems, 17:545–552, 2005.

[56] Isabelle Guyon, Amir Saffari, Gideon Dror, and Gavin Cawley. Agnostic Learning vs. Prior KnowledgeChallenge. International Joint Conference on Neural Networks (IJCNN), pages 829–834, 2007. ISSN1098-7576. doi: 10.1109/IJCNN.2007.4371065.

[57] Isabelle Guyon, Gavin Cawley, Gideon Dror, and Vincent Lemaire. Results of the Active LearningChallenge. Active Learning and Experimental Design workshop in conjunction with AISTATS 2010,16:19–45, 2011.

[58] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten.The WEKA data mining software: an update. SIGKDD Explor. Newsl., 11(1):10–18, Nov 2009. ISSN1931-0145.

[59] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statistical Learning:Data Mining, Inference, and Prediction. Springer, 2001.

[60] Tin Kam Ho. The Random Subspace Method for Constructing Decision Forests. IEEE Transactionson Pattern Analysis and Machine Intelligence, 20:832–844, August 1998.

[61] A E Hoerl. Application of ridge analysis to regression problems. Chemical Engineering Progress, 58:54–59, 1962.

[62] Torsten Hothorn, Kurt Hornik, Carolin Strobl, and Achim Zeileis. party: A Laboratory for RecursivePartytioning. R package version 0.9-999, http://CRAN.R-project.org/package=party, 2009.

[63] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A Practical Guide to Support Vector Classica-tion. Technical report, 2003.

[64] Jian Huang, Shuangge Ma, and Cun-Hui Zhang. Adaptive Lasso for Sparse High-Dimensional Re-gression Models. Technical Report 374, Department of Statistics and Actuarial Science, Universityof Iowa, Nov 2006. Accessed 7-July-2013 from http://statdev.divms.uiowa.edu/sites/default/

files/techrep/tr374.pdf.

[65] Laurent Hyafil and Ronald L. Rivest. Constructing optimal binary decision trees is NP-complete.Information Processing Letters, 5(1):15–17, 1976.

[66] Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. Journal ofcomputational and graphical statistics, 5(3):299–314, 1996.

[67] Abhishek Jaiantilal. Classification and Regression via randomforest-matlab. Accessed 30-March-2010from http://code.google.com/randomforest-matlab, 2010.

[68] Abhishek Jaiantilal and Gregory Grudic. Increasing Feature Selection Accuracy for L1 RegularizedLinear Models. JMLR: Proceedings of the Fourth International Workshop on Feature Selection in DataMining, 10:86–96, June 2010.

[69] Lars Juhl Jensen and Alex Bateman. The rise and fall of supervised machine learning techniques.Bioinformatics, 27(24):3331–3332, 2011.

113

[70] George H John, Ron Kohavi, and Karl Pfleger. Irrelevant Features and the Subset Selection Prob-lem. In Eleventh International Conference on Machine Learning, volume 94, pages 121–129. MorganKaufmann, 1994.

[71] Li Jun, Zhang Shunyi, Xuan Ye, and Sun Yanfei. Identifying Skype traffic by Random Forest. InInternational Conference on Wireless Communications, Networking and Mobile Computing, pages2841–2844. IEEE, 2007.

[72] R K Pace and Ronald Barry. Sparse Spatial Autoregressions. Statistics & Probability Letters, 33(3):291–297, 1997.

[73] Kenji Kira and Larry A Rendell. A Practical Approach to Feature Selection. In Derek H. Sleeman andPeter Edwards, editors, Ninth International Workshop on Machine Learning, pages 249–256. MorganKaufmann, 1992.

[74] Ron Kohavi and George H John. Wrappers for Feature Subset Selection. Artificial Intelligence, 97(1):273–324, 1997.

[75] Igor Kononenko. Estimating Attributes: Analysis and Extensions of RELIEF. In Francesco Bergadanoand Luc De Raedt, editors, European Conference on Machine Learning, pages 171–182. Springer, 1994.

[76] Miron B Kursa and Witold R Rudnicki. Feature Selection with the Boruta Package. Journal ofStatistical Software, 36(11):1–13, 9 2010. ISSN 1548-7660. URL http://www.jstatsoft.org/v36/

i11.

[77] Ching-Pei Lee and Chih-Jen Lin. Large-scale Linear RankSVM. Accessed 7-July-2013 from http:

//www.csie.ntu.edu.tw/~cjlin/papers/ranksvm/ranksvml2.pdf, 2013.

[78] Andy Liaw and Matthew Wiener. Classification and Regression by randomForest. R News, 2(3):18–22,2002. URL http://CRAN.R-project.org/doc/Rnews/.

[79] Yi Lin and Yongho Jeon. Random Forests and Adaptive Nearest Neighbors. Technical Report 1055,Department of Statistics, University of Wisconsin, 2002.

[80] Michael EJ Masson and Geoffrey R Loftus. Using Confidence Intervals for Graphically Based DataInterpretation. Canadian Journal of Experimental Psychology, 57(3):203–220, 2003.

[81] E H Moore. On the reciprocal of the general algebraic matrix. Bulletin of the American MathematicalSociety, 26:394–395, 1920.

[82] Steven L Moulton, Stephanie Haley-Andrews, and Jane Mulligan. Emerging Technologies for Pediatricand Adult Trauma Care. Current opinion in pediatrics, 22(3):332, 2010.

[83] Radford M Neal. Software by Radford M Neal. Accessed 1-March-2013 from http://www.cs.toronto.

edu/~radford/software-online.html.

[84] Radford M Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.

[85] Roger Penrose. A generalized inverse for matrices. Proceedings of the Cambridge Philosophical Society,51:406–413, 1955.

[86] John C Platt. Fast Training of Support Vector Machines using Sequential Minimal Optimization.In Bernhard Scholkopf, Christopher JC Burges, and Alexander J Smola, editors, Advances in kernelmethods: support vector learning, Cambridge, MA, 1998. MIT press.

[87] Dimitris N Politos, Joseph P Romano, and Michael Wolf. Subsampling. Springer, 1999.

[88] Michael J Procopio. Hand-labeled DARPA LAGR datasets. Used to be available at http://ml.cs.

colorado.edu/~procopio/labeledlagrdata/, 2007.

114

[89] Michael J Procopio, Jane Mulligan, and Greg Grudic. Learning Terrain Segmentation with Classi-fier Ensembles for Autonomous Robot Navigation in Unstructured Environments. Journal of FieldRobotics, 26(2):145–175, 2009.

[90] John Ross Quinlan. C4.5: Programs for Machine Learning, volume 1. Morgan Kaufmann PublishersInc., San Francisco, CA, USA, 1993. ISBN 1-55860-238-0.

[91] John Ross Quinlan. Combining Instance-Based and Model-Based Learning. In International Conferenceon Machine Learning, pages 236–243, June 1993.

[92] Ryan Rifkin and Aldebaro Klautau. In Defense of One-Vs-All Classification. The Journal of MachineLearning Research, 5:101–141, 2004.

[93] Marko Robnik-Sikonja and Igor Kononenko. An adaptation of Relief for attribute estimation in regres-sion. In Douglas H. Fisher, editor, Fourteenth International Conference on Machine Learning, pages296–304. Morgan Kaufmann, 1997.

[94] Guilherme V Rocha, Xing Wang, and Bin Yu. Asymptotic distribution and sparsistency for l1-penalizedparametric m-estimators with applications to linear svm and logistic regression, 2009. URL http:

//www.citebase.org/abstract?id=oai:arXiv.org:0908.1940.

[95] R Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1997.

[96] Lior Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33(1-2):1–39, 2010.

[97] Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. Annals of Statistics, 35(3):1012–1030, 2007.

[98] Saharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a Regularized Path to a Maximum MarginClassifier. Journal of Machine Learning Research, 5:941–973, 2004.

[99] Yvan Saeys, Inaki Inza, and Pedro Larranaga. A review of feature selection techniques in bioinformat-ics. Bioinformatics, 23(19):2507–2517, 2007.

[100] Robert E Schapire. The Boosting Approach to Machine Learning: An Overview. Lecture Notes InStatistics, Springer Verlag, New York, pages 149–172, 2003.

[101] Bernhard Scholkopf and Alexander J Smola. Learning with Kernels: Support Vector Machines,Regularization, Optimization and Beyond. the MIT Press, 2002.

[102] Daniel F Schwarz, Inke R Konig, and Andreas Ziegler. On safari to Random Jungle: a fast implemen-tation of Random Forests for high-dimensional data. Bioinformatics, 26(14):1752–1758, 2010.

[103] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge UniversityPress, 2004.

[104] Kai-Quan Shen, Chong-Jin Ong, Xiao-Ping Li, Zheng Hui, and Einar PV Wilder-Smith. A FeatureSelection Method for Multilevel Mental Fatigue EEG Classification. IEEE Transactions on BiomedicalEngineering, 54(7):1231–1237, 2007.

[105] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, AlexKipman, and Andrew Blake. Real-Time Human Pose Recognition in Parts from Single Depth Images.In IEEE Conference on Computer Vision and Pattern Recognition, pages 1297–1304. IEEE, 2011.

[106] K. Sjostrand. Matlab implementation of LASSO, LARS, the elastic net and SPCA. http://www2.

imm.dtu.dk/pubdb/p.php?3897, June 2005. Version 2.0.

[107] Stephen J Smith, Nick Ellis, and C Roland Pitcher. Conditional variable importance in R package ex-tendedForest. Accessed 7-July-2013 from https://r-forge.r-project.org/scm/viewvc.php/pkg/

extendedForest/?root=gradientforest, 2011.

115

[108] Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra.MPI: The Complete Reference (Vol. 1): Volume 1-The MPI Core, volume 1. MIT press, 1998.

[109] K P Soman, S Diwakar, and V Ajay. Insight Into Data Mining: Theory and Practice. Prentice-Hall ofIndia Private Limited, 2006. ISBN 9788120328976.

[110] Carolin Strobl and Achim Zeileis. Danger: High power!–exploring the statistical properties of a testfor random forest variable importance. Technical Report 017, Department of Statistics, University ofMunich, 2008.

[111] Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis. Condi-tional variable importance for random forests. BMC Bioinformatics, 9(1):307, 2008. ISSN 1471-2105.doi: 10.1186/1471-2105-9-307.

[112] Robert Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal StatisticalSociety, Series B (Methodological), 58:267–288, 1996.

[113] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smooth-ness via the fused lasso. Journal of Royal Statistical Society, Series B (Statistical Methodology), 67(1):91–108, 2005.

[114] Roman Timofeev. Classification and Regression Trees (CART) Theory and Applications. Master’sthesis, 2004.

[115] Jameson L Toole, Michael Ulm, Marta C Gonzalez, and Dietmar Bauer. Inferring land use from mobilephone activity. In Proceedings of the ACM SIGKDD International Workshop on Urban Computing,pages 1–8. ACM, 2012.

[116] Eugene Tuv, Alexander Borisov, George Runger, and Kari Torkkola. Feature Selection with Ensembles,Articial Variables, and Redundancy Elimination. The Journal of Machine Learning Research, 10:1341–1366, 2009.

[117] Andrey Nikolayevich Tychonoff. On the stability of inverse problems. Doklady Akademii Nauk SSSR,39:195–198, 1943.

[118] M Van Breukelen, Robert PW Duin, David MJ Tax, and JE den Hartog. Handwritten digit recognitionby combined classifiers. Kybernetika, 34(4):381–386, 1998.

[119] Vladimir N Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998. ISBN0471030031.

[120] Pantelis Vlachos. Statlib. http://lib.stat.cmu.edu/datasets/.

[121] J D Watts and R L Lawrence. Merging random forest classification with an object-oriented approachfor analysis of agricultural lands. The International Archives of the Photogrammetry, Remote Sensingand Spatial Information Sciences, XXXVII (B7), 2008.

[122] Sanford Weisberg. Applied Linear Regression. Wiley, 2005.

[123] Jason Weston and Chris Watkins. Multi-class Support Vector Machines. Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, May 1998.

[124] Frank Wilcoxon. Individual Comparisons by Ranking Methods. Biometrics Bulletin, 1(6):80–83, 1945.

[125] David H Wolpert. Stacked Generalization. Neural Networks, 5:241–259, 1992.

[126] David H Wolpert. What the No Free Lunch Theorems Really Mean; How to Improve Search Algorithms.Available at http://www.santafe.edu/media/workingpapers/12-10-017.pdf, 2012.

116

[127] David H Wolpert and William G. Macready. No Free Lunch Theorems For Optimization. EvolutionaryComputation, IEEE Transactions on, 1(1):67–82, 1997.

[128] Pengyi Yang, Yee Hwa Yang, Bing B Zhou, and Albert Y Zomaya. A Review of Ensemble Methods inBioinformatics. Current Bioinformatics, 5(4):296–308, 2010.

[129] Lei Yu and Huan Liu. Efficient Feature Selection via Analysis of Relevance and Redundancy. TheJournal of Machine Learning Research, 5:1205–1224, 2004.

[130] Peng Zhao and Bin Yu. On Model Selection Consistency of Lasso. The Journal of Machine LearningResearch, 7:2541–2563, 2006.

[131] Ji Zhu, Saharon Rosset, Trevor Hastie, and Rob Tibshirani. 1-norm Support Vector Machines.Advances in neural information processing systems, 16(1):49–56, 2004.

[132] Hui Zou. An Improved 1-norm SVM for Simultaneous Classification and Variable Selection. AISTATS,2:675–681, 2007.

[133] Hui Zou. The Adaptive Lasso and Its Oracle Properties. Journal of the American StatisticalAssociation, 101:1418–1429(12), December 2006.

[134] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of theRoyal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.

[135] Hui Zou and Runze Li. One-step sparse estimates in nonconcave penalized likelihood models. Annalsof Statistics, 36(4):1509–1533, 2008.

Appendix A

L1-SVM2 Algorithm

In this chapter, we detail the L1-SVM2 algorithm. First, we discuss the optimization of a double

differentiable loss function regularized with the L1 penalty.

A.1 Optimization of a Double Differentiable Loss Function Reg-

ularized with the L1 Penalty

This section summarizes the method, proposed by Rosset and Zhu [97], that solves for the entire

regularization path for a double differentiable loss function and the lasso penalty.

A.1.1 Necessary and Sufficient Conditions for Optimality for the Whole Regu-

larization Path

Lasso penalty regularized algorithms for a double differentiable loss function can be formulated as

minβ

n∑i=1

L(xi, yi, β) + λ

p∑j=1

|βj |, (A.1)

where n is the number of examples and p is the number of features. βj is the parameter for feature j

and is substituted as βj = β+j − β−j with β+

j ≥ 0, β−j ≥ 0; such a decomposition helps in reducing the

2p constraints, due to |β|, to 2p+1 constraints, where p is the number of features. We also substitute

118

L(xi, yi, β) = L(yi, (β+, β−)xi) to obtain the primal and dual formulation of Equation A.1 as

Primal : minβ+,β−

∑i

L(yi, (β+, β−)xi) + λ

∑j

(β+j + β−j ), such that ∀j, β+

j ≥ 0 and β−j ≥ 0, (A.2)

Dual : minβ+,β−

∑i

L(yi, (β+, β−)xi) + λ

∑j

(β+j + β−j )−

∑j

γ+j β

+j −

∑j

γ−j β−j . (A.3)

Henceforth, we use the shorthand notation L(β) instead of L(xi, yi, β).

We use the Karush Kuhn Tucker (KKT) conditions to examine the properties of the minima for

Equation A.1. The primal formulation consists of the loss function and inequality constraints. Using KKT

conditions, we obtain γ+j β

+j = 0 and γ−j β

−j = 0. Then, differentiating Equation A.3, using β−j and β+

j and

setting the gradient to 0, we obtain [∇L(βj)] + λ− γ+j = 0 and −[∇L(βj)] + λ− γ−j = 0.

At the minima of Equation A.3, for fixed λ > 0 (for λ = 0, β = 01,...,p), we get

if β+j > 0, λ > 0 =⇒ γ+

j = 0 =⇒ [∇L(β)]j = −λ < 0 =⇒ γ−j = 0 =⇒ β−j = 0, (A.4)

if β−j > 0, λ > 0 =⇒ γ−j = 0 =⇒ [∇L(β)]j = λ > 0 =⇒ γ+j = 0 =⇒ β+

j = 0. (A.5)

From Equation A.4 and Equation A.5, we surmise that either β+j or β−j (and similarly either γ+

j or γ−j )

are non-zero, but not both simultaneously. When |[∇L(βj)]| = λ, the feature j can have non-zero coefficient

βj , but features that have |[∇L(βj)]| < λ have coefficient βj = 0. Thus, for each value of λ, we have sets of

features that can be divided on the basis of |[∇L(βj)]| = λ (active set A = {j ∈ {1..m} and βj 6= 0} ) and

|[∇L(βj)]| < λ (active complementary set Ac), and formulated as

j ∈ A =⇒ |[∇L(βj)]| = λ, sign[∇L(βj)] = −sign(βj), (A.6)

j /∈ A( or j ∈ Ac) =⇒ |[∇L(βj)]| < λ. (A.7)

Now that we have examined the properties of the minima of Equation A.1, we utilize Newton’s method

for optimizing it. Instead of solving for a fixed value of λ, we use a strategy of starting with an empty set A

and λ =∞; initially, as λ =∞, set A is empty. We gradually decrease the value of λ which allows features

to enter the set A; KKT conditions are violated whenever a feature moves between sets A and Ac and we

require a recalculation of the gradients. We briefly note the calculation of the gradients and the conditions

that violate the KKT conditions as follows:

119

Initially, choose β0 = {0, .., 0} ∈ Rn, (A.8)

According to Newton’s method, βk+1 = βk − [∇2f(βk)]−1∇f(βk), k = 0, 1, . . . , (A.9)

where ∇f(βk) = −∇L(β) or λ ∇J(β) as ∇L(β) + λ∇J(β) = 0,

βk+1 = βk − [∇2L(βk)]−1∇J(βk),

Thus,∂βk

∂λ= −[∇2L(βk)]−1∇J(βk) = −[∇2L(βk)]−1Sign(βk), k ∈ A,

(A.10)

Note that the notation βkj refers to the value of feature j at iteration k, whereas βk refers to the parameter

vector at iteration k.

KKT conditions are violated:

(1) When a feature (not in the active set A) achieves |[∇L(βj)]| = λ.

(2) When a feature obtains a value of 0, the KKT conditions will be violated as sign[∇L(βj)] =

−Sign(β(λj)).

(3) If the loss function is not totally differentiable everywhere then the gradient direction becomes invalid

at the points where the loss function is not differentiable; whenever we encounter a non-differentiable

transition, the gradient direction is recalculated.

Next, we discuss an algorithm that calculates the entire regularization path by properly handling the

KKT conditions.

Algorithm for Optimizing Doubly Differentiable Loss Functions with the L1 Penalty. The

algorithm is noted as follows:

(1) Initialize β = 01..p, A = maxj |∇L(βj)|, γA = −sign(∇L(β))A, γAc = 0, and p is the number of

features.

(2) While (max|∇L(β)| > 0)

a) d1 = min{d > 0 : |∇L(βj + dγj)j /∈A| = |∇L(βm + dγm)m∈A|} (add features to A).

120

b) d2 = min{d > 0 : (βj + dγj |)j∈A = 0} (delete features from A).

c) d3 = min{d > 0 : r(yi, (β+dγ)xi) = 0 hits a ‘knot’, i = 1 . . . n and n is the number of examples}.

d) set d=min(d1,d2,d3) and β = β + dγ.

e) γj∈A = |∇2L(βj∈A)|−1(−sign(βj∈A)), γj∈Ac = 0

In step 2c), we use r() function to define distance of a training example to a knot point (non-differentiable

point). In step 2d), we take the smallest increment till a KKT condition is violated.

A.2 L1-SVM2

In this section, we present the derivations of the piecewise algorithm that calculates the entire regu-

larization path for the squared hinge loss function with the L1 penalty (termed L1-SVM2), which is defined

as

L1-SVM2: minβ,β0

{λ||β||1 +

∑i

(1− yi ∗ (xiβ + β0))2+, (v)+ = max(v, 0)

}. (A.11)

Our notation will be as follows: we represent the dot product as ‘∗’ and 1m is a vector of length

m. The matrix representing input examples is represented by a n × p matrix X, where n is the number of

examples and p is the number of features. x:,j represents the jth column in X, the data for the jth feature;

similarly, xi,: represents the ith row in X, the data for the ith example. We represent the loss function

L(X, y, β) in shorthand as L. We derive the gradient of the loss function and then step size d1, for the

algorithm discussed in the previous section, as

L = (1m − y ∗ (Xβ)− y ∗ (β0))2+,

For each feature ‘j’, ∇Lj = −2(1m − y ∗ (Xβ − β0))T+(y ∗ x:,j),

∇2Lj,j = 2(−y ∗ x:,j)T (−y ∗ x:,j) = 2xT:,jx:,j ,∇2Lj,k = 2(−y ∗ xk)T (−y ∗ x:,j) = 2xT:,jx:,k,

∇2Lj,β0= 2(−yx:,j)

T (−y) = 2∑k

xk,j ,∇2Lβ0,β0= 2(y)T (y) = 2yT y,

For finding d1, |∇L(βa + d1γa)|a∈A = |∇L(βj + d1γj)|j∈Ac ⇒ ∇L(βa + d1γa)a∈A = ±∇L(βj + d1γj)j∈Ac ,

⇒ (1m − y ∗ (Xβ + β0 + d1γ0 +Xd1γ))T (y ∗ x:,a) = ±(1m − y ∗ (Xβ + β0 + d1γ0 +Xd1γ))T (y ∗ x:,j),

121

⇒ (1m − y ∗ (Xβ + β0))T (y ∗ x:,a)∓ (1m − y ∗ (Xβ + β0))T (y ∗ x:,j)

= ∓y ∗ ((γ0 +Xγ)(y ∗ x:,j) + (γ0 +Xγ)(y ∗ x:,a))d1

⇒ Step size d1 = min

{A Correlation− J Correlation

A val − J val,A Correlation+ J Correlation

A val + J val

}, (A.12)

where, A Correlation = (1m − y ∗ (Xβ + β0))T (y ∗ x:,a), J Correlation = (1m − y ∗ (Xβ + β0))T (y ∗ x:,j),

A val = (γ0 +Xγ)(x:,a), J val = (γ0 +Xγ)(x:,j) (∵ y ∗ y =∑

1m). (A.13)

where γ is a p × 1 vector and γ0 is a singular value, containing the direction of the gradient for β and β0,

respectively. Note that, for the L1-SVM2, the step size d1 is the minimum of the ratio of difference between

the generalized correlations and difference of directional estimates.

Finding γ0. The algorithm discussed in the previous section has all the features in the active set A with

the same gradient value. Thus, we cannot directly solve for γ0 (it is also complicated by the fact that the

gradient of β0 is always 0).

We obtain the value of γ0 in terms of γ as

∇β0L =∇β0

(1m − y ∗ (Xβ)− y ∗ (β0))2+ = (1m − y ∗ (Xβ)− y ∗ (β0))T (−y), (A.14)

as ∇β0(L(β) + λJ(β)) = 0,⇒ (1m − y ∗ (Xβ)− y ∗ (β0))T (−y) + λ(0) = 0,

⇒(1m − y ∗ (Xβ)− y ∗ (β0))T (−y) = 0,

and for a small step d1, (1m − y ∗ (Xβ)− y ∗ (β0)− y ∗ (Xd1γ)− y ∗ (d1γ0))T (−y) = 0,

⇒(y ∗ (Xγ + γ0))T (y) = 0⇒ (y ∗ (Xγ))T y + (y ∗ γ0)T (y) = 0,

⇒∑

Xγ + nγ0 = 0⇒ γ0 = −∑Xγ

n. (A.15)

Appendix B

Appendix for Chapter 4

B.1 Description of Real World Datasets

We describe the datasets used in our experiments.

Classification Datasets.

(1) a1a - dataset derived from the adult dataset. In this dataset, the continuous features were discretized

into quantiles and each quantile was binarized. Categorical features with m categories were converted

to m binary features. Details presented in Platt [86].

(2) australian - dataset containing information about credit card applications. The feature information is

not available. Dataset obtained from Chang and Lin [28].

(3) german.numer - dataset contains information about credit card applications. The feature information in-

cludes amount present in checking account, credit history, employment, among others. Dataset obtained

from Chang and Lin [28].

(4) heart - dataset is used to predict the presence or absence of heart disease. Features include age, sex,

chest pain, among others. Dataset obtained from Chang and Lin [28].

(5) hillvalley (noisy) - each example in this dataset represents 100 points on a two-dimensional graph. The

points will either create a hill (bump in the terrain) or a valley (dip in the terrain). We used the noisy

123

version of this dataset where the terrain is uneven and the hills and the valleys are not obvious. Dataset

obtained from Asuncion and Newman [5].

(6) ionosphere - detection of ‘good’ vs. ‘bad’ radar targets; the ‘good’ radar targets show evidence of some

structure in the ionosphere and the ‘bad’ targets do not. Features are 17 radar pulses each with 2

attributes. Dataset can be obtained from either Chang and Lin [28] or Asuncion and Newman [5].

(7) Chess (kr vs kp) dataset - dataset represents the position of chess pieces in an end game in chess, where

a pawn on a7 is one square away from queening. The task is to determine if the white player can win

or not. Dataset obtained from Asuncion and Newman [5].

(8) BUPA liver-disorders - detection of liver disorders. Data consists of 6 features, and 5 of which are blood

tests and the 6th feature is the number of half-pint drinks drunk by the individual. Dataset can be

obtained from either Chang and Lin [28] or Asuncion and Newman [5].

(9) pimadiabetes - detection of diabetes for female patients of the Pima Indian heritage. Features include

blood tests, blood pressure, age, among other. Dataset can be obtained from either Chang and Lin [28]

or Asuncion and Newman [5].

(10) spambase - determining between spam and non-spam emails. Dataset can be obtained from either

Asuncion and Newman [5] or Alcala-Fdez et al. [2].

(11) splice - “Splice junctions are points on a DNA sequence at which ‘superfluous’ DNA is removed during

the process of protein creation in higher organisms. The problem posed in this dataset is to recognize,

given a sequence of DNA, the boundaries between exons (the parts of the DNA sequence retained after

splicing) and introns (the parts of the DNA sequence that are spliced out). This problem consists of

two sub tasks: recognizing exon/intron boundaries (referred to as EI sites), and recognizing intron/exon

boundaries (IE sites). (In the biological community, IE borders are referred to a ‘acceptors’ while EI

borders are referred to as ‘donors’.) ” - description at Asuncion and Newman [5].

(12) svmguide3 - dataset was presented in Hsu et al. [63]. According to the authors, it represents some form

of vehicle information.

124

(13) w1a - the goal is to classify whether a web page belongs to a category or not. Input is a 300 sparse

binary keyword attribute extracted from each web page. Dataset was presented in Platt [86].

(14) breast-cancer (diagnostic) / WDBC - detection of malignant or benign breast cancer tumor. Features

were computed from a digitized image of a fine needle aspirate (FNA) of a breast mass and describe

the characteristics of the cell nuclei present in the image including the radius, texture, among others.

Dataset is available at Asuncion and Newman [5].

(15) breast-cancer - detection of malignant or benign breast cancer tumor. This dataset has different features

than the breast-cancer (diagnostic) dataset that was discussed earlier. Dataset is available at Asuncion

and Newman [5].

Regression Datasets.

(1) cadata (houses.zip) - “These spatial data contain 20,640 observations on housing prices with 9 economic

covariates” - according to K Pace and Barry [72].

(2) compactiv - “The data was collected from a Sun Sparcstation 20/712 with 128 Mbytes of memory

running in a multi-user university department. Users would typically be doing a large variety of tasks

ranging from accessing the internet, editing files or running very cpu-bound programs. The data was

collected continuously on two separate occasions. On both occasions, system activity was gathered every

5 seconds. The final dataset is taken from both occasions with equal numbers of observations coming

from each collection epoch in random order.” - description at Delve [1].

(3) diabetes - this dataset was discussed in Efron et al. [42]. The target value is the measure of disease

progression a year after baseline. The 10 baseline features include age, sex, bmi, among other features.

(4) housing - predicting house values in suburbs of Boston [5]. Features include crime rate, tax, age, among

others.

(5) mortgage - “This file contains the Economic data information of USA from 01/04/1980 to 02/04/2000

on a weekly basis. From given features, the goal is to predict the 30 year conventional mortgage rate.”

125

- description at Alcala-Fdez et al. [2].

(6) auto-mpg - “The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms

of 3 multi-valued discrete and 5 continuous attributes” - according to Quinlan [91]. Features include

cylinder, displacement, horsepower, among others.

(7) Pole communications - This is a commercial application described in Weiss and Indurkhya (1995) and

at http://www.cs.su.oz.au/~nitin. The data describes a telecommunication problem. No further

information is available.

(8) triazines - the dataset is part of “Qualitative Structure Activity Relationships” and available at Asuncion

and Newman [5].

(9) census - The data were collected as part of the 1990 US census. These are mostly counts cumulated

at different survey level. These are all concerned with predicting the median price of the house in

the region based on demographic composition and a state of housing market in the region. We used

house-price-16H from the datasets available at Delve [1].

(10) Concrete compressive strength - “Concrete is the most important material in civil engineering. The

concrete compressive strength is a highly nonlinear function of age and ingredients.” - description at

Asuncion and Newman [5]. Eight quantitative input features are available.

B.2 Error Results on Artificial and Real World Datasets

Numerical results for the classification and regression datasets are presented in Table B.1. Numerical

results for artificial datasets are presented in Table B.2. Numerical results for number of features correctly

and incorrectly detected by various L1 penalized algorithms are presented in Table B.3.

126

Error % on Real World Regression Datasets using LARSDataset MSA SBA Bootstrap Adaptive L1

cadata 33.68 (8.74) 32.79 (6.45) 33.80 (8.71) 33.98 (9.53) 33.75 (8.73)compactiv 2.56 (0.21) 2.56 (0.27) 2.58 (0.28) 2.54 (0.22) 2.57 (0.27)

diabetes 55.37 (11.81) 54.35 (10.26) 53.93 (12.23) 56.15 (11.66) 55.76 (10.85)housing 22.47 (21.80) 19.45 (19.07) 21.91 (20.44) 17.49 (11.88) 23.39 (23.62)

mortgage 0.07 (0.02) 0.07 (0.01) 0.13 (0.18) 0.09 (0.07) 0.08 (0.03)mpg 14.18 (5.35) 13.03 (3.08) 13.37 (3.42) 13.05 (3.49) 13.05 (3.23)pole 29.68 (0.86) 30.00 (0.85) 30.06 (0.85) 29.73 (0.93) 30.08 (0.90)

triazines 90.88 (23.33) 83.90 (25.16) 84.27 (17.56) 89.72 (22.60) 90.88 (18.01)concrete 22.82 (3.53) 22.32 (3.74) 22.59 (3.75) 22.36 (3.52) 22.53 (3.71)

census 52.56 (18.92) 50.48 (14.22) 52.61 (18.65) 55.90 (28.93) 52.31 (17.16)average 32.43 30.90 31.52 32.10 32.44

ranks 3.00 1.70 3.60 3.00 3.70

Error % on Real World Classification Datasets using L1-SVM2Dataset MSA SBA Bootstrap Adaptive L1

a1a 18.63 (1.69) 18.19 (2.17) 18.93 (1.89) 18.38 (2.07) 18.37 (1.70)australian 15.36 (2.91) 14.49 (3.20) 14.35 (3.51) 14.78 (3.54) 14.93 (2.98)

breastcancer 4.06 (3.77) 3.48 (3.56) 3.77 (3.70) 3.91 (3.72) 4.21 (3.87)germannumber 26.10 (2.77) 25.60 (1.65) 26.70 (2.11) 26.30 (1.49) 25.60 (1.71)

heart 18.52 (7.20) 14.44 (5.91) 17.04 (7.03) 17.78 (6.72) 17.78 (7.57)hillvalley 19.03 (5.71) 13.27 (3.48) 24.68 (5.63) 24.50 (6.26) 28.36 (5.76)

ionosphere 10.09 (5.51) 7.50 (3.94) 9.63 (4.37) 7.78 (3.41) 7.78 (3.88)krvskp 2.97 (1.12) 3.00 (0.90) 3.07 (1.04) 3.07 (1.18) 3.25 (0.92)

liverdisorders 30.81 (7.90) 29.24 (8.46) 32.38 (10.29) 31.33 (7.94) 31.67 (7.87)pimadiabetes 23.94 (5.48) 23.29 (4.93) 23.42 (5.17) 23.30 (4.46) 23.55 (4.73)

spambase 8.14 (1.02) 8.09 (1.24) 8.22 (0.75) 8.33 (0.93) 8.20 (0.80)splice 20.10 (3.21) 19.10 (3.11) 19.90 (4.51) 19.60 (3.31) 20.10 (3.78)

svmguide3 16.34 (1.60) 16.11 (3.13) 17.21 (2.10) 17.15 (2.22) 16.90 (2.42)w1a 2.67 (0.67) 2.30 (0.47) 2.79 (0.49) 2.30 (0.47) 2.46 (0.62)

WDBC 3.69 (2.92) 3.86 (3.07) 3.16 (3.07) 4.56 (3.62) 4.56 (3.62)average 14.70 13.46 15.02 14.87 15.18

ranks 3.37 1.30 3.47 3.17 3.70

Table B.1: Results on real world classification and regression datasets using L1-SVM2 and LARS respectively.Algorithms with the lowest error rate are in bold; parentheses contain the standard deviation. We onlyshow results from multiple step version of MSA and SBA (Subsampling). We also omit the results from SBA(Bootstrap).

127

Error % for LARS (Regression)Artificial MSA SBA(Boot) SBA(Sub) MSA SBA(Boot) SBA(Sub) Bootstrap Adaptive L1

Dataset No. Single Single Single Multiple Multiple Multiple1 5.63 5.59 5.46 5.54 5.39 5.29 5.68 5.88 6.342 1.51 1.49 1.45 1.50 1.48 1.45 1.50 1.51 1.543 19.33 23.70 23.39 16.89 20.76 18.80 27.80 26.59 29.114 5.72 5.74 5.56 5.63 5.52 5.37 5.76 6.07 6.705 1.52 1.49 1.46 1.50 1.48 1.45 1.50 1.49 1.566 24.76 34.46 31.47 19.14 26.43 25.21 50.95 38.51 37.947 6.57 6.66 6.62 6.42 6.44 6.10 6.70 7.27 7.748 1.67 1.66 1.63 1.66 1.64 1.61 1.66 1.72 1.739 7.30 7.60 7.57 7.05 6.73 6.81 8.68 7.70 9.91

average 8.22 9.82 9.40 7.26 8.43 8.01 12.25 10.75 11.40ranks 5.22 5.56 3.56 3.56 2.78 1.44 6.89 7.22 8.78

Error % for L1-SVM2 (Classification)Artificial MSA SBA(Boot) SBA(Sub) MSA SBA(Boot) SBA(Sub) Bootstrap Adaptive L1

Dataset No. Single Single Single Multiple Multiple Multiple1 11.05 10.75 9.52 10.58 10.31 8.72 10.82 11.13 11.772 4.59 4.37 3.51 4.22 4.01 3.52 4.49 4.15 4.683 25.46 25.85 23.62 23.64 24.47 22.24 26.96 26.08 26.204 11.71 11.27 10.18 11.14 10.71 9.51 11.77 11.65 12.905 4.41 4.51 3.62 4.33 4.50 3.61 4.66 3.88 5.086 30.27 30.20 28.46 28.74 29.16 26.85 31.70 30.73 30.787 12.30 11.98 11.09 11.75 11.38 10.37 12.08 12.52 13.338 5.51 5.50 4.33 5.15 5.23 4.28 5.67 5.42 5.779 20.84 20.17 18.92 19.93 19.79 18.31 21.20 20.16 20.70

average 14.02 13.84 12.58 13.28 13.28 11.93 14.37 13.97 14.58ranks 6.67 5.67 1.89 3.78 3.67 1.11 7.78 5.89 8.56

Error % for LPSVM (Classification)Artificial MSA SBA(Boot) SBA(Sub) MSA SBA(Boot) SBA(Sub) Bootstrap Adaptive L1

Dataset No. Single Single Single Multiple Multiple Multiple1 11.53 12.44 9.83 10.16 9.45 7.56 10.43 11.40 11.912 4.25 5.07 3.72 3.83 3.53 2.73 3.97 3.59 4.473 25.32 25.76 22.43 22.65 21.27 15.48 25.54 26.41 26.144 12.55 14.07 10.39 11.41 10.92 8.33 11.09 11.87 13.255 4.41 5.57 3.88 4.29 3.72 2.79 4.07 2.68 5.116 30.35 30.67 26.47 28.20 26.74 20.19 30.99 30.93 30.617 12.46 14.02 11.12 11.39 11.26 8.92 11.66 12.87 13.248 5.30 6.02 4.53 4.98 4.58 3.89 5.48 5.01 5.649 18.73 18.46 16.32 17.93 17.56 13.96 18.96 18.19 18.15

average 13.88 14.67 12.08 12.76 12.11 9.32 13.58 13.66 14.28ranks 6.44 8.33 2.67 4.44 2.67 1.11 6.22 5.67 7.44

Error % for Logistic Regression (Classification)Artificial MSA SBA(Boot) SBA(Sub) MSA SBA(Boot) SBA(Sub) Bootstrap Adaptive L1

Dataset No. Single Single Single Multiple Multiple Multiple1 8.90 11.59 10.14 7.98 7.99 6.61 10.57 9.79 11.762 4.82 3.82 3.23 3.50 3.36 2.65 4.21 2.90 4.153 20.84 26.78 26.09 17.59 21.65 17.35 25.57 25.97 26.654 11.02 11.84 10.58 10.07 8.43 6.45 12.85 10.43 13.075 6.42 3.46 3.12 3.98 2.99 2.41 4.71 2.94 4.466 24.12 31.13 30.53 22.19 24.63 20.42 33.71 30.38 31.137 11.45 13.25 11.89 9.79 10.32 8.22 16.60 11.57 13.498 5.36 4.95 4.17 3.91 4.01 3.25 6.54 4.28 5.129 17.68 19.93 18.93 15.59 13.80 12.23 23.09 19.08 20.15

average 12.29 14.08 13.19 10.51 10.80 8.84 15.32 13.04 14.44ranks 5.56 6.89 5.11 3.00 3.11 1 8 4.44 7.89

Table B.2: Results on nine synthetic datasets using four different L1 penalized algorithms. Algorithms withthe highest accuracy are in bold. Standard deviation are omitted.

128LARS

Artificial MSA SBA(Boot) SBA MSA SBA SBABootstrap Adaptive L1Dataset Single (Boot) (Sub) Multiple (Boot) (Sub)

Number Single Single Multiple Multiple1

::::12.4/1.3 12.9/2.3 12.9/3.1 12.4/1.4 13.3/1.6 12.9/0.5 13.0/3.1 12.4/5.7 13.9/

:::18.2

2::::11.0/0.2 12.1/0.7 13.1/1.0 11.6/1.1 12.1/0.6 12.6/0.8 12.5/0.0 11.8/2.2 12.9/

:::5.6

3 14.9/3.1 14.7/13.1 14.9/24.1 14.9/0.4 14.8/5.6 14.6/3.5:::13.8/9.6 14.7/18.0 15.0/

:::40.0

4 12.2/1.5 12.3/1.6 12.4/3.1::::12.1/1.2 12.6/1.1 12.4/0.4 12.4/1.9 12.1/5.9 13.7/

:::23.7

5::::11.0/0.2 12.0/0.6 13.0/1.4 11.6/1.7 12.1/0.7 12.4/0.8 12.2/0.0 12.2/2.0 12.8/

:::6.1

6 14.6/7.4 14.2/27.1 14.3/31.8 14.5/0.3 14.3/7.8 13.9/8.3:::10.2/4.7 14.2/33.3 14.8/

:::54.6

7 12.5/1.3 12.8/2.6 13.4/6.8 12.5/0.9 13.0/1.5 12.9/0.4 13.1/4.4::::12.5/7.7 14.0/

:::21.5

8 10.8/1.8 11.5/3.0 11.7/3.4 10.8/1.7 11.4/2.3 11.5/2.0 12.0/3.2::::10.7/3.9 12.8/

:::9.1

9 23.8/3.1 24.3/10.3 24.5/18.2 23.8/2.0 24.6/5.1 24.1/8.2:::22.7/2.9 24.5/15.4 24.7/

:::36.5

average 13.7/2.2 14.1/6.8 14.5/10.3 13.8/1.2 14.3/2.9 14.1/2.8:::13.5/3.3 13.9/10.5 15.0/

:::23.9

L1-SVM2Artificial MSA SBA(Boot) SBA MSA SBA SBA

Bootstrap Adaptive L1Dataset Single (Boot) (Sub) Multiple (Boot) (Sub)Number Single Single Multiple Multiple

1 8.5/2.9 8.9/3.8 10.0/4.5:::8.2/2.5 8.6/2.2 9.3/1.8 8.3/2.0 9.4/7.3 10.3/

:::12.2

2:::4.4/0.5 5.4/0.8 6.9/0.7 5.0/1.5 4.9/0.6 6.3/0.5 5.0/0.0 6.3/2.9 6.5/

:::4.0

3 11.2/10.9 11.7/14.5 12.0/16.7 11.6/13.1 10.7/6.4 11.1/7.5:::9.4/7.2 12.2/20.1 13.1/

:::34.1

4 7.8/2.3 8.7/3.8 9.1/3.8 8.0/3.2 8.3/2.2 8.6/1.7:::7.4/1.0 9.1/8.3 9.4/

::9.1

5:::4.4/0.3 5.1/1.2 6.5/0.8 4.5/0.4 4.5/0.8 5.8/0.5 4.5/0.1 6.6/2.1 6.1/

:::2.9

6 8.5/11.0 9.0/13.3 9.8/17.0 8.9/13.5 7.9/5.4 8.2/4.9:::6.3/4.6 9.8/23.4 10.9/

:::35.6

7 7.9/2.9 8.6/4.6 9.5/5.5 7.9/2.7 8.0/2.3 8.8/2.6:::7.8/2.2 8.6/7.9 9.8/

::::13.5

8:::4.2/1.6 5.0/2.6 6.1/3.4 4.6/2.5 4.3/1.8 5.6/2.5 4.4/1.2 5.4/4.3 6.7/

::8.2

9 10.8/5.4 12.5/6.8 14.1/9.2 12.1/10.1 10.8/3.8 12.4/4.9::::10.0/2.2 14.3/13.7 15.1/

:::18.3

average 7.5/4.2 8.3/5.7 9.3/6.8 7.9/5.5 7.6/2.8 8.5/3.0:::7.0/2.3 9.1/10.0 9.8/

::::15.3

LPSVMArtificial MSA SBA(Boot) SBA MSA SBA SBA

Bootstrap Adaptive L1Dataset Single (Boot) (Sub) Multiple (Boot) (Sub)Number Single Single Multiple Multiple

1:::8.0/1.9 10.0/12.0 10.8/11.9 8.0/2.6 9.1/3.0 9.8/3.1 8.4/1.9 10.2/

::::13.1 10.0/10.7

2:::4.7/0.3 8.0/21.9 7.8/9.5 5.2/0.9 8.6/18.9 11.5/

:::39.8 5.5/0.1 11.0/5.3 6.7/2.8

3 10.9/9.3 13.0/26.5 13.8/:::33.9 10.6/5.7 12.3/9.6 13.3/7.8

:::9.9/5.8 12.6/28.4 12.7/29.3

4 7.6/2.9 9.3/18.6 10.2/12.0::7.6/2.2 8.7/6.4 9.6/3.6 7.9/1.8 10.3/

:::28.9 9.6/13.2

5 4.5/0.4 7.5/22.5 7.6/9.2::4.4/0.8 7.9/20.7 12.0/

::::58.6 5.2/0.1 13.0/5.3 6.6/3.6

6 8.6/11.4 11.0/32.5 12.7/45.6 8.2/9.2 10.3/14.5 11.9/9.8:::6.6/4.2 10.9/

::::51.1 10.8/32.4

7 8.0/2.5 9.7/13.5 10.7/14.9:::7.5/1.1 9.0/4.5 9.8/5.5 8.0/2.2 9.9/

::::16.4 9.6/11.6

8:::4.0/1.2 5.7/11.8 7.1/12.1 4.1/1.3 6.2/14.9 9.0/

::::39.6 4.7/1.0 9.9/12.9 6.3/5.9

9::::12.3/4.5 16.4/15.1 18.8/19.3 13.5/8.5 13.6/7.0 16.5/7.1 12.7/2.9 18.5/

::::23.8 16.5/15.2

average:::7.6/3.8 10.1/19.4 11.0/18.7 7.7/3.6 9.5/11.0 11.5/19.4 7.7/2.2 11.8/

:::20.6 9.9/13.9

Logistic RegressionArtificial MSA SBA(Boot) SBA MSA SBA SBA

Bootstrap Adaptive L1Dataset Single (Boot) (Sub) Multiple (Boot) (Sub)Number Single Single Multiple Multiple

1 10.2/13.6 11.1/18.5 11.1/19.5 9.8/9.2 11.5/:::25.1 11.1/17.3

:::9.6/4.3 10.3/9.0 10.8/22.9

2 7.2/3.5 10.1/9.5 10.4/6.7:::7.2/2.6 10.0/

::::15.4 10.8/14.2 7.5/0.1 10.6/2.4 8.8/9.2

3 12.9/27.6 13.5/43.9 13.6/48.2 12.6/17.8 13.9/:::50.6 13.3/28.9 12.6/29.7

::::12.4/26.4 13.6/49.2

4 8.7/7.9 10.4/19.2 10.8/19.2 8.6/5.8 11.1/:::28.2 11.1/19.5

:::8.6/1.8 9.7/7.6 9.2/17.2

5:::6.0/2.0 9.6/8.0 10.1/4.8 6.7/1.4 10.4/14.6 11.1/

:::14.8 6.3/0.1 10.7/2.1 7.5/5.7

6 10.6/32.9 12.0/56.6 12.4/:::67.4 10.3/22.3 12.3/62.1 12.0/41.4

::::10.1/25.0 10.6/36.4 11.4/54.8

7 9.8/13.6 10.9/22.3 11.2/23.3 9.5/8.9 11.4/:::29.2 10.9/19.1

:::9.1/4.8 9.4/8.3 10.7/25.6

8 7.7/8.1 9.6/13.2 9.5/10.7:::7.6/6.5 9.7/

::::24.1 9.8/23.9 8.1/2.7 9.1/7.7 9.5/16.7

9 16.3/26.2 19.2/38.8 19.6/41.6::::15.2/19.6 20.5/

:::48.3 19.8/40.8 16.6/12.3 16.3/18.4 18.0/43.4

average 9.9/15.0 11.8/25.5 12.1/26.8:::9.7/10.5 12.3/

:::33.1 12.2/24.4 9.8/9.0 11.0/13.1 11.1/27.2

Table B.3: Features (correctly/incorrectly) selected by four different L1 penalized algorithms. Algorithmswith the highest number of correct features or the lowest number of incorrect features are in bold; Algorithmswith the lowest number of correct features or the highest number of incorrect features have a

:::::wave

:::::sign.

129

B.3 Feature Selection Results for Artificial Datasets

Our primary criteria for model selection is error rate and we showed that weighted models obtained

from SBA outperform other algorithms. Now, we examine the features selected by individual algorithms. In

Figure B.1, we show the average number of feature, over all nine synthetic datasets, correctly and incorrectly

selected by the algorithms. Our observations are as follows:

� All algorithms have a lower false discovery rate compared to the standard L1 algorithm.

� Interestingly, adaptive and SBA show higher rate of correct detection, compared to MSA and Boot-

strap, but at the cost of including more incorrect features in the final models.

0%

10%

20%

30%

40%

50%

60%

70%

80%

Bootstrap

MSA L1

Adaptive

SBA

% fe

atur

es d

etec

ted

Correct DetectionIncorrect Detection

Figure B.1: Features correctly and incorrectly detected by four different L1 penalized algorithms.

130

B.4 Expanding Dataset Dictionary

We expanded available real world datasets to include both first and second order features. We briefly

discuss the error rate results for models trained using only first order features compared to models trained

using both first and second order features. Our results are shown in Figure B.2 using 10 fold cross validation

and standard L1 penalized algorithms, with LARS for regression and L1-SVM2 for classification. The

difference in the error is significant at p = 0.03467, using the wilcoxon signrank test [124].

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Err

or R

ate

mortgage(1

5/135)

WDBC

(30/495)

breastcancer(1

0/65)

ionosp

here

(34/629)

austra

lian(1

4/119)

heart

(13/104)

svmguide3(2

2/275)

mpg(7

/35)

pim

adiabetes(8

/44)

germ

an.numer(2

4/324)

compactiv(2

1/252)

housing(1

3/104)

liverd

isord

ers

(6/27)

cadata

(8/44)

concrete

(8/44)

pole

(26/377)

diabetes(1

0/65)

censu

s(1

6/152)

First order featuresFirst+Second order features

Figure B.2: Results by using expanded dictionary for real world datasets with the standard L1 penalizedalgorithm for LARS (regression) and L1-SVM2 (classification).

Appendix C

Appendix for Chapter 5

C.1 Description of Real World Datasets

We describe the datasets used for our experiments. Note that all the datasets used are classification

based datasets.

(1) covertype - prediction of forest cover type based only on cartographic features like elevation, slope,

distance from water, etc. The dataset has seven distinct forest cover type classes [10] and is available

at Asuncion and Newman [5].

(2) ibn sina - was a challenge dataset in ‘Active Learning Challenge’ [57]. The task is to spot arabic words

in ancient manuscripts to facilitate indexing.

(3) sylva - classification of Ponderosa pine vs other type of forest cover. This dataset was a challenge dataset

in ‘Agnostic Learning vs. Prior Knowledge Challenge’ [56].

(4) zebra - classification of whether the zebrafish embryo is in division (meiosis) or not. This dataset was a

challenge dataset in ‘Active Learning Challenge’ [57].

(5) musk - classification of whether new molecules will be musks or non-musks. This dataset is available at

Asuncion and Newman [5].

132

(6) hemodynamic 1 to hemodynamic 4 - classifying the level of vacuum pressure set in a Lower Body

Negative Pressure (LBNP) chamber. Lower Body Negative Pressure (LBNP) chamber is used induce

the effects of blood loss, in humans and pigs, by lowering pressure on the lower part of the body causing

blood to flow towards the lower extremities; the amount of vacuum pressure is correlated with amount

of blood loss [30]. Each of the dataset was obtained from different sets of patient groups each containing

between 30 to 40 patients. Time series data was obtained from various sensors including heart rate

monitors, pulse-oximeters, among others; features in the data are derived from time-frequency domain

analysis of the sensors. Note that each dataset has features extracted in a different manner. Dataset is

available upon request.

(7) factors, fourier, karhunen, pixel, and zernike - this set of datasets consists of features of handwritten

numerals (‘0’–‘9’) extracted from a collection of Dutch utility maps. 200 patterns per class (for a total of

2,000 patterns) have been digitized in binary images. This dataset is available at Asuncion and Newman

[5]. These digits are represented in terms of the following five feature sets:

� factors: 216 profile correlations.

� fourier: 76 Fourier coefficients of the character shapes.

� karhunen: 64 Karhunen-Love coefficients.

� pixel: 240 pixel averages in 2 x 3 windows.

� zernike: 47 Zernike moments.

(8) power 1, power 2, and power 3 - the goal of these datasets is to classify between optimal and sub-

optimal power configurations based on the connections present in a power grid and the cost/reliability

induced by the connections [54]. The overall idea is to optimally collocate distributed generation of

power and feeder interties between feeders at a feasible cost to improve supply reliability of power; while

also satisfying power flow constraints in an islanded mode of operation of the distribution system. The

targets in the datasets differ on basis of the Pareto-optimal front induced by cost vs. reliability index

and the feeder connections. Dataset is available upon request.

133

C.2 Results on Real World Datasets

We show the error rate in Table C.1 and the relative error rates in Table C.2.

C.3 Choice of ntree Parameter

The ntree parameter affects the generalization error of the random forest. According to Breiman,

a large value of ntree should be chosen. In this appendix, we show why our choice of ntree = 1000 is

appropriate for the datasets that were used for our experiments.

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

1

2

3

4

5

100 trees (0.00%)

250 trees (1.93%)

500 trees (2.60%)

1000 trees (3.27%)2000 trees (3.34%)

Figure C.1: What is an appropriate value of ntree? Results depict significance testing via the Bergmann-Hommel procedure at p=0.05. The average improvement in test error from the baseline condition of ntree =100 is noted in parenthesis.

We examine the hypothesis that there is no difference in prediction abilities between forests of different

sizes. For all the datasets used in our experiments, we found the best model when the random forest size

is one of the following ntree value: 100, 250, 500, 1000 and 2000 trees. The best model for a given ntree

134

DatasetProb Prob Err Err Breiman Strobl Breiman Strobl

BaselineReweight Removal Reweight Removal Reweight Reweight Removal Removal

covtype 0.2311 0.2304 0.2320:::::::0.2326 0.2299 0.2303 0.2299 0.2299 0.2295

hemo 1 0.2819 0.2752 0.2695 0.2760::::::0.2947 0.2939 0.2927 0.2893 0.2942

hemo 2 0.0820 0.0744 0.0772 0.0747 0.0855 0.0847 0.0799 0.0806:::::::0.0864

hemo 3 0.1196 0.1015 0.0969 0.1022 0.1235 0.1203 0.1248 0.1252:::::::0.1268

hemo 4 0.2706 0.2537 0.2512 0.2529 0.2546 0.2566 0.2514 0.2519:::::::0.2804

zernike 0.2029 0.2073 0.2189 0.2063 0.2206 0.2195 0.2183:::::::0.2218 0.2187

karhunen 0.0441 0.0432::::::0.0483 0.0435 0.0428 0.0427 0.0423 0.0435 0.0423

fourier 0.1721 0.1724::::::0.1799 0.1743 0.1695 0.1741 0.1672 0.1658 0.1774

factors 0.0333 0.0333::::::0.0417 0.0357 0.0378 0.0384 0.0380 0.0378 0.0379

pow 1 0.0353 0.0352 0.0392 0.0360 0.0421::::::0.1226 0.0452 0.0521 0.0463

pow 2 0.1791 0.1848 0.1956 0.1798 0.1887 0.1947 0.1896:::::::0.1968 0.1953

pow 3::::::0.1664 0.1609 0.1653 0.1626 0.1574 0.1617 0.1559 0.1633 0.1650

sina 0.0367 0.0350::::::0.0388 0.0365 0.0350 0.0362 0.0334 0.0342 0.0351

sylva 0.0090 0.0097 0.0090 0.0089 0.0094::::::0.0102 0.0094 0.0095 0.0091

zebra 0.2049 0.1576 0.1594 0.1439 0.1862 0.1934 0.1918 0.1915:::::::0.2132

musk 0.0259 0.0257 0.0267 0.0267 0.0283 0.0281:::::::0.0286 0.0286 0.0281

pixel 0.0318 0.0333::::::0.0339 0.0309 0.0293 0.0301 0.0292 0.0299 0.0294

average 0.1251 0.1196 0.1226 0.1190 0.1256::::::0.1316 0.1252 0.1266 0.1303

ranks 4.882 3.529 5.500 3.971 4.971::::::6.235 4.176 5.559 6.176

Table C.1: Error rates on Individual Datasets, for all algorithms.

DatasetProb Prob Err Err Breiman Strobl Breiman Strobl

Reweight Removal Reweight Removal Reweight Reweight Removal Removalcovtype -0.69 -0.37 -1.06

:::::-1.32 -0.18 -0.32 -0.18 -0.16

hemo 1 4.18 6.45 8.41 6.18:::::-0.18 0.11 0.53 1.67

hemo 2 5.18 13.96 10.72 13.62::::1.13 2.01 7.50 6.73

hemo 3 5.63 19.93 23.58 19.37 2.57 5.11 1.55::::1.27

hemo 4::::3.50 9.52 10.42 9.81 9.22 8.49 10.36 10.18

zernike 7.22 5.22 -0.09 5.65 -0.85 -0.37 0.18:::::-1.43

karhunen -4.27 -2.14::::::-14.09 -2.71 -1.10 -0.82 0.14 -2.88

fourier 3.00 2.82:::::-1.40 1.76 4.50 1.86 5.80 6.54

factors 12.22 12.15::::::-10.16 5.72 0.24 -1.36 -0.20 0.17

pow 1 23.71 24.09 15.30 22.35 9.09::::::::-164.71 2.42 -12.60

pow 2 8.26 5.37 -0.17 7.93 3.36 0.27 2.90:::::-0.81

pow 3:::::-0.86 2.47 -0.21 1.44 4.61 2.00 5.48 1.04

sina -4.34 0.40::::::-10.28 -3.76 0.28 -2.95 4.97 2.55

sylva 0.97 -6.72 1.06 2.89 -2.61::::::-12.19 -2.86 -4.41

zebra::::3.90 26.11 25.22 32.51 12.66 9.29 10.06 10.18

musk 7.78 8.46 4.89 4.89 -0.87 0.00:::::-2.00 -2.00

pixel -8.05 -13.43::::::-15.30 -5.32 0.32 -2.45 0.50 -1.88

average 3.96 6.72 2.75 7.12 2.48:::::-9.18 2.77 0.83

worst -8.05 -13.43 -15.30 -5.32 -2.61::::::::-164.71 -2.86 -12.60

best 23.71 26.11 25.22 32.51 12.66 9.29 10.36 10.18ranks 4.588 3.294 4.971 3.735 4.618 5.765 3.941 5.088

Table C.2: % Relative Improvement on Error Rate – over baseline – on Individual Datasets, for all algorithms.

135

and a particular dataset was found as follows: for each training fold, we parametrically searched for the best

mtry value (over {p/10, 2p/10, . . . , p,√p}, where p is the number of features) by choosing the model with

the lowest out-of-bag error rate. We then evaluated the performance of the chosen forest on the test set.

In Figure C.1, we show the relative improvement, over all real world datasets used in our experiments,

compared to the baseline results obtained using forests having 100 trees. We connect all forests (with different

ntree values) whose results are insignificant with each other using the Bergmann-Hommel procedure [52].

As shown in the figure, there is significant improvement (≈ 0.67%) when forests are constructed using 1000

trees compared to using 500 trees. There is less than (≈ 0.07%) improvement when forest size is set to 2000

trees instead of 1000 trees.

C.4 Choice of mtry parameter

In this appendix, we show why our choice of parametrically searching for the best mtry is appropriate

for the datasets that were used for our experiments. We examine the hypothesis that there is no difference

1

2

mtry - parameterically searched (6.14%)

mtry =√

p (0.00%)

Figure C.2: For Random Forests, do we need to parametrically search for the best mtry? Results depictsignificance testing via the Bergmann-Hommel procedure at p=0.05. The average improvement in test errorfrom the baseline condition of mtry =

√p is noted in parenthesis.

136

in prediction abilities between forests constructed by parametrically finding the mtry vs. forests constructed

with the optimal mtry =√p, where p is the number of features; as all the datasets used in our experiments

are classification based datasets, we use the suggested mtry =√p value for classification [19]. For all the

datasets used in our experiments, we found the best model when the number of trees in the random forest

is 1000. The best model for a given ntree = 1000 and a particular dataset was found as follows: for each

training fold, we parametrically searched for the best mtry value (over {p/10, 2p/10, . . . , p,√p}) by choosing

the model with the lowest out-of-bag error rate. We then evaluated the performance of the chosen forest on

the test set. Note, we also noted the test set error for mtry =√p.

In Figure C.1, we show the relative improvement, over all datasets used in our experiments, compared

to the baseline results obtained using mtry =√p, where p=number of features in the dataset. We use

the Bergmann-Hommel procedure [52] as a post-hoc test and depict the results graphically in the figure;

algorithms that are significantly different from each other are not connected by a vertical line. As shown

in the figure, there is significant difference at p = 0.05 and an improvement (≈ 6.14%) when forests are

constructed using mtry that is parametrically found vs. the default suggested value of mtry =√p.

C.5 A Challenge for ML algorithms

In Section 5.2, our primary focus was random forests and we showed that existing feature importance

measures within random forests are unable to correctly distinguish between relevant and irrelevant features,

on a family of artificial datasets. We can design similar datasets that are problematic for other supervised

classification algorithms. We now discuss a artificial dataset that does not work across a variety of algorithms.

We generated the dataset using features x1,...x6 and then appended the x7 = tan(x6) feature.

X = [x1 x2 x3 x4 x5 x6 x7] (C.1)

x1, ..., x5 = relevant features represented by column vectors of size N

x6 and x7 = irrelevant features represented by column vector of size N

also x7 = tan(x6)

137

The data matrix X is a multivariate normal distribution with the correlation and covariance matrices

described as follows:

relevant features︷ ︸︸ ︷ irrelevant features︷ ︸︸ ︷

cov(X) =

0.5 0 0 0 0 0.1 0.12

0 0.5 0 0 0 0.1 0.12

0 0 0.5 0 0 0.1 0.12

0 0 0 0.5 0 0.1 0.12

0 0 0 0 0.5 0.1 0.12

0.1 0.1 0.1 0.1 0.1 0.1 0.12

0.12 0.12 0.12 0.12 0.12 0.12 0.12

(C.2)

The output vector Y of size n is obtained from the following nonlinear generative model:

yi =

5∏j=1

[sin(xj) + 1], (C.3)

We tuned the size of the dataset and the covariance matrix so that the dataset (does not) work over

a variety of algorithms. There is higher correlation (almost double compared to relevant features) between

the irrelevant features and the target value Y . Note that one can utilize multiple tan() transforms on the

irrelevant features to reduce the amount of correlation between the relevant and the irrelevant features.

For our experiments, we generated data 1000 training examples according to Equation C.3. We also

generated a separate validation set consisting of 1000 examples. Note that as this dataset is a single instance

of the distribution in Equation C.3, similar but not the same results will be obtained for another instance of

the distribution. Due to a large number of classifiers and parameters present, we show results using a single

sample of the artificial model. Note that for all discussed algorithms, and for the artificial dataset, the error

rate on validation data is lower only when the relevant features are used to train the model compared to

when an irrelevant feature is included in training. Next, we discuss feature importance values obtained from

multiple algorithms including random forests [19], cforest [62], SVM [119], ReliefF [93] and BayesNN [84];

we refrain from an extensive discussion of these algorithms.

138

1

2

3

4

5

6

7

Fea

ture

s

Importance

(a) Random Forest − Breiman

1

2

3

4

5

6

7

Fea

ture

s

Importance

(b) Random Forest − Strobl

1

2

3

4

5

6

7

Fea

ture

s

Importance

(c) cforest − Conditional

1

2

3

4

5

6

7

Fea

ture

s

Importance

(d) cforest − Unconditional

1 5 10 50 100

0.002

0.006

0.01

0.014

0.018

(e) ReliefF for {1, 5, 10, 10, 100} Neighbors

Impo

rtan

ce

1

2

3

4

5

6

7

(f) SVM (Held−out folds)F

eatu

res

Importance

0 50 100 150 2000

1

2

3

4

Impo

rtan

ce

(g) BayesNN − 25 hidden units, 200 iterations

0 50 100 150 2000.5

0.6

0.7

0.8

0.9

1

Impo

rtan

ce

(h) BayesNN − 100 hidden units, 200 iterations

Figure C.3: Importance (without retraining) using various algorithms for the artificial dataset. Importanceusing various algorithms: The red color represents importance for the spurious features, whereas greenrepresents importance for the relevant features. We show importance for features as bars for (a) RandomForest - Breiman Importance, (b) Random Forest - Strobl Importance, (c) cforest - Conditional Importance,(d) cforest - Unconditional Importance and (f) SVMs. (e) Importance over different number of nearestneighbors is represented for ReliefF. We represent hyperparameter values for BayesNN for 25 and 100 hiddenunits in (g) and (h) respectively.

139

Random Forests Importance Using Breiman’s and Strobl’s Importance Measures. We cre-

ated 1000 trees and parametrically created models corresponding to mtry = 1, ..., p. We then chose the best

performing model (which was trained with mtry = 2) using validation data1 and we plot the importance

values obtained for all features from the best model in Figure C.3. Importance obtained via Breiman’s

importance measure and Strobl’s importance measure are shown; in both cases, atleast one of the spurious

features is assigned a higher importance value than the relevant features.

cforest - CART/RF Based Algorithm. We briefly discussed Strobl’s importance in Section 5.1.1.

Strobl’s importance was first proposed for conditional trees described in [62]. The ‘party’ package [62] with R

[66] provides two ways to calculate importance: unconditional importance and conditional importance. Con-

ditional importance is equivalent to Strobl’s importance measure and unconditional importance is equivalent

to Breiman’s importance measure.

We created 1000 trees and parametrically created models corresponding to mtry = 1, ..., p. We then

chose the best performing model (which was trained with mtry = 2) using the validation data and we plot

the importance value obtained from the best model in Figure C.3. We show both importance obtained via

conditional and unconditional importance and in both cases atleast one of the relevant feature has been

assigned lower importance value compared to the spurious features.

There are some observations which may be pertinent to the reader regarding the cforest algorithm.

The minimum validation error (relative error) achieved using cforest was 0.2637 which was higher compared

to 0.2247 obtained for random forests. The authors of the package warn of excess computation time when

conditional importance is calculated. We observed that calculating the conditional importance took upwards

of a week on a 4 GHz Core-i7 processor, when the number of permutation was set to 10; the algorithm also

consumed more than 4 gigabytes of memory in the process. When the number of examples was increased to

2000, the party package [62] (that implements cforest) failed to work due to excess memory requirements,

even on a machine with 128 GB of main memory.

1 Note that we obtained similar value of mtry when the out-of-bag error rate was used for validation.

140

ReliefF - k-NN based algorithm. The ReliefF algorithm [73, 75] estimates feature importance based

on how well the feature helps in distinguishing sample examples from their nearest neighbors of the same class

vs nearest neighbors from other classes. Robnik-Sikonja and Kononenko [93] adapted ReliefF for regression.

ReliefF only provides a ranking and that ranking is then utilized with a supervised classifier.

We plot the importance from ReliefF (using weka’s [58] implementation) for different number of

nearest neighbors in Figure C.3. As shown in the figure, the importance of the irrelevant features is higher

than atleast one of the relevant features. We show the plot for ReliefF with {1, 5, 10, 50, 100} of nearest

neighbors.

SVM - Importance calculated by a permutation test on validation dataset. The permutation

based scheme used by random forests, and discussed in Section 2.4.5.1, for evaluating feature importance can

also be utilized for SVM. The permutation of a feature (in the data of a separate test-set before testing it

on the SVM) will cause a corresponding perturbation in the SVM hyperplane and when a feature important

for prediction is perturbed there will be a corresponding degradation of the hyperplane and degradation of

the results. We use Algorithm 5 to calculate the feature importance using SVMs. The feature importance

can be calculated either using a held out fold or separate validation data.

We show the results for the cost parameter2 C=10 in Figure C.3, averaged over multiple held-out3

folds. As shown in the figure, the importance assigned to the spurious feature numbered 7 in higher than

the importance assigned to the relevant features.

BayesNN. We used the Bayes Neural Network [84] available at [83]. The Bayes neural network has

certain number of hidden nodes and a single hyperparameter for each of the feature. Though we show results

for BayesNN with both 25 and 100 hidden nodes, a network with 25 nodes performed better than a network

with 100 nodes4 .

According to the author, the hyperparameters are directly proportional to the importance of the

features. We plot the hyperparameters over 200 iterations in Figure C.3. As shown in the figure, the

2 We found models with C≥10 giving the same error results on held-out folds and so we only show result for C=10.3 We found similar importance results for the separate validation data.4 We also experimented with networks of size 5 and 10, and found results that were similar to the network with 25 nodes;

thus, we only show our results with networks of size 25 and 100.

141

Algorithm 5: Algorithm: Calculating SVM importance via a permutation scheme

Input: 1000 examples for training and 1000 examples for validation. SVM used is an RBF kernel

with the epsilon-SVR algorithm. The best model is parametrically selected from the cost

parameter array = {1e-2, 1e-1,..., 1e6}.

Output: Importance values from held out folds and from validation data.

Pseudo-Code:

number of folds = 10

for i in number of folds

use 9 folds for training, internally use 1 out of the 9 folds

for finding the best cost parameter

EITHER use 9 folds for training with the best cost parameter,

and find feature importance using the remaining 1 fold

OR use all training data and the best cost parameter,

and find feature importance using the held-out and separate validation data

collate the importance values over the entire for loop

For, ‘evaluate the feature importance’ step, we first calculate the baseline error rate without permuting

any feature. We then permute each feature and calculate the increase/decrease in error rate. This

step is similar to the step performed for calculating importance for random forests in Section 5.1.1,

except that instead of using out of bag examples we use a held out validation set or a held out fold.

hyperparameters (importance) of the spurious features have higher values than the hyperparameters for the

relevant features, for both networks of size 25 and 100.

Thus, we showed that existing algorithms may incorrectly rank features.

142

C.6 Why Does Prob+Reweighting Perform Worse Than Other

Retraining+Removal Algorithms?

We briefly discuss why Prob+Reweighting performs worse than other retraining+removal algorithms.

We define a metric called feature difference score as

Difference score based on number of features discovered (a, b, k)

= 100 ∗ 1

Dk

k datasets∑m=1

|f counts num(k, a)− f counts num(k, b)| (C.4)

where, m = 1, . . . , k datasets, i = 1, . . . , Dk are the features for dataset k,

a and b are algorithms for comparison,

f counts num(k, a) returns the number of features discovered by algorithm-a for dataset-k over all folds,

f counts num(k, a) =

Dk∑i=1

I(f counts(k, a, i)), I() is an indicator function,

refer to Equation 5.11 for f counts(k, a, i)).

The difference score is based on number of features discovered by any two algorithms and it is

calculated for individual datasets (rather than individual training/test splits of the datasets). This metric

only takes into consideration the number of features discovered by individual algorithms rather than matching

individual features, like the similarity score for features discovered (Equation 5.11) which was discussed

earlier. The score will be 0 if two algorithms discover the same number of features.

From the overall results presented earlier in Figure 5.3 (page-94), we observed Prob+Reweighting

scheme performing worse on average than both Err+Removal and Prob+Removal schemes. Note that

Prob+Reweighting scheme performs better than Err+Reweighting scheme on average % relative improve-

ment of error but worse on average absolute error. Using the feature similarity score discussed in the

preceding sub-section, we found that Prob+Reweighting has a lower similarity score when compared to

other retraining algorithms (Figure 5.8).

In Figure C.4, we plot the difference in relative improvement of error for different retraining importance

based algorithms vs the difference score based on number of features discovered (discussed in Equation C.4 for

Prob+Reweighting vs (Prob+Removal, Err+Removal,Err+Reweighting). Our goal is to observe if there is

143

any difference in total number of features discovered, when there is a large difference in relative improvement

of error by any two algorithms.

For Prob+Reweighting vs. the rest of the retraining schemes, we observe that there is a linear

relation between the feature difference score and the difference in relative improvement from baseline. The

datasets where this is observed are some of the hemodynamic datasets and covtype. We hypothesize that

Prob+Reweighting had a higher error rate for certain datasets, as it got stuck in a minima and was not able

to remove enough number of features compared to the other Retraining+Removal algorithms.

C.7 Comparison to Boruta and rf-ace

We briefly digress and describe two algorithms, Boruta [76] and rf-ace [116], that are popular for

feature selection in random forests; we have not compared them directly to our discussed methods because

the primary goal of these algorithms is to remove irrelevant features.

As the Boruta [76] and rf-ace [116] algorithms are similar to each other, we provide only a general

overview of their working mechanisms. Both assume that a given input matrix X of size n × p and target

values/labels vector y of size n are available, where n is the number of examples and p is the number of

features. Boruta [76] and rf-ace [116] create a new dataset X of size n × 2p, which contains the original

dataset X and a permuted copy of each individual feature in X. Both algorithms are iterative algorithms

and perform the following steps in each iteration:

(1) Both algorithms train multiple random forests models on (X, y) and obtain values from Breiman’s

feature importance measure. Statistical tests are performed to analyze whether there is a significant

difference in the feature importance values for a given feature and its permuted copy across multiple

models.

(2) Features and permuted copies with insignificantly different feature importance values are identified

and removed from X. The algorithm loops back to step (1) unless the number of features remain

constant for a certain number of iterations.

We obtained percent error results for Boruta and rf-ace for the 17 datasets used in our experiments.

144

−10 0 10 20 30 40 50 60 70 80 90−10

0

10

20

30Err+Removal vs. Prob+Reweighting

covtype

hemo_1

hemo_2hemo_3

hemo_4zernike karhunen fourier factors

pow_1pow_2

pow_3sina sylva

zebra musk

pixel

%feat difference

accu

racy

diff

eren

ce

0 10 20 30 40 50−10

−5

0

5

10

15

20

25Prob+Removal vs. Prob+Reweighting

covtype

hemo_1

hemo_2hemo_3

hemo_4zernike karhunen fourier

factorspow_1 pow_2 pow_3sina sylva zebra

muskpixel

%feat difference

accu

racy

diff

eren

ce

0 10 20 30 40 50 60 70 80

−20

−10

0

10

20

Err+weighting vs. Prob+weighting

covtypehemo_1

hemo_2hemo_3

hemo_4zernikekarhunen fourier

factorspow_1pow_2pow_3sina sylva

zebramusk

pixel

%feat difference

accu

racy

diff

eren

ce

Figure C.4: Comparison of Prob+Reweighting vs other retraining based algorithms. Difference in improve-ment over baseline for different algorithms vs the difference score based on number of features discovered.Note that Prob+Reweighting has higher difference in improvement with the other retrained algorithms fora couple of hemodynamic datasets and covtype. Observe that there is a linear relation between the featuredifference score and the difference in relative improvement from baseline. We hypothesize a higher errorrate is caused because Prob+Reweighting did not remove enough number of features compared to the otherretraining algorithms.

145

Dataset Model−1 Dataset Model−2 Dataset Model−3−20

0

20

Difference in error rates (models with all features − models with only relevant features)using Random Forests

% E

rror

diff

eren

ce

Dataset Model−1 Dataset Model−2 Dataset Model−30

50

100Breiman Importance (Random Forests)

# da

tase

ts

Dataset Model−1 Dataset Model−2 Dataset Model−30

50

100Strobl Importance (Random Forests + extendedForest())

# da

tase

ts

Dataset Model−1 Dataset Model−2 Dataset Model−30

50

100Gini (Random Forests)

# da

tase

ts

Dataset Model−1 Dataset Model−2 Dataset Model−30

50

100Retraining Importance (using Random Forests)

# da

tase

ts

Correct detectionIncorrect detection

Dataset Model−1 Dataset Model−2 Dataset Model−30

50

100Breiman Importance (Conditional Forests + unconditional)

# da

tase

ts

Dataset Model−1 Dataset Model−2 Dataset Model−30

50

100Strobl Importance (Conditional Forests + conditional)

# da

tase

ts

Figure C.5: Feature importance results for random forests and conditional forests on 3 artificial datasetmodels. From each artificial dataset models, we generate 100 datasets. We plot the number of times correctand incorrect detection is performed by an importance measure using the correct detection scoring ruledefined in Equation 5.7; incorrect detection = 1-correct detection. Compared to Figure 5.1, we also addresults using the Gini importance measure.

146

Percent Error Avg. number of features# of datasets

there is no changein # of features

Baseline Model 13.03 113.94 -rf-ace 12.98 95.75 10

Boruta 13.06 91.15 10

Table C.3: Results of rf-ace and Boruta.

Our results are presented in Table C.3; the baseline model is a model trained with all features. We can

draw the following three conclusions about both Boruta and rf-ace: (1) there is no significant reduction in

the percent error rate compared to the baseline model; (2) the average number of features selected is higher

than the average number of features selected through other algorithms discussed in this study (Figure 5.9);

and (3) for 10 out of 17 datasets, we found that both Boruta and rf-ace failed to perform feature selection.

C.8 Random Forests - Computational Complexity and Software

In this section, we discuss the computational complexity of random forests and compare it to SVM;

we also describe the properties of the software developed for the purpose of this thesis.

C.8.0.1 Computation Cost of a Random Forest Model

We calculate the computational cost of a random forest model as follows: at each split of a tree, we

search for the best feature that split the data; the computational cost of this procedure is upperbound to

O(n ∗ p)5 , where n is the number of examples and p is the number of features. Furthermore, the trees are

full-grown but are upperbounded to O(n). Thus, to grow an ensemble forest containing ntree trees, the total

computational cost is O(ntree ∗ n2 ∗ p

).

Now, we compare the computational complexity of random forests to SVM. The computational cost

of SVM is cited to be between O(n2) and O(n3) [12], but the authors do not consider the computational

complexity associated with calculating part of the kernel matrix6 which is O(n p); thus, the computational

5 Breiman’s implementation of random forests contains a trick that allows the calculation of the gini impurity value withO(n) complexity.

6 SVM solvers do not calculate the entire kernel matrix but only what is required.

147

complexity of an SVM model is between O(n3 p) and O(n4 p). We can ignore ntree in O(ntree ∗ n2 ∗ p

),

the computational complexity of random forests, as ntree is a constant factor; thus, the computational

complexity of random forests is atleast O(n) lesser than that of SVM.

Random forests hold a computational advantage against certain multi-class SVM methods. Note that

for multi-class SVM, where for a k-class problem, popular techniques propose creating either (k−1) classifiers

for a one-vs-all strategy or k(k − 1)/2 classifiers for an all-vs-all strategy [59, 92, 123]; the decision trees in

random forests can natively map to any number of classes and does not require creating additional classifiers

via a one-vs-all or an all-vs-all strategy.

C.8.0.2 Random Forests are Parallelizable

We discuss the parallelizable nature of the random forests algorithm that has made it a popular

learning algorithm for large scale datasets. The random forests algorithm is inherently parallelizable as trees

can be trained independent of one another; in contrast, boosting algorithms and SVM are not—inherently—

parallelizable. A real world example of random forests is the Kinect [105] accessory for the XBox 360 console;

gesture classification using image segmentation is performed in Kinect and the random forest classifier was

trained in parallel on 1000’s of computer cores.

Additionally, for the purpose of this thesis, we developed a parallel version of the random forests

algorithm that is scalable to multiple cores and computers; our code is based on Breiman’s reference code

and our code’s properties are described next.

C.8.0.3 Software Used

Boulesteix et al. [13] presents a good overview of the software packages available for RF. We considered

the following packages: randomForest-R by Liaw and Wiener [78], randomforest-matlab by Jaiantilal [67],

Random Jungle by Schwarz et al. [102], and Treebagger by Matlab. We briefly discuss the reasons why we

had to develop our own RF software. Then, we discuss the properties of our developed RF software.

148

A Critique of Existing Random Forests Software. As we saw earlier in the Chapter 5 of this

thesis, the amount of computational performance required to execute our experiments was quite high7 ; it

was imperative to reduce the computational cost as much as possible. As we will see later, another desirable

property was repeatable results from a random forests ensemble. Thus, we required the following properties,

from the packages.

� The package should fast in both single-threaded mode and multi-threaded mode. randomForest-R

[78], randomforest-matlab [67], and Random Jungle [102] were a magnitude faster than Matlab’s

Treebagger function (tested using Matlab 2012b).

� The package should be able to scale to multiple cores and preferably to multiple computers. Random

Jungle [102] scales to multiple cores, other packages do not.

� The package allows for setting the out-of-bag (OOB) examples and the random seed for individual

trees in the random forest ensemble for repeatable results. None of the packages supported this

property.

� The package is easily integrated into a computational language like Matlab or R [66]. Random

Jungle [102] does not integrate into a computational language, other packages do integrate into

either Matlab or R.

Developed Random Forest Software. As none of the available packages had the desired proper-

ties, we had to develop our own random forests package which was based on randomForest-R [78] and

randomforest-matlab [67]. The total number of lines of developed code in C++ and Matlab, for the random

forest algorithm and our proposed feature selection algorithms utilizing random forest, is around 25,000.

Our developed software package allowed for setting the out-of-bag (OOB) examples and the random seed for

individual trees in a random forest ensemble. Furthermore, the developed software package can be used from

Matlab. Now, we discuss the computational performance of the developed software, which has the ability to

run on multiple cores on a single or multi-socket machine and on a cluster of machines.

7 About 350K CPU hours (40 years) of a single 1.8 Ghz core of the Core-i7 processor.

149

RandomrandomForest-R randomforest-matlab

Matlab’sJungle Treebagger

Multi Threaded (Regression) 2.5X 75X 37.8X 59XMulti Threaded (Classification) 6.6X 31.3X 3.7X 17.8X

Table C.4: Relative Execution Time for existing software packages compared to our developed multi-threadedsoftware package. Our code has an execution time of 1X. We enabled 4 threads of a 2.2 Ghz i7 processor.Note that randomForest-R and randomforest-matlab are single threaded programs and the execution timefactor listed in the multi-threaded rows are for the single-threaded version.

We compare the computational performance of the developed software package to existing packages

in Table C.4. Before discussing more details of the table, please note that when we mention the phrase ‘the

software package scales linearly as number of cores increase’ we mean to say that the execution time drops

linearly as the number of cores increase. Our developed software had the following properties:

(1) Multi-threaded random forest for single socket machines: we developed a multi-threaded random forest

software package, for both regression and classification, that scales linearly as the number of cores increase

(for a single socket machine); our software implementation is based on the code by Breiman, which is also

the basis of the code in randomForest-R [78] and randomforest-matlab [67]. In Table C.4, we show relative

execution times for different random forest packages on an artificial dataset consisting of 1000 examples

and 100 features (with the default random forest parameters); the developed software outperforms other

algorithms, in execution speed; even in the worst case, for multi-threaded regression, the developed

software is 2.5 times faster than Random Jungle.

(2) Multi-threaded random forest for dual and quad socket machines: most of the experiments were performed

in the year 2012; then, many of the available computers consisted of single socket computers, though dual

and quad socket machines were easily available and were affordably priced (< 10, 000 USD). The number

of cores available per socket range between 4-8 cores (8-16 threads) for an Intel based system and between

8-16 cores for an AMD based system.

For multi-socket machines, main memory is not only shared between cores residing on the same socket but

also between cores residing on different sockets. Such a memory arrangement creates a large slowdown

when threads running on different socket cores have to access each other’s memory. A technology called

150

Non-Uniform Memory Access (NUMA) is available for multi-socket machines and helps in alleviating

memory contentions between cores on multiple sockets; via NUMA, memory can be controlled in a fine-

grained manner by the threads running on the machine. Programs optimized for NUMA, also known as

NUMA-aware programs, have threads running on different sockets in such a way that memory transfers

are kept at the minimum.

Our developed random forest software package is NUMA-aware and scales linearly when the number of

sockets increase, and it was tested on AMD systems with 32 cores and 48 cores, that used dual and

quad sockets, respectively. Note that existing random forest software packages make no claim of being

NUMA-aware and the penalty of using a non NUMA-aware software on a NUMA system can be severe

[7].

(3) Feature selection using random forest implemented via MPI: Training a random forest can be done in

parallel on multiple machines (each tree or set of trees created on separate machines) and has been utilized

for very large datasets [105]. As the number of examples, for the datasets used for this thesis, are in

the thousands, it makes sense to create entire forests on individual machines instead of creating trees on

multiple machines and then aggregating the forest, as the overhead of network communication is higher

that the computational8 overhead.

In Chapter 5, we will propose an iterated feature selecting algorithm based on random forest. In each step

of our proposed algorithm, we create a large number of random forest models that can be simultaneously

trained on multiple machines. For implementing such a parallellized training procedure, we developed

an Message Passing Interface (MPI) program [108]; MPI is a popular API used, in clusters, for parallel

programming. The speedup obtained by using a cluster of many machines is only bounded by the number

of features explored in each step and also the network overhead, in each step, is negligible.

(4) CPU Optimizations: Using CPU specific machine intrinsics for prefetching and vector calculations [46],

we were able to increase the speed of our code by about 25%; intrinsics are assembly codes utilizing special

CPU instructions. Prefetching gives explicit hints to the CPU that it needs to load remote memory to

8 For the datasets used in this thesis, most forests take around a second to be trained.

151

immediate memory, as that memory location is going to be utilized within a short while. Most CPU’s

have vector units that can be used to execute multiple9 floating point instructions in the same amount

of time required to execute a single floating point operation. The speedup obtained by using vector

instructions was not large10 as the ratio of the number of memory instructions to the number of floating

point instructions is high; that is, the program is memory-bound.

9 Usually 4 floating point operations using SSE2/3/4 or 8 floating point operations using AVX; SSE2/3/4 and AVX aretypes of x86 CPU instructions

10 By using vector instruction, we would expect a reduction in execution time by a factor of 4 or 8 instead of a factor of 0.25.