Feature Selection by Iterative Reweighting - CU Scholar
-
Upload
khangminh22 -
Category
Documents
-
view
2 -
download
0
Transcript of Feature Selection by Iterative Reweighting - CU Scholar
Feature Selection by Iterative Reweighting: An Exploration
of Algorithms for Linear Models and Random Forests.
by
Abhishek Jaiantilal
B.A., Dharamsinh Desai Institute of Technology, 2003
M.S., New Jersey Institute of Technology, 2006
M.S., University of Colorado, 2011
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Department of Computer Science
2013
This thesis entitled:Feature Selection by Iterative Reweighting: An Exploration of Algorithms for Linear Models and Random
Forests.written by Abhishek Jaiantilal
has been approved for the Department of Computer Science
Michael Mozer
Prof. Aaron Clauset
Prof. Vanja Dukic
Prof. Qin Lv
Prof. Sriram Sankaranarayanan
Date
The final copy of this thesis has been examined by the signatories, and we find that both the content andthe form meet acceptable presentation standards of scholarly work in the above mentioned discipline.
iii
Jaiantilal, Abhishek (Ph.D., Computer Science)
Feature Selection by Iterative Reweighting: An Exploration of Algorithms for Linear Models and Random
Forests.
Thesis directed by Prof. Michael Mozer
In many areas of machine learning and data science, the available data are represented as vectors
of feature values. Some of these features are useful for prediction, but others are spurious or redundant.
Feature selection is commonly used to determine the utility of a feature. Typically, features are selected
in an all-or-none fashion for inclusion in a model. We describe an alternative approach that has received
little attention in the literature: determining the relative importance of features via continuous weights, and
performing multiple iterations of model training to iteratively reweight features such that the least useful
features eventually obtain a weight of zero. We explore feature selection by employing iterative reweighting
for two classes of popular machine learning models: L1 penalized linear models and Random Forests.
Recent studies have shown that incorporating importance weights into L1 models leads to improvement
in predictive performance in a single iteration of training. In Chapter 3, we advance the state-of-the-art by
developing an alternative method for estimating feature importance based on subsampling. Extending the
approach to multiple iterations of training, employing the importance weights from iteration n to bias the
training on iteration n+ 1 seems promising, but past studies yielded no benefit to iterative reweighting. In
Chapter 4, we obtain a significant reduction of 7.48% in the error rate over standard L1 penalized algorithms,
and nearly as large an improvement over alternative feature selection algorithms such as Adaptive Lasso,
Bootstrap Lasso, and MSA-LASSO using our improved estimates of feature importance.
In Chapter 5, we consider iterative reweighting in the context of Random Forests and contrast this with
a more standard backward-elimination technique that involves training models with the full complement of
features and iteratively removing the least important feature.In parallel with this contrast, we also compare
several measures of importance, including our own proposal based on evaluating models constructed with
and without each candidate feature. We show that our importance measure yields both higher accuracy and
greater sparsity than importance measures obtained without retraining models (including measures proposed
by Breiman and Strobl), though at a greater computational cost.
v
Acknowledgements
The foundations of this thesis were laid the week i started graduate school at CU. During the first
week of school, i met both Greg Grudic and Mike Mozer; unbeknownst to each other, both suggested the
idea of exploring sparse linear models as initial research. Sparse models turned out to be the focus of my
research and the motivation of this thesis.
I am indebted to Mike for pushing me towards trying out ideas, without an apriori bias towards a
particular method, and for allowing me to pursue ideas out of my own interest.
I am indebted to Greg for exchange of ideas over the years and the initial push towards some of
the algorithms discussed in this thesis; Greg and Flashback Technologies provided valuable opportunities
concerning employment, computing equipment, and data, that were essential for the completion of this thesis.
I would like to thank Prof. Sriram for his encouragement, support, and providing me some of the
data that proved essential to this thesis. I am thankful to the other committee members, Prof. Vanja, Prof.
Qin, and Prof. Aaron, for serving on my defense committee and their valuable insights. I am very thankful
to my other collaborators during my stay at CU, Prof. Shivakant Mishra and Yifei Jiang.
Though mentioned last, the greatest amount of thanks goes to the support and guidance of my parents,
sisters, and friends, Miraaj Soni and Greg Theodore.
Contents
Chapter
1 Thesis Overview 1
1.1 Scope of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Weighed Linear Models and Iteratively Reweighted Linear Models . . . . . . . . . . . 3
1.2.2 Iteratively Reweighted Random Forests Models . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Arrangement of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Current Approaches to Supervised Learning with Feature Selection 6
2.1 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Overview of Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Feature Relevancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Type of Feature Selection Algorithms: Wrapper, Filter, and Embedded . . . . . . . . 11
2.3 Feature Selection Using Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Regularization Using the Ridge Penalty Function . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Regularization Using the Lasso Penalty Function . . . . . . . . . . . . . . . . . . . . . 15
2.3.3 Regularization Using Other Penalty Functions . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Feature Selection Using Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Why Use Random Forest? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Model Combination in an Ensemble of Classifiers . . . . . . . . . . . . . . . . . . . . . 21
vii
2.4.3 The Random Forest Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.4 Implicit Feature Selection Using Random Forests . . . . . . . . . . . . . . . . . . . . . 27
2.4.5 Explicit Feature Selection Using Random Forests . . . . . . . . . . . . . . . . . . . . . 28
3 Single-Step Estimation and Feature Selection using L1 Penalized Linear Models 31
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Background Information and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Asymptotic properties of L1 penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Motivation for our Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Randomized Sampling (RS) Method to Create Weight Vector . . . . . . . . . . . . . . . . . . 35
3.3.1 Randomized Sampling (RS) Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.2 Consistency of choosing a set of Features from Randomized Sampling (RS) Experiments 37
3.4 Algorithms and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.1 Synthetic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.2 Real World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5 Conclusions and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Iterative Training and Feature Selection Using L1 Penalized Linear Models 46
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.1 Existing Single and Multiple Step Estimation Algorithms for L1 Penalized Algorithms 47
4.2 Subsampling/Bootstrap Adaptive Lasso Algorithm (SBA-LASSO) . . . . . . . . . . . . . . . 52
4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Tuning Parameter and Parametric Search in MSA . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Parametric Search in SBA, Adaptive Lasso, and Bootstrap Lasso . . . . . . . . . . . . 55
4.3.3 L1 Penalized Algorithms and Their Parameters . . . . . . . . . . . . . . . . . . . . . . 57
4.3.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4.1 Artificial Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
viii
4.4.2 Results for Real World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5 Iterative Training and Feature Selection using Random Forest 75
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1.1 Existing Feature Importance Measures for Random Forest . . . . . . . . . . . . . . . . 76
5.2 A Challenge for Existing Feature Importance Measures in Random Forests . . . . . . . . . . 78
5.2.1 Description of the Artificial Dataset Models . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.2 Results for Artificial Dataset Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Retraining-Based Feature Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.1 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.2 Details of Iterated Reweighting of Random Forest Models . . . . . . . . . . . . . . . . 85
5.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.6.1 Overall Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.6.2 Analysis Based on % Reduction of Error . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.6.3 Analysis Based on Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.6.4 Computation Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6 Conclusion 106
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
ix
Bibliography 109
Appendix
A L1-SVM2 Algorithm 117
A.1 Optimization of a Double Differentiable Loss Function Regularized with the L1 Penalty . . . 117
A.1.1 Necessary and Sufficient Conditions for Optimality for the Whole Regularization Path 117
A.2 L1-SVM2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
B Appendix for Chapter 4 122
B.1 Description of Real World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
B.2 Error Results on Artificial and Real World Datasets . . . . . . . . . . . . . . . . . . . . . . . 125
B.3 Feature Selection Results for Artificial Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 129
B.4 Expanding Dataset Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
C Appendix for Chapter 5 131
C.1 Description of Real World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
C.2 Results on Real World Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
C.3 Choice of ntree Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
C.4 Choice of mtry parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
C.5 A Challenge for ML algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
C.6 Why Does Prob+Reweighting Perform Worse Than Other Retraining+Removal Algorithms? 142
C.7 Comparison to Boruta and rf-ace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
C.8 Random Forests - Computational Complexity and Software . . . . . . . . . . . . . . . . . . . 146
Tables
Table
3.1 Mean ± Std. Deviation of Error Rates in % on Models 1 & 4 by SVM2. The best algorithms
are in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Mean ± Std. Deviation of Error Rates in % on on Models 1 & 4 by 1-norm SVM. The best
algorithms are in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Variable Selection Results on Models 1 & 4 using SVM2. The best algorithms are in bold. . 41
3.4 Mean Error rates in % for Models 2, 3 & 5 using SVM2. The best algorithms are in bold. . . 41
3.5 Mean ± Std. Deviation of Error Rates on Real world Datasets. The best performing L1 and
L2 penalized algorithms are in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Avg. Error rate on Robotic Datasets from [88]. The best algorithms are in bold. . . . . . . . 44
4.1 Average number of features discovered correctly and incorrectly by the MSA over four different
L1 penalized algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 List of real world datasets used in our experiments . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1 Seventeen datasets used in our experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
B.1 Results on real world classification and regression datasets. . . . . . . . . . . . . . . . . . . . 126
B.2 Results on nine synthetic datasets using four different L1 penalized algorithms . . . . . . . . 127
B.3 Features (correctly/incorrectly) selected by four different L1 penalized algorithms . . . . . . . 128
C.1 Error rates on Individual Datasets, for all algorithms. . . . . . . . . . . . . . . . . . . . . . . 134
xi
C.2 % Relative Improvement on Error Rate – over baseline – on Individual Datasets, for all
algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
C.3 Results of rf-ace and Boruta. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
C.4 Relative Execution Time of existing random forest software compared to developed software. 149
Figures
Figure
2.1 Feature relevancy according to Kohavi and John [74] . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Feature relevancy according to Yu and Liu [129] . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 The search for the best feature at each node in bagged trees and random forest trees. . . . . . 25
4.1 Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for variations
of the MSA, using percent error results obtained using nine artificial dataset models × four
different L1 penalized algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for variations
of the SBA, using percent error results obtained using nine artificial dataset models × four
different L1 penalized algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for different
L1 penalized algorithms, using percent error results obtained using nine artificial dataset models 67
4.4 Graphical representation of Friedman ranks and results of the Bergmann-Hommel procedure
for L1-SVM and LARS using percent error results obtained using real world datasets . . . . . 70
4.5 Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for variations
of the SBA, using percent error results obtained using real world datasets. . . . . . . . . . . . 73
4.6 Overall number of features discovered using L1 penalized algorithms for real world datasets. . 74
5.1 Feature importance results for random forests and conditional forests on 3 artificial dataset
models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
xiii
5.2 Comparison between search of features at the nodes of trees in random forests and FIbRF . . 86
5.3 Percent error results and percent relative reduction of error for all algorithms. . . . . . . . . . 94
5.4 Analysis of four importance measures and two elimination schemes. . . . . . . . . . . . . . . . 96
5.6 Difference in percent reduction of error among algorithms that use retraining-based vs non-
retraining-based feature importance measures, collapsed across removal schemes. . . . . . . . 98
5.7 Difference for the same training/test split of a dataset is represented by a ◦ red circle. We show:
(a) comparison of Err+Removal and Breiman+Removal and (b) comparison of Err+Removal
and Prob+Removal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.8 Comparison of various algorithms on the basis of the similarity score . . . . . . . . . . . . . . 101
5.9 Sparsity of models created by different algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.10 Sparsity of models vs % reduction in error over the baseline model for Prob+Removal and
Err+Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.11 Computational complexity relative to Strobl+Removal (1X). . . . . . . . . . . . . . . . . . . 103
B.1 Features correctly and incorrectly detected by four different L1 penalized algorithms . . . . . 129
B.2 Results by using expanded dictionary for real world datasets with the standard L1 penalized
algorithm for LARS and L1-SVM2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
C.1 What is an appropriate value of ntree? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
C.2 For Random Forests, do we need to parametrically search for the best mtry? . . . . . . . . . 135
C.3 Importance (without retraining) using various algorithms for the artificial dataset . . . . . . . 138
C.4 Comparison of Prob+Reweighting vs other retraining based algorithms . . . . . . . . . . . . . 144
C.5 Feature importance results for random forests and conditional forests on 3 artificial dataset
models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Chapter 1
Thesis Overview
1.1 Scope of this thesis
We live in a data-intensive world, and in many domains—the internet, genetics, multimedia, finance,
among others—the amount of data generated is too much for a human to manually sift through for relevant
information. Algorithms providing insights to the data are in demand, and have given rise to the fields of
machine learning and data mining and the newly named field of data science. The contributions of this
thesis are towards two sub-fields within machine learning, namely supervised machine learning and feature
selection.
Supervised Machine Learning. Supervised machine learning involves building a model from data
provided by a teacher. Supervised problems involve both regression and classification. Regression involves
transforming an input to a continuous output value; classification involves transforming an input to a discrete
output label. The input to a supervised machine learning model, x, typically consists of a vector of p features.
The output, y, is a continuous value (regression) or discrete value (classification). The problem is supervised
in the sense that a teacher provides a set of {x, y} examples mapping a particular input to a target output.
In supervised machine learning, an input representing some object of interest is represented as X, an array
of size n×p, and target values/labels Y , a vector of size n, are available; n is the number of examples and p is
the number of features observed for individual examples; in regression, the target values are continuous values
2
and in classification, the target labels are class labels. The trained machine learning model is expected to
reliably predict the target values or labels of future data. In this thesis, we restrict our study to the following
supervised machine learning techniques: linear regression [59, 122], linear support vector machine (SVM)
[119], and random forests [19].
Feature Selection. The goal of accurately predicting the target values or labels of unseen data is
hindered if noise is present in the data used for training the models. An irrelevant1 feature is either a feature
with measurement noise or a feature that does not help in prediction and, depending on the sensitivity of
the classifier algorithm, such a feature can adversely affect the creation of accurate models as well as the
interpretation of the data. Feature selection is a sub-field within machine learning aimed at creating accurate
models by excluding noisy (or irrelevant) features and by including only the relevant features.
In this thesis, we discuss embedded feature selection algorithms, like lasso regression [42, 112], 1-norm
support vector machine (SVM) [131], and random forests [19], that include the feature selection process
embedded in their model training process; that is, the algorithms simultaneously train the model and perform
feature selection. The embedded feature selection algorithms, discussed in this thesis, consist of both linear
and nonlinear algorithms. Linear algorithms assume an underlying linear model generating the data, and thus
they can address models restricted to certain types of data. Nonlinear algorithms are universal approximators
and are able to model any kind of data.
1.2 Thesis Summary
In this thesis, we explore the hypothesis that iteratively reweighted models—based on either the
L1 penalized algorithm or the random forests algorithm—outperform standard models that do not utilize
reweighting. We summarize the contributions of this thesis for the two class of machine learning models we
explore: linear models constructed using L1 penalized algorithms and nonlinear models constructed using
random forests.
1 We discuss features and their relevancy later in Section 2.2.1.
3
1.2.1 Weighed Linear Models and Iteratively Reweighted Linear Models
Feature selecting linear models are created by using a linear loss function in conjunction with a
sparsity inducing penalty function. Popular linear loss functions, utilized in this thesis, include the squared
loss function for regression and the hinge loss function for classification [59]. The L1 penalty function is
a popular means of obtaining sparsity. It has the advantage that the resultant formulation is convex and
convex optimization has been well studied [97, 132, 133].
Recently, weighted linear models [132, 133], trained with a linear loss function and a weighted L1
penalty, have been proposed that employ user-specified weights to bias linear models towards relevant features
and away from irrelevant or noisy features. A multi-step estimation procedure, which iteratively reweights
the linear model, was suggested by [24, 27]. Existing weighted algorithms differ in the manner they derive
the weights. The weights in [132, 133] were derived from an L2 penalized algorithm, whereas the weights
used in [24, 27] were derived from an L1 penalized algorithm. Our contributions to the field of L1 penalized
linear models are as follows:
� In Chapter 3, we propose a single-step procedure using weights obtained from subsampled datasets. We
show that our results are comparable to an existing weighted algorithm (Adaptive Lasso [132, 133]) for
two different types of linear models. Our proposed algorithm can also be used in an online setting unlike
existing algorithms.
� In Chapter 4, we extend the single-step procedure from Chapter 3 into an iteratively reweighted proce-
dure. We show that our proposed method achieves higher prediction accuracy than existing methods,
including Adaptive Lasso [132, 133], Bootstrap Lasso [6], and MSA-LASSO [24]. We also propose a tun-
ing parameter for increasing the accuracy of MSA-LASSO [24]. Our results for weighted linear models
show that the magnitudes of the weights, and thus the method for obtaining weights, is essential to the
performance of the weighted linear model.
4
1.2.2 Iteratively Reweighted Random Forests Models
Our other major contributions are for a popular nonlinear algorithm known as random forests [19].
A random forests is an ensemble of simple decision trees that can collectively perform either classification
or regression. Decision trees recursively partition the input data [23, 90, 109, 114]; typically, at each node
in a decision tree, the partitions are created by choosing the feature that best helps in partitioning the
data among all available features; whereas, at each node in the decision trees utilized for random forests,
the partitions are created by choosing the feature that best helps in partitioning the data among randomly
selected mtry < p features, where p is the number of available features. Random forests are so named due
to the random search over mtry < p features at each node in the decision tree. Our contributions to the
field of random forests are as follows:
� The notion of iterated estimation is gradually becoming popular for linear models trained using the L1
penalty, but has not been explored for nonlinear models. We propose an approach to iterated reweighting
for random forest models. One of our contribution is a weighted form of random forests that can be used
to bias the random forest models towards relevant features and away from irrelevant features.
� Using validation data, we can quantify the increase in prediction error when a particular feature is
removed. A large increase in validation error implies the necessity of the feature for prediction and
vice-versa; feature importance is quantified as the relative increase in validation error when a feature is
absent. Typically, feature importance is calculated by training a single model and evaluating the absence
of features, one at a time, from validation data; note as features are not considered in isolation from one
another in the training process, the trained model may overfit correlated (but spurious) features. We
propose a new feature importance measure, based on retrained models, that evaluates features in isolation
for both training and validation data. Typically, feature importance values are uninterpretable as the
relative increase in error for individual features is data dependent; thus, we also propose an interpretable
form of feature importance based on probabilities obtained from a significance test. We motivate the
advantage of our importance measure using artificial data and show improved results, on real-world
datasets, over existing random forests methods.
5
1.3 Arrangement of the thesis
Chapter 2 presents an overview of feature selection and feature selection through L1 penalized algo-
rithms and random forests. In Chapter 3, we propose a single-step procedure for linear models using weights
derived from subsampled datasets. In Chapter 4, we extend the single-step procedure, for linear models, into
an iterative procedure. In Chapter 5, we propose an iteratively reweighted procedure for random forests.
Chapters 3, 4, and 5 can be read independent of each other; though Chapter 4 is an extension of Chapter 3.
In Chapter 6, we present our conclusions and proposals for future work.
Chapter 2
Current Approaches to Supervised Learning with Feature Selection
In this chapter, we summarize the relevant literature and background information needed to under-
stand our research. This chapter is divided into four sections. In Section 2.1, we present an overview of
supervised learning. In Section 2.2, we present an overview of feature selection methods. In Section 2.3, we
discuss using an L1 or an L2 penalty function to train linear models, and in Section 2.4, we discuss random
forests, a nonlinear algorithm. We start with a discussion of supervised machine learning.
2.1 Supervised Machine Learning
In supervised machine learning, we assume that a training dataset D = {(xi, yi), i = 1, . . . , n, xi ∈
Rp, yi ∈ R} is available, where x is a vector containing information about p features that are discrete or
continuous, and y is a target value that is either discrete or continuous. Features are also known as variables
or attributes. We assume that there exists a joint probability distribution, P (x, y), and that a finite number
of n examples are drawn i.i.d. from P (x, y) to construct the training dataset D.
A classifier or model is a function h(x) : Rp → y, where y is a discrete label. Regression involves
transforming an input x to a continuous output value, whereas classification involves transforming an input
x to a discrete output label. The non-negative loss function L(y, y) is used to measure the difference between
prediction y (obtained from classifier h) and target value or label y. The expectation of the loss function,
7
also known as the risk function associated with h, is defined as,
R(h) = E[L(y, h(x)] =
∫L(y, y)dP (x, y). (2.1)
The goal of supervised machine learning is to find a classifier h∗ from the set {h(x)}, x ∈ Rp for which
the risk is minimized, that is,
h∗ = arg minh∈{h(x)},x∈Rp
R(h). (2.2)
The classifier h∗ models P (y|x) which is the probability of y given x. In contrast to this discriminative
model, a generative model attempts to model the joint probability distribution P (x, y) and use it to derive
P (y|x) via Bayes’ rule. In real world datasets, the full joint distribution P (x, y) is usually unknown, and
only a finite set of examples is available. In this thesis, we restrict our discussion to discriminative models.
For discriminative models, we calculate the empirical risk function, which is an approximation of risk
function R(h), as follows:
Remp(h) =
n∑i=1
L(yi, yi). (2.3)
The theory of empirical risk minimization (ERM) [119] states that a learning algorithm, also known
as a classifier algorithm or an induced algorithm, should choose a classifier h such that the empirical risk is
minimized, that is:
h = arg minh∈{h(x)},x∈Rp
Remp(h),
= arg minh∈{h(x)},x∈Rp
n∑i=1
L(yi, yi). (2.4)
The selection of classifier h from a set of classifiers {h(x)}, x ∈ Rp, is known as model selection.
In contrast, model combination is a procedure in which an ensemble of models, classifiers, or experts
are used together for predicting target values or labels of unseen data. The prediction of an ensemble
is a function of the individual classifiers in the ensemble and the individual classifiers are also known as
base classifiers. The motivation behind model combination is that a single base classifier may not be able
to accurately model the entire distribution of data due to biases in the base classifier. In such cases, an
ensemble whose prediction is a combination of outputs from individual base classifiers, with different model
biases (architectures), may perform better than any single base classifier.
8
In this thesis, we discuss linear algorithms that are trained using model selection and random forest,
a nonlinear algorithm, that uses model combination for training. Before discussing these algorithms, we
digress and discuss feature selection.
2.2 Overview of Feature Selection
Feature selection is a subfield within machine learning aimed at creating accurate models by excluding
irrelevant features, and by including only relevant features.
We focus on the area of supervised learning, and ignore techniques for feature selection and dimen-
sionality reduction via unsupervised learning, where target values or labels are unavailable.
The goal of feature selection is to obtain good performance from a model, i.e., high accuracy or low
error. Good performance is defined over the entire distribution of the data. However, practically speaking,
such distributions are not available and thus held-out test sets are employed for evaluating classifier and
regression accuracy.
Why is feature selection necessary? Consider the Bayes classifier, which is a theoretical ideal classifier
that predicts the most probable class for a given input. The Bayes classifier is assumed to have access to the
distribution of the data. Practically, we cannot create Bayes classifiers, and we do not know the underlying
distribution of the data, except in trivial cases. The Bayes classifier is optimal in its use of input features.
If purely noise features are included in the input, accuracy of the Bayes classifier will not decrease. Thus,
from a theoretical standpoint, features need not be be discarded. However, in contrast to this theoretical
ideal, supervised machine learning algorithms are susceptible to noisy features, and the best classification
accuracy is achieved only if the noisy features are ignored; we define a noisy feature as a feature with high
measurement noise. Such susceptibility to noise is due to the fact that any algorithm has to balance the
estimation of multiple parameters (for reducing the bias of the model) against accuracy in estimating the
parameters (for reducing the variance of the model).
9
2.2.1 Feature Relevancy
Intuitively, one has a sense of what a relevant or irrelevant feature is, but formally defining these terms
is tricky. We summarize some of the attempts at formalization that have been proposed in past research
[11, 22, 70, 74, 129].
� Kohavi and John [74] defines the optimal feature subset as the subset of features for which the accuracy of
the constructed classifier is maximized. They note that such a classifier might not be unique. Furthermore,
Kohavi and John [74] and John et al. [70] categorize features into two types based on their relevance to the
prediction: weak and strong features. The set of strongly relevant features, denoted S, consists of those
features whose removal will cause a performance deterioration in the optimal Bayes classifier. In contrast,
a feature X is a weakly relevant feature, if it is not a strongly relevant feature, and there exists a subset
of features M , such that the performance of the Bayes classifier on M is worse than the performance on
M ∪ {X}. Strongly relevant features are essential and cannot be removed without loss of prediction
accuracy, whereas weakly relevant features can sometime contribute to prediction accuracy. Irrelevant
features are features that are neither strongly nor weakly relevant. In Figure 2.1, we depict Kohavi and
John’s [74] definition of the optimal set of features via the darkly shaded region, and a Bayes classifier
has to use all the strongly relevant features and some of the weakly relevant features.
Figure 2.1: The figure is based on the discussion in Kohavi and John [74]. The optimal set of features isdefined by the darkly shaded region which contains all strongly relevant features and some, but not all, ofthe weakly relevant features.
� Blum and Langley [11] define strongly relevant feature as follows: A feature S is strongly relevant if
there exist examples A and B that differ only in their assignment to S and have different target values or
10
I II III IV
I: Irrelevant featuresIII: Weakly relevant but non-redundant features
II: Weakly relevant and redundant features
IV: Strongly relevant features
III + IV: Optimal Subset
Figure 2.2: Reproduced from Yu and Liu [129]. According to Yu and Liu [129], redundancy of featuresshould be considered within sets of strongly and weakly relevant features (as defined by Kohavi and John[74]). Ideally, the optimal set of features are present in the set III+IV, that is the set containing stronglyrelevant features and weakly relevant but non-redundant features.
labels; whereas, weakly relevant features are features that become strongly relevant when a subset of
features are removed from the data. Though, their definition of strongly and weakly relevant features is
a bit different from [70, 74], the implications are similar; that is, strongly relevant features should never
be discarded as their removal increases the ambiguity of the data, thereby increasing the probability of
prediction error on future data.
� Yu and Liu [129] argue that redundancy of features should be considered in addition to the relevancy of
the features. Their definition of redundant features coexisting with weakly and strongly relevant features
is depicted in Figure 2.2; redundant features are considered to be a form of weakly relevant features. The
optimal set of features should consist of strongly relevant features and weakly relevant features that are
non-redundant.
� Breiman [22] relates the relevancy of features to the selection of the features in a decision tree. At each
node in a decision tree, a decision tree algorithm searches the entire feature space for the feature that best
separates/splits the data according to an information criterion. Breiman characterizes strong features as
features which have high probability of being selected for splitting at the nodes of the decision tree and
vice-versa for the weak features. Breiman does not define the irrelevancy of features, but assumes that
noisy feature are a type of weak features. Similarly, Breiman does not indicate a probability threshold
11
that differentiates a weak feature from a strong feature.
2.2.2 Type of Feature Selection Algorithms: Wrapper, Filter, and Embedded
Now, we give a brief overview of feature selection algorithms. According to Kohavi and John [74],
feature selection algorithms broadly fall into three categories: filter, wrapper, and embedded methods.
(1) Filter -based feature selection algorithms work by analyzing the characteristics of the data and evaluating
the features without including any learning algorithm in the process. Thus, results of the filter methods
are independent of the classifier.
(2) Wrapper -based feature selection algorithms use a learning algorithm to identify the relevant features and
create the final model using only the relevant features. The wrapper based algorithms use multiple search
strategies in conjunction with a classifier. As the exploration of the entire feature set is an NP-hard1
problem, wrapper based algorithm use the classifier as a black box and evaluate the performance of
the classifier on certain feature sets. The search strategies include backward elimination (removing one
feature at a time) and forward selection (adding one feature at a time).
(3) Algorithms like lasso regression [112], 1-norm Support Vector Machine (SVM) [131], random forests [19],
and decision trees [23, 90], are embedded feature selecting algorithms, and include feature selection as a
part of their model training process; that is, the algorithms simultaneously train the model and perform
feature selection. The algorithms that we focus on in this thesis are embedded methods.
A disadvantage of the feature sets obtained from the wrapper and embedded methods is their dependence
on the classifier used. However, this dependency allows wrapper and embedded methods to obtain higher
predictive accuracy than filter methods because selection and evaluation of the features is done by the same
classifier.
In the next section, we discuss linear models which are an interesting class of algorithms as their linear
form facilitates analysis.
1 A dataset with p features will have 2p possible feature sets.
12
2.3 Feature Selection Using Linear Models
Next, we discuss linear models, a type of classifier.
Linear Model. Assume that a training dataset D = {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R} is available.
A linear model, for training example i, is defined as,
yi = β0 +
p∑i=1
xijβj , (2.5)
where xij represents the value of the jth feature in the input xi for the example i, yi is an output value, β0
is known as bias, and βj are known as model parameters. We refer to the vector β = (β0, β1, . . . , βp) as the
model parameter vector; for convenience, we include the constant variable 1 in xi which allows us to include
β0 inside the parameter vector β.
Least Squares Method. The least squares method is a popular learning algorithm used to obtain the
model parameters β for the linear model shown in Equation 2.5.
Assume that a training dataset D = {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R} is available. We represent
the inputs xi, i = 1, ....n, xi ∈ Rp in a matrix X of size n× p and the target values yi, i = 1, ....n, yi ∈ R in
a vector y of size n. Then, the least squares method estimates are
β = arg minβ
(y −Xβ)2. (2.6)
If we assume that {h(x)}, x ∈ Rp, is a set of linear models, then β is the parameter vector of the linear
classifier h selected using the theory of empirical risk minimization (ERM), which was discussed earlier in
Section 2.1.
The analytical solution for the least squares method in Equation 2.6 is
β = (XTX)−1XT y, (2.7)
which can be obtained by setting the derivative of Equation 2.6 to zero, and is only defined when XTX is
invertible. When XTX is singular or near-singular, its inversion is not stable and will cause large magnitudes
in β. In order to rectify this instability, a method known as regularization [61, 81, 85, 117] is employed.
13
Regularization. Regularization is a method for obtaining stable models. For regularization of the least
squares method, we augment Equation 2.6 to form
β = arg minβ
(y −Xβ)2 + λ||β||22, λ ≥ 0, (2.8)
where ||β||22 = (√β2
1 + ...β2d)2. The analytical solution for Equation 2.8 is
β = (XTX + λI)−1XT y. (2.9)
and the additional λI term helps in mitigating the effects of a singular or near-singular XTX.
A general form of regularization is defined as
β = arg minβL(y, h(X,β)) + λJ(β), λ ≥ 0, (2.10)
where L(y, h(X,β)) is the loss function, J(β) is called the penalty function, and h(X,β) is a classifier or a
model. J(β) controls the complexity of the model, via the model parameter β, and the λ parameter, also
known as the regularization parameter, controls the amount of complexity allowed for the model. Typically,
models trained using Equation 2.4 are known as regularized or penalized models and the learning algorithms
are known as regularized or penalized algorithms.
In the case when least squares is regularized, as shown in Equation 2.8, h(X,β) = Xβ, L(y, h(X,β)) =
(y − h(X,β))2 = (y − Xβ)2, and J(β) = ||β||22. The term J(β) = ||β||22 = (√β2
1 + ...β2d)2 is known as the
ridge penalty function or the L2-norm penalty function, and for brevity, henceforth, we will refer them as
the ridge penalty or the L2 penalty, respectively. L2 penalty gets its name from the Lp-norm function, which
for a vector x of size n is ||x||p = (∑ni=1 |xi|p)
1/p.
The rest of this section is devoted towards discussing different types of functions that can be used
for modeling the loss function L(y, h(X,β)) and the penalty function J(β) in Equation 2.10. We limit our
discussion of loss functions to convex functions [14, 95], which have a unique global minimum.
2.3.1 Regularization Using the Ridge Penalty Function
Ridge Regression. In Equation 2.8, we showed that the least squares method can be regularized by
using the ridge penalty or the L2 penalty; the resultant formulation is called ridge regression.
14
Support Vector Machine (SVM) Classification Using the L2 Penalty. Up until now, we have
primarily discussed regression using the least squares loss function where the targets y are continuous values.
When targets are discrete values or class labels, the learning problem is known as classification. A popular
loss function, used for classification, is the hinge loss function which is defined as
Hinge Loss: L(y, h(X,β)) =
n∑i=1
[1− yi ∗ (xiβ)]+, [v]+ = max(v, 0), yi = {+1,−1}. (2.11)
In Equation 2.11, we only consider the case of binary classification and assume that the possible class labels
for a target are either +1 or -1. Note that in regression β represents a p-dimensional hyperplane passing
through the data, whereas in classification β represents a p-dimensional hyperplane separating the two
classes, where p is the number of features. There are differences between the least squares loss function and
the hinge loss function. The least squares loss function awards a positive value if there is a difference between
an example’s target value and the predicted value, and zero otherwise. In contrast, the hinge loss function
awards a positive value only if an example is misclassified. The value of the positive loss for a misclassified
example is dependent on the distance of the example from the separating hyperplane represented via model
parameter β.
The formulation of the support vector machine (SVM) algorithm [59, 119] is
L2 penalized SVM: β = arg minβ
{λ||β||22 +
n∑i=1
[1− yi ∗ (xiβ)]+, [v]+ = max(v, 0), λ ≥ 0
}. (2.12)
The parameter vector β is discovered by optimizing the hinge loss function regularized with the L2 penalty.
For clarity, we refer to the SVM algorithm as L2 penalized SVM.
The name of the SVM algorithm is derived from the concept of support vectors, which are examples
that lie near the hyperplane separating the two classes and can be used to identify the separating hyperplane.
Typically, the number of support vectors is much less than the number of training examples. In case of linear
models, including the linear SVM, we can identify the hyperplane either using the parameter vector β or the
support vectors. When the SVM algorithm is used to train nonlinear models, the model parameter β is not
available and the support vectors are used to define the nonlinear hyperplane separating the two classes. In
this thesis, we limit ourselves to the linear SVM.
15
Unlike ridge regression, an analytical solution does not exist for the L2 penalized SVM shown in
Equation 2.12. Though quadratic programming can be employed for solving L2 penalized SVM, more
efficient algorithms have been designed; for more information refer to Scholkopf and Smola [101] and Platt
[86].
2.3.2 Regularization Using the Lasso Penalty Function
The lasso penalty or L1 penalty was considered by Tibshirani [112] for regularizing least squares
regression. This paper showed that the lasso penalty induces feature selection within the trained linear
model by producing a sparse β. That is, many of the values in the model parameter β may be set to zero
depending on the value of the regularization parameter λ. Before discussing the feature selection properties
of the lasso penalty, we present formulations of the lasso penalty that incorporate both the least squares and
hinge loss functions.
Lasso Regression. The lasso or L1 penalty, represented as ||β||1 = |β1|+ ...|βd|, is used with least
squares to form
β = arg minβ
(y −Xβ)2 + λ||β||1, λ ≥ 0, (2.13)
and the resultant formulation is known as lasso regression.
Support Vector Machine (SVM) Classification Using the L1 Penalty. Like we defined the L2
penalized SVM in Equation 2.12, we can define an L1 penalized form of the hinge loss function as
L1 penalized SVM: β = arg minβ
{λ||β||1 +
n∑i=1
[1− yi ∗ (xiβ)]+, [v]+ = max(v, 0), λ ≥ 0
}. (2.14)
Unlike ridge regression, an analytical solution does not exist for either lasso regression or the L1 penalized
SVM. But before we discuss algorithms for solving lasso regression and the L1 penalized SVM, we discuss
feature selection properties induced in lasso regression and the L1 penalized SVM due to the L1 penalty.
Feature Selection in Lasso Regression and Other Lasso Penalized Algorithms. The following
example was described in Tibshirani [112] to show that the lasso penalty performs a type of feature selection.
16
Assume that the input matrix X is orthogonal (XTX = I). βls is used to identify the least squares
estimates for the orthogonal (XTX = I) case, and thus βls = (XTX)−1XT y = XT y, using Equation 2.7.
We digress a bit and first discuss the analytical solution for ridge regression. When input matrix X
is orthogonal (XTX = I), the analytical solution for ridge regression is
βridge = (XTX + λI)−1XT y = (1 + λ)−1βls, using Equation 2.9, where βls = XT y. (2.15)
Note that the model parameter βridge, obtained from ridge regression, is scaled by a factor of (1 + λ) when
compared to the model parameter βls, obtained from least squares.
In general, there is no analytical solution for lasso regression, but there exists an analytical solution
for lasso regression when the input matrix X is orthogonal (XTX = I). We can differentiate Equation 2.13
and set the derivative to 0. We obtain
βj(lasso) = sign(βj(ls))(|βj(ls)| − λ)+, where (v)+ = max(v, 0), (2.16)
where βj(lasso) and βj(ls) are used to represent model parameters from lasso regression and least squares,
respectively, for the jth feature. As shown in Equation 2.16, the model parameter βj(lasso) obtained from
lasso regression for the jth feature is nonzero only when |βj(ls)| > λ, and zero otherwise. Thus, for lasso
regression, the lasso penalty thresholds features based on the amount of λ specified, producing a sparse βlasso,
the parameter vector obtained from lasso regression; a sparse βlasso is still produced by lasso regression when
the input matrix X is not orthogonal.
When we consider other convex loss functions, including the hinge loss function shown in Equation 2.11
and used for L2 penalized SVM, sparsity in the model parameters is induced due to the lasso penalty. In
contrast, the model parameter obtained from optimizing a convex loss function regularized with the ridge or
L2 penalty is dense; that is, most of the values in β are non-zero.
Solving Lasso Penalized Formulations Using LARS and LARS-like Algorithms. Earlier, we
noted that an analytical solution does not exist for either lasso regression or the L1 penalized SVM. However,
because the L1 penalty is a convex penalty and both least squares and hinge loss functions are convex
functions, there exists a global minimum [14, 95] for the resultant formulations of both lasso regression
17
and the L1 penalized SVM. As the least squares loss function is quadratic, Tibshirani [112] proposed using
quadratic programming for solving lasso regression given a fixed value of the regularization parameter λ, λ ≥ 0
in Equation 2.13. Similarly, as both the loss function and the penalty function are linear, we can use linear
programming for solving the L1 penalized SVM, given a fixed value of the regularization parameter λ, λ ≥ 0
in Equation 2.14.
An efficient algorithm, known as Least Angle Regression (LARS), was proposed by Efron et al. [42]
for solving lasso regression. For lasso regression, the running time of LARS is less than that of a solver using
quadratic programming. Also, LARS can simultaneously solve for the entire range of the regularization
parameter λ (0 ≤ λ ≤ ∞) Equation 2.13.
Before discussing the LARS algorithm, we make the following two observations for lasso regression
that were essential in the design of LARS: (1) For lasso regression, between any two regularization values
λ1 < λ1+ε, ε > 0 and ε is a sufficiently small value, the change in model parameter βj is linear, where j is the
index of an input feature and βj 6= 0. (2) In Equation 2.16, we concluded that the lasso penalty thresholds
features using the λ parameter, and the larger the value of λ, the sparser is the model parameter β.
The working of the LARS algorithm can be described as follows: In the LARS algorithm, the initial
value of the regularization parameter is set to λ = ∞ and is gradually decreased to λ = 0. Due to the
thresholding property of the lasso penalty, as λ decreases, more features gain a nonzero value in the parameter
vector β. Also, for lasso regression, the change in the parameter vector β is linear for features that are
nonzero. LARS analytically calculates the change in λ required to go from k features to k + 1 features and
the corresponding change in the feature values represented in the model parameter β. Thus, the LARS
algorithm can calculate the parameter vector β for the entire regularization path 0 ≤ λ ≤ ∞ in p steps,
when p features are present; in practice, the number of steps taken are min(n, p), where n is the number of
examples and p is the number of features.
Recently, many papers have proposed algorithms that calculate the entire regularization paths for
different types of linear loss functions and these algorithms bear similarity to the LARS algorithm. Zhu
et al. [131] proposed an efficient algorithm for the L1 penalized SVM (shown in Equation 2.14). Later,
Rosset and Zhu [97] provided an algorithmic framework modeled around LARS that can calculate the entire
18
regularization path (0 ≤ λ ≤ ∞) for single and double differentiable loss functions using the L1 penalty.
An example of single differentiable loss functions is the hinge loss function as shown in Equation 2.11; an
example of double differentiable loss functions is least squares, as shown in Equation 2.6. Henceforth, we
refer to algorithms that follow the algorithmic framework described in Rosset and Zhu [97], and which also
includes LARS, as piecewise-linear solution-path-generating algorithms; we extensively use such algorithms
in this thesis.
Even though the algorithms discussed in this thesis are based on L1 and L2 penalty functions, we note
that the literature includes exploration of other penalty functions. We briefly discuss these other functions
in the next section.
2.3.3 Regularization Using Other Penalty Functions
Lγ Family of Penalty Functions. The most popular form of penalty functions are the ridge or L2
penalty and the lasso or L1 penalty. In this section, we describe other penalty functions starting with the
members in the Lγ family of penalty functions. The bridge penalty [50] is a family of penalty functions
defined as
J(β) =∑j
|βj |γ , γ ≥ 1. (2.17)
The bridge family of penalty functions also includes the L1 and L2 penalty functions. The family of
penalty functions existing between the L1 and L2 penalty functions, that is J(β) =∑j
|βj |γ , 1 ≤ γ < 2
(includes the L1 penalty but excludes the L2 penalty), are shown to induce sparsity in the model parameter
β, when used to regularize a convex loss function; the reason for sparsity is the discontinuity existing in the
derivatives of the penalty functions, that forces certain model parameters to be set to zero thus obtaining a
sparse β vector. For Lγ (γ ≥ 1) penalty functions, the highest amount of sparsity in β is obtained with the
L1 penalty, whereas a lower amount of sparsity in β is obtained using a penalty function with the value of γ
near 2. Penalty functions with J(β) =∑j
|βj |γ , γ ≥ 2, which also includes the ridge or L2 penalty, are not
sparse and that is because such penalty functions are continuous everywhere.
19
The members of the Lγ family of penalty functions, other than bridge family of penalty functions, are
J(β) =∑j
|βj |γ , 0 ≤ γ < 1. (2.18)
The L0 penalty function is different from all other norm penalty functions as it only considers the
number of non-zero values present in the model parameter β. The optimization of a loss function regularized
with the L0 penalty function is NP-hard, because in order to find a model with a fixed number of non-zero
values in β, we need to evaluate 2p different linear models, where p is the number of features. Also, the L0
penalty and the rest of the J(β) =∑j
|βj |γ , 0 < γ < 1, penalty functions are non-convex in nature; thus,
when such penalty functions are used for regularizing a convex loss function Equation 2.10, there is a chance
that a global minimum may not be found due to the existence of multiple local minima. All the members of
the family of penalty functions, represented by J(β) =∑j
|βj |γ , 0 ≤ γ < 1 and when used for regularizing a
loss function Equation 2.10, impart sparsity in the model parameter β. The L0 penalty produces the sparsest
β and the rest of the penalty functions produce sparser β compared to the L1 penalty. Thus, there is interest
in devising algorithms that can optimize such non-convex penalty functions. Friedman [48] proposed an
algorithm that approximates the model parameter β along the entire regularization path (0 ≤ λ ≤ ∞), when
either a convex (Lγ , 1 ≤ γ < ∞) or a non-convex penalty (Lγ , 0 < γ < 1) function is used to regularize a
convex loss function.
Some Other Penalty Functions. We do not discuss the details of other popular penalty functions but
acknowledge their usage in literature. The Smooth Clipped Absolute Deviation (SCAD) penalty function is
a non-convex function for which efficient algorithms have been proposed [44, 135]. The elastic-net penalty
function [134] uses a combination of L1 and L2 penalty functions, inducing sparsity in the model parameters
via the L1 penalty but controlling the amount of sparsity via the L2 penalty. The fused lasso penalty [113]
was proposed as an extension of the L1 penalty, when feature magnitudes are related via an a priori known
feature ordering; in the fused lasso penalty, in addition to the L1 penalty function, penalty terms are added
that constrain the model to consider the a priori known feature ordering.
This section primarily discussed linear models that model linear relationships among features. Next,
we discuss random forests, an algorithm that can model nonlinear relationships among features.
20
2.4 Feature Selection Using Random Forest
The random forest model has proven extremely popular in application, due to its excellent performance
on a variety of problems, the fact that it can capture nonlinearity in input features, and has been shown to
be a universal approximator.
Next, we try to motivate random forest as a state-of-the-art supervised machine learning algorithm
worthy of further research.
2.4.1 Why Use Random Forest?
The field of supervised machine learning would be simplified if we can find a single best supervised
machine learning algorithm. Unfortunately finding the best supervised algorithm is not possible and this
fact is best elucidated via the ‘No Free Lunch Theorem’ by Wolpert and Macready [126, 127]. According
to ‘No Free Lunch Theorem’, in order to find the best supervised learning algorithm, we have to compare
algorithms on all possible (infinite) datasets, and thus, the search for the best algorithm is futile.
Instead, using a fixed number of datasets, supervised learning algorithms are compared and analyzed
on the basis of their prediction error. When the prediction results are not significantly different, secondary
properties like training time and testing time of the models, interpretability of the models, among others
play a role in the choice of an algorithm. In certain cases, the secondary properties are more important than
prediction error.
The current state-of-the-art supervised machine learning algorithms, for modeling nonlinear data, are
SVM, boosting, and random forest; extensive studies show comparable accuracy among these algorithms. Our
interest in random forest is due to the following two properties: implicit feature selection and computational
complexity. Due to the use of decision trees, random forest models are feature selecting in a nonlinear setting.
Additionally, random forest has lower computational complexity for training, than SVM and boosting; we
briefly compare computational complexity between SVM and random forest in Appendix C.8.
Random Forest is a Popular Off-the-shelf Classifier. Like other popular supervised learning tech-
niques, random forests have been extensively used for multiple domains including bioinformatics, ecology,
21
among others; instead of providing an exhaustive list of studies, we refer the reader to the following notable
studies: Saeys et al. [99] and Yang et al. [128] in bioinformatics, Watts and Lawrence [121] and Cutler et al.
[33] in ecology, and Toole et al. [115] and Jun et al. [71] in general computing.
We can also quantify the general interest in random forests by their usage in a competitive set-
ting. Kaggle (www.kaggle.com) is a platform where researchers can compete on machine learning tasks
provided by companies and other researchers; according to Jeremy Howard, the founder of Kaggle at
http://strataconf.com/strata2012/public/schedule/detail/22658, “Ensembles of decision trees (of-
ten known as ‘Random Forests’) have been the most successful general-purpose algorithm in modern times.
For instance, most Kaggle competitions have at least one top entry that heavily uses this approach. This
algorithm is very simple to understand, and is fast and easy to apply.” At the same hyperlink, Jeremy also
notes the usefulness and popularity of L1 penalized algorithms, another focus of this thesis. Similarly for
the NIPS 2003 Feature Selection Challenge, a competition for feature selection algorithms, Guyon et al. [55]
notes random forest as one of the top performing algorithm.
This thesis addresses feature selection in random forest, but before we can discuss feature selection,
we need to lay the groundwork by describing the ideas that led to their genesis—ensemble techniques and
decision trees.
2.4.2 Model Combination in an Ensemble of Classifiers
In Section 2.1, we briefly discussed model combination, a procedure in which an ensemble of models,
classifiers, or experts are used together for predicting target values or labels of unseen data. In order to
discuss model combination in depth, we present a particular type of base classifier, a decision tree, which is
the base classifier used in random forests and many other ensemble approaches.
2.4.2.1 Decision Trees: A Type Of Base Classifier
Decision trees are trained by recursively partitioning the feature space and thus can model nonlinear
data. The creation of an optimal decision tree is NP-complete [65] and greedy algorithms like Classification
and Regression Trees (CART) [23], C4.5 [90], among others are used to train decision trees. The decision
22
trees that we consider are binary decision trees.
The training of a decision tree is done as follows. At each node in a decision tree, a decision tree
algorithm searches the entire feature space for the feature that best separates/splits the data according
to an information criterion; information criteria include the gini impurity which is used for CART trees
[23, 59, 109], and information gain which is used for C4.5 [90]. Decision trees are full-grown, such that each
node contains a certain number of examples; by using a held-out set or via cross-validation, pruning of the
trees is performed to remove nodes that do not provide additional information and to prevent overfitting.
Decision trees are popular because they are fairly interpretable. Decision trees are a versatile mod-
eling tool and are used for both classification and regression problems; furthermore, decision trees can also
accommodate different types of features including continuous, discrete, and in general any kind of feature
whose values can be compared.
2.4.2.2 Components of Model Combination
Model combination can roughly be segmented into the following two components: the method used
for training base classifiers and the method used for combining prediction results of base classifiers; we briefly
describe these two components.
Training Base Classifiers. Base classifiers can either be trained independent of one another or using
the information from a previously trained base classifier to guide the training of future base classifiers. We
explain the contrast via illustration by two popular methods: boosting [47, 100] and bagging [15].
Boosting is a generic method in which multiple base classifiers are repeatedly trained on a different
subset of the training data and the prediction of the ensemble is a weighted linear combination of the
prediction from base classifiers. AdaBoost [47, 100] is a boosting algorithm in which new base classifiers
are trained by focusing on examples that are misclassified by existing base classifiers; examples that are
misclassified get an increased weighting. Typically, the base classifier in AdaBoost is a weak classifier that
is slightly better than a random classifier; an example of such a weak classifier is a decision stump which is
a one level decision tree. For more information on other boosting algorithms refer to Hastie et al. [59].
23
Bagging [15], or bootstrap aggregating, is a generic method in which individual base classifiers, typi-
cally decision trees, are trained on a different bootstrap sample of the original training dataset. Bootstrap
[34, 36, 41] is a method for creating a new dataset from available data by sampling the data, multiple times,
with or without replacement. The most popular form of bootstrap involves sampling n examples with re-
placement from a training dataset of size n. Typically, the base classifiers in bagging are decision trees and
the prediction of the ensemble—for regression—is an average of the prediction over individual base classifiers;
for classification, the class label for a test example is the class with the highest number of votes among the
base classifiers.
Combining Predictions From Base Classifiers. The method of combining predictions across base
classifiers can be roughly segmented into weighted methods and meta-learning methods [96].
Weighted methods weigh the prediction of individual base classifiers. AdaBoost [47, 100] and stacked
regression [17] are weighted methods. The final prediction they yield is a weighted average of the prediction
from base classifiers. In contrast, for bagging [15], the final prediction is an unweighted average of the
prediction from base classifiers.
Stacked generalization [125], or stacking, is a meta-learning method for combining predictions from
base classifiers; the base classifiers may be obtained from different learning algorithms. In stacking, a meta-
dataset is created from the predictions of the base classifiers, and then, a meta-classifier is trained on the
meta-dataset. By using a meta-classifier, stacking can identify the reliability of individual base classifiers.
For more information on other weighted and meta-learning methods refer to Rokach [96].
2.4.2.3 The Role of Resampling in the Evolution of Ensemble Methods
Now we briefly describe some studies on ensemble training that motivated Breiman to propose random
forest [19].
In 1994, Breiman [15] proposed the bagging algorithm which uses an ensemble of decision trees trained
using bootstrapped datasets; bootstrapped datasets sample the space of available training examples. In 1996,
Schapire and Freund [47, 100] proposed AdaBoost, a boosting algorithm, in which base classifiers are trained
24
so as to minimize the training misclassification; misclassified examples get an increased penalization cost;
in 1998, Breiman [18] proposed that AdaBoost is a type of ‘adaptive resampling and combining’ algorithm
or ‘arcing’ algorithm, where examples, used for training a base classifier, are sampled according to their
misclassification rate. In 1997, Amit and Geman [3] proposed growing an ensemble of trees and, at each
node in the trees, selecting the best feature from a subset of available features; their work was focused
on character recognition and they limited the depth of the trees. In 1998, Ho [60] proposed the random
subspace method for creating an ensemble of decision trees, where each tree is trained using a randomly
selected subset of features from the available feature set. In 2000, Dietterich [39] compared bagging using a
randomized version of C4.5 trees to bagging using C4.5 trees (without randomization) and boosting; in the
randomized version of C4.5, at each node of a decision tree, a random feature was selected, for splitting the
data, from the top 20 features.
A common theme among these ensemble methods is the use of resampling, either in the feature space,
example space, or both.
2.4.3 The Random Forest Algorithm
The random forest algorithm was proposed by Breiman [19] in 2001. Random forest are constructed
using an ensemble of decision trees which are trained by resampling both the feature space and the example
space. Like bagging, each decision tree is constructed using a bootstrapped dataset, but unlike bagging, at
each node in a tree, a random subset of features are used to find the best feature; the differences in bagging
trees and random forest trees is shown in Figure 2.3. The prediction of the random-forest ensemble is as
in other ensemble methods: averaging predictions from the trees in case of regression, or by taking a vote
among the trees in case of classification.
Parameters of the Random Forest Algorithm. Bagging [15] has a single parameter, ntree, that
controls the number of trees used in the ensemble. Instead, the random forest algorithm has two parameters:
mtry controls the number of random features to search for the best feature, at each node of individual
decision trees in the forest, and ntree is the number of decision trees to include in the forest.
25
Figure 2.3: We depict the search for the best feature at each node within bagged trees and random foresttrees; the best feature is the feature with the highest information criterion. For trees in bagging, we considera search over all p features for the best feature. For trees in a random forest, we search over mtry < pfeatures for the best feature, where p is the number of features.
It may be prudent to discuss the theory behind random forest and examine the effects of the ntree
and mtry parameter on the generalization error (or the probability of the prediction error over the entire
distribution of the data).
2.4.3.1 A Theoretical Motivation for Random Forest
Breiman in his paper on random forest [19] showed theoretical motivation for ensemble methods,
including random forest and we briefly describe his derivations in the following paragraphs.
First, we define a maximum margin classifier. For data that contains two separable classes, the
maximum margin classifier constructs a hyperplane that divides the classes, and this hyperplane is furthest
away from both classes. Other popular supervised learning algorithms, like the SVM [31, 119] and boosting
algorithms [98], are also maximum margin classifiers. Next, we consider a maximum margin classifier for k
classes, which separates a given class from all other classes.
The margin function for a maximum margin classifier is defined, over the entire ensemble, as
mr(x, y) = PΘ(h(x,Θ) = y) − maxj 6=y
PΘ(h(x,Θ) = j), where y is the target class for the input x, Θ is a
classifier distribution, h(x,Θ) is a classifier, and j is a class label. The margin is defined as the difference in
confidence of the ensemble towards x belonging to the correct class (represented via y) and confidence of the
26
ensemble towards x belonging to any other class (represented via j). When the value of the margin function
mg(x, y) for x is greater than zero, there are more base classifiers voting for the correct class compared to
any other classes and a value less than zero implies that there are more base classifiers voting for an incorrect
class compared to the correct class. Also, we assume that the base classifiers are generated from the same
distribution Θ.
The expected prediction error rate can be defined as PE = Px,y(mr(x, y) < 0) over the entire space
of data and classifiers; the prediction error is the probability that the margin function is negative.
For a random variable m with finite mean E(m) and finite variance var(m), the Chebyshev inequal-
ity is defined as P (|m− E(m)| ≥ t) ≤ V ar(m) / t2; the Chebyshev inequality implies that given a large
number of values, most values are close to the mean; using t = E(m) in the Chebyshev inequality, we
get P (|m− E(m)| ≥ E(m)) ≤ V ar(m) / E(m)2, which can then be used to derive the probability that the
margin function is negative (and the probability of prediction error) as
PE = Px,y(mr(x, y) < 0) ≤ V ar(mr)
(Ex,y(mr))2. (2.19)
Furthermore, Breiman derives the variance in the margin function as V ar(mr) ≤ ρ(1 −
(Ex,ymr(X,Y ))2, where ρ is the correlation among classifiers in the ensemble. Thus, the prediction er-
ror is bounded from above as
PE ≤ ρ(1− (Ex,y(mr))2)
(Ex,y(mr))2. (2.20)
Though the bound of the prediction error PE is loose, it gives a motivation for ensemble techniques
using classifiers sampled from the same classifier distribution. Based on the bound for prediction error PE,
Breiman argues that an ensemble should reduce the correlation among classifiers ρ and increase the expected
value Ex,y(mr) of the margin.
We briefly describe the effects of the parameters in random forest on the prediction error PE. The
expected value Ex,y(mr) of the margin and should converge to a fixed value as the number of trees ntree
goes to infinity. In contrast, mtry is responsible for reducing the correlation ρ among classifiers; a large value
of mtry will create correlated classifiers whereas a very small value of mtry may not be enough to model the
underlying distribution.
27
Breiman [15, 16, 19] notes that decision trees are unstable classifiers; unstable classifiers are defined
as those that produce vastly different models due a small change in the training data. In contrast, Support
Vector Machine (SVM) is a stable classifier. A reason for using an unstable classifier is the reduced amount
of correlation in an unstable classifier compared to a stable classifier like SVM.
Studies on the Theory Behind Random Forest. Now, we briefly discuss some past studies trying
to explain the random forest algorithm.
Breiman [20] showed that random forests can be cast as a kernel function based algorithm that
separates two classes; a kernel function is a nonlinear function. Larger trees are shown to produce better
kernels and thus they have better accuracy.
Breiman [22] proposed that, for random forests, the value of the optimal mtry does not depend on the
sample size, but on the number of strong and weak features present in the data, where strong features are
defined as those having a high probability of being selected for splitting the nodes of the decision tree and
vice-versa for the weak features. Furthermore, mtry is shown to be independent of ntree. Biau [8] showed
that the convergence rate of random forests only depends on the number of strong features in the model and
not on the number of noisy features present.
Lin and Jeon [79] described the similarity between random forests and an adaptively weighted form of
the k -nearest neighbor. The k -nearest neighbor is a nonlinear model that predicts a test example in terms
of k training examples closest to it. Biau and Devroye [9] showed the consistency of bagging an ensemble of
k -nearest neighbor and discussed connections with random forests.
Next, we discuss how random forests are used for feature selection.
2.4.4 Implicit Feature Selection Using Random Forests
Inherently, a decision tree is a feature selecting classifier, because at each node of the decision tree
only the best feature among available feature is used for partitioning the data, according to an information
criterion. We assume that the decision trees are trained in a manner that depends on the target labels; there
are versions of decision trees and random forests [32, 53] that are trained independent of the target labels.
28
Next, we discuss how random forests are utilized for selecting features in an explicit manner.
2.4.5 Explicit Feature Selection Using Random Forests
In order to perform explicit feature selection using random forests, we first define feature importance,
which is a quantification of the intrinsic value of a feature for prediction.
2.4.5.1 Feature Importance in Random Forests
Before discussing the details of Breiman’s feature importance measure, we must explain a detail of
random forests.
Out-of-bag (OOB) data as a Substitute for Validation Data. In random forests, each tree is
trained using a bootstrapped dataset. If we use sampling with replacement (that is, sample N times from N
examples), the limiting probability of observing a sample in a bootstrapped dataset is 0.6322 . This means
that on average only 63.2% of the original data are reflected in each of the bootstrapped data sets. Thus,
each tree is trained on 63.2% of the original data and we can use the remaining 36.8% of the data as a
substitute for validation data.
The out-of-bag or OOB error rate over the entire forest, using OOB examples, is calculated as
Classification: erroob =1
N
N∑i=1
1
Noob
(ntree∑k=1
I[(xi, yi) /∈ datasetk] [I(yi 6= yi)]
), (2.21)
Regression: erroob =1
N
N∑i=1
1
Noob
(ntree∑k=1
I[(xi, yi) /∈ datasetk](yi − yi)2
), (2.22)
where Noob =∑ntreek=1 I[(xi, yi) /∈ datasetk] is the number of trees in which example i is OOB, ntree
bootstrapped datasets represented by datasetk are used to train ntree trees, N is the number of examples,
I() is an indicator function, erroob is the OOB error rate, x is an input, y is a target, and y is a prediction;
the expression in parentheses of Equation 2.21 is the number of times example i was misclassified when the
example was OOB within ntree trees.
2 The 0.632 value is derived as follows: if there are N samples, the probability of choosing a particular sample is 1/N , and ifwe perform this procedure N times, then the probability of not choosing a particular sample in N trials is (1− 1/N)N = 0.368(for N ≥ 1000), and the probability of choosing that particular sample in N trials is 0.632.
29
The procedure for calculating the OOB error rate, for classification based trees using Equation 2.21,
can be described as follows: once a random forest model is trained, for each tree in the ensemble, the OOB
examples are classified and the votes are stored; the label predicted for an example is the label for the class
that won the maximum votes, over all the trees in the ensemble for which the example was OOB. The OOB
error rate is then the proportion of times OOB examples were misclassified; a similar procedure is employed
in regression based random forests using Equation 2.22.
Feature Importance Calculated Using the OOB Error Rate. Breiman [19, 21] proposed to use
the OOB error rate to calculate the feature importance as follows: (1) a single random forest models is
trained, (2) the OOB error rate is calculated, (3) for an individual feature, the values of the feature in the
OOB data is permuted and the OOB error rate is recalculated, and (4) the increase in OOB error rate due
to permutation is the feature importance value of a feature.
Mathematically, Breiman’s feature importance for classification tasks is calculated as
feature importance for featurem =
1
N
N∑i=1
1
Noob
(ntree∑k=1
I[(xi, yi) /∈ datasetk] [I(yi 6= yi)]−ntree∑k=1
I[(xi, yi) /∈ datasetk] [I(yi 6= yi)]
), (2.23)
and Breiman’s feature importance for regression tasks is calculated as
feature importance for featurem =
1
N
N∑i=1
1
Noob
(ntree∑k=1
I[(xi, yi) /∈ datasetk](yi − yi)2 −ntree∑k=1
I[(xi, yi) /∈ datasetk](yi − yi)2
), (2.24)
where Noob =∑ntreek=1 I[(xi, yi) /∈ datasetk] is the number of trees in which example i is OOB, ntree
bootstrapped datasets represented by datasetk are used to train ntree trees, N is the number of examples,
I() is an indicator function, x is an input, y is a target, and y is a prediction before permuting feature m
whereas y is a prediction after permuting feature m.
30
2.4.5.2 Feature Selection in Random Forests using Feature Importance and Backward Elim-
ination
The feature importance measure we described in the previous section for random forests has been
used to rank features and perform feature selection. The variables selection for random forests (varSelRF)
algorithm [37], proposed for classifying gene datasets, iteratively eliminates features determined to be least
important. Iteratively, the varSelRF algorithm trains a model and eliminates 20% of the features with the
lowest feature importance values; finally, the random forests model with the lowest OOB error or validation
error is selected.
This concludes our review of linear models and random forests. In the next chapter, we propose a
method for improving the accuracy of L1 penalized algorithms.
Chapter 3
Single-Step Estimation and Feature Selection using L1 Penalized Linear Models
This chapter was accepted and presented at Feature Selection in Data Mining ‘10 as “Increasing Fea-
ture Selection Accuracy for L1 regularized linear models in Large Datasets” [68]. Other than some minor
edits, the chapter is reproduced as-is from the paper. The next chapter “Iterative Training and Feature
Selection Using L1 Penalized Linear Models” is an extension to this chapter.
3.1 Introduction
Feature selection using the L1 penalty (also referred as 1-norm or Lasso penalty) has been shown
to perform well when there are spurious features mixed with relevant features and this property has been
extensively discussed in Efron et al. [42], Tibshirani [112], and Zhu et al. [131]. In this paper, we focus
on feature selection using the L1 penalty for classification, addressing open problems related to feature
selection accuracy and large datasets. This paper is organized as follows, Section-2 presents motivation
and background, primarily focusing on the fact that asymptotically L1 penalty based method might include
spurious features. Based on the work in Zou [133], we show that random sampling can find a set of weights
that improves accuracy over the unweighted (which is normally used) L1 penalty methods and we detail
this in Section-3. In Section-4, we show results on two different classification algorithms and compare the
weighted method proposed in Zou [132] with the random sampling method described in our paper. Our
method differs from Zou’s method as it hinges on random sampling to find the weight vector instead of using
32
the L2 penalty. The proposed method is shown to give significant improvement in accuracy over a number
of data sets.
The contribution of our work is as follows: we show that a fast pre-processing step can be used to
increase the accuracy of L1 regularized models and is a good fit when the number of examples are large;
we connect the theoretical results from Rocha et al. [94] showing the viability of our method on various L1
penalized algorithms and also show empirical results supporting our claim.
3.2 Background Information and Motivation
Consider the following setup in which information about n examples, each with p dimensions, is
represented in a n x p design matrix denoted by X , with y ∈ Rn representing target values/labels, and
β ∈ Rp representing a set of model parameters to be estimated. For our paper, we consider classification
based linear models with a convex loss function and a penalty term (a regularizer). A regularized formulation
that can be used to generally describe many machine learning algorithms as
minβL(X, y, β) + λJ(β), (3.1)
where L(X, y, β) = loss function , J(β) = penalty function, and λ ≥ 0 .
The metric or loss, L(X, y, β), may represent various loss functions including ‘hinge loss’ for classifi-
cation based Support Vector Machines (SVM) and ‘squared error loss’ for regression. Popular forms of the
penalty function J(β) include using the L2 and L1 norm function on β and are termed Ridge and Lasso
penalty, respectively, in literature (refer to Tibshirani [112]).
3.2.1 Asymptotic properties of L1 penalty
Many papers including Tibshirani [112], Efron et al. [42], and Zhu et al. [131] discuss the merits of the
L1 penalty. The L1 penalty has been shown to be efficient in producing sparse models (models with many
of the β’s set to 0) and this feature selecting ability makes it robust against noisy features. In addition,
the L1 penalty is a convex penalty and when used in conjunction with convex loss functions, the resultant
formulation has a global minimum.
33
As the L1 penalty is used for simultaneous feature selection and correct estimation, a topic of interest
is to understand whether sparsity holds when n → ∞, n=number of examples. Intuitively, given enough
samples, the estimated parameters βn should approach the true parameters β0.
Assume that the data is generated as
y = Xβ0 + ε. (3.2)
with ε being gaussian noise of zero mean and β0 being the true generating model parameters. Also, βjk
represents the jth feature for βk. If A0={j | βj0 6= 0} is the true model and An is the model found for n→∞.
For consistency in feature selection, we need An={ j | βjn 6= 0} and limn→∞
P (An = A0)= 1, that is we find the
correct set of features A0 asymptotically. Zou [133] showed that lasso estimator is consistent (in terms of
βn → β0) but can be inconsistent as a feature selecting estimator in presence of correlated noisy features.
3.2.1.1 Hybrid SVM
Zou [133] showed that adaptive lasso regression defined as
Adaptive Lasso Regression: minβ||y −Xβ||2 + λ
∑j
Wj |βj | s.t. Wj = |β(OLS)j |−γ , γ > 0, (3.3)
can be used for simultaneous feature selection and creating more accurate models than the standard L1
penalty. The main difference between the adaptive lasso penalty and the standard lasso penalty is the use
of user-defined weights Wj . Note β(OLS) denotes the estimates found via least squares regression.
In Zou [132], the same properties were applied in case of classification, and the resultant algorithm
was known as ‘Improved 1-norm SVM ’ or ‘Hybrid SVM’, and is defined as
Improved 1-norm SVM: minβ,β0
∑i
[1− yi(x:,iβ + β0)]+ + λ∑j
Wj |βj |, (3.4)
where Wj = |β(l2)j |−γ , γ > 0, β(l2) = arg minβ,β0
∑i
[1− yi(x:,iβ + β0)]+ + λ2
∑j
||βj ||22,
{x:,i, yi} represents an example, λ and λ2 are regularizing parameters. v+ = max(v, 0).
In this paper, we also discuss an L1 penalized algorithm with the squared hinge loss function as is
34
defined as
Improved SVM2: minβ,β0
∑i
[1− yi(x:,iβ + β0)]2+ + λ∑j
Wj |βj |, (3.5)
where Wj = |β(l2)j |−γ , γ > 0, β(l2) = arg minβ,β0
∑i
[1− yi(x:,iβ + β0)]2+ + λ2
∑j
||βj ||22,
{x:,i, yi} represents an example, λ and λ2 are regularizing parameters. v+ = max(v, 0).
For the adaptive lasso penalty, the formulations in Equation 3.3 and Equation 3.4 are convex and
will require almost no modification to the standard L1 penalized algorithms. Refer to Zou [133] for the
modifications that are needed.
Intuitively, the weights found via the L2 penalty are inversely proportional to the true model parameter
β0 in Equation 3.2. For a given feature, if the weight is small (i.e. the true model magnitude is higher)
then in the adaptive lasso penalty we are penalizing that feature lesser, compared to a feature with larger
weights. Thereby, we are encouraging those features to have higher magnitude in the adaptive L1 models
and vice-versa for noisy features.
3.2.2 Motivation for our Method
The weighted lasso penalty is dependent on obtaining suitable weights ‘W ’. Zou [132, 133] shows that
the ordinary least squares estimates and the estimates from SVM with the L2 norm penalty can be used to
find the weights as shown in Equation 3.3 and Equation 3.4. For our paper, we obtain these weights via
feature selection on randomized subsets of the training data. If the accuracy is higher than the unweighted
case, it means that the features are appropriately (and correctly) weighted.
One of our goals was to see the translation of results from Zou [133] to other linear formulations
and thus we also experimented on the weighted SVM2 formulation shown in Equation 3.5 (unweighted
formulation is shown in Equation 3.8). The SVM2 formulation is referred to in literature as quadratic loss
SVM (but with L2 penalty) or 2-norm SVM (refer to [103]); it is squared hinge loss coupled with the L1
penalty.
35
3.2.2.1 Efficient Algorithms to solve formulations with L1 norm penalty
Efron et al. [42] showed an efficient algorithm for lasso regression called Least Angle Regression
(LARS), that can solve for all values of λ, that is 0 ≤ λ ≤ ∞. In Rosset and Zhu [97], a generic algorithm,
for which LARS is a special case, is documented that can be used for all double differentiable losses with
the L1 penalty. For our experiments, we resort to specific linear SVM based formulations for which entire
regularization paths can be constructed.
The penalized formulation for ‘1-norm SVM’ Zhu et al. [131] is defined as
1-norm SVM: minβ,β0
∑i
[1− yi(x:,iβ + β0)]+ + λ∑j
|βj |, (3.6)
Equivalent to Equation 3.6 : minβ,β0
||β||1 + C∑i
ξi, s.t. yi(x:,iβ + β0) ≥ 1− ξi, ξi ≥ 0. (3.7)
Zhu et al. [131] showed a simple piecewise algorithm to solve for 0 ≤ λ ≤ ∞ in the 1-norm SVM. As the loss
and the penalty function are both singly differentiable, a piecewise path cannot be constructed as efficiently
as in LARS but linear programming can be employed to calculate the step size. Equation 3.7 is an equivalent
to Equation 3.6 and similar to the formulation seen in literature except with the L2 loss function instead of
the L1 loss function.
The penalized formulation for squared hinge loss (or quadratic loss SVM) with the L1 penalty is
defined as
SVM2: minβ,β0
∑i
[1− yi(x:,iβ + β0)]2+ + λ∑j
|βj |, v+ = max(v, 0). (3.8)
As the loss function is doubly differentiable, an efficient piecewise algorithm can be constructed, via
the method described by [97], to solve for 0 ≤ λ ≤ ∞. Our vested interest in using such piecewise algorithms,
is to help understand whether better (entire) regularization paths are created or not for the weighted L1
penalty.
3.3 Randomized Sampling (RS) Method to Create Weight Vector
Our randomized sampling method depends on a small random subset of training data. We assume
that the subset of the training data is small, i.e. it is computationally cheap to act on such a set in a
36
reasonable time. Also, such randomized sampling is done multiple times.
3.3.1 Randomized Sampling (RS) Method
Our randomized sampling algorithm is described below in Algorithm 1: Randomized Sampling
Method. The algorithm can be explained as follows: We choose a subset of m examples out of the pre-
sented n examples such that m << n. We train an L1 penalty based algorithm (e.g. 1-norm SVM [131],
SVM2, etc.) so that we can find a set of relevant features. We keep a note of the features that we found in
that particular experiment. After many such randomized experiments, the counts of the number of times a
feature was found in these randomized experiments is summed up and normalized and denoted by V . This
Algorithm 1: Randomized Sampling (RS) Method
Input: n examples each with p features, K randomized experiments, B (Block) number of examples
used to train model in each randomized experiment.
Output: Count vector W (vector of size p) representing number of times features were selected in K
randomized experiment.
Divide n examples into K randomized sets each of size B and denote them as Ntrni, i = 1 . . .K and
let V ←− 0.
for i = 1 . . .K do
Get Ntrni, construct Ntsti and Nvali set.
Train Modeli = L1 Algorithm(Ntrni, Ntsti, Nvali).
Si = selected features in Modeli via validation data.
V ← V + {x, x ∈ Rp|xj = 1 if j ∈ Si else xj = 0}.
count vector, denoted by V , is then inverted and used as weights for the weighted version of the algorithm;
i.e. weights used in the weighted formulations are W = 1/V . Intuitively, if a feature is important and is
found multiple times via the RS method, then the corresponding weight for the feature is less and thus it is
penalized lesser, encouraging higher magnitude for the feature.
37
3.3.2 Consistency of choosing a set of Features from Randomized Sampling
(RS) Experiments
Our method is dependent on finding some set of relevant features and their counts for a given dataset
via the RS method. Our experimental results are restricted to the weighted and unweighted formulation for
SVM2 and 1-norm SVM, but our theoretical results are applicable to all linear models with twice differentiable
loss function with the L1 penalty. We next mention results, regarding the asymptotic consistency and
normality properties in n (number of examples) for L1 penalized algorithms, which can help understand the
consistency of our method.
Lemma 1. This result is derived from Theorem-5 in Rocha et al. [94]. If the loss function L(X, y, β)
shown in Equation 3.1 is bounded, unique and a convex function, with E|L(X, y, β)| <∞ and furthermore
L(X, y, β) is twice differentiable with a positive hessian H matrix, then the following consistency condition
defined for the L1 penalty when using the formulation in Equation 3.1 and true model in Equation 3.2:
||HAc,A[HA,A −HA,β0H−1β0,β0
Hβ0,A]−1sign(βA)||∞ ≤ 1, where Hx,y =d2L(X, y, β)
dxdy, (3.9)
Ac = {j ∈ 1...p|βj = 0} , A = {j ∈ 1...p|βj 6= 0} and β0 is an intercept.
� if λn is a sequence of non-negative real numbers such that λnn−1 → 0 and λnn
−(1+c)/2 → λ > 0 for some
0 < c < 1/2 as n → ∞ and the condition Equation 3.9 is satisfied then P [sign(βn(λn) = sign(β)] ≥
1− exp[−nc]. βn is parameter found for number of examples=n .
� If the condition in Equation 3.9 is not satisfied, then for any sequence of non-negative numbers λn,
limn→∞
P [sign(βn(λn)) = sign(β)] < 1. (3.10)
The probability of choosing incorrect variables is bounded to exp(−Dnc), where D is a positive constant
(shown in the proof of Theorem 5 of [94]).
If the condition in Equation 3.9 is fulfilled, it means that the interactions between relevant and noisy
features are distinguishable and the L1 penalty can correctly identify the signs in β. If the condition in
38
Equation 3.9 is not fulfilled, then noisy features will be added to the model with a probability away from 1.
Also, note that the above conditions are applicable for 1-norm SVM, as shown in [94].
Lemma 2. We use b to specify the size of the subset and assuming b→∞, then from Lemma-1, when
condition of consistency Equation 3.9 is satisfied then P [sign(βb(λb) = sign(β)] ≥ 1−exp[−bc] ≈ 1, where βb
and λb represent the parameters for the subset of size b. For k such subsets V , as depicted in the algorithm
in Section 1, is bounded to k(1− exp[−bc]) ≈ k, b→∞. When the condition in Equation 3.9 is not satisfied
then the probability of choosing noisy variables in a subset is upper-bounded to exp(−Dbc) and for k subsets,
sum(Vj) ≤ k · exp(−Dbc) and Vj ≈ 0, b → ∞, (where Vj are indices of noisy variables). Thus, the noisy
variables have a probability of having a low count in V and a large weight in W , thus penalizing the noisy
features heavily .
q p 2-norm SVM2 1-norm SVM2 Hybrid (Zou) RS(20%) RS(30%) RS(40%)2 14 9.64±2.30 7.92±1.89 7.88±2.09 7.69±1.71 7.67±1.66 7.68±1.694 27 10.90±2.41 8.01±1.84 7.88±2.09 7.73±1.59 7.73±1.59 7.71±1.606 44 12.17±2.64 7.93±1.79 7.79±1.69 7.64±1.60 7.64±1.59 7.64±1.528 65 13.45±2.96 8.13±2.10 7.87±1.84 7.82±1.85 7.83±1.85 7.81±1.8312 119 16.91±3.24 8.11±1.95 8.05±1.94 7.78±1.71 7.78±1.70 7.76±1.6616 189 17.93±3.32 7.87±1.78 8.29±2.41 7.66±1.57 7.66±1.63 7.66±1.6320 275 19.31±3.32 8.06±2.14 8.04±2.01 7.69±1.81 7.74±1.89 7.77±1.87
Table 3.1: Mean ± Std. Deviation of Error Rates in % on Models 1 & 4 by SVM2. The best algorithms arein bold.
Random Sampling is Subsampling. To better quantify our random sampling method, we explain
it in terms of subsampling (refer to Politos et al. [87]). Subsampling is a method of sampling m examples
from n total examples with m < n, unlike Bootstrap that samples n times with replacement from n samples.
Let estimator θ be a general function of i.i.d data generated from some probability distribution P . In our
case of feature selection, this estimator is the feature set. We are interested in finding an estimator and
it’s confidence region based on the probability P of the data and we define it as θ(P ). When P is large
then we can construct an empirical estimator θn of θ(P ) such that θn = θ(Pn), where Pn is the empirical
distribution; that is estimate the true feature set empirically. We define a root of form τn(θ − θ), where
39
τn is some sequence (like√n or n) increasing with n (number of examples), and we are looking at the
difference between the empirical estimator θn and the true estimator θ. We define Jn(P ) to be the sampling
distribution of τn(θ − θ(F )) based on a sample size of n from P and define the CDF as
Jn(x, P ) = ProbabilityP {τn(θn − θ) ≤ x}, x ∈ R. (3.11)
q p 2-norm SVM 1-norm SVM Hybrid (Zou) RS(20%) RS(30%) RS(40%)2 14 8.74±1.30 7.64±0.09 7.64±1.02 7.63±1.02 7.64±1.01 7.53±0.094 27 9.76±1.75 7.85±1.14 7.95±1.34 7.83±1.28 7.79±1.24 7.69±1.196 44 10.57±1.95 7.85±1.01 7.92±1.12 7.79±1.12 7.77±1.18 7.69±1.238 65 11.47±2.31 7.81±0.99 7.99±1.36 7.75±1.13 7.74±1.15 7.63±1.0912 119 13.27±2.48 7.91±0.98 8.04±1.35 7.77±1.16 7.82±1.21 7.63±1.0016 189 15.58±2.94 7.94±1.15 7.87±1.21 7.74±1.31 7.75±1.23 7.64±1.1420 275 17.14±2.96 7.90±1.00 7.85±1.11 7.77±1.20 7.80±1.27 7.69±1.19
Table 3.2: Mean ± Std. Deviation of Error Rates in % on on Models 1 & 4 by 1-norm SVM. The bestalgorithms are in bold.
Lemma 3. From [87], for data generated via i.i.d., there is a limiting law J(P ) such that Jn(P ) converges
weakly (in probability) to J(P ) and τb(θn − θ) → 0 as n → ∞ with the conditions that τb/τn → 0, b → ∞
and b/n→ 0, where b is the number of examples in the subsample experiment and n is the total number of
available examples.
Lemma 3, has remarkably weak conditions for subsampling and it requires that the root has some
limiting distribution and the sample size b is not too large (but still going to infinity) compared to n . In
our case, the subsets are of size b→∞, b << n and for the rate of estimation at τn ∝ nc, τb ∝ bc, 0 < c ≤ 1,
then τb/τn → 0. For the RS method, we create weight vector whose index for a feature is non-zero if that
feature was found in a particular experiment. θn is the sample mean of n such RS experiments weights,
having mean converging to θ(P ) (due to Lemma-2). Thus estimating the true feature set on basis of random
sampling of subsets of data is weakly convergent. [133] used a root-n-consistent estimator’s weight (from
the L2 penalty) but mentions that the conditions can be further weakened and if there is an an such that
an →∞ and an(θ − θ) = O(1) then such an estimator can also be used. By Lemma-3, our RS estimator is
40
one such consistent estimator and thus can be used as a valid estimator for usage with the weighted lasso
penalty.
3.4 Algorithms and Experiments
In this paper, we limit ourselves to an empirical study of data block sizes for the RS estimator. We
replicate the experiments from ‘An Improved 1-norm SVM for Simultaneous Classification and Variable
Selection’ by [132] and report on 1-norm SVM and SVM2.
Method for choosing Weights (for Hybrid and RS) and Validation data (for RS). For the
Hybrid SVM, in order to find the optimal weights via L2 penalty, we use the method described by [132]. We
first find the best SVM (or SVM2) algorithm model weights (β(l2)) with the L2 penalty via a parametric
search over costs C={0.1, 0.5, 1, 2, 5, 10}. We then create entire piecewise paths for various weight values
|β(l2)|−γ , γ = {1, 2, 4}; choose the best performing model on validation data and then report on the test
dataset. Description on how we chose training set for the RS method is given in individual experiments.
Our RS experiments need validation data to help choose the relevant features for each of the RS training set.
We do the following: if n is the size of the training set and we choose m of those examples for the current
RS training set, we just use the left out n−m examples (as validation) for choosing the best features from
the piecewise paths generated by the L1 algorithm on the m examples. In case, if a held out validation set
was present, we use that instead.
3.4.1 Synthetic Datasets
We simulate two synthetic datasets, one akin to “orange data” described in [131] and another a
bernoulli distribution based dataset. The following notation is used for some of the tables: We use C and
IC to denote the mean number of correctly and incorrectly selected features, respectively. Also, we resort
to reporting to mean and std. deviation as the median of the incorrectly selected features was 0 for many
experiments. PPS stands for the probability of perfect selection, i.e the probability of only choosing the
correct feature set.
41
q 6 8 12 16 20p 44 65 119 189 275
1-norm SVM2IC 1.5±2.59 1.42±2.44 1.67±3.4 1.58±2.95 1.71 ±3.52
PPS 0.554 0.544 0.536 0.564 0.592
Hybrid SVM2IC 1.05±1.87 1.03±1.79 1.19±2.05 1.35±2.51 1.13±2.21
PPS 0.596 0.598 0.554 0.576 0.596
RS(20%)IC 0.65±1.15 0.62±1.15 0.8±1.48 0.61±1.17 0.54±1.04
PPS 0.636 0.646 0.600 0.686 0.666
RS(30%)IC 0.69±1.18 0.73±1.15 0.70±1.27 0.63±1.25 0.56±1.06
PPS 0.626 0.604 0.626 0.666 0.662
RS(40%)IC 0.62±1.05 0.61±1.01 0.68±1.29 0.66±1.25 0.55±1.02
PPS 0.644 0.636 0.650 0.668 0.672
RS(50%)IC 0.67±1.11 0.65±1.19 0.69±1.31 0.62±1.36 0.59±1.14
PPS 0.628 0.628 0.630 0.670 0.670
*C (mean of Correct features)=2 for all above experiments.
Table 3.3: Variable Selection Results on Models 1 & 4 using SVM2. The best algorithms are in bold.
Exp. Name Correlation Bayes 2-norm 1-norm Hybrid RS(20%) RS(30%) RS(40%)
Model 2ρ = 0 6.04 9.77 8.14 7.46 7.51 7.53 7.55ρ = 0.5 4.35 7.74 6.43 5.96 5.97 5.86 5.86
Model 3ρ = 0 8.48 11.04 9.79 9.54 9.46 9.46 9.45ρ = 0.5 7.03 8.49 9.51 8.45 8.17 8.17 8.20
Model 5 ρ = 0.5|i−j| 6.88 31.31 9.32 8.5 8.6 8.56 8.22
*range of std. deviation in accuracy for above table was between 1.02 to 1.96.
Table 3.4: Mean Error rates in % for Models 2, 3 & 5 using SVM2. The best algorithms are in bold.
Models 1 and 4 from [132]. The “orange data” has two classes, one inside the other like the core inside
the skin of the orange. The first class has two independent standard normals x1 and x2. The second class
also has two independent standard normals x1 and x2 but is conditioned on 4.5 ≤ x21 + x2
2 ≤ 8. To simulate
the effects of noise, there are ‘q’ independent standard normals. The Bayes rule is 1-2I(4.5 ≤ x21 + x2
2 ≤ 8),
where I() is an indicator function and the Bayes error is about 4%. We resort to an enlarged dictionary
D = {√
2xj ,√
2xjxk, x2j , j, k = 1, 2, . . . , 2 + q} as the original space is not linear. We have independent sets
of 100 validation examples and 20000 test examples and. ‘q’ is set to 2, 4, 6, 8, 12, 16, 20 and we report on
500 experiments.
For the RS method, block sizes were set to 20%, 30% and 40% of the total training size and performed
10/(%size of each block/100) total experiments; i.e. for 20% we generated 10/0.2=20 total randomized
42
training sets each of size 0.2*(total training data). The weighted vector was created via the RS method
described earlier and then used to train the weighted 1-norm and SVM2 algorithms.
We depict error rates in Table 3.1 and Table 3.2 for SVM2 and 1-norm SVM respectively. q depicts
the number of noise features in original space and p represents the number of features in the new space via
the dictionary D. The L2 algorithm version, in the 3rd column, show increasing error rates as the number
of noisy features increase. The L1 algorithm version, in the 4th column is much more robust to noise and
the error rates do not degrade at all. Hybrid SVM perform usually better than the unweighted 1-norm SVM
(except for couple of times for in Table 3.2). For all different block sizes, the RS method performs best. The
feature selecting ability of individual algorithm is depicted in Table 3.3 (Note: 1-norm SVM results were
omitted for space constraints and the results were similar to those of SVM2). We can see that the probability
of finding the best model is high for all the algorithms. Hybrid is better at that compared to the 1-norm
and the RS method performs best.
Models 2, 3, and 5 from [132]. Models 2, 3, and 5 are simulated from the model y ∼ Bernoulli{p(u)}
where p(u) = exp(xTβ + β0 + ε)/(1 + exp(xTβ + β0) + ε)} with ε being a standard normal representing
error. We create 100 training examples, 100 validation examples, 20,000 test examples and report on 500
randomized experiments.
Model 2 (Sparse Model): We set β0 = 0 and β = {3, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 3}. The features x1, ...x12
are standard normals and experiments are done with correlation between xi and xj set to ρ = {0, 0.5}. The
Bayes rule is to assign classes to 2I(x1 + x6 + x12)− 1.
Model 3 (Sparse Model): We use β0 = 1 and β = {3, 2, 0, 0, 0, 0, 0, 0, 0}. The features x1, ...x12 are
standard normals and experiments are done with correlation between xi and xj set to ρ = {0, 0.5}|i−j|. The
Bayes rule is to assign classes to 2I(3x1 + 2x2 + 1)− 1.
Model 5 (Noisy features): We use β0 = 1 and β = {3, 2, 0, 0, 0, 0, 0, 0, 0}. The features x1, ...x12 are
standard normals and experiments are done with correlation set to ρ = 0.5|i−j|. We added 300 independent
normal variables as noise features to get a total of 309 features.
For Models 2, 3 and 5: error rates are reported in Table 3.4 for SVM2 (results for 1-norm SVM were
43
similar and hence skipped). Note, weighted models are consistently better than both of their 1-norm and
2-norm unweighted counterparts. The RS method has equal or greater accuracy than the Hybrid version.
3.4.2 Real World Datasets
UCI datasets. In Table 3.5, results on the Spam, WDBC and Ionosphere datasets from UCI repository,
by [5], are reported. For WDBC and Ionosphere dataset, we split the data into 3 parts with 2 parts used
for training (and validation) and the 3rd remaining part for testing. For the Spam dataset, indicators for
test (1536 examples) set and training set can be obtained from http://www-stat.stanford.edu/~tibs/
ElemStatLearn/. For our RS method we generated smaller datasets from the training set as follows: If
the training set size is N and size for individual RS set is K, then the number of datasets generated are
10 ∗ N/K. We also show the size of the RS training set as Block in the table. For Hybrid SVM, the best
parameter for γ and C are chosen as described earlier in Section . We report on 50 randomized experiments.
In Table 3.5, error rates for both SVM2 and 1-norm SVM are shown. The use of weights via Hybrid and RS
method, always increases the accuracy from the unweighted case. Also, as seen on both synthetic and real
world datasets, RS blocksize does not create that much variability in the results.
Dataset AlgorithmWithout Randomized Hybrid
2-norm SVMWeighting Sampling SVM
(Name/Block) Weighting
WDBC
1-norm (100)3.66 ± 1.17
2.79 ± 0.93 2.89 ± 0.79
4.05 ± 1.361-norm (150) 2.79 ± 0.90 3.16 ± 1.22SVM2 (100)
3.55 ± 1.812.78 ± 1.03 2.73 ± 1.01
SVM2 (150) 2.90 ± 1.15 2.91 ± 1.13
SPAM
1-norm (200)9.09 ± 0.878
8.18 ± 0.49 8.31 ± 0.61
7.06 ± 0.041-norm (1000) 7.53 ± 0.17 8.19 ± 0.73SVM2 (200)
8.45 ± 3.437.38 ± 0.52 7.39 ± 0.30
SVM2 (1000) 7.70 ± 2.73 7.48 ± 0.52
Ionosphere
1-norm (50)12.38 ± 2.04
11.52 ± 1.39 11.84 ± 1.38
13.03 ± 2.86
1-norm (75) 11.25 ± 1.98 11.56 ± 1.731-norm (100) 11.29 ± 1.65 11.72 ± 1.23SVM2 (50)
12.69 ± 2.8211.43 ± 2.52 11.21 ± 2.66
SVM2 (75) 11.61 ± 2.50 11.22 ± 2.58SVM2 (100) 11.37 ± 2.67 11.26 ± 2.68
Table 3.5: Mean ± Std. Deviation of Error Rates on Real world Datasets. The best performing L1 and L2
penalized algorithms are in bold.
44
Robotic Dataset. We now discuss a novel use of our subsampling method on robotic datasets [88].
These datasets are created by hand labeling 100 images obtained from running the DARPA LAGR robot in
varied outdoor environments. The classes labeled are robot traverseable path and obstacles. The authors
provide pre-extracted color histogram features for the dataset at [89]. We used a subset (12,000 examples)
of the available data for each of the 100 frames. Each example is 15 dimensional.
DS1A DS2A DS3AUnweighted SVM2 8.92 4.36 1.24Weighted SVM2 6.41 4.13 1.15
Table 3.6: Avg. Error rate on Robotic Datasets from [88]. The best algorithms are in bold.
We set our experimentation as follows: for each frame Fi, i is the index of the frame, we divide
the obtained examples (12K examples) into 8 folds (9.6K examples) for training, 1 fold (1.2K examples) for
validation and 1 fold (1.2K examples) for testing. We train/validate/test on the unweighted SVM2 algorithm.
For the weighted experiment, we train via our RS method, by dividing the training into 10 subsets (each
960 examples) and finding the weight vector. This weight vector is then used to create the weighted SVM2
models and we report on the test set. Now, instead of discarding weights when a new frame arrives, we use
the weights found in frame Fi again in Fi+1, i.e. if weights in frame Fi are noted as Wi then:
Wi+1 ← {Wi + weight results of RS for frame Fi+1}.
This is one experimental environment, where creating L2 models for the entire data is not feasible
and the RS estimator is a potential approach. Also, propagating feature importance between frames is an
advantage for the RS estimator. In Table 3.6, we show overall results for 100 frames for 3 datasets done
10 times. We propagate the weights for the weighted SVM2 between frames. As shown, there is a drop in
error rates (between 5-28%) for the weighted SVM2 compared to the unweighted SVM2. The overhead of
computing the weights via RS was < 10% that of computing a model for the entire training set.
45
3.5 Conclusions and Future work
A random sampling framework is presented and is empirically shown to give effective feature weights
to the lasso penalty, resulting in both increased model accuracy and feature selection accuracy. The proposed
framework is at least as effective (and at times more effective than) the Hybrid SVM, with the added benefit
of significantly lower computational cost. In addition, unlike the Hybrid SVM which must see all the data at
once, Random Sampling is shown to be effective in an on-line setting where predictions must be made based
on only partially available data (as in data taken from the robotics domain). In this paper the framework is
demonstrated on two types of linear classification algorithms, and theoretical support is presented showing
its applicability, in general, to sparse algorithms.
Chapter 4
Iterative Training and Feature Selection Using L1 Penalized Linear Models
4.1 Introduction
L1 penalized linear models are extremely popular. These models use an L1 (or Lasso) penalty in
conjunction with a linear or quadratic loss function. Their popularity stems from the fact that the resultant
model implicitly performs feature selection: the weights for irrelevant features become small or zero [42, 112,
131].
Recent studies [130, 133] have shown that in the presence of correlated but irrelevant features, the
L1 penalty may select features inconsistently; feature inconsistency means that the L1 penalty selects both
relevant features and correlated but irrelevant features. Zou [133] showed that an alternative, the adaptive
lasso penalty works in cases in which the standard L1 penalty is inconsistent. The adaptive lasso penalized
algorithms use user-specified weights to penalize individual features and allow for consistent feature selection.
The adaptive lasso penalty is a single step algorithm that uses weights derived from an L2 or an unpenalized
algorithm. Adaptive lasso penalized algorithms have been proposed for both classification [132] and regression
[133]. We explain this algorithm in more detail shortly.
The adaptive lasso algorithm has been extended to operate over multiple iterations [24]. At each
iteration, this algorithm, the multi-step adaptive lasso (hereafter, MSA) algorithm uses the weights are
derived from an L1 penalized solution to bias the model for the next iteration. The accuracy results for MSA
are not promising [24].
47
For both the adaptive lasso and its multi-step variant, the MSA, weights play a central role in deter-
mining the sparsity and predictive abilities of the final model. In this study, we propose an alternative to
the MSA, which we call the subsampling/bootstrap adaptive lasso (SBA-LASSO), that uses weights obtained
from subsampled datasets instead of weights obtained from an L1 or L2 penalized algorithm. Our results
show that when used with both artificial and real world datasets, the SBA-LASSO significantly outperforms
existing single and multiple step algorithms in terms of accuracy for both regression and classification.
This paper is organized as follows: Section 4.1.1 presents the details of various single and multiple
step algorithms proposed for L1 penalized algorithms. Section 4.2 presents the details of our proposed
subsampling/bootstrap adaptive lasso (SBA-LASSO). Section 4.3 presents the details of our experimental
setup for various algorithms. Section 4.4 presents experimental results for both artificial and real world
datasets.
4.1.1 Existing Single and Multiple Step Estimation Algorithms for L1 Penalized
Algorithms
We discuss existing single and multiple step algorithms proposed for L1 penalized algorithms. These
algorithms were developed as a means of reducing the number of irrelevant features incorrectly detected by
a standard L1 penalized algorithm.
To remind the reader, the general form of an L1 penalized algorithm is
minβL(X, y, β) + λ
p∑j=1
(|βj |), λ ≥ 0, (4.1)
where L(X, y, β) is a loss function, X is a matrix of size n × p and y is a vector of size n representing
the training data {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R}; β is the model parameter vector of size p that is
estimated for a given value of the regularization parameter λ.
4.1.1.1 Bootstrap Lasso
Bootstrap lasso [6] begins with the creation of multiple bootstrap datasets. For each of these datasets,
we find the L1 penalized model with the lowest error in validation data and obtain a list of features selected
48
by the model, i.e., features whose β coefficients are nonzero. The final bootstrap lasso model is trained using
an L1 penalized algorithm but only on those features that were selected by more than a given percentage of
L1 models. We use the notation τ for the threshold-selection parameter.
The estimation algorithms we describe next use a weighted form of the lasso penalty.
4.1.1.2 Adaptive Lasso
Adaptive Lasso Penalty. Zou [132, 133] proposed an adaptive lasso penalized formulation:
minβL(X, y, β) + λ
p∑j=1
(Wj |βj |), (4.2)
where W is a user specified vector of size p, and other variables are as before: L(X, y, β) is either a single or
a double differentiable convex function, X is a matrix of size n × p and y is a vector of size n representing
the training data {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R}, and β is the model parameter vector of size p that
is estimated for a given value of the regularization parameter λ.
The idea behind the adaptive lasso is to set W high for irrelevant features and low for relevant features
so that irrelevant features are penalized more than relevant features. This weighting allows the adaptive
lasso penalty to reduce the effects of irrelevant features further than the standard lasso penalty. Instead of
setting the weights W by hand, they may be obtained using an L2 penalized algorithm for classification and
an unpenalized algorithm for regression. We describe this approach next.
Adaptive Lasso Regression and Classification. Zou [133] proposed that the adaptive lasso penaltyp∑j=1
Wj |βj | s.t. Wj = |β(OLS)j |−γ , γ > 0 could be used to regularize least squares as follows:
minβ||y −Xβ||2 + λ
p∑j=1
Wj |βj | s.t. Wj = |β(OLS)j |−γ , γ > 0, (4.3)
where X is a matrix of size n × p and y is a vector, representing target values, of size n, representing the
training data {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R}; β(OLS) denote the model parameter vector found using
ordinary least squares (OLS) regression; and β is the model parameter vector of size p that is estimated for
a given value of the regularization parameter λ. The resultant formulation is known as an adaptive lasso
regression.
49
Zou [132] applied the same properties to support vector machine (SVM) classification, and the resulting
formulation is called the Improved 1-norm SVM or Hybrid SVM, and is defined as:
minβ,β0
n∑i=1
[1− yi(xiβ + β0)]+ + λ
p∑j=1
Wj |βj |, v+ = max(v, 0), s.t. Wj = |β(l2)j |−γ , γ > 0, (4.4)
β(l2) = arg minβ,β0
n∑i=1
[1− yi(xiβ + β0)]+ + λ1
p∑j=1
||βj ||22, v+ = max(v, 0),
λ and λ1 are regularizing parameters, and {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R} is the training data; β(l2)
denote the model parameter vector found using the L2 penalized SVM; β is the model parameter vector of
size p and β0 is an offset value that are estimated for a given value of the regularization parameter λ. Note
the inclusion of the parameter γ in the exponent of the formula for the weights. This parameter modulates
the scaling of the weights.
Henceforth, we will refer to both the adaptive lasso penalized least squares regression and the SVM
classification as adaptive lasso algorithms. Note that although the adaptive lasso penalty was proposed for
the L1-SVM and lasso regression, it can also be used for other loss functions.
For adaptive lasso, the weights of the relevant features are expected to be small (as the corresponding
L2 or unpenalized estimates are expected to be large), whereas the weights of irrelevant features are expected
to be large (as the corresponding L2 or unpenalized estimates are expected to be small). Thus, adaptive
lasso penalizes irrelevant features more than relevant features and in so doing is able to reduce the effects of
irrelevant features further than the standard lasso penalty.
The adaptive lasso has multiple free parameters, some associated with the L2 penalty as well as λ and
γ. Instead of performing a grid search over all parameters, Zou [132, 133] proposed a hierarchical approach
using validation data to select the best L2 penalized classification model or unpenalized regression model
and corresponding weights; then using an adaptive lasso to perform a grid search over λ and γ given the
selected weights. For our experiments, we adopted Zou’s method.
Next, we discuss an algorithm that is like adaptive lasso, but instead of using weights derived from
an L2 penalized algorithm, it uses weights derived from an L1 penalized algorithm.
50
4.1.1.3 Multi Step Estimation Using Multi-Step Adaptive Lasso (MSA)
The multi-step adaptive lasso algorithm (MSA) [24], summarized in Algorithm 1, converts the single
step adaptive lasso algorithm into an iterative algorithm.
In the MSA, the weights derived at step n are used to bias the model at step n+1. The MSA alternates
between estimating β (from previous weights) and the weights (from previous β estimates). The weights
used in the MSA are derived from an L1 penalized algorithm whereas the weights used in the adaptive lasso
algorithm are derived from an L2 penalized (or an unpenalized) algorithm.
Algorithm 2: Multi Step Adaptive Lasso (MSA) Algorithm
Input: n training examples with p features represented in input matrix X with target values/labels
y. Validation data represented by Nval are available. M steps are performed.
Output: Best model β(m) (using validation data) from possible β(k), k = 1, 2, ...,M models.
Step 1. Initialize the weights W(0)j = 1, where j = 1, 2, ..., p.
Step 2. For k = 1, 2, ...,M ,
Use the adaptive lasso penalty in conjunction with a loss function as
minβ(k)
L(X, y, β(k)) + λ∑j
W(k−1)j |β(k)
j |. (4.5)
Assume that, at the kth step, λ(k)∗ is the optimal regularization parameter and β(k) is the model
parameter vector found using validation data Nval.
Update the weights W(k)j (the weights at the kth step) as
W(k)j =
1
|β(k)j |
. (4.6)
β(k)j is the model parameter value obtained for the jth feature and optimal regularization
parameter λ(k)∗ at the kth step.
In this algorithm, the number of features will monotonically decrease or remain the same with the
number of iterations, as only features that have a non-zero β value in the previous step will be considered in
subsequent steps; as weights are derived from an L1 penalized algorithm, the model parameter β is expected
to be sparse.
51
When the number of iterations, k, is 2, the MSA is identical to the adaptive lasso algorithm except
for the fact that it uses weights derived from an L1 penalized algorithm rather than an L2 penalized (or
unpenalized) algorithm. Buhlmann and Meier [24] noted that results for MSAs with more than k = 3 steps
were not promising; when k > 3, models trained with MSAs are sparser and less accurate than models with
k = 3 steps. Note that once the β estimate for a feature is set to zero, the feature is eliminated and is not
considered in subsequent steps of the MSA.
Instead of performing a parametric search over λ and γ as in adaptive lasso, Buhlmann and Meier
[24] limited their experiments to a parametric search over regularization parameter λ and setting γ = 1. As
shown in Equation 4.4 and Equation 4.3, γ is used to tune the adaptive lasso. In Section 4.3.1, we briefly
discuss how γ can be used to tune the MSA.
Finally, we briefly describe an algorithm that is similar to the MSA but is used in the domain of
signal recovery. One of our goals is to explore whether modifications to the MSA allow for higher prediction
accuracy.
Similarity of MSA to an Algorithm in Compressed Sensing / Signal Recovery. An issue with
the MSA algorithm, noted by Buhlmann and Meier [24] is that as the number of steps increases, the models
generated are sparser and less accurate than previous models. When MSA was proposed, a similar algorithm
was concurrently suggested by Candes et al. [27] in which sparsity was controlled by including a small positive
offset ε > 0; Candes et al. [27] used W(k)j =
1
|β(k)j |+ ε
instead of W(k)j =
1
|β(k)j |
in Equation 4.6. Due to ε, a
zero-valued β estimate for a feature does not remove the feature from consideration in subsequent steps.
Candes et al. [27] designed their algorithm to be used in compressed sensing, and the value of ε is
based on the characteristics of the data in that domain. We briefly digress and describe the use of L1
penalized algorithms in compressed sensing: L1 penalized algorithms are used to obtain relevant features in
a high-dimensional setting in which the number of features � the number of examples (an underdetermined
linear system); in contrast to real world machine learning datasets, for data obtained in compressed sensing,
there exists a unique sparse solution for the underdetermined linear system [26].
In the algorithm proposed by Candes et al. [27], the weights are updated to W(k)j =
1
|β(k)j |+ ε
,
52
where ε is set as follows: we sort |β| in descending order, find the φth term [φ = n/4 log(p/n)] and set
ε = max{|β|φ, 10−3}; φ is linked to the anticipated accuracy of L1 penalty minimization for a sparse signal.
Only when p > n (more features than examples), is log(p/n) a real number and φ a positive value. An
argument against using this formula for ε in machine learning problems is that |β|φ is based in compressed
sensing [27, 40], in which the ratio of signal to noise can be estimated, but |β|φ can not be estimated for real
world machine learning datasets. We experimented with ε in order to determine whether we could prevent
the MSA from producing very sparse models with low accuracy.
We will describe experiments with the different variations of the MSA in Section 4.4.1.2. First,
however, we propose a novel multiple step estimation algorithm.
4.2 Subsampling/Bootstrap Adaptive Lasso Algorithm (SBA-
LASSO)
To motivate our new algorithm, we contrast the approach taken by adaptive lasso and MSA with that
taken by bootstrap lasso. Both adaptive lasso and MSA directly use weights that are inversely proportional
to β estimates obtained from an L1 or L2 penalized algorithm. In contrast, the bootstrap lasso creates a
final model with only features that are selected in more than a certain proportion of L1 models trained on
bootstrapped datasets. Bootstrap lasso performs a threshold function on the selection counts of the features
and ignores the β estimates.
We propose the subsampling/bootstrap adaptive lasso algorithm (SBA-LASSO) to combine the idea
of the bootstrap lasso and the weighting used by the adaptive lasso and the MSA. The algorithm is specified
as Algorithm 3; for brevity, we refer to the SBA-LASSO algorithm as SBA. Like the MSA, the SBA is a
multiple step algorithm that uses weights. Unlike the MSA and the adaptive lasso, in which the weights are
inversely proportional to the β estimates obtained from an L1 or an L2 algorithm, the weights in the SBA are
inversely proportional to the number of times features are selected by an L1 algorithm trained on (multiple)
subsampled or bootstrapped datasets; the choice of subsampling or bootstrap is left to the user. The goal
of the SBA is to discourage penalization of features that are often selected by L1 penalized models trained
53
Algorithm 3: Subsampling/Bootstrap Adaptive Lasso (SBA-LASSO) Algorithm
Input: n training examples with p features represented in input matrix X and target values/labels y;
t controls the number of subsampled or bootstrapped datasets created, b controls the size of
subsampled datasets, M is the number of steps performed, and validation data Nval are
available.
Output: Best model β(m) (using validation data) from possible β(k), k = 1, 2, ...,M models.
Step 1. Initialize the weights W(0)j = 1, where j = 1, 2, ..., p.
Step 2. For k = 1, 2, ...,M , update the weights W (k) (the weights at the kth step) as follows:
For subsampling, sample b examples without replacement from n training examples t times to
create t datasets. For bootstrap, sample n examples with replacement from n training
examples t times to create t datasets. Denote these datasets as Ntrni, i = 1 . . . t.
V ← 0
For i = 1 . . . t,
Modeli is trained with Ntrni data using an adaptive lasso penalized algorithm and the
weights W (k−1) from the previous step; that is, estimate optimal λNtrni and
corresponding βNtrniestimates using validation data Nval via
minβNtrni
L(XNtrni , yNtrni , βNtrni) + λ∑j
W(k−1)j |(βj)Ntrni |. (4.7)
Also, Si = {j|(βj)Ntrni 6= 0} ( features selected in modeli ).
V ← V + {x, x ∈ Rp|xj = 1 if j ∈ Si else xj = 0}.
V ← V /max(V , 1). (4.8)
Update the weights W(k)j (the weights at the kth step) as follows:
W(k)j ← 1
Vj. (4.9)
Estimate optimal λ, γ and corresponding β(k) (using validation data Nval) for
minβ(k)
L(X, y, β) + λ
p∑j=1
(W(k)j )γ |β(k)
j |. (4.10)
on subsampled/bootstrapped datasets and to encourage the penalization of features that are infrequently
selected.
54
A single step version of the SBA-LASSO algorithm was proposed earlier in Chapter 3 and by Jaiantilal
and Grudic [68]. We extend the algorithm to multiple steps, thereby combining properties of various models
that have already been studied in the literature.
The SBA adaptively determines the threshold above which features should not be penalized. In
the SBA, features that are frequently selected in L1 penalized models trained using multiple subsampled
and bootstrapped datasets will have marginal or no penalization. This behavior is similar to that of the
bootstrap lasso, in which the final model is created by selecting only features that are selected by more
than a certain proportion of multiple L1 penalized models trained on bootstrapped datasets. Both the SBA
and the bootstrap lasso ignore feature magnitude |β|. In the SBA, features that are occasionally selected in
L1 penalized models trained using multiple subsampled and bootstrapped datasets are penalized in inverse
proportion to the number of times they were selected. In contrast, the degree of penalization in the adaptive
lasso and the MSA depends on weights derived using an L1 or L2 penalized algorithm, and all features are
penalized to a certain degree.
4.3 Methodology
In this section, we present the methodology for comparing our proposed algorithm against algorithms
in the literature. We discuss the determination of parameters of each algorithm and our evaluation metrics.
Our experimental setup is complicated because we have to consider different parameters that are used
by the MSA, the SBA, the adaptive lasso, and the bootstrap lasso, all of which augment an L1 penalized
algorithm; furthermore, an L1 penalized algorithm has its own regularization parameter (λ). In order to find
the optimal model, we have to jointly optimize over all parameters.
4.3.1 Tuning Parameter and Parametric Search in MSA
In order to put the MSA on equal footing with adaptive lasso and the SBA, we modify the MSA
objective function (Equation 4.5) to incorporate a tuning parameter γ:
minβ(k)
L(X, y, β(k)) + λ∑j
(W(k−1)j )γ |β(k)
j |, γ > 0, (4.11)
55
and perform a parametric grid search over both λ and γ within each step of MSA. In Section 4.4.1.2 (page 63),
we show that results improve when the tuning parameter is incorporated into the MSA.
We briefly summarize the different variations of the MSA, and the parametric search in the variations,
as follows:
(1) In Section 4.1.1.3, we discussed the MSA in which weights are set to W(k)j =
1
|β(k−1)j |
. We label this
variation untuned MSA. In an untuned MSA, at each step, we perform a search over regularization
parameter λ.
(2) In Section 4.1.1.3, we also discussed a related method proposed by Candes et al. [27] that introduced
the ε parameter W(k)j =
1
|β(k−1)j |+ ε
; we label this variation epsilon MSA. In an epsilon MSA, at
each step, we perform a search over regularization parameter λ. Variable φ is used in the calculation
of ε; for artificial datasets, we calculated φ = n/4 log(p/n), where p is the number of features and n
is the number of examples, in the training data.
(3) In the current section, we introduced a tuning step for MSA via γ as follows: W(k)j =
1
|β(k−1)j |γ
; we
label this variation tuned MSA. In a tuned MSA, at each step, we perform a grid search over (λ,γ),
where λ is the regularization parameter.
4.3.2 Parametric Search in SBA, Adaptive Lasso, and Bootstrap Lasso
Now we discuss the parameters associated with the SBA, the adaptive lasso, and the bootstrap lasso.
The parametric search in the various algorithms is performed as follows:
� The regularization parameter λ is dependent on the solver used for the L1 penalized algorithm. We note
the values of λ in Section 4.3.3.
� The parameter γ is used to tune the weights. For the artificial datasets used in our experiments, we
parametrically searched over γ={1,2,4}, and in case of real world datasets, we considered γ={0.05, 0.5,
1,2,4}.
56
� For the single step bootstrap lasso, we created 100 bootstrapped datasets, and for each bootstrapped
dataset, we trained an L1 model with an optimal regularization parameter value (λ) picked using valida-
tion data. We trained the final model (which uses an L1 penalized algorithm) using only those features
that were selected by more than a given percentage of L1 models trained on bootstrapped datasets; the
final model was also trained with an optimal regularization parameter (λ) picked using validation data.
For the sake of convenience, we refer to the parameter τ controlling the percentage as the ‘threshold value
for the bootstrap lasso’.
For the bootstrap lasso, Bach [6] reported results using τ = 90%; for the datasets used in this study,
the bootstrap lasso initially did not perform competitively with the other algorithms due to the high
threshold value, τ = 90%. For our experiments, we considered τ = {40%, 50%, 60%, 70%, 80%, 90%}.
The parameters for the final model were picked using validation data and a grid search over (τ ,λ).
� In each step of Subsampling SBA, we considered φ={10%, 25%, 40%, 50%}, where φ is the percentage of
examples sampled from the original dataset to create a subsampled dataset; we considered 100 subsampled
datasets for each φ value. Within each step of Subsampling SBA, we performed a grid search over (φ,γ,λ),
using validation data to pick the parameters for the final model.
� In each step of Bootstrap SBA, we created 100 bootstrapped datasets using bootstrap sampling with
replacement. Within each step of Bootstrap SBA, we performed a grid search over (γ,λ).
� As suggested by Zou [132, 133], for the adaptive lasso, we parametrically found the weights using the
best model selected by an L2 penalized algorithm and used those weights to perform a grid search over
(γ,λ).
� The parameter list for the MSA is complicated due to the multiple variations of MSAs. In Section 4.4.1.2,
we discuss how we chose the best variation of the MSA. More information about the parameters used for
the MSA is provided in that section.
� The MSA and the SBA are multiple step algorithms, and thus we have to consider a stopping criterion.
We limit the total number of steps to 50. After termination, we chose the model with the best validation
57
performance.
4.3.3 L1 Penalized Algorithms and Their Parameters
In this section, we describe the four linear L1 penalized algorithms used in our experiments, three
of which are used for binary classification and one of which is used for regression. We omit the weighted
formulations of the algorithms. We also mention the solvers used for both L1 penalized and L2 penalized
algorithms We also list the values from which individual solvers choose the best regularization parameters.
Some of the solvers use regularization parameter C (∝ 1/λ) instead of λ.
Note that the solver used for a standard L1 penalized algorithm can be adapted for a weighted L1
penalized algorithm by appropriately scaling1 the input matrix before being passed to the solver, as described
by Zou [132, 133].
L1-SVM. The formulation of the L1-SVM algorithm [131] with the hinge loss function regularized using
the L1 penalty is defined as
minβ,β0
n∑i=1
[1− yi(xiβ + β0)]+ + λ
p∑j=1
|βj |, v+ = max(v, 0), (4.12)
where {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R} is the training data; β is the model parameter vector of size p
and β0 is an offset value that are estimated for a given value of the regularization parameter λ.
Fung and Mangasarian [51] offer a solver for the L1-SVM through a software package called the LP
Support Vector Machine (LPSVM). LPSVM uses two parameters (ν and δ) for regularization; we parametri-
cally searched over ν = 2(−12,−11,...,11,12) and δ = 10(−3,−2,...,2,3). We also experimented on artificial datasets
with the L1-SVM solver proposed by Zhu et al. [131] which provides the entire regularization path for λ,
and obtained results similar to those obtained by LPSVM; we found the solver to be computationally slower
than LPSVM and thus we use LPSVM for our experiments.
The L2 penalized SVM algorithm is solved using LIBSVM [28]. We considered the following range of
values for the regularization parameter: C ={1e-2, 1e-1, 0.5, 1, 2, 10, 1e2, 1e4, 1e5}.1 In an L1 penalized algorithm, each feature in the input matrix is scaled to mean zero and standard deviation of one before
being passed to the solver; we refer to this matrix as a scaled input matrix. To implement an adaptive L1 penalized algorithmusing the solver for an L1 penalized algorithm, each feature in the scaled input matrix is divided by its corresponding weightvalue before passing it to the solver of an L1 penalized algorithm.
58
L1-SVM2. The SVM algorithm can also be formulated using a squared hinge loss function regularized
with the L1 penalty as follows:
minβ,β0
n∑i=1
[1− yi(xiβ + β0)]2+ + λ
p∑j=1
|βj |, v+ = max(v, 0), (4.13)
where {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R} is the training data; β is the model parameter vector of size p
and β0 is an offset value that are estimated for a given value of the regularization parameter λ.
In the literature, this SVM is sometimes referred to as L2-SVM [29], as it uses a squared hinge loss
function, an L2 loss function, with the L2 penalty function. In order to be consistent with our terminology
which uses the Lp or lasso penalty as the prefix in an algorithm’s name, we refer to Equation 4.13 as L1-SVM2;
the suffix SVM2 denotes a squared hinge loss function.
The software we used that calculates the entire piecewise path for L1-SVM2 is based on the algorithmic
framework described by Rosset and Zhu [97]. Optimization routines are described in Appendix A.2. The
optimization of a squared hinge loss function is easier than that of a hinge loss function because a squared
hinge loss function is differentiable everywhere [29].
The L2 penalized SVM2 algorithm is solved using LIBLINEAR [45, 77] software package. We consid-
ered the following range of values for the regularization parameter: C =2(−10,−9,...,9,10).
Lasso Regression. The formulation for the lasso regression is
minβ
n∑i=1
(yi − xiβ)2 + λ
p∑j=1
|βj |, (4.14)
where {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R} is the training data and β is the model parameter vector of size
p that is estimated for a given value of the regularization parameter λ. The software we used to calculate
the entire piecewise solution path (0 ≤ λ ≤ ∞) for lasso regression is based on LARS [42, 106].
In performing adaptive lasso regression, Zou [133] considered only weights derived from estimates
made through the unpenalized ordinary least squares method; for our experiments, we also considered ridge
regression (L2 penalized least squares) estimates. We considered the following range of values for the ridge/L2
penalty regularization parameter: λ={0, 1e-2, 1e-1, 0.5, 1, 2, 10, 1e2, 1e3, 1e4, 1e5}.
59
L1 Logistic Regression. The formulation for the L1 logistic regression is defined as
minβ,β0
l(β, β0) + λ
p∑j=1
|βj |, G = {+1,−1}, (4.15)
P (G = yi|X = xi) =exp((β0 + xTi β)T yi)
1 + exp((β0 + xTi β)T yi), l(β, β0) =
n∑i=1
log P (G = yi|X = xi).
where {(xi, yi), i = 1, . . . , n, xi ∈ Rp, yi ∈ R} is the training data; β is the model parameter vector of size
p and β0 is an offset value that are estimated for a given value of the regularization parameter λ. The
LIBLINEAR [45] software package is used to estimate β at fixed values of the regularization parameter for
both L1 and L2 penalties. We considered the following range of values for the regularization parameters:
C ={0.1, 1, 1e1, 1e2, 1e3, 1e4, 1e5}.
4.3.4 Evaluation Metrics
In this section, we discuss the evaluation metrics used to compare and evaluate the performance of
algorithms.
Classification Error Percentage. For classification based datasets, we calculate the classification
error percentage as 100 ∗n∑i=1
I(yi 6= yi)/n, for n examples; yi is the target label for the ith example, yi is the
predicted label for the ith example, and I() is an indicator function.
Regression Relative Squared Error Percentage. For regression based datasets, we calculate the
relative squared error percentage as
Regression relative squared error percentage =
n∑i=1
(yi − yi)2
n∑i=1
(yi − y)2
∗ 100, for n examples, (4.16)
where yi is the target label for the ith example, yi is the predicted label for the ith example, y is the mean
target label, and I() is an indicator function.
Instead of using the mean squared error for regression problems, which has no intrinsic interpretation
and is therefore difficult to compare across datasets, we use the relative squared error percentage. The
60
relative squared error is greater than one if the algorithm lacks the ability to utilize the features to make
predictions, i.e., it is worse than an algorithm that makes a constant prediction of y.
Comparison of Multiple Algorithms Using Significance testing. When comparing two or more
algorithms, we can calculate the average error across datasets. However, errors are not commensurate over
different datasets, and a large variation on a single dataset may be enough to give one of the algorithms an
unfair advantage over the others.
An alternate method is to use a paired Student’s t-test or the Wilcoxon signed-rank test [124] to
compare the significance of differences in error between individual algorithms, but such tests are overly
optimistic when multiple algorithms are compared, as they do not consider the fact that multiple comparisons
can be rejected due to random chance.
To compare multiple algorithms, Demsar [35] suggested the Friedman test, which ranks algorithms’
performance using error results for individual datasets, from best to worst. If after performing the Friedman
test, the hypothesis that the algorithms are similar is rejected, one can use a post-hoc test, such as the
Bergmann-Hommel test, which adjusts the significance value based on the number of classifiers present; such
an adjustment takes into account the fact that many pairwise comparisons may be rejected due to random
chance. Garcıa and Herrera [52] provide software that lists the pairwise hypotheses that are rejected based
on the results of the Bergmann-Hommel test.
Note that the comparison of multiple algorithms via the Friedman test and post-hoc testing has
disadvantages when the difference between algorithms is not significant. Assuming that the errors from the
algorithms are obtained through a cross validation technique such as 10 fold cross validation [4, 25, 38], the
ranking may be incorrect if the differences across algorithms applied to a dataset differ due to random chance
alone, as when all algorithms applied to a given dataset obtain the same or similar final models.
4.4 Results
We present results for both artificial datasets and real world datasets. We compare various forms
of feature selections and various prediction models, formed by the Cartesian product: {variations of the
61
MSA, variations of the SBA, adaptive lasso, bootstrap lasso, standard L1 penalized algorithm}×{L1-SVM,
L1-SVM2, lasso regression, L1 logistic regression}.
4.4.1 Artificial Datasets
4.4.1.1 Description
The artificial datasets we used were originally used by [24, 64]. The multiple goals of these datasets are
to simulate a variety of conditions, including correlation and non-correlation between relevant and irrelevant
features and the existence or nonexistence of groups within the relevant features.
We generate regression datasets from the following linear model
y = Xβ + ε, ε ∼ N(0, σ2). (4.17)
Classification datasets with two-classes present are created from the model via y ∼ Bernoulli{p(u)},
where p(u) = exp(xTβ)/(1 + exp(xTβ))}, p is the number of features in the model, and n is the number of
examples. The artificial datasets are generated by choosing a value for the β vector and then sampling from
Equation 4.17. In the following presentation, N(x,m) denotes a Gaussian random variable with mean x and
standard deviation m. The specifics for the different artificial datasets are as follows.
Artificial Dataset-1. We set p = 200 and σ = 1.5. The first 15 features are uncorrelated with the remaining
185 features. The pairwise correlation between the first 15 features x1, ...x15 is ρ = {0.5}|i−j|. The pairwise
correlation between the remaining 185 features x16, ...x200 is also ρ = {0.5}|i−j|. The coefficients, of the
linear model, were set to β1, ...β5 = 2.5, β6, ...β10 = 1.5, and β11, ...β15 = 0.5.
Artificial Dataset-1 simulates the case in which p > n and correlation exists among groups of relevant
features and groups of irrelevant features, but groups of relevant and irrelevant features are uncorrelated
with each other.
Artificial Dataset-2. Same as Artificial Dataset-1, but with higher correlation—ρ = 0.95.
62
Artificial Dataset-3. Artificial dataset is generated, for p = 200 and σ = 1.5, as follows:
xi = Z1 + ei, Z1 ∼ N(0, 1), i = 1, . . . , 5,
xi = Z1 + ei, Z2 ∼ N(0, 1), i = 6, . . . , 10,
xi = Z3 + ei, Z3 ∼ N(0, 1), i = 11, . . . , 15,
Xi ∼ N(0, 1), Xi are uncorrelated, i = 16, . . . , 200, ei is uncorrelated N(0, 0.01), i = 1, . . . , 15.
The coefficients, of the linear model, were set to β1, ...β15 = 1.5, and the remaining coefficients are zero.
Artificial Dataset-3 simulates the case in which p > n and group structure exists among relevant features.
Artificial Dataset-4. Same as Artificial Dataset-1, but with p = 400 (lower signal to noise ratio).
Artificial Dataset-5. Same as Artificial Dataset-2, but with p = 400 (lower signal to noise ratio).
Artificial Dataset-6. Same as Artificial Dataset-3, but with p = 400 (lower signal to noise ratio).
Artificial Dataset-7. For p = 200 and σ = 1.5, we set the correlation among the 200 features x1, ...x200 to
ρ = {0.5}|i−j|. The coefficients were set to β1, ...β5 = 2.5, β11, ...β15 = 1.5, and β21, ...β25 = 0.5. Artificial
Dataset-7 is similar to Artificial Dataset-1 but has correlation among the relevant and irrelevant features.
Artificial Dataset-8. Same as Artificial Dataset-7, but with higher correlation—ρ = 0.95.
Artificial Dataset-9. For p = 200 and σ = 0, we set the correlation among the 200 features x1, ...x200 to
ρ = {0.5}|i−j|. The coefficients, of the linear model, were set to β1, ...β25 = 0.6. We eliminated the
estimation noise variable ε in y = Xβ + ε and lowered the signal-to-noise ratio in comparison to Artificial
Dataset-7.
For each of the nine artificial datasets, we generated 50 instances of each dataset, each consisting of
100 training examples, 1000 validation examples, and 20000 test examples. We present results which are the
mean over these 50 instances to reduce sampling variance.
Instead of presenting all results at once, we distill them to reduce the complexity of our analysis.
63
4.4.1.2 Distilling the results of MSA and SBA
Our model exploration is complicated by the existence of multiple variations of both the MSA and
the SBA. A direct comparison of all algorithms that includes the many variants of MSA and SBA may lead
to a complicated picture. Therefore, we first compare the different variants of MSA and SBA in isolation to
choose a single variant that we will put into competition with alternative models. We acknowledge that the
fact that MSA and SBA have many variants provides them with a greater opportunity to appear superior
to alternative models if the same test data is used both to select a variant of MSA (or SBA) and to compare
MSA (or SBA) against alternatives. Therefore, we use the artificial datasets to select the variants of MSA
and SBA that minimize squared error, and then compare the chosen variants of MSA and SBA against
alternatives on real world data sets.
Percent Error Results for Variations of MSA. In Section 4.3.1, we discussed the different variations
of MSA. Note: for our experiments, we also consider the single and multiple step versions of each variation
of the MSA.
In Figure 4.1, we show statistical results for the variations of the MSA and condensed over all four
L1 penalized algorithms; these are based on percent error results obtained for the nine artificial datasets
through Friedman ranks and the Bergmann-Hommel procedure [52]. For the sake of brevity, we omit results
for individual L1 penalized algorithms and refer to the single and multiple step algorithms as ‘single’ and
‘multiple’ respectively. Graphically, we arrange the algorithms from best to worst performing (top to bottom)
according to their average Friedman ranks across all datasets and L1 penalized algorithms, and we connect all
the algorithms whose pairwise hypotheses cannot be rejected (p = 0.05 according to the Bergmann-Hommel
procedure). Thus, algorithms that are significantly different from one another are not connected. We also
note the average percent error in parentheses next to the algorithm name.
Based on Figure 4.1, we can make two main conclusions: (a) for all algorithms (tuned, untuned, and
epsilon MSAs), performing multiple steps significantly decreases the percent error compared to a single step,
and (b) among the studied variations of the MSA, a tuned MSA with multiple steps is the best performing
and has a significantly higher Friedman rank than the rest of the variations.
64
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
Tuned + Multiple (11.44%)
Epsilon + Multiple (11.88%)Untuned + Multiple (11.73%)
Tuned + Single (12.58%)
Epsilon + Single (13.09%)Untuned + Single (12.89%)
Figure 4.1: Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for variationsof the MSA, using percent error results obtained using nine artificial dataset models × four different L1
penalized algorithms. We consider tuned MSA (using γ), epsilon MSA (using ε), and untuned MSA (usingneither ε nor γ) and results from both Single and Multiple step estimations; for brevity, we do not referto the algorithms by their full name. The algorithms are arranged top (best performing) to bottom (worstperforming) according to their Friedman test ranks. Algorithms that are insignificantly different from eachother (p=0.05 according to the Bergmann-Hommel procedure), are connected. Average percent error isnoted in parentheses next to the algorithm name.
# features correctly detected # features incorrectly detectedTuned + Multiple 9.59 (4.15) 5.04 (4.99)Epsilon + Multiple 10.19 (4.28) 10.83 (10.55)Untuned + Multiple 9.94 (4.13) 6.83 (5.45)
one-way ANOVA p =0.9448 p =0.0106
Table 4.1: Average number of features discovered correctly and incorrectly by the MSA over four different L1
penalized algorithms. We considered tuned MSA (using γ), epsilon MSA (using ε), and untuned MSA (usingneither ε nor γ) and their results from using multiple steps; for brevity, we do not refer to the algorithmsby their full name. We performed two one-way ANOVAs on the following two hypotheses: (1) the averagenumber of features discovered correctly by variations of the MSA is the same, and (2) the average numberof features discovered incorrectly by variations of the MSA is the same.
Difference in the Feature Selection Results Between the Algorithm of Candes and Wakin [26]
and MSA. In Table 4.1, we briefly examine the feature selection results obtained for tuned, epsilon, and
65
untuned MSAs. Because the linear models that generate the artificial datasets are known, we can calculate
the average number of features correctly and incorrectly selected in the best models of tuned, epsilon, and
untuned MSA.
To establish the statistical significance of our results in Table 4.1, we propose two null hypotheses:
first, that the average number of features correctly detected by the different algorithms is the same, and
second, that the average number of features incorrectly detected by the different algorithms is the same. A
one-way analysis of variance (ANOVA) of the average number of features correctly detected over the various
artificial datasets was insignificant, and thus we cannot reject the first null hypothesis. In contrast, a one-way
analysis of variance of the average number of features incorrectly detected was found to be significant. We
can thus reject the second null hypothesis: that is, at least one of the algorithms had a mean error percentage
significantly different from the rest of the algorithms. Compared to the tuned and untuned MSA, a larger
number of irrelevant features were detected by epsilon MSA; this phenomenon can be explained by the fact
that features are not explicitly removed from the model as ε > 0, even when the estimated model parameter
value for a feature is zero.
Based on the results presented here, we focus all subsequent evaluations of MSA using only the tuned
MSA with multiple steps.
Determining the best algorithm among variations of SBA. The SBA can use weights derived
from either bootstrap or subsampling. Likewise, the SBA can be used as a single or multiple step algorithm.
In Figure 4.2, we show statistical results, for variations of the SBA condensed over all four L1 penalized
algorithms, using percent error results obtained for nine artificial datasets via the Friedman test and the
Bergmann-Hommel procedure [52]. For the sake of brevity, we do not provide results for individual L1
penalized algorithms.
Based on Figure 4.2, we reach two main conclusions: (a) for both Subsampling SBA and Bootstrap
SBA, performing multiple steps rather than a single step significantly reduces the percent error, and (b)
among the variations of the SBA, Subsampling SBA with multiple steps is the best performing algorithm
and has a significantly higher Friedman rank than the rest of the algorithms. Based on these results with
66
1
2
3
4
1
2
3
4SBA(Bootstrap) + Single (13.11%)
SBA(Subsampling) + Single (11.81%)SBA(Bootstrap) + Multiple (11.16%)
SBA(Subsampling) + Multiple (9.53%)
Figure 4.2: Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for variations ofthe SBA, using percent error results obtained using nine artificial dataset models × four different L1 penalizedalgorithms. The algorithms are arranged top (best performing) to bottom (worst performing) according totheir Friedman test ranks. Algorithms that are insignificantly different from each other (p=0.05 accordingto the Bergmann-Hommel procedure), are connected. Average percent error is noted in parentheses afterthe algorithm name. Algorithms with multiple steps are referred to as Multiple and algorithms with a singlestep are referred to as Single.
artificial datasets, we report all subsequent results for SBA using the variant with subsampling and multiple
steps.
4.4.1.3 Overall Results
Now that we have chosen the best performing SBAs and MSAs, we can compare these algorithms
to the bootstrap lasso and the adaptive lasso. As a baseline for comparison, we also include the results
obtained from a standard L1 penalized algorithm. In Figure 4.3, we present statistical results based on
percent error results, using the Friedman test and the Bergmann-Hommel procedure [52] for nine artificial
datasets segmented over four L1 penalized algorithms.
Based on Figure 4.3, we can draw the following main conclusion: the SBA is the best performing
algorithm for all four L1 penalized algorithms, whereas the MSA is the second best performing algorithm.
Note that even though the SBA always has a rank of one for both L1 logistic regression and L1-SVM2, the
Bergmann-Hommel procedure does not reject the hypothesis that the SBA is significantly different from the
MSA. When we consider results across all L1 penalized algorithms, the SBA’s performance is significantly
better than that of the MSA and the rest of the algorithms according to the Bergmann-Hommel procedure;
67
(a) L1-SVM.
1
2
3
4
5
(a) L1-SVM.
1
2
3
4
5
(a) L1-SVM.
1
2
3
4
5
(a) L1-SVM.
1
2
3
4
5
MSA (12.76%)
SBA (9.32%)
Bootstrap (13.58%)Adaptive (13.66%)
Standard (14.28%)
(b) L1 Logistic Regression.
1
2
3
4
5
(b) L1 Logistic Regression.
1
2
3
4
5
(b) L1 Logistic Regression.
1
2
3
4
5
(b) L1 Logistic Regression.
1
2
3
4
5
MSA (10.51%)
SBA (8.84%)
Bootstrap (15.32%)
Adaptive (13.04%)
Standard (14.44%)
(c) L1-SVM2.
1
2
3
4
5
(c) L1-SVM2.
1
2
3
4
5
(c) L1-SVM2.
1
2
3
4
5
(c) L1-SVM2.
1
2
3
4
5
MSA (13.28%)
SBA (11.93%)
Bootstrap (14.37%)
Adaptive (13.97%)
Standard (14.58%)
(d) Lasso Regression.
1
2
3
4
5
(d) Lasso Regression.
1
2
3
4
5
(d) Lasso Regression.
1
2
3
4
5
(d) Lasso Regression.
1
2
3
4
5
MSA (7.26%)
SBA (8.01%)
Bootstrap (12.25%)Adaptive (10.75%)
Standard (11.40%)
Figure 4.3: Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for different L1
penalized algorithms, using percent error results obtained using nine artificial dataset models. The algorithmsare arranged top (best performing) to bottom (worst performing) according to their Friedman test ranks.Algorithms that are insignificantly different from each other (p=0.05 according to the Bergmann-Hommelprocedure), are connected. Average percent error is noted in parentheses after the algorithm name.
we do not show this result.
Our primary criterion for model selection is percent error: we showed that for artificial datasets,
68
models obtained using the SBA have significantly lower average percent error (reflected in higher Friedman
ranks) than other algorithms. In Appendix B.3, we present additional results for the artificial datasets
pertaining to accuracy of feature selection.
4.4.2 Results for Real World Datasets
4.4.2.1 Description
We could not find existing studies that performed an extensive comparison of the adaptive lasso,
the bootstrap lasso, and the MSA using real world datasets. Even published results for these individual
algorithms are sparse and usually limited to a few mostly domain-specific real world datasets; for example,
Buhlmann and Meier [24] presented results for a gene dataset and Zou [132] presented results for three real
world datasets (spambase, WDBC, and ionosphere). In this study, we hope to provide an extensive list of
datasets and comparisons among various algorithms.
We note the datasets used for our experiments in Table 4.2; a detailed description is presented in
Appendix B.1 (page 122). All of the datasets are publicly available at either UCI2 [5], LIBSVM3 [28], or
KEEL4 [2]. The datasets contain between 22 and 186,000 examples and between 6 and 300 features. Our only
criterion for choosing among datasets was to include those that contained at least a few hundred examples.
This ensured that enough validation data was available for all algorithms.
Many of the datasets had fewer than 35 input features and are thus not ideal candidates for feature
selection. To show the value of feature selection even for these low-dimensional data sets, we augmented
the feature vector with second-order features, i.e., features defined to be the product of all pairs of simple
features. Such augmentation yields the potential for a significant improvement in the performance of a
linear model, but this improvement will be highly dependent on feature selection to prevent overfitting.
Ideally, L1 penalized algorithms should be able to train models that suppress irrelevant5 features from the
expanded dictionary without suffering a reduction in accuracy. In Appendix B.4 (page 130), we show that
2 http://archive.ics.uci.edu/ml/index.html3 http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/4 http://sci2s.ugr.es/keel/datasets.php5 Irrelevant features: features irrelevant to an L1 penalized algorithm may contain essential nonlinear information not
detectable by the algorithm.
69
Classification
DatasetClass
# examples # of features# of features
SourcePercentage originally
1 a1a+ 75/25 1605 119 119 [5, 28, 86]2 australian* 56/44 690 119 14 [28]3 breast-cancer 65/35 683 65 10 [5, 28]4 german.numer* 70/30 1000 324 24 [28]5 heart* 56/44 270 104 13 [28]6 hillvalley (noisy) 50/50 1212 100 100 [5]7 ionosphere 36/64 351 629 34 [5, 28]8 Chess (kr vs kp) dataset 52/48 3196 36 36 [5]9 BUPA liver-disorders 58/42 345 27 6 [5, 28]10 pimadiabetes 35/65 768 44 8 [5, 28]11 spambase 61/39 4597 57 57 [2, 5]12 splice+ 48/52 1000 60 60 [1, 5, 28]13 svmguide3 76/24 1243 275 22 [28, 63]14 w1a∆ 97/3 5000 300 300 [28, 86]15 breast-cancer (diagnostic)Σ 63/37 569 495 30 [5]
Regression
Dataset # examples # of features# of features
Sourceoriginally
1 cadata (houses.zip) 20640 44 8 [28, 120]2 compactiv 8192 252 21 [1, 2]3 diabetes 442 65 10 [42]4 housing 506 104 13 [5, 28]5 mortgage 1049 135 15 [2]6 Auto-mpg 392 35 7 [5, 28]7 Pole telecommunications 14998 377 26 [2]8 triazines% 186 60 60 [5, 28]9 census (house-price-16H) ∇ 22784 152 16 [1, 2]10 Concrete compressive strength 1030 44 8 [2, 5]
Table 4.2: List of real world datasets used in our experiments. There are 10 regression datasets and 15classification datasets containing 186-22000 examples and 6-300 features. We included second order featuresfor datasets with ≤ 35 original features. *These datasets were formerly available at StatLog and are currentlyavailable at LIBSVM-DATASETS. +We used only the training example specified at LIBSVM-DATASETSfor our experiments. #We combine both training and test data. %The triazines dataset is part of “QualitativeStructure Activity Relationships” at UCI [5]. ∆We subsampled 5000 examples from the training+test dataavailable. ∇The target value is long-tailed and thus is transformed via log2(value+1). ΣAlso known asWDBC. LIBSVM-DATASETS = http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
by augmenting the first-order features with second-order features, the error of L1 models can be significantly
reduced.
70
4.4.2.2 Setup
We performed 10 fold cross validation, internally using six folds for training and three folds as val-
idation data (various parameters were selected using validation data); for our experiments with real world
datasets, we ensured that class proportions in each fold were kept the same.
We use 10 regression datasets with lasso regression and 15 classification datasets with L1-SVM2. We
ran into issues with the LIBLINEAR and LPSVM solvers for some of the datasets, and thus we omit results
from L1 logistic regression and L1-SVM.
4.4.2.3 Results and Analysis
(a) L1-SVM2.
1
2
3
4
5
(a) L1-SVM2.
1
2
3
4
5
MSA (14.70%/1.41%)
SBA (13.46%/9.75%)
Bootstrap (15.02%/1.40%)
Adaptive (14.87%/2.18%)
Standard (15.18%/0.00%)
(b) Lasso Regression.
1
2
3
4
5
(b) Lasso Regression.
1
2
3
4
5
(b) Lasso Regression.
1
2
3
4
5
MSA (32.43%/0.65%)
SBA (30.90%/4.07%)
Bootstrap (31.52%/-5.75%)
Adaptive (32.10%/0.68%)
Standard (32.44%/0.00%)
Figure 4.4: Graphical representation of Friedman ranks and results of the Bergmann-Hommel procedurefor L1-SVM and LARS using percent error results obtained using real world datasets. The algorithmsare arranged top (best performing) to bottom (worst performing) according to their Friedman test ranks.Algorithms that are insignificantly different from each other (p=0.05 according to the Bergmann-Hommelprocedure), are connected. In parentheses following the algorithm name is the average percent error as wellas the average relative reduction in percent error over an L1 penalized algorithm.
71
Overall Results. In Figure 4.4, we show ranking results obtained for the real world data sets for lasso
regression and L1-SVM2. These results are based on the percent error for each feature-selection technique,
ranked according to the Friedman test and the Bergmann-Hommel procedure [52]. For each algorithm, we
also show (in parentheses) the average error across datasets and the percentage reduction in error relative
to the standard L1-SVM2 (Figure 4.4a) and standard lasso regression (Figure 4.4b).
From Figure 4.4(a), we can draw three main conclusions about the L1-SVM2: (a) the SBA outperforms
other algorithms, and according to the Bergmann-Hommel procedure, the difference is significant; (b) if we
consider the average reduction in percent error across individual datasets, the SBA had the highest relative
reduction in percent error (9.75%) over the standard L1-SVM2; and (c) if we consider the average percent
error across individual datasets, the SBA had the lowest average percent error (13.46%).
From Figure 4.4(b), we can draw three main conclusions about lasso regression: (a) the SBA outper-
forms both bootstrap lasso and standard lasso regression, and according to the Bergmann-Hommel procedure,
the difference is significant, but the SBA is not statistically different from the adaptive lasso and the MSA;
(b) if we consider the average reduction in percent error across individual datasets, of all the algorithms,
SBA had the highest relative reduction in percent error (4.07%) over standard lasso regression; and (c) if we
consider the average percent error across individual datasets, the SBA had the lowest percent error (30.90%)
of all the algorithms.
From Figure 4.4, we can conclude that the SBA outperforms the rest of the algorithms on all of the
three following metrics: Friedman rank, average reduction in error, and average error. The numerical tables
of the results for both real world and artificial datasets are presented in Appendix B.2 (page 125).
Is Subsampling SBA with Multiple Steps Still the Best Algorithm Among Variations of
SBA? In Figure 4.5, we briefly examine the statistical results obtained from variations of the SBA using
real world datasets. As shown in the figure, Subsampling SBA with multiple steps is the best performing
algorithm, but its performance was not significantly different from those of the rest of the algorithms, at
least for the L1-SVM2. Note that Subsampling SBA with multiple steps has the highest rank and lowest
average percent error of all the variations of the SBA, using both L1-SVM2 and lasso regression.
72
We omit detailed presentation of results for the MSA, but the results are similar as those for the
SBA: the tuning MSA variant beats out the alternatives on the real-world datasets as it did on the artificial
datasets.
Feature Selection Results. In Figure 4.6, we plot the number of features selected by LARS and L1-
SVM2. We arrange the algorithms left to right, sorted according to increasing percent error. For L1-SVM2,
we can draw the following conclusion: MSA and bootstrap lasso produce models that are sparser than SBA,
whereas adaptive lasso and the standard L1 penalized algorithm produce models that are denser than SBA.
For LARS, we can draw the following conclusion: the MSA produces the sparsest model, whereas the number
of features in the SBA, adaptive lasso, and bootstrap lasso were in the same range but lower than that of
the standard lasso regression algorithm.
Computational Cost. The computational costs of the various algorithms are not equivalent. The
computational cost of the SBA is an order of magnitude higher than that of the MSA algorithm because of
the parametric search over multiple subsampled or bootstrapped datasets. The time taken to perform all
of our experiments was about five days on a dual CPU computer; parallel programming was used for the
grid search over the parameters. Given the speed of modern machines and the ample availability of cycles,
reduction in test error and interpretability of results (via feature sparsity) seem to be more important criteria
than training time. Further, recent methods using coordinate descent [49] may reduce the computation time.
4.5 Conclusions and Future Work
Our empirical results support the following hypothesis: we can significantly improve prediction in
regression and classification problems using a multiple step estimation algorithm instead of a single step
estimation algorithm. We propose a multiple step algorithm, the SBA-LASSO, that uses weights derived
from either bootstrapped or subsampled datasets. We compare the SBA-LASSO with existing algorithms
including the adaptive lasso, the bootstrap lasso and the MSA-LASSO for different L1 penalized methods
(on both classification and regression problems). Subsampling SBA-LASSO with multiple steps is shown to
73
(a) L1-SVM2.
1
2
3
4
(a) L1-SVM2.
1
2
3
4
SBA(Bootstrap) + Single (14.32%/0.84%)
SBA(Subsampling) + Single (14.00%/1.42%)
SBA(Bootstrap) + Multiple (14.46%/0.00%)
SBA(Subsampling) + Multiple (13.46%/5.15%)
(b) Lasso Regression.
1
2
3
4
(b) Lasso Regression.
1
2
3
4
(b) Lasso Regression.
1
2
3
4
SBA(Bootstrap) + Single (31.81%/0.00%)
SBA(Subsampling) + Single (31.89%/3.53%)
SBA(Bootstrap) + Multiple (31.49%/5.92%)
SBA(Subsampling) + Multiple (30.90%/5.87%)
Figure 4.5: Graphical representation of Friedman ranks and Bergmann-Hommel procedure, for variationsof SBA, using percent error results obtained using real world datasets. The algorithms are arranged top(best performing) to bottom (worst performing) according to their Friedman test ranks. Algorithms that areinsignificantly different from each other (p=0.05 according to the Bergmann-Hommel procedure), are con-nected. In parentheses after the algorithm name, we note (Average percentage / Average relative reductionin percent error over the worst ranked algorithm).
perform significantly better than existing algorithms both in ranking and in terms of reduction in percent
error, using both artificial and real world datasets.
74
0
5
10
15
20
25
30
35
40
SBA
Adaptive
MSA
Bootstrap
Standard
Ave
rage
num
ber
of fe
atur
es
(a) L1-SVM2.
0
20
40
60
80
100
SBA
MSA
Adaptive
Bootstrap
Standard
Ave
rage
num
ber
of fe
atur
es
(b) Lasso Regression.
Figure 4.6: Overall number of features discovered for real world datasets. Algorithms are arranged left toright based on increasing friedman ranks (obtained from Figure 4.4); the best algorithm is on the left andthe worst is on the right. In the plots, we show the average number of features and the standard deviation.Normalization of the standard deviation was performed as suggested by Masson and Loftus [80].
The theoretical justification for Subsampling SBA and other multiple step algorithms is an open
research question. We hypothesize that the Subsampling SBA is able to adaptively find the threshold after
which a feature should not be penalized in the adaptive lasso penalty. We further conjecture that MSA-
LASSO and adaptive lasso may be improved by using a squashing function on the weights, which will push
them toward binary values and therefore make the algorithms behave more like SBA-LASSO.
Chapter 5
Iterative Training and Feature Selection using Random Forest
5.1 Introduction
The random forest algorithm is popular and widely used in machine learning [69, 99, 128]. A random
forest is an ensemble of bagged decision trees, whose aggregated results are used for prediction. A random
forest can be considered an implicit feature selecting algorithm due to the use of decision trees.
Random forests perform explicit feature selection by using a feature importance measure in conjunction
with a backward elimination technique [37, 104]. A feature importance measure provides a way to rank
features in terms of their relevance to prediction. In previous studies [19, 111], feature importance values
were calculated by training a single model and evaluating the increase in error for out-of-bag (OOB) examples
by selectively permuting values for a feature in those examples; out-of-bag examples are examples that are
not used for training individual trees. Logically, if a feature is relevant to prediction, permuting a feature
will cause an increase in the OOB error rate, whereas if a feature is irrelevant, permuting the feature will
not increase the OOB error rate.
In this study, we explore a range of schemes for explicit feature selection using random forest. Existing
approaches start with the full complement of features and iteratively train models by removing features that
are deemed to be of least importance. The decision about how many features to remove is based on an
evaluation of held-out validation data or OOB data.
A key component of explicit feature selection is the feature importance measure. For our study, we
76
consider existing feature importance measures proposed by Breiman [19] and Strobl et al. [111]; we also
propose a new feature importance measure that uses multiple retrained models based on their performance
using artificial datasets. We show that our proposed feature importance measure yields both higher accuracy
and greater sparsity than existing feature importance measures, though at a greater computational cost.
In previous work, features were considered for inclusion in individual decision trees in an all-or-none
fashion. In contrast, we propose a strategy in which continuous weights are assigned to individual features;
features with higher weights have a greater probability than those with lower weights of being used to train
a model. We perform multiple iterations of model training to iteratively reweight features such that the least
useful features eventually obtain a weight of zero. Weights are derived from feature importance values; in
iterated reweighting of random forest models, features with high feature importance values have a greater
probability than features with low feature importance values of being used to train a model.
This chapter is organized as follows. Section 5.2 presents the details of a set of artificial dataset
models that present a challenge to existing random forest feature importance measures. Section 5.3 present
the details of our proposed feature importance measure. Section 5.4 presents an overview of our approach
to iterated reweighting of random forest models. Section 5.5 presents an overview of our experimental
methodology. In Section 5.6, we present our results.
5.1.1 Existing Feature Importance Measures for Random Forest
We describe feature importance measures for random forests that were previously proposed by Breiman
[19, 21] and Strobl et al. [111].
Breiman’s Feature Importance Measure. Before discussing the details of Breiman’s feature impor-
tance measure, we must explain a detail of random forests. In a random forest model, each tree is trained
using a bootstrap dataset. Examples in the original dataset that are not used to train a tree are known as
out-of-bag or OOB examples. OOB examples can be used to calculate an error rate for a random forest
model; this error rate is called the OOB error rate. This error rate can be used as a type of validation error.
Breiman [19, 21] proposed using OOB error rates to calculate feature importance values as follows: (1)
77
a single random forest models is trained; (2) the OOB error rate is calculated; (3) for an individual feature,
the values of the feature in the OOB data are permuted, and the OOB error rate is recalculated; and (4) the
increase in the OOB error rate due to permutation is the feature importance value of a particular feature.
Mathematically, Breiman’s feature importance for classification tasks is calculated as follows:
feature importance for featurem =
1
N
N∑i=1
1
Noob
(ntree∑k=1
I[(xi, yi) /∈ datasetk] [I(yi 6= yi)]−ntree∑k=1
I[(xi, yi) /∈ datasetk] [I(yi 6= yi)]
), (5.1)
and Breiman’s feature importance for regression tasks is calculated as
feature importance for featurem =
1
N
N∑i=1
1
Noob
(ntree∑k=1
I[(xi, yi) /∈ datasetk](yi − yi)2 −ntree∑k=1
I[(xi, yi) /∈ datasetk](yi − yi)2
), (5.2)
where Noob =∑ntreek=1 I[(xi, yi) /∈ datasetk] is the number of trees in which example i is OOB, ntree
bootstrapped datasets represented by datasetk are used to train ntree trees, N is the number of examples,
I() is an indicator function, x is an input, y is a target, and y is a prediction before permuting feature m
whereas y is a prediction after permuting feature m.
Breiman’s feature importance measure may be normalized by dividing the importance values obtained
from the equations above by the standard error of the feature value. However, a study by Strobl and Zeileis
[110] suggests that unnormalized feature importance values have better statistical properties. Thus, in this
study, we restrict our experiments to unnormalized feature importance values.
Strobl’s Feature Importance Measure. Strobl et al. [111] argued that Breiman’s feature importance
measure is biased and is prone to give correlated features an increased feature importance value. The propose
a new feature importance measure, which we refer to as the Strobl importance, that avoids the bias. Rather
than explain the details of their argument, we focus on differences in the calculation of Strobl’s and Breiman’s
feature importance measures.
The main difference between Strobl’s and Breiman’s feature importance measures is how the OOB
examples are permuted. To calculate Breiman’s feature importance, we permute all values for a feature
78
without considering the values of remaining features. In contrast, to calculate Strobl’s feature importance,
we permute values for a feature conditional on values of the remaining features.
5.2 A Challenge for Existing Feature Importance Measures in
Random Forests
In this section, we discuss a family of artificial datasets whose features existing feature importance
measures are unable to correctly rank. Our goal is to provide our motivation for proposing a new importance
measure that is able to correctly rank the features. First, we describe the nonlinear model used to generate
the datasets.
5.2.1 Description of the Artificial Dataset Models
We create three different artificial regression datasets. Each data set is obtained from the following
nonlinear generative model:
yi =
5∏j=1
[sin(xj) + 1], (5.3)
where yi is the target value generated from input xi for example i. We label the five features used
to generate a target value relevant features. They are uncorrelated with one another. Additionally, we add
irrelevant features that are correlated with the relevant features.
Each artificial dataset model differs in the number of relevant features correlated with the irrelevant
features. Next, we describe the three different artificial dataset models using their covariance matrix.
Artificial Dataset 1 has five irrelevant features in addition to the five relevant features, and the
79
covariance matrix describing the relationship among pairs of features is as follows:
relevant features︷ ︸︸ ︷ irrelevant features︷ ︸︸ ︷
Cov(X) =
0.5 0 0 0 0 0.1 0 0 0 0.1
0 0.5 0 0 0 0.1 0.1 0 0 0.1
0 0 0.5 0 0 0.1 0.1 0.1 0 0
0 0 0 0.5 0 0 0.1 0.1 0.1 0
0 0 0 0 0.5 0 0 0.1 0 0.1
0.1 0.1 0.1 0 0 0.2 0 0 0 0
0 0.1 0.1 0.1 0 0 0.2 0 0 0
0 0 0.1 0.1 0.1 0 0 0.2 0 0
0 0 0 0.1 0 0 0 0 0.2 0
0.1 0.1 0 0 0.1 0 0 0 0 0.2
, (5.4)
where each of the relevant features is correlated with two or three irrelevant features. Note that adjustments
were made to values in the covariance matrix to make it semi-definite.
Artificial Dataset 2 has the following covariance matrix:
relevant features︷ ︸︸ ︷ irrelevant features︷ ︸︸ ︷
Cov(X) =
0.5 0 0 0 0 0.1 0 0.1 0.1 0.1
0 0.5 0 0 0 0.1 0.1 0 0.1 0.1
0 0 0.5 0 0 0.1 0.1 0.1 0 0.1
0 0 0 0.5 0 0.1 0.1 0.1 0.1 0
0 0 0 0 0.5 0 0.1 0.1 0.1 0.1
0.1 0.1 0.1 0.1 0 0.1 0.1 0.1 0.1 0.1
0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
0.1 0 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
0.1 0.1 0 0.1 0.1 0.1 0.1 0.1 0.1 0.1
0.1 0.1 0.1 0 0.1 0.1 0.1 0.1 0.1 0.1
, (5.5)
where four irrelevant features are correlated with each relevant features. There are a total of five relevant
features and five irrelevant features.
Artificial Dataset 3 has the following covariance matrix:
relevant features︷ ︸︸ ︷irrelevant feature︷︸︸︷
Cov(X) =
0.5 0 0 0 0 0.1
0 0.5 0 0 0 0.1
0 0 0.5 0 0 0.1
0 0 0 0.5 0 0.1
0 0 0 0 0.5 0.1
0.1 0.1 0.1 0.1 0.1 0.1
, (5.6)
80
where a single irrelevant features is correlated with all relevant features. There are a total of five relevant
features and one irrelevant feature.
After the input matrix X is generated, we apply a tan(tan()) transform to the irrelevant features
to reduce linear correlation between the relevant and irrelevant features; in order to simulate the case of
nonlinear correlation.
5.2.2 Results for Artificial Dataset Models
Now, we present our method and results on the three artificial datasets created to evaluate existing
importance measures for random forests.
Setup. From each of the three artificial dataset models, we generated 100 datasets each of which con-
tained 1000 training examples and 1000 validation examples. For each of the 3x100 datasets, we obtained
values for various feature importance measures.
We considered results for the 3x100 datasets for the four combinations of {Breiman’s feature impor-
tance measure, Strobl’s feature importance measure} × {random forest, conditional forest}. A conditional
forest uses a different type of decision tree from a random forest; Strobl’s feature importance was proposed
originally for conditional forests, and thus we also included results obtained using conditional forests.
We used the following two software packages: (a) random forests [67, 107], and (b) conditional forests
[62]. For both software packages, we set the number of trees ntree = 1000 and parametrically selected mtry
using validation data. For the algorithm combination {Strobl’s feature importance, conditional forest}, we
modified our experimental setup as follows: for each dataset, we trained on a conditional forest model using
a subsampled1 set of 750 of the available 1000 training examples.
Method. Ideally, when applied to artificial datasets, the feature importance values determined by a
feature importance measure will be higher for relevant features than for irrelevant features. In order to
quantify whether relevant features actually have higher feature importance values than irrelevant features,
1 The conditional forest package did not complete successfully for a dataset with 1000 training examples when trained ona computer with 64GB of memory. Even after subsampling the dataset, the execution time for calculating feature importancewas almost a day per dataset.
81
we define the following scoring rule
correct detection =
I[min{importance(m, datasetk){i∈relevant}} ≥ max{importance(m, datasetk){i∈irrelevant}}], (5.7)
where importance(m,D){k} is the set of feature importance values calculated using feature importance
measure m for feature set k in the dataset D; relevant is the set of relevant features and irrelevant is the
set of irrelevant features and are known a priori for the artificial dataset models. When a feature importance
measure ranks all relevant features higher than irrelevant features, the correct detection scoring rule has a
value of one; otherwise, it has a value of zero.
Results. In Figure 5.1, we plot the correct detection rate obtained by four combinations of the feature
importance measure {Breiman’s feature importance measure, Strobl’s feature importance measure} and
forest type {conditional forests, random forests} on the 3x100 artificial datasets. We also plot the results
obtained for a new importance measure which we term retraining-based feature importance; the importance
measure will be discussed further in Section 5.3.
Based on Figure 5.1, we can draw two conclusions. First, for the majority of replications of artificial
data sets 2 and 3, 3, existing feature importance measures incorrectly rank an irrelevant feature higher
than one of the relevant feature. Second, the only feature importance measure that consistently ranks
relevant features higher than irrelevant features is the retraining-based feature importance (which we describe
shortly)..
Many other machine learning algorithms have problems with these data sets. For more information
refer to Appendix C.5. Having presented the case that there is a class of situations for which existing feature-
importance measures fail for random forests (and other machine learning models), we now turn to proposing
a feature-importance measure that we hypothesize is more meaningfully related to the relevance of features
to a task.
82
Dataset Model−1 Dataset Model−2 Dataset Model−3−20
0
20
Difference in error rates (models with all features − models with only relevant features)using Random Forests
% E
rror
diff
eren
ce
Dataset Model−1 Dataset Model−2 Dataset Model−30
50
100Random Forests + Breiman’s Feature Importance
# da
tase
ts
Dataset Model−1 Dataset Model−2 Dataset Model−30
50
100Random Forests + Strobl’s Feature Importance
# da
tase
ts
Dataset Model−1 Dataset Model−2 Dataset Model−30
50
100Conditional Forests + Breiman’s Feature Importance
# da
tase
ts
Dataset Model−1 Dataset Model−2 Dataset Model−30
50
100Conditional Forests + Strobl’s Feature Importance
# da
tase
ts
Dataset Model−1 Dataset Model−2 Dataset Model−30
50
100Random Forests + Retraining based Feature Importance
# da
tase
ts
Correct detectionIncorrect detection
Figure 5.1: Feature importance results for random forests and conditional forests on 3 artificial datasetmodels. From each artificial dataset models, we generate 100 datasets. We plot the number of times correctand incorrect detection is performed by an importance measure using the correct detection scoring ruledefined in Equation 5.7; incorrect detection = 1-correct detection.
5.3 Retraining-Based Feature Importance
Breiman’s feature importance measure quantifies the importance of a feature as the increase in OOB
error when the feature is permuted; furthermore, both Breiman’s and Strobl’s feature importance measures
83
are calculated using a single model. We propose a different formulation of feature importance that is based
on training multiple models.
Retraining-based feature importance for feature i is obtained by permuting feature i so that the feature
value for training example e now corresponds to the feature value of some other example e′. Following training
on the data set with feature i permuted, the increase in error can be measured. To minimize variance due to
the particular permutation chosen, we repeat the permutation-and-retraining procedure k times. Formally,
we define feature importance for feature i as:
Feature importance for featurei =1
k
∑k
ErrorOOB(TrainRF (X\i, y))− ErrorOOB(TrainRF (X, y))
(5.8)
where X is the input matrix of size n × p , y is the vector of size n containing target values or labels, X\i is
the matrix X with feature i permuted, k is the number of resamples, TrainRF (X,Y ) parametrically trains a
random forest model using the input data X and target values or labels Y , and ErrorOOB(TrainRF (X,Y ))
returns the OOB error rate of the random forest model trained using TrainRF (X,Y ). k resamples are
performed for improving the confidence in the calculated feature importance values. For a single resample,
we train two models, one with the original input matrix X and other with the input matrix X\i which has
the feature i either permuted or removed; the feature importance for a feature is quantified as the increase
in OOB error. In order to induce repeatability in the OOB error rate, we ensure that an individual tree in
both random forest models are trained with the same random seed and the same in-the-bag examples; note
an individual tree in a random forest model is trained using a bootstrap dataset, obtained from the original
dataset, and in-the-bag examples are examples present in the bootstrap dataset.
The retraining-based importance measure we just defined Equation 5.8 is cast in terms of the scaleless
measure of ‘error’, which has no intrinsic interpretation. Is an error of .45 good? No one can say. Conse-
quently, we define a second retraining-based importance measure in a currency that is readily interpreted:
the currency of probability.
We can reformulate feature importance in terms of the probability that the error will increase as a
84
result of permuting (or removing) a feature. We define
Feature importance for featurei = 1− Probability from ttest[perm, baseline] (5.9)
where, perm ∈ Rk, baseline ∈ Rk,m = 1, . . . , k,
permm = ErrorOOB(TrainRF (X\i, Y )), baselinem = ErrorOOB(TrainRF (X,Y )).
As before, X is the input matrix of size n × p , y is the vector of size n containing target values or labels, X\i
is the matrix X with feature i permuted, k is the number of resamples, TrainRF (X,Y ) parametrically trains
a random forest model using the input data X and target values or labels Y , ErrorOOB(TrainRF (X,Y ))
returns the OOB error rate of the random forest model trained using TrainRF (X,Y ). ttest(perm, baseline)
is the one-tail paired Student’s t-test. ttest(perm, baseline) tests the null hypothesis that the random variable
perm− baseline has a normal distribution with mean equal to zero and unknown variance. Logically, we are
trying to get a probability value for the alternative hypothesis: that the random variable perm − baseline
has a mean greater than zero. If the difference in the OOB error rate on permuting feature i is not reliably
worse, one-tailed t-test will return a probability value near 1 and the feature importance value for feature i
is near 0. In contrast, when there is significant increase in the OOB error rate when feature i is permuted,
according to the one-tailed t-test, the feature importance value of feature i will be near 1.
5.4 Approach
In this section, we provide the details of our research. Our three key contributions are summarized in
Section 5.4.1. Two of our contributions were discussed in Section 5.3. Our third contribution is discussed in
Section 5.4.2.
5.4.1 Key Contributions
We briefly describe our three key contributions.
Retraining-Based Feature Importance. In Section 5.2, we discussed a set of artificial datasets
models in which irrelevant features are correlated with the relevant features that generated the models.
85
Existing random forests importance measures, including those proposed by Breiman [19] and Strobl et al.
[111], are unable to consistently rank relevant features higher than irrelevant features. In Section 5.3, we
presented a new feature importance measure that uses retrained models and is able to correctly rank relevant
features higher than irrelevant features. Our proposed feature importance measure uses relative increases in
OOB error.
Retraining-Based Feature Importance in Terms of Probability of Error. In Section 5.3, we
motivated the idea that feature-importance values obtained from existing measures are difficult to interpret
as they are based on calculating the relative increase in OOB error. We argued that a one tail paired
Student’s t-test can be utilized to obtain a probability value associated with the hypothesis that permuting
or removing a feature reliably increases the OOB error. Such probability values provide an interpretive
meaning to feature importance values.
Iterated Reweighting of Random Forest Models. Before we discuss our proposed algorithm for
iterated reweighting of random forests models, we briefly note how existing feature selection methods for
random forests use feature importance measures. Procedures such as variable selection using random forests
(varSelRF) [37] and related procedures, begin with a full complement of features, and models are iteratively
created by removing the features with the lowest feature importance values. The final model is selected based
on results on held-out validation data or OOB data; logically, the features used to train models are selected
in an all-or-none fashion. In contrast, we propose a strategy in which continuous weights are assigned to
individual features; features with high weights have a greater probability than features with low weights of
being used to train a model. We perform multiple iterations of model training to reweight features such that
the least useful features eventually obtain a weight of zero.
Next, we discuss the details of the iterated reweighting of random forest models.
5.4.2 Details of Iterated Reweighting of Random Forest Models
To perform iterated reweighting of random forests models, we assign continuous weights between zero
and one to individual features; features with large weights have a greater probability that features with
86
small weights of being used to train a model. We perform multiple iterations of model training to iteratively
reweight features such that the least useful features eventually obtain a weight of zero.
To explain the procedure for iterated reweighting, we must first explain where the weights come from,
and then we must explain how a random forest model is trained with biasing from the specified weights.
Weights. The weights are derived from feature importance values; in iterated reweighting of random
forest models, features with higher feature importance value have a greater probability of being used to train
a model.
Figure 5.2: Comparison between search of features at the nodes of trees in random forests and FIbRF. Thenumber of available features is indicated by p.
Feature Importance Biased Random Forests (FIbRF). The other component for iterated
reweighting of random forests models is an algorithm that trains a random forest model biased with the
specified weights. We now make a proposal for how the biasing should be done.
In the standard random forest algorithm, at each node in a decision tree, mtry features are randomly
selected with uniform probability from p available features. We propose a modification to the random forest
algorithm called feature importance biased random forests (FIbRF), in which mtry features are selected not
with a uniform probability, but instead with probability dependent on pre-specified weights. The FIbRF
87
algorithm is explained in Algorithm 4 and in Figure 5.2.
Algorithm 4: FIbRF: Routine Called at Each Node of Individual Decision Trees in a Random Forest.
Input: Value of mtry. Weights w is a vector of size p. p is the number of available features.
Output: An array mtry array containing features to search for instead of randomly sampling mtry
features with uniform probability from available p features.
t← mtry
p
p array ← {1, . . . , p}
mtry array ← {}
mtry num← 0
prob is a vector of size p
forall the i in p array do
probi ←wi∑
k∈p array
wk
while t > 0 & mtry num < mtry do
Randomly select feature labeled j from p array based on probabilities in prob
/* Once a feature is added to mtry array, it is not sampled again from p array */
mtry num← mtry num+ 1
t← t− probj
p array ← p array\{j}
mtry array ← mtry array ∪ {j}return mtry array
Tuning Feature Importance Values for Use as FIbRF Weights. Because the values obtained from
feature importance measures for different datasets are on different scales, we perform a scaling operation on
the feature importance values before using them as weights. Tuning parameter λ determines the mapping
from feature importance values to weights as follows:
Weight wi in FIbRF for feature i = (feature importance value for feature i [scaled between 0 and 1])λ, λ ≥ 0.
(5.10)
88
Note that when λ = 0, the sampling of features at the nodes in individual decision trees is equivalent for
both FIbRF and random forest. For FIbRF, we perform a grid search over (mtry, λ). We recommend the
following range for λ: {0, 0.25, 0.5, 1}. As we discuss in Appendix C.3, the ntree parameter can be set to a
fixed value.
The tuning parameter is a power function. We scale the feature importance values between zero and
one before using the tuning parameter, and thus the weights are also bounded between zero and one for
λ ≥ 0; in Section 5.5, we discuss how to handle negative feature importance values that may be produced
by feature importance measures based on OOB error.
5.5 Methodology
In this section, we discuss datasets, experimental setup, and evaluation metrics used for our study.
5.5.1 Datasets
The 17 classification datasets used in our experiments are listed in Table 5.1; a detailed description
of the datasets is presented in Appendix C.1. Our criteria for selecting these datasets is as follows:
� We select datasets that have a couple thousand examples and at least 25 but no more than 250
features. High-dimensional datasets (> 250 features) are omitted, as they may make experiments
computationally intractable.
� Datasets with fewer examples (< 1000) are omitted, as the results are sensitive to the size of training
and testing data.
In addition to these standard data sets used to evaluate machine learning algorithms, we obtained
data from two additional domains. The description of the datasets is as follows:
(1) Medical data describing blood loss in patients [82]. The goal is to predict the amount of blood loss
induced in a patient based on inputs from various sensors such as pulse oximeters and heart rate
89
Dataset # of Classes Class Percentage # of Examples # of Features Source
covertype 5 37/49/7/3/3 Sampled 4884 54 [5, 10]from 581K
(Class 4 and 5dropped)
ibn sina 2 62/38 Sampled 4000 92 [57]from 20722
sylva (raw) 2 83/17 Sampled 4805 108 [56]from 13086
zebra 2 equal Sampled 4000 152 [57]from 61488
musk 2 85/15 6598 166 [5]hemodynamic 1 5 24/15/16/13/32 4594 35 [82]hemodynamic 2 2 equal 4000 43 [82]hemodynamic 3 5 15/30/21/15/19 7000 40 [82]hemodynamic 4 6 equal 3000 31 [82]
zernike 10 equal 2000 47 [5, 118]karhunen 10 equal 2000 64 [5, 118]
fourier 10 equal 2000 76 [5, 118]factors 10 equal 2000 216 [5, 118]pixel 10 equal 2000 240 [5, 118]
power 1 2 49/51 2000 191 [54]power 2 2 equal 2000 191 [54]power 3 2 equal 3000 191 [54]
Table 5.1: Seventeen datasets used in our experiments. The number of examples ranges from 2000 to 7000,and number of features ranges from 31 and 216.
monitors, among others. These datasets are listed in Table 5.1 as hemodynamic X, where X=1,2,3,
and 4.
(2) Data obtained from a power grid layout [54]. The goal is to discriminate between optimal and
sub-optimal power grid configurations, based on the connections present in the power grid and the
cost and reliability induced by those connections. These datasets are listed in Table 5.1 as power X,
where X=1,2, and 3.
Testing Methodology. For each of the 17 classification datasets, we randomly split the data into
two equal sets, using one set for training and the other for testing; we report our results averaged over 10
repetitions. Also, the same training-testing sets were used for all algorithm combinations.
90
Next, the different algorithm combinations of feature importance measures and iteration schemes are
discussed. For brevity, we use the term algorithm as a shorthand for the 8 algorithm combinations.
5.5.2 Experimental Setup
In our experiments, we explore both alternative approaches to computing feature importance and ap-
proaches to utilizing the importance. The four alternatives for computing feature importance are: Breiman
feature importance, Strobl feature importance, retraining-based feature importance using OOB error, and
retraining-based feature importance using probability of error. We earlier proposed an iterated reweighting
scheme for using feature importance. In order to evaluate this scheme, we compare it to the standard ap-
proach to utilizing importance: iterated removal. In iterated removal, the least important feature is removed
at each iteration of the feature selection process. Our goal is to identify the best algorithm combination of
a feature importance measure and an iterative scheme, based on performance of the algorithm combination
over many data sets.
Next, we list the various experimental parameters and conditions set for the eight algorithm combi-
nations. For brevity, we use the term algorithm as a shorthand for algorithm combination.
Number of Resamples for Calculating Feature Importance. The number of resamples is 30 for
all algorithms. For feature importance values obtained using Breiman’s method or Strobl’s method, the
number of resamples2 controls the number of times feature importance values are calculated and averaged.
Criteria for Explicit Removal of Features. For algorithms that use iterated removal, at the end of
each iteration we remove the feature with the lowest feature importance value. Algorithms that use iterated
reweighting do not explicitly remove features.
Adjustment of Feature Importance Values. We adjust feature importance values for the following
algorithms:
� For algorithms that use iterated reweighting, we set all features with importance values less than
2 This is equivalent to setting the nperm parameter to 30 in the randomForest software package [78].
91
zero to zero; note that feature importance values based on the OOB error rate may have values less
than zero.
� For algorithms that use retraining-based feature importance based on probability of error, we set all
probability values less than 0.05 to zero.
Parameters and Parametric Search. Parameter(s) and the range of values considered for those
parameter(s) are listed for the following algorithms:
� For all algorithms, we set the number of trees (ntree) to 1000. We present our justification3 in
Appendix C.3.
� For algorithms that use iterated removal, we consider the following range of values for mtry : {p/10,
2p/10, 3p/10, . . ., p, sqrt(p)}, where p is the number of features.
� For algorithms that use iterated reweighting, we perform a grid search over (mtry, λ). We consider
the following range of values: (1) for mtry : {p/10, 3p/10, 5p/10, . . ., p, sqrt(p)}, where p is the
number of features; and (2) for λ: {0, 0.25, 0.5, 1}.
Stopping Criterion. A general stopping criterion for all algorithms is an increase of 25% in the OOB
error rate over the lowest observed OOB error rate among all previous iterations. An additional stopping
criterion for algorithms that use iterated reweighting is maintenance of the same feature set for 25 iterations.
Selection of the Final Model. For all algorithms, at the end of each iteration, we save the values of
the OOB error rate and the test set error rate, which are averaged three times using the best parameters
selected via a parametric search. After termination, due to satisfying a stopping criterion, the model with
the lowest OOB error rate is selected as the final model and the error rate for the test set is reported.
Software. Our experiments were performed using Breiman’s random forests Breiman [19] implementa-
tion using Jaiantilal [67]. We modified the package to use FIbRF. We modified the package to use FIbRF
3 For the datasets used in this study, increasing the number of trees from 1000 to 2000 did not significantly increase accuracy.In contrast, increasing the number of trees from 500 to a 1000 produced a significant increase in accuracy.
92
and to allow the use of pre-specified random seeds and in-the-bag examples for individual trees of a random
forest model. The extendedForest [107] package was used to implement Strobl’s importance measure for
random forests.
5.5.3 Evaluation Metrics
For our experiments, we relied primarily on the error metric to evaluate and compare the performance
of various algorithms.
Classification Error Percentage. For classification based datasets, we calculate the classification
error percentage as 100 ∗n∑i=1
I(yi 6= yi)/n, for n examples; yi is the target label for ith example, yi is the
predicted label for ith example, and I() is an indicator function.
Comparison of Multiple Algorithms Using Significance testing. When comparing two or more
algorithms, we can calculate the average error across datasets. However, errors are not commensurate over
different datasets, and a large variation on a single dataset may be enough to give one of the algorithms an
unfair advantage over the others.
An alternate method is to use a paired Student’s t-test or the Wilcoxon signed-rank test [124] to
compare the significance of differences in error between individual algorithms, but such tests are overly
optimistic when multiple algorithms are compared, as they do not consider the fact that multiple comparisons
can be rejected due to random chance.
To compare multiple algorithms, Demsar [35] suggested the Friedman test, which ranks algorithms’
performance using error results for individual datasets, from best to worst. If after performing the Friedman
test, the hypothesis that the algorithms are similar is rejected, one can use a post-hoc test, such as the Holm
test, which adjusts the significance value based on the number of classifiers present; such an adjustment
takes into account the fact that many pairwise comparisons may be rejected due to random chance. Garcıa
and Herrera [52] provide software that lists the pairwise hypotheses that are rejected based on the results of
the Holm test.
Note that the comparison of multiple algorithms via the Friedman test and post-hoc testing has
93
disadvantages when the difference between algorithms is not significant. Assuming that the error values
from the algorithms are obtained through a cross validation technique such as Repeated Learning and
Testing (RLT)4 [4, 25, 38], the ranking may be incorrect if the differences across algorithms applied to a
dataset differ due to random chance alone, as when all algorithms applied to a given dataset obtain the same
or similar final models.
Next, we discuss our results.
5.6 Results
In this section, we present our results for the eight possible algorithm combinations of {Breiman
feature importance measure, Strobl’s feature importance measure, a retraining-based feature importance
measure based on OOB error, a retraining-based feature importance measure based on probability of error}
× {iterated reweighting, iterated removal}.
For the sake of convenience, we refer to the retraining-based feature importance measure based on
OOB error as Err, the retraining-based feature importance measure based on probability of error as Prob,
Breiman’s feature importance measure as Breiman, Strobl’s feature importance measure as Strobl, iterated
reweighting as reweighting, and iterated removal as removal.
Note that numerical tables for the results are presented in Appendix C.2 (page-133). Next, we present
the overall results and provide an analysis of those results.
5.6.1 Overall Results
This section concerns results obtained using 17 real world datasets.
Note that for some of the figures, we also include the results from the baseline model, which is a
random forest model trained using all features and an mtry value that is parametrically selected.
In Figure 5.3(a), we show the average percent error across the 17 datasets. We can make the following
main conclusion: the algorithms that use retraining-based feature importance measures beat out those based
4 In RLT, we randomly split the data into two sets, and use the larger set for training a model and the smaller set forevaluating the model; this procedure is repeated multiple times to obtain an estimate of the test error.
94
11.8
12
12.2
12.4
12.6
12.8
13
13.2
13.4
13.6
Strobl+
Reweighting
Baseline
Strobl+
Removal
Breim
an+Reweighting
Breim
an+Removal
Pro
b+Reweighting
Err+Reweighting
Pro
b+Removal
Err+Removal
Per
cent
Err
or
(a) average percent error and standard error for all algorithms.
−10
−5
0
5
Strobl+
Reweighting
Strobl+
Removal
Breim
an+Reweighting
Err+Reweighting
Breim
an+Removal
Pro
b+Reweighting
Pro
b+Removal
Err+Removal
% R
elat
ive
Red
uctio
n of
Err
or
(b) mean % relative reduction of error and standard error for all algo-rithms.
Figure 5.3: For all real world datasets, we plot: (a) average percent error and standard error for differentalgorithms, and (b) average percent relative reduction in error and standard error for different algorithms.The relative reduction is calculated in comparison to a baseline model. Algorithms are presented left to rightbased on increasing improvement. Normalization of the standard deviation was performed as suggested byMasson and Loftus [80].
95
on Breiman and Strobl measures.
In Figure 5.3(b), we show the relative reduction in error compared to the baseline model. The relative
reduction in error is a better measure than absolute error as one can quantify improvement over a control
condition (in our case, the baseline model). Based on the results in the Figure, it is clear that the three top
performing algorithms use retraining-based feature importance measures.
5.6.2 Analysis Based on % Reduction of Error
We began our analysis with an examination of the following two factors: (1) importance measures,
one of {Breiman, Strobl, Err, Prob}, and (2) iterative schemes, one of {reweighting, removal}. We per-
formed an analysis of variance (ANOVA) on the percent relative reduction in error and found that there
is a marginally significant interaction between the factors of importance measure and the iterative scheme
[F(306,2)=2.975, p=0.053]; in addition, both importance measure [F(306,2)=48.385, p <1e-3] and iterative
schemes [F(153,1)=69.171, p <1e-3] are significant factors. In Figure 5.4, we plot the mean percent relative
reduction of error for the two factors across all datasets. These results allow for the following conclusion:
reweighting performs worse than removal. Henceforth, our analysis will be primarily based on iterated
removal.
Comparing Algorithms Using Removal Based on a Statistical Procedure. Next, we compared
algorithms that use iterated removal. In Figure 5.5, we show statistical results based on significance testing
using the Friedman rank test and the Holm procedure as suggested by Garcıa and Herrera [52]; these
statistical results were obtained using the percent error results across all datasets. Graphically, we arrange
the algorithms from top (best performing algorithm) to bottom (worst performing algorithm) according
to their average Friedman ranks across all datasets. Pairs of algorithms whose outcomes are statistically
indistinguishable (cannot be rejected at p = 0.05 according to the Holm procedure) are connected in the
Figure. Thus, algorithms that are significantly different from each other are not connected. We also note
the average percent error and the reduction in percent error from that of the baseline model in parentheses
next to the algorithm name.
96
Reweighting Removal−12
−10
−8
−6
−4
−2
0
2
4
6
8
% R
elat
ive
Red
uctio
n in
Err
or
ProbErrBreimanStrobl
Figure 5.4: Analysis of four importance measures and two elimination schemes.
For Figure 5.5, we can draw the following conclusion: Prob+Removal has the highest rank but is
insignificantly different from Breiman+Removal and Err+Removal.
Next, we analyze the differences between algorithms on the basis of their performance on individual
datasets.
Is There a Difference Between Retraining and Non-Retraining-Based Feature Importance?
In this analysis, we considered only combinations that use removal. For each dataset and each train-
ing/test split for that dataset, we calculated the average percent reduction in error across algorithms that
use retraining-based feature importance measures; similarly, we calculate the average percent reduction in
error across algorithms that use non-retraining-based feature importance measures (Breiman and Strobl).
In Figure 5.6, we compare these two conditions and show their results on individual datasets and individual
training/test splits of those dataset. We use a ◦ red circle to indicate the difference in percent reduction in
error on the same training/test splits; a positive value on the y-axis implies that the retraining-based feature
importance measures perform better on average than the non-retraining-based feature importance measures.
As Figure 5.6 shows, there are 10 datasets for which retraining-based feature importance measures
97
1
2
3
4
1
2
3
4
1
2
3
4
Prob+Removal (11.96%/6.72%)Err+Removal (11.90%/7.12%)
Breiman+Removal (12.52%/2.77%)
Strobl+Removal (12.66%/0.83%)
Figure 5.5: Algorithms are arranged from top to bottom according to their Friedman test ranks. Algorithmsthat are insignificantly different from each other (p=0.05 according to the Holm procedure) are connected.(Average percent error over datasets and percent reduction of error over the baseline model) are noted inparentheses after the algorithm name.
perform better on average than non-retraining-based feature importance measures. There are 7 for which
the opposite is true, but the reduction in error is clearly larger for the 10 than the increase in error is for the
7. Thus, we conclude that retraining is a useful feature to include in random forest training when feature
selection is being performed.
Comparison of Err+Removal to Other Removal Algorithms. We observed that there are dis-
tinct datasets for which algorithms that use a retraining-based feature importance measures outperform
algorithms that use non-retraining-based feature importance measures. Next, we examine the example
of Err+Removal, an algorithm that uses a retraining-based feature importance measure, and compare
it with Breiman+Removal and Prob+Removal. We do not show the comparison of Err+Removal to
Strobl+Removal, as the results were similar to the comparison to Breiman+Removal.
In Figure 5.7(a), we compare Err+Removal to Breiman+Removal. We can draw the following two
98
−40
−30
−20
−10
0
10
20
30
40
Datasets
Diff
eren
ce in
% R
educ
tion
of E
rror
% reduction of errordifference between retrained and non−retrained importance measures for individual datasets
mean difference in % reduction of errordifference in individual folds
Figure 5.6: Difference in percent reduction of error among algorithms that use retraining-based vs non-retraining-based feature importance measures, collapsed across removal schemes. Difference in percent re-duction of error on the same training/test split of a dataset is represented by a ◦ red circle.
conclusions: (1) Err+Removal performs better than Breiman+Removal on ten datasets and for those datasets
the difference in percent reduction in error is ≥ 5%; and (2) both algorithms perform differently depending
on the dataset.
In Figure 5.7(b), we compare Err+Removal to Prob+Removal. We can draw the following conclusion:
the two algorithms are similar to each other; on about ten of the datasets, the difference in percent reduction
in error between the algorithms is less than 2%.
5.6.3 Analysis Based on Feature Selection
Next, we analyze differences in feature discovery across the eight different algorithms.
Similarity of Algorithms Based on Features Selected. To compare features discovered by any
two algorithms, for each individual training/test split, we calculated a similarity score, which is defined as
99
−30
−20
−10
0
10
20
30
40
50Err+Removal vs. Breiman+Removal
Datasets
Diff
eren
ce in
% R
educ
tion
of E
rror
mean difference in % reduction of errordifference in individual folds
(a) Comparison of Err+Removal and Breiman+Removal.
−30
−20
−10
0
10
20
30
40
50Err+Removal vs. Prob+Removal
Datasets
Diff
eren
ce in
% R
educ
tion
of E
rror
mean difference in % reduction of errordifference in individual folds
(b) Comparison of Err+Removal and Prob+Removal.
Figure 5.7: Difference for the same training/test split of a dataset is represented by a ◦ red circle. Weshow: (a) comparison of Err+Removal and Breiman+Removal and (b) comparison of Err+Removal andProb+Removal.
100
follows:
Similarity Score for Features Discovered (a, b)
= 100 ∗
1− 1
k ∗ t
k datasets∑m=1
1
Dk
Dk features∑i=1
|f counts(k, a, i)− f counts(k, b, i)|
,
(5.11)
where, m = 1, . . . , k datasets, i = 1, . . . , Dk are the features for dataset k, t randomized trials are performed
per dataset, a and b are algorithms for comparison, and f counts(k, a, i) returns the number of times feature
i was selected by algorithm a for dataset k. If we assume that we run t randomized trials per dataset, thus
f counts(k, a, i) is between 0 and t.
For the score to be valid, we have to ensure that same training/test splits are used by both algorithms.
If two algorithms discover similar feature sets, the similarity score will be high; otherwise, it will be low.
A similarity score near 100 implies that for individual training/test splits across multiple datasets, the
algorithms selected very similar features.
In Figure 5.8, we make pairwise comparisons of various algorithms based on their similarity scores.
We shade the range of similar to dissimilar algorithms from white (similar) to black (dissimilar). We can
draw the following two conclusions:
� Err+Removal, Err+Reweighting, and Prob+Removal have high similarity scores with one another,
whereas Prob+Reweighting has a lower similarity score with the others.
� Breiman+Removal and Strobl+Removal are similar to each other, but they have lower similarity
scores with the retraining-based algorithms.
Sparsity of Models. In Figure 5.9, we plot the average number of features discovered by the eight
algorithms across all datasets. We can draw the following three conclusions:
(1) The sparsest models are created by the following retraining algorithms: Err+Reweighting,
Err+Removal and Prob+Removal. Err+Reweighting produced the sparsest model, but in terms
of the overall results, it performed worse than Prob+Removal and Err+Removal.
101
83.13
86.31
62.21
72.87
66.55
52.67
55.83
83.13
84.81
61.35
74.39
65.51
53.68
57.06
86.31
84.81
63.79
70.99
68.69
51.28
53.42
62.21
61.35
63.79
60.43
66.13
72.45
70.13
72.87
74.39
70.99
60.43
79.97
71.62
71.97
66.55
65.51
68.69
66.13
79.97
67.07
66.92
52.67
53.68
51.28
72.45
71.62
67.07
88.13
55.83
57.06
53.42
70.13
71.97
66.92
88.13
Prob+Rem
oval
Err+Rew
eighting
Err+Rem
oval
Prob+Rew
eighting
Breim
an+Rem
oval
Strobl+
Rem
oval
Breim
an+Rew
eighting
Strobl+
Rew
eighting
Prob+Removal
Err+Reweighting
Err+Removal
Prob+Reweighting
Breiman+Removal
Strobl+Removal
Breiman+Reweighting
Strobl+Reweighting
51.28
56.16
61.03
65.90
70.77
75.64
80.51
85.39
90.26
95.13
100.00
Figure 5.8: Comparison of various algorithms on basis of the similarity score discussed in Equation 5.11.Algorithms that are similar to each other are indicated with a lighter shade.
(2) Prob+Reweighting did not output a sparse model, unlike the other retraining-based algorithms.
(3) Err+Reweighting, Err+Removal, and Prob+Removal produced sparser models than Breiman+Removal,
as 31%-57% more features were discovered by Breiman+Removal.
If there is a cost associated with feature acquisition, then retraining-based algorithms are more promis-
ing than non-retraining-based (using Breiman or Strobl’s feature importance) algorithms, due to the lower
error rate and fewer features acquired.
102
30
40
50
60
70
80
90
Err+Reweighting
Pro
b+Removal
Err+Removal
Breim
an+Removal
Strobl+
Reweighting
Pro
b+Reweighting
Strobl+
Removal
Breim
an+Reweighting
Ave
rage
Num
ber
of F
eatu
res
Dis
cove
red
Figure 5.9: Sparsity of models created by different algorithms. Note that the average number of featuresavailable over all datasets is ≈ 114. Normalization of the standard deviation was performed as suggested byMasson and Loftus [80].
Sparsity of Models vs Reduction in Error for Retraining+Removal Algorithms. In Fig-
ure 5.10, we plot the percent reduction in error over the baseline model against percent features present
in the final model for Prob+Removal and Err+Removal. We can draw the following conclusion: there is a
linear relation between the percent features remaining in the final model and the percent reduction in error
obtained; that is, sparser models have a greater reduction in error than models trained with all features.
5.6.4 Computation Complexity
Our earlier analysis was based on the error percentage and feature selection abilities of various algo-
rithms. An important additional criterion to consider is the computational complexity of the algorithms.
In Figure 5.11, we compare the number of random forests trained relative to Strobl+Removal; we
103
20 30 40 50 60 70 80 90 100−20
−10
0
10
20
30
% number of features in final model
% r
educ
tion
of e
rror
pval−removalerr−removal
Figure 5.10: Sparsity of models vs % reduction in error over the baseline model for Prob+Removal andErr+Removal.
0 20 40 60 80 100 120
Strobl+Removal
Breiman+Removal
Breiman+Reweighting
Strobl+Reweighting
Err+Reweighting
Prob+Reweighting
Prob+Removal
Err+Removal
number of models trained relative to Strobl+Removal (1X)
Figure 5.11: Computational complexity relative to Strobl+Removal (1X).
do not calculate the total running time, as the experiments were run on heterogeneous computers. We
104
acknowledge that comparing the computational complexity of the algorithms based on the number of models
is approximate for the following two reasons: (1) the amount of time taken for real world datasets is dataset
specific, and (2) by only showing results for the number of models trained, we are ignoring the time that
is required to calculate Breiman’s and Strobl’s feature importance measures for algorithms that use those
measures.
From Figure 5.11, we can draw the following two conclusions: (1) the number of models trained
for Prob+Removal and Err+Removal is 100 times greater than the number of models required for
Strobl+Removal and Breiman+Removal, and (2) Prob+Reweighting and Err+Reweighting have a com-
putational cost that is an order of magnitude higher than Breiman+Removal and Strobl+Removal, but an
order of magnitude lower than Prob+Removal and Err+Removal.
A reasonable concern is the high computational complexity of the algorithms that use retraining-based
feature importance measures, which have a computational complexity of O(p2) for the number of models
created, where p is the number of features. We argue that our results show that there are distinct datasets
for which algorithms that use a retraining-based feature importance measures show a significant reduction in
error rate when compared to algorithms that use existing feature importance measures; thus, we can justify
the use of a computationally expensive procedure. We can also consider the advantages of the parallel nature
of training a random forest model [105] and obtaining the retraining-based feature importance; we can train
multiple models in parallel to calculate the retraining-based feature importance.
We briefly note the computers used for our study. Our experiments were performed over a couple of
months on commodity hardware consisting of ten computers5 each of which contained either quad-core and
hyper-threaded Intel processors or multicore AMD processors.
5 On a personal note, the performance of a quad socket machine, which can be obtained at a cost of $10000 (in 2011), isnow matched in computational efficiency by a dual socket machine, which can be obtained at a cost of $3000 (in 2012). In2012, at a cost of $15000, we bought off-the-shelf computers whose total computational power was comparable to the fastestsupercomputer in 1999, ASCI-RED ( http://www.top500.org/system/168753 ); ASCI-RED was constructed at a cost of $185Million. Note we are comparing only advances in computing due to newer CPU architecture/manufacturing processes; evenlarger advances are possible with accelerators like GPU’s.
105
5.7 Conclusions and Future Work
In this study, we proposed an iterated reweighting scheme for random forests; we also proposed an
importance measure based on retrained models. Contrary to our initial intuition, an iterated reweighting
scheme did not perform as well as the existing iterated removal scheme. A combination of our proposed
feature importance measure and an iterated removal scheme showed improved results over existing feature
importance measures. Though our proposed method is computationally intensive, the sparsity in number of
features for the obtained models may offset costs associated with feature acquisition.
For the future, we suggest that it may be worth investigating a hybrid strategy using both iterated
reweighting and iterated removal to prevent an algorithm getting stuck in a local optimum, as it appears to
happened with iterated reweighting.
Chapter 6
Conclusion
6.1 Conclusions
In this thesis, we explored feature selection by employing iterative reweighting for two classes of
popular machine learning models: L1 penalized linear models and Random Forests.
Recent studies have shown that incorporating feature importance weights into L1 models leads to
improvement in predictive performance in a single iteration of training. In Chapter 3, we advanced the
state-of-the-art by developing an alternative method for estimating feature importance based on subsampling.
Extending the approach to multiple iterations of training, employing the importance weights from
iteration n to bias the training on iteration n + 1 seems promising, but past studies yielded no benefit to
iterative reweighting. In Chapter 4, we proposed a multiple step algorithm, the SBA-LASSO, that uses
weights derived from either bootstrapped or subsampled datasets. We tuned the characteristics of our
algorithm based on synthetic data sets, and then evaluated the final algorithm on 25 real-world data sets,
whose input dimension ranged from 6 to 300. For the lower-dimensional problems, we augmented the input
with second-order features, which not only allows the models to better exploit feature selection but yield
models with less bias. We obtained a significant reduction of 7.48% in the error rate over standard L1
penalized algorithms, and nearly as large an improvement over alternative feature selection algorithms such
as Adaptive Lasso, Bootstrap Lasso, and MSA-LASSO, on both classification and regression problems, using
our improved estimates of feature importance. Our empirical results also support the hypothesis that we can
107
significantly improve prediction in regression and classification problems using a multiple step estimation
algorithm instead of a single step estimation algorithm.
In Chapter 5, we considered iterative reweighting in the context of Random Forests and contrast this
with a more standard backward-elimination technique that involves training models with the full comple-
ment of features and iteratively removing the least important feature; weights determined the probability of
selecting a feature for inclusion in the trees of a Random Forest. In parallel with this contrast, we also com-
pared several measures of importance, including our own proposal based on evaluating models constructed
with and without each candidate feature. We showed that our importance measure yields both higher accu-
racy and greater sparsity than importance measures obtained without retraining models (including measures
proposed by Breiman and Strobl), though at a greater computational cost. In contrast to the results we ob-
tained with linear models, iterative reweighting of feature importance with the nonlinear random forests did
not outperform a backward elimination approach with all-or-none feature inclusion. Though our proposed
method is computationally intensive, the sparsity in number of features for the obtained models may offset
costs associated with feature acquisition.
6.2 Future Work
L1 Penalized Algorithms. Chapter 4 provided empirical evidence that Subsampling SBA outperforms
existing algorithms. The theoretical justification for Subsampling SBA and other multiple step algorithms
is still an open research question. We hypothesize that the Subsampling SBA is able to adaptively find the
threshold after which a feature should not be penalized in the adaptive lasso penalty. We further conjecture
that MSA-LASSO and Adaptive Lasso may be improved by using a squashing function on the weights, which
will push them toward binary values and therefore make the algorithms behave more like SBA-LASSO.
Comparison of our developed SBA-LASSO algorithm with other sparsity inducing penalty functions
should be investigated; a possible research avenue is using weights with other sparsity inducing penalty
functions.
108
Random Forest. In Chapter 5, we showed that our proposed importance measure based on models
constructed with and without each candidate feature provides large improvements; contrary to out initial
intuition, iterated reweighting schemes do not perform as well as backward elimination, an all-or-none feature
inclusion scheme. We have some initial evidence (Appendix C.6) to support the hypothesis that a reweighting
scheme gets stuck in a local optimum and for the future we suggest that it may be worth investigating a
hybrid strategy using both iterated reweighting and iterated removal.
As our proposed importance measure is computationally intensive, a further analysis of the effects of
removing more than a single feature at a time would be useful. Furthermore, an analysis of the effect of
number of resamples on the performance of the algorithms would be useful.
Random Forests was the main focus of this thesis and for the future we suggest experimenting1 iterated
retraining with other ensemble methods, like boosted trees [43].
We discussed an artificial dataset for which existing feature selection algorithms are unable to correctly
rank the relevant features higher than the irrelevant features. A thorough investigation of such datasets is an
essential step for understanding feature interactions and their influence on feature selection by an algorithm.
Bibliography
[1] Delve Datasets. Accessed 7-July-2013 from http://www.cs.toronto.edu/~delve/data/datasets.
html.
[2] J Alcala-Fdez, L Sanchez, S Garcıa, M J Del Jesus, S Ventura, J M Garrell, J Otero, C Romero,J Bacardit, V M Rivas, et al. KEEL: A Software Tool to Assess Evolutionary Algorithms for DataMining Problems. Soft Computing, 13(3):307–318, 2009.
[3] Yali Amit and Donald Geman. Shape Quantization and Recognition with Randomized Trees. NeuralComputation, 9(7):1545–1588, 1997.
[4] Sylvain Arlot and Alain Celisse. A survey of cross-validation procedures for model selection. StatisticsSurveys, 4:40–79, 2010.
[5] A Asuncion and DJ Newman. UCI Machine Learning Repository, 2010. URL http://archive.ics.
uci.edu/ml. University of California, Irvine, School of Information and Computer Sciences.
[6] Francis R Bach. Bolasso: Model Consistent Lasso Estimation through the Bootstrap. In Proceedingsof the 25th international conference on Machine learning, pages 33–40. ACM, 2008.
[7] Lars Bergstrom. Measuring NUMA effects with the STREAM benchmark. arXiv preprintarXiv:1103.3225, 2011.
[8] Gerard Biau. Analysis of a Random Forests Model. Journal of Machine Learning Research, 98888:1063–1095, June 2012. ISSN 1532-4435.
[9] Gerard Biau and Luc Devroye. On the layered nearest neighbour estimate, the bagged nearest neigh-bour estimate and the random forest method in regression and classification. Journal of MultivariateAnalysis, 101(10):2499–2518, 2010.
[10] J A Blackard and D J Dean. Comparative accuracies of artificial neural networks and discriminantanalysis in predicting forest cover types from cartographic variables. Computers and Electronics inAgriculture, 24:131–151, 1999.
[11] Avrim L Blum and Pat Langley. Selection of relevant features and examples in machine learning.Artificial Intelligence, 97(1):245–271, 1997.
[12] Leon Bottou and Chih-Jen Lin. Support vector machine solvers. Large scale kernel machines, pages301–320, 2007.
[13] Anne-Laure Boulesteix, Silke Janitza, Jochen Kruppa, and Inke R. Konig. Overview of random forestmethodology and practical guidance with emphasis on computational biology and bioinformatics. WileyInterdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(6):493–507, 2012. ISSN 1942-4795.
110
[14] Stephen P Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, Cam-bridge, UK; New York, 2004.
[15] Leo Breiman. Bagging Predictors. Machine Learning, 24(2):123–140, 1996.
[16] Leo Breiman. Heuristics of instability and stabilization in model selection. The Annals of Statistics,24(6):2350–2383, 1996.
[17] Leo Breiman. Stacked regressions. Machine Learning, 24(1):49–64, 1996.
[18] Leo Breiman. Arcing Classifier. The Annals of Statistics, 26(3):801–849, 1998.
[19] Leo Breiman. Random Forests. Machine Learning, 45(1):5–32, 2001.
[20] Leo Breiman. Some Infinity Theory for Predictor Ensembles. Technical Report 579, StatisticsDepartment, UC Berkeley, 2001. Accessed 7-July-2013 from http://www.stat.berkeley.edu/
tech-reports/579.pdf.
[21] Leo Breiman. Looking inside the black box, 2002. Wald Lecture II, Department of Statistics, Cal-ifornia University. Accessed 30-March-2010 from http://stat-www.berkeley.edu/users/breiman/
wald2002-2.pdf.
[22] Leo Breiman. Consistency for a simple model of random forests. Technical Report 670, StatisticsDepartment, UC Berkeley, 2004. Accessed 7-July-2013 from www.stat.berkeley.edu/~breiman/
RandomForests/consistencyRFA.pdf.
[23] Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone.Classification and Regression Trees. Wadsworth, Belmont, California, 1984.
[24] Peter Buhlmann and Lukas Meier. Discussion: One-step sparse estimates in nonconcave penalizedlikelihood models. Annals of Statistics, 36(4):1534–1541, 2008. Available at http://arxiv.org/pdf/0808.1013.
[25] Prabir Burman. A comparative study of ordinary cross-validation, v-fold cross-validation and therepeated learning-testing methods. Biometrika, 76(3):503–514, 1989.
[26] Emmanuel J Candes and Michael B Wakin. An introduction to compressive sampling. Signal ProcessingMagazine, IEEE, 25(2):21–30, 2008.
[27] Emmanuel J Candes, Michael B Wakin, and Stephen P Boyd. Enhancing sparsity by reweighted l1minimization. Journal of Fourier Analysis and Applications, 14(5):877–905, 2008.
[28] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Softwareavailable at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[29] Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. Coordinate Descent Method for Large-scale L2-lossLinear Support Vector Machines. The Journal of Machine Learning Research, 9:1369–1398, 2008.
[30] William H Cooke, Kathy L Ryan, and Victor A Convertino. Lower body negative pressure as a modelto study progression to acute hemorrhagic shock in humans. Journal of Applied Physiology, 96(4):1249–1261, 2004.
[31] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-basedvector machines. The Journal of Machine Learning Research, 2:265–292, 2002.
[32] Adele Cutler and Guohua Zhao. PERT-perfect random tree ensembles. Computing Science andStatistics, 33:490–497, 2001.
[33] D Richard Cutler, Thomas C Edwards Jr, Karen H Beard, Adele Cutler, Kyle T Hess, Jacob Gibson,and Joshua J Lawler. Random forests for classification in ecology. Ecology, 88(11):2783–2792, 2007.
111
[34] A C Davison and D V Hinkley. Bootstrap Methods and Their Application (Cambridge Series inStatistical and Probabilistic Mathematics , No 1). Cambridge University Press, 1 edition, October1997. ISBN 0521574714.
[35] Janez Demsar. Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of MachineLearning Research, 7:1–30, Dec 2006. ISSN 1532-4435.
[36] Persi Diaconis and Bradley Efron. Computer-intensive Methods in Statistics. Scientific American,pages 116–130, May 1983.
[37] Ramon Diaz-Uriarte. varSelRF: Variable selection using Random Forests, 2009. URL http://
ligarto.org/rdiaz/Software/Software.html. R package version 0.7-1.
[38] Thomas G Dietterich. Approximate Statistical Tests for Comparing Supervised Classification LearningAlgorithms. Neural Computation, 10(7):1895–1923, Oct 1998. ISSN 0899-7667.
[39] Thomas G Dietterich. An Experimental Comparison of Three Methods for Constructing Ensembles ofDecision Trees: Bagging, Boosting, and Randomization. Machine learning, 40(2):139–157, 2000.
[40] David L Donoho. Compressed Sensing. IEEE Transactions on Information Theory, 52(4):1289–1306,2006. ISSN 0018-9448.
[41] Bradley Efron and Robert Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.
[42] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. Annalsof Statistics, 32:407–499, 2004.
[43] Jane Elith, John R Leathwick, and Trevor Hastie. A working guide to boosted regression trees. Journalof Animal Ecology, 77(4):802–813, 2008.
[44] Jianqing Fan and Runze Li. Variable Selection via Nonconcave Penalized Likelihood and its OracleProperties. American Statistical Association, 96(456):1348–1360, 2001.
[45] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: Alibrary for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
[46] Agner Fog. Software Optimization Resources. Accessed 1-March-2013 from http://www.agner.org/
optimize/.
[47] Yoav Freund and Robert E Schapire. Experiments with a New Boosting Algorithm. InternationalConference on Machine Learning, 96:148–156, 1996.
[48] Jerome Friedman. Fast Sparse Regression and Classification. International Journal of Forecasting, 28(3):722–738, 2012. Accessed 7-July-2013 from http://www-stat.stanford.edu/~jhf/ftp/GPSpub.
pdf.
[49] Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization Paths for Generalized LinearModels via Coordinate Descent. Journal of Statistical Software, 33(1):1, 2010.
[50] Wenjiang J. Fu. Penalized Regressions: The Bridge versus the Lasso. Journal of Computational andGraphical Statistics, 7(3):397–416, 1998.
[51] G Fung and O L Mangasarian. A Feature Selection Newton Method for Support Vector Ma-chine Classification. Technical Report 02-03, Data Mining Institute, Computer Sciences Depart-ment, University of Wisconsin, Madison, Wisconsin, September 2002. Accessed 7-July-2013 fromftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/02-01.ps.
[52] Salvador Garcıa and Francisco Herrera. An Extension on “Statistical Comparisons of Classiers overMultiple Data Sets” for all Pairwise Comparisons. Journal of Machine Learning Research, 9(2677–2694):66, 2008.
112
[53] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine Learning,63(1):3–42, 2006.
[54] J Giraldez, A Jaiantilal, J Walz, S Suryanarayanan, S Sankaranarayanan, H E Brown, and E Chang.An evolutionary algorithm and acceleration approach for topological design of distributed resourceislands. PowerTech, 2011 IEEE Trondheim, pages 1–8, June 2011.
[55] Isabelle Guyon, Steve Gunn, and Gideon Dror. Result Analysis of the NIPS 2003 Feature SelectionChallenge. Advances in Neural Information Processing Systems, 17:545–552, 2005.
[56] Isabelle Guyon, Amir Saffari, Gideon Dror, and Gavin Cawley. Agnostic Learning vs. Prior KnowledgeChallenge. International Joint Conference on Neural Networks (IJCNN), pages 829–834, 2007. ISSN1098-7576. doi: 10.1109/IJCNN.2007.4371065.
[57] Isabelle Guyon, Gavin Cawley, Gideon Dror, and Vincent Lemaire. Results of the Active LearningChallenge. Active Learning and Experimental Design workshop in conjunction with AISTATS 2010,16:19–45, 2011.
[58] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten.The WEKA data mining software: an update. SIGKDD Explor. Newsl., 11(1):10–18, Nov 2009. ISSN1931-0145.
[59] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statistical Learning:Data Mining, Inference, and Prediction. Springer, 2001.
[60] Tin Kam Ho. The Random Subspace Method for Constructing Decision Forests. IEEE Transactionson Pattern Analysis and Machine Intelligence, 20:832–844, August 1998.
[61] A E Hoerl. Application of ridge analysis to regression problems. Chemical Engineering Progress, 58:54–59, 1962.
[62] Torsten Hothorn, Kurt Hornik, Carolin Strobl, and Achim Zeileis. party: A Laboratory for RecursivePartytioning. R package version 0.9-999, http://CRAN.R-project.org/package=party, 2009.
[63] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. A Practical Guide to Support Vector Classica-tion. Technical report, 2003.
[64] Jian Huang, Shuangge Ma, and Cun-Hui Zhang. Adaptive Lasso for Sparse High-Dimensional Re-gression Models. Technical Report 374, Department of Statistics and Actuarial Science, Universityof Iowa, Nov 2006. Accessed 7-July-2013 from http://statdev.divms.uiowa.edu/sites/default/
files/techrep/tr374.pdf.
[65] Laurent Hyafil and Ronald L. Rivest. Constructing optimal binary decision trees is NP-complete.Information Processing Letters, 5(1):15–17, 1976.
[66] Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. Journal ofcomputational and graphical statistics, 5(3):299–314, 1996.
[67] Abhishek Jaiantilal. Classification and Regression via randomforest-matlab. Accessed 30-March-2010from http://code.google.com/randomforest-matlab, 2010.
[68] Abhishek Jaiantilal and Gregory Grudic. Increasing Feature Selection Accuracy for L1 RegularizedLinear Models. JMLR: Proceedings of the Fourth International Workshop on Feature Selection in DataMining, 10:86–96, June 2010.
[69] Lars Juhl Jensen and Alex Bateman. The rise and fall of supervised machine learning techniques.Bioinformatics, 27(24):3331–3332, 2011.
113
[70] George H John, Ron Kohavi, and Karl Pfleger. Irrelevant Features and the Subset Selection Prob-lem. In Eleventh International Conference on Machine Learning, volume 94, pages 121–129. MorganKaufmann, 1994.
[71] Li Jun, Zhang Shunyi, Xuan Ye, and Sun Yanfei. Identifying Skype traffic by Random Forest. InInternational Conference on Wireless Communications, Networking and Mobile Computing, pages2841–2844. IEEE, 2007.
[72] R K Pace and Ronald Barry. Sparse Spatial Autoregressions. Statistics & Probability Letters, 33(3):291–297, 1997.
[73] Kenji Kira and Larry A Rendell. A Practical Approach to Feature Selection. In Derek H. Sleeman andPeter Edwards, editors, Ninth International Workshop on Machine Learning, pages 249–256. MorganKaufmann, 1992.
[74] Ron Kohavi and George H John. Wrappers for Feature Subset Selection. Artificial Intelligence, 97(1):273–324, 1997.
[75] Igor Kononenko. Estimating Attributes: Analysis and Extensions of RELIEF. In Francesco Bergadanoand Luc De Raedt, editors, European Conference on Machine Learning, pages 171–182. Springer, 1994.
[76] Miron B Kursa and Witold R Rudnicki. Feature Selection with the Boruta Package. Journal ofStatistical Software, 36(11):1–13, 9 2010. ISSN 1548-7660. URL http://www.jstatsoft.org/v36/
i11.
[77] Ching-Pei Lee and Chih-Jen Lin. Large-scale Linear RankSVM. Accessed 7-July-2013 from http:
//www.csie.ntu.edu.tw/~cjlin/papers/ranksvm/ranksvml2.pdf, 2013.
[78] Andy Liaw and Matthew Wiener. Classification and Regression by randomForest. R News, 2(3):18–22,2002. URL http://CRAN.R-project.org/doc/Rnews/.
[79] Yi Lin and Yongho Jeon. Random Forests and Adaptive Nearest Neighbors. Technical Report 1055,Department of Statistics, University of Wisconsin, 2002.
[80] Michael EJ Masson and Geoffrey R Loftus. Using Confidence Intervals for Graphically Based DataInterpretation. Canadian Journal of Experimental Psychology, 57(3):203–220, 2003.
[81] E H Moore. On the reciprocal of the general algebraic matrix. Bulletin of the American MathematicalSociety, 26:394–395, 1920.
[82] Steven L Moulton, Stephanie Haley-Andrews, and Jane Mulligan. Emerging Technologies for Pediatricand Adult Trauma Care. Current opinion in pediatrics, 22(3):332, 2010.
[83] Radford M Neal. Software by Radford M Neal. Accessed 1-March-2013 from http://www.cs.toronto.
edu/~radford/software-online.html.
[84] Radford M Neal. Bayesian learning for neural networks. PhD thesis, University of Toronto, 1995.
[85] Roger Penrose. A generalized inverse for matrices. Proceedings of the Cambridge Philosophical Society,51:406–413, 1955.
[86] John C Platt. Fast Training of Support Vector Machines using Sequential Minimal Optimization.In Bernhard Scholkopf, Christopher JC Burges, and Alexander J Smola, editors, Advances in kernelmethods: support vector learning, Cambridge, MA, 1998. MIT press.
[87] Dimitris N Politos, Joseph P Romano, and Michael Wolf. Subsampling. Springer, 1999.
[88] Michael J Procopio. Hand-labeled DARPA LAGR datasets. Used to be available at http://ml.cs.
colorado.edu/~procopio/labeledlagrdata/, 2007.
114
[89] Michael J Procopio, Jane Mulligan, and Greg Grudic. Learning Terrain Segmentation with Classi-fier Ensembles for Autonomous Robot Navigation in Unstructured Environments. Journal of FieldRobotics, 26(2):145–175, 2009.
[90] John Ross Quinlan. C4.5: Programs for Machine Learning, volume 1. Morgan Kaufmann PublishersInc., San Francisco, CA, USA, 1993. ISBN 1-55860-238-0.
[91] John Ross Quinlan. Combining Instance-Based and Model-Based Learning. In International Conferenceon Machine Learning, pages 236–243, June 1993.
[92] Ryan Rifkin and Aldebaro Klautau. In Defense of One-Vs-All Classification. The Journal of MachineLearning Research, 5:101–141, 2004.
[93] Marko Robnik-Sikonja and Igor Kononenko. An adaptation of Relief for attribute estimation in regres-sion. In Douglas H. Fisher, editor, Fourteenth International Conference on Machine Learning, pages296–304. Morgan Kaufmann, 1997.
[94] Guilherme V Rocha, Xing Wang, and Bin Yu. Asymptotic distribution and sparsistency for l1-penalizedparametric m-estimators with applications to linear svm and logistic regression, 2009. URL http:
//www.citebase.org/abstract?id=oai:arXiv.org:0908.1940.
[95] R Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1997.
[96] Lior Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33(1-2):1–39, 2010.
[97] Saharon Rosset and Ji Zhu. Piecewise linear regularized solution paths. Annals of Statistics, 35(3):1012–1030, 2007.
[98] Saharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a Regularized Path to a Maximum MarginClassifier. Journal of Machine Learning Research, 5:941–973, 2004.
[99] Yvan Saeys, Inaki Inza, and Pedro Larranaga. A review of feature selection techniques in bioinformat-ics. Bioinformatics, 23(19):2507–2517, 2007.
[100] Robert E Schapire. The Boosting Approach to Machine Learning: An Overview. Lecture Notes InStatistics, Springer Verlag, New York, pages 149–172, 2003.
[101] Bernhard Scholkopf and Alexander J Smola. Learning with Kernels: Support Vector Machines,Regularization, Optimization and Beyond. the MIT Press, 2002.
[102] Daniel F Schwarz, Inke R Konig, and Andreas Ziegler. On safari to Random Jungle: a fast implemen-tation of Random Forests for high-dimensional data. Bioinformatics, 26(14):1752–1758, 2010.
[103] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge UniversityPress, 2004.
[104] Kai-Quan Shen, Chong-Jin Ong, Xiao-Ping Li, Zheng Hui, and Einar PV Wilder-Smith. A FeatureSelection Method for Multilevel Mental Fatigue EEG Classification. IEEE Transactions on BiomedicalEngineering, 54(7):1231–1237, 2007.
[105] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark Finocchio, Richard Moore, AlexKipman, and Andrew Blake. Real-Time Human Pose Recognition in Parts from Single Depth Images.In IEEE Conference on Computer Vision and Pattern Recognition, pages 1297–1304. IEEE, 2011.
[106] K. Sjostrand. Matlab implementation of LASSO, LARS, the elastic net and SPCA. http://www2.
imm.dtu.dk/pubdb/p.php?3897, June 2005. Version 2.0.
[107] Stephen J Smith, Nick Ellis, and C Roland Pitcher. Conditional variable importance in R package ex-tendedForest. Accessed 7-July-2013 from https://r-forge.r-project.org/scm/viewvc.php/pkg/
extendedForest/?root=gradientforest, 2011.
115
[108] Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra.MPI: The Complete Reference (Vol. 1): Volume 1-The MPI Core, volume 1. MIT press, 1998.
[109] K P Soman, S Diwakar, and V Ajay. Insight Into Data Mining: Theory and Practice. Prentice-Hall ofIndia Private Limited, 2006. ISBN 9788120328976.
[110] Carolin Strobl and Achim Zeileis. Danger: High power!–exploring the statistical properties of a testfor random forest variable importance. Technical Report 017, Department of Statistics, University ofMunich, 2008.
[111] Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis. Condi-tional variable importance for random forests. BMC Bioinformatics, 9(1):307, 2008. ISSN 1471-2105.doi: 10.1186/1471-2105-9-307.
[112] Robert Tibshirani. Regression Shrinkage and Selection via the Lasso. Journal of the Royal StatisticalSociety, Series B (Methodological), 58:267–288, 1996.
[113] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smooth-ness via the fused lasso. Journal of Royal Statistical Society, Series B (Statistical Methodology), 67(1):91–108, 2005.
[114] Roman Timofeev. Classification and Regression Trees (CART) Theory and Applications. Master’sthesis, 2004.
[115] Jameson L Toole, Michael Ulm, Marta C Gonzalez, and Dietmar Bauer. Inferring land use from mobilephone activity. In Proceedings of the ACM SIGKDD International Workshop on Urban Computing,pages 1–8. ACM, 2012.
[116] Eugene Tuv, Alexander Borisov, George Runger, and Kari Torkkola. Feature Selection with Ensembles,Articial Variables, and Redundancy Elimination. The Journal of Machine Learning Research, 10:1341–1366, 2009.
[117] Andrey Nikolayevich Tychonoff. On the stability of inverse problems. Doklady Akademii Nauk SSSR,39:195–198, 1943.
[118] M Van Breukelen, Robert PW Duin, David MJ Tax, and JE den Hartog. Handwritten digit recognitionby combined classifiers. Kybernetika, 34(4):381–386, 1998.
[119] Vladimir N Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998. ISBN0471030031.
[120] Pantelis Vlachos. Statlib. http://lib.stat.cmu.edu/datasets/.
[121] J D Watts and R L Lawrence. Merging random forest classification with an object-oriented approachfor analysis of agricultural lands. The International Archives of the Photogrammetry, Remote Sensingand Spatial Information Sciences, XXXVII (B7), 2008.
[122] Sanford Weisberg. Applied Linear Regression. Wiley, 2005.
[123] Jason Weston and Chris Watkins. Multi-class Support Vector Machines. Technical Report CSD-TR-98-04, Department of Computer Science, Royal Holloway, University of London, May 1998.
[124] Frank Wilcoxon. Individual Comparisons by Ranking Methods. Biometrics Bulletin, 1(6):80–83, 1945.
[125] David H Wolpert. Stacked Generalization. Neural Networks, 5:241–259, 1992.
[126] David H Wolpert. What the No Free Lunch Theorems Really Mean; How to Improve Search Algorithms.Available at http://www.santafe.edu/media/workingpapers/12-10-017.pdf, 2012.
116
[127] David H Wolpert and William G. Macready. No Free Lunch Theorems For Optimization. EvolutionaryComputation, IEEE Transactions on, 1(1):67–82, 1997.
[128] Pengyi Yang, Yee Hwa Yang, Bing B Zhou, and Albert Y Zomaya. A Review of Ensemble Methods inBioinformatics. Current Bioinformatics, 5(4):296–308, 2010.
[129] Lei Yu and Huan Liu. Efficient Feature Selection via Analysis of Relevance and Redundancy. TheJournal of Machine Learning Research, 5:1205–1224, 2004.
[130] Peng Zhao and Bin Yu. On Model Selection Consistency of Lasso. The Journal of Machine LearningResearch, 7:2541–2563, 2006.
[131] Ji Zhu, Saharon Rosset, Trevor Hastie, and Rob Tibshirani. 1-norm Support Vector Machines.Advances in neural information processing systems, 16(1):49–56, 2004.
[132] Hui Zou. An Improved 1-norm SVM for Simultaneous Classification and Variable Selection. AISTATS,2:675–681, 2007.
[133] Hui Zou. The Adaptive Lasso and Its Oracle Properties. Journal of the American StatisticalAssociation, 101:1418–1429(12), December 2006.
[134] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of theRoyal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.
[135] Hui Zou and Runze Li. One-step sparse estimates in nonconcave penalized likelihood models. Annalsof Statistics, 36(4):1509–1533, 2008.
Appendix A
L1-SVM2 Algorithm
In this chapter, we detail the L1-SVM2 algorithm. First, we discuss the optimization of a double
differentiable loss function regularized with the L1 penalty.
A.1 Optimization of a Double Differentiable Loss Function Reg-
ularized with the L1 Penalty
This section summarizes the method, proposed by Rosset and Zhu [97], that solves for the entire
regularization path for a double differentiable loss function and the lasso penalty.
A.1.1 Necessary and Sufficient Conditions for Optimality for the Whole Regu-
larization Path
Lasso penalty regularized algorithms for a double differentiable loss function can be formulated as
minβ
n∑i=1
L(xi, yi, β) + λ
p∑j=1
|βj |, (A.1)
where n is the number of examples and p is the number of features. βj is the parameter for feature j
and is substituted as βj = β+j − β−j with β+
j ≥ 0, β−j ≥ 0; such a decomposition helps in reducing the
2p constraints, due to |β|, to 2p+1 constraints, where p is the number of features. We also substitute
118
L(xi, yi, β) = L(yi, (β+, β−)xi) to obtain the primal and dual formulation of Equation A.1 as
Primal : minβ+,β−
∑i
L(yi, (β+, β−)xi) + λ
∑j
(β+j + β−j ), such that ∀j, β+
j ≥ 0 and β−j ≥ 0, (A.2)
Dual : minβ+,β−
∑i
L(yi, (β+, β−)xi) + λ
∑j
(β+j + β−j )−
∑j
γ+j β
+j −
∑j
γ−j β−j . (A.3)
Henceforth, we use the shorthand notation L(β) instead of L(xi, yi, β).
We use the Karush Kuhn Tucker (KKT) conditions to examine the properties of the minima for
Equation A.1. The primal formulation consists of the loss function and inequality constraints. Using KKT
conditions, we obtain γ+j β
+j = 0 and γ−j β
−j = 0. Then, differentiating Equation A.3, using β−j and β+
j and
setting the gradient to 0, we obtain [∇L(βj)] + λ− γ+j = 0 and −[∇L(βj)] + λ− γ−j = 0.
At the minima of Equation A.3, for fixed λ > 0 (for λ = 0, β = 01,...,p), we get
if β+j > 0, λ > 0 =⇒ γ+
j = 0 =⇒ [∇L(β)]j = −λ < 0 =⇒ γ−j = 0 =⇒ β−j = 0, (A.4)
if β−j > 0, λ > 0 =⇒ γ−j = 0 =⇒ [∇L(β)]j = λ > 0 =⇒ γ+j = 0 =⇒ β+
j = 0. (A.5)
From Equation A.4 and Equation A.5, we surmise that either β+j or β−j (and similarly either γ+
j or γ−j )
are non-zero, but not both simultaneously. When |[∇L(βj)]| = λ, the feature j can have non-zero coefficient
βj , but features that have |[∇L(βj)]| < λ have coefficient βj = 0. Thus, for each value of λ, we have sets of
features that can be divided on the basis of |[∇L(βj)]| = λ (active set A = {j ∈ {1..m} and βj 6= 0} ) and
|[∇L(βj)]| < λ (active complementary set Ac), and formulated as
j ∈ A =⇒ |[∇L(βj)]| = λ, sign[∇L(βj)] = −sign(βj), (A.6)
j /∈ A( or j ∈ Ac) =⇒ |[∇L(βj)]| < λ. (A.7)
Now that we have examined the properties of the minima of Equation A.1, we utilize Newton’s method
for optimizing it. Instead of solving for a fixed value of λ, we use a strategy of starting with an empty set A
and λ =∞; initially, as λ =∞, set A is empty. We gradually decrease the value of λ which allows features
to enter the set A; KKT conditions are violated whenever a feature moves between sets A and Ac and we
require a recalculation of the gradients. We briefly note the calculation of the gradients and the conditions
that violate the KKT conditions as follows:
119
Initially, choose β0 = {0, .., 0} ∈ Rn, (A.8)
According to Newton’s method, βk+1 = βk − [∇2f(βk)]−1∇f(βk), k = 0, 1, . . . , (A.9)
where ∇f(βk) = −∇L(β) or λ ∇J(β) as ∇L(β) + λ∇J(β) = 0,
βk+1 = βk − [∇2L(βk)]−1∇J(βk),
Thus,∂βk
∂λ= −[∇2L(βk)]−1∇J(βk) = −[∇2L(βk)]−1Sign(βk), k ∈ A,
(A.10)
Note that the notation βkj refers to the value of feature j at iteration k, whereas βk refers to the parameter
vector at iteration k.
KKT conditions are violated:
(1) When a feature (not in the active set A) achieves |[∇L(βj)]| = λ.
(2) When a feature obtains a value of 0, the KKT conditions will be violated as sign[∇L(βj)] =
−Sign(β(λj)).
(3) If the loss function is not totally differentiable everywhere then the gradient direction becomes invalid
at the points where the loss function is not differentiable; whenever we encounter a non-differentiable
transition, the gradient direction is recalculated.
Next, we discuss an algorithm that calculates the entire regularization path by properly handling the
KKT conditions.
Algorithm for Optimizing Doubly Differentiable Loss Functions with the L1 Penalty. The
algorithm is noted as follows:
(1) Initialize β = 01..p, A = maxj |∇L(βj)|, γA = −sign(∇L(β))A, γAc = 0, and p is the number of
features.
(2) While (max|∇L(β)| > 0)
a) d1 = min{d > 0 : |∇L(βj + dγj)j /∈A| = |∇L(βm + dγm)m∈A|} (add features to A).
120
b) d2 = min{d > 0 : (βj + dγj |)j∈A = 0} (delete features from A).
c) d3 = min{d > 0 : r(yi, (β+dγ)xi) = 0 hits a ‘knot’, i = 1 . . . n and n is the number of examples}.
d) set d=min(d1,d2,d3) and β = β + dγ.
e) γj∈A = |∇2L(βj∈A)|−1(−sign(βj∈A)), γj∈Ac = 0
In step 2c), we use r() function to define distance of a training example to a knot point (non-differentiable
point). In step 2d), we take the smallest increment till a KKT condition is violated.
A.2 L1-SVM2
In this section, we present the derivations of the piecewise algorithm that calculates the entire regu-
larization path for the squared hinge loss function with the L1 penalty (termed L1-SVM2), which is defined
as
L1-SVM2: minβ,β0
{λ||β||1 +
∑i
(1− yi ∗ (xiβ + β0))2+, (v)+ = max(v, 0)
}. (A.11)
Our notation will be as follows: we represent the dot product as ‘∗’ and 1m is a vector of length
m. The matrix representing input examples is represented by a n × p matrix X, where n is the number of
examples and p is the number of features. x:,j represents the jth column in X, the data for the jth feature;
similarly, xi,: represents the ith row in X, the data for the ith example. We represent the loss function
L(X, y, β) in shorthand as L. We derive the gradient of the loss function and then step size d1, for the
algorithm discussed in the previous section, as
L = (1m − y ∗ (Xβ)− y ∗ (β0))2+,
For each feature ‘j’, ∇Lj = −2(1m − y ∗ (Xβ − β0))T+(y ∗ x:,j),
∇2Lj,j = 2(−y ∗ x:,j)T (−y ∗ x:,j) = 2xT:,jx:,j ,∇2Lj,k = 2(−y ∗ xk)T (−y ∗ x:,j) = 2xT:,jx:,k,
∇2Lj,β0= 2(−yx:,j)
T (−y) = 2∑k
xk,j ,∇2Lβ0,β0= 2(y)T (y) = 2yT y,
For finding d1, |∇L(βa + d1γa)|a∈A = |∇L(βj + d1γj)|j∈Ac ⇒ ∇L(βa + d1γa)a∈A = ±∇L(βj + d1γj)j∈Ac ,
⇒ (1m − y ∗ (Xβ + β0 + d1γ0 +Xd1γ))T (y ∗ x:,a) = ±(1m − y ∗ (Xβ + β0 + d1γ0 +Xd1γ))T (y ∗ x:,j),
121
⇒ (1m − y ∗ (Xβ + β0))T (y ∗ x:,a)∓ (1m − y ∗ (Xβ + β0))T (y ∗ x:,j)
= ∓y ∗ ((γ0 +Xγ)(y ∗ x:,j) + (γ0 +Xγ)(y ∗ x:,a))d1
⇒ Step size d1 = min
{A Correlation− J Correlation
A val − J val,A Correlation+ J Correlation
A val + J val
}, (A.12)
where, A Correlation = (1m − y ∗ (Xβ + β0))T (y ∗ x:,a), J Correlation = (1m − y ∗ (Xβ + β0))T (y ∗ x:,j),
A val = (γ0 +Xγ)(x:,a), J val = (γ0 +Xγ)(x:,j) (∵ y ∗ y =∑
1m). (A.13)
where γ is a p × 1 vector and γ0 is a singular value, containing the direction of the gradient for β and β0,
respectively. Note that, for the L1-SVM2, the step size d1 is the minimum of the ratio of difference between
the generalized correlations and difference of directional estimates.
Finding γ0. The algorithm discussed in the previous section has all the features in the active set A with
the same gradient value. Thus, we cannot directly solve for γ0 (it is also complicated by the fact that the
gradient of β0 is always 0).
We obtain the value of γ0 in terms of γ as
∇β0L =∇β0
(1m − y ∗ (Xβ)− y ∗ (β0))2+ = (1m − y ∗ (Xβ)− y ∗ (β0))T (−y), (A.14)
as ∇β0(L(β) + λJ(β)) = 0,⇒ (1m − y ∗ (Xβ)− y ∗ (β0))T (−y) + λ(0) = 0,
⇒(1m − y ∗ (Xβ)− y ∗ (β0))T (−y) = 0,
and for a small step d1, (1m − y ∗ (Xβ)− y ∗ (β0)− y ∗ (Xd1γ)− y ∗ (d1γ0))T (−y) = 0,
⇒(y ∗ (Xγ + γ0))T (y) = 0⇒ (y ∗ (Xγ))T y + (y ∗ γ0)T (y) = 0,
⇒∑
Xγ + nγ0 = 0⇒ γ0 = −∑Xγ
n. (A.15)
Appendix B
Appendix for Chapter 4
B.1 Description of Real World Datasets
We describe the datasets used in our experiments.
Classification Datasets.
(1) a1a - dataset derived from the adult dataset. In this dataset, the continuous features were discretized
into quantiles and each quantile was binarized. Categorical features with m categories were converted
to m binary features. Details presented in Platt [86].
(2) australian - dataset containing information about credit card applications. The feature information is
not available. Dataset obtained from Chang and Lin [28].
(3) german.numer - dataset contains information about credit card applications. The feature information in-
cludes amount present in checking account, credit history, employment, among others. Dataset obtained
from Chang and Lin [28].
(4) heart - dataset is used to predict the presence or absence of heart disease. Features include age, sex,
chest pain, among others. Dataset obtained from Chang and Lin [28].
(5) hillvalley (noisy) - each example in this dataset represents 100 points on a two-dimensional graph. The
points will either create a hill (bump in the terrain) or a valley (dip in the terrain). We used the noisy
123
version of this dataset where the terrain is uneven and the hills and the valleys are not obvious. Dataset
obtained from Asuncion and Newman [5].
(6) ionosphere - detection of ‘good’ vs. ‘bad’ radar targets; the ‘good’ radar targets show evidence of some
structure in the ionosphere and the ‘bad’ targets do not. Features are 17 radar pulses each with 2
attributes. Dataset can be obtained from either Chang and Lin [28] or Asuncion and Newman [5].
(7) Chess (kr vs kp) dataset - dataset represents the position of chess pieces in an end game in chess, where
a pawn on a7 is one square away from queening. The task is to determine if the white player can win
or not. Dataset obtained from Asuncion and Newman [5].
(8) BUPA liver-disorders - detection of liver disorders. Data consists of 6 features, and 5 of which are blood
tests and the 6th feature is the number of half-pint drinks drunk by the individual. Dataset can be
obtained from either Chang and Lin [28] or Asuncion and Newman [5].
(9) pimadiabetes - detection of diabetes for female patients of the Pima Indian heritage. Features include
blood tests, blood pressure, age, among other. Dataset can be obtained from either Chang and Lin [28]
or Asuncion and Newman [5].
(10) spambase - determining between spam and non-spam emails. Dataset can be obtained from either
Asuncion and Newman [5] or Alcala-Fdez et al. [2].
(11) splice - “Splice junctions are points on a DNA sequence at which ‘superfluous’ DNA is removed during
the process of protein creation in higher organisms. The problem posed in this dataset is to recognize,
given a sequence of DNA, the boundaries between exons (the parts of the DNA sequence retained after
splicing) and introns (the parts of the DNA sequence that are spliced out). This problem consists of
two sub tasks: recognizing exon/intron boundaries (referred to as EI sites), and recognizing intron/exon
boundaries (IE sites). (In the biological community, IE borders are referred to a ‘acceptors’ while EI
borders are referred to as ‘donors’.) ” - description at Asuncion and Newman [5].
(12) svmguide3 - dataset was presented in Hsu et al. [63]. According to the authors, it represents some form
of vehicle information.
124
(13) w1a - the goal is to classify whether a web page belongs to a category or not. Input is a 300 sparse
binary keyword attribute extracted from each web page. Dataset was presented in Platt [86].
(14) breast-cancer (diagnostic) / WDBC - detection of malignant or benign breast cancer tumor. Features
were computed from a digitized image of a fine needle aspirate (FNA) of a breast mass and describe
the characteristics of the cell nuclei present in the image including the radius, texture, among others.
Dataset is available at Asuncion and Newman [5].
(15) breast-cancer - detection of malignant or benign breast cancer tumor. This dataset has different features
than the breast-cancer (diagnostic) dataset that was discussed earlier. Dataset is available at Asuncion
and Newman [5].
Regression Datasets.
(1) cadata (houses.zip) - “These spatial data contain 20,640 observations on housing prices with 9 economic
covariates” - according to K Pace and Barry [72].
(2) compactiv - “The data was collected from a Sun Sparcstation 20/712 with 128 Mbytes of memory
running in a multi-user university department. Users would typically be doing a large variety of tasks
ranging from accessing the internet, editing files or running very cpu-bound programs. The data was
collected continuously on two separate occasions. On both occasions, system activity was gathered every
5 seconds. The final dataset is taken from both occasions with equal numbers of observations coming
from each collection epoch in random order.” - description at Delve [1].
(3) diabetes - this dataset was discussed in Efron et al. [42]. The target value is the measure of disease
progression a year after baseline. The 10 baseline features include age, sex, bmi, among other features.
(4) housing - predicting house values in suburbs of Boston [5]. Features include crime rate, tax, age, among
others.
(5) mortgage - “This file contains the Economic data information of USA from 01/04/1980 to 02/04/2000
on a weekly basis. From given features, the goal is to predict the 30 year conventional mortgage rate.”
125
- description at Alcala-Fdez et al. [2].
(6) auto-mpg - “The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms
of 3 multi-valued discrete and 5 continuous attributes” - according to Quinlan [91]. Features include
cylinder, displacement, horsepower, among others.
(7) Pole communications - This is a commercial application described in Weiss and Indurkhya (1995) and
at http://www.cs.su.oz.au/~nitin. The data describes a telecommunication problem. No further
information is available.
(8) triazines - the dataset is part of “Qualitative Structure Activity Relationships” and available at Asuncion
and Newman [5].
(9) census - The data were collected as part of the 1990 US census. These are mostly counts cumulated
at different survey level. These are all concerned with predicting the median price of the house in
the region based on demographic composition and a state of housing market in the region. We used
house-price-16H from the datasets available at Delve [1].
(10) Concrete compressive strength - “Concrete is the most important material in civil engineering. The
concrete compressive strength is a highly nonlinear function of age and ingredients.” - description at
Asuncion and Newman [5]. Eight quantitative input features are available.
B.2 Error Results on Artificial and Real World Datasets
Numerical results for the classification and regression datasets are presented in Table B.1. Numerical
results for artificial datasets are presented in Table B.2. Numerical results for number of features correctly
and incorrectly detected by various L1 penalized algorithms are presented in Table B.3.
126
Error % on Real World Regression Datasets using LARSDataset MSA SBA Bootstrap Adaptive L1
cadata 33.68 (8.74) 32.79 (6.45) 33.80 (8.71) 33.98 (9.53) 33.75 (8.73)compactiv 2.56 (0.21) 2.56 (0.27) 2.58 (0.28) 2.54 (0.22) 2.57 (0.27)
diabetes 55.37 (11.81) 54.35 (10.26) 53.93 (12.23) 56.15 (11.66) 55.76 (10.85)housing 22.47 (21.80) 19.45 (19.07) 21.91 (20.44) 17.49 (11.88) 23.39 (23.62)
mortgage 0.07 (0.02) 0.07 (0.01) 0.13 (0.18) 0.09 (0.07) 0.08 (0.03)mpg 14.18 (5.35) 13.03 (3.08) 13.37 (3.42) 13.05 (3.49) 13.05 (3.23)pole 29.68 (0.86) 30.00 (0.85) 30.06 (0.85) 29.73 (0.93) 30.08 (0.90)
triazines 90.88 (23.33) 83.90 (25.16) 84.27 (17.56) 89.72 (22.60) 90.88 (18.01)concrete 22.82 (3.53) 22.32 (3.74) 22.59 (3.75) 22.36 (3.52) 22.53 (3.71)
census 52.56 (18.92) 50.48 (14.22) 52.61 (18.65) 55.90 (28.93) 52.31 (17.16)average 32.43 30.90 31.52 32.10 32.44
ranks 3.00 1.70 3.60 3.00 3.70
Error % on Real World Classification Datasets using L1-SVM2Dataset MSA SBA Bootstrap Adaptive L1
a1a 18.63 (1.69) 18.19 (2.17) 18.93 (1.89) 18.38 (2.07) 18.37 (1.70)australian 15.36 (2.91) 14.49 (3.20) 14.35 (3.51) 14.78 (3.54) 14.93 (2.98)
breastcancer 4.06 (3.77) 3.48 (3.56) 3.77 (3.70) 3.91 (3.72) 4.21 (3.87)germannumber 26.10 (2.77) 25.60 (1.65) 26.70 (2.11) 26.30 (1.49) 25.60 (1.71)
heart 18.52 (7.20) 14.44 (5.91) 17.04 (7.03) 17.78 (6.72) 17.78 (7.57)hillvalley 19.03 (5.71) 13.27 (3.48) 24.68 (5.63) 24.50 (6.26) 28.36 (5.76)
ionosphere 10.09 (5.51) 7.50 (3.94) 9.63 (4.37) 7.78 (3.41) 7.78 (3.88)krvskp 2.97 (1.12) 3.00 (0.90) 3.07 (1.04) 3.07 (1.18) 3.25 (0.92)
liverdisorders 30.81 (7.90) 29.24 (8.46) 32.38 (10.29) 31.33 (7.94) 31.67 (7.87)pimadiabetes 23.94 (5.48) 23.29 (4.93) 23.42 (5.17) 23.30 (4.46) 23.55 (4.73)
spambase 8.14 (1.02) 8.09 (1.24) 8.22 (0.75) 8.33 (0.93) 8.20 (0.80)splice 20.10 (3.21) 19.10 (3.11) 19.90 (4.51) 19.60 (3.31) 20.10 (3.78)
svmguide3 16.34 (1.60) 16.11 (3.13) 17.21 (2.10) 17.15 (2.22) 16.90 (2.42)w1a 2.67 (0.67) 2.30 (0.47) 2.79 (0.49) 2.30 (0.47) 2.46 (0.62)
WDBC 3.69 (2.92) 3.86 (3.07) 3.16 (3.07) 4.56 (3.62) 4.56 (3.62)average 14.70 13.46 15.02 14.87 15.18
ranks 3.37 1.30 3.47 3.17 3.70
Table B.1: Results on real world classification and regression datasets using L1-SVM2 and LARS respectively.Algorithms with the lowest error rate are in bold; parentheses contain the standard deviation. We onlyshow results from multiple step version of MSA and SBA (Subsampling). We also omit the results from SBA(Bootstrap).
127
Error % for LARS (Regression)Artificial MSA SBA(Boot) SBA(Sub) MSA SBA(Boot) SBA(Sub) Bootstrap Adaptive L1
Dataset No. Single Single Single Multiple Multiple Multiple1 5.63 5.59 5.46 5.54 5.39 5.29 5.68 5.88 6.342 1.51 1.49 1.45 1.50 1.48 1.45 1.50 1.51 1.543 19.33 23.70 23.39 16.89 20.76 18.80 27.80 26.59 29.114 5.72 5.74 5.56 5.63 5.52 5.37 5.76 6.07 6.705 1.52 1.49 1.46 1.50 1.48 1.45 1.50 1.49 1.566 24.76 34.46 31.47 19.14 26.43 25.21 50.95 38.51 37.947 6.57 6.66 6.62 6.42 6.44 6.10 6.70 7.27 7.748 1.67 1.66 1.63 1.66 1.64 1.61 1.66 1.72 1.739 7.30 7.60 7.57 7.05 6.73 6.81 8.68 7.70 9.91
average 8.22 9.82 9.40 7.26 8.43 8.01 12.25 10.75 11.40ranks 5.22 5.56 3.56 3.56 2.78 1.44 6.89 7.22 8.78
Error % for L1-SVM2 (Classification)Artificial MSA SBA(Boot) SBA(Sub) MSA SBA(Boot) SBA(Sub) Bootstrap Adaptive L1
Dataset No. Single Single Single Multiple Multiple Multiple1 11.05 10.75 9.52 10.58 10.31 8.72 10.82 11.13 11.772 4.59 4.37 3.51 4.22 4.01 3.52 4.49 4.15 4.683 25.46 25.85 23.62 23.64 24.47 22.24 26.96 26.08 26.204 11.71 11.27 10.18 11.14 10.71 9.51 11.77 11.65 12.905 4.41 4.51 3.62 4.33 4.50 3.61 4.66 3.88 5.086 30.27 30.20 28.46 28.74 29.16 26.85 31.70 30.73 30.787 12.30 11.98 11.09 11.75 11.38 10.37 12.08 12.52 13.338 5.51 5.50 4.33 5.15 5.23 4.28 5.67 5.42 5.779 20.84 20.17 18.92 19.93 19.79 18.31 21.20 20.16 20.70
average 14.02 13.84 12.58 13.28 13.28 11.93 14.37 13.97 14.58ranks 6.67 5.67 1.89 3.78 3.67 1.11 7.78 5.89 8.56
Error % for LPSVM (Classification)Artificial MSA SBA(Boot) SBA(Sub) MSA SBA(Boot) SBA(Sub) Bootstrap Adaptive L1
Dataset No. Single Single Single Multiple Multiple Multiple1 11.53 12.44 9.83 10.16 9.45 7.56 10.43 11.40 11.912 4.25 5.07 3.72 3.83 3.53 2.73 3.97 3.59 4.473 25.32 25.76 22.43 22.65 21.27 15.48 25.54 26.41 26.144 12.55 14.07 10.39 11.41 10.92 8.33 11.09 11.87 13.255 4.41 5.57 3.88 4.29 3.72 2.79 4.07 2.68 5.116 30.35 30.67 26.47 28.20 26.74 20.19 30.99 30.93 30.617 12.46 14.02 11.12 11.39 11.26 8.92 11.66 12.87 13.248 5.30 6.02 4.53 4.98 4.58 3.89 5.48 5.01 5.649 18.73 18.46 16.32 17.93 17.56 13.96 18.96 18.19 18.15
average 13.88 14.67 12.08 12.76 12.11 9.32 13.58 13.66 14.28ranks 6.44 8.33 2.67 4.44 2.67 1.11 6.22 5.67 7.44
Error % for Logistic Regression (Classification)Artificial MSA SBA(Boot) SBA(Sub) MSA SBA(Boot) SBA(Sub) Bootstrap Adaptive L1
Dataset No. Single Single Single Multiple Multiple Multiple1 8.90 11.59 10.14 7.98 7.99 6.61 10.57 9.79 11.762 4.82 3.82 3.23 3.50 3.36 2.65 4.21 2.90 4.153 20.84 26.78 26.09 17.59 21.65 17.35 25.57 25.97 26.654 11.02 11.84 10.58 10.07 8.43 6.45 12.85 10.43 13.075 6.42 3.46 3.12 3.98 2.99 2.41 4.71 2.94 4.466 24.12 31.13 30.53 22.19 24.63 20.42 33.71 30.38 31.137 11.45 13.25 11.89 9.79 10.32 8.22 16.60 11.57 13.498 5.36 4.95 4.17 3.91 4.01 3.25 6.54 4.28 5.129 17.68 19.93 18.93 15.59 13.80 12.23 23.09 19.08 20.15
average 12.29 14.08 13.19 10.51 10.80 8.84 15.32 13.04 14.44ranks 5.56 6.89 5.11 3.00 3.11 1 8 4.44 7.89
Table B.2: Results on nine synthetic datasets using four different L1 penalized algorithms. Algorithms withthe highest accuracy are in bold. Standard deviation are omitted.
128LARS
Artificial MSA SBA(Boot) SBA MSA SBA SBABootstrap Adaptive L1Dataset Single (Boot) (Sub) Multiple (Boot) (Sub)
Number Single Single Multiple Multiple1
::::12.4/1.3 12.9/2.3 12.9/3.1 12.4/1.4 13.3/1.6 12.9/0.5 13.0/3.1 12.4/5.7 13.9/
:::18.2
2::::11.0/0.2 12.1/0.7 13.1/1.0 11.6/1.1 12.1/0.6 12.6/0.8 12.5/0.0 11.8/2.2 12.9/
:::5.6
3 14.9/3.1 14.7/13.1 14.9/24.1 14.9/0.4 14.8/5.6 14.6/3.5:::13.8/9.6 14.7/18.0 15.0/
:::40.0
4 12.2/1.5 12.3/1.6 12.4/3.1::::12.1/1.2 12.6/1.1 12.4/0.4 12.4/1.9 12.1/5.9 13.7/
:::23.7
5::::11.0/0.2 12.0/0.6 13.0/1.4 11.6/1.7 12.1/0.7 12.4/0.8 12.2/0.0 12.2/2.0 12.8/
:::6.1
6 14.6/7.4 14.2/27.1 14.3/31.8 14.5/0.3 14.3/7.8 13.9/8.3:::10.2/4.7 14.2/33.3 14.8/
:::54.6
7 12.5/1.3 12.8/2.6 13.4/6.8 12.5/0.9 13.0/1.5 12.9/0.4 13.1/4.4::::12.5/7.7 14.0/
:::21.5
8 10.8/1.8 11.5/3.0 11.7/3.4 10.8/1.7 11.4/2.3 11.5/2.0 12.0/3.2::::10.7/3.9 12.8/
:::9.1
9 23.8/3.1 24.3/10.3 24.5/18.2 23.8/2.0 24.6/5.1 24.1/8.2:::22.7/2.9 24.5/15.4 24.7/
:::36.5
average 13.7/2.2 14.1/6.8 14.5/10.3 13.8/1.2 14.3/2.9 14.1/2.8:::13.5/3.3 13.9/10.5 15.0/
:::23.9
L1-SVM2Artificial MSA SBA(Boot) SBA MSA SBA SBA
Bootstrap Adaptive L1Dataset Single (Boot) (Sub) Multiple (Boot) (Sub)Number Single Single Multiple Multiple
1 8.5/2.9 8.9/3.8 10.0/4.5:::8.2/2.5 8.6/2.2 9.3/1.8 8.3/2.0 9.4/7.3 10.3/
:::12.2
2:::4.4/0.5 5.4/0.8 6.9/0.7 5.0/1.5 4.9/0.6 6.3/0.5 5.0/0.0 6.3/2.9 6.5/
:::4.0
3 11.2/10.9 11.7/14.5 12.0/16.7 11.6/13.1 10.7/6.4 11.1/7.5:::9.4/7.2 12.2/20.1 13.1/
:::34.1
4 7.8/2.3 8.7/3.8 9.1/3.8 8.0/3.2 8.3/2.2 8.6/1.7:::7.4/1.0 9.1/8.3 9.4/
::9.1
5:::4.4/0.3 5.1/1.2 6.5/0.8 4.5/0.4 4.5/0.8 5.8/0.5 4.5/0.1 6.6/2.1 6.1/
:::2.9
6 8.5/11.0 9.0/13.3 9.8/17.0 8.9/13.5 7.9/5.4 8.2/4.9:::6.3/4.6 9.8/23.4 10.9/
:::35.6
7 7.9/2.9 8.6/4.6 9.5/5.5 7.9/2.7 8.0/2.3 8.8/2.6:::7.8/2.2 8.6/7.9 9.8/
::::13.5
8:::4.2/1.6 5.0/2.6 6.1/3.4 4.6/2.5 4.3/1.8 5.6/2.5 4.4/1.2 5.4/4.3 6.7/
::8.2
9 10.8/5.4 12.5/6.8 14.1/9.2 12.1/10.1 10.8/3.8 12.4/4.9::::10.0/2.2 14.3/13.7 15.1/
:::18.3
average 7.5/4.2 8.3/5.7 9.3/6.8 7.9/5.5 7.6/2.8 8.5/3.0:::7.0/2.3 9.1/10.0 9.8/
::::15.3
LPSVMArtificial MSA SBA(Boot) SBA MSA SBA SBA
Bootstrap Adaptive L1Dataset Single (Boot) (Sub) Multiple (Boot) (Sub)Number Single Single Multiple Multiple
1:::8.0/1.9 10.0/12.0 10.8/11.9 8.0/2.6 9.1/3.0 9.8/3.1 8.4/1.9 10.2/
::::13.1 10.0/10.7
2:::4.7/0.3 8.0/21.9 7.8/9.5 5.2/0.9 8.6/18.9 11.5/
:::39.8 5.5/0.1 11.0/5.3 6.7/2.8
3 10.9/9.3 13.0/26.5 13.8/:::33.9 10.6/5.7 12.3/9.6 13.3/7.8
:::9.9/5.8 12.6/28.4 12.7/29.3
4 7.6/2.9 9.3/18.6 10.2/12.0::7.6/2.2 8.7/6.4 9.6/3.6 7.9/1.8 10.3/
:::28.9 9.6/13.2
5 4.5/0.4 7.5/22.5 7.6/9.2::4.4/0.8 7.9/20.7 12.0/
::::58.6 5.2/0.1 13.0/5.3 6.6/3.6
6 8.6/11.4 11.0/32.5 12.7/45.6 8.2/9.2 10.3/14.5 11.9/9.8:::6.6/4.2 10.9/
::::51.1 10.8/32.4
7 8.0/2.5 9.7/13.5 10.7/14.9:::7.5/1.1 9.0/4.5 9.8/5.5 8.0/2.2 9.9/
::::16.4 9.6/11.6
8:::4.0/1.2 5.7/11.8 7.1/12.1 4.1/1.3 6.2/14.9 9.0/
::::39.6 4.7/1.0 9.9/12.9 6.3/5.9
9::::12.3/4.5 16.4/15.1 18.8/19.3 13.5/8.5 13.6/7.0 16.5/7.1 12.7/2.9 18.5/
::::23.8 16.5/15.2
average:::7.6/3.8 10.1/19.4 11.0/18.7 7.7/3.6 9.5/11.0 11.5/19.4 7.7/2.2 11.8/
:::20.6 9.9/13.9
Logistic RegressionArtificial MSA SBA(Boot) SBA MSA SBA SBA
Bootstrap Adaptive L1Dataset Single (Boot) (Sub) Multiple (Boot) (Sub)Number Single Single Multiple Multiple
1 10.2/13.6 11.1/18.5 11.1/19.5 9.8/9.2 11.5/:::25.1 11.1/17.3
:::9.6/4.3 10.3/9.0 10.8/22.9
2 7.2/3.5 10.1/9.5 10.4/6.7:::7.2/2.6 10.0/
::::15.4 10.8/14.2 7.5/0.1 10.6/2.4 8.8/9.2
3 12.9/27.6 13.5/43.9 13.6/48.2 12.6/17.8 13.9/:::50.6 13.3/28.9 12.6/29.7
::::12.4/26.4 13.6/49.2
4 8.7/7.9 10.4/19.2 10.8/19.2 8.6/5.8 11.1/:::28.2 11.1/19.5
:::8.6/1.8 9.7/7.6 9.2/17.2
5:::6.0/2.0 9.6/8.0 10.1/4.8 6.7/1.4 10.4/14.6 11.1/
:::14.8 6.3/0.1 10.7/2.1 7.5/5.7
6 10.6/32.9 12.0/56.6 12.4/:::67.4 10.3/22.3 12.3/62.1 12.0/41.4
::::10.1/25.0 10.6/36.4 11.4/54.8
7 9.8/13.6 10.9/22.3 11.2/23.3 9.5/8.9 11.4/:::29.2 10.9/19.1
:::9.1/4.8 9.4/8.3 10.7/25.6
8 7.7/8.1 9.6/13.2 9.5/10.7:::7.6/6.5 9.7/
::::24.1 9.8/23.9 8.1/2.7 9.1/7.7 9.5/16.7
9 16.3/26.2 19.2/38.8 19.6/41.6::::15.2/19.6 20.5/
:::48.3 19.8/40.8 16.6/12.3 16.3/18.4 18.0/43.4
average 9.9/15.0 11.8/25.5 12.1/26.8:::9.7/10.5 12.3/
:::33.1 12.2/24.4 9.8/9.0 11.0/13.1 11.1/27.2
Table B.3: Features (correctly/incorrectly) selected by four different L1 penalized algorithms. Algorithmswith the highest number of correct features or the lowest number of incorrect features are in bold; Algorithmswith the lowest number of correct features or the highest number of incorrect features have a
:::::wave
:::::sign.
129
B.3 Feature Selection Results for Artificial Datasets
Our primary criteria for model selection is error rate and we showed that weighted models obtained
from SBA outperform other algorithms. Now, we examine the features selected by individual algorithms. In
Figure B.1, we show the average number of feature, over all nine synthetic datasets, correctly and incorrectly
selected by the algorithms. Our observations are as follows:
� All algorithms have a lower false discovery rate compared to the standard L1 algorithm.
� Interestingly, adaptive and SBA show higher rate of correct detection, compared to MSA and Boot-
strap, but at the cost of including more incorrect features in the final models.
0%
10%
20%
30%
40%
50%
60%
70%
80%
Bootstrap
MSA L1
Adaptive
SBA
% fe
atur
es d
etec
ted
Correct DetectionIncorrect Detection
Figure B.1: Features correctly and incorrectly detected by four different L1 penalized algorithms.
130
B.4 Expanding Dataset Dictionary
We expanded available real world datasets to include both first and second order features. We briefly
discuss the error rate results for models trained using only first order features compared to models trained
using both first and second order features. Our results are shown in Figure B.2 using 10 fold cross validation
and standard L1 penalized algorithms, with LARS for regression and L1-SVM2 for classification. The
difference in the error is significant at p = 0.03467, using the wilcoxon signrank test [124].
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Err
or R
ate
mortgage(1
5/135)
WDBC
(30/495)
breastcancer(1
0/65)
ionosp
here
(34/629)
austra
lian(1
4/119)
heart
(13/104)
svmguide3(2
2/275)
mpg(7
/35)
pim
adiabetes(8
/44)
germ
an.numer(2
4/324)
compactiv(2
1/252)
housing(1
3/104)
liverd
isord
ers
(6/27)
cadata
(8/44)
concrete
(8/44)
pole
(26/377)
diabetes(1
0/65)
censu
s(1
6/152)
First order featuresFirst+Second order features
Figure B.2: Results by using expanded dictionary for real world datasets with the standard L1 penalizedalgorithm for LARS (regression) and L1-SVM2 (classification).
Appendix C
Appendix for Chapter 5
C.1 Description of Real World Datasets
We describe the datasets used for our experiments. Note that all the datasets used are classification
based datasets.
(1) covertype - prediction of forest cover type based only on cartographic features like elevation, slope,
distance from water, etc. The dataset has seven distinct forest cover type classes [10] and is available
at Asuncion and Newman [5].
(2) ibn sina - was a challenge dataset in ‘Active Learning Challenge’ [57]. The task is to spot arabic words
in ancient manuscripts to facilitate indexing.
(3) sylva - classification of Ponderosa pine vs other type of forest cover. This dataset was a challenge dataset
in ‘Agnostic Learning vs. Prior Knowledge Challenge’ [56].
(4) zebra - classification of whether the zebrafish embryo is in division (meiosis) or not. This dataset was a
challenge dataset in ‘Active Learning Challenge’ [57].
(5) musk - classification of whether new molecules will be musks or non-musks. This dataset is available at
Asuncion and Newman [5].
132
(6) hemodynamic 1 to hemodynamic 4 - classifying the level of vacuum pressure set in a Lower Body
Negative Pressure (LBNP) chamber. Lower Body Negative Pressure (LBNP) chamber is used induce
the effects of blood loss, in humans and pigs, by lowering pressure on the lower part of the body causing
blood to flow towards the lower extremities; the amount of vacuum pressure is correlated with amount
of blood loss [30]. Each of the dataset was obtained from different sets of patient groups each containing
between 30 to 40 patients. Time series data was obtained from various sensors including heart rate
monitors, pulse-oximeters, among others; features in the data are derived from time-frequency domain
analysis of the sensors. Note that each dataset has features extracted in a different manner. Dataset is
available upon request.
(7) factors, fourier, karhunen, pixel, and zernike - this set of datasets consists of features of handwritten
numerals (‘0’–‘9’) extracted from a collection of Dutch utility maps. 200 patterns per class (for a total of
2,000 patterns) have been digitized in binary images. This dataset is available at Asuncion and Newman
[5]. These digits are represented in terms of the following five feature sets:
� factors: 216 profile correlations.
� fourier: 76 Fourier coefficients of the character shapes.
� karhunen: 64 Karhunen-Love coefficients.
� pixel: 240 pixel averages in 2 x 3 windows.
� zernike: 47 Zernike moments.
(8) power 1, power 2, and power 3 - the goal of these datasets is to classify between optimal and sub-
optimal power configurations based on the connections present in a power grid and the cost/reliability
induced by the connections [54]. The overall idea is to optimally collocate distributed generation of
power and feeder interties between feeders at a feasible cost to improve supply reliability of power; while
also satisfying power flow constraints in an islanded mode of operation of the distribution system. The
targets in the datasets differ on basis of the Pareto-optimal front induced by cost vs. reliability index
and the feeder connections. Dataset is available upon request.
133
C.2 Results on Real World Datasets
We show the error rate in Table C.1 and the relative error rates in Table C.2.
C.3 Choice of ntree Parameter
The ntree parameter affects the generalization error of the random forest. According to Breiman,
a large value of ntree should be chosen. In this appendix, we show why our choice of ntree = 1000 is
appropriate for the datasets that were used for our experiments.
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
100 trees (0.00%)
250 trees (1.93%)
500 trees (2.60%)
1000 trees (3.27%)2000 trees (3.34%)
Figure C.1: What is an appropriate value of ntree? Results depict significance testing via the Bergmann-Hommel procedure at p=0.05. The average improvement in test error from the baseline condition of ntree =100 is noted in parenthesis.
We examine the hypothesis that there is no difference in prediction abilities between forests of different
sizes. For all the datasets used in our experiments, we found the best model when the random forest size
is one of the following ntree value: 100, 250, 500, 1000 and 2000 trees. The best model for a given ntree
134
DatasetProb Prob Err Err Breiman Strobl Breiman Strobl
BaselineReweight Removal Reweight Removal Reweight Reweight Removal Removal
covtype 0.2311 0.2304 0.2320:::::::0.2326 0.2299 0.2303 0.2299 0.2299 0.2295
hemo 1 0.2819 0.2752 0.2695 0.2760::::::0.2947 0.2939 0.2927 0.2893 0.2942
hemo 2 0.0820 0.0744 0.0772 0.0747 0.0855 0.0847 0.0799 0.0806:::::::0.0864
hemo 3 0.1196 0.1015 0.0969 0.1022 0.1235 0.1203 0.1248 0.1252:::::::0.1268
hemo 4 0.2706 0.2537 0.2512 0.2529 0.2546 0.2566 0.2514 0.2519:::::::0.2804
zernike 0.2029 0.2073 0.2189 0.2063 0.2206 0.2195 0.2183:::::::0.2218 0.2187
karhunen 0.0441 0.0432::::::0.0483 0.0435 0.0428 0.0427 0.0423 0.0435 0.0423
fourier 0.1721 0.1724::::::0.1799 0.1743 0.1695 0.1741 0.1672 0.1658 0.1774
factors 0.0333 0.0333::::::0.0417 0.0357 0.0378 0.0384 0.0380 0.0378 0.0379
pow 1 0.0353 0.0352 0.0392 0.0360 0.0421::::::0.1226 0.0452 0.0521 0.0463
pow 2 0.1791 0.1848 0.1956 0.1798 0.1887 0.1947 0.1896:::::::0.1968 0.1953
pow 3::::::0.1664 0.1609 0.1653 0.1626 0.1574 0.1617 0.1559 0.1633 0.1650
sina 0.0367 0.0350::::::0.0388 0.0365 0.0350 0.0362 0.0334 0.0342 0.0351
sylva 0.0090 0.0097 0.0090 0.0089 0.0094::::::0.0102 0.0094 0.0095 0.0091
zebra 0.2049 0.1576 0.1594 0.1439 0.1862 0.1934 0.1918 0.1915:::::::0.2132
musk 0.0259 0.0257 0.0267 0.0267 0.0283 0.0281:::::::0.0286 0.0286 0.0281
pixel 0.0318 0.0333::::::0.0339 0.0309 0.0293 0.0301 0.0292 0.0299 0.0294
average 0.1251 0.1196 0.1226 0.1190 0.1256::::::0.1316 0.1252 0.1266 0.1303
ranks 4.882 3.529 5.500 3.971 4.971::::::6.235 4.176 5.559 6.176
Table C.1: Error rates on Individual Datasets, for all algorithms.
DatasetProb Prob Err Err Breiman Strobl Breiman Strobl
Reweight Removal Reweight Removal Reweight Reweight Removal Removalcovtype -0.69 -0.37 -1.06
:::::-1.32 -0.18 -0.32 -0.18 -0.16
hemo 1 4.18 6.45 8.41 6.18:::::-0.18 0.11 0.53 1.67
hemo 2 5.18 13.96 10.72 13.62::::1.13 2.01 7.50 6.73
hemo 3 5.63 19.93 23.58 19.37 2.57 5.11 1.55::::1.27
hemo 4::::3.50 9.52 10.42 9.81 9.22 8.49 10.36 10.18
zernike 7.22 5.22 -0.09 5.65 -0.85 -0.37 0.18:::::-1.43
karhunen -4.27 -2.14::::::-14.09 -2.71 -1.10 -0.82 0.14 -2.88
fourier 3.00 2.82:::::-1.40 1.76 4.50 1.86 5.80 6.54
factors 12.22 12.15::::::-10.16 5.72 0.24 -1.36 -0.20 0.17
pow 1 23.71 24.09 15.30 22.35 9.09::::::::-164.71 2.42 -12.60
pow 2 8.26 5.37 -0.17 7.93 3.36 0.27 2.90:::::-0.81
pow 3:::::-0.86 2.47 -0.21 1.44 4.61 2.00 5.48 1.04
sina -4.34 0.40::::::-10.28 -3.76 0.28 -2.95 4.97 2.55
sylva 0.97 -6.72 1.06 2.89 -2.61::::::-12.19 -2.86 -4.41
zebra::::3.90 26.11 25.22 32.51 12.66 9.29 10.06 10.18
musk 7.78 8.46 4.89 4.89 -0.87 0.00:::::-2.00 -2.00
pixel -8.05 -13.43::::::-15.30 -5.32 0.32 -2.45 0.50 -1.88
average 3.96 6.72 2.75 7.12 2.48:::::-9.18 2.77 0.83
worst -8.05 -13.43 -15.30 -5.32 -2.61::::::::-164.71 -2.86 -12.60
best 23.71 26.11 25.22 32.51 12.66 9.29 10.36 10.18ranks 4.588 3.294 4.971 3.735 4.618 5.765 3.941 5.088
Table C.2: % Relative Improvement on Error Rate – over baseline – on Individual Datasets, for all algorithms.
135
and a particular dataset was found as follows: for each training fold, we parametrically searched for the best
mtry value (over {p/10, 2p/10, . . . , p,√p}, where p is the number of features) by choosing the model with
the lowest out-of-bag error rate. We then evaluated the performance of the chosen forest on the test set.
In Figure C.1, we show the relative improvement, over all real world datasets used in our experiments,
compared to the baseline results obtained using forests having 100 trees. We connect all forests (with different
ntree values) whose results are insignificant with each other using the Bergmann-Hommel procedure [52].
As shown in the figure, there is significant improvement (≈ 0.67%) when forests are constructed using 1000
trees compared to using 500 trees. There is less than (≈ 0.07%) improvement when forest size is set to 2000
trees instead of 1000 trees.
C.4 Choice of mtry parameter
In this appendix, we show why our choice of parametrically searching for the best mtry is appropriate
for the datasets that were used for our experiments. We examine the hypothesis that there is no difference
1
2
mtry - parameterically searched (6.14%)
mtry =√
p (0.00%)
Figure C.2: For Random Forests, do we need to parametrically search for the best mtry? Results depictsignificance testing via the Bergmann-Hommel procedure at p=0.05. The average improvement in test errorfrom the baseline condition of mtry =
√p is noted in parenthesis.
136
in prediction abilities between forests constructed by parametrically finding the mtry vs. forests constructed
with the optimal mtry =√p, where p is the number of features; as all the datasets used in our experiments
are classification based datasets, we use the suggested mtry =√p value for classification [19]. For all the
datasets used in our experiments, we found the best model when the number of trees in the random forest
is 1000. The best model for a given ntree = 1000 and a particular dataset was found as follows: for each
training fold, we parametrically searched for the best mtry value (over {p/10, 2p/10, . . . , p,√p}) by choosing
the model with the lowest out-of-bag error rate. We then evaluated the performance of the chosen forest on
the test set. Note, we also noted the test set error for mtry =√p.
In Figure C.1, we show the relative improvement, over all datasets used in our experiments, compared
to the baseline results obtained using mtry =√p, where p=number of features in the dataset. We use
the Bergmann-Hommel procedure [52] as a post-hoc test and depict the results graphically in the figure;
algorithms that are significantly different from each other are not connected by a vertical line. As shown
in the figure, there is significant difference at p = 0.05 and an improvement (≈ 6.14%) when forests are
constructed using mtry that is parametrically found vs. the default suggested value of mtry =√p.
C.5 A Challenge for ML algorithms
In Section 5.2, our primary focus was random forests and we showed that existing feature importance
measures within random forests are unable to correctly distinguish between relevant and irrelevant features,
on a family of artificial datasets. We can design similar datasets that are problematic for other supervised
classification algorithms. We now discuss a artificial dataset that does not work across a variety of algorithms.
We generated the dataset using features x1,...x6 and then appended the x7 = tan(x6) feature.
X = [x1 x2 x3 x4 x5 x6 x7] (C.1)
x1, ..., x5 = relevant features represented by column vectors of size N
x6 and x7 = irrelevant features represented by column vector of size N
also x7 = tan(x6)
137
The data matrix X is a multivariate normal distribution with the correlation and covariance matrices
described as follows:
relevant features︷ ︸︸ ︷ irrelevant features︷ ︸︸ ︷
cov(X) =
0.5 0 0 0 0 0.1 0.12
0 0.5 0 0 0 0.1 0.12
0 0 0.5 0 0 0.1 0.12
0 0 0 0.5 0 0.1 0.12
0 0 0 0 0.5 0.1 0.12
0.1 0.1 0.1 0.1 0.1 0.1 0.12
0.12 0.12 0.12 0.12 0.12 0.12 0.12
(C.2)
The output vector Y of size n is obtained from the following nonlinear generative model:
yi =
5∏j=1
[sin(xj) + 1], (C.3)
We tuned the size of the dataset and the covariance matrix so that the dataset (does not) work over
a variety of algorithms. There is higher correlation (almost double compared to relevant features) between
the irrelevant features and the target value Y . Note that one can utilize multiple tan() transforms on the
irrelevant features to reduce the amount of correlation between the relevant and the irrelevant features.
For our experiments, we generated data 1000 training examples according to Equation C.3. We also
generated a separate validation set consisting of 1000 examples. Note that as this dataset is a single instance
of the distribution in Equation C.3, similar but not the same results will be obtained for another instance of
the distribution. Due to a large number of classifiers and parameters present, we show results using a single
sample of the artificial model. Note that for all discussed algorithms, and for the artificial dataset, the error
rate on validation data is lower only when the relevant features are used to train the model compared to
when an irrelevant feature is included in training. Next, we discuss feature importance values obtained from
multiple algorithms including random forests [19], cforest [62], SVM [119], ReliefF [93] and BayesNN [84];
we refrain from an extensive discussion of these algorithms.
138
1
2
3
4
5
6
7
Fea
ture
s
Importance
(a) Random Forest − Breiman
1
2
3
4
5
6
7
Fea
ture
s
Importance
(b) Random Forest − Strobl
1
2
3
4
5
6
7
Fea
ture
s
Importance
(c) cforest − Conditional
1
2
3
4
5
6
7
Fea
ture
s
Importance
(d) cforest − Unconditional
1 5 10 50 100
0.002
0.006
0.01
0.014
0.018
(e) ReliefF for {1, 5, 10, 10, 100} Neighbors
Impo
rtan
ce
1
2
3
4
5
6
7
(f) SVM (Held−out folds)F
eatu
res
Importance
0 50 100 150 2000
1
2
3
4
Impo
rtan
ce
(g) BayesNN − 25 hidden units, 200 iterations
0 50 100 150 2000.5
0.6
0.7
0.8
0.9
1
Impo
rtan
ce
(h) BayesNN − 100 hidden units, 200 iterations
Figure C.3: Importance (without retraining) using various algorithms for the artificial dataset. Importanceusing various algorithms: The red color represents importance for the spurious features, whereas greenrepresents importance for the relevant features. We show importance for features as bars for (a) RandomForest - Breiman Importance, (b) Random Forest - Strobl Importance, (c) cforest - Conditional Importance,(d) cforest - Unconditional Importance and (f) SVMs. (e) Importance over different number of nearestneighbors is represented for ReliefF. We represent hyperparameter values for BayesNN for 25 and 100 hiddenunits in (g) and (h) respectively.
139
Random Forests Importance Using Breiman’s and Strobl’s Importance Measures. We cre-
ated 1000 trees and parametrically created models corresponding to mtry = 1, ..., p. We then chose the best
performing model (which was trained with mtry = 2) using validation data1 and we plot the importance
values obtained for all features from the best model in Figure C.3. Importance obtained via Breiman’s
importance measure and Strobl’s importance measure are shown; in both cases, atleast one of the spurious
features is assigned a higher importance value than the relevant features.
cforest - CART/RF Based Algorithm. We briefly discussed Strobl’s importance in Section 5.1.1.
Strobl’s importance was first proposed for conditional trees described in [62]. The ‘party’ package [62] with R
[66] provides two ways to calculate importance: unconditional importance and conditional importance. Con-
ditional importance is equivalent to Strobl’s importance measure and unconditional importance is equivalent
to Breiman’s importance measure.
We created 1000 trees and parametrically created models corresponding to mtry = 1, ..., p. We then
chose the best performing model (which was trained with mtry = 2) using the validation data and we plot
the importance value obtained from the best model in Figure C.3. We show both importance obtained via
conditional and unconditional importance and in both cases atleast one of the relevant feature has been
assigned lower importance value compared to the spurious features.
There are some observations which may be pertinent to the reader regarding the cforest algorithm.
The minimum validation error (relative error) achieved using cforest was 0.2637 which was higher compared
to 0.2247 obtained for random forests. The authors of the package warn of excess computation time when
conditional importance is calculated. We observed that calculating the conditional importance took upwards
of a week on a 4 GHz Core-i7 processor, when the number of permutation was set to 10; the algorithm also
consumed more than 4 gigabytes of memory in the process. When the number of examples was increased to
2000, the party package [62] (that implements cforest) failed to work due to excess memory requirements,
even on a machine with 128 GB of main memory.
1 Note that we obtained similar value of mtry when the out-of-bag error rate was used for validation.
140
ReliefF - k-NN based algorithm. The ReliefF algorithm [73, 75] estimates feature importance based
on how well the feature helps in distinguishing sample examples from their nearest neighbors of the same class
vs nearest neighbors from other classes. Robnik-Sikonja and Kononenko [93] adapted ReliefF for regression.
ReliefF only provides a ranking and that ranking is then utilized with a supervised classifier.
We plot the importance from ReliefF (using weka’s [58] implementation) for different number of
nearest neighbors in Figure C.3. As shown in the figure, the importance of the irrelevant features is higher
than atleast one of the relevant features. We show the plot for ReliefF with {1, 5, 10, 50, 100} of nearest
neighbors.
SVM - Importance calculated by a permutation test on validation dataset. The permutation
based scheme used by random forests, and discussed in Section 2.4.5.1, for evaluating feature importance can
also be utilized for SVM. The permutation of a feature (in the data of a separate test-set before testing it
on the SVM) will cause a corresponding perturbation in the SVM hyperplane and when a feature important
for prediction is perturbed there will be a corresponding degradation of the hyperplane and degradation of
the results. We use Algorithm 5 to calculate the feature importance using SVMs. The feature importance
can be calculated either using a held out fold or separate validation data.
We show the results for the cost parameter2 C=10 in Figure C.3, averaged over multiple held-out3
folds. As shown in the figure, the importance assigned to the spurious feature numbered 7 in higher than
the importance assigned to the relevant features.
BayesNN. We used the Bayes Neural Network [84] available at [83]. The Bayes neural network has
certain number of hidden nodes and a single hyperparameter for each of the feature. Though we show results
for BayesNN with both 25 and 100 hidden nodes, a network with 25 nodes performed better than a network
with 100 nodes4 .
According to the author, the hyperparameters are directly proportional to the importance of the
features. We plot the hyperparameters over 200 iterations in Figure C.3. As shown in the figure, the
2 We found models with C≥10 giving the same error results on held-out folds and so we only show result for C=10.3 We found similar importance results for the separate validation data.4 We also experimented with networks of size 5 and 10, and found results that were similar to the network with 25 nodes;
thus, we only show our results with networks of size 25 and 100.
141
Algorithm 5: Algorithm: Calculating SVM importance via a permutation scheme
Input: 1000 examples for training and 1000 examples for validation. SVM used is an RBF kernel
with the epsilon-SVR algorithm. The best model is parametrically selected from the cost
parameter array = {1e-2, 1e-1,..., 1e6}.
Output: Importance values from held out folds and from validation data.
Pseudo-Code:
number of folds = 10
for i in number of folds
use 9 folds for training, internally use 1 out of the 9 folds
for finding the best cost parameter
EITHER use 9 folds for training with the best cost parameter,
and find feature importance using the remaining 1 fold
OR use all training data and the best cost parameter,
and find feature importance using the held-out and separate validation data
collate the importance values over the entire for loop
For, ‘evaluate the feature importance’ step, we first calculate the baseline error rate without permuting
any feature. We then permute each feature and calculate the increase/decrease in error rate. This
step is similar to the step performed for calculating importance for random forests in Section 5.1.1,
except that instead of using out of bag examples we use a held out validation set or a held out fold.
hyperparameters (importance) of the spurious features have higher values than the hyperparameters for the
relevant features, for both networks of size 25 and 100.
Thus, we showed that existing algorithms may incorrectly rank features.
142
C.6 Why Does Prob+Reweighting Perform Worse Than Other
Retraining+Removal Algorithms?
We briefly discuss why Prob+Reweighting performs worse than other retraining+removal algorithms.
We define a metric called feature difference score as
Difference score based on number of features discovered (a, b, k)
= 100 ∗ 1
Dk
k datasets∑m=1
|f counts num(k, a)− f counts num(k, b)| (C.4)
where, m = 1, . . . , k datasets, i = 1, . . . , Dk are the features for dataset k,
a and b are algorithms for comparison,
f counts num(k, a) returns the number of features discovered by algorithm-a for dataset-k over all folds,
f counts num(k, a) =
Dk∑i=1
I(f counts(k, a, i)), I() is an indicator function,
refer to Equation 5.11 for f counts(k, a, i)).
The difference score is based on number of features discovered by any two algorithms and it is
calculated for individual datasets (rather than individual training/test splits of the datasets). This metric
only takes into consideration the number of features discovered by individual algorithms rather than matching
individual features, like the similarity score for features discovered (Equation 5.11) which was discussed
earlier. The score will be 0 if two algorithms discover the same number of features.
From the overall results presented earlier in Figure 5.3 (page-94), we observed Prob+Reweighting
scheme performing worse on average than both Err+Removal and Prob+Removal schemes. Note that
Prob+Reweighting scheme performs better than Err+Reweighting scheme on average % relative improve-
ment of error but worse on average absolute error. Using the feature similarity score discussed in the
preceding sub-section, we found that Prob+Reweighting has a lower similarity score when compared to
other retraining algorithms (Figure 5.8).
In Figure C.4, we plot the difference in relative improvement of error for different retraining importance
based algorithms vs the difference score based on number of features discovered (discussed in Equation C.4 for
Prob+Reweighting vs (Prob+Removal, Err+Removal,Err+Reweighting). Our goal is to observe if there is
143
any difference in total number of features discovered, when there is a large difference in relative improvement
of error by any two algorithms.
For Prob+Reweighting vs. the rest of the retraining schemes, we observe that there is a linear
relation between the feature difference score and the difference in relative improvement from baseline. The
datasets where this is observed are some of the hemodynamic datasets and covtype. We hypothesize that
Prob+Reweighting had a higher error rate for certain datasets, as it got stuck in a minima and was not able
to remove enough number of features compared to the other Retraining+Removal algorithms.
C.7 Comparison to Boruta and rf-ace
We briefly digress and describe two algorithms, Boruta [76] and rf-ace [116], that are popular for
feature selection in random forests; we have not compared them directly to our discussed methods because
the primary goal of these algorithms is to remove irrelevant features.
As the Boruta [76] and rf-ace [116] algorithms are similar to each other, we provide only a general
overview of their working mechanisms. Both assume that a given input matrix X of size n × p and target
values/labels vector y of size n are available, where n is the number of examples and p is the number of
features. Boruta [76] and rf-ace [116] create a new dataset X of size n × 2p, which contains the original
dataset X and a permuted copy of each individual feature in X. Both algorithms are iterative algorithms
and perform the following steps in each iteration:
(1) Both algorithms train multiple random forests models on (X, y) and obtain values from Breiman’s
feature importance measure. Statistical tests are performed to analyze whether there is a significant
difference in the feature importance values for a given feature and its permuted copy across multiple
models.
(2) Features and permuted copies with insignificantly different feature importance values are identified
and removed from X. The algorithm loops back to step (1) unless the number of features remain
constant for a certain number of iterations.
We obtained percent error results for Boruta and rf-ace for the 17 datasets used in our experiments.
144
−10 0 10 20 30 40 50 60 70 80 90−10
0
10
20
30Err+Removal vs. Prob+Reweighting
covtype
hemo_1
hemo_2hemo_3
hemo_4zernike karhunen fourier factors
pow_1pow_2
pow_3sina sylva
zebra musk
pixel
%feat difference
accu
racy
diff
eren
ce
0 10 20 30 40 50−10
−5
0
5
10
15
20
25Prob+Removal vs. Prob+Reweighting
covtype
hemo_1
hemo_2hemo_3
hemo_4zernike karhunen fourier
factorspow_1 pow_2 pow_3sina sylva zebra
muskpixel
%feat difference
accu
racy
diff
eren
ce
0 10 20 30 40 50 60 70 80
−20
−10
0
10
20
Err+weighting vs. Prob+weighting
covtypehemo_1
hemo_2hemo_3
hemo_4zernikekarhunen fourier
factorspow_1pow_2pow_3sina sylva
zebramusk
pixel
%feat difference
accu
racy
diff
eren
ce
Figure C.4: Comparison of Prob+Reweighting vs other retraining based algorithms. Difference in improve-ment over baseline for different algorithms vs the difference score based on number of features discovered.Note that Prob+Reweighting has higher difference in improvement with the other retrained algorithms fora couple of hemodynamic datasets and covtype. Observe that there is a linear relation between the featuredifference score and the difference in relative improvement from baseline. We hypothesize a higher errorrate is caused because Prob+Reweighting did not remove enough number of features compared to the otherretraining algorithms.
145
Dataset Model−1 Dataset Model−2 Dataset Model−3−20
0
20
Difference in error rates (models with all features − models with only relevant features)using Random Forests
% E
rror
diff
eren
ce
Dataset Model−1 Dataset Model−2 Dataset Model−30
50
100Breiman Importance (Random Forests)
# da
tase
ts
Dataset Model−1 Dataset Model−2 Dataset Model−30
50
100Strobl Importance (Random Forests + extendedForest())
# da
tase
ts
Dataset Model−1 Dataset Model−2 Dataset Model−30
50
100Gini (Random Forests)
# da
tase
ts
Dataset Model−1 Dataset Model−2 Dataset Model−30
50
100Retraining Importance (using Random Forests)
# da
tase
ts
Correct detectionIncorrect detection
Dataset Model−1 Dataset Model−2 Dataset Model−30
50
100Breiman Importance (Conditional Forests + unconditional)
# da
tase
ts
Dataset Model−1 Dataset Model−2 Dataset Model−30
50
100Strobl Importance (Conditional Forests + conditional)
# da
tase
ts
Figure C.5: Feature importance results for random forests and conditional forests on 3 artificial datasetmodels. From each artificial dataset models, we generate 100 datasets. We plot the number of times correctand incorrect detection is performed by an importance measure using the correct detection scoring ruledefined in Equation 5.7; incorrect detection = 1-correct detection. Compared to Figure 5.1, we also addresults using the Gini importance measure.
146
Percent Error Avg. number of features# of datasets
there is no changein # of features
Baseline Model 13.03 113.94 -rf-ace 12.98 95.75 10
Boruta 13.06 91.15 10
Table C.3: Results of rf-ace and Boruta.
Our results are presented in Table C.3; the baseline model is a model trained with all features. We can
draw the following three conclusions about both Boruta and rf-ace: (1) there is no significant reduction in
the percent error rate compared to the baseline model; (2) the average number of features selected is higher
than the average number of features selected through other algorithms discussed in this study (Figure 5.9);
and (3) for 10 out of 17 datasets, we found that both Boruta and rf-ace failed to perform feature selection.
C.8 Random Forests - Computational Complexity and Software
In this section, we discuss the computational complexity of random forests and compare it to SVM;
we also describe the properties of the software developed for the purpose of this thesis.
C.8.0.1 Computation Cost of a Random Forest Model
We calculate the computational cost of a random forest model as follows: at each split of a tree, we
search for the best feature that split the data; the computational cost of this procedure is upperbound to
O(n ∗ p)5 , where n is the number of examples and p is the number of features. Furthermore, the trees are
full-grown but are upperbounded to O(n). Thus, to grow an ensemble forest containing ntree trees, the total
computational cost is O(ntree ∗ n2 ∗ p
).
Now, we compare the computational complexity of random forests to SVM. The computational cost
of SVM is cited to be between O(n2) and O(n3) [12], but the authors do not consider the computational
complexity associated with calculating part of the kernel matrix6 which is O(n p); thus, the computational
5 Breiman’s implementation of random forests contains a trick that allows the calculation of the gini impurity value withO(n) complexity.
6 SVM solvers do not calculate the entire kernel matrix but only what is required.
147
complexity of an SVM model is between O(n3 p) and O(n4 p). We can ignore ntree in O(ntree ∗ n2 ∗ p
),
the computational complexity of random forests, as ntree is a constant factor; thus, the computational
complexity of random forests is atleast O(n) lesser than that of SVM.
Random forests hold a computational advantage against certain multi-class SVM methods. Note that
for multi-class SVM, where for a k-class problem, popular techniques propose creating either (k−1) classifiers
for a one-vs-all strategy or k(k − 1)/2 classifiers for an all-vs-all strategy [59, 92, 123]; the decision trees in
random forests can natively map to any number of classes and does not require creating additional classifiers
via a one-vs-all or an all-vs-all strategy.
C.8.0.2 Random Forests are Parallelizable
We discuss the parallelizable nature of the random forests algorithm that has made it a popular
learning algorithm for large scale datasets. The random forests algorithm is inherently parallelizable as trees
can be trained independent of one another; in contrast, boosting algorithms and SVM are not—inherently—
parallelizable. A real world example of random forests is the Kinect [105] accessory for the XBox 360 console;
gesture classification using image segmentation is performed in Kinect and the random forest classifier was
trained in parallel on 1000’s of computer cores.
Additionally, for the purpose of this thesis, we developed a parallel version of the random forests
algorithm that is scalable to multiple cores and computers; our code is based on Breiman’s reference code
and our code’s properties are described next.
C.8.0.3 Software Used
Boulesteix et al. [13] presents a good overview of the software packages available for RF. We considered
the following packages: randomForest-R by Liaw and Wiener [78], randomforest-matlab by Jaiantilal [67],
Random Jungle by Schwarz et al. [102], and Treebagger by Matlab. We briefly discuss the reasons why we
had to develop our own RF software. Then, we discuss the properties of our developed RF software.
148
A Critique of Existing Random Forests Software. As we saw earlier in the Chapter 5 of this
thesis, the amount of computational performance required to execute our experiments was quite high7 ; it
was imperative to reduce the computational cost as much as possible. As we will see later, another desirable
property was repeatable results from a random forests ensemble. Thus, we required the following properties,
from the packages.
� The package should fast in both single-threaded mode and multi-threaded mode. randomForest-R
[78], randomforest-matlab [67], and Random Jungle [102] were a magnitude faster than Matlab’s
Treebagger function (tested using Matlab 2012b).
� The package should be able to scale to multiple cores and preferably to multiple computers. Random
Jungle [102] scales to multiple cores, other packages do not.
� The package allows for setting the out-of-bag (OOB) examples and the random seed for individual
trees in the random forest ensemble for repeatable results. None of the packages supported this
property.
� The package is easily integrated into a computational language like Matlab or R [66]. Random
Jungle [102] does not integrate into a computational language, other packages do integrate into
either Matlab or R.
Developed Random Forest Software. As none of the available packages had the desired proper-
ties, we had to develop our own random forests package which was based on randomForest-R [78] and
randomforest-matlab [67]. The total number of lines of developed code in C++ and Matlab, for the random
forest algorithm and our proposed feature selection algorithms utilizing random forest, is around 25,000.
Our developed software package allowed for setting the out-of-bag (OOB) examples and the random seed for
individual trees in a random forest ensemble. Furthermore, the developed software package can be used from
Matlab. Now, we discuss the computational performance of the developed software, which has the ability to
run on multiple cores on a single or multi-socket machine and on a cluster of machines.
7 About 350K CPU hours (40 years) of a single 1.8 Ghz core of the Core-i7 processor.
149
RandomrandomForest-R randomforest-matlab
Matlab’sJungle Treebagger
Multi Threaded (Regression) 2.5X 75X 37.8X 59XMulti Threaded (Classification) 6.6X 31.3X 3.7X 17.8X
Table C.4: Relative Execution Time for existing software packages compared to our developed multi-threadedsoftware package. Our code has an execution time of 1X. We enabled 4 threads of a 2.2 Ghz i7 processor.Note that randomForest-R and randomforest-matlab are single threaded programs and the execution timefactor listed in the multi-threaded rows are for the single-threaded version.
We compare the computational performance of the developed software package to existing packages
in Table C.4. Before discussing more details of the table, please note that when we mention the phrase ‘the
software package scales linearly as number of cores increase’ we mean to say that the execution time drops
linearly as the number of cores increase. Our developed software had the following properties:
(1) Multi-threaded random forest for single socket machines: we developed a multi-threaded random forest
software package, for both regression and classification, that scales linearly as the number of cores increase
(for a single socket machine); our software implementation is based on the code by Breiman, which is also
the basis of the code in randomForest-R [78] and randomforest-matlab [67]. In Table C.4, we show relative
execution times for different random forest packages on an artificial dataset consisting of 1000 examples
and 100 features (with the default random forest parameters); the developed software outperforms other
algorithms, in execution speed; even in the worst case, for multi-threaded regression, the developed
software is 2.5 times faster than Random Jungle.
(2) Multi-threaded random forest for dual and quad socket machines: most of the experiments were performed
in the year 2012; then, many of the available computers consisted of single socket computers, though dual
and quad socket machines were easily available and were affordably priced (< 10, 000 USD). The number
of cores available per socket range between 4-8 cores (8-16 threads) for an Intel based system and between
8-16 cores for an AMD based system.
For multi-socket machines, main memory is not only shared between cores residing on the same socket but
also between cores residing on different sockets. Such a memory arrangement creates a large slowdown
when threads running on different socket cores have to access each other’s memory. A technology called
150
Non-Uniform Memory Access (NUMA) is available for multi-socket machines and helps in alleviating
memory contentions between cores on multiple sockets; via NUMA, memory can be controlled in a fine-
grained manner by the threads running on the machine. Programs optimized for NUMA, also known as
NUMA-aware programs, have threads running on different sockets in such a way that memory transfers
are kept at the minimum.
Our developed random forest software package is NUMA-aware and scales linearly when the number of
sockets increase, and it was tested on AMD systems with 32 cores and 48 cores, that used dual and
quad sockets, respectively. Note that existing random forest software packages make no claim of being
NUMA-aware and the penalty of using a non NUMA-aware software on a NUMA system can be severe
[7].
(3) Feature selection using random forest implemented via MPI: Training a random forest can be done in
parallel on multiple machines (each tree or set of trees created on separate machines) and has been utilized
for very large datasets [105]. As the number of examples, for the datasets used for this thesis, are in
the thousands, it makes sense to create entire forests on individual machines instead of creating trees on
multiple machines and then aggregating the forest, as the overhead of network communication is higher
that the computational8 overhead.
In Chapter 5, we will propose an iterated feature selecting algorithm based on random forest. In each step
of our proposed algorithm, we create a large number of random forest models that can be simultaneously
trained on multiple machines. For implementing such a parallellized training procedure, we developed
an Message Passing Interface (MPI) program [108]; MPI is a popular API used, in clusters, for parallel
programming. The speedup obtained by using a cluster of many machines is only bounded by the number
of features explored in each step and also the network overhead, in each step, is negligible.
(4) CPU Optimizations: Using CPU specific machine intrinsics for prefetching and vector calculations [46],
we were able to increase the speed of our code by about 25%; intrinsics are assembly codes utilizing special
CPU instructions. Prefetching gives explicit hints to the CPU that it needs to load remote memory to
8 For the datasets used in this thesis, most forests take around a second to be trained.
151
immediate memory, as that memory location is going to be utilized within a short while. Most CPU’s
have vector units that can be used to execute multiple9 floating point instructions in the same amount
of time required to execute a single floating point operation. The speedup obtained by using vector
instructions was not large10 as the ratio of the number of memory instructions to the number of floating
point instructions is high; that is, the program is memory-bound.
9 Usually 4 floating point operations using SSE2/3/4 or 8 floating point operations using AVX; SSE2/3/4 and AVX aretypes of x86 CPU instructions
10 By using vector instruction, we would expect a reduction in execution time by a factor of 4 or 8 instead of a factor of 0.25.