Comparing different strategies for variable selection in large dimensions

Outline Motivation Statistical Learning Feature Selection Experiments Extensions

Comparing different strategies for variable

selection in large dimensions

B. Ghattas

Institut de Mathématiques de Luminy, Marseille, France

May 25, 2009, Gent, Belgium

B. Ghattas Institut de Mathématiques de Luminy, Marseille, France

Comparing different strategies for variable selection in large dimensions

Outline

Motivation.

Statistical Learning.

Feature Selection.

Experiments.

Extensions.

Motivation

We would like to learn a model from data and discover the importantfeatures for learning.Examples:

Predict exportation of Uruguayan meat using informations aboutconcurrent countries and their interactions: exchange rate, production,annual import and/ export, ...Predict whether a flow within a router in a network is an email, asound, an image, a video ...Predict whether an individual will recidivate or not a heart attackknowing his past, his ECG analysis, his plasma analysis, ...

Motivation

Sparse data

A particular attention for situations where n << p, "Microarray data".

Which are the genes that give the best discrimination between thepresence and absence of a cancer ?

A three steps approach

First order the variables.Next introduce them sequentiallay within the modelmonitoring its performance evolution.Localize the optimal number of variables to keep in the model.

0 100 200 300 400 500 600

Empirical Risk Minimization

(Xi , Yi )i = 1..n random independent variables coming from an unknowndistribution P(X , Y ). (Xi , Yi ) ∈ (X ,Y) ⊆ (Rp, {1, ..., J}).We suppose that:

To estimate f we minimize a loss function L(Y , f (X )) and look forf ∗ = Argminf ∈CE [L].As P is unknown we use the Empirical Risk Minimization principle,computing:

fn = Argminf ∈C1nL(f (Xi ), Yi )

Linear Separation, binary case

S= n i.i.d. sample of (X ,Y) ⊆ (Rp, {−1, +1})

S = {(x1, y1) , (x2, y2) , . . . , (xn, yn)} ⊆ (X × Y)n .

We look for a function: f (x) = sign (〈w .x〉 + b)

Margin optimization

The margin of an hyperplane is : γ = 1‖w‖ .

The optimal hyperplane is the one that achieves the maximum ofthe margin over all the separating hyperplanes.

The Optimization problem

Find (w , b) ∈ Rp × R such that :

Minimizew ,b‖w‖2

2Under yi (〈w .xi 〉 + b) > 1∀i ∈ {1, . . . , n}

Solution :

w∗ =n∑

α∗i yixi =

i∈sv

α∗i yixi .

b∗ = −maxyi=−1 (〈w∗.xi 〉) + minyi=+1 (〈w∗.xi 〉)

where α∗ = (α∗1, . . . , α

∗n) are the Langrangian coefficients and

sv = {i ∈ {1, . . . , n} ; α∗i 6= 0}.

The decision function is :

f (x)n = sign

i∈sv

α∗i yi 〈xi .x〉 + b∗

The Kernel Trick

Project the data in a high dimensional space where linearseparation is tractable, using a transformation Φ.

The Kernel Trick - 2

We do not need to know the exact expression of Φ, the dot productbetween observations defined by the Kernel K (z , z ′) = 〈Φ(z).Φ(z ′)〉is sufficient to compute the separating hyperplane.

f (x) = sign(∑

i∈sv α∗i yi 〈Φ(xi ).Φ(x)〉 + b∗

sign(∑

i∈sv α∗i yiK (xi .x) + b∗

Risk Bounds

• Radius-margin bound: For the LOO error estimation (Vapnik [9])

γ2= R2 ‖w∗‖2 , (2)

L is the number of misclassified observations by LOO, γ the margin,Rradius of the smallest ball covering S.

• Span bound: Vapnik and Chapelle [10].

L 6∑

i∈sv

α∗i S

2i , (3)

where the span Si is the distance between the support vectors xi and aset of constrained linear combination of the other SV.

Scores

Three scores are commonly used :The weight vector score: W = ‖w∗‖2

The Radius score: RW = R2 ‖w∗‖2

The Span Score: Spb =n∑

α∗i S

Each score may be computed at different orders :

"zero-order" : The value of the score computed omitting that variable."difference-order" difference betwwen the score using that variable and itsvalue without it."first-order" is the derivative of the score w.r.t. to artificial weights.

We use Bootstrap mean estimates for each score.

Random Forests

Random Forests, (L. Breiman, 2001)

K bootstrap Samples, keeping the out of bag samples.Construct a Maximum tree over each one, using best split over veryfew variables randomly selected.Don’t prune.Aggregate trees using mean (regression) or majority vote(classification).

"Random Input" uses one variable at each split."Random Features" uses a linear combination of variables with randomlyselected coefficients.Weak trees + weak correlation between trees (between their predictions)→ Powerful learner.

Random Forests

Variables importance

Based on OOB samples, and difference in the performance of a treewhen the values of one variable are randomly permuted.

Consider the prediction error τk of the kth tree of the forestover the OOB Sample.

Permute randomly the values of Xj in the OOB sample anduse the modified sample for prediction.

Measure the prediction error for the modified sample τ′

The Importance measure for variable j is :

I (j) = 1K

∑Kk=1

τk−τ′

Random Forests

Variables importance- Comments

Insensitive to the nature of the resampling used (bootstrapsamples with or without replacement).

Stable in presence of correlations between variables.

Invariant to normalization (using standard deviation of Zi (j)

Stable w.r.t. data perturbations. Bootstrapping VI isunnecessary.

GLMpath

The model and its estimation (McCullugh et al. [6])

The classification model used:

g(µ) = β0 + β1x1 + ... + βpxp

where Y ∈ {0, 1}, µ = E (Y ) = P[Y = 1] , and g is a link function.

g(µ) = µ1−µ

, logistic regression.Parameter β = (β0, ..., βp) are estimated using likelihood.

GLMpath

Regularization, Park et.al.(2006)

Penalize the likelihood with L1 constraint over the coefficients.

β(λ) = argminβ {−logL(x; β) + λ ‖β‖1}

where λ > 0 is a regularization parameter.The sequence β(λ), 0 < λ < ∞ is called the path.For λ = ∞ all the coefficients are equal to zero.Increasing λ sets more coefficients to zero.

GLMpath

Estimation

At each step k of the algorithm, we have values for λk , βk .

1 Compute the necessary increment to reach λk+1

2 "predictor step": compute a linear approximation βk+, forβk+1

3 "corrector step": Use convex optimization to compute βk+1,with initial value βk+

4 Test if the set of active variables (having non zero coefficient),must change.

GLMpath

Variables importance

Use B=500 bootstrap samples.

Compute the optimal GLM-penalized model, and keep it’scoefficients.

The importance of variable j is the absolute value of it’scoefficient’s bootstrap mean βB

Variables whose coefficient bootstrap mean is zero won’t beused for comparisons.

Stepwise selection

Stepwise, SFS, SBS, SFFS, ....

Advantage: Do not need a specific model, but a monotoniccriterion over a set of variables.

Drawbacks: computational complexity, depend on the order ofvariables in the data.

SVM-RFE

SVM-RFE (Guyon et al. [4],Rakotomamonjy [8])

While there are still variablesLearn an SVM and sort variables using the score ||w ||2 bydifferences.Estimate its misclassification error.Eliminate half of the variables, the least important if there aremore then 100 kept.

For the last 100 variables, eliminate them recursively one byone.

Our procedure

D = Learning sample. B = 200 Number of bootstrap samples.Compute the score(D, B) to get a hierarchy X (1), . . . ,X (p).For k = 1, . . . , p

For l = 1, . . . , 50Randomly split with stratification D = Al ∪ Tl

Al is the learning sample Tl the test sample.Mk

l = f(X (1), . . . ,X (k), Al

Erkl = Test

l , Tl

Erk = 150

∑50l=1 Erk

kopt = Arg mink

Toys data [11]

Two classes {−1, 1} having the same probability.

xi ∼ yN(i , 1), i = 1, 2, 3

xi ∼ yN(0, 1), i = 3, 4, 5

with probability 0.7,else

xi ∼ yN(0, 1), i = 1, 2, 3

xi ∼ yN(i − 3, 1), i = 3, 4, 5

xi ∼ N(0, 20), i = 7 . . . , p

These data are linearly separable with high probability, decreasingwith the sample size.

Toys data [11]

−2.0 −1.0 0.0 1.0 −2 −1 0 1 2 3 −2 −1 0 1 2

−1.5 −0.5 0.5 1.5

−2 −1 0 1 2 −2 −1 0 1 2

Toys data [11]

Hierarchy, varying n

Rank where 4, 5 and 6 important variables appeared in the hierarchy. Wehave used p = 200, B = 200, n = 50, 100, 200.

n/Score FDS ∂W ∂RW ∂Spb RF GLMpath

504613

100456

200456

Toys data [11]

Hierarchy, varying p

n = 50, B = 200,p = 500, 1000.

p/Score FDS ∂W ∂RW ∂Spb RF GLMpath

5004518

1000434173

433194

432202

431224

4205206

Toys data [11]

Rank Correlations

200 observations, 200 variables.

∂W ∂RW ∂Spb RF GLMpath

FDS 0.467 0.390 -0.216 0.180 0.542∂W 1 0.685 -0.410 0.132 0.944∂RW 1 -0.267 0.205 0.682∂Spb 1 0.056 -0.484RF 1 0.161

50 observations, 1000 variables.

∂W ∂RW ∂Spb RF GLMpath

FDS 0.918 0.873 0.604 0.093 0.705∂W 1 0.925 0.664 0.074 0.725∂RW 1 0.622 0.073 0.702∂Spb 1 0.083 0.567RF 1 0.086

Toys data [11]

Performances

Score/(n, p) (50,200) (100,200) (200,200) (50,500) (50,1000)FDS 0.0208(6) 0.0072(7) 0.0048(7) 0.0044(5) 0.0084(5)∂W 0.0084(5) 0.012(6) 0.0048(7) 0.008(7) 0.0084(5)∂RW 0.0084(5) 0.0072(7) 0.0048(7) 0.008(7) 0.0076(6)∂Spb 0.0084(5) 0.0096(6) 0.0044(8) 0.0044(5) 0.0084(5)

SVM − RFE 0.0476(8) 0.016(8) 0.006(4) 0.0132(8) 0.0104(4)GLMpath 0.0188(1) 0.0252(3) 0.0074(4) 0.008(4) 0.0192(2)

RF 0.044(3) 0.0272(6) 0.0064(25) 0.0252(12) 0.0656(4)

Table: 50 stratified test, or CV (glmpath).

Toys data [11]

sample size effects, 50 stratified test samples, p = 200.

x d’

ur50 observations

50 observations

∂W∂RW∂Spb

50 observations

x d’

100 observations

x d’

200 observations

200 variables10

200 observations

Toys data [11]

Number of variables effect. p = 500, 1000, n = 50.

d’er

500 variables

0.5500 variables

∂W∂RW∂Spb

0.5500 variables

x d’

1000 variables

0.51000 variables

Microarray datasets

Data sets

Data p learning test n +1/-1

Colon 2000 62 – 22/40Lymphoma 4026 96 – 62/34Prostate 12600 102 – 52/50Leukemia 7129 38 34 27/11 - 20/14

Microarray datasets

Hierarchies comparison

0-coefficients: Colon-999, Lymphoma-1376, Leukemia-1190, Prostate-2234.x-axis: Normalized rank.y-axis: Proportion of common variables for thecompared methods.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Rang normalisé

SVMSVM/FASVM/GLMpathFA/GLMpath

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Rang normalisé

Lymphoma

Microarray datasets

Hierarchies comparison 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Rang normalisé

Prostate

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

Rang normalisé

Leukemia

Microarray datasets

Common Variables

Comparison / data Colon Lymphoma Prostate LeukemiaSVM 37 37 32 30

SVM/GLMpath 33 26 24 21SVM/RF 4 9 12 9

RF/GLMpath 10 12 16 21

Table: Number of common variables within the top 50

Microarray datasets

Results, real data sets

Score/Data Colon Lympoma Prostate Leukemia

FDS 0.1219(3) 0.0436(200) 0.0371(315) 0.0882(7)∂W 0.0009(31) 0(186) 0.0269(83) 0.1176(2)∂RW 0.0029(33) 0(60) 0.0269(902) 0.0882(22)∂Spb 0.0029(34) 0.0006(118) 0.0109(45) 0.1176(11)

SVM − RFE 0.0057(32) 0(64) 0(64) 0.0882(1)GLMpath 0.064(2) 0(3) 0(3) 0(1)

RF 0.0962(55) 0.0588(73) 0.0554(7) 0.0588(103)

Colon: 0.17, Lymphoma: 0.06, Prostate: 0.075, Leukemia:0.20588.

Microarray datasets

Bias Selection

D data set, B Number of bootstrap samples

Partition D with stratification, D1, ..., D10.Set D−j = D − Dj .For j = 1, . . . , 10

Score(D−j ,B) and use the hierarchy X (1), ..., X (p)

For k = 1, . . . , p

Mk = f(X (1), ..., X (k)

Erk = TestRS (Mk , D−j )

koptj = Argmink{Erk}

erj = Mean error of Mkoptj over Dj .

Compute er = 1

j=1erj .

Microarray datasets

Results

Data Colon Lymphoma Prostate

FDS 0.1595(15.1) 0.1233(83.7) 0.0882(126.4)∂W 0.233 (35.1) 0.051 (86.5) 0.054 (756.6)∂RW 0.214 (43.3) 0.042 (71) 0.053 (573.3)∂Spb 0.197 (31.8) 0.073 (70.5) 0.052 (95.5)

SVM − RFE 0.1452(26.4) 0.0878(16.8) 0.0582(43.2)GLMpath 0.1809 (1.3) 0.0522 (2.8) 0.05909 (1.6)

RF 0.106 (49.8) 0.052 (65.9) 0.059 (81)

Non linear Separation

d’er

50 variables

0.7100 variables

x d’

200 variables

0.6300 variables

FDS∂W∂RW∂Spb

80 observations

Multiclass separation

One versus others

Here Y ∈ 1, .., J. We construct J hyperplanes each learning thesimplified binary classification of one class against the other.

One versus one

We construct the 12J(J − 1) hyperplanes each learning the

simplified binary classification of one class against another.

Multiclass procedures

Multiclass linear separable data

−2 −1 0 1 2−2

−1.5

−0.5

−2 −1 0 1 2−1.5

−0.5

Multiclass linear feature selection

d’er

esRADAG

1Hamming−decoding

0.8Loss−decoding

x d’

Application: Malaria discrimination

∂W ∂RW ∂Spb

Tagap1

Ppp1r9bItgb1

Ndufa7Myo10

Ppp1r12cFrap1

Rapgef6Tubb3Rpl22

Tagap1

Ppp1r9bItgb1

Zfpn1a4Chchd1

0610037H22RikRpl22Frap1Myo10Rapgef6

Tagap1

Rpl22Frap1Pmm1Bzw1Itgb1

Ppp1r9bCdc37l1

0610037H22RikTubb3

Table: Optimal sets : Selected genes in a decreasing order of importance.

Perspectives

Thank you for your attention.

Bibliography

L. Breiman. Random forests. Machine Learning Journal, 45:5Ű32, 2001.

O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multipleparameters for support vector machines. Machine Learning, 46(1-3) : 131-159.

I. Guyon and A. Elisseff. An introduction to variable and feature selection.Journal of Machine Learning Research, 3 : 1157-1182, 2003.

I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancerclassification using support vector machines. Machine Learning, 46(1-3) :389-422, 2002.

P. Langley. Selection of relevant features in machine learning. In AAAI FallSymposium on Relevance, pages 140-144, New Orleans, 1994.

P. McCullagh and J. Nelder. Generalized Linear Models. CHAPMAN &HALL/CRC, Boca Raton, 1989.

M. Y. Park and T. Hastie. L1 Regularization Path Algorithm for GeneralizedLinear Models. Technical report, Stanford University, February 2006.

A. Rakotomamonjy. Variable selection using SVM-based criteria. Journal ofMachine Learning Research, 3 : 1357-1370, 2003.

V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998.

V. Vapnik and O. Chapelle. Bounds on error expectation for support vectormachines. Neural Computation, 12 : 9, 2000.

J. Weston, A. Elisseff, B. Schoelkopf, and M. Tipping. Use of the zero normwith linear models and kernel methods. Journal of Machine Learning Research, 3: 1439-1461, 2003.

Comparing different strategies for variable selection in large dimensions

Documents

Transcript of Comparing different strategies for variable selection in large dimensions

Comparing print and web texts

Semiotic Dimensions of Creativity

Nationwide Variable Insurance Trust

variable speed drives

MEASURING INSTRUMENTS WITH FIXED DIMENSIONS ...

COMPARING DEFLUORIDATION AND SAFE ... - UPGro

Comparing categories among geographic ontologies

Variable displacement pump A10VSO

Variable Valve Timing - AA1Car

Comparing Fault-Proneness Estimation Models

Topological matter in four dimensions

Comparing and Constrasting Research Methodologies

Dimensions of Active Behaviour

Crystallography in Four Dimensions - DiVA

Variable Carving Volume Casting

DIMENSIONS OF INFORMATION SYSTEMS SUCCESS

EXTERNAL SECTOR DIMENSIONS

Energy: higher dimensions - Physics, CUHK

Variable area flowmeter - Downloads

forces in Two Dimensions - Staff