Comparing different strategies for variable selection in large dimensions

Outline Motivation Statistical Learning Feature Selection Experiments Extensions

Comparing different strategies for variable

selection in large dimensions

B. Ghattas

Institut de Mathématiques de Luminy, Marseille, France

May 25, 2009, Gent, Belgium

B. Ghattas Institut de Mathématiques de Luminy, Marseille, France

Comparing different strategies for variable selection in large dimensions


Outline

Motivation.

Statistical Learning.

Feature Selection.

Experiments.

Extensions.




Motivation

We would like to learn a model from data and discover the importantfeatures for learning.Examples:

Predict exportation of Uruguayan meat using informations aboutconcurrent countries and their interactions: exchange rate, production,annual import and/ export, ...Predict whether a flow within a router in a network is an email, asound, an image, a video ...Predict whether an individual will recidivate or not a heart attackknowing his past, his ECG analysis, his plasma analysis, ...




Sparse data

A particular attention for situations where n << p, "Microarray data".

Which are the genes that give the best discrimination between thepresence and absence of a cancer ?




A three steps approach

First order the variables.Next introduce them sequentiallay within the modelmonitoring its performance evolution.Localize the optimal number of variables to keep in the model.

0 100 200 300 400 500 600

05

1015

2025

30

Index

err.r

s




Empirical Risk Minimization

Empirical Risk Minimization

(Xi , Yi )i = 1..n random independent variables coming from an unknowndistribution P(X , Y ). (Xi , Yi ) ∈ (X ,Y) ⊆ (Rp, {1, ..., J}).We suppose that:

To estimate f we minimize a loss function L(Y , f (X )) and look forf ∗ = Argminf ∈CE [L].As P is unknown we use the Empirical Risk Minimization principle,computing:

fn = Argminf ∈C1nL(f (Xi ), Yi )




SVM

Linear Separation, binary case

S= n i.i.d. sample of (X ,Y) ⊆ (Rp, {−1, +1})

S = {(x1, y1) , (x2, y2) , . . . , (xn, yn)} ⊆ (X × Y)n .

We look for a function: f (x) = sign (〈w .x〉 + b)




SVM

Margin optimization

The margin of an hyperplane is : γ = 1‖w‖ .

The optimal hyperplane is the one that achieves the maximum ofthe margin over all the separating hyperplanes.




SVM

The Optimization problem

Find (w , b) ∈ Rp × R such that :

Minimizew ,b‖w‖2

2Under yi (〈w .xi 〉 + b) > 1∀i ∈ {1, . . . , n}

(1)

Solution :

w∗ =n∑

i=1

α∗i yixi =

∑

i∈sv

α∗i yixi .

and

b∗ = −maxyi=−1 (〈w∗.xi 〉) + minyi=+1 (〈w∗.xi 〉)

2.

where α∗ = (α∗1, . . . , α

∗n) are the Langrangian coefficients and

sv = {i ∈ {1, . . . , n} ; α∗i 6= 0}.

The decision function is :

f (x)n = sign

(∑

i∈sv

α∗i yi 〈xi .x〉 + b∗

)




SVM

The Kernel Trick

Project the data in a high dimensional space where linearseparation is tractable, using a transformation Φ.




SVM

The Kernel Trick - 2

We do not need to know the exact expression of Φ, the dot productbetween observations defined by the Kernel K (z , z ′) = 〈Φ(z).Φ(z ′)〉is sufficient to compute the separating hyperplane.

f (x) = sign(∑

i∈sv α∗i yi 〈Φ(xi ).Φ(x)〉 + b∗

)=

sign(∑

i∈sv α∗i yiK (xi .x) + b∗

)




SVM

Risk Bounds

• Radius-margin bound: For the LOO error estimation (Vapnik [9])

L 6R2

γ2= R2 ‖w∗‖2 , (2)

L is the number of misclassified observations by LOO, γ the margin,Rradius of the smallest ball covering S.

• Span bound: Vapnik and Chapelle [10].

L 6∑

i∈sv

α∗i S

2i , (3)

where the span Si is the distance between the support vectors xi and aset of constrained linear combination of the other SV.




SVM

Scores

Three scores are commonly used :The weight vector score: W = ‖w∗‖2

The Radius score: RW = R2 ‖w∗‖2

The Span Score: Spb =n∑

i=1

α∗i S

2i

Each score may be computed at different orders :

"zero-order" : The value of the score computed omitting that variable."difference-order" difference betwwen the score using that variable and itsvalue without it."first-order" is the derivative of the score w.r.t. to artificial weights.

We use Bootstrap mean estimates for each score.




Random Forests

Random Forests, (L. Breiman, 2001)

K bootstrap Samples, keeping the out of bag samples.Construct a Maximum tree over each one, using best split over veryfew variables randomly selected.Don’t prune.Aggregate trees using mean (regression) or majority vote(classification).

"Random Input" uses one variable at each split."Random Features" uses a linear combination of variables with randomlyselected coefficients.Weak trees + weak correlation between trees (between their predictions)→ Powerful learner.




Random Forests

Variables importance

Based on OOB samples, and difference in the performance of a treewhen the values of one variable are randomly permuted.

Consider the prediction error τk of the kth tree of the forestover the OOB Sample.

Permute randomly the values of Xj in the OOB sample anduse the modified sample for prediction.

Measure the prediction error for the modified sample τ′

k(j)

The Importance measure for variable j is :

I (j) = 1K

∑Kk=1

τk−τ′

k(j)

τk




Random Forests

Variables importance- Comments

Insensitive to the nature of the resampling used (bootstrapsamples with or without replacement).

Stable in presence of correlations between variables.

Invariant to normalization (using standard deviation of Zi (j)

Stable w.r.t. data perturbations. Bootstrapping VI isunnecessary.




GLMpath

The model and its estimation (McCullugh et al. [6])

The classification model used:

g(µ) = β0 + β1x1 + ... + βpxp

where Y ∈ {0, 1}, µ = E (Y ) = P[Y = 1] , and g is a link function.

g(µ) = µ1−µ

, logistic regression.Parameter β = (β0, ..., βp) are estimated using likelihood.




GLMpath

Regularization, Park et.al.(2006)

Penalize the likelihood with L1 constraint over the coefficients.

β(λ) = argminβ {−logL(x; β) + λ ‖β‖1}

where λ > 0 is a regularization parameter.The sequence β(λ), 0 < λ < ∞ is called the path.For λ = ∞ all the coefficients are equal to zero.Increasing λ sets more coefficients to zero.




GLMpath

Estimation

At each step k of the algorithm, we have values for λk , βk .

1 Compute the necessary increment to reach λk+1

2 "predictor step": compute a linear approximation βk+, forβk+1

3 "corrector step": Use convex optimization to compute βk+1,with initial value βk+

4 Test if the set of active variables (having non zero coefficient),must change.




GLMpath

Variables importance

Use B=500 bootstrap samples.

Compute the optimal GLM-penalized model, and keep it’scoefficients.

The importance of variable j is the absolute value of it’scoefficient’s bootstrap mean βB

j .

Variables whose coefficient bootstrap mean is zero won’t beused for comparisons.




Stepwise selection

Stepwise selection

Stepwise, SFS, SBS, SFFS, ....

Advantage: Do not need a specific model, but a monotoniccriterion over a set of variables.

Drawbacks: computational complexity, depend on the order ofvariables in the data.




SVM-RFE

SVM-RFE (Guyon et al. [4],Rakotomamonjy [8])

While there are still variablesLearn an SVM and sort variables using the score ||w ||2 bydifferences.Estimate its misclassification error.Eliminate half of the variables, the least important if there aremore then 100 kept.

For the last 100 variables, eliminate them recursively one byone.




Our procedure

Our procedure

D = Learning sample. B = 200 Number of bootstrap samples.Compute the score(D, B) to get a hierarchy X (1), . . . ,X (p).For k = 1, . . . , p

For l = 1, . . . , 50Randomly split with stratification D = Al ∪ Tl

Al is the learning sample Tl the test sample.Mk

l = f(X (1), . . . ,X (k), Al

)

Erkl = Test

(Mk

l , Tl

)

Erk = 150

∑50l=1 Erk

l

kopt = Arg mink

{Erk}

.




Toys data [11]

Toys

Two classes {−1, 1} having the same probability.

xi ∼ yN(i , 1), i = 1, 2, 3

xi ∼ yN(0, 1), i = 3, 4, 5

with probability 0.7,else

xi ∼ yN(0, 1), i = 1, 2, 3

xi ∼ yN(i − 3, 1), i = 3, 4, 5

xi ∼ N(0, 20), i = 7 . . . , p

These data are linearly separable with high probability, decreasingwith the sample size.




Toys data [11]

X1

−2.0 −1.0 0.0 1.0 −2 −1 0 1 2 3 −2 −1 0 1 2

−1.

50.

01.

0

−2.

0−

0.5

1.0

X2

X3

−2

01

2

−2

01

23

X4

X5

−2

01

2

−1.5 −0.5 0.5 1.5

−2

01

2

−2 −1 0 1 2 −2 −1 0 1 2

X6




Toys data [11]

Hierarchy, varying n

Rank where 4, 5 and 6 important variables appeared in the hierarchy. Wehave used p = 200, B = 200, n = 50, 100, 200.

n/Score FDS ∂W ∂RW ∂Spb RF GLMpath

504613

4517

4516

4512

4612

458

100456

457

456

456

456

456

200456

456

456

456

459

456




Toys data [11]

Hierarchy, varying p

n = 50, B = 200,p = 500, 1000.

p/Score FDS ∂W ∂RW ∂Spb RF GLMpath

5004518

4713

4712

4511

51242

456

1000434173

433194

432202

431224

4205206

43538




Toys data [11]

Rank Correlations

200 observations, 200 variables.

∂W ∂RW ∂Spb RF GLMpath

FDS 0.467 0.390 -0.216 0.180 0.542∂W 1 0.685 -0.410 0.132 0.944∂RW 1 -0.267 0.205 0.682∂Spb 1 0.056 -0.484RF 1 0.161

50 observations, 1000 variables.

∂W ∂RW ∂Spb RF GLMpath

FDS 0.918 0.873 0.604 0.093 0.705∂W 1 0.925 0.664 0.074 0.725∂RW 1 0.622 0.073 0.702∂Spb 1 0.083 0.567RF 1 0.086




Toys data [11]

Performances

Score/(n, p) (50,200) (100,200) (200,200) (50,500) (50,1000)FDS 0.0208(6) 0.0072(7) 0.0048(7) 0.0044(5) 0.0084(5)∂W 0.0084(5) 0.012(6) 0.0048(7) 0.008(7) 0.0084(5)∂RW 0.0084(5) 0.0072(7) 0.0048(7) 0.008(7) 0.0076(6)∂Spb 0.0084(5) 0.0096(6) 0.0044(8) 0.0044(5) 0.0084(5)

SVM − RFE 0.0476(8) 0.016(8) 0.006(4) 0.0132(8) 0.0104(4)GLMpath 0.0188(1) 0.0252(3) 0.0074(4) 0.008(4) 0.0192(2)

RF 0.044(3) 0.0272(6) 0.0064(25) 0.0252(12) 0.0656(4)

Table: 50 stratified test, or CV (glmpath).




Toys data [11]

sample size effects, 50 stratified test samples, p = 200.

100

102

0

0.2

0.4

Tau

x d’

erre

ur50 observations

FDS

100

102

0

0.2

0.4

50 observations

∂W∂RW∂Spb

100

102

0

0.2

0.4

50 observations

FA

100

102

0

0.2

0.4

Tau

x d’

erre

ur

100 observations

100

102

0

0.2

0.4

100 observations

100

102

0

0.2

0.4

100 observations

100

102

0

0.2

0.4

Tau

x d’

erre

ur

200 observations

100

102

0

0.2

0.4

200 observations

200 variables10

010

20

0.2

0.4

200 observations




Toys data [11]

Number of variables effect. p = 500, 1000, n = 50.

100

102

0

0.1

0.2

0.3

0.4

0.5T

aux

d’er

reur

: 50

obs

erva

tions

500 variables

FDS

100

102

0

0.1

0.2

0.3

0.4

0.5500 variables

∂W∂RW∂Spb

100

102

0

0.1

0.2

0.3

0.4

0.5500 variables

FA

100

102

0

0.1

0.2

0.3

0.4

0.5

Tau

x d’

erre

ur :

50 o

bser

vatio

ns

1000 variables

100

102

0

0.1

0.2

0.3

0.4

0.51000 variables

100

102

0

0.1

0.2

0.3

0.4

0.51000 variables




Microarray datasets

Data sets

Data p learning test n +1/-1

Colon 2000 62 – 22/40Lymphoma 4026 96 – 62/34Prostate 12600 102 – 52/50Leukemia 7129 38 34 27/11 - 20/14




Microarray datasets

Hierarchies comparison

0-coefficients: Colon-999, Lymphoma-1376, Leukemia-1190, Prostate-2234.x-axis: Normalized rank.y-axis: Proportion of common variables for thecompared methods.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

Fré

quen

ce r

elat

ive

Rang normalisé

Colon

SVMSVM/FASVM/GLMpathFA/GLMpath

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

Fré

quen

ce r

elat

ive

Rang normalisé

Lymphoma





Microarray datasets

Hierarchies comparison 2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1F

réqu

ence

rel

ativ

e

Rang normalisé

Prostate


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

Fré

quen

ce r

elat

ive

Rang normalisé

Leukemia





Microarray datasets

Common Variables

Comparison / data Colon Lymphoma Prostate LeukemiaSVM 37 37 32 30

SVM/GLMpath 33 26 24 21SVM/RF 4 9 12 9

RF/GLMpath 10 12 16 21

Table: Number of common variables within the top 50




Microarray datasets

Results, real data sets

Score/Data Colon Lympoma Prostate Leukemia

FDS 0.1219(3) 0.0436(200) 0.0371(315) 0.0882(7)∂W 0.0009(31) 0(186) 0.0269(83) 0.1176(2)∂RW 0.0029(33) 0(60) 0.0269(902) 0.0882(22)∂Spb 0.0029(34) 0.0006(118) 0.0109(45) 0.1176(11)

SVM − RFE 0.0057(32) 0(64) 0(64) 0.0882(1)GLMpath 0.064(2) 0(3) 0(3) 0(1)

RF 0.0962(55) 0.0588(73) 0.0554(7) 0.0588(103)

Colon: 0.17, Lymphoma: 0.06, Prostate: 0.075, Leukemia:0.20588.




Microarray datasets

Bias Selection

D data set, B Number of bootstrap samples

Partition D with stratification, D1, ..., D10.Set D−j = D − Dj .For j = 1, . . . , 10

Score(D−j ,B) and use the hierarchy X (1), ..., X (p)

For k = 1, . . . , p

Mk = f(X (1), ..., X (k)

)

Erk = TestRS (Mk , D−j )

koptj = Argmink{Erk}

erj = Mean error of Mkoptj over Dj .

Compute er = 1

10

∑10

j=1erj .




Microarray datasets

Results

Data Colon Lymphoma Prostate

FDS 0.1595(15.1) 0.1233(83.7) 0.0882(126.4)∂W 0.233 (35.1) 0.051 (86.5) 0.054 (756.6)∂RW 0.214 (43.3) 0.042 (71) 0.053 (573.3)∂Spb 0.197 (31.8) 0.073 (70.5) 0.052 (95.5)

SVM − RFE 0.1452(26.4) 0.0878(16.8) 0.0582(43.2)GLMpath 0.1809 (1.3) 0.0522 (2.8) 0.05909 (1.6)

RF 0.106 (49.8) 0.052 (65.9) 0.059 (81)




Non linear Separation

Non linear Separation

100

101

102

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7T

aux

d’er

reur

50 variables

100

101

102

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7100 variables

100

101

102

103

0.1

0.2

0.3

0.4

0.5

0.6

Tau

x d’

erre

ur

200 variables

100

101

102

103

0.1

0.2

0.3

0.4

0.5

0.6300 variables

FDS∂W∂RW∂Spb

80 observations




Multiclass separation

One versus others

Here Y ∈ 1, .., J. We construct J hyperplanes each learning thesimplified binary classification of one class against the other.





One versus one

We construct the 12J(J − 1) hyperplanes each learning the

simplified binary classification of one class against another.





Multiclass procedures





Multiclass linear separable data

−2 −1 0 1 2−2

−1

0

1

2

x1

x 2

m = 3

−2 −1 0 1 2−2

−1

0

1

2

x1

x 2

m = 4

−2 −1 0 1 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

x1

x 2

m = 5

−2 −1 0 1 2−1.5

−1

−0.5

0

0.5

1

1.5

x1

x 2

m = 6





Multiclass linear feature selection

100

102

104

0

0.2

0.4

0.6

0.8T

aux

d’er

reur

: 50

0 va

riabl

esRADAG

100

102

104

0

0.2

0.4

0.6

0.8

1Hamming−decoding

100

102

104

0

0.2

0.4

0.6

0.8Loss−decoding

100

102

104

0

0.2

0.4

0.6

0.8

1

Tau

x d’

erre

ur :

1000

var

iabl

es

100

102

104

0

0.2

0.4

0.6

0.8

1

100

102

104

0

0.2

0.4

0.6

0.8

1

m = 6





Application: Malaria discrimination

∂W ∂RW ∂Spb

Ccl25

Tagap1

Hmgn1

Ppp1r9bItgb1

Ndufa7Myo10

Ppp1r12cFrap1

Rapgef6Tubb3Rpl22

Tagap1

Ccl25

Hmgn1

Ppp1r9bItgb1

Zfpn1a4Chchd1

0610037H22RikRpl22Frap1Myo10Rapgef6

Hmgn1

Ccl25

Tagap1

Rpl22Frap1Pmm1Bzw1Itgb1

Ppp1r9bCdc37l1

0610037H22RikTubb3

Table: Optimal sets : Selected genes in a decreasing order of importance.





Perspectives

Thank you for your attention.





Bibliography

L. Breiman. Random forests. Machine Learning Journal, 45:5Ű32, 2001.

O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multipleparameters for support vector machines. Machine Learning, 46(1-3) : 131-159.

I. Guyon and A. Elisseff. An introduction to variable and feature selection.Journal of Machine Learning Research, 3 : 1157-1182, 2003.

I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancerclassification using support vector machines. Machine Learning, 46(1-3) :389-422, 2002.

P. Langley. Selection of relevant features in machine learning. In AAAI FallSymposium on Relevance, pages 140-144, New Orleans, 1994.

P. McCullagh and J. Nelder. Generalized Linear Models. CHAPMAN &HALL/CRC, Boca Raton, 1989.

M. Y. Park and T. Hastie. L1 Regularization Path Algorithm for GeneralizedLinear Models. Technical report, Stanford University, February 2006.

A. Rakotomamonjy. Variable selection using SVM-based criteria. Journal ofMachine Learning Research, 3 : 1357-1370, 2003.

V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998.

V. Vapnik and O. Chapelle. Bounds on error expectation for support vectormachines. Neural Computation, 12 : 9, 2000.

J. Weston, A. Elisseff, B. Schoelkopf, and M. Tipping. Use of the zero normwith linear models and kernel methods. Journal of Machine Learning Research, 3: 1439-1461, 2003.



Comparing different strategies for variable selection in large dimensions

Documents

Transcript of Comparing different strategies for variable selection in large dimensions