A Classification Method Incorporating Interactions among Variables for High-dimensional Data

26
A Classification Method Incorporating Interactions among Variables for High-dimensional Data Haitian Wang , Shaw-Hwa Lo , Tian Zheng , Inchi Hu †∗ Department of ISOM, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong; Department of Statistics, Columbia University, New York, USA; Corresponding Author, [email protected] Abstract The advances in technology have made available high-dimensional data involving thousands of variables at an ever-growing rate. These data can be used to answer significant questions. For instance, with DNA microarray technology, scientists can now measure the expression levels of thousands of genes simultaneously, while the number of patients is typically of the orders of tens. Selecting a small number of genes based on microarray datasets for accurate prediction of clinical outcomes leads to a better understanding of the genetic signature of a disease and improves treatment strategies. However, in addition to the difficulty that the number of variables is much larger than the sample size, the classification task is further complicated by the need to consider interaction among variables. In this paper, we propose a classification method that incorporates interaction among variables. In several microarray datasets, our method outperform existing methods which ignore variable-i nteraction. 1 Introduction: why another method? In recent years, novel statistical methods have been developed to meet the challenge of high- dimensional classification problems. To help explaining the strength of our method, we will focus on a specific problem—the use of gene expression data to predict clinical outcomes of cancer, even though the method can be applied much more broadly. The problem is challenging not just because the number of variables p is much greater than the number of observations n. What’s even more challenging is that one needs to consider the interactive effects among variables in addition to their individual marginal effects. For example, suppose that metastasis is activated via one or more of several biological pathways and each pathway 1

Transcript of A Classification Method Incorporating Interactions among Variables for High-dimensional Data

A Classification Method Incorporating Interactions

among Variables for High-dimensional Data

Haitian Wang†, Shaw-Hwa Lo‡, Tian Zheng‡, Inchi Hu† ∗

† Department of ISOM, Hong Kong University of Science and Technology,

Clear Water Bay, Hong Kong;‡ Department of Statistics, Columbia University, New York, USA;

∗ Corresponding Author, [email protected]

Abstract

The advances in technology have made available high-dimensional data involvingthousands of variables at an ever-growing rate. These data can be used to answersignificant questions. For instance, with DNA microarray technology, scientists cannow measure the expression levels of thousands of genes simultaneously, while thenumber of patients is typically of the orders of tens. Selecting a small number ofgenes based on microarray datasets for accurate prediction of clinical outcomes leadsto a better understanding of the genetic signature of a disease and improves treatmentstrategies. However, in addition to the difficulty that the number of variables is muchlarger than the sample size, the classification task is further complicated by the needto consider interaction among variables. In this paper, we propose a classificationmethod that incorporates interaction among variables. In several microarray datasets,our method outperform existing methods which ignore variable-i nteraction.

1 Introduction: why another method?

In recent years, novel statistical methods have been developed to meet the challenge of high-

dimensional classification problems. To help explaining the strength of our method, we will

focus on a specific problem—the use of gene expression data to predict clinical outcomes

of cancer, even though the method can be applied much more broadly. The problem is

challenging not just because the number of variables p is much greater than the number of

observations n. What’s even more challenging is that one needs to consider the interactive

effects among variables in addition to their individual marginal effects. For example, suppose

that metastasis is activated via one or more of several biological pathways and each pathway

1

involves a group of genes. In this case, the expression level of a single gene involved in

one of the biological pathways may show little or no marginal effect in the prediction of

metastasis status and its effect can only be detected by considering jointly with other genes

in the pathway. That is, the interaction among genes is critical to the accurate prediction of

metastasis outcomes.

In this paper, we propose a high-dimensional classification method that explicitly in-

corporates interactions among variables. The resulting classification rule is an ensemble of

classifiers and each classifier involves only one feature cluster. Each classifier explicitly in-

corporates interactions within its own feature cluster. The effort of including interactive

information in the classification rule has a clear reward in prediction. We demonstrate

through three microarray datasets that our classification rule significantly improves the er-

ror rate and outperform methods that ignore interactions among variables. Furthermore,

the feature cluster identified by our method is a plausible candidate for genes involved in a

biological pathway, which deserves further investigation.

There are two critical issues concerning high-dimensional classification methods: feature

selection and error-rate estimation. Regarding feature selection, the existing methods can be

broadly put into two categories. The first category of methods does feature selection through

simultaneously estimating the effect of all variables. We will argue that for the problem under

consideration p is too large and n too small for methods in the first category to work well; see

§2.1 for details. Methods of the second category do feature selection by assessing the effect

of each variable individually. Because the effect of each variable is estimated individually

ignoring the interaction among variables, methods in the second category are not expected to

perform well in problems with significant interaction effects or problems with little marginal

effects; see §2.2 for details.

The error rate estimation is important to the evaluation of classification methods. How-

ever, the evaluation is sometimes clouded by tuning parameter selection and feature selection.

If done inappropriately, the error rate estimates can be too optimistic. To be more precise,

if error rate estimation is done using observations involved in tuning parameter selection

and/or feature selection, then the error rate estimates are biased and lack calibration. Our

classification rule does not contain any tuning parameter and thus there is no danger of sup-

plying overly optimistic error-rate estimate due to tuning-parameter selection. The error-rate

estimation of our method is free of feature selection bias because feature selection is done

without involving any information whatsoever from the testing sample.

The rest of the article is organized as follows. Section 2 contains discussion on feature

selection and error-rate estimation before a selective literature review. Section 3 contains

a preliminary illustration of the proposed method via a small toy example. In Section 4,

a detailed description of our method is given. Section 5 applies the proposed classification

method to three microarray datasets. Section 6 is a summary, which also includes a discussion

comparing our method with other classification methods. Some technical aspects of the

proposed method are dealt with in the Appendix.

2

2 Comparison with other methods

Before literature survey, we first discuss two critical issues concerning high-dimensional clas-

sification methods. The purpose of the discussion is to gain insight on the performance of

various high-dimensional methods proposed in the literature and thus facilitate the compar-

ison of different methods.

2.1 Feature Selection

For high dimensional classification problems, where the number of variables p is much greater

than the number of observations n, it is imperative to derive solutions under sparse models.

That is, only a relatively small number s of p variables is relevant to the response variable

Y . It is well known that methods with �1-regularization have feature selection property and

they have desirable properties under sparse models. Meinshausen and Yu (2009) showed

that the LASSO estimator is �2- consistent, if

n

sn log pn→∞. (1)

The preceding equation has information-theoretic interpretation that the number of ob-

servations available for estimating each bit of minimum description length for the sparse

model tends to infinity. However, for the purpose of predicting the clinical outcome using

gene expression profiling, usually the number of variables is too large and the number of

observations is too few for condition (1) to hold. For instance, with p = 10000 variables,

only s = 10 relevant to the response variable Y , and n = 100 available observations, we have

n

s log p=

100

10 · log(10000)≈ 1.1

The other remarkable method, Dantzig selector proposed by Candes and Tao (2007), requires

a similar condition; see section 2.1 of the preceding paper.

These results from high-dimensional regression literature shed lights on classification

problems. Methods such as LASSO and Dantzig selector are designed to estimate the ef-

fects (coefficients) of all variables simultaneously. However, when p is too large and n too

small, (1) suggests that one cannot expect classification methods such as support vector ma-

chine (SVM) to perform effective feature selection because SVM estimates simultaneously

coefficients of all variables in a linear classification rule.

In sharp contrast to aforementioned methods, the other category of methods does feature

selection by estimating the effect of variables individually. One method in this category has

gained much attention recently. SIS (sure independent screening) introduced by Fan and

Lv (2008), measures the effect of each variable by its correlation with the response variable

Y . Then the feature selection is done via discarding low correlation variables. Fan and Lv

(2008) showed that SIS possessed the sure screening property, that is, the submodel obtained

by SIS contains the true sparse model with overwhelming probability. They proposed to first

3

perform SIS to reduce the number of variables to a moderate scale so that subsequent data

analysis can be done more easily. The sure screening property of SIS is retained at the cost

of ignoring the interactions among variables. An important variable will escape SIS if it is

marginally uncorrelated (or weakly correlated) with Y but jointly correlated with Y when

considered with other variables together. For complex disorders, it is likely that a gene by

itself has little effect but when combined with other genes has significant effect on disease

status. In this case, SIS will miss important genes.

In this paper, we propose a feature selection strategy based on ideas from Lo and Zheng

(2002, 2004) and Chernoff et al. (2009), which assesses the effects of variables neither si-

multaneously nor individually. The proposed feature selection procedure estimates the joint

effect of a randomly-selected subset of variables and reduces the subset to a smaller one by

removing irrelevant variables from the subset. This strategy can avoid not only the difficulty

of those methods too ambitious given the size of the problem by estimating all coefficients

at the same time but also the difficult of other methods that focus on one single variable at

a time ignoring joint effects. Our procedure may be called ‘feature-group’ selection because

the features selected come in groups. The interaction among variables in the same group will

be accommodated, while assuming no interactions for variables that are in different groups.

2.2 Error-rate estimation

There are two common pitfalls in error rate estimation for high dimensional classification

problems using cross validation. First, when error-rate estimation and tuning-parameter

selection are done at the same time, the error-rate estimate is biased downward. That is, the

error rate obtained by minimizing the cross-validation estimate with respect to the tuning

parameter is too optimistic. Zhu et al. (2008) referred to error-rate estimates of this type

as those based on ‘internal’ CV. Here ‘internal’ means the error-rate is assessed internally

by the same set of observations used for tuning parameter selection as opposed to the error

rate assessed externally by observations not involved in tuning parameter selection.

Zhu et al. (2008) also pointed out the other pitfall: feature selection bias. This kind

of bias exists in various types of feature selection procedures. It exists in fix-size feature

selection procedures, where one selects a fixed number of variables, say 10, from a large

number of variables, say 10,000, in one round or in several rounds such as recursive feature

elimination (RFE) proposed by Guyon et al (2002). The feature selection bias is also present

in preliminary screening, where one selects a subset of variables for further analysis. In

addition, it exists in the optimal subset of unrestricted sizes, where the number of variables

involved in the final classification rule is determined by minimizing the error rate among

subsets of difference sizes.

The reason for the existence of selection bias is similar to tuning parameter selection.

Even though features usually are not selected by minimizing error rate, they are selected

by some criterion, which supposedly can help reducing the error rate. Thus the error rate

estimation is biased if it is done internal to feature selection procedure. To correct the

4

selection bias, it is necessary to estimate the error rate externally to the feature selection.

That is, the error rate needs to be estimated based on observations not involved in the

feature selection process. In view of the preceding discussion, the error rate estimation of

the proposed method is done externally to tuning parameter selection and feature selection

and thus free from both types of biases.

2.3 Literature review

One can view the vast literature on high-dimensional classification problems as a collection

indexed by methods and datasets because typically a paper in this collection advocates one or

more methods and applies them on a few datasets to report the results. Thus we can adopt a

‘cross-sectional’ approach to review the high-dimensional classification literature pertaining

to a particular gene expression dataset. If the deataset is well-know and well-studied, we

can hope to have a comprehensive view of high-dimensional classification methods and their

performance through this particular dataset.

The chosen dataset was first made available in the breast cancer study of van’t Veer

et al (2002), which has generated a great deal of discussions in the literature. Originally it

contains expression levels of approximately 25,000 genes and after filtering, some 5,000 genes

were kept for further analysis. Of the 97 female breast cancer patients, 46 relapsed (distant

metastasis < 5 year) and 51 non-relapsed (no distant metastasis ≥ 5 year), 78 were used as

the training sample and 19 as the test sample.

The error rate estimates reported in the literature for a wide variety of classification

methods about the van’t Veer (2002) dataset are typically around 30%. Pochet et al (2004)

applied SVM and Fisher discriminant analysis (FDA) with error rates all above 25%. Michiels

et al (2005) used the correlation method and the error rate was over 30%. Diaz-Uriarte and

de Anres (2006) applied the bootstrap method to estimate the error rate. Their error rate

estimates for SVM, random forest (RF), k-nearest neighbor (KNN), and linear discriminant

analysis (LDA) were all higher than 30%. Yan and Zheng (2008) applied multigene profile

association method (MPAS) and used 13-fold CV to yield an error rate estimate of 29.5%.

Some papers reported error rates significantly lower than 30%, even as low as 1%. How-

ever, after careful examination, we found all of them suffer from feature selection bias and/or

turning parameter selection bias. A couple of them used leave-one-out cross validation

(LOOCV) to estimate the error rate. In addition to the two kinds of biases previously dis-

cussed, error-rate estimate using LOOCV has additional problem of much larger variance

than, say, 5-fold CV, because the error rate estimates in each fold of LOOCV are highly

correlated. The details will be given in §5.Another set of papers reported arround 10% error rate on the test sample only; see §5

for details. Incidentally, Alexe et al (2006), using logical analysis of data (LAD), found a

pair of gene that completely separates the taining and testing samples of van’t Veer et al

(2002). Because this is an extremely rare situation, they concluded that the training and

testing samples of van’t Veer et al (2002) have very different characteristics. Thus it is

5

better for error rate estimates based only on the test sample of van’t Veer et al (2002), to

be cross-validated on other test sets as well.

From the foregoing literature review, we conclude that classification methods being re-

viwed either have relatively high error rate estimates, or procedures employed to estimate

the error rates are biased or lack calibration due to tuning parameter selection and feature

selection. The proposed method yields an error rate of 8% from 10 CV test sets. Further-

more, the proposed method has no tuning parameter and selects features without using any

information from the test sets and thus the error rate estimation are free from both types of

biases.

3 Preliminary illustration- two basic tools

To illustrate the effectiveness of the proposed method, which will be described in the next

section, here we provide a preliminary illustration via a toy example. The purpose of the toy

example is to demonstrate that two key ideas from Lo and Zheng (2002, 2004), an influence

measure I and a backward dropping algorithm (BDA), can elicit interaction information

difficult for other high-dimensional methods.

3.1 A toy example

Example 1: Let Y , the class membership, and X = (X1, X2, · · · , X200), the explanatory vari-

ables, all be binary taking values in {0, 1}. We generate 200 values for each Xi independently

with P{Xi = 0} = P{Xi = 1} = 0.5 and

Y =

{X1 + X2 + X3(mod 2) with probability 0.5

X4 + X5(mod 2) with probability 0.5(2)

Thus we have a 200×201 data matrix (n = 200 observations, p = 200 explanatory variables,

and one response variable Y ). Suppose we do not know the model and wish to predict Y

based on the explanatory variables X. To train the classification rule, we use 150 observations

as the training set, while the other 50 observations are used as the test set.

The proposed method uses I to evaluate the strength of influence on Y for a small set

of explanatory variables. The BDA guided by the influence measure I is applied to a small

subset of variables to obtain a smaller subset, called the return set, by stepwise elimination

of irrelevant variables. The top ranked return sets generated by BDA are as follows.

6

Size Influence score Variables

2 1250.200 4, 5

3 517.522 1, 2, 3

2 455.182 124, 168

3 409.097 35, 178, 194

2 401.324 82, 121

3 399.887 17, 78, 112

3 393.876 6, 43, 93

2 381.318 47, 117

3 376.208 27, 40, 156

3 358.346 37, 95, 132

From these influence scores, we can easily identify the influential variables since the highest

I-score return set of size two, {X4, X5} and of size three, {X1, X2, X3}, have significantly

higher I-score than other return sets of the same size, respectively.

Note that any feature selection method based on marginal effects would fail in this ex-

ample because there is no marginal effect from the influential variables X1, · · · , X5 on the

response variable Y . Unless X1, X2, X3 are considered together, their effect on Y will go

undetected because corr(Xi, Y ) = 0, i = 1, 2, 3. The same is true for X4, X5.

After identifying the return sets, we then turn the return sets into classifiers. Because

the explanatory variable are all binary, the classification tree is a good choice as classifiers.

We shall classify the test cases according to the return sets, {X1, X2, X3} and {X4, X5}, by

growing a classification tree from each return set. Usually, we will have more than one return

sets of influential variables. We can assemble these classifiers into a classification rule using

the boosting method; see, e.g., Hastie, Tibshirani and Friedman (2009). The performance of

the classification rule and other classification methods such as linear discriminate analysis

(LDA), support vector machine (SVM), and random forest (RF) are reported below.

Error rate

Methods Training set Test set

LDA 0.2 0.52

SVM 0.0 0.52

RF 0.49 0.46

Proposed 0.26 0.24

The proposed method yields the lowest error rate for the test set at 24%, which is

also close to the best possible error rate because we do not know whether {X1, X2, X3} or

{X4, X5} is used to generate the value of Y . Among the four methods, SVM produces the

largest gap between the training and test set error rates. This is probably due to tuning

parameter estimation required by SVM so that the error rate for the training set seriously

underestimates the true one. The proposed method does not contain any tuning parameter.

7

3.2 The influence measure

Here we describe the two basic ideas in Lo and Zheng (2002, 2004), an influence measure and

BDA, which we adopted to analyze the toy example. For easy illustration, we assume that the

response variable Y is binary (taking values 0 and 1) and all explanatory variable are discrete.

Consider the partition Pk generated by a subset of k explanatory variables {Xb1 , · · · , Xbk}

according to their values. If all variables in the subset are binary then there are 2k partition

elements. Let n1(j) be the number of observations with Y = 1 in partition element j and let

n1(j) = nj × π1 be the expected number of 1-response observations in element j under the

null hypothesis of no association, where nj is the number of observations in element j and

π1 is the proportion of 1-response observations in the sample. Then the influence measure

is a sum of squares over all partition elements defined by

I(Xb1 , · · · , Xbk) =

∑j∈Pk

[n1(j)− n1(j)]2.

The statistic I measures the squared deviation of Y -observations from what’s expected

assuming the subset of explanatory variables has no influence on Y . There are two properties

of I that make it effective. First, the measure does not require one to specify the form of

{Xb1 , · · · , Xbk}’s join influence on Y . Secondly, under the null hypothesis that the subset

has no influence on Y , the expected value of I remains unchanged when dropping variables

from the subset. The second property makes I fundamentally different from the Pearson

chi-square statistic whose expectation depends on the degrees of freedom and hence on the

number of variables used to define the partition. To see this, we rewrite I in its general form

when Y is not necessarily discrete

I =∑j∈Pk

n2j (Yj − Y )2,

where Yj is the average of Y -observations over the jth partition element and Y is the average

over all partition elements. It is shown in Chernoff et al (2009) that I has asymptotic

distribution as that of a weighted sum of χ2 random variables, each with one degree of

freedom. This very property also provides the theoretical justification for the following

algorithm.

3.3 The backward dropping algorithm (BDA)

BDA is a greedy algorithm to search for the feature combination that maximizes the I-score

through stepwise elimination of features from an initial subset randomly selected from the

feature space. The details are as follows.

1. Training Set: Consider a training set {(y1, x1), · · · , (yn, xn)} of n observations, where

xi = (x1i, · · · , xpi) is a p-dimensional vector of explanatory variables. Typically p is

very large. All explanatory variables are discrete.

8

2. Sample from Variable Space: Randomly select a starting subset of k explanatory vari-

ables Sb = {Xb1 , · · · , Xbk}, b = 1, · · · , B.

3. Compute I Statistic:

I(Sb) =∑j∈Pk

n2j (Yj − Y )2

4. Drop Variables: For each variable in Sb, tentatively drop it from Sb and recalculate

the corresponding I statistics. Then drop the one that gives the highest I-value after

dropping. Call this new set of variables S ′b, which has one variable less than Sb.

5. Return Set: Continue the next round of dropping on S ′b until no variable left to drop.

Keep the subset of variables which yields the highest I score in the whole dropping

process. Refer to this subset of variables as the return set Rb. Keep it for future use.

If no variable in the return set has influence on Y , then the values of I will not change

much in the dropping process; see Figure 1. On the other hand, when important variables

are included in the return set, then the I score will increase rapidly before reaching the

maximum and decrease drastically afterwards; see Figure 2.

5 10 15 20 25

010

020

030

040

050

0

Dropping Rounds

Influ

ence

Mea

sure

(a) Backdropping history with information

5 10 15 20 25

010

020

030

040

050

0

Dropping Rounds

Influ

ence

Mea

sure

(b) Backdropping history without information

Figure 1: Comparison of I-scores for return sets with and without information

4 Method Description

The influence measure and BDA have been successfully applied to discover influential genes

for complex disorders with impressive results; see e.g. Lo and Zheng (2002, 2004) For high-

9

dimensional classification problems, here we develop a new method using these two ideas

as basic tools. The proposed classification method consists of four major parts: feature

selection, generating return sets from selected features, turning return sets into classifiers,

and assembling classifiers to form the final classification rule.

Figure 2: Put the flowchart of the method here.

4.1 Feature selection

When the number of features is moderate with respect to the number of observations as

in the toy example, we can directly apply BDA to the feature space and generate return

sets without doing feature selection first. However, when the number of features is in the

thousands, directly applying BDA is not computationally efficient in identifying key feature

clusters and thus we need to do feature selection before return set generation. The feature

selection is based on the influence measure I . The I-score is shared by a subset (cluster)

of features instead of a single feature. This calls for a treatment different from the usual

approach of ranking the features according to their scores, choosing a cut-off value, and

keeping all features with scores exceeding the cut-off value.

It is necessary to first decide on the starting size of the clusters to apply BDA. Here

the issue is the trade-off between computation cost and the order of detectable interactions.

The larger the cluster size the higher order of interaction we can detect. However, the

computation time increases by a factor depending on the total number of features. For

example, suppose we have 5000 features. Then there are 12.5 million of pairs and more than

20 billion triplets. Even though the I-scores of triplets can provide information on 3rd order

interaction among features, the computation cost is more than 1000 times that of pairs,

which provide us only information about 2nd order interaction.

Suppose we have determined the cluster size for a given problem after considering the

trade-off between computational cost and the order of detectable interaction. We then

calculate the I-scores for all clusters of chosen size. The question now is how to choose

a threshold value so that feature clusters with I-scores higher than the threshold will be

retained. In this regard, we found that usually the number of feature clusters increase

dramatically after a certain value of I-score, which is designated as the threshold. Please

see Figure 3.

We then look at the retention frequencies of the retained features with I-scores higher

than the threshold. The ordered retention frequencies typically show sharp drops in the

beginning and little differences after a certain point, which then determines a set of high-

frequency features; see Figure 4.

The set of high-frequency variables are candidates for influential features. This is because

an influential feature not only will produce high I-score but also does that more frequently

when combined with other variables than non-influential ones. Usually, there is only a

moderate number of high-frequency variables ready to be used in the next step.

10

0 10 20 30 40 50 60

0.0

0.5

1.0

1.5

2.0

# of Return Sets in Thousand

Sec

ond

Diff

eren

ce

Figure 3: 2nd difference of ordered I-scores for every 1,000 triplets.

4.2 Generate return sets

We now apply the BDA to the high frequency variables obtained from the previous step.

There are two parameters we need to determine before applying the BDA: the starting size

and the number of repetitions.

The starting size refers to the initial number of features selected to which we apply the

BDA. The starting size depends on the number of cases in the training set. If the starting

size is too large, all partition elements contain no more than one training case, the dropping

is basically random and one can start with a smaller size, achieving the same result with less

computing time. The ideal starting size is such that some partition elements contain two or

more observations. Using Poisson approximation, we can calculate the expected number of

partition elements with two or more observations. Requiring it to be at least one has the

starting size k satisfyn2

2mk−1≥ 1, (3)

where n is the number of training cases and mk−1 is the number of partition elements

generated by a set of k−1 variables. Thus (3) provides an upper bound for the starting size.

Suppose that there are 100 cases in the training set and each variable assumes two

values. Because thirteen variables would generate a partition with 213 = 8192 elements and

1002/8192 > 1, we can choose the starting size to be 13, which is the largest integer satisfying

11

0 50 100 150 200 250

02

46

810

# of Genes

Firs

t Diff

eren

ce o

f Gen

e F

requ

ency

Figure 4: 1st difference of retention frequency from top 21,000 triplets.

(3).

The number of repetitions for the BDA depends on the number of training cases as well.

The number of training cases determines the size of return sets that can be supported by

the training set. For example, if we have 100 training cases, then they can support a return

set of size 4 assuming each explanatory variable is binary. Since each return set of size 4

has 24 = 16 partition elements, each partition element on the average contains 100/16 > 5

training cases, indicating the adequacy of chi-square approximation following a common

rule for chi-square goodness of fit test. Hence return sets of size 4 is well-supported by the

training set of 100 cases. In this case, we would like to make sure that the BDA is repeated

sufficient number of times so that the quadruplets are covered rather completely. Here we

encounter a variation of coupon-collecting problem.

Let p be the number of variables and let k be the starting size. Then there are(p4

)quadru-

plets from p variables. Each run of BDA can cover(

k4

)quadruplets. Consider the following

coverage problem. There are a total of(p4

)boxes and each corresponds to a quadruplet. Each

time(

k4

)balls, which corresponds to the number of quadruplets contained in a starting set

of size k, are randomly placed into boxes so that each ball is in a different box. In Appendix

A2, it is shown that roughly we are expected to repeat the BDA

B ≈(

p4

)(

k4

) log

(p

4

)(4)

12

times to have a complete coverage of quadruplets among p features.

The preceding result does not take into account that each time we do not place one ball

but a group of balls into boxes. It is well known that grouping will increase the proportion

of vacant boxes; see e.g. Hall (1988). Therefore the expected number of repetitions required

to cover all quadruplets will be greater (the exact result is an open problem). We propose

to use 2B as an upper bound. In all simulationed and real-data examples, we have never

missed a key cluster of chosen size after running 2B repetitions.

The return sets generated from BDA will undergo two filtering procedures to remove

between-cluster correlation and false positives. The first procedure is to filter out return

sets with overlapping variables. Because the return sets will be converted into classifiers in

the next step and it is desirable to have weakly correlated classifiers, we shall remove those

return sets containing common variables. This can be done by first sort the return sets in

decreasing order according to the I-scores and then remove return sets from the list that

have variables in common with a return set with a higher I score. The return sets after

removing overlap ones are then subjected to a forward adding algorithm to remove false

positives. Please see Appendix A3 for details. The filtering procedures are very important,

as found in practice, for lowering error rates. Sometimes, the error rates are much improved

after the filtering procedures.

4.3 Turning return sets into classifiers

After return sets have been generated, the next task is to classify Y ’s based on information

contained in each single return set. Because the number of variables contained in one return

set is quite small, the traditional setting of large n small p prevails here, and all classification

methods developed under such setting can be applied. In this paper, we would like to take

advantage of interactions among variables for the prediction of response variable Y , using

methods that can readily incorporate interactions among variables. Furthermore, methods

without tuning parameters is preferred. Because the sample size is only moderate, if the

classifier contains the tuning parameter, we need to use part of the sample as the validation

set to do tuning parameter selection and thus there will be less observations available for

training the classifier (training set) and for estimating the error rate (testing set). Another

reason for prefering classifiers without tuning paramet ers is that tuning parameter selection

is hard. Ghosh and Hall (2008) pointed out that cross validation is a poor method for

tuning parameter selection because “the part of the cross-validation estimator that depends

on the tuning parameter is very highly stochastically variable, so much so that it has an

unboundedly large number of local extrema which bear no important asymptotic relationship

to the parameters that actually minimize the risk.”

The other factor to consider is that the classifier with less distributional assumptions is

more desirable. In §4.4.4 of Hastie, Tibshirani, and Friedman, it is asserted that Fisher’s lin-

ear discriminant analysis (LDA) makes more distributional assumptions than logistic regres-

sion, even though these two methods take similar forms in terms of classification boundary.

13

Actually, in many microarray datasets that we tried, logistic regression yields significantly

lower error rates than LDA. With all factors considered, logistic regression seems to be a

good choice.

In logistic-regression classifier, we will include all possible interaction terms based a return

set. A return set of size 4 would give rise to 16 terms including up to 4-way interaction in

the logistic regression as the full model. We can then apply a model selection criterion to

select a good submodel. For the three real-data examples in §5, we used AIC.

4.4 Assemble classifiers to form the classification rules

In the previous section, we describe how to construct classifier from a single return set. We

now describe how to combine these classifiers to form a classification rule. In the literature,

there are quite a few ways to do it: bagging, Bayesian model averaging, and boosting etc.

Except for boosting, most of the ensemble learning methods is based on the idea of averaging.

Bagging is an average in a straighforward manner. Bayesian model averaging is a weighted

average with posterior probabilities of the classifiers as the weights. There are other methods

where the weight is proportional to the performance of the classifier; see the ensemble method

in Peng (2005). On the other hand, boosting is based on the idea of complementing. In

boosting, the next classifier to be added to the classificaiton rule is one that can do well on

those cases missclassified by classifiers already in the current classification rule.

Boosting is most appropriate when there are several factors affecting the probability of

Y = 1 and each factor depends on a small subset of variables. In this case, the classifier

based on a return-set (or feature-cluster) represents the effect of one factor. The ada boost

method of Freund and Schapire (1997) can be viewed as one that constructs an additive

classification rule from a set of basis functions by minimizing the misclassification risk ac-

cording to an exponential loss function. Here the set of basis function consists of classifiers

each corresponding to a return set. For this paper, the boosting algorithm is adapted as

follows.

1. Find H return sets with the highest HTD scores.

2. Initialize training-case weights wi = 1/n, i = 1, · · · , n.

3. For h = 1 to H do

(a) For j = h to H, fit logistic regression classifier Lj to the training data using return

set Rk. Calculate

errj =

∑i wiI(yi �= Lj(xi))∑

i wi,

and αj =1

2log

1− errj

errj.

14

(b) Let j′ = argmaxh≤j≤H

αj. Update

wi ← wi × exp(αj′I(yi �= Lj′(xi)).

(c) Relabel Rj′ as Rh with corresponding αh. And relabel the remaining H−h return

sets as {Rh+1, . . . , RH}.4. Output the classification rule sign{∑H

h=1 αhLh}.The final classification rule is one such that interactions among features are allowed within

each component classifier but not among features in different classifiers. As the classifiers are

added one by one to the classification rule, we expected the error rates for the training set

to drop continuously after each update to reflect improved fit to the training sample. Hence

the error rate path for the testing sample when classifiers are sequentially added to the final

rule, is an effective tool to detect over-fitting because information from the test sample is

not used in constructing the classification rule, i.e., the boosting procedure.

5 Real-Data Examples

This section contains three real-data examples. The first two are gene expression datasets

for breast cancer. They are from van’t Veer et al (2002) and Sotiriou et al (2003), which

originally contain 24187 genes for 97 patients, and 7650 genes for 99 patients, respectively.

The purpose for the first study is to classify female breast cancer patients according to relapse

and non-relapse clinical outcomes using gene expression data, while that of Sotiriou (2003)

is to classify tumor subtypes.

In the van’t Veer dataset, we kept 4918 genes for the classification task, which were

obtained from Tibshirani and Efron (2002). Following van’s Veer et al, 78 cases out of 97

are used as the training sample (34 relapse and 44 non-relapse) and 19 (12 relapse and 7

non-relapse) as the test sample. Our method yields 10.5% error rate on the test sample,

which corresponds to the best result reported in the literature. To check the stability of our

method, we further run ten external cross-validation experiments by randomly partitioning

the 97 subjects into a training sample of size 87 and a test sample of 10. The average test-set

error rate based on these ten experiments is 8.0%. The error rates of the 10 training sets

and 10 testing sets are given in Figures 5, 6 and 7.

Note that in all ten CV experiments, the error rate on the test set generally decreases

as more classifiers are added to the classification rule using the boosting method. Because

the classification rule used no information from test sets, this indicates that the proposed

method does not suffer from overfitting. The following table provides more details for the

performance of methods reviewed in §2.3. It is easy to see that the proposed method is doing

quite well against other methods in the literature.

The Sotiriou et al. dataset contains 7650 genes on 99 patients. The task is to classify

tumors according to their estrogen receptor (ER) status using gene expression information.

15

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G1−Train

● ●

● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G1−Test

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G2−Train

● ● ● ● ● ● ● ●

● ●

● ● ●

● ● ● ●

5 10 15

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G2−Test

● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G3−Train

● ● ●

● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ●

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G3−Test

Figure 5: 10GCV Voting result, a.

16

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G4−Train

● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ●

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G4−Test

● ●

● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G5−Train

● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G5−Test

● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●

5 10 15

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G6−Train

● ● ● ● ● ● ● ●

● ●

● ● ● ● ● ●

5 10 15

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G6−Test

Figure 6: 10GCV Voting result, b.

17

● ●●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G7−Train

● ●

● ●

● ● ● ● ●

● ●

● ● ● ● ● ● ● ● ● ● ●

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G7−Test

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G8−Train

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G8−Test

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G9−Train

● ●

● ● ● ● ●

● ●

● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ●

0 5 10 15 20 25

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G9−Test

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G10−Train

● ●

● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

rs.no

erro

r

G10−Test

18

Table 1: Error rates on van’t Veer dataset by various methods in the literature

ErrorAuthor Feature selection Classifier evaluated rate MethodGrate et al (2002) None �1-SVM 0.010� LOOCVPochet et al (2004) None LS-SVMa 0.310 LOOCV

Linear kernel 0.321 Test setNone LS-SVM 0.309 LOOCV

RBFb kernel 0.316 Test setNone LS-SVM 0.478 LOOCV

no regularization 0.428 Test setPCAc FDAd 0.297 LOOCV(unsupervised) 0.426 Test setPCA FDA 0.265 LOOCV(supervised) 0.331 Test setkPCAe linear kernel FDA 0.288 LOOCV(unsupervised) 0.391 Test setkPCA linear Kernel FDA 0.264 LOOCV(supervised) 0.346 Test setkPCA RBF kernel FDA 0.251 LOOCV(unsupervised) 0.486 Test setkPCA RBF kernel FDA 0.000 LOOCV(supervised) 0.632 Test set

Li and Yang (2005) RFE SVM 0.105n� Test setRFE Ridge Regression 0.158n� Test setRFE Rocchio 0.158n� Test set

Michiels et al (2005) Correlation Correlation 0.310 500 rCVp

Peng (2005) Golubf SVM 0.247� LOOCVGolub Bagging SVM 0.226� LOOCVGolub Boosting SVM 0.226� LOOCVGolub Ensemble SVM 0.186� LOOCV

Yeung et al (2005) BMAg BMA 0.158� Test setAlexe et al (2006) LADh LAD 0.183� CVDiaz-Uriarte and None Random forest 0.342 bootstrapde Anres (2006) None SVM 0.325 bootstrap

None KNNi 0.337 bootstrapNone DLDAj 0.331 bootstrapShrunken Centroid Shrunken Centroid 0.324 bootstrapNN+VSk NN+VS 0.337 bootstrap

Wahde & Szallasi (2006) Evolutinary Algorithm LDA 0.105� Test setSong et al (2007) RFE SVM 0.077� 10-fold CVZhu et al (2007) RFE SVM 0.29o 10-fold CVYan and Zheng (2008) sMPASl sMPAS 0.295 13-fold CVLiu et al (2009) EICSm EICS 0.219 10-fold rCV

aLeast square SVM; bRadial basis function; cPrinciple component analysis; dFisher discrim-inant analysis; eKernel principle component analysis; fThe feature selection method used inGolub et al (1999); gBayesian model averaging; hLogical analysis of data; iK-nearest neigh-bor; jDiagonal linear discriminant analysis; kNearest neighbor with variable selection; lSignedmultigene association; mEnsemble independent component system; nEstimated from graphs;oThe best bias-corrected error rate by SVM with RFE starting from 5422 genes; pRandomCV; �Biased error rate estimates due to turning parameter selection and/or feature selection.

19

This is different from the objective of van’t Veer et al (2002), where the goal is to discriminate

relapse patients from non-relapse ones. We compute the I-scores for all pairs of the 7650

genes, from which the top 5000 high-score genes were selected. Then we follow the same

procedure as that in analyzing the van’t Veer dataset. The average error rates over 10 CV

groups is 5%. This matches the best results in the literature; see Zhang et al (2006).

The third gene expression dataset is from Golub et al (1999) The dataset consists of

7129 genes, 38 cases in the training set and 34 in the test set. The purpose is to classify

acute leukemia into two subtypes: acute lymphoblastic leukemia (ALL) and acute myeloid

leukemia (AML). Our classification rule consists of 11 gene clusters with total of 34 genes

and correctly classifies all test cases. This dataset was analyzed in the same way as Sotiriou

et al. ’s data except that we use two CV experiments here. In Golub et al (1999), perfect

classification was achieved on 29 out of 34 test cases. The signals contained in the other five

test cases were considered too weak to be classified. Among these five weak-signal test cases,

one has been consistently misclassified by nearly all follow-up reanalyses. We have included

all 5 weak-signal test cases in our analysis. Our classification method correctly classified all

34 test cases. The average test-set error rate on two CV test sets is 5.9%.

5 10 15

−1.

0−

0.5

0.0

0.5

1.0

rs.no

erro

r

Train

5 10 15

0.00

0.05

0.10

0.15

0.20

rs.no

erro

r

Test

Figure 8: Error rate paths for Golub dataset

6 Conclusion and discussion

In this paper, we make use of two key ideas from Lo and Zheng (2002, 2004) and Chernoff

et al. (2009) to develop a high-dimensional classification method. The method produces

better error rates than other methods in the literature that do not consider interaction

effect among variables. The feature-cluster selection procedure that we adopted from Lo

and Zheng (2002, 2004) is originally designed to discover influential genes responsible for

the disease under study instead of minimizing classification error rates. When applied to

20

analyze inflammatory bowel disease (IBD), on top of confirming many previously identified

susceptibility loci reported in the literature, Lo and Zheng (2004) found four new loci not

reported before, which were later replicated in a UK genome-wide association study (Barrett

et al, 2008). Therefore, the proposed method is intended to have two desirable properties:

1. The classification rules derived from the method have low error rate.

2. In the process of deriving the classification rule, influential variables for the response

will be identified.

In view of results presented in this paper, it seems that the feature clusters involved in

the classification rules contain information that worth further investigation in understanding

the disease signature.

References

[1] Alexe, B., Alexe, S., Axelrod, D. et al (2006), Breast cancer prognosis by combinatorial

analysis of gene expression data, Breast Cancer Research, 8:R41.

[2] Barrett, J.C., Hansoul, S., Nicolae D.L. et al (2008), Genome-wide association defines

more than 30 distinct susceptibility loci for Crohn’s disease, Nat. Genet., 40, 955-962.

[3] Candes, E.J. and Tao, T. (2007), The Dantzig selector: Statistical estimation when p

is much larger than n (with discussion), Ann. Stat., 35, 2313-2404.

[4] Chernoff, H., Lo, S. H., and Zheng, T. (2009), Discovering influential variables: a

method of partitions, Annals of Applied Stat., 3, 1335-1369.

[5] Diaz-Uriate, R. and de Andres, S. A. (2006), Gene selection and classification of mi-

croarray data using random forest, Bmc Bioinformatics, 7:3.

[6] Fan, J. and Lv, J. (2008), Sure independence screening for ultrahigh dimensional feature

space (with discussion), J. Roy. Statist. Soc. Ser. B, 70, 849-911.

[7] Freund, Y. and Schapire, R. (1997), A decision-theoretic generalization of online learning

and an application to boosting, Journal of Computer and System Science, 55, 119-139.

[8] Golub, T. R. et al (1999), Molecular classification of cancer: class discovery and class

prediction by gene expression monitoring, Science, 286, 531-537.

[9] Ghosh, A. K. and Hall, P. (2008), On error-rate estimation in nonparametric classifica-

tion, Statistica Sinica, 18, 1081-1100.

[10] Grate, L. R. et al (2002), Simultaneous relevant feature identification and classification

in high dimensional spaces, Algorithm in Bioinformatics Proceedings, 2452, 1-9.

21

[11] Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002), Gene selection for cancer

classification using support vector machine, Mach. Learning, 46, 389-422.

[12] Hastie, T., Tibshirani, R., and Friedman, J. (2009), The elements of statistical learning,

2nd Ed. Springer, New York.

[13] Li, F. and Yang, Y. M. (2005), Analysis of recursive gene selection approaches from

microarray data, Bioinformatics,21(19), 3741-3747.

[14] Liu, K. H. et al (2009), Microarray data classification based on ensemble independent

component selection, Computers in Biology and Medicine,39(), 953-960.

[15] Lo, S. H. and Zheng, T. (2002), Backward haplotype transmission association (BHTA)

algorithm - a fast multiple-marker screening method , Human Heredity, 53, 197-215.

[16] Lo, S. H. and Zheng, T. (2004), A demonstration and findings of a statistical approach

through reanalysis of inflammatory bowel disease data, Proceedings National Academy

of Science, 101, 10386-10391.

[17] Meinshausen, N. and Yu, B. (2009), Lasso-type recovery of sparse representation for

high-dimensional data, Ann. Stat., 37, 246-270.

[18] Michiels, S. S. et al (2005), Prediction of cancer outcome with microarrays: a multiple

random validation strategy, Lancet, 365(9458), 488-492.

[19] Peng, Y. H. (2005), Robust ensemble learning for cancer diagnosis based on microarry

classification, Advanced Data Mining and Applications, Proceedings, 3584, 564-574.

[20] Pochet, N. F. et al (2004), Systematic benchmarking of microarray and classification:

assessing the role of non-linearity and dimensionality reduction, Bioinformatics, 20(17),

3185-3195.

[21] Song, L., Bedo, J., Borgwardt, K.M., et al (2007), Gene selection via the BAHSIC

family of algorithms, Bioinformatics (ISMB), 23(13), i490-i498.

[22] Sotiriou, C. et al (2003), Breast cancer classification and prognosis based on gene expres-

sion profiles from a population-based study, Proceedings National Academy of Science,

100, 10393-10398.

[23] Tibshirani, R. and Efron, B. (2002), Pre-validation and inference in microarray, Statis-

tical Application in Genetics and Molecular Biology, 1(1), Article 1.

[24] van’t Veer, L. J. et al (2002), Gene expression profiling predicts clinical outcome of

breast cancer. Nature 415, 530 - 536.

[25] Wahde, M. and Szallasi, Z. (2006), Improving the prediction of the clinical outcome of

breast cancer using evolutionary algorithms, Soft Computing, 10(4), 338-345.

22

[26] Yan, X. and Zheng, T. (2008), Selecting informative genes for discriminant analysis

using multigene expression profiles, BMC Genomics, 9(Suppl 2), 1471-2164-9-S2-S14.

[27] Yeung, K. Y., et al (2005), Bayesian model averaging: development of an improved

multi-class, gene selection and classification tool for microarray data, Bioinformatics,

21(10), 2394-2402.

[28] Zhang, H., et al (2006), Gene selection using support vector machine with non-convex

penalty, Bioinformatics, 22(1), 88-85.

[29] Zhu, J. X., McLachlan, G. J., Ben-Tovim Jones, L. and Wood, I. A. (2008), On selection

bias with prediction rules formed from gene expression data , J. Stat. Planning and

Infer., 138, 374-386.

Appendix A1: The starting cluster size

When choosing a size k for forming a initial set of variables to run the backward dropping

algorithm, the objective is to have a small chance of erroneously dropping an influential

variable, i.e;, non-informative screening. If all cells in the partition generated by the k

variables contain either one or no training cases, the dropping is basically random. On the

other hand, if there are partition elements with two or more training cases, the influence of

the variable can then be informatively measured when it is dropped. Treating elements of

a partition as urns of the same size and subjects as balls, we evaluate in the following the

probability of having an element with at least 2 observations using a urn model

Let m be the number of boxes. n be the number of balls. Dropping balls in boxes

randomly, let p2 be the probability of two or more balls in a particular box. The number of

balls in a particular box follows the binomial distribution with probability of success 1/m

and the number of trials equals n. Thus

p2 = 1−(

m− 1

m

)n

− n1

m

(m− 1

m

)n−1

.

Therefore m× p2 is the expected number of boxes with 2 or more balls. We assume that

n→∞ and m→∞ such that nm

= λ.

Then

mp2 = m[1−(

m− 1

m

)n

− n1

m

(m− 1

m

)n−1

]

= m[1−(

m− 1

m

)m nm

− n

m

(m− 1

m

)mn−1m

]

→ m(1− e−λ − λe−λ)

If we further assume that λ is small and close to zero, then

23

m(1− e−λ − λe−λ)

= m[1− (1− λ +λ2

2+ o(λ2))− λ(1− λ +

λ2

2+ o(λ2))]

= m[λ2

2+ o(λ2)]

≈ m

2

( n

m

)2

=n2

2m

Therefore if we let mk−1 denote the number of elements in a partition generated by k− 1

variables, thenn2

2mk−1≥ 1, (5)

implies that the expected number of partition elements after dropping one variable from the

starting set of k variables is at least one. Note that in reality the number of training cases in

different partition elements are non-uniform and thus the number of cells with two or more

training cases would be larger than 1. Therefore the starting size satisfying (3) represent a

minimum requirement under uniform assumption.

A similar calculation for expected number of boxes with three or more balls yields

m(1− e−λ − λe−λ − λ2

2e−λ)

= me−λ[eλ − 1− λ− λ2

2))]

= m[λ3

3!+ o(λ3)]

≈ m

6

( n

m

)3

=n3

6m2k−1

Thus the corresponding condition on the starting size k is

n3

6m2k−1

≥ 1. (6)

Appendix A2: The number of repetitions for backward

dropping algorithm

Assume the placement time follows a Poisson process of rate 1. Then

P (box j empty at time t) ≈ exp

(−t

(k4

)(p4

))

for any particular box j. Poissonization makes boxes independent. So Qt = the number of

empty boxes at time t satisfies

Qt ≈ Poisson with mean

(p

4

)exp

(−t

(p4

)(k4

))

,

24

Let B be the first time all boxes are occupied, then

P (B ≤ t) = P (Qt = 0) ≈ exp

[−(

p

4

)exp

(−t

(p4

)(k4

))]

.

This can be arranged to (n4

)(

k4

)[B −

(p4

)(

k4

) log

(k

4

)]≈ ξ;

where P (ξ ≤ x) = exp(−e−x). Therefore, B ≈ (p4)

(k4)

log(

k4

).

The generalization to clusters of size z is obtained by substituting 4 with z

Bz ≈(

pz

)(

kz

) log

(p

z

).

Appendix A3: Forward adding algorithm to remove false

positive return sets

Let H be the total number of return sets after filtering out overlapping ones. The second

filter for false positives is carried out as follows. The filtering algorithm evaluates the effect

of adding variables to each return set Rh, h = 1, . . . , H. Let |Rh| be the number of variables

in Rh. Without loss of generality, assume Rh = {X1, . . . , X|Rh|}. The remaining variables,

X|Rh|+1, . . . , Xp are added to Rh one at a time to generate p− |Rh| forward-one sets. I-score

is calculated for each of these forward-one sets and let Ah be the number of them with I-score

greater than that of Rh. Ah represents the number of size |Rh| + 1 initial sets that will not

lead to Rh if subjected to BDA.

Even though we use forward adding for filtering, the motivation still lies in the backward-

dropping mechanism. For a true return set containing only influential variables, it should

be difficult to increase I-scores by adding a variable, or equivalently we should consistently

obtain this true return set by applying BDA to random initial sets with one more randomly

chosen variable plus this set. The same should not be true for a false positive return set.

A return set with many forward-one sets implies that whether it will be returned by BDA

depends on other variables in the initial set, which indicates that the return set is likely a

false positive.

For the van’t Veer et al example, we selected two return sets generated from one of the

10 CV experiments. In Table 2, the ‘false positive’ one has 7 forward-one sets even though

its I-score is higher than the ‘true’ one, which has only 1 forward-one set. After we remove

the false positive return sets according to the forward-one results, the final classification rule

has considerably lower error rate in the testing sample.

Usually the frequency plot of Ah would clearly show a set of outliers such as Figure 6

below. Therefore it is easy to determine a threshold on Ah for removing return sets.

25

Table 2: Examples of forward-one sets

False True

Original return set {665, 2283, 2930} I = 422.39 {108, 2400, 4208} I = 410.778

Forward-one sets {1451, 2930} I = 523.035 {2930, 4208} I = 427.351

{665, 2283, 2930} I = 470.498

{665, 1668, 2930} I = 450.050

{1946, 2930} I = 438.888

{1885, 2283, 2930} I = 438, 298

{2283, 3291} I = 426.516

{2283, 2900, 2930} I = 423.930

0 5 10 15 20

01

23

45

6

Number of fw1 set

App

eara

nce

Fre

quen

cy

Figure 9: Frequency distribution for the number of forward-one sets

26