A Classification Method Incorporating Interactions among Variables for High-dimensional Data
Transcript of A Classification Method Incorporating Interactions among Variables for High-dimensional Data
A Classification Method Incorporating Interactions
among Variables for High-dimensional Data
Haitian Wang†, Shaw-Hwa Lo‡, Tian Zheng‡, Inchi Hu† ∗
† Department of ISOM, Hong Kong University of Science and Technology,
Clear Water Bay, Hong Kong;‡ Department of Statistics, Columbia University, New York, USA;
∗ Corresponding Author, [email protected]
Abstract
The advances in technology have made available high-dimensional data involvingthousands of variables at an ever-growing rate. These data can be used to answersignificant questions. For instance, with DNA microarray technology, scientists cannow measure the expression levels of thousands of genes simultaneously, while thenumber of patients is typically of the orders of tens. Selecting a small number ofgenes based on microarray datasets for accurate prediction of clinical outcomes leadsto a better understanding of the genetic signature of a disease and improves treatmentstrategies. However, in addition to the difficulty that the number of variables is muchlarger than the sample size, the classification task is further complicated by the needto consider interaction among variables. In this paper, we propose a classificationmethod that incorporates interaction among variables. In several microarray datasets,our method outperform existing methods which ignore variable-i nteraction.
1 Introduction: why another method?
In recent years, novel statistical methods have been developed to meet the challenge of high-
dimensional classification problems. To help explaining the strength of our method, we will
focus on a specific problem—the use of gene expression data to predict clinical outcomes
of cancer, even though the method can be applied much more broadly. The problem is
challenging not just because the number of variables p is much greater than the number of
observations n. What’s even more challenging is that one needs to consider the interactive
effects among variables in addition to their individual marginal effects. For example, suppose
that metastasis is activated via one or more of several biological pathways and each pathway
1
involves a group of genes. In this case, the expression level of a single gene involved in
one of the biological pathways may show little or no marginal effect in the prediction of
metastasis status and its effect can only be detected by considering jointly with other genes
in the pathway. That is, the interaction among genes is critical to the accurate prediction of
metastasis outcomes.
In this paper, we propose a high-dimensional classification method that explicitly in-
corporates interactions among variables. The resulting classification rule is an ensemble of
classifiers and each classifier involves only one feature cluster. Each classifier explicitly in-
corporates interactions within its own feature cluster. The effort of including interactive
information in the classification rule has a clear reward in prediction. We demonstrate
through three microarray datasets that our classification rule significantly improves the er-
ror rate and outperform methods that ignore interactions among variables. Furthermore,
the feature cluster identified by our method is a plausible candidate for genes involved in a
biological pathway, which deserves further investigation.
There are two critical issues concerning high-dimensional classification methods: feature
selection and error-rate estimation. Regarding feature selection, the existing methods can be
broadly put into two categories. The first category of methods does feature selection through
simultaneously estimating the effect of all variables. We will argue that for the problem under
consideration p is too large and n too small for methods in the first category to work well; see
§2.1 for details. Methods of the second category do feature selection by assessing the effect
of each variable individually. Because the effect of each variable is estimated individually
ignoring the interaction among variables, methods in the second category are not expected to
perform well in problems with significant interaction effects or problems with little marginal
effects; see §2.2 for details.
The error rate estimation is important to the evaluation of classification methods. How-
ever, the evaluation is sometimes clouded by tuning parameter selection and feature selection.
If done inappropriately, the error rate estimates can be too optimistic. To be more precise,
if error rate estimation is done using observations involved in tuning parameter selection
and/or feature selection, then the error rate estimates are biased and lack calibration. Our
classification rule does not contain any tuning parameter and thus there is no danger of sup-
plying overly optimistic error-rate estimate due to tuning-parameter selection. The error-rate
estimation of our method is free of feature selection bias because feature selection is done
without involving any information whatsoever from the testing sample.
The rest of the article is organized as follows. Section 2 contains discussion on feature
selection and error-rate estimation before a selective literature review. Section 3 contains
a preliminary illustration of the proposed method via a small toy example. In Section 4,
a detailed description of our method is given. Section 5 applies the proposed classification
method to three microarray datasets. Section 6 is a summary, which also includes a discussion
comparing our method with other classification methods. Some technical aspects of the
proposed method are dealt with in the Appendix.
2
2 Comparison with other methods
Before literature survey, we first discuss two critical issues concerning high-dimensional clas-
sification methods. The purpose of the discussion is to gain insight on the performance of
various high-dimensional methods proposed in the literature and thus facilitate the compar-
ison of different methods.
2.1 Feature Selection
For high dimensional classification problems, where the number of variables p is much greater
than the number of observations n, it is imperative to derive solutions under sparse models.
That is, only a relatively small number s of p variables is relevant to the response variable
Y . It is well known that methods with �1-regularization have feature selection property and
they have desirable properties under sparse models. Meinshausen and Yu (2009) showed
that the LASSO estimator is �2- consistent, if
n
sn log pn→∞. (1)
The preceding equation has information-theoretic interpretation that the number of ob-
servations available for estimating each bit of minimum description length for the sparse
model tends to infinity. However, for the purpose of predicting the clinical outcome using
gene expression profiling, usually the number of variables is too large and the number of
observations is too few for condition (1) to hold. For instance, with p = 10000 variables,
only s = 10 relevant to the response variable Y , and n = 100 available observations, we have
n
s log p=
100
10 · log(10000)≈ 1.1
The other remarkable method, Dantzig selector proposed by Candes and Tao (2007), requires
a similar condition; see section 2.1 of the preceding paper.
These results from high-dimensional regression literature shed lights on classification
problems. Methods such as LASSO and Dantzig selector are designed to estimate the ef-
fects (coefficients) of all variables simultaneously. However, when p is too large and n too
small, (1) suggests that one cannot expect classification methods such as support vector ma-
chine (SVM) to perform effective feature selection because SVM estimates simultaneously
coefficients of all variables in a linear classification rule.
In sharp contrast to aforementioned methods, the other category of methods does feature
selection by estimating the effect of variables individually. One method in this category has
gained much attention recently. SIS (sure independent screening) introduced by Fan and
Lv (2008), measures the effect of each variable by its correlation with the response variable
Y . Then the feature selection is done via discarding low correlation variables. Fan and Lv
(2008) showed that SIS possessed the sure screening property, that is, the submodel obtained
by SIS contains the true sparse model with overwhelming probability. They proposed to first
3
perform SIS to reduce the number of variables to a moderate scale so that subsequent data
analysis can be done more easily. The sure screening property of SIS is retained at the cost
of ignoring the interactions among variables. An important variable will escape SIS if it is
marginally uncorrelated (or weakly correlated) with Y but jointly correlated with Y when
considered with other variables together. For complex disorders, it is likely that a gene by
itself has little effect but when combined with other genes has significant effect on disease
status. In this case, SIS will miss important genes.
In this paper, we propose a feature selection strategy based on ideas from Lo and Zheng
(2002, 2004) and Chernoff et al. (2009), which assesses the effects of variables neither si-
multaneously nor individually. The proposed feature selection procedure estimates the joint
effect of a randomly-selected subset of variables and reduces the subset to a smaller one by
removing irrelevant variables from the subset. This strategy can avoid not only the difficulty
of those methods too ambitious given the size of the problem by estimating all coefficients
at the same time but also the difficult of other methods that focus on one single variable at
a time ignoring joint effects. Our procedure may be called ‘feature-group’ selection because
the features selected come in groups. The interaction among variables in the same group will
be accommodated, while assuming no interactions for variables that are in different groups.
2.2 Error-rate estimation
There are two common pitfalls in error rate estimation for high dimensional classification
problems using cross validation. First, when error-rate estimation and tuning-parameter
selection are done at the same time, the error-rate estimate is biased downward. That is, the
error rate obtained by minimizing the cross-validation estimate with respect to the tuning
parameter is too optimistic. Zhu et al. (2008) referred to error-rate estimates of this type
as those based on ‘internal’ CV. Here ‘internal’ means the error-rate is assessed internally
by the same set of observations used for tuning parameter selection as opposed to the error
rate assessed externally by observations not involved in tuning parameter selection.
Zhu et al. (2008) also pointed out the other pitfall: feature selection bias. This kind
of bias exists in various types of feature selection procedures. It exists in fix-size feature
selection procedures, where one selects a fixed number of variables, say 10, from a large
number of variables, say 10,000, in one round or in several rounds such as recursive feature
elimination (RFE) proposed by Guyon et al (2002). The feature selection bias is also present
in preliminary screening, where one selects a subset of variables for further analysis. In
addition, it exists in the optimal subset of unrestricted sizes, where the number of variables
involved in the final classification rule is determined by minimizing the error rate among
subsets of difference sizes.
The reason for the existence of selection bias is similar to tuning parameter selection.
Even though features usually are not selected by minimizing error rate, they are selected
by some criterion, which supposedly can help reducing the error rate. Thus the error rate
estimation is biased if it is done internal to feature selection procedure. To correct the
4
selection bias, it is necessary to estimate the error rate externally to the feature selection.
That is, the error rate needs to be estimated based on observations not involved in the
feature selection process. In view of the preceding discussion, the error rate estimation of
the proposed method is done externally to tuning parameter selection and feature selection
and thus free from both types of biases.
2.3 Literature review
One can view the vast literature on high-dimensional classification problems as a collection
indexed by methods and datasets because typically a paper in this collection advocates one or
more methods and applies them on a few datasets to report the results. Thus we can adopt a
‘cross-sectional’ approach to review the high-dimensional classification literature pertaining
to a particular gene expression dataset. If the deataset is well-know and well-studied, we
can hope to have a comprehensive view of high-dimensional classification methods and their
performance through this particular dataset.
The chosen dataset was first made available in the breast cancer study of van’t Veer
et al (2002), which has generated a great deal of discussions in the literature. Originally it
contains expression levels of approximately 25,000 genes and after filtering, some 5,000 genes
were kept for further analysis. Of the 97 female breast cancer patients, 46 relapsed (distant
metastasis < 5 year) and 51 non-relapsed (no distant metastasis ≥ 5 year), 78 were used as
the training sample and 19 as the test sample.
The error rate estimates reported in the literature for a wide variety of classification
methods about the van’t Veer (2002) dataset are typically around 30%. Pochet et al (2004)
applied SVM and Fisher discriminant analysis (FDA) with error rates all above 25%. Michiels
et al (2005) used the correlation method and the error rate was over 30%. Diaz-Uriarte and
de Anres (2006) applied the bootstrap method to estimate the error rate. Their error rate
estimates for SVM, random forest (RF), k-nearest neighbor (KNN), and linear discriminant
analysis (LDA) were all higher than 30%. Yan and Zheng (2008) applied multigene profile
association method (MPAS) and used 13-fold CV to yield an error rate estimate of 29.5%.
Some papers reported error rates significantly lower than 30%, even as low as 1%. How-
ever, after careful examination, we found all of them suffer from feature selection bias and/or
turning parameter selection bias. A couple of them used leave-one-out cross validation
(LOOCV) to estimate the error rate. In addition to the two kinds of biases previously dis-
cussed, error-rate estimate using LOOCV has additional problem of much larger variance
than, say, 5-fold CV, because the error rate estimates in each fold of LOOCV are highly
correlated. The details will be given in §5.Another set of papers reported arround 10% error rate on the test sample only; see §5
for details. Incidentally, Alexe et al (2006), using logical analysis of data (LAD), found a
pair of gene that completely separates the taining and testing samples of van’t Veer et al
(2002). Because this is an extremely rare situation, they concluded that the training and
testing samples of van’t Veer et al (2002) have very different characteristics. Thus it is
5
better for error rate estimates based only on the test sample of van’t Veer et al (2002), to
be cross-validated on other test sets as well.
From the foregoing literature review, we conclude that classification methods being re-
viwed either have relatively high error rate estimates, or procedures employed to estimate
the error rates are biased or lack calibration due to tuning parameter selection and feature
selection. The proposed method yields an error rate of 8% from 10 CV test sets. Further-
more, the proposed method has no tuning parameter and selects features without using any
information from the test sets and thus the error rate estimation are free from both types of
biases.
3 Preliminary illustration- two basic tools
To illustrate the effectiveness of the proposed method, which will be described in the next
section, here we provide a preliminary illustration via a toy example. The purpose of the toy
example is to demonstrate that two key ideas from Lo and Zheng (2002, 2004), an influence
measure I and a backward dropping algorithm (BDA), can elicit interaction information
difficult for other high-dimensional methods.
3.1 A toy example
Example 1: Let Y , the class membership, and X = (X1, X2, · · · , X200), the explanatory vari-
ables, all be binary taking values in {0, 1}. We generate 200 values for each Xi independently
with P{Xi = 0} = P{Xi = 1} = 0.5 and
Y =
{X1 + X2 + X3(mod 2) with probability 0.5
X4 + X5(mod 2) with probability 0.5(2)
Thus we have a 200×201 data matrix (n = 200 observations, p = 200 explanatory variables,
and one response variable Y ). Suppose we do not know the model and wish to predict Y
based on the explanatory variables X. To train the classification rule, we use 150 observations
as the training set, while the other 50 observations are used as the test set.
The proposed method uses I to evaluate the strength of influence on Y for a small set
of explanatory variables. The BDA guided by the influence measure I is applied to a small
subset of variables to obtain a smaller subset, called the return set, by stepwise elimination
of irrelevant variables. The top ranked return sets generated by BDA are as follows.
6
Size Influence score Variables
2 1250.200 4, 5
3 517.522 1, 2, 3
2 455.182 124, 168
3 409.097 35, 178, 194
2 401.324 82, 121
3 399.887 17, 78, 112
3 393.876 6, 43, 93
2 381.318 47, 117
3 376.208 27, 40, 156
3 358.346 37, 95, 132
From these influence scores, we can easily identify the influential variables since the highest
I-score return set of size two, {X4, X5} and of size three, {X1, X2, X3}, have significantly
higher I-score than other return sets of the same size, respectively.
Note that any feature selection method based on marginal effects would fail in this ex-
ample because there is no marginal effect from the influential variables X1, · · · , X5 on the
response variable Y . Unless X1, X2, X3 are considered together, their effect on Y will go
undetected because corr(Xi, Y ) = 0, i = 1, 2, 3. The same is true for X4, X5.
After identifying the return sets, we then turn the return sets into classifiers. Because
the explanatory variable are all binary, the classification tree is a good choice as classifiers.
We shall classify the test cases according to the return sets, {X1, X2, X3} and {X4, X5}, by
growing a classification tree from each return set. Usually, we will have more than one return
sets of influential variables. We can assemble these classifiers into a classification rule using
the boosting method; see, e.g., Hastie, Tibshirani and Friedman (2009). The performance of
the classification rule and other classification methods such as linear discriminate analysis
(LDA), support vector machine (SVM), and random forest (RF) are reported below.
Error rate
Methods Training set Test set
LDA 0.2 0.52
SVM 0.0 0.52
RF 0.49 0.46
Proposed 0.26 0.24
The proposed method yields the lowest error rate for the test set at 24%, which is
also close to the best possible error rate because we do not know whether {X1, X2, X3} or
{X4, X5} is used to generate the value of Y . Among the four methods, SVM produces the
largest gap between the training and test set error rates. This is probably due to tuning
parameter estimation required by SVM so that the error rate for the training set seriously
underestimates the true one. The proposed method does not contain any tuning parameter.
7
3.2 The influence measure
Here we describe the two basic ideas in Lo and Zheng (2002, 2004), an influence measure and
BDA, which we adopted to analyze the toy example. For easy illustration, we assume that the
response variable Y is binary (taking values 0 and 1) and all explanatory variable are discrete.
Consider the partition Pk generated by a subset of k explanatory variables {Xb1 , · · · , Xbk}
according to their values. If all variables in the subset are binary then there are 2k partition
elements. Let n1(j) be the number of observations with Y = 1 in partition element j and let
n1(j) = nj × π1 be the expected number of 1-response observations in element j under the
null hypothesis of no association, where nj is the number of observations in element j and
π1 is the proportion of 1-response observations in the sample. Then the influence measure
is a sum of squares over all partition elements defined by
I(Xb1 , · · · , Xbk) =
∑j∈Pk
[n1(j)− n1(j)]2.
The statistic I measures the squared deviation of Y -observations from what’s expected
assuming the subset of explanatory variables has no influence on Y . There are two properties
of I that make it effective. First, the measure does not require one to specify the form of
{Xb1 , · · · , Xbk}’s join influence on Y . Secondly, under the null hypothesis that the subset
has no influence on Y , the expected value of I remains unchanged when dropping variables
from the subset. The second property makes I fundamentally different from the Pearson
chi-square statistic whose expectation depends on the degrees of freedom and hence on the
number of variables used to define the partition. To see this, we rewrite I in its general form
when Y is not necessarily discrete
I =∑j∈Pk
n2j (Yj − Y )2,
where Yj is the average of Y -observations over the jth partition element and Y is the average
over all partition elements. It is shown in Chernoff et al (2009) that I has asymptotic
distribution as that of a weighted sum of χ2 random variables, each with one degree of
freedom. This very property also provides the theoretical justification for the following
algorithm.
3.3 The backward dropping algorithm (BDA)
BDA is a greedy algorithm to search for the feature combination that maximizes the I-score
through stepwise elimination of features from an initial subset randomly selected from the
feature space. The details are as follows.
1. Training Set: Consider a training set {(y1, x1), · · · , (yn, xn)} of n observations, where
xi = (x1i, · · · , xpi) is a p-dimensional vector of explanatory variables. Typically p is
very large. All explanatory variables are discrete.
8
2. Sample from Variable Space: Randomly select a starting subset of k explanatory vari-
ables Sb = {Xb1 , · · · , Xbk}, b = 1, · · · , B.
3. Compute I Statistic:
I(Sb) =∑j∈Pk
n2j (Yj − Y )2
4. Drop Variables: For each variable in Sb, tentatively drop it from Sb and recalculate
the corresponding I statistics. Then drop the one that gives the highest I-value after
dropping. Call this new set of variables S ′b, which has one variable less than Sb.
5. Return Set: Continue the next round of dropping on S ′b until no variable left to drop.
Keep the subset of variables which yields the highest I score in the whole dropping
process. Refer to this subset of variables as the return set Rb. Keep it for future use.
If no variable in the return set has influence on Y , then the values of I will not change
much in the dropping process; see Figure 1. On the other hand, when important variables
are included in the return set, then the I score will increase rapidly before reaching the
maximum and decrease drastically afterwards; see Figure 2.
5 10 15 20 25
010
020
030
040
050
0
Dropping Rounds
Influ
ence
Mea
sure
(a) Backdropping history with information
5 10 15 20 25
010
020
030
040
050
0
Dropping Rounds
Influ
ence
Mea
sure
(b) Backdropping history without information
Figure 1: Comparison of I-scores for return sets with and without information
4 Method Description
The influence measure and BDA have been successfully applied to discover influential genes
for complex disorders with impressive results; see e.g. Lo and Zheng (2002, 2004) For high-
9
dimensional classification problems, here we develop a new method using these two ideas
as basic tools. The proposed classification method consists of four major parts: feature
selection, generating return sets from selected features, turning return sets into classifiers,
and assembling classifiers to form the final classification rule.
Figure 2: Put the flowchart of the method here.
4.1 Feature selection
When the number of features is moderate with respect to the number of observations as
in the toy example, we can directly apply BDA to the feature space and generate return
sets without doing feature selection first. However, when the number of features is in the
thousands, directly applying BDA is not computationally efficient in identifying key feature
clusters and thus we need to do feature selection before return set generation. The feature
selection is based on the influence measure I . The I-score is shared by a subset (cluster)
of features instead of a single feature. This calls for a treatment different from the usual
approach of ranking the features according to their scores, choosing a cut-off value, and
keeping all features with scores exceeding the cut-off value.
It is necessary to first decide on the starting size of the clusters to apply BDA. Here
the issue is the trade-off between computation cost and the order of detectable interactions.
The larger the cluster size the higher order of interaction we can detect. However, the
computation time increases by a factor depending on the total number of features. For
example, suppose we have 5000 features. Then there are 12.5 million of pairs and more than
20 billion triplets. Even though the I-scores of triplets can provide information on 3rd order
interaction among features, the computation cost is more than 1000 times that of pairs,
which provide us only information about 2nd order interaction.
Suppose we have determined the cluster size for a given problem after considering the
trade-off between computational cost and the order of detectable interaction. We then
calculate the I-scores for all clusters of chosen size. The question now is how to choose
a threshold value so that feature clusters with I-scores higher than the threshold will be
retained. In this regard, we found that usually the number of feature clusters increase
dramatically after a certain value of I-score, which is designated as the threshold. Please
see Figure 3.
We then look at the retention frequencies of the retained features with I-scores higher
than the threshold. The ordered retention frequencies typically show sharp drops in the
beginning and little differences after a certain point, which then determines a set of high-
frequency features; see Figure 4.
The set of high-frequency variables are candidates for influential features. This is because
an influential feature not only will produce high I-score but also does that more frequently
when combined with other variables than non-influential ones. Usually, there is only a
moderate number of high-frequency variables ready to be used in the next step.
10
0 10 20 30 40 50 60
0.0
0.5
1.0
1.5
2.0
# of Return Sets in Thousand
Sec
ond
Diff
eren
ce
Figure 3: 2nd difference of ordered I-scores for every 1,000 triplets.
4.2 Generate return sets
We now apply the BDA to the high frequency variables obtained from the previous step.
There are two parameters we need to determine before applying the BDA: the starting size
and the number of repetitions.
The starting size refers to the initial number of features selected to which we apply the
BDA. The starting size depends on the number of cases in the training set. If the starting
size is too large, all partition elements contain no more than one training case, the dropping
is basically random and one can start with a smaller size, achieving the same result with less
computing time. The ideal starting size is such that some partition elements contain two or
more observations. Using Poisson approximation, we can calculate the expected number of
partition elements with two or more observations. Requiring it to be at least one has the
starting size k satisfyn2
2mk−1≥ 1, (3)
where n is the number of training cases and mk−1 is the number of partition elements
generated by a set of k−1 variables. Thus (3) provides an upper bound for the starting size.
Suppose that there are 100 cases in the training set and each variable assumes two
values. Because thirteen variables would generate a partition with 213 = 8192 elements and
1002/8192 > 1, we can choose the starting size to be 13, which is the largest integer satisfying
11
0 50 100 150 200 250
02
46
810
# of Genes
Firs
t Diff
eren
ce o
f Gen
e F
requ
ency
Figure 4: 1st difference of retention frequency from top 21,000 triplets.
(3).
The number of repetitions for the BDA depends on the number of training cases as well.
The number of training cases determines the size of return sets that can be supported by
the training set. For example, if we have 100 training cases, then they can support a return
set of size 4 assuming each explanatory variable is binary. Since each return set of size 4
has 24 = 16 partition elements, each partition element on the average contains 100/16 > 5
training cases, indicating the adequacy of chi-square approximation following a common
rule for chi-square goodness of fit test. Hence return sets of size 4 is well-supported by the
training set of 100 cases. In this case, we would like to make sure that the BDA is repeated
sufficient number of times so that the quadruplets are covered rather completely. Here we
encounter a variation of coupon-collecting problem.
Let p be the number of variables and let k be the starting size. Then there are(p4
)quadru-
plets from p variables. Each run of BDA can cover(
k4
)quadruplets. Consider the following
coverage problem. There are a total of(p4
)boxes and each corresponds to a quadruplet. Each
time(
k4
)balls, which corresponds to the number of quadruplets contained in a starting set
of size k, are randomly placed into boxes so that each ball is in a different box. In Appendix
A2, it is shown that roughly we are expected to repeat the BDA
B ≈(
p4
)(
k4
) log
(p
4
)(4)
12
times to have a complete coverage of quadruplets among p features.
The preceding result does not take into account that each time we do not place one ball
but a group of balls into boxes. It is well known that grouping will increase the proportion
of vacant boxes; see e.g. Hall (1988). Therefore the expected number of repetitions required
to cover all quadruplets will be greater (the exact result is an open problem). We propose
to use 2B as an upper bound. In all simulationed and real-data examples, we have never
missed a key cluster of chosen size after running 2B repetitions.
The return sets generated from BDA will undergo two filtering procedures to remove
between-cluster correlation and false positives. The first procedure is to filter out return
sets with overlapping variables. Because the return sets will be converted into classifiers in
the next step and it is desirable to have weakly correlated classifiers, we shall remove those
return sets containing common variables. This can be done by first sort the return sets in
decreasing order according to the I-scores and then remove return sets from the list that
have variables in common with a return set with a higher I score. The return sets after
removing overlap ones are then subjected to a forward adding algorithm to remove false
positives. Please see Appendix A3 for details. The filtering procedures are very important,
as found in practice, for lowering error rates. Sometimes, the error rates are much improved
after the filtering procedures.
4.3 Turning return sets into classifiers
After return sets have been generated, the next task is to classify Y ’s based on information
contained in each single return set. Because the number of variables contained in one return
set is quite small, the traditional setting of large n small p prevails here, and all classification
methods developed under such setting can be applied. In this paper, we would like to take
advantage of interactions among variables for the prediction of response variable Y , using
methods that can readily incorporate interactions among variables. Furthermore, methods
without tuning parameters is preferred. Because the sample size is only moderate, if the
classifier contains the tuning parameter, we need to use part of the sample as the validation
set to do tuning parameter selection and thus there will be less observations available for
training the classifier (training set) and for estimating the error rate (testing set). Another
reason for prefering classifiers without tuning paramet ers is that tuning parameter selection
is hard. Ghosh and Hall (2008) pointed out that cross validation is a poor method for
tuning parameter selection because “the part of the cross-validation estimator that depends
on the tuning parameter is very highly stochastically variable, so much so that it has an
unboundedly large number of local extrema which bear no important asymptotic relationship
to the parameters that actually minimize the risk.”
The other factor to consider is that the classifier with less distributional assumptions is
more desirable. In §4.4.4 of Hastie, Tibshirani, and Friedman, it is asserted that Fisher’s lin-
ear discriminant analysis (LDA) makes more distributional assumptions than logistic regres-
sion, even though these two methods take similar forms in terms of classification boundary.
13
Actually, in many microarray datasets that we tried, logistic regression yields significantly
lower error rates than LDA. With all factors considered, logistic regression seems to be a
good choice.
In logistic-regression classifier, we will include all possible interaction terms based a return
set. A return set of size 4 would give rise to 16 terms including up to 4-way interaction in
the logistic regression as the full model. We can then apply a model selection criterion to
select a good submodel. For the three real-data examples in §5, we used AIC.
4.4 Assemble classifiers to form the classification rules
In the previous section, we describe how to construct classifier from a single return set. We
now describe how to combine these classifiers to form a classification rule. In the literature,
there are quite a few ways to do it: bagging, Bayesian model averaging, and boosting etc.
Except for boosting, most of the ensemble learning methods is based on the idea of averaging.
Bagging is an average in a straighforward manner. Bayesian model averaging is a weighted
average with posterior probabilities of the classifiers as the weights. There are other methods
where the weight is proportional to the performance of the classifier; see the ensemble method
in Peng (2005). On the other hand, boosting is based on the idea of complementing. In
boosting, the next classifier to be added to the classificaiton rule is one that can do well on
those cases missclassified by classifiers already in the current classification rule.
Boosting is most appropriate when there are several factors affecting the probability of
Y = 1 and each factor depends on a small subset of variables. In this case, the classifier
based on a return-set (or feature-cluster) represents the effect of one factor. The ada boost
method of Freund and Schapire (1997) can be viewed as one that constructs an additive
classification rule from a set of basis functions by minimizing the misclassification risk ac-
cording to an exponential loss function. Here the set of basis function consists of classifiers
each corresponding to a return set. For this paper, the boosting algorithm is adapted as
follows.
1. Find H return sets with the highest HTD scores.
2. Initialize training-case weights wi = 1/n, i = 1, · · · , n.
3. For h = 1 to H do
(a) For j = h to H, fit logistic regression classifier Lj to the training data using return
set Rk. Calculate
errj =
∑i wiI(yi �= Lj(xi))∑
i wi,
and αj =1
2log
1− errj
errj.
14
(b) Let j′ = argmaxh≤j≤H
αj. Update
wi ← wi × exp(αj′I(yi �= Lj′(xi)).
(c) Relabel Rj′ as Rh with corresponding αh. And relabel the remaining H−h return
sets as {Rh+1, . . . , RH}.4. Output the classification rule sign{∑H
h=1 αhLh}.The final classification rule is one such that interactions among features are allowed within
each component classifier but not among features in different classifiers. As the classifiers are
added one by one to the classification rule, we expected the error rates for the training set
to drop continuously after each update to reflect improved fit to the training sample. Hence
the error rate path for the testing sample when classifiers are sequentially added to the final
rule, is an effective tool to detect over-fitting because information from the test sample is
not used in constructing the classification rule, i.e., the boosting procedure.
5 Real-Data Examples
This section contains three real-data examples. The first two are gene expression datasets
for breast cancer. They are from van’t Veer et al (2002) and Sotiriou et al (2003), which
originally contain 24187 genes for 97 patients, and 7650 genes for 99 patients, respectively.
The purpose for the first study is to classify female breast cancer patients according to relapse
and non-relapse clinical outcomes using gene expression data, while that of Sotiriou (2003)
is to classify tumor subtypes.
In the van’t Veer dataset, we kept 4918 genes for the classification task, which were
obtained from Tibshirani and Efron (2002). Following van’s Veer et al, 78 cases out of 97
are used as the training sample (34 relapse and 44 non-relapse) and 19 (12 relapse and 7
non-relapse) as the test sample. Our method yields 10.5% error rate on the test sample,
which corresponds to the best result reported in the literature. To check the stability of our
method, we further run ten external cross-validation experiments by randomly partitioning
the 97 subjects into a training sample of size 87 and a test sample of 10. The average test-set
error rate based on these ten experiments is 8.0%. The error rates of the 10 training sets
and 10 testing sets are given in Figures 5, 6 and 7.
Note that in all ten CV experiments, the error rate on the test set generally decreases
as more classifiers are added to the classification rule using the boosting method. Because
the classification rule used no information from test sets, this indicates that the proposed
method does not suffer from overfitting. The following table provides more details for the
performance of methods reviewed in §2.3. It is easy to see that the proposed method is doing
quite well against other methods in the literature.
The Sotiriou et al. dataset contains 7650 genes on 99 patients. The task is to classify
tumors according to their estrogen receptor (ER) status using gene expression information.
15
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G1−Train
● ●
●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G1−Test
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
5 10 15
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G2−Train
● ● ● ● ● ● ● ●
● ●
●
●
● ● ●
● ● ● ●
5 10 15
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G2−Test
● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G3−Train
● ● ●
●
● ● ● ● ●
●
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ●
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G3−Test
Figure 5: 10GCV Voting result, a.
16
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G4−Train
●
●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G4−Test
● ●
● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G5−Train
● ● ● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G5−Test
● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●
5 10 15
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G6−Train
● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
5 10 15
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G6−Test
Figure 6: 10GCV Voting result, b.
17
● ●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G7−Train
● ●
● ●
●
● ● ● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ● ●
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G7−Test
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G8−Train
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G8−Test
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G9−Train
● ●
●
● ● ● ● ●
● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
0 5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G9−Test
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G10−Train
● ●
● ● ●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
rs.no
erro
r
G10−Test
18
Table 1: Error rates on van’t Veer dataset by various methods in the literature
ErrorAuthor Feature selection Classifier evaluated rate MethodGrate et al (2002) None �1-SVM 0.010� LOOCVPochet et al (2004) None LS-SVMa 0.310 LOOCV
Linear kernel 0.321 Test setNone LS-SVM 0.309 LOOCV
RBFb kernel 0.316 Test setNone LS-SVM 0.478 LOOCV
no regularization 0.428 Test setPCAc FDAd 0.297 LOOCV(unsupervised) 0.426 Test setPCA FDA 0.265 LOOCV(supervised) 0.331 Test setkPCAe linear kernel FDA 0.288 LOOCV(unsupervised) 0.391 Test setkPCA linear Kernel FDA 0.264 LOOCV(supervised) 0.346 Test setkPCA RBF kernel FDA 0.251 LOOCV(unsupervised) 0.486 Test setkPCA RBF kernel FDA 0.000 LOOCV(supervised) 0.632 Test set
Li and Yang (2005) RFE SVM 0.105n� Test setRFE Ridge Regression 0.158n� Test setRFE Rocchio 0.158n� Test set
Michiels et al (2005) Correlation Correlation 0.310 500 rCVp
Peng (2005) Golubf SVM 0.247� LOOCVGolub Bagging SVM 0.226� LOOCVGolub Boosting SVM 0.226� LOOCVGolub Ensemble SVM 0.186� LOOCV
Yeung et al (2005) BMAg BMA 0.158� Test setAlexe et al (2006) LADh LAD 0.183� CVDiaz-Uriarte and None Random forest 0.342 bootstrapde Anres (2006) None SVM 0.325 bootstrap
None KNNi 0.337 bootstrapNone DLDAj 0.331 bootstrapShrunken Centroid Shrunken Centroid 0.324 bootstrapNN+VSk NN+VS 0.337 bootstrap
Wahde & Szallasi (2006) Evolutinary Algorithm LDA 0.105� Test setSong et al (2007) RFE SVM 0.077� 10-fold CVZhu et al (2007) RFE SVM 0.29o 10-fold CVYan and Zheng (2008) sMPASl sMPAS 0.295 13-fold CVLiu et al (2009) EICSm EICS 0.219 10-fold rCV
aLeast square SVM; bRadial basis function; cPrinciple component analysis; dFisher discrim-inant analysis; eKernel principle component analysis; fThe feature selection method used inGolub et al (1999); gBayesian model averaging; hLogical analysis of data; iK-nearest neigh-bor; jDiagonal linear discriminant analysis; kNearest neighbor with variable selection; lSignedmultigene association; mEnsemble independent component system; nEstimated from graphs;oThe best bias-corrected error rate by SVM with RFE starting from 5422 genes; pRandomCV; �Biased error rate estimates due to turning parameter selection and/or feature selection.
19
This is different from the objective of van’t Veer et al (2002), where the goal is to discriminate
relapse patients from non-relapse ones. We compute the I-scores for all pairs of the 7650
genes, from which the top 5000 high-score genes were selected. Then we follow the same
procedure as that in analyzing the van’t Veer dataset. The average error rates over 10 CV
groups is 5%. This matches the best results in the literature; see Zhang et al (2006).
The third gene expression dataset is from Golub et al (1999) The dataset consists of
7129 genes, 38 cases in the training set and 34 in the test set. The purpose is to classify
acute leukemia into two subtypes: acute lymphoblastic leukemia (ALL) and acute myeloid
leukemia (AML). Our classification rule consists of 11 gene clusters with total of 34 genes
and correctly classifies all test cases. This dataset was analyzed in the same way as Sotiriou
et al. ’s data except that we use two CV experiments here. In Golub et al (1999), perfect
classification was achieved on 29 out of 34 test cases. The signals contained in the other five
test cases were considered too weak to be classified. Among these five weak-signal test cases,
one has been consistently misclassified by nearly all follow-up reanalyses. We have included
all 5 weak-signal test cases in our analysis. Our classification method correctly classified all
34 test cases. The average test-set error rate on two CV test sets is 5.9%.
5 10 15
−1.
0−
0.5
0.0
0.5
1.0
rs.no
erro
r
Train
5 10 15
0.00
0.05
0.10
0.15
0.20
rs.no
erro
r
Test
Figure 8: Error rate paths for Golub dataset
6 Conclusion and discussion
In this paper, we make use of two key ideas from Lo and Zheng (2002, 2004) and Chernoff
et al. (2009) to develop a high-dimensional classification method. The method produces
better error rates than other methods in the literature that do not consider interaction
effect among variables. The feature-cluster selection procedure that we adopted from Lo
and Zheng (2002, 2004) is originally designed to discover influential genes responsible for
the disease under study instead of minimizing classification error rates. When applied to
20
analyze inflammatory bowel disease (IBD), on top of confirming many previously identified
susceptibility loci reported in the literature, Lo and Zheng (2004) found four new loci not
reported before, which were later replicated in a UK genome-wide association study (Barrett
et al, 2008). Therefore, the proposed method is intended to have two desirable properties:
1. The classification rules derived from the method have low error rate.
2. In the process of deriving the classification rule, influential variables for the response
will be identified.
In view of results presented in this paper, it seems that the feature clusters involved in
the classification rules contain information that worth further investigation in understanding
the disease signature.
References
[1] Alexe, B., Alexe, S., Axelrod, D. et al (2006), Breast cancer prognosis by combinatorial
analysis of gene expression data, Breast Cancer Research, 8:R41.
[2] Barrett, J.C., Hansoul, S., Nicolae D.L. et al (2008), Genome-wide association defines
more than 30 distinct susceptibility loci for Crohn’s disease, Nat. Genet., 40, 955-962.
[3] Candes, E.J. and Tao, T. (2007), The Dantzig selector: Statistical estimation when p
is much larger than n (with discussion), Ann. Stat., 35, 2313-2404.
[4] Chernoff, H., Lo, S. H., and Zheng, T. (2009), Discovering influential variables: a
method of partitions, Annals of Applied Stat., 3, 1335-1369.
[5] Diaz-Uriate, R. and de Andres, S. A. (2006), Gene selection and classification of mi-
croarray data using random forest, Bmc Bioinformatics, 7:3.
[6] Fan, J. and Lv, J. (2008), Sure independence screening for ultrahigh dimensional feature
space (with discussion), J. Roy. Statist. Soc. Ser. B, 70, 849-911.
[7] Freund, Y. and Schapire, R. (1997), A decision-theoretic generalization of online learning
and an application to boosting, Journal of Computer and System Science, 55, 119-139.
[8] Golub, T. R. et al (1999), Molecular classification of cancer: class discovery and class
prediction by gene expression monitoring, Science, 286, 531-537.
[9] Ghosh, A. K. and Hall, P. (2008), On error-rate estimation in nonparametric classifica-
tion, Statistica Sinica, 18, 1081-1100.
[10] Grate, L. R. et al (2002), Simultaneous relevant feature identification and classification
in high dimensional spaces, Algorithm in Bioinformatics Proceedings, 2452, 1-9.
21
[11] Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002), Gene selection for cancer
classification using support vector machine, Mach. Learning, 46, 389-422.
[12] Hastie, T., Tibshirani, R., and Friedman, J. (2009), The elements of statistical learning,
2nd Ed. Springer, New York.
[13] Li, F. and Yang, Y. M. (2005), Analysis of recursive gene selection approaches from
microarray data, Bioinformatics,21(19), 3741-3747.
[14] Liu, K. H. et al (2009), Microarray data classification based on ensemble independent
component selection, Computers in Biology and Medicine,39(), 953-960.
[15] Lo, S. H. and Zheng, T. (2002), Backward haplotype transmission association (BHTA)
algorithm - a fast multiple-marker screening method , Human Heredity, 53, 197-215.
[16] Lo, S. H. and Zheng, T. (2004), A demonstration and findings of a statistical approach
through reanalysis of inflammatory bowel disease data, Proceedings National Academy
of Science, 101, 10386-10391.
[17] Meinshausen, N. and Yu, B. (2009), Lasso-type recovery of sparse representation for
high-dimensional data, Ann. Stat., 37, 246-270.
[18] Michiels, S. S. et al (2005), Prediction of cancer outcome with microarrays: a multiple
random validation strategy, Lancet, 365(9458), 488-492.
[19] Peng, Y. H. (2005), Robust ensemble learning for cancer diagnosis based on microarry
classification, Advanced Data Mining and Applications, Proceedings, 3584, 564-574.
[20] Pochet, N. F. et al (2004), Systematic benchmarking of microarray and classification:
assessing the role of non-linearity and dimensionality reduction, Bioinformatics, 20(17),
3185-3195.
[21] Song, L., Bedo, J., Borgwardt, K.M., et al (2007), Gene selection via the BAHSIC
family of algorithms, Bioinformatics (ISMB), 23(13), i490-i498.
[22] Sotiriou, C. et al (2003), Breast cancer classification and prognosis based on gene expres-
sion profiles from a population-based study, Proceedings National Academy of Science,
100, 10393-10398.
[23] Tibshirani, R. and Efron, B. (2002), Pre-validation and inference in microarray, Statis-
tical Application in Genetics and Molecular Biology, 1(1), Article 1.
[24] van’t Veer, L. J. et al (2002), Gene expression profiling predicts clinical outcome of
breast cancer. Nature 415, 530 - 536.
[25] Wahde, M. and Szallasi, Z. (2006), Improving the prediction of the clinical outcome of
breast cancer using evolutionary algorithms, Soft Computing, 10(4), 338-345.
22
[26] Yan, X. and Zheng, T. (2008), Selecting informative genes for discriminant analysis
using multigene expression profiles, BMC Genomics, 9(Suppl 2), 1471-2164-9-S2-S14.
[27] Yeung, K. Y., et al (2005), Bayesian model averaging: development of an improved
multi-class, gene selection and classification tool for microarray data, Bioinformatics,
21(10), 2394-2402.
[28] Zhang, H., et al (2006), Gene selection using support vector machine with non-convex
penalty, Bioinformatics, 22(1), 88-85.
[29] Zhu, J. X., McLachlan, G. J., Ben-Tovim Jones, L. and Wood, I. A. (2008), On selection
bias with prediction rules formed from gene expression data , J. Stat. Planning and
Infer., 138, 374-386.
Appendix A1: The starting cluster size
When choosing a size k for forming a initial set of variables to run the backward dropping
algorithm, the objective is to have a small chance of erroneously dropping an influential
variable, i.e;, non-informative screening. If all cells in the partition generated by the k
variables contain either one or no training cases, the dropping is basically random. On the
other hand, if there are partition elements with two or more training cases, the influence of
the variable can then be informatively measured when it is dropped. Treating elements of
a partition as urns of the same size and subjects as balls, we evaluate in the following the
probability of having an element with at least 2 observations using a urn model
Let m be the number of boxes. n be the number of balls. Dropping balls in boxes
randomly, let p2 be the probability of two or more balls in a particular box. The number of
balls in a particular box follows the binomial distribution with probability of success 1/m
and the number of trials equals n. Thus
p2 = 1−(
m− 1
m
)n
− n1
m
(m− 1
m
)n−1
.
Therefore m× p2 is the expected number of boxes with 2 or more balls. We assume that
n→∞ and m→∞ such that nm
= λ.
Then
mp2 = m[1−(
m− 1
m
)n
− n1
m
(m− 1
m
)n−1
]
= m[1−(
m− 1
m
)m nm
− n
m
(m− 1
m
)mn−1m
]
→ m(1− e−λ − λe−λ)
If we further assume that λ is small and close to zero, then
23
m(1− e−λ − λe−λ)
= m[1− (1− λ +λ2
2+ o(λ2))− λ(1− λ +
λ2
2+ o(λ2))]
= m[λ2
2+ o(λ2)]
≈ m
2
( n
m
)2
=n2
2m
Therefore if we let mk−1 denote the number of elements in a partition generated by k− 1
variables, thenn2
2mk−1≥ 1, (5)
implies that the expected number of partition elements after dropping one variable from the
starting set of k variables is at least one. Note that in reality the number of training cases in
different partition elements are non-uniform and thus the number of cells with two or more
training cases would be larger than 1. Therefore the starting size satisfying (3) represent a
minimum requirement under uniform assumption.
A similar calculation for expected number of boxes with three or more balls yields
m(1− e−λ − λe−λ − λ2
2e−λ)
= me−λ[eλ − 1− λ− λ2
2))]
= m[λ3
3!+ o(λ3)]
≈ m
6
( n
m
)3
=n3
6m2k−1
Thus the corresponding condition on the starting size k is
n3
6m2k−1
≥ 1. (6)
Appendix A2: The number of repetitions for backward
dropping algorithm
Assume the placement time follows a Poisson process of rate 1. Then
P (box j empty at time t) ≈ exp
(−t
(k4
)(p4
))
for any particular box j. Poissonization makes boxes independent. So Qt = the number of
empty boxes at time t satisfies
Qt ≈ Poisson with mean
(p
4
)exp
(−t
(p4
)(k4
))
,
24
Let B be the first time all boxes are occupied, then
P (B ≤ t) = P (Qt = 0) ≈ exp
[−(
p
4
)exp
(−t
(p4
)(k4
))]
.
This can be arranged to (n4
)(
k4
)[B −
(p4
)(
k4
) log
(k
4
)]≈ ξ;
where P (ξ ≤ x) = exp(−e−x). Therefore, B ≈ (p4)
(k4)
log(
k4
).
The generalization to clusters of size z is obtained by substituting 4 with z
Bz ≈(
pz
)(
kz
) log
(p
z
).
Appendix A3: Forward adding algorithm to remove false
positive return sets
Let H be the total number of return sets after filtering out overlapping ones. The second
filter for false positives is carried out as follows. The filtering algorithm evaluates the effect
of adding variables to each return set Rh, h = 1, . . . , H. Let |Rh| be the number of variables
in Rh. Without loss of generality, assume Rh = {X1, . . . , X|Rh|}. The remaining variables,
X|Rh|+1, . . . , Xp are added to Rh one at a time to generate p− |Rh| forward-one sets. I-score
is calculated for each of these forward-one sets and let Ah be the number of them with I-score
greater than that of Rh. Ah represents the number of size |Rh| + 1 initial sets that will not
lead to Rh if subjected to BDA.
Even though we use forward adding for filtering, the motivation still lies in the backward-
dropping mechanism. For a true return set containing only influential variables, it should
be difficult to increase I-scores by adding a variable, or equivalently we should consistently
obtain this true return set by applying BDA to random initial sets with one more randomly
chosen variable plus this set. The same should not be true for a false positive return set.
A return set with many forward-one sets implies that whether it will be returned by BDA
depends on other variables in the initial set, which indicates that the return set is likely a
false positive.
For the van’t Veer et al example, we selected two return sets generated from one of the
10 CV experiments. In Table 2, the ‘false positive’ one has 7 forward-one sets even though
its I-score is higher than the ‘true’ one, which has only 1 forward-one set. After we remove
the false positive return sets according to the forward-one results, the final classification rule
has considerably lower error rate in the testing sample.
Usually the frequency plot of Ah would clearly show a set of outliers such as Figure 6
below. Therefore it is easy to determine a threshold on Ah for removing return sets.
25
Table 2: Examples of forward-one sets
False True
Original return set {665, 2283, 2930} I = 422.39 {108, 2400, 4208} I = 410.778
Forward-one sets {1451, 2930} I = 523.035 {2930, 4208} I = 427.351
{665, 2283, 2930} I = 470.498
{665, 1668, 2930} I = 450.050
{1946, 2930} I = 438.888
{1885, 2283, 2930} I = 438, 298
{2283, 3291} I = 426.516
{2283, 2900, 2930} I = 423.930
0 5 10 15 20
01
23
45
6
Number of fw1 set
App
eara
nce
Fre
quen
cy
Figure 9: Frequency distribution for the number of forward-one sets
26