A simple procedure to detect non-central observations from a
sample
Mylene DUVAL∗, Celine DELMAS∗, Beatrice LAURENT†,
Christele ROBERT-GRANIE∗
∗SAGA, INRA Toulouse, France
†Departement de Mathematiques, INSA Toulouse, France
Summary
In this paper we propose a simple procedure to select the subset of non-central ob-
servations from a sample. This problem is motivated by the detection of differentially
expressed genes between conditions in microarray experiments. We prove under mild
conditions the consistency of the selected subset of observations. We compare by simu-
lations our proposed procedure to the Benjamini and Hochberg’s procedure, a procedure
based on mixture models and a procedure based on model selection.
Some key words: microarray data; multiple testing procedure; ordered statistics; partial sums.
1
1. Introduction
The aim of this paper is to propose a simple procedure to detect non-central observations
from a sample. We will focus on the study of two samples (χi)i=1,··· ,n and (Fi)i=1,··· ,n.
We write
χi =
ν1∑
k=1
(γiNik + δik)2 1 ≤ i ≤ n (1)
Fi =
∑ν1
k=1(γiNik + δik)2
∑ν2
j=1D2ij
ν2
ν1
1 ≤ i ≤ n (2)
where ν1 and ν2 are two known integers, (δik) k = 1, · · · , ν1
i = 1, · · · , n
are unknown parameters,
(Nik) k = 1, · · · , ν1
i = 1, · · · , n
are independent identically distributed centered variables with unit
variance and (Dij) j = 1, · · · , ν2
i = 1, · · · , n
are independent identically distributed variables indepen-
dent of the (Nik) k = 1, · · · , ν1
i = 1, · · · , n
. We define the non-centrality parameter ηi as ηi =∑ν1
k=1 δ2ik.
We assume that γi = 1 for all i such that ηi = 0 and that γi ≥ 1 is unknown otherwise.
Our goal is to identify the set Jn = i : ηi 6= 0. Note that taking the square of the
observations of a Gaussian sample, we obtain a chi-square sample with one degree of
freedom, which is a particular case of model (1) with ν1 = 1, γi = 1 and Ni1 indepen-
dent identically distributed standard Gaussian variables for i ∈ 1, ..., n. Similarly a
Student sample is a particular case of model (2).
Our work is motivated by the problem of the detection of differentially expressed genes
between conditions in microarray experiments. For this purpose a test statistic is built
for each gene. In general it is a T or a F statistic to compare two or more conditions
but it can also be a Gaussian or a χ2 statistic when the variance of the gene expression
is assumed to be known. Under the null hypothesis that the gene is not differentially
expressed between the conditions, the non-central parameter of the test statistic is null
whereas it is non-null under the alternative hypothesis that the gene is differentially
expressed between the conditions. We observe the sample of all the test statistics cor-
responding to all the genes. To detect the genes that are differentially expressed we
have to separate the central from the non-central observations. This is the object of
the paper.
To address this problem the simplest procedure would be the Bonferroni correction.
This procedure controls the familywise error rate (FWER) which is the probability of
accumulating one or more false-positives over all the tests. This criterion is very strin-
gent and may affect the power when the number of tests is large. An alternative pro-
cedure consists in controlling the false-discovery rate (FDR). Benjamini and Hochberg
2
(1995) introduced a FDR controlling procedure and proved that it controls the FDR for
independent test statistics. Adaptive FDR controlling procedures have been proposed
to increase the power while controlling the FDR (Benjamini and Hochberg, 2000; Ben-
jamini, Krieger and Yekutieli, 2005, Adaptive linear step-up procedures that control
the false discovery rate, Research paper 01-03). Genovese and Wasserman (2002, 2004)
have proposed a method to minimize the false-negative rate (FNR) while controlling
the FDR. Mixture models have been studied to separate the central observations from
the others (Delmas, 2006, On mixture models for multiple testing in the differential
analysis of gene expression data.Submitted paper; Bordes et al., 2006). In the Gaussian
case some procedures based on model selection have been proposed (Birge and Mas-
sart, 2001; Huet, 2006). Some authors have proposed procedures based on the partial
sums of the absolute or squared ordered observations (Hoh et al., 2001; Lavielle and
Ludena, 2006, Random thresholds for linear model selection.Submitted paper). Zaykin
et al. (2002) and Dudbridge and Koeleman (2003, 2004) have proposed methods based
on the partial products of the ordered pvalues. Meinshausen and Rice (2004) have
proposed a method to estimate the proportion of false null hypotheses among a large
number of independently tested hypotheses. This method is based on the distribution
of the p-values of the hypothesis tests, which are uniform on [0, 1] under the null hy-
pothesis.
The aim of our paper is to propose a simple procedure that offers good asymptotic
properties.
The paper is organized as follows. In section 2 we introduce the proposed procedure.
In section 3 we state the consistency of the selected subset of observations for the two
kinds of samples (1) and (2). In section 4 we compare by simulations our proposed pro-
cedure with different methods: the Benjamini and Hochberg’s procedure, a procedure
based on mixture models and the Birge and Massart’s procedure. Section 5 is devoted
to the proofs.
2. The procedure
Assume we observe the sample (1) or (2). We note Xi the observation: Xi = χi or
Xi = Fi. We note kn the number of non-central observations. Our procedure to separate
the central from the non-central observations can be stated as follows:
(i) We order the Xi’s:
Xσ(1) ≥ Xσ(2) ≥ · · · ≥ Xσ(n)
3
We define for 0 ≤ k < n, τk = 1n−k
∑ni=k+1Xσ(i).
(ii) We estimate kn by
kn = min0≤k<n
k : τk ≤ τ
where τ = E[Xi] under the assumption that the non-centrality parameter ηi is
equal to 0. If for 0 ≤ k < n, τk > τ , then we set kn = n.
(iii) We decide that Xσ(1), · · · , Xσ(kn) are non-central observations.
This procedure is based on the idea that if the two populations of the central and
non-central observations are well separated then τk is a good estimator of τ only for
k = kn. For k < kn the expression of τk includes non central variables and hence τk
tends to over estimate τ . For k > kn the expression of τk does not include the largest
observations of a sample of independent identically distributed variables with mean τ
then τk tends to under estimate τ .
3. Results
We first introduce some notations. We write
Jn = i, ηi 6= 0
and
Jn =
σ(1), σ(2), . . . , σ(kn)
.
Write V and T respectively denote the number of false positives and the number of
false negatives defined as
V = Card(Jn ∩ Jn), T = Card(¯Jn ∩ Jn),
where, for all set J , J denotes the complementary set of J .
Before giving our main theorem, we state two lemma, to show that, under suitable
assumptions, the variables satisfying ηi 6= 0 are “well separated” from the others. We
introduce some hypotheses.
H1: Write φ(1) be the cumulative distribution function of L(1) = maxi=1,..,n
∑ν1
k=1N2ik.
We assume that there is a sequence (an, bn)n≥0 with bn > 0 for all n ∈ N such that
4
φ(1)(an + bnx) −−−→n→∞
F (x), where F is a cumulative distribution function. We also
assume that,
mini∈Jn
(
ηi
2γ2i
)
≥ αn
where the sequence (αn)n∈N satisfies
αn/2 − an
bn−−−→n→∞
+∞.
Example:
Assume that the Nik’s are independent identically distributed centered Gaussian ran-
dom variables with unit variance. Then for 1 ≤ i ≤ n,∑ν1
k=1N2ik ∼ χ2(ν1). In that case
we can prove that F is the cumulative distribution function of a Gumbel variable, with:
an = 2 log(n) + (ν1 − 2) log(log(n)) − 2 log(2ν1/2Γ(ν1/2)) + ν1 log(2) ∀n
and
bn = 2 ∀n(see Resnick (1987) Section 1.5 for similar calculations).
Then we have to choose αn such that αn/2−an
bn−−−→n→∞
+∞, for example αn = 2an + ǫn
for all n where ǫn −−−→n→∞
+∞.
H2: Write ψ(1) and ξ(1) respectively be the cumulative distribution functions of
D(1) = maxi=1,..,n
ν2∑
l=1
D2ilν1/ν2
and
L(1) = maxi=1,..,n
∑ν1
k=1N2ik
∑ν2
l=1D2il
ν2
ν1
.
We assume that there are two sequences (cn, dn)n≥0 and (en, fn)n≥0 with dn > 0 and
fn > 0 for all n ∈ N such that ψ(1)(cn +dnx) −−−→n→∞
G(x) and ξ(1)(en +fnx) −−−→n→∞
H(x),
where G and H are cumulative distribution functions.
Write (xn)n∈N satisfiesxn − cndn
−−−→n→∞
+∞.
We assume that for all i ∈ Jn,
min1≤i≤kn
(
ηi
γ2i
)
≥ 4xnβn
5
where the sequence (βn)n∈N satisfies
βn − en
fn
−−−→n→∞
+∞.
Example:
Assume that the Dil’s and the Nik’s are independent identically distributed centered
Gaussian random variables with unit variance. Then for 1 ≤ i ≤ n,∑ν2
l=1D2il ∼ χ2(ν2)
and in that case we can prove thatG is the cumulative distribution function of a Gumbel
variable, with:
cn =ν2
ν1
[
2 log(n) + (ν2 − 2) log(log(n)) − 2 log(2ν2/2Γ(ν2/2)) + ν2 log(2)]
∀n
and
dn = 2ν2
ν1
∀n
(see Resnick (1987) Section 1.5 for similar calculations).
We choose xn such that xn−cn
dn−−−→n→∞
+∞, for example xn = cn + ǫn for all n where
ǫn −−−→n→∞
+∞.
L(1) = maxi=1,..,n
Pν1k=1 N2
ikPν2l=1 D2
il
ν2
ν1∼ F (ν1, ν2). We can prove that H is the cumulative distri-
bution function of a Frechet variable, with:
fn = (Kn)2/ν2 ∀n
where K = 2 Γ(ν1+ν2)Γ(ν1/2)Γ(ν2/2)
νν1/2−12
νν1/21
and
en = 0 ∀n
(see Resnick (1987) Section 1.5 for similar calculations).
We choose βn such that βn
fn−−−→n→∞
+∞, for example βn = f(1+ǫ)n where ǫ > 0.
Lemma 1 Write U1, . . . , Un be n independent random variables such that
Ui =
ν1∑
k=1
(γiNik + δik)2 for 1 ≤ i ≤ kn
and
Ui =
ν1∑
k=1
(Nik)2 for kn + 1 ≤ i ≤ n
6
where for all 1 ≤ i ≤ kn, ηi =∑ν1
k=1 δ2ik 6= 0 and γi ≥ 1 are unknown parameters;
Nik, 1 ≤ i ≤ n, 1 ≤ k ≤ ν1 are independent identically distributed centered variables
with unit variance. We assume that assumption H1 holds. We define
Ωn =
min1≤i≤kn
Ui ≥ maxkn+1≤i≤n
Ui
.
Then
pr(Ωcn) −−−→
n→∞0.
Lemma 2 Write U1, . . . , Un be n independent random variables such that
Ui =
∑ν1
k=1(γiNik + δik)2
∑ν2
l=1D2il
ν2
ν1
for 1 ≤ i ≤ kn
and
Ui =
∑ν1
k=1N2ik
∑ν2
l=1D2il
ν2
ν1
for kn + 1 ≤ i ≤ n
where for all 1 ≤ i ≤ kn, ηi =∑ν1
k=1 δ2ik 6= 0 and γi ≥ 1 are unknown parameters;
Nik, 1 ≤ i ≤ n, 1 ≤ k ≤ ν1 are independent identically distributed centered variables
with unit variance; Dil, 1 ≤ i ≤ n, 1 ≤ l ≤ ν2 are independent identically distributed
variables independent of the Nik’s. We assume that assumption H2 holds. We define
Ωn =
min1≤i≤kn
Ui ≥ maxkn+1≤i≤n
Ui
.
Then
pr(Ωcn) −−−→
n→∞0.
Theorem 1 We consider the procedure described in Section 2. We assume that the
cardinality of Jn equals kn = λn with 0 < λ < 1. Write W denote the cumulative
distribution function of the variables Xi/τ under the assumption that ηi = 0. We
assume that W (2) < 1. Then, under assumptions H1 for sample (1) or assumptions
H2 for sample (2),
pr
[
V
n> un
]
−→ 0 as n −→ +∞
pr
[
T
n> un
]
−→ 0 as n −→ +∞
where un −→ 0 and√nun −→ +∞ as n tends to infinity.
7
4. Application and simulations
4.1 Application to microarray data
This procedure may be convenient to detect differentially expressed genes in microarray
experiments. DNA microarrays are a new class of technology that enables molecular
biologists to simultaneously measure the expression level of thousands of genes (Brown
and Bolstein, 1999). Thousands of gene probes made of cDNA or oligonucleotides are
spotted on a small glass slide or a nylon membrane in a regular matrix pattern. A
basic experiment consists in comparing the expression levels in two different types of
conditions. More generally, we can study several conditions with one or more repeti-
tions. The intensity level on each spot on the microarray represents a measure of the
concentration of the corresponding mRNA in the biological sample. The detection of
differentially expressed genes in DNA microarray experiments is an important question
asked by biologists to statisticians. At this stage, we assume that intensity levels of
genes are correctly normalized.
Assume we study microarray data from an experiment including n genes, J conditions
and R repetitions for each gene in each condition. We note Aijr the rth repetition of the
expression level of gene i in condition j. We assume that for all i, j, r, Aijr ∼ N (mij, σ2i )
where mij ∈ R, and σi ∈ R+ are unknown. We assume that the Aijr’s are independent.
We note Aij. = 1R
∑Rr=1Aijr, and Ai.. = 1
JR
∑Jj=1
∑Rr=1Aijr.
If R is large enough, we can estimate σ2i by σ2
i = 1J(R−1)
∑Jj=1
∑Rr=1(Aijr − Aij.)
2.
In that case:
∀1 ≤ i ≤ n, Xi =1
(J − 1)σ2i
J∑
j=1
R(Aij. − Ai..)2 ∼ F (ηi; J − 1, J(R− 1))
where Xi ∈ R+, ηi is the non-centrality parameter. ηi = 0 if gene i is non-differentially
expressed between the J conditions, and ηi > 0 if not. This model is similar to model (2).
When R is small we can not estimate σ2i by the same estimator σ2
i . Thus we assume
that σi = σ for all the genes which are not differentially expressed and σ is known.
Write Xi = 1σ2
∑Jj=1R(Aij. − Ai..)
2. If gene i is not differentially expressed between
the J conditions: Xi ∼ χ2(J − 1), else σ2
σ2iXi ∼ χ2(ηi; J − 1), where ηi > 0 is the non-
centrality parameter. This model is similar to model (1) with γi = σi
σ.
In the particular case where we compare only two conditions, Xi = R(Ai1.−Ai2.)2
2σ2 . If
8
gene i is non differentially expressed between the two conditions: Xi ∼ χ2(1), elseσ2
σ2iXi ∼ χ2(ηi; 1), where ηi > 0 is the non-centrality parameter. This model corresponds
to model (1) with ν1 = 1.
Write
J n = 1 ≤ i ≤ n, ηi 6= 0
J n is the set of the genes which are differentially expressed between the J conditions,
it contains kn elements.
Our aim is to estimate the number kn and the set J n, thanks to the procedure pre-
sented in Section 2.
4.2 Simulations
To validate our procedure denoted by DDLR, we present several simulations results and
some comparisons with three other methods. We want to determine the number kn of
genes effectively differentially expressed between two conditions, with only one repeti-
tion for each gene and each condition. We suppose that all the genes have the same
variance σ2. We simulated n independent observations Ai as follows: kn from a normal
distribution N (mi1 −mi2, 2σ2) and (n− kn) from a normal distribution N (0, 2σ2). Ai
represents the difference of expression of gene i between the two conditions. We write
Xi =A2
i
2σ2 .
Generally in applications, the standard deviation σ is unknown. We note for all i,
s2i = s2 =var(Ai) = 2σ2. We propose to use the estimator of s presented by Haaland
and O’Connell (1995).
Estimation of the variance
This estimator is defined as follows:
s = 1.5 ∗median|Ai|, |Ai| ≤ 2.5s0, where s0 = 1.5 ∗median|Ai|, i = 1..n.
Intuitively, this estimator may be a consistent estimator if the proportion of variables Ai
with non-null mean is quite small. Let see some heuristic ideas about the construction
of this estimator:
Write tn = median(|Ai|) the empirical median. We know that tn −−−→n→∞
mn almost
9
surely, where mn is the theorical median.
Moreover, if we assume that for all i ∈ 1, ..., n, Ai ∼ N (0, s2), then
pr(|Ai| ≤ mn) = 2
∫ mn
0
1
s√
2πe−
x2
2s2 dx =1
2
so∫ mn
s
0
1√2πe−
x2
2 dx =1
4
Thanks to the statistic tables, we estimate mn
s≃ 0.6745, that is to say s ≃ 1.4826mn ≃
1.5median|Ai|, i = 1, ..., n .
However some random variables Ai are not centered. In order to separate the ran-
dom variables Ai which are centered and the random variables Ai which have non-
zero mean, Haaland and O’Connell (1995) approximated the first set with the set
|Ai|, 1 = 1, ..., n : |Ai| ≤ 2.5s0. This explains why they propose to estimate s by:
s = 1.5 ∗median|Ai|, |Ai| ≤ 2.5s0.
With large probability s0 ≥ s, then if Ai ∼ N (0, s2),
pr(|Ai| > 2.5s0) ≤ pr(|Ai| > 2.5s) ≃ 2 ∗ (1 − 0.9938) = 0.0124.
That is to say that if Ai is centered, the probability not to take it to estimate the
standard deviation is 1%. In this application, we have supposed that the random vari-
ables Ai are Gaussian (Kerr et al, 2000). We can also generalize the estimation of the
variance in the case where the Ai’s have a known symetric density.
Simulations
We have considered different values for n, kn and µ, where mi1−mi2
s= µ for all i ∈ Jn.
For each value of n = 5000 and n = 10000, we have considered: kn = 0.1 ∗ n and
µ ∈ 3, 5, 8. We recall that Xi =A2
i
s2 ∼ χ2(µ; 1) for all i ∈ J n, and Xi ∼ χ2(1)
otherwise.
In the expression of Xi, we replace s by s.
We use our procedure presented in Section 2 with the observations Xii=1,...,n and
τ = 1.
Hypothesis H1 is satisfied with an = 2log(n)− log(log(n))−2 log(√
2π)+log(2), bn = 2
for all n, and αn = 2an + log(log(n)) for all n for example.
Theorem 1 is proved under the assumption that µ >√
2αn.
Now we present briefly three methods to compare the performances of DDLR procedure.
10
All these methods assume that the standard deviation s is known, so we estimated it
with the threshold estimator s presented in Section 4.2.
BM method:
First, Birge and Massart (2001) provided a general approach based on model selection
via penalization for Gaussian random vectors with a known variance. The penalty
function presented by Birge and Massart is:
∀k ∈ 1, ..., n, pen(k) = λσ2k(1 +√
2Lk)2
where λ > 1 and (Lk)k≥1 is a series of positive real numbers.
The procedure proposed by Birge and Massart leads, in our practical setting, to define:
kn = argmink=1,...,n
[
−k∑
i=1
Xσ(i) + pen(k)]
.
where Xσ(1) ≥ ... ≥ Xσ(n).
We have chosen pen(k) = Mk, where M is a constant which has been calibrated at
M = 8 to obtain good results when µ = 5.
BH method:
The second method is a test method presented by Benjamini and Hochberg (1995). It
controls the expected proportion of errors among the rejected hypotheses, named the
false discovery rate (FDR). In our application, as we want to select the differentially
expressed genes, the problem means as many tests as the number of genes standing on
the microarray. The situation is summarized in Table1.
The false discovery number is connected with the proportion of the rejected null hy-
potheses which are erroneously rejected V
kn. Then the FDR is defined as
FDR = E[
V
max(1,kn)
]
. We define the False Negative Rate: FNR = E[
T
max(1,n−kn)
]
.
When the number of tests n is large, one accumulates the false discovery number at
each test, as a result in microarray experiments, the error on the estimation of genes
effectively differentially expressed can be very important. This explains why Benjamini
and Hochberg (1995) presented a method which controls the FDR.
BH adaptative procedure (adapted from the classical procedure presented by Benjamini
and Hochberg, 1995) is as follows: assuming that we test n− kn null hypotheses,
- write p1, p2, ..., pn be the n p-values corresponding to the n tests. These ones are
sorted in a decreasing order: p(1) ≤ p(2) ≤ ... ≤ p(n).
11
- write H0(1), H0(2), ..., H0(n) the corresponding null hypotheses.
- write π0 ∈ R∗+ and write kn the biggest integer k ∈ 1, ..., n such as p(k) ≤ k
nπ0α
where α ∈]0, 1[.
- H0(i) is rejected, for i = 1, ..., k.
Benjamini and Hochberg proved that for this procedure, FDR ≤ n−kn
nπ0α.
In the application on microarray data, we fixed the coefficient α = 0.05 and we cali-
brated π0 thanks to simulations at π0 = 2.9, such that the estimation of kn was the
nearest as possible from kn in the case where n = 5000, µ = 5, and k0 = 500.
As a consequence this procedure controls the FDR by 0.016.
MIXT method:
The third method is a classical method on the mixture of two normal distribution
pN (µ, s2)+(1− p)N (0, s2), where p and µ are unknown parameters, s is known. Write
αi a variable corresponding to the conditionnal probability that gene i is differentially
expressed, given that the observations Ai, µ and p are known.
Ai is Gaussian with variance s2 and null mean if gene i is non differentially expressed.
The estimations of p, µ and αi can be obtained by the EM algorithm (Titterington et
al., 1985).
Then write iter the number of iterations in the EM algorithm. We estimate kn by
kn =n∑
i=1
1α
[iter]i,0 >0.5
4.3 Results and discussion
Table 2 and Table 3 present the results of simulations for n = 5000 and n = 10000
respectively for the four methods. The following notations are used:
1. kn denotes the estimation of the number of differentially expressed genes;
2. ˆFDR denotes the estimation (in percentage) of the false discovery rate;
12
3. ˆFNR denotes the estimation (in percentage) of the false negative rate;
4. The rate RDR is defined as: RDR = Skn
, in percentage.
The quantities 1, 2, 3, and 4 are calculated on the basis of 1000 simulations.
For example, in Table 2, we simulated a microarray experiment, on which n = 5000
genes were tested, and with only kn = 500 genes simulated differentially expressed. We
simulated three levels of difference of expressions for the genes differentially expressed:
µ = 3, 5, or 8. In the case µ = 8, BM method estimates kn by kn = 519.6. Among these
kn = 519.6 genes, ˆFDR = 3.8% of genes were not simulated differentially expressed.
Among these n− kn = 4480.4 genes found non-differentially expressed, ˆFNR = 0% of
genes were simulated differentially expressed. With this method, RDR = 100% of the
genes which were simulated differentially expressed are found.
The results given in Tables 2 and 3 suggest the following remarks:
• When we compare Tables 2 and 3, the proportions RDR, ˆFNR, and ˆFDR, are
the same in general. That is to say that the four methods may depend only on the
proportion kn
n, which is an important parameter in the estimator of the variance s2.
Therefore we only discuss results presented in Table 3.
• When µ = 8, all the methods give a good estimation of kn, and all the genes
which were simulated differentially expressed are found: RDR = 100% for almost all
the methods. BM method tends to over estimate kn, the criterion does not penalise
enough the high dimensions. BH method tends also to over estimate kn. We may change
the constant π0 in order to have a better estimate, however, if we do so, the method
would tend to under estimate in the case µ = 5 (see description of BH method). So it
does not seem to be an adapted solution to improve BH method. MIXT and DDLR
methods give good results for the four criteria kn, ˆFDR, ˆFNR and RDR.
• Consider the case µ = 5. All the methods find more than 96% of the genes simulated
differentially expressed. Among the genes found differentially expressed, 3.8% ( ˆFDR)
of these genes were not simulated differentially expressed for BM method, against 1.4%
for DDLR and BH methods and 1.5% for MIXT method. The choice of the method
depends on the objective of the user. If he prefers to find more differentially expressed
13
genes in spite of a high ˆFDR ( ˆFDR ≥ 3.8) but a better RDR (RDR > 98%), he can
choose BM method. If he prefers to control the error level ˆFDR, he may choose among
the methods BH, DDLR, or MIXT. It is worth noticing that BM method does not
return a good RDR in the case µ = 3. Generally in most applications to microarray
data, µ and σ are not known and the ratio µ is often close to 3 or less than 3.
• In the case µ = 5, MIXT method seems to be the best method in terms of ˆFDR
(1.5%) and RDR (98%). However, in the case µ = 3, the ˆFDR of MIXT method
is higher (10%) essentially because the mean µ of the genes differentially expressed is
very weak, so the method can not easily seperate the genes which are not differentially
expressed from the others. On the other hand, in the case µ = 3, DDLR method finds
57% of differentially expressed genes among the genes which are simulated differentially
expressed. Moreover the ˆFDR is quite weak: 7% of genes.
In real microarray data analysis, the mean of the differences of genes levels between
two conditions, is generally small (µ ≤ 3). So DDLR method seems to be well adapted
for transcriptomic data.
As we said before, BM method is difficult to use, because some constants must be
calibrated in the penalty function. Since the ratio µ is generally close to 3 in mi-
croarray data analysis, we should calibrate the constant in the penalty function of BM
method by holding account that. Unfortunately in this case, we could not find this
constant because the method is unstable and we do not obtain equivalent estimations
at each simulation. So we decided to keep the constant M = 8 in the BM penalty
function, calibrated in the case n = 5000, k0 = 500 and µ = 5.
To conclude, for any level of µ, our method finds a high proportion of genes simu-
lated differentially expressed (RDR ≥ 57%) and the ˆFDR (≤ 7%) stands quite low.
The other methods tend to privilege only one of the criteria ˆFDR, ˆFNR or RDR.
Moreover, DDLR method is very easy to implement and no constant is needed to cali-
brate.
In this new approach, we assumed that the variance was known. It would be interesting
to develop our method in the case where the variance is unknown, as proposed by Sylvie
Huet (2006), who generalized BM Method.
14
5. Proofs
5.1 Proof of Lemma 1
The variables (Ui)kn+1≤i≤n are denoted (Vi)1≤i≤n−kn .
We write U(1) ≥ U(2) ≥ . . . U(kn) and V(1) ≥ V(2) ≥ . . . V(n−kn). The complementary set
of Ωn is the event
V(1) > U(kn)
. Since Ui =∑ν1
k=1(γiNik + δik)2, we use the inequality
2ab ≥ −a2/2 − 2b2 which holds for all a, b ∈ R to obtain that for all 1 ≤ i ≤ kn,
Ui
γ2i
≥ −ν1∑
k=1
N2ik +
ηi
2γ2i
.
Recalling that L1 = max1≤i≤n
∑ν1
k=1N2ik, this implies that
U(kn)
γ2(kn)
≥ min1≤i≤kn
ηi
2γ2i
− L(1).
Then, since for 1 ≤ i ≤ kn, γi ≥ 1,
pr(Ωcn) = pr(V(1) > U(kn))
≤ pr
(
V(1)
γ2(kn)
+ L(1) > min1≤i≤kn
ηi
2γ2i
)
≤ pr
(
V(1) + L(1) > min1≤i≤kn
ηi
2γ2i
)
.
V(1) ≤ L(1), hence pr(Ωcn) ≤ pr(2L(1) > αn). Since
(
L(1) − an
)
/bnL→ F and
(αn/2 − an) /bn −−−→n→∞
+∞ with H1 we obtain that pr(Ωcn) −−−→
n→∞0. This concludes the
proof of Lemma 1.
5.2 Proof of Lemma 2
As in the proof of Lemma 1, we obtain
pr(Ωcn) ≤ pr
(
2L(1) > min1≤i≤kn
(
ηi
2γ2i
)(
1
D(1)
))
.
By H2,
pr(
D(1) ≤ xn
)
−−−→n→∞
1.
15
This implies that
pr(Ωcn) ≤ pr
(
L(1) ≥ min1≤i≤kn
(
ηi
4xnγ2i
))
+ pr(
D(1) > xn
)
which tends to 0 as n→ ∞ by assumption H2.
5.3 Proof of Theorem 1
Let us first prove that the fonction k → τk is non increasing.
For 1 ≤ k ≤ n− 1:
τk−1 =1
(n− k + 1)
n∑
i=k
X(i)
=X(k)
(n− k + 1)+
1
(n− k + 1)
n∑
i=k+1
X(i)
=X(k)
(n− k + 1)+
n− k
(n− k + 1)τk
Since for i ≥ k X(k) ≥ X(i), we get X(k) ≥ 1n−k
∑ni=k+1X(i). This implies that
τk−1 ≥1
(n− k + 1)τk +
n− k
(n− k + 1)τk = τk.
Hence, the function k 7→ τk is non increasing.
Let us now prove that
pr
(
T
n> un
)
−→ 0 as n −→ +∞.
pr
(
T
n> un
)
≤ pr
(
kn
n− kn
n< −un
∩ Ωn
)
+ pr(Ωcn).
pr(
kn < kn − nun ∩ Ωn
)
≤ pr(
τkn−nun ≤ τ ∩ Ωn
)
≤ pr(
1
n− kn + nun
kn∑
i=kn−nun+1
X(i)
τ+
1
n− kn + nun
n∑
i=kn+1
X(i)
τ≤ 1
∩ Ωn
)
.
We define
P1 = pr(
1
n− kn + nun
kn∑
i=kn−nun+1
X(i)
τ+
1
n− kn + nun
n∑
i=kn+1
X(i)
τ≤ 1
∩ Ωn
)
.
16
P1 ≤ pr(
1
n− kn + nun
kn∑
i=kn−nun+1
X(i)
τ≤ 2nun
n− kn + nun
∩ Ωn
)
+ pr(
1
n− kn + nun
n−kn∑
i=1
Zi ≤ 1 − 2nun
n− kn + nun
)
where (Zi)i=1,..,n−kn is a sample of independent identically distributed random variables
with distribution W , which is the distribution of Xi/τ under the assumption ηi = 0.
Note the E(Zi) = 1, we denote by v the standard deviation of the Zi’s.
pr
(
1
n− kn + nun
n−kn∑
i=1
Zi ≤ 1− 2nun
n− kn + nun
)
= pr
(
∑n−kn
i=1 Zi − (n− kn)
v√n− kn
≤ −nun
v√n− kn
)
.
The central limit theorem implies that∑n−kn
i=1 Zi − (n− kn)
v√n− kn
L→ N (0, 1),
and since√nun −−−→
n→∞+∞, we obtain
−nun
v√
(n− kn)−−−→n→∞
−∞.
This implies that
pr
(
1
n− kn + nun
n−kn∑
i=1
Zi ≤ 1 − 2nun
n− kn + nun
)
−−−→n→∞
0.
Let us now control the other term appearing in the upper bound for P1.
pr
(
1
n− kn + nun
kn∑
i=kn−nun+1
X(i)
τ≤ 2nun
n− kn + nun
∩ Ωn
)
= pr
(
1
nun
kn∑
i=kn−nun+1
X(i)
τ≤ 2
∩ Ωn
)
.
On the event Ωn,
1
nun
kn∑
i=kn−nun+1
X(i)
τ≥ Z(1),
where (Z1, . . . , Zn−kn) are independent identically distributed with distribution W , and
Z(1) = max Zi, 1 ≤ i ≤ n− kn. Hence
pr
(
1
nun
kn∑
i=kn−nun+1
X(i)
τ≤ 2
∩ Ωn
)
≤ pr(
Z(1) ≤ 2)
= (W (2))n−kn .
17
This tends to 0 as n tends to infinity since we assumed that W (2) < 1.
We have proved that P1 −−−→n→∞
0. Since pr(Ωcn) −−−→
n→∞0 by Lemma 1 for Sample (1) and
by Lemma 2 for Sample (2), we obtain that
pr
(
T
n> un
)
−→ 0 as n −→ +∞.
It remains to show that
pr
(
V
n> un
)
−→ 0 as n −→ +∞
pr
(
V
n> un
)
≤ pr
(
kn
n− kn
n> un
∩ Ωn
)
+ pr (Ωcn) .
Hence, we shall prove that
pr
(
kn
n− kn
n> un
∩ Ωn
)
−−−→n→∞
0.
pr
(
kn − kn > nun
∩ Ωn
)
= pr
(
τkn+nun > τ
∩ Ωn
)
= pr
(
1
n− (kn + nun)
n∑
i=kn+nun+1
X(i) > τ
∩ Ωn
)
.
Write (Zi)i=1,..,n−kn be a sample of independent random variables with common cumu-
lative distribution function W , and write Z(1) ≥ Z(2) ≥ .. ≥ Z(n−kn).
pr(kn − kn > nun ∩ Ωn) = pr
(
1
n− (kn + nun)
n−kn∑
i=nun+1
Z(i) > 1
∩ Ωn
)
≤ pr(
n−kn∑
i=1
Z(i) −nun∑
i=1
Z(i) > n− (kn + nun))
≤ pr(
n−kn∑
i=1
Z(i) > n− kn +√
n− kn
√nun
)
+ pr(
−nun∑
i=1
Z(i) > −nun −√
n− kn
√nun
)
We set
P2 = pr(
n−kn∑
i=1
Z(i) > n− kn +√
n− kn
√nun
)
P3 = pr(
−nun∑
i=1
Z(i) > −nun −√
n− kn
√nun
)
.
18
Since∑n−kn
i=1 Z(i) =∑n−kn
i=1 Zi,
P2 = pr(
∑n−kn
i=1 Zi − (n− kn)
v√n− kn
>
√nun
v
)
.
The central limit theorem implies that∑n−kn
i=1 Zi − (n− kn)
v√n− kn
L→ N (0, 1)
since√nun −−−→
n→∞+∞, we obtain P2 −−−→
n→∞0.
Moreover,
P3 = pr
(
1
nun
nun∑
i=1
Z(i) < 1 +
√n− kn√n
)
,
which implies that
P3 ≤ pr( 1
nun
nun∑
i=1
Z(i) ≤ 2)
≤ pr(
Z(nun) ≤ 2)
≤ pr(
Z(n−kn) ≤ .. ≤ Z(nun) ≤ 2)
≤ pr(
n−kn∑
i=1
(1Zi≤2 − p) ≥ n− kn − nun − (n− kn)p)
.
where p = P (Zi ≤ 2) ∈]0, 1[.
We define ζi = 1Zi≤2 − p ; E(ζi) = 0 et |ζi| ≤ 1.
P3 ≤ pr
(
n−kn∑
i=1
ζi ≥ (n− kn)(1 − p) − nun
)
≤ pr
(
1
n− kn
n−kn∑
i=1
ζi ≥ 1 − p− nun
n− kn
)
.
We use Hoeffding’s inequality that we first recall :
Lemma 3 Write (Wi)i=1,..,n be independent identically distributed random variables
such as for all i, a ≤ Wi ≤ b and E(Wi) = 0.
Then for x > 0,
pr
(
1
n
n∑
i=1
Wi ≥ x
)
≤ exp
(
− 2nx2
(b− a)2
)
.
19
We apply Hoeffding’s inequality with the sample (ζi)i=1,..,n−kn and
x = xn = 1 − p− nun/ (n− kn) .
Since un −−−→n→∞
0 and kn = λn, there exists n0 ∈ N such that for n ≥ n0,
xn ≥ 1 − p
2> 0.
For n ≥ n0,
P3 ≤ exp
(
−(n− kn)
2
(
1 − p
2
)2)
.
Then P3 −−−→n→∞
0.
This concludes the proof of Theorem 1.
20
References
Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and
powerful approach to multiple testing. J. Royal Statist. Soc. Ser. B, 57, 289–300.
Benjamini, Y. & Hochberg, Y. (2000). The adaptive control of the false discovery rate in
multiple comparison problems. The Journal of Educational and Behavioral Statistics, 25, 1,
60–83.
Birge, L. & Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc., 3, 203–268.
Bordes, L., Delmas, C. & Vandekerkhove, P. (2006). Semiparametric Estimation of a
two-component mixture model when a component is known. Scandinavian Journal of Statis-
tics, to appear.
Brown, P. & Bolstein, D. 1999. Exploring the new world of the genome with DNA mi-
croarray. Nat. Genet. Suppl., 21, 33-37.
Dudbridge, F. & Koeleman, B. P. (2003). Rank truncated product of P-values, with ap-
plication to genomewide association scans. Genet. Epidemiol., 25, 360–366.
Dudbridge, F. & Koeleman, B. P. (2004). Efficient computation of significance levels for
multiple associations in large studies of correlated data, including genomewide association
studies. Am. J. Hum. Genet., 75, 424–435.
Genovese, C. R. & Wasserman, L. (2002). Operating characteristics and extensions of the
FDR procedure. J. Royal Statist. Soc. Ser. B, 64, 499–518.
Genovese, C. R. & Wasserman, L. (2004). A stochastic process approach to false discovery
control. Annals of Statistics, 32, 1035–1061.
Haaland, D. & O’Connell M.A.(1995). Inference for effect Satured Fractionnal. Factori-
als. Technometrics.137,1
Hoh, J., Wille, A. & Ott, J. (2001). Trimming, weighting, and grouping SNPs in human
case-control association studies. Genome Res., 11, 2115–2119.
Huet, S. (2006). Model selection for estimating the non zero components of a Gaussian vec-
tor. ESAIM: Probability and Statistics, to appear.
Kerr M.K., Martin M. & Churchill G.A. (2000). Analysis of variance for gene expression
microarray data. Comput. Biol., 7, 819–837.
Meinshausen, N. & Rice, J. (2004). Estimating the proportion of false null hypotheses
among a large number of independently tested hypotheses. Annals of Statistics, to appear.
Resnick, S.I (1987). Extreme values, Regular Variation, and Point Processes, Springer-
Verlag Applied Probability Trust.
Titterington, D., Smith, M. & Markov, U. (1985). Statistical analysis of finite mixture
distributions. Chichester, UK: John Wiley and Sons.
21
Zaykin, D. V., Zhivotovsky, L. A., Westfall, P. H. & Weir, B. S. (2002). Truncated
product method for combining P-values. Genet. Epidemiol., 22, 170–185.
22
decision
H0 accepted H0 rejected total
true null hypotheses U V n− kn
non-true null hypotheses T S kn
total n− kn kn n
Table 1: Number of errors when testing n hypotheses.
23
method µ knˆFDR ˆFNR RDR
3 261.9 4.7 5.3 49.9
BM 5 511.1 3.8 0.2 98.3
8 519.6 3.8 0.0 100.0
3 110.7 0.7 8.0 22
BH 5 490.4 1.5 0.4 96.6
8 507.4 1.5 0.0 100.0
3 408.3 9.9 2.9 73.5
MIXT 5 497.3 1.5 0.2 98.0
8 500 0.1 0.0 99.9
3 306.2 7.2 4.6 56.8
DDLR 5 487.0 1.5 0.5 95.9
8 500.3 0.6 0.1 99.5
Table 2: Comparison between BH, BM, MIXT and DDLR methods for different values
of µ, in the case n = 5000 genes and kn = 500 genes simulated differentially expressed.
24
method µ knˆFDR ˆFNR RDR
3 521.5 4.6 5.3 49.8
BM 5 1023.0 3.8 0.2 98.4
8 1039.7 3.8 0.0 100.0
3 220.2 0.6 8.0 21.9
BH 5 980.2 1.4 0.4 96.6
8 1015.0 1.5 0.0 100.0
3 820.2 10.0 2.9 73.8
MIXT 5 994.9 1.5 0.2 98.0
8 999.9 0.0 0.0 100.0
3 613.2 7.0 4.6 57.0
DDLR 5 973.5 1.4 0.4 96.0
8 997.6 0.3 0.1 99.5
Table 3: Comparison between BH, BM, MIXT and DDLR methods for different values
of µ, in the case n = 10000 genes and kn = 1000 genes simulated differentially expressed.
25