Download - A simple procedure to detect non-central observations from a sample

Transcript

A simple procedure to detect non-central observations from a

sample

Mylene DUVAL∗, Celine DELMAS∗, Beatrice LAURENT†,

Christele ROBERT-GRANIE∗

∗SAGA, INRA Toulouse, France

[email protected]

[email protected]

[email protected]

†Departement de Mathematiques, INSA Toulouse, France

[email protected]

Summary

In this paper we propose a simple procedure to select the subset of non-central ob-

servations from a sample. This problem is motivated by the detection of differentially

expressed genes between conditions in microarray experiments. We prove under mild

conditions the consistency of the selected subset of observations. We compare by simu-

lations our proposed procedure to the Benjamini and Hochberg’s procedure, a procedure

based on mixture models and a procedure based on model selection.

Some key words: microarray data; multiple testing procedure; ordered statistics; partial sums.

1

1. Introduction

The aim of this paper is to propose a simple procedure to detect non-central observations

from a sample. We will focus on the study of two samples (χi)i=1,··· ,n and (Fi)i=1,··· ,n.

We write

χi =

ν1∑

k=1

(γiNik + δik)2 1 ≤ i ≤ n (1)

Fi =

∑ν1

k=1(γiNik + δik)2

∑ν2

j=1D2ij

ν2

ν1

1 ≤ i ≤ n (2)

where ν1 and ν2 are two known integers, (δik) k = 1, · · · , ν1

i = 1, · · · , n

are unknown parameters,

(Nik) k = 1, · · · , ν1

i = 1, · · · , n

are independent identically distributed centered variables with unit

variance and (Dij) j = 1, · · · , ν2

i = 1, · · · , n

are independent identically distributed variables indepen-

dent of the (Nik) k = 1, · · · , ν1

i = 1, · · · , n

. We define the non-centrality parameter ηi as ηi =∑ν1

k=1 δ2ik.

We assume that γi = 1 for all i such that ηi = 0 and that γi ≥ 1 is unknown otherwise.

Our goal is to identify the set Jn = i : ηi 6= 0. Note that taking the square of the

observations of a Gaussian sample, we obtain a chi-square sample with one degree of

freedom, which is a particular case of model (1) with ν1 = 1, γi = 1 and Ni1 indepen-

dent identically distributed standard Gaussian variables for i ∈ 1, ..., n. Similarly a

Student sample is a particular case of model (2).

Our work is motivated by the problem of the detection of differentially expressed genes

between conditions in microarray experiments. For this purpose a test statistic is built

for each gene. In general it is a T or a F statistic to compare two or more conditions

but it can also be a Gaussian or a χ2 statistic when the variance of the gene expression

is assumed to be known. Under the null hypothesis that the gene is not differentially

expressed between the conditions, the non-central parameter of the test statistic is null

whereas it is non-null under the alternative hypothesis that the gene is differentially

expressed between the conditions. We observe the sample of all the test statistics cor-

responding to all the genes. To detect the genes that are differentially expressed we

have to separate the central from the non-central observations. This is the object of

the paper.

To address this problem the simplest procedure would be the Bonferroni correction.

This procedure controls the familywise error rate (FWER) which is the probability of

accumulating one or more false-positives over all the tests. This criterion is very strin-

gent and may affect the power when the number of tests is large. An alternative pro-

cedure consists in controlling the false-discovery rate (FDR). Benjamini and Hochberg

2

(1995) introduced a FDR controlling procedure and proved that it controls the FDR for

independent test statistics. Adaptive FDR controlling procedures have been proposed

to increase the power while controlling the FDR (Benjamini and Hochberg, 2000; Ben-

jamini, Krieger and Yekutieli, 2005, Adaptive linear step-up procedures that control

the false discovery rate, Research paper 01-03). Genovese and Wasserman (2002, 2004)

have proposed a method to minimize the false-negative rate (FNR) while controlling

the FDR. Mixture models have been studied to separate the central observations from

the others (Delmas, 2006, On mixture models for multiple testing in the differential

analysis of gene expression data.Submitted paper; Bordes et al., 2006). In the Gaussian

case some procedures based on model selection have been proposed (Birge and Mas-

sart, 2001; Huet, 2006). Some authors have proposed procedures based on the partial

sums of the absolute or squared ordered observations (Hoh et al., 2001; Lavielle and

Ludena, 2006, Random thresholds for linear model selection.Submitted paper). Zaykin

et al. (2002) and Dudbridge and Koeleman (2003, 2004) have proposed methods based

on the partial products of the ordered pvalues. Meinshausen and Rice (2004) have

proposed a method to estimate the proportion of false null hypotheses among a large

number of independently tested hypotheses. This method is based on the distribution

of the p-values of the hypothesis tests, which are uniform on [0, 1] under the null hy-

pothesis.

The aim of our paper is to propose a simple procedure that offers good asymptotic

properties.

The paper is organized as follows. In section 2 we introduce the proposed procedure.

In section 3 we state the consistency of the selected subset of observations for the two

kinds of samples (1) and (2). In section 4 we compare by simulations our proposed pro-

cedure with different methods: the Benjamini and Hochberg’s procedure, a procedure

based on mixture models and the Birge and Massart’s procedure. Section 5 is devoted

to the proofs.

2. The procedure

Assume we observe the sample (1) or (2). We note Xi the observation: Xi = χi or

Xi = Fi. We note kn the number of non-central observations. Our procedure to separate

the central from the non-central observations can be stated as follows:

(i) We order the Xi’s:

Xσ(1) ≥ Xσ(2) ≥ · · · ≥ Xσ(n)

3

We define for 0 ≤ k < n, τk = 1n−k

∑ni=k+1Xσ(i).

(ii) We estimate kn by

kn = min0≤k<n

k : τk ≤ τ

where τ = E[Xi] under the assumption that the non-centrality parameter ηi is

equal to 0. If for 0 ≤ k < n, τk > τ , then we set kn = n.

(iii) We decide that Xσ(1), · · · , Xσ(kn) are non-central observations.

This procedure is based on the idea that if the two populations of the central and

non-central observations are well separated then τk is a good estimator of τ only for

k = kn. For k < kn the expression of τk includes non central variables and hence τk

tends to over estimate τ . For k > kn the expression of τk does not include the largest

observations of a sample of independent identically distributed variables with mean τ

then τk tends to under estimate τ .

3. Results

We first introduce some notations. We write

Jn = i, ηi 6= 0

and

Jn =

σ(1), σ(2), . . . , σ(kn)

.

Write V and T respectively denote the number of false positives and the number of

false negatives defined as

V = Card(Jn ∩ Jn), T = Card(¯Jn ∩ Jn),

where, for all set J , J denotes the complementary set of J .

Before giving our main theorem, we state two lemma, to show that, under suitable

assumptions, the variables satisfying ηi 6= 0 are “well separated” from the others. We

introduce some hypotheses.

H1: Write φ(1) be the cumulative distribution function of L(1) = maxi=1,..,n

∑ν1

k=1N2ik.

We assume that there is a sequence (an, bn)n≥0 with bn > 0 for all n ∈ N such that

4

φ(1)(an + bnx) −−−→n→∞

F (x), where F is a cumulative distribution function. We also

assume that,

mini∈Jn

(

ηi

2γ2i

)

≥ αn

where the sequence (αn)n∈N satisfies

αn/2 − an

bn−−−→n→∞

+∞.

Example:

Assume that the Nik’s are independent identically distributed centered Gaussian ran-

dom variables with unit variance. Then for 1 ≤ i ≤ n,∑ν1

k=1N2ik ∼ χ2(ν1). In that case

we can prove that F is the cumulative distribution function of a Gumbel variable, with:

an = 2 log(n) + (ν1 − 2) log(log(n)) − 2 log(2ν1/2Γ(ν1/2)) + ν1 log(2) ∀n

and

bn = 2 ∀n(see Resnick (1987) Section 1.5 for similar calculations).

Then we have to choose αn such that αn/2−an

bn−−−→n→∞

+∞, for example αn = 2an + ǫn

for all n where ǫn −−−→n→∞

+∞.

H2: Write ψ(1) and ξ(1) respectively be the cumulative distribution functions of

D(1) = maxi=1,..,n

ν2∑

l=1

D2ilν1/ν2

and

L(1) = maxi=1,..,n

∑ν1

k=1N2ik

∑ν2

l=1D2il

ν2

ν1

.

We assume that there are two sequences (cn, dn)n≥0 and (en, fn)n≥0 with dn > 0 and

fn > 0 for all n ∈ N such that ψ(1)(cn +dnx) −−−→n→∞

G(x) and ξ(1)(en +fnx) −−−→n→∞

H(x),

where G and H are cumulative distribution functions.

Write (xn)n∈N satisfiesxn − cndn

−−−→n→∞

+∞.

We assume that for all i ∈ Jn,

min1≤i≤kn

(

ηi

γ2i

)

≥ 4xnβn

5

where the sequence (βn)n∈N satisfies

βn − en

fn

−−−→n→∞

+∞.

Example:

Assume that the Dil’s and the Nik’s are independent identically distributed centered

Gaussian random variables with unit variance. Then for 1 ≤ i ≤ n,∑ν2

l=1D2il ∼ χ2(ν2)

and in that case we can prove thatG is the cumulative distribution function of a Gumbel

variable, with:

cn =ν2

ν1

[

2 log(n) + (ν2 − 2) log(log(n)) − 2 log(2ν2/2Γ(ν2/2)) + ν2 log(2)]

∀n

and

dn = 2ν2

ν1

∀n

(see Resnick (1987) Section 1.5 for similar calculations).

We choose xn such that xn−cn

dn−−−→n→∞

+∞, for example xn = cn + ǫn for all n where

ǫn −−−→n→∞

+∞.

L(1) = maxi=1,..,n

Pν1k=1 N2

ikPν2l=1 D2

il

ν2

ν1∼ F (ν1, ν2). We can prove that H is the cumulative distri-

bution function of a Frechet variable, with:

fn = (Kn)2/ν2 ∀n

where K = 2 Γ(ν1+ν2)Γ(ν1/2)Γ(ν2/2)

νν1/2−12

νν1/21

and

en = 0 ∀n

(see Resnick (1987) Section 1.5 for similar calculations).

We choose βn such that βn

fn−−−→n→∞

+∞, for example βn = f(1+ǫ)n where ǫ > 0.

Lemma 1 Write U1, . . . , Un be n independent random variables such that

Ui =

ν1∑

k=1

(γiNik + δik)2 for 1 ≤ i ≤ kn

and

Ui =

ν1∑

k=1

(Nik)2 for kn + 1 ≤ i ≤ n

6

where for all 1 ≤ i ≤ kn, ηi =∑ν1

k=1 δ2ik 6= 0 and γi ≥ 1 are unknown parameters;

Nik, 1 ≤ i ≤ n, 1 ≤ k ≤ ν1 are independent identically distributed centered variables

with unit variance. We assume that assumption H1 holds. We define

Ωn =

min1≤i≤kn

Ui ≥ maxkn+1≤i≤n

Ui

.

Then

pr(Ωcn) −−−→

n→∞0.

Lemma 2 Write U1, . . . , Un be n independent random variables such that

Ui =

∑ν1

k=1(γiNik + δik)2

∑ν2

l=1D2il

ν2

ν1

for 1 ≤ i ≤ kn

and

Ui =

∑ν1

k=1N2ik

∑ν2

l=1D2il

ν2

ν1

for kn + 1 ≤ i ≤ n

where for all 1 ≤ i ≤ kn, ηi =∑ν1

k=1 δ2ik 6= 0 and γi ≥ 1 are unknown parameters;

Nik, 1 ≤ i ≤ n, 1 ≤ k ≤ ν1 are independent identically distributed centered variables

with unit variance; Dil, 1 ≤ i ≤ n, 1 ≤ l ≤ ν2 are independent identically distributed

variables independent of the Nik’s. We assume that assumption H2 holds. We define

Ωn =

min1≤i≤kn

Ui ≥ maxkn+1≤i≤n

Ui

.

Then

pr(Ωcn) −−−→

n→∞0.

Theorem 1 We consider the procedure described in Section 2. We assume that the

cardinality of Jn equals kn = λn with 0 < λ < 1. Write W denote the cumulative

distribution function of the variables Xi/τ under the assumption that ηi = 0. We

assume that W (2) < 1. Then, under assumptions H1 for sample (1) or assumptions

H2 for sample (2),

pr

[

V

n> un

]

−→ 0 as n −→ +∞

pr

[

T

n> un

]

−→ 0 as n −→ +∞

where un −→ 0 and√nun −→ +∞ as n tends to infinity.

7

4. Application and simulations

4.1 Application to microarray data

This procedure may be convenient to detect differentially expressed genes in microarray

experiments. DNA microarrays are a new class of technology that enables molecular

biologists to simultaneously measure the expression level of thousands of genes (Brown

and Bolstein, 1999). Thousands of gene probes made of cDNA or oligonucleotides are

spotted on a small glass slide or a nylon membrane in a regular matrix pattern. A

basic experiment consists in comparing the expression levels in two different types of

conditions. More generally, we can study several conditions with one or more repeti-

tions. The intensity level on each spot on the microarray represents a measure of the

concentration of the corresponding mRNA in the biological sample. The detection of

differentially expressed genes in DNA microarray experiments is an important question

asked by biologists to statisticians. At this stage, we assume that intensity levels of

genes are correctly normalized.

Assume we study microarray data from an experiment including n genes, J conditions

and R repetitions for each gene in each condition. We note Aijr the rth repetition of the

expression level of gene i in condition j. We assume that for all i, j, r, Aijr ∼ N (mij, σ2i )

where mij ∈ R, and σi ∈ R+ are unknown. We assume that the Aijr’s are independent.

We note Aij. = 1R

∑Rr=1Aijr, and Ai.. = 1

JR

∑Jj=1

∑Rr=1Aijr.

If R is large enough, we can estimate σ2i by σ2

i = 1J(R−1)

∑Jj=1

∑Rr=1(Aijr − Aij.)

2.

In that case:

∀1 ≤ i ≤ n, Xi =1

(J − 1)σ2i

J∑

j=1

R(Aij. − Ai..)2 ∼ F (ηi; J − 1, J(R− 1))

where Xi ∈ R+, ηi is the non-centrality parameter. ηi = 0 if gene i is non-differentially

expressed between the J conditions, and ηi > 0 if not. This model is similar to model (2).

When R is small we can not estimate σ2i by the same estimator σ2

i . Thus we assume

that σi = σ for all the genes which are not differentially expressed and σ is known.

Write Xi = 1σ2

∑Jj=1R(Aij. − Ai..)

2. If gene i is not differentially expressed between

the J conditions: Xi ∼ χ2(J − 1), else σ2

σ2iXi ∼ χ2(ηi; J − 1), where ηi > 0 is the non-

centrality parameter. This model is similar to model (1) with γi = σi

σ.

In the particular case where we compare only two conditions, Xi = R(Ai1.−Ai2.)2

2σ2 . If

8

gene i is non differentially expressed between the two conditions: Xi ∼ χ2(1), elseσ2

σ2iXi ∼ χ2(ηi; 1), where ηi > 0 is the non-centrality parameter. This model corresponds

to model (1) with ν1 = 1.

Write

J n = 1 ≤ i ≤ n, ηi 6= 0

J n is the set of the genes which are differentially expressed between the J conditions,

it contains kn elements.

Our aim is to estimate the number kn and the set J n, thanks to the procedure pre-

sented in Section 2.

4.2 Simulations

To validate our procedure denoted by DDLR, we present several simulations results and

some comparisons with three other methods. We want to determine the number kn of

genes effectively differentially expressed between two conditions, with only one repeti-

tion for each gene and each condition. We suppose that all the genes have the same

variance σ2. We simulated n independent observations Ai as follows: kn from a normal

distribution N (mi1 −mi2, 2σ2) and (n− kn) from a normal distribution N (0, 2σ2). Ai

represents the difference of expression of gene i between the two conditions. We write

Xi =A2

i

2σ2 .

Generally in applications, the standard deviation σ is unknown. We note for all i,

s2i = s2 =var(Ai) = 2σ2. We propose to use the estimator of s presented by Haaland

and O’Connell (1995).

Estimation of the variance

This estimator is defined as follows:

s = 1.5 ∗median|Ai|, |Ai| ≤ 2.5s0, where s0 = 1.5 ∗median|Ai|, i = 1..n.

Intuitively, this estimator may be a consistent estimator if the proportion of variables Ai

with non-null mean is quite small. Let see some heuristic ideas about the construction

of this estimator:

Write tn = median(|Ai|) the empirical median. We know that tn −−−→n→∞

mn almost

9

surely, where mn is the theorical median.

Moreover, if we assume that for all i ∈ 1, ..., n, Ai ∼ N (0, s2), then

pr(|Ai| ≤ mn) = 2

∫ mn

0

1

s√

2πe−

x2

2s2 dx =1

2

so∫ mn

s

0

1√2πe−

x2

2 dx =1

4

Thanks to the statistic tables, we estimate mn

s≃ 0.6745, that is to say s ≃ 1.4826mn ≃

1.5median|Ai|, i = 1, ..., n .

However some random variables Ai are not centered. In order to separate the ran-

dom variables Ai which are centered and the random variables Ai which have non-

zero mean, Haaland and O’Connell (1995) approximated the first set with the set

|Ai|, 1 = 1, ..., n : |Ai| ≤ 2.5s0. This explains why they propose to estimate s by:

s = 1.5 ∗median|Ai|, |Ai| ≤ 2.5s0.

With large probability s0 ≥ s, then if Ai ∼ N (0, s2),

pr(|Ai| > 2.5s0) ≤ pr(|Ai| > 2.5s) ≃ 2 ∗ (1 − 0.9938) = 0.0124.

That is to say that if Ai is centered, the probability not to take it to estimate the

standard deviation is 1%. In this application, we have supposed that the random vari-

ables Ai are Gaussian (Kerr et al, 2000). We can also generalize the estimation of the

variance in the case where the Ai’s have a known symetric density.

Simulations

We have considered different values for n, kn and µ, where mi1−mi2

s= µ for all i ∈ Jn.

For each value of n = 5000 and n = 10000, we have considered: kn = 0.1 ∗ n and

µ ∈ 3, 5, 8. We recall that Xi =A2

i

s2 ∼ χ2(µ; 1) for all i ∈ J n, and Xi ∼ χ2(1)

otherwise.

In the expression of Xi, we replace s by s.

We use our procedure presented in Section 2 with the observations Xii=1,...,n and

τ = 1.

Hypothesis H1 is satisfied with an = 2log(n)− log(log(n))−2 log(√

2π)+log(2), bn = 2

for all n, and αn = 2an + log(log(n)) for all n for example.

Theorem 1 is proved under the assumption that µ >√

2αn.

Now we present briefly three methods to compare the performances of DDLR procedure.

10

All these methods assume that the standard deviation s is known, so we estimated it

with the threshold estimator s presented in Section 4.2.

BM method:

First, Birge and Massart (2001) provided a general approach based on model selection

via penalization for Gaussian random vectors with a known variance. The penalty

function presented by Birge and Massart is:

∀k ∈ 1, ..., n, pen(k) = λσ2k(1 +√

2Lk)2

where λ > 1 and (Lk)k≥1 is a series of positive real numbers.

The procedure proposed by Birge and Massart leads, in our practical setting, to define:

kn = argmink=1,...,n

[

−k∑

i=1

Xσ(i) + pen(k)]

.

where Xσ(1) ≥ ... ≥ Xσ(n).

We have chosen pen(k) = Mk, where M is a constant which has been calibrated at

M = 8 to obtain good results when µ = 5.

BH method:

The second method is a test method presented by Benjamini and Hochberg (1995). It

controls the expected proportion of errors among the rejected hypotheses, named the

false discovery rate (FDR). In our application, as we want to select the differentially

expressed genes, the problem means as many tests as the number of genes standing on

the microarray. The situation is summarized in Table1.

The false discovery number is connected with the proportion of the rejected null hy-

potheses which are erroneously rejected V

kn. Then the FDR is defined as

FDR = E[

V

max(1,kn)

]

. We define the False Negative Rate: FNR = E[

T

max(1,n−kn)

]

.

When the number of tests n is large, one accumulates the false discovery number at

each test, as a result in microarray experiments, the error on the estimation of genes

effectively differentially expressed can be very important. This explains why Benjamini

and Hochberg (1995) presented a method which controls the FDR.

BH adaptative procedure (adapted from the classical procedure presented by Benjamini

and Hochberg, 1995) is as follows: assuming that we test n− kn null hypotheses,

- write p1, p2, ..., pn be the n p-values corresponding to the n tests. These ones are

sorted in a decreasing order: p(1) ≤ p(2) ≤ ... ≤ p(n).

11

- write H0(1), H0(2), ..., H0(n) the corresponding null hypotheses.

- write π0 ∈ R∗+ and write kn the biggest integer k ∈ 1, ..., n such as p(k) ≤ k

nπ0α

where α ∈]0, 1[.

- H0(i) is rejected, for i = 1, ..., k.

Benjamini and Hochberg proved that for this procedure, FDR ≤ n−kn

nπ0α.

In the application on microarray data, we fixed the coefficient α = 0.05 and we cali-

brated π0 thanks to simulations at π0 = 2.9, such that the estimation of kn was the

nearest as possible from kn in the case where n = 5000, µ = 5, and k0 = 500.

As a consequence this procedure controls the FDR by 0.016.

MIXT method:

The third method is a classical method on the mixture of two normal distribution

pN (µ, s2)+(1− p)N (0, s2), where p and µ are unknown parameters, s is known. Write

αi a variable corresponding to the conditionnal probability that gene i is differentially

expressed, given that the observations Ai, µ and p are known.

Ai is Gaussian with variance s2 and null mean if gene i is non differentially expressed.

The estimations of p, µ and αi can be obtained by the EM algorithm (Titterington et

al., 1985).

Then write iter the number of iterations in the EM algorithm. We estimate kn by

kn =n∑

i=1

[iter]i,0 >0.5

4.3 Results and discussion

Table 2 and Table 3 present the results of simulations for n = 5000 and n = 10000

respectively for the four methods. The following notations are used:

1. kn denotes the estimation of the number of differentially expressed genes;

2. ˆFDR denotes the estimation (in percentage) of the false discovery rate;

12

3. ˆFNR denotes the estimation (in percentage) of the false negative rate;

4. The rate RDR is defined as: RDR = Skn

, in percentage.

The quantities 1, 2, 3, and 4 are calculated on the basis of 1000 simulations.

For example, in Table 2, we simulated a microarray experiment, on which n = 5000

genes were tested, and with only kn = 500 genes simulated differentially expressed. We

simulated three levels of difference of expressions for the genes differentially expressed:

µ = 3, 5, or 8. In the case µ = 8, BM method estimates kn by kn = 519.6. Among these

kn = 519.6 genes, ˆFDR = 3.8% of genes were not simulated differentially expressed.

Among these n− kn = 4480.4 genes found non-differentially expressed, ˆFNR = 0% of

genes were simulated differentially expressed. With this method, RDR = 100% of the

genes which were simulated differentially expressed are found.

The results given in Tables 2 and 3 suggest the following remarks:

• When we compare Tables 2 and 3, the proportions RDR, ˆFNR, and ˆFDR, are

the same in general. That is to say that the four methods may depend only on the

proportion kn

n, which is an important parameter in the estimator of the variance s2.

Therefore we only discuss results presented in Table 3.

• When µ = 8, all the methods give a good estimation of kn, and all the genes

which were simulated differentially expressed are found: RDR = 100% for almost all

the methods. BM method tends to over estimate kn, the criterion does not penalise

enough the high dimensions. BH method tends also to over estimate kn. We may change

the constant π0 in order to have a better estimate, however, if we do so, the method

would tend to under estimate in the case µ = 5 (see description of BH method). So it

does not seem to be an adapted solution to improve BH method. MIXT and DDLR

methods give good results for the four criteria kn, ˆFDR, ˆFNR and RDR.

• Consider the case µ = 5. All the methods find more than 96% of the genes simulated

differentially expressed. Among the genes found differentially expressed, 3.8% ( ˆFDR)

of these genes were not simulated differentially expressed for BM method, against 1.4%

for DDLR and BH methods and 1.5% for MIXT method. The choice of the method

depends on the objective of the user. If he prefers to find more differentially expressed

13

genes in spite of a high ˆFDR ( ˆFDR ≥ 3.8) but a better RDR (RDR > 98%), he can

choose BM method. If he prefers to control the error level ˆFDR, he may choose among

the methods BH, DDLR, or MIXT. It is worth noticing that BM method does not

return a good RDR in the case µ = 3. Generally in most applications to microarray

data, µ and σ are not known and the ratio µ is often close to 3 or less than 3.

• In the case µ = 5, MIXT method seems to be the best method in terms of ˆFDR

(1.5%) and RDR (98%). However, in the case µ = 3, the ˆFDR of MIXT method

is higher (10%) essentially because the mean µ of the genes differentially expressed is

very weak, so the method can not easily seperate the genes which are not differentially

expressed from the others. On the other hand, in the case µ = 3, DDLR method finds

57% of differentially expressed genes among the genes which are simulated differentially

expressed. Moreover the ˆFDR is quite weak: 7% of genes.

In real microarray data analysis, the mean of the differences of genes levels between

two conditions, is generally small (µ ≤ 3). So DDLR method seems to be well adapted

for transcriptomic data.

As we said before, BM method is difficult to use, because some constants must be

calibrated in the penalty function. Since the ratio µ is generally close to 3 in mi-

croarray data analysis, we should calibrate the constant in the penalty function of BM

method by holding account that. Unfortunately in this case, we could not find this

constant because the method is unstable and we do not obtain equivalent estimations

at each simulation. So we decided to keep the constant M = 8 in the BM penalty

function, calibrated in the case n = 5000, k0 = 500 and µ = 5.

To conclude, for any level of µ, our method finds a high proportion of genes simu-

lated differentially expressed (RDR ≥ 57%) and the ˆFDR (≤ 7%) stands quite low.

The other methods tend to privilege only one of the criteria ˆFDR, ˆFNR or RDR.

Moreover, DDLR method is very easy to implement and no constant is needed to cali-

brate.

In this new approach, we assumed that the variance was known. It would be interesting

to develop our method in the case where the variance is unknown, as proposed by Sylvie

Huet (2006), who generalized BM Method.

14

5. Proofs

5.1 Proof of Lemma 1

The variables (Ui)kn+1≤i≤n are denoted (Vi)1≤i≤n−kn .

We write U(1) ≥ U(2) ≥ . . . U(kn) and V(1) ≥ V(2) ≥ . . . V(n−kn). The complementary set

of Ωn is the event

V(1) > U(kn)

. Since Ui =∑ν1

k=1(γiNik + δik)2, we use the inequality

2ab ≥ −a2/2 − 2b2 which holds for all a, b ∈ R to obtain that for all 1 ≤ i ≤ kn,

Ui

γ2i

≥ −ν1∑

k=1

N2ik +

ηi

2γ2i

.

Recalling that L1 = max1≤i≤n

∑ν1

k=1N2ik, this implies that

U(kn)

γ2(kn)

≥ min1≤i≤kn

ηi

2γ2i

− L(1).

Then, since for 1 ≤ i ≤ kn, γi ≥ 1,

pr(Ωcn) = pr(V(1) > U(kn))

≤ pr

(

V(1)

γ2(kn)

+ L(1) > min1≤i≤kn

ηi

2γ2i

)

≤ pr

(

V(1) + L(1) > min1≤i≤kn

ηi

2γ2i

)

.

V(1) ≤ L(1), hence pr(Ωcn) ≤ pr(2L(1) > αn). Since

(

L(1) − an

)

/bnL→ F and

(αn/2 − an) /bn −−−→n→∞

+∞ with H1 we obtain that pr(Ωcn) −−−→

n→∞0. This concludes the

proof of Lemma 1.

5.2 Proof of Lemma 2

As in the proof of Lemma 1, we obtain

pr(Ωcn) ≤ pr

(

2L(1) > min1≤i≤kn

(

ηi

2γ2i

)(

1

D(1)

))

.

By H2,

pr(

D(1) ≤ xn

)

−−−→n→∞

1.

15

This implies that

pr(Ωcn) ≤ pr

(

L(1) ≥ min1≤i≤kn

(

ηi

4xnγ2i

))

+ pr(

D(1) > xn

)

which tends to 0 as n→ ∞ by assumption H2.

5.3 Proof of Theorem 1

Let us first prove that the fonction k → τk is non increasing.

For 1 ≤ k ≤ n− 1:

τk−1 =1

(n− k + 1)

n∑

i=k

X(i)

=X(k)

(n− k + 1)+

1

(n− k + 1)

n∑

i=k+1

X(i)

=X(k)

(n− k + 1)+

n− k

(n− k + 1)τk

Since for i ≥ k X(k) ≥ X(i), we get X(k) ≥ 1n−k

∑ni=k+1X(i). This implies that

τk−1 ≥1

(n− k + 1)τk +

n− k

(n− k + 1)τk = τk.

Hence, the function k 7→ τk is non increasing.

Let us now prove that

pr

(

T

n> un

)

−→ 0 as n −→ +∞.

pr

(

T

n> un

)

≤ pr

(

kn

n− kn

n< −un

∩ Ωn

)

+ pr(Ωcn).

pr(

kn < kn − nun ∩ Ωn

)

≤ pr(

τkn−nun ≤ τ ∩ Ωn

)

≤ pr(

1

n− kn + nun

kn∑

i=kn−nun+1

X(i)

τ+

1

n− kn + nun

n∑

i=kn+1

X(i)

τ≤ 1

∩ Ωn

)

.

We define

P1 = pr(

1

n− kn + nun

kn∑

i=kn−nun+1

X(i)

τ+

1

n− kn + nun

n∑

i=kn+1

X(i)

τ≤ 1

∩ Ωn

)

.

16

P1 ≤ pr(

1

n− kn + nun

kn∑

i=kn−nun+1

X(i)

τ≤ 2nun

n− kn + nun

∩ Ωn

)

+ pr(

1

n− kn + nun

n−kn∑

i=1

Zi ≤ 1 − 2nun

n− kn + nun

)

where (Zi)i=1,..,n−kn is a sample of independent identically distributed random variables

with distribution W , which is the distribution of Xi/τ under the assumption ηi = 0.

Note the E(Zi) = 1, we denote by v the standard deviation of the Zi’s.

pr

(

1

n− kn + nun

n−kn∑

i=1

Zi ≤ 1− 2nun

n− kn + nun

)

= pr

(

∑n−kn

i=1 Zi − (n− kn)

v√n− kn

≤ −nun

v√n− kn

)

.

The central limit theorem implies that∑n−kn

i=1 Zi − (n− kn)

v√n− kn

L→ N (0, 1),

and since√nun −−−→

n→∞+∞, we obtain

−nun

v√

(n− kn)−−−→n→∞

−∞.

This implies that

pr

(

1

n− kn + nun

n−kn∑

i=1

Zi ≤ 1 − 2nun

n− kn + nun

)

−−−→n→∞

0.

Let us now control the other term appearing in the upper bound for P1.

pr

(

1

n− kn + nun

kn∑

i=kn−nun+1

X(i)

τ≤ 2nun

n− kn + nun

∩ Ωn

)

= pr

(

1

nun

kn∑

i=kn−nun+1

X(i)

τ≤ 2

∩ Ωn

)

.

On the event Ωn,

1

nun

kn∑

i=kn−nun+1

X(i)

τ≥ Z(1),

where (Z1, . . . , Zn−kn) are independent identically distributed with distribution W , and

Z(1) = max Zi, 1 ≤ i ≤ n− kn. Hence

pr

(

1

nun

kn∑

i=kn−nun+1

X(i)

τ≤ 2

∩ Ωn

)

≤ pr(

Z(1) ≤ 2)

= (W (2))n−kn .

17

This tends to 0 as n tends to infinity since we assumed that W (2) < 1.

We have proved that P1 −−−→n→∞

0. Since pr(Ωcn) −−−→

n→∞0 by Lemma 1 for Sample (1) and

by Lemma 2 for Sample (2), we obtain that

pr

(

T

n> un

)

−→ 0 as n −→ +∞.

It remains to show that

pr

(

V

n> un

)

−→ 0 as n −→ +∞

pr

(

V

n> un

)

≤ pr

(

kn

n− kn

n> un

∩ Ωn

)

+ pr (Ωcn) .

Hence, we shall prove that

pr

(

kn

n− kn

n> un

∩ Ωn

)

−−−→n→∞

0.

pr

(

kn − kn > nun

∩ Ωn

)

= pr

(

τkn+nun > τ

∩ Ωn

)

= pr

(

1

n− (kn + nun)

n∑

i=kn+nun+1

X(i) > τ

∩ Ωn

)

.

Write (Zi)i=1,..,n−kn be a sample of independent random variables with common cumu-

lative distribution function W , and write Z(1) ≥ Z(2) ≥ .. ≥ Z(n−kn).

pr(kn − kn > nun ∩ Ωn) = pr

(

1

n− (kn + nun)

n−kn∑

i=nun+1

Z(i) > 1

∩ Ωn

)

≤ pr(

n−kn∑

i=1

Z(i) −nun∑

i=1

Z(i) > n− (kn + nun))

≤ pr(

n−kn∑

i=1

Z(i) > n− kn +√

n− kn

√nun

)

+ pr(

−nun∑

i=1

Z(i) > −nun −√

n− kn

√nun

)

We set

P2 = pr(

n−kn∑

i=1

Z(i) > n− kn +√

n− kn

√nun

)

P3 = pr(

−nun∑

i=1

Z(i) > −nun −√

n− kn

√nun

)

.

18

Since∑n−kn

i=1 Z(i) =∑n−kn

i=1 Zi,

P2 = pr(

∑n−kn

i=1 Zi − (n− kn)

v√n− kn

>

√nun

v

)

.

The central limit theorem implies that∑n−kn

i=1 Zi − (n− kn)

v√n− kn

L→ N (0, 1)

since√nun −−−→

n→∞+∞, we obtain P2 −−−→

n→∞0.

Moreover,

P3 = pr

(

1

nun

nun∑

i=1

Z(i) < 1 +

√n− kn√n

)

,

which implies that

P3 ≤ pr( 1

nun

nun∑

i=1

Z(i) ≤ 2)

≤ pr(

Z(nun) ≤ 2)

≤ pr(

Z(n−kn) ≤ .. ≤ Z(nun) ≤ 2)

≤ pr(

n−kn∑

i=1

(1Zi≤2 − p) ≥ n− kn − nun − (n− kn)p)

.

where p = P (Zi ≤ 2) ∈]0, 1[.

We define ζi = 1Zi≤2 − p ; E(ζi) = 0 et |ζi| ≤ 1.

P3 ≤ pr

(

n−kn∑

i=1

ζi ≥ (n− kn)(1 − p) − nun

)

≤ pr

(

1

n− kn

n−kn∑

i=1

ζi ≥ 1 − p− nun

n− kn

)

.

We use Hoeffding’s inequality that we first recall :

Lemma 3 Write (Wi)i=1,..,n be independent identically distributed random variables

such as for all i, a ≤ Wi ≤ b and E(Wi) = 0.

Then for x > 0,

pr

(

1

n

n∑

i=1

Wi ≥ x

)

≤ exp

(

− 2nx2

(b− a)2

)

.

19

We apply Hoeffding’s inequality with the sample (ζi)i=1,..,n−kn and

x = xn = 1 − p− nun/ (n− kn) .

Since un −−−→n→∞

0 and kn = λn, there exists n0 ∈ N such that for n ≥ n0,

xn ≥ 1 − p

2> 0.

For n ≥ n0,

P3 ≤ exp

(

−(n− kn)

2

(

1 − p

2

)2)

.

Then P3 −−−→n→∞

0.

This concludes the proof of Theorem 1.

20

References

Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and

powerful approach to multiple testing. J. Royal Statist. Soc. Ser. B, 57, 289–300.

Benjamini, Y. & Hochberg, Y. (2000). The adaptive control of the false discovery rate in

multiple comparison problems. The Journal of Educational and Behavioral Statistics, 25, 1,

60–83.

Birge, L. & Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc., 3, 203–268.

Bordes, L., Delmas, C. & Vandekerkhove, P. (2006). Semiparametric Estimation of a

two-component mixture model when a component is known. Scandinavian Journal of Statis-

tics, to appear.

Brown, P. & Bolstein, D. 1999. Exploring the new world of the genome with DNA mi-

croarray. Nat. Genet. Suppl., 21, 33-37.

Dudbridge, F. & Koeleman, B. P. (2003). Rank truncated product of P-values, with ap-

plication to genomewide association scans. Genet. Epidemiol., 25, 360–366.

Dudbridge, F. & Koeleman, B. P. (2004). Efficient computation of significance levels for

multiple associations in large studies of correlated data, including genomewide association

studies. Am. J. Hum. Genet., 75, 424–435.

Genovese, C. R. & Wasserman, L. (2002). Operating characteristics and extensions of the

FDR procedure. J. Royal Statist. Soc. Ser. B, 64, 499–518.

Genovese, C. R. & Wasserman, L. (2004). A stochastic process approach to false discovery

control. Annals of Statistics, 32, 1035–1061.

Haaland, D. & O’Connell M.A.(1995). Inference for effect Satured Fractionnal. Factori-

als. Technometrics.137,1

Hoh, J., Wille, A. & Ott, J. (2001). Trimming, weighting, and grouping SNPs in human

case-control association studies. Genome Res., 11, 2115–2119.

Huet, S. (2006). Model selection for estimating the non zero components of a Gaussian vec-

tor. ESAIM: Probability and Statistics, to appear.

Kerr M.K., Martin M. & Churchill G.A. (2000). Analysis of variance for gene expression

microarray data. Comput. Biol., 7, 819–837.

Meinshausen, N. & Rice, J. (2004). Estimating the proportion of false null hypotheses

among a large number of independently tested hypotheses. Annals of Statistics, to appear.

Resnick, S.I (1987). Extreme values, Regular Variation, and Point Processes, Springer-

Verlag Applied Probability Trust.

Titterington, D., Smith, M. & Markov, U. (1985). Statistical analysis of finite mixture

distributions. Chichester, UK: John Wiley and Sons.

21

Zaykin, D. V., Zhivotovsky, L. A., Westfall, P. H. & Weir, B. S. (2002). Truncated

product method for combining P-values. Genet. Epidemiol., 22, 170–185.

22

decision

H0 accepted H0 rejected total

true null hypotheses U V n− kn

non-true null hypotheses T S kn

total n− kn kn n

Table 1: Number of errors when testing n hypotheses.

23

method µ knˆFDR ˆFNR RDR

3 261.9 4.7 5.3 49.9

BM 5 511.1 3.8 0.2 98.3

8 519.6 3.8 0.0 100.0

3 110.7 0.7 8.0 22

BH 5 490.4 1.5 0.4 96.6

8 507.4 1.5 0.0 100.0

3 408.3 9.9 2.9 73.5

MIXT 5 497.3 1.5 0.2 98.0

8 500 0.1 0.0 99.9

3 306.2 7.2 4.6 56.8

DDLR 5 487.0 1.5 0.5 95.9

8 500.3 0.6 0.1 99.5

Table 2: Comparison between BH, BM, MIXT and DDLR methods for different values

of µ, in the case n = 5000 genes and kn = 500 genes simulated differentially expressed.

24

method µ knˆFDR ˆFNR RDR

3 521.5 4.6 5.3 49.8

BM 5 1023.0 3.8 0.2 98.4

8 1039.7 3.8 0.0 100.0

3 220.2 0.6 8.0 21.9

BH 5 980.2 1.4 0.4 96.6

8 1015.0 1.5 0.0 100.0

3 820.2 10.0 2.9 73.8

MIXT 5 994.9 1.5 0.2 98.0

8 999.9 0.0 0.0 100.0

3 613.2 7.0 4.6 57.0

DDLR 5 973.5 1.4 0.4 96.0

8 997.6 0.3 0.1 99.5

Table 3: Comparison between BH, BM, MIXT and DDLR methods for different values

of µ, in the case n = 10000 genes and kn = 1000 genes simulated differentially expressed.

25