Multivariate search for differentially expressed gene combinations

10
BioMed Central Page 1 of 10 (page number not for citation purposes) BMC Bioinformatics Open Access Methodology article Multivariate search for differentially expressed gene combinations Yuanhui Xiao 1 , Robert Frisina 2 , Alexander Gordon 1 , Lev Klebanov 1,3 and Andrei Yakovlev* 1 Address: 1 Department of Biostatistics and Computational Biology, University of Rochester, 601 Elmwood Avenue, Rochester, New York 14642, USA, 2 Departments of Otolaryngology, Neurobiology and Anatomy, and Biomedical Engineering, University of Rochester, 601 Elmwood Avenue, Rochester, New York 14642, USA and 3 Department of Probability and Statistics, Charls University, Sokolovska 83, Praha-8, CZ-18675, Czech Republic Email: Yuanhui Xiao - [email protected]; Robert Frisina - [email protected]; Alexander Gordon - [email protected]; Lev Klebanov - [email protected]; Andrei Yakovlev* - [email protected] * Corresponding author Abstract Background: To identify differentially expressed genes, it is standard practice to test a two- sample hypothesis for each gene with a proper adjustment for multiple testing. Such tests are essentially univariate and disregard the multidimensional structure of microarray data. A more general two-sample hypothesis is formulated in terms of the joint distribution of any sub-vector of expression signals. Results: By building on an earlier proposed multivariate test statistic, we propose a new algorithm for identifying differentially expressed gene combinations. The algorithm includes an improved random search procedure designed to generate candidate gene combinations of a given size. Cross- validation is used to provide replication stability of the search procedure. A permutation two- sample test is used for significance testing. We design a multiple testing procedure to control the family-wise error rate (FWER) when selecting significant combinations of genes that result from a successive selection procedure. A target set of genes is composed of all significant combinations selected via random search. Conclusions: A new algorithm has been developed to identify differentially expressed gene combinations. The performance of the proposed search-and-testing procedure has been evaluated by computer simulations and analysis of replicated Affymetrix gene array data on age-related changes in gene expression in the inner ear of CBA mice. Background The set of microarray expression data on p distinct genes is represented by a random vector X = X 1 ,..., X p with sto- chastically dependent components. The dimension of X is typically very high relative to the number of observations (replicates of experiment). The standard practice is to test the hypothesis of no differential expression for each gene. Formulated in terms of the marginal distributions of all components of X, this hypothesis means that the expres- sion levels of a particular gene are identically distributed under two (or more) experimental conditions. It is com- monly believed that the only challenging problem here is that of multiple statistical tests, because the correspond- ing test statistics computed for different genes are Published: 26 October 2004 BMC Bioinformatics 2004, 5:164 doi:10.1186/1471-2105-5-164 Received: 07 August 2004 Accepted: 26 October 2004 This article is available from: http://www.biomedcentral.com/1471-2105/5/164 © 2004 Xiao et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Transcript of Multivariate search for differentially expressed gene combinations

BioMed CentralBMC Bioinformatics

ss

Open AcceMethodology articleMultivariate search for differentially expressed gene combinationsYuanhui Xiao1, Robert Frisina2, Alexander Gordon1, Lev Klebanov1,3 and Andrei Yakovlev*1

Address: 1Department of Biostatistics and Computational Biology, University of Rochester, 601 Elmwood Avenue, Rochester, New York 14642, USA, 2Departments of Otolaryngology, Neurobiology and Anatomy, and Biomedical Engineering, University of Rochester, 601 Elmwood Avenue, Rochester, New York 14642, USA and 3Department of Probability and Statistics, Charls University, Sokolovska 83, Praha-8, CZ-18675, Czech Republic

Email: Yuanhui Xiao - [email protected]; Robert Frisina - [email protected]; Alexander Gordon - [email protected]; Lev Klebanov - [email protected]; Andrei Yakovlev* - [email protected]

* Corresponding author

AbstractBackground: To identify differentially expressed genes, it is standard practice to test a two-sample hypothesis for each gene with a proper adjustment for multiple testing. Such tests areessentially univariate and disregard the multidimensional structure of microarray data. A moregeneral two-sample hypothesis is formulated in terms of the joint distribution of any sub-vector ofexpression signals.

Results: By building on an earlier proposed multivariate test statistic, we propose a new algorithmfor identifying differentially expressed gene combinations. The algorithm includes an improvedrandom search procedure designed to generate candidate gene combinations of a given size. Cross-validation is used to provide replication stability of the search procedure. A permutation two-sample test is used for significance testing. We design a multiple testing procedure to control thefamily-wise error rate (FWER) when selecting significant combinations of genes that result from asuccessive selection procedure. A target set of genes is composed of all significant combinationsselected via random search.

Conclusions: A new algorithm has been developed to identify differentially expressed genecombinations. The performance of the proposed search-and-testing procedure has been evaluatedby computer simulations and analysis of replicated Affymetrix gene array data on age-relatedchanges in gene expression in the inner ear of CBA mice.

BackgroundThe set of microarray expression data on p distinct genesis represented by a random vector X = X1,..., Xp with sto-chastically dependent components. The dimension of X istypically very high relative to the number of observations(replicates of experiment). The standard practice is to testthe hypothesis of no differential expression for each gene.

Formulated in terms of the marginal distributions of allcomponents of X, this hypothesis means that the expres-sion levels of a particular gene are identically distributedunder two (or more) experimental conditions. It is com-monly believed that the only challenging problem here isthat of multiple statistical tests, because the correspond-ing test statistics computed for different genes are

Published: 26 October 2004

BMC Bioinformatics 2004, 5:164 doi:10.1186/1471-2105-5-164

Received: 07 August 2004Accepted: 26 October 2004

This article is available from: http://www.biomedcentral.com/1471-2105/5/164

© 2004 Xiao et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 1 of 10(page number not for citation purposes)

BMC Bioinformatics 2004, 5:164 http://www.biomedcentral.com/1471-2105/5/164

stochastically dependent. This problem is discussed in [2]in the context of microarray data analysis. Resamplingtechniques [3,4] provide a universal approach to the prob-lem of multiple dependent tests inherent in the most typ-ical study designs. However, there is another aspect of thestandard approach that warrants special attention. Anytest constructed solely in terms of marginal distributionsof gene expression levels disregards the multidimensional(dependence) information hidden in gene interactions,which is its most obvious deficiency.

In a recent paper, Szabo et al. [5] proposed to build a tar-get set of interesting genes from non-overlapping subsetsof genes of a given size (≥1) that have been declared dif-ferentially expressed in accordance with a pertinent statis-tical test. The size of each sought-for subset is naturallyconstrained by the available sample size. This approachstrives to preserve the dependence structure at least withineach of such building blocks, which is already a majorstep toward a more general methodology of microarraygene expression data analysis.

No matter what specific statistical techniques are chosento approach the problem of identifying differentiallyexpressed gene combinations rather than individualgenes, the hypothesis that the expression levels of a givenset of genes are identically distributed across the condi-tions under study is the most meaningful hypothesis to betested. However, this hypothesis is now formulated interms of the joint distribution of expression levels. Theissue of multiple testing is dramatically magnified withmultivariate methodology, because the total number oftests to be carried out at all steps of multivariate selectionmay be many orders of magnitude larger than with uni-variate methods. A constructive idea is to design a randomsearch procedure for identifying differentially expressedsets of genes followed by testing significance of a final set.Szabo et al. [5,6] proposed a search procedure based onmaximization of a new distance between multivariate dis-tributions of gene expression signals. They used permuta-tion techniques for hypotheses testing. To adjust formultiple testing, the null-distribution was estimated fromthe test statistics generated by each optimal (in terms ofthe adopted distance) set of genes found in each permuta-tion sample. The authors provided an illustrative exampleof clear advantages of multivariate methodology over uni-variate approaches. In the present paper, we improve thecross-validation and multiple testing components of theearlier proposed algorithm. This new combination of thesearch-and-testing procedures furnishes a sound statisticalmethodology for multivariate analysis of microarray data.

ResultsMathematical framework: measure of differential expressionTo compare gene expression signals in two different exper-imental conditions (states) one needs a pertinent distancebetween two random vectors. Such a distance is expectedto satisfy the following requirements: (1) it should have aclear probabilistic meaning; (2) it should accommodateboth continuous and categorical data; (3) its estimateshould be stable to random fluctuations and numericalerrors; (4) its computation should not be too time con-suming. A distance that meets all the above requirementswas proposed in [6].

Let X = X1,..., Xd and Y = Y1,..., Yd, d ≤ p, be two randomsub-vectors with probability measures µ and ν, respec-tively, defined on the Euclidean space Rd. Let K(x, y) be astrictly negative definite kernel, that is

for any x1,..., xs from Rd and any

real numbers h1,..., hs, , with equality if and

only if all hi = 0. Introduce the following expression

The quantity N(µ, ν) can be shown [7] to be a metric inthe space of all probability measures Rd, so that the nullhypothesis in two-sample comparisons can be formulatedas H0 : N(µ, ν) = 0. A normalized version of N can be

derived as , where

If K(x, y) = Ψ(x - y) and Ψ(·) is homogeneous of anyorder, then Nnorm is both location and scale invariant.

Consider two independent samples, consisting of n1andn2 observations respectively, represented by the d-dimen-

sional vectors and , and introduce an

empirical counterpart (nonparametric estimate) of N(µ,ν) as follows

A very important advantage of the empirical counterpart

of the distance N is that it does not involve numeri-cally unstable high-dimensional components (such ascovariance matrix or its inverse), thus it is expected to benumerically stable even for small sample sizes. This was

K h hi j i ji js ( , ),

x x ≤=∑ 01

hiis ==∑ 0

1

N N K d d

K d d

dd

dd

2 2 2= =

− −

∫∫

∫∫

( , ) ( , ) ( ) ( )

( , ) ( ) ( )

µ ν µ ν

µ µ

x y x y

x y x y

RR

RR

KK d ddd

( , ) ( ) ( ). ( )x y x yRR∫∫ ν ν 1

N N Anorm2 2= /

A K d d K d ddd dd

= + ( )∫∫ ∫∫( , ) ( ) ( ) ( , ) ( ) ( ).x y x y x y x yRR RR

µ µ ν ν 2

x x1 1,..., n y y1 2

,..., n

Nn n

K x yn

K x xn n i jj

n

i

n

i jj

n2

1 2 12

11 11 2

21 212

1ˆ , ˆ ( , ) ( , )µ ν( ) = −== =∑∑ ∑∑∑ ∑∑−

= ==

13

22

1 11

1 22

nK y y

i

n

i jj

n

i

n

( , ). ( )

Page 2 of 10(page number not for citation purposes)

BMC Bioinformatics 2004, 5:164 http://www.biomedcentral.com/1471-2105/5/164

corroborated by a computer simulation study [5], inwhich this distance demonstrated a much higher stabilitythan the Mahalanobis distance and the nearest neighborclassifier.

Another distinct advantage of the approach based on is a wide selection of negative definite kernels that are sen-sitive to various departures from the hypothesis: µ = ν. Letx and y denote observations in two samples on a particu-

lar set of variables. We consider either of these obser-vations to be points in Euclidean space Rd. One naturalchoice is the Euclidean distance between points represent-ing experimental measurements:

When this kernel is applied to logarithms of gene expres-

sion signals the corresponding distance is scale invari-ant. Another possible choice is a bounded kernelexemplified by

Yet another kernel based on the correlation coefficienttends to pick up sets of genes with separated means anddifferences in correlation in the two samples under com-parison [6]. One can also use a convex combination of theabove mentioned kernels with the weights chosen in sucha way as to make the distance more sensitive to particulartypes of the alternative hypothesis.

The search-and-testing algorithmOnce a multivariate distance between expression signalshas been selected, it can be employed in a search for dif-ferentially expressed genes with the target subset of genesbeing defined as a subset for which the distance betweenthe two groups under comparison attains its maximum.Unlike univariate testing, an exhaustive multivariatesearch is computationally prohibitive because the numberof possible subsets increases as the d-th power of the totalnumber of genes. The issue of computational complexitycan be resolved by applying random search methodology.Random search can be designed in a number of variousways. One simple algorithm was described in [6,8]. Weused this algorithm, hereafter designated as Simple Ran-dom Search (SRS), with multiple random starts and longsequences of search steps in the application reported inthe present paper. We also compared its performance withthat of simulated annealing [1].

To reduce the selection bias associated with choosing asmall number of variables from a large set [9], Szabo et al.

[5,6] suggested to use cross-validation techniques with thesearch for a target subset of genes running in each cross-validation cycle. The basic structure of our cross-valida-tion algorithm is as follows:

Algorithm A1: Cross-validated search for differentially expressed genes1. Randomly draw (without replacement) u1 samplesfrom one group of arrays and u2 samples from the othergroup.

2. Leave out the selected arrays and find the optimal (inaccordance with the chosen criterion) subset of genesusing only the data from the remaining arrays.

3. Repeat steps 1 and 2 in succession v times to obtain v"optimal" sets of genes.

The main problem here is that the algorithm results inmany overlapping sub-optimal sets, and one needs tosomehow combine them to report a single final set. Szaboet al. resorted to a somewhat unnatural way of forming afinal set by selecting single genes with the highest frequen-cies of occurrence in sub-optimal sets. In our new algo-rithm, this is accomplished through designing a second-stage cross-validated search limited to the union of thepreviously selected sets. In the second-stage search proce-dure, cross-validation is carried out at each step of random

search with the distance averaged over all cross-valida-tion samples. With this approach, the correlation struc-ture is better preserved when combining the results ofcross-validation. The foregoing description of the second-stage search may be summarized in the followingalgorithm:

Algorithm A2: The second-stage cross-validated random search1. Form the union of all sets resulted from Algorithm A1to represent an initial target set. Drop the data on all othergenes from the data set.

2. Initiate a random search algorithm.

3. At each step of the search algorithm, randomly draw(without replacement) l1 samples from one group ofarrays and l2 samples from the other group. Leave out theselected arrays and compute the N-statistic using only thedata from the remaining arrays. Perform this computationr times.

4. Compute the average (arithmetic mean) of the N-statis-

tics resulted from step 3. Denote this average by .

K x yg gg

12 4( , ) ( ) . ( )x y x y= − = −

∈∑

K rr2 1

1

10 2( , ) , .x y

x y= −

+ −< <

ˆ *Nr

Page 3 of 10(page number not for citation purposes)

BMC Bioinformatics 2004, 5:164 http://www.biomedcentral.com/1471-2105/5/164

5. Move to the next step of random search using the statis-

tic as a pertinent objective function to be maximized.

In the application discussed in the present paper, we usedAlgorithm A2 with 200 cross-validated samples in the sec-ond stage of the search algorithm. The two-stage searchalgorithm runs with multiple random starts and returnsthe most differentially expressed (in terms of the distance

) gene combination of a given size.

Once an optimal set has been found, all genes pertainingto this set are discarded and a search for the next set of dif-ferentially expressed genes is initiated. Szabo et al. [5] pro-posed a stopping rule based on a permutation significancetest. In the improved version of our algorithm, instead oftesting significance at each step of the successive selectionof subsets of genes, the selection procedure runs (withouttesting) for a preset number of steps, thereby forming areasonably long sequence of non-overlapping "maximal"subsets. The same cross-validated random search proce-dure is applied to each permutation sample, generated tomodel the complete null hypothesis for disjoint subsets ofgenes, and finally the step-down multiple testing resam-pling algorithm by Westfall and Young [4] is applied tothe subsets thus selected. If all the null hypotheses happento be rejected, the selection procedure goes on eliminatingsubsets of genes resulting from the search algorithm, oth-erwise the procedure stops. The heuristic procedure thusdesigned mimics its univariate multiple testing (marginalhypotheses testing) counterpart with known properties[4], thereby ensuring an approximate control of the fam-ily-wise error rate (FWER).

Suppose that all tests are two-tailed and utilize the same

test statistic , then the following resampling algorithmcan be proposed:

Algorithm A3: Successive selection of differentially expressed gene combinations1. Form m permutation samples of sizes n1and n2, respec-tively, from n1+ n2 replicated observations (arrays). Foreach of the m permutation samples, run (without testing)the successive selection algorithm to find a preset numberI of disjoint sets. At each step of successive selection, anoptimal k-element set is identified by the two-stage cross-validated search algorithm and the corresponding m

sequences of -values are stored.

2. Returning to the original two-sample setting, find asequence of I optimal sets of the same size k and compute

the respective test statistics for the selected sets.

3. Apply the step-down multiple testing resampling algo-rithm by Westfall and Young [4] to the N-statistics result-ing from Steps 1 and 2. If the number of rejectedhypotheses is less than I then stop and declare all therejected sets of genes differentially expressed, otherwisereturn to Step 1 and continue successively selecting sets ofgenes. A faster version of Step 3 uses the single-step resa-mpling adjustment [4].

The above algorithm can be reformulated in terms of p-values. The algorithm is computationally more expensivethan its prototype presented in [5]. We used a SunFireV480 station to implement the algorithm. This "bruteforce" approach is needed to extract more informationfrom multivariate gene expression profiles.

With the above approach, no distributional assumptions

are needed although the test statistic is not distribu-tion free. For this statistic, however, it can be proven thatpermutations produce samples from a distribution that is,in some sense, the least favorable for rejecting an underly-ing composite null hypothesis. In other words, permuta-tions provide an optimal choice of a null distribution.More precisely, this theoretical result is valid for the resa-mpling (with replacement) analog of permutations, butregular (without replacement) permutations may be agood approximation to this resampling procedure if bothsamples under comparison are not too small. This con-cept and its mathematical framework is discussed atlength in our previous report [10].

For efficient nonparametric estimation of adjusted p-val-ues associated with sets of genes resulting from randomsearch, it is also desirable that the test statistic be scaleinvariant for any sample size. A statistic that meets thisrequirement is an empirical counterpart of the normal-ized distance Nnorm with a properly chosen kernel func-tion, see formula (2) and the succeeding explanation. Yetanother possibility is to use the kernel K1 with log-intensi-ties of gene expressions. We employed the latter pivotingstructure of the N-statistic in the analysis of simulated andbiological data presented in the subsequent sections.

Simulation studiesWe first tested our methodology by computer simula-tions. To this end, we designed a simulation study asfollows.

Two sets of data on 1,000 genes were simulated. For con-venience we will label them as "control" and "treatment"samples, respectively. The size of each sample was equalto 10. In the treatment group, the first 12 genes were set tobe differentially expressed. To simulate these genes, loga-rithms of gene expression signals were generated from amultivariate normal distribution with an exchangeable

ˆ *Nr

Page 4 of 10(page number not for citation purposes)

BMC Bioinformatics 2004, 5:164 http://www.biomedcentral.com/1471-2105/5/164

correlation structure. The algorithm designed to simulatesuch data is presented in the Appendix. The correlationcoefficient for all pairs of gene log-intensities was setequal to 0.6, while the standard deviation was chosen tobe either σ = 0.5 or σ = 1 for all individual genes. Themean log-expression values τ for the genes assigned to thetarget set of genes were specified as follows: τ = 5 for thefirst 4 genes (Subset 1), τ = 4 for the second group of 4genes (Subset 2), τ = 3 for the third group of 4 genes (Sub-set 3). The remainder of the genes (not differentiallyexpressed) were simulated as log-normally distributedrandom variables with τ = 1 and the same standard devi-ation (either σ = 0.5 or σ = 1) and correlation coefficient.The 1,000 genes in the control group were simulated justlike those that were not differentially expressed in thetreatment group.

Our search-and-testing procedure was applied to the datasets thus generated in order to see whether (and how fre-quently) it can find all subsets, as well as all individualgenes, included in the target set of differentially expressedgenes. In each experiment, the SRS algorithm was run withmultiple random starts. At each step of the successiveselection of genes, the algorithm sought for a subset of 4genes. The parameter I in Algorithm A3 was set equal to 5.Since the sole purpose of our simulations was to checkhow well a given algorithm finds a maximum of the N-sta-tistic over gene sets, no recourse to cross-validation wasmade in this study. The number of permutations was setat 200. Because such simulations are very time consumingthe experiment was repeated only 100 times. Two samples(control and treatment) were generated in each of the 100experiments.

First we tested the SRS algorithm with 8 random starts and2,500 search steps. When σ = 0.5 for the treatment groupthe algorithm was able to correctly recover Subset 1 in

82%, Subset 2 in 72%, and Subset 3 in 76% of simulationruns. The proportion of cases where all 12 genes were cor-rectly recovered (irrespective of the order they entered theselected subsets) was 61%. The false discovery rate,defined as the mean proportion of falsely discoveredgenes among the true differentially expressed genes, wasequal to 0.02.

When σ = 1 the SRS algorithm recovers Subset 1 in 76%,Subset 2 in 56%, and Subset 3 in 39% of the simulationruns. The proportion of cases where all 12 genes were cor-rectly recovered was 53%. The false discovery rate wasequal to 0.04.

As one would expect, the SRS algorithm performed betterwith 16 random starts and 3,600 search steps. For σ = 0.5,the rate of correct discovery becomes 100% for all threesets. For σ = 1 the algorithm correctly recovers Subset 1 in81%, Subset 2 in 65%, and Subset 3 in 48% of simulationruns. The proportion of cases where all 12 genes are cor-rectly recovered is 62%. However, the false discovery rateremains essentially the same as when running the SRSalgorithm with 8 starts and 2,500 search steps. The resultson individual simulated genes are presented in Table 1.

By way of comparison, we ran the Westfall and Youngalgorithm with a univariate counterpart of the test statisticN at the same level of FWER. While the results for σ = 0.5were identical (100% correct recovery), the univariatemethod recovered less genes (45%) in the target set whenwe set σ = 1. In the latter case, the univariate algorithmhad a uniformly lower correct discovery rate for genes #9through #12 (69%, 71%, 70%, 71%, respectively) in com-parison to the multivariate method (Table 1). One shouldnot expect much discrepancy between the univariate andmultivariate methods in these simulations because thealternative hypotheses were modeled in a univariate way.

Table 1: Proportions of correct discoveries for each gene in the target set.

Gene SRS: 8 starts, 2500 steps SRS: 16 starts, 3600 steps

Correct discovery µ = 1, σ = 0.5 Correct discovery µ = 1, σ = 1 Correct discovery µ = 1, σ = 0.5 Correct discovery µ = 1, σ = 1

1 100% 100% 100% 100%2 100% 100% 100% 100%3 100% 100% 100% 100%4 100% 100% 100% 100%5 100% 97% 100% 99%6 100% 99% 100% 100%7 100% 100% 100% 100%8 100% 100% 100% 100%9 99% 78% 100% 76%10 99% 76% 100% 78%11 96% 72% 100% 74%12 97% 74% 100% 72%

Page 5 of 10(page number not for citation purposes)

BMC Bioinformatics 2004, 5:164 http://www.biomedcentral.com/1471-2105/5/164

In another experiment we studied the simulated anneal-ing optimization (SAO) with one random start and thesame parameters of the simulation model. Although com-putationally expensive, the SAO algorithm is easier tohandle when tuning its parameters in simulation experi-ments. Proceeding from the less favorable case of σ = 1, wedetermined parameters of the SAO algorithm that providecorrect selection of all three sets of differentially expressedgenes in all simulation runs.

Another way of testing the two algorithms is to applythem in a situation where the true global maximum of theN-distance is known. We randomly selected 2000 genesfrom the data set discussed in the next section. All possiblepairs were formed from the 2000 genes and the corre-sponding N-statistic between the two samples (young ver-sus old mice) was computed for each pair. The data werenormalized before the analysis (see Section "Results andDiscussion"). Having determined a maximum value ofthe N-statistic over all pairs, we ran the SRS and SAO algo-rithms (with parameters suggested by our simulationexperiments) to see whether they could find the actualmaximum. Both algorithms hit the target.

ResultsThe biological purpose of our experimental study was tobetter understand age-related changes in gene expressionthat occur in mouse inner ear (including the organ ofCorti and stria vascularis). Since we do not expect numer-ous genes to be involved in the process of aging of theauditory system, this experimental system seems to beespecially promising for the use of multivariate methods.

Hearing loss or deafness affects about 10% of the U.S.population, or about 30 million people, most of themover age 60. Presbycusis – age-related hearing loss – is aprimary sensory problem in the elderly population, thenumber one communicative disorder, and one of the topthree chronic medical conditions affecting the aged. It isoften described as difficulty in understanding speech,especially in conditions of high ambient backgroundnoise. Most elderly persons have a reduction in hearingacuity. For example, cross-sectional and longitudinalstudies have consistently demonstrated gradually decreas-ing pure tone thresholds by cohort groups of elderly[13,14]. The composite audiometric pattern is one of bet-ter hearing for low- and mid-speech frequencies thanhigher speech frequencies. The consequence of this pat-tern is difficulty in hearing and understanding, not onlyconversational speech, but in particular, speech that issoftly spoken. In fact, a similar gradual reduction inspeech recognition for words and phonemes in quiet hasbeen shown to accompany the pure tone thresholddecrease in cohort groups of the elderly [14-16].

Much progress had been made in the field of auditoryaging research regarding sensitivity deficits and metabolicproblems of the cochlea. As humans and animals age,they lose sensory hair cells, 8th cranial nerve (i.e., vestib-ulocochlear) fibers, and develop stria vascularis/potas-sium recycling metabolic problems that degradeaudibility and spectral tuning [17-21].

In addition, the differing roles of the ear and brain in pres-bycusis, and aging deficits in speech understanding inbackground noise, and their respective neural bases arebeginning to be understood. Age effects in these areas aredistinguishable and age-related problems in the brain canbe influenced by the peripheral etiologies of presbycusis[22-24]. Considering studies completed to date, presbycu-sis in humans, and corresponding age-related hearing lossin animal models such as the CBA mouse, have two majorfacets: 1) A peripheral hearing loss of cochlear origin,starting with sensitivity losses in the high pitches (highfrequencies), involving loss of sensory hair cells, spiralganglion neurons (8th nerve fibers) and metabolic mal-functions of the highly vascularized stria vascularis organsystem that produces the potassium rich endolymph ofthe inner ear [25,26]; and 2) An inability to comprehendspeech in background noise, that results from deficits inthe inner ear and the central auditory nervous system[23,24]. For the animal model studies of presbycusis, theCBA mouse strain has been quite useful to date.

The goal of the present study is to explore the underlyingcochlear gene expression changes that may predispose orcause presbycusis. Common neurodegenerative diseasessuch as presbycusis are likely to be caused by several fun-damental problems that interact with each other and withenvironmental factors, including genetic pre-dispositionsto environmental insults, noise and ototoxic medications[27]. Although over a hundred genes have been identifiedthat cause congenital deafness (e.g. [28-30]), no candidategenes have yet been identified that are involved in humanpresbycusis. The present report attempts to gain some ini-tial insights into gene expression changes related to innerear problems that may predispose or cause age-relatedneurosensory disorders, such as age-related hearing loss –presbycusis, utilizing the CBA mouse strain.

The two groups of arrays under comparison included 9and 12 arrays, respectively (see the next section). The datawere normalized using the quantile normalizationmethod [11,12] carried out at the probe feature level.Compared to our simulations, the number of permuta-tions was increased to 400. Each search cycle in the SRSalgorithm proceeded in 45,000 steps with 100 randomstarts. The algorithm was tuned to search for a set of 5genes at each step of the successive selection procedure.We also changed parameters that control the efficiency of

Page 6 of 10(page number not for citation purposes)

BMC Bioinformatics 2004, 5:164 http://www.biomedcentral.com/1471-2105/5/164

the SAO algorithm to account for an increased dimen-sionality of the problem. The latter algorithm also soughtfor sets consisting of 5 genes. We used the followingparameter values in the combined two-stage cross-vali-dated search algorithm: I = 5, u1 = 4 (out of 9 arrays), u2 =6 (out of 12 arrays), v = 10, l1 = 4 (out of 9 arrays), l2 = 6(out of 12 arrays), r = 200.

Although the lists of genes produced by both algorithmsare quite similar, there are still some discrepanciesbetween them which may be attributed to the choice ofparameters for each method. Since the SAO algorithm isless sensitive to the choice of the initial gene combination,we present only the results obtained with this algorithm.In the "young" versus "old" comparison, the procedureselected two sets of 5 genes with an adjusted p-value of lessthan 0.05. For comparison, we applied the Wesfall andYoung step-down multiple testing procedure with a uni-

variate counterpart of as the test statistic. This methodselects only 6 genes at the same FWER; all of them appearamong those genes that have been selected by the multi-variate search-and-testing procedure. The final list of 10genes was evaluated further for consistency with the exist-ing biological knowledge.

DiscussionOf the 10 identified genes (from 2 sets) exhibiting majorexpression changes with age, there are 6 differentiallyexpressed genes having to do with immune system func-tion. This is important from an aging point of view for tworeasons. First, immunoprecipitations or immunoproductscan be damaging to nerve cells, and have been implicatedas being responsible for age-related neurodegeneration inthe brain in general, and in Alzheimer's disease specifi-cally, but this is a new finding for the cochlea and age-related hearing loss – presbycusis. Second, autoimmuneproblems, where the immune system starts attacking itsown nerve cells, is another leading candidate for a causa-tive factor in neurodegenerative aging conditions. Theseimmune products are likely to come from the vascularsupply to the cochlea, yet may be a causative componentfor age-related hearing loss due to the resultant damage tothe cochlea sensory cells.

There are 3 genes having to do with post-translational pro-tein changes, including protein binding properties, withtwo of these genes involved in carbohydrate metabolism(sugar/glucose binding in mitochondria for cellular respi-ration). These genes are related to the production of reac-tive oxygen species (ROS), which damage nerve cells, andhave been implicated in age-related neurodegenerativedisorders, and in cases of cochlear sensorineural hearingloss. For example, problems in cellular respiration canlead to accumulation of toxic intracellular substances,causing damage to sensory cell structures and abnormal

metabolic processing along with increased levels of ROS[31-33].

The last gene, involved in mammary gland functioning,showed a significant increase with age. A closer inspectionof the expression levels for this gene have shown that theobserved effect cannot be attributed to the presence ofoutliers in the data. Although not directly involved in sen-sory functioning, this gene may change its espression aspart of general degenerative processes in inner ear. Anerror in this gene annotation cannot be ruled out as well.This observation is definitely worth another look.

The above-described initial observations are quite provoc-ative, in that we have several groupings of genes that haveimportant functional significance for aging and hearing,including important aspects of cochlear, inner ear func-tioning. These animal-model gene-array investigations arequite useful for guiding human genetics experimentsaimed at identifying candidate genes involved in the sus-ceptibility and progression of human age-related hearingloss and other age-dependent neurosensory disorders.

Regarding methodological aspects of this paper, we wouldlike to note that a pertinent multivariate method for selec-tion of differentially expressed genes should include twocomponents: finding subsets of candidate genes thatjointly separate the classes (states) under comparison andtesting statistical significance of this separation; the latterdoes not necessarily refer to characteristics of a classifica-tion (allocation) rule such as classification error rates. Wealso would like to stress that the problem of significancetesting in the multivariate formulation is not equivalentto the problem of statistical classification (supervisedlearning). While closely related, these problems are fun-damentally different. For example, the use of the classifi-cation error rate as a criterion for selection of importantvariables is appropriate where the aim is to form a discri-minant rule for the subsequent outright allocation ofunclassified samples to one of the known classes. A verygood separation between classes can sometimes be pro-vided by looking at a single feature variable (gene) so thatthe classification error rate is difficult to reduce further byincluding other (probably quite significant) variables inthe rule. However, one would like to keep the chance ofmissing other interesting variables to a minimum. Theproblem dealt with in this paper is not that of classifica-tion or prediction. Our method is designed to find genecombinations that change in concert (as a set) theirexpression due to some biological factors. The problemthus formulated reduces to that of significance testing.

It must be emphasized that our method is designed notonly to identify sets of genes whose interrelationships dif-fer but also those genes with marginal effects. More

Page 7 of 10(page number not for citation purposes)

BMC Bioinformatics 2004, 5:164 http://www.biomedcentral.com/1471-2105/5/164

importantly, the method seeks to provide an alternativeway of making a specific FWER-based multiple testingprocedure less conservative and, to some extent, lessdependent on the subset pivotality requirement (see [4]for definition), by extracting more information from thedata. In addition, this approach can be used for rankingand clustering those genes that have been declared differ-entially expressed by univariate methods.

ConclusionsA new algorithm for identifying differentially expressedgene combinations has been developed. This algorithm isbuilt on the earlier proposed multivariate test statistic [6]and successive selection of differentially expressed sets ofgenes [5]. The algorithm includes an improved randomsearch procedure designed to generate candidate genecombinations of a given size. Cross-validation is used toprovide replication stability of the search procedure. Apermutation two-sample test is used for significance test-ing. We design a multiple testing procedure to control thefamily-wise error rate when selecting significant combina-tions of genes that result from a successive selection pro-cedure. A target set of genes is composed of all significantcombinations selected via random search. The perform-ance of the proposed search-and-testing procedure hasbeen evaluated by computer simulations and analysis ofreplicated Affymetrix gene array data on age-relatedchanges in gene expression in the inner ear of CBA mice.

MethodsSubjectsCBA mice from the University of Rochester vivariumserved as subjects for this study who had similar environ-mental, non-ototoxic life histories. Subjects were mice ofthe following age groups: Young adult (N = 9, 3–4months) and old (N = 12, 24–33 months). All animalprocedures were approved the University of RochesterCommittee on Animal Resources.

Cochlear dissectionsSubject groups of the present report had extensive behav-ioral and neurophysiological hearing testing prior to sac-rifice, verifying that the old mice had age-related hearingloss. Mice were sacrificed by cervical dislocation. Thenboth cochleae for each mouse were immediately dissectedusing a Zeiss stereomicroscope. The cochleae were placedin cold saline for micro dissection of the cochlear parti-tion (basilar membrane, organ of Corti and spiral liga-ment), and were then placed in cold Trizol. A detailedprotocol for Trizol can be found at http://www.fgc.urmc.rochester.edu. All samples were stored at -80°C for microarray gene expression processing.

Gene expression microarraysThe RNA quality was assessed by electrophoresis using theAgilent Bioanalyzer 2100. Between 200 ng and 2 ug oftotal RNA from each sample was used to generate a highfidelity cDNA, which was modified at the 3' end to con-tain an initiation site for T7 RNA polymerase, while 1 ugof cDNA was used in an in vitro transcription (IVT). 20 ugof full-length cRNA, from each mouse (age groups asdescribed above), was fragmented. After fragmentation,the cDNA, full-length cRNA, and fragmented cRNA wereanalyzed by electrophoresis using the Agilent Bioanalyzer2100 to assess the appropriate size distribution prior tomicroarray hybridization. Detailed protocols for samplepreparation using the Ambion MessageAmp protocol canbe found at http://www.ambion.com. Affymetrix M430AHigh density oligonucleotide array set (A) which queried20,000 murine probe sets was used. Each gene on the sub-array is represented by 11 pairs of 25 mer oligonucleotidesthat span the coding region for the 20,000 genes and ESTsrepresented (clear overlapping of genes is evident). Eachprobe pair consists of a perfect match (PM) sequence thatis complementary to the cDNA target, and a miss-match(MM) sequence that has a single base pair mutation in aregion critical for target hybridization; this sequenceserves as a control for non-specific hybridization. Stainingand washing of all arrays was performed in the Affymetrixfluidics module per manufacturer's protocol. Streptavidinphycroerythrin stain (SAPE, Molecular Probes) was thefluorescent conjugate used to detect hybridized targetsequences. All arrays in this study were assessed for "arrayperformance" prior to data analysis.

Methods for data analysis and computer simulationsThe methodology of data analysis and design of computersimulations have been described at length in the preced-ing sections. The relevant software for data analysis andsimulations is included in the 1 [see the folder "Multivar-iateSearch"]. Here we supplement this information with adescription of the generator of multivariate exchangeablenormal random vectors which we used in oursimulations.

Suppose we want to generate a normal random vector Xin Rd with mean vector M ∈ Rd and covariance matrix Σwhose entries are σ2 and ρσ2 on and off diagonal,respectively. It is well-known that X can be represented inthe form

X = M + CZ,

where Z is the standard normal vector with mean 0 in Rd

and C is a d × d matrix with CCT = Σ. (Here CT denotes thetranspose of C.) The matrix C may be chosen symmetricand can be computed using well-known algebraic proce-dures. However, our matrix Σ has a special structure:

Page 8 of 10(page number not for citation purposes)

BMC Bioinformatics 2004, 5:164 http://www.biomedcentral.com/1471-2105/5/164

Σ = (1 - ρ)σ2Id + ρσ21d × d,

where Id is a unit matrix of size d and 1d × d is a squarematrix with all the d2 entries being equal to 1. Using thiswe look for C of the same form:

C = αId + β1d × d.

From the relations C2 = Σ and we have α2 =

σ2(1 - ρ), 2αβ = ρσ2, so that

Authors' contributionsYX is responsible for the computational component ofthis study. He also participated in the methodology devel-opment. LK, AG, and AY have equally contributed to var-ious methodological aspects of the proposed multivariateanalysis. RF provided experimental data and biologicalinterpretation of the net results of data analysis.

Additional material

AcknowledgementsWe thank Dr. Andrew Brooks, Dr. Mary D'Souza, Dr. Xiaoxia Zhu, Martha Erhardt, John Housel and Cristine Brower for technical assistance. Method-ological discussions with Dr. Anthony Almudevar are greatly appreciated. We are grateful to anonymous reviewers whose comments have helped us improve the manuscript. The research is supported by NIH Grants P01 AG09524 from the National Institute on Aging, P30 DC05409 from the National Institute on Deafness & Communication Disorders, and the Inter-national Center for Hearing & Speech Research, Rochester, NY.

References1. Almudevar A: A simulated annealing algorithm for maximum

likelihood pedigree reconstruction. Theoretical Population Biology2003, 63:63-75.

2. Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing inmicroarray experiments. Statistical Science 2003, 18:71-103.

3. Pesarin F: Multivariate Permutation Tests: With Applicationsin Biostatistics. Wiley, Chichester; 2001.

4. Westfall PH, Young S: Resampling-Based Multiple Testing.Wiley, New York; 1993.

5. Szabo A, Boucher K, Jones D, Klebanov L, Tsodikov A, Yakovlev A:Multivariate exploratory tools for microarray data analysis.Biostatistics 2003, 4:555-567.

6. Szabo A, Boucher K, Carroll W, Klebanov L, Tsodikov A, Yakovlev A:Variable selection and pattern recognition with gene expres-sion data generated by the microarray technology. Mathemat-ical Biosciences 2002, 176:71-98.

7. Zinger AA, Klebanov LB, Kakosyan AV: Characterization of distri-butions by mean values of statistics in connection with someprobability metrics. In In: Stability Problems for Stochastic ModelsVNIISI Moscow; 1999:47-55.

8. Chilingaryan A, Gevorgyan N, Vardanyan A, Jones D, Szabo A: Mul-tivarite approach for selecting sets of differentiallyexpressed genes. Mathematical Biosciences 2002, 176(1):59-69.

9. Ambroise C, McLachlan GJ: Selection bias in gene extraction onthe basis of microarray gene-expression data. Proceedings of theNational Academy of Sciences USA 2002, 99:6562-6566.

10. Klebanov L, Gordon A, Xiao Y, Land H, Yakovlev A: A new test sta-tistic for testing two-sample hypotheses in microarray dataanalysis. Technical Report 2004 [http://www.urmc.rochester.edu/smd/biostat/people/faculty/andrei.htm]. Department of Biostatisticsand Computational Biology University of Rochester

11. Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison ofnormalization methods for high density oligonucleotidearray data based on variance and bias. Bioinformatics 2003,19(2):185-193.

12. Irizarry RA, Gautier L, Cope LM: An R package for analyses ofAffymetrix oligonucleotide arrays. In In: The Analysis of GeneExpression Data Edited by: Parmigiani G, Garrett ES, Irizarry RA, ZegerSL. Springer, New York; 2003:102-119.

13. Corso JF: Age correction factor in noise-induced hearing loss:a quantitative model. Audiology 1980, 19:221-232.

14. Gates GA, Caspary DM, Clark W, Pillsbury HC 3rd, Brown SC, DobieRA: Presbycusis. Otolaryngol Head Neck Surg 1989, 100:266-271.

15. Gelfand SA, Piper N, Silman S: Consonant recognition in quiet asa function of aging among normal hearing subjects. J AcoustSoc Am 1985, 78:1198-1206.

16. Gelfand SA, Piper N, Silman S: Consonant recognition in quietand in noise with aging among normal hearing listeners. JAcoust Soc Am 1986, 80:1589-1598.

17. Lonsbury-Martin BL, Cutler WM, Martin GK: Evidence for theinfluence of aging on distortion product otoacoustic emis-sions in humans. J Acoust Soc Am 1991, 89:1749-1759.

18. Lonsbury-Martin BL, Martin GK, Probst R, Coats AC: Acoustic dis-tortion products in rabbit ear canal. I. Basic features andphysiological vulnerability. Hear Res 1987, 28:173-189.

19. Probst R, Lonsbury-Martin BL, Martin GK: A review of otoacousticemissions. J Acoust Soc Am 1991, 89:2027-2067.

20. Willott JF: Effects of aging, hearing loss, and anatomical loca-tion on thresholds of inferior colliculus neurons in C57BL/6and CBA mice. J Neurophysiol 1986, 56:391-408.

21. Willott JF: Aging and the auditory system: Anatomy, physiol-ogy, and psychophysics. Singular Publishing Group, San Diego;1991.

22. Frisina DR, Frisina RD: Speech recognition in noise and presby-cusis: relations to possible neural mechanisms. Hear Res 1997,106:95-104.

23. Frisina DR, Frisina RD, Snell KB, Burkard R, Walton JP, Ison JR: Audi-tory temporal processing during aging. In In: Functional Neurobi-ology of Aging Edited by: Hof PR, Mobbs CV. Academic Press, SanDiego; 2001:565-579.

24. Frisina RD: Anatomical and neurochemical bases of presbycu-sis. In In: Functional Neurobiology of Aging Edited by: Hof PR, Mobbs CV.Academic Press, San Diego; 2001:531-547.

25. Jacobson M, Kim SH, Romney J, Zhu X, Frisina RD: Contralateralsuppression of distortion-product otoacoustic emissions

Additional File 1The additional folder "MultivariateSearch" includes the following three sub-folders: 1. SAO _Simulation 2. SRS_Simulation 3. TSSearch Each subfolder contains a Unix executable file. The executable file "SASearch" implements the algorithm based on simulated annealing optimization. The executable file "SRSearch" implement the version based on simple random search. The exectuable file "TSSearch" for the two-stage search is is located in the sub-folder "TSSearch". Each sub-folder also contains two input files. The file "simulation04_UI.txt" is an input file for data analy-sis. Suppose the data file is named xxxx.marr, then the input file should be named as xxxx_UI.txt. To analyze the data from the file xxxx.marr, type: [Executable file] xxxx or [Executable file] 0 xxxx. The input file "simulation04_ui.txt" is designed for simulation experiments. To conduct simulations, one has to prepare an input file with the name: XXX_simu_ui.txt, where XXX is a string that follows the naming conven-tion of computer files. An input file for data analysis with the name XXX_ui.txt is also needed. To run simulations, type: [executable file] 1 xxxx.Click here for file[http://www.biomedcentral.com/content/supplementary/1471-2105-5-164-S1.zip]

1 1d d d dd× ×=2

α σ ρ βρσ

α α ρσ= − =

+ +1

2

2 2, .

d

Page 9 of 10(page number not for citation purposes)

BMC Bioinformatics 2004, 5:164 http://www.biomedcentral.com/1471-2105/5/164

Publish with BioMed Central and every scientist can read your work free of charge

"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."

Sir Paul Nurse, Cancer Research UK

Your research papers will be:

available free of charge to the entire biomedical community

peer reviewed and published immediately upon acceptance

cited in PubMed and archived on PubMed Central

yours — you keep the copyright

Submit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.asp

BioMedcentral

declines with age: A comparison of findings in CBA micewith human listeners. Laryngoscope 2003, 113:1707-1713.

26. Guimaraes P, Zhu X, Cannon T, Kim SH, Frisina RD: Sex differ-ences in distortion product otoacoustic emissions as a func-tion of age in CBA mice. Hear Res 2004 in press.

27. Gates GA, Couropmitree NN, Myers RH: Genetic associations inage-related hearing thresholds. Arch Otolaryngol Head Neck Surg1999, 125:654-659.

28. Kelley PM, Harris DJ, Comer BC, Askew JW, Fowler T, Smith SD,Kimberling WJ: Novel mutations in the connexin 26 gene(GJB2) that cause autosomal recessive (DFNB1) hearingloss. Am J Hum Genet 1998, 62:792-799.

29. Kelsell DP, Dunlop J, Stevens HP, Lench NJ, Liang JN, Parry G, MuellerRF, Leigh IM: Connexin 26 mutations in hereditary non-syn-dromic sensorineural deafness. Nature 1997, 387:80-83.

30. Kikuchi T, Adams JC, Miyabe Y, So E, Kobayashi T: Potassium ionrecycling pathway via gap junction systems in the mamma-lian cochlea and its interruption in hereditary nonsyndromicdeafness. Med Electron Microsc 2000, 33:51-56.

31. Manna SK, Zhang HJ, Yan T, Oberley LW, Aggarwal BB: Over-expression of manganese superoxide dismutase suppressestumor necrosis factor-induced apoptosis and activation ofNF- and activated protein-1. J Biol Chem 1998, 273:132-145.

32. Frisina ST, Mapes F, Kim SH, Frisina DR, Frisina RD: Comprehen-sive characterization of hearing loss in aged diabetics. Paperpresented at Society for Neuroscience 33rd Annual Meeting. New Orleans,LA 2003.

33. Gries A, Herr A, Kirsch S, Gunther C, Weber S, Szabo G, HolzmannA, Bottiger BW, Martin E: Inhaled nitric oxide inhibits platelet-leukocyte interactions in patients with acute respiratory dis-tress syndrome. Crit Care Med 2003, 31:1697-170.

Page 10 of 10(page number not for citation purposes)