Next generation modeling in GWAS: comparing different genetic architectures

21
1 23 Human Genetics ISSN 0340-6717 Hum Genet DOI 10.1007/s00439-014-1461-1 Next generation modeling in GWAS: comparing different genetic architectures Evangelina López de Maturana, Noelia Ibáñez-Escriche, Óscar González-Recio, Gaëlle Marenne, Hossein Mehrban, Stephen J. Chanock, et al.

Transcript of Next generation modeling in GWAS: comparing different genetic architectures

1 23

Human Genetics ISSN 0340-6717 Hum GenetDOI 10.1007/s00439-014-1461-1

Next generation modeling in GWAS:comparing different genetic architectures

Evangelina López de Maturana, NoeliaIbáñez-Escriche, Óscar González-Recio,Gaëlle Marenne, Hossein Mehrban,Stephen J. Chanock, et al.

1 23

Your article is protected by copyright and

all rights are held exclusively by Springer-

Verlag Berlin Heidelberg. This e-offprint is

for personal use only and shall not be self-

archived in electronic repositories. If you wish

to self-archive your article, please use the

accepted manuscript version for posting on

your own website. You may further deposit

the accepted manuscript version in any

repository, provided it is only made publicly

available 12 months after official publication

or later and provided acknowledgement is

given to the original source of publication

and a link is inserted to the published article

on Springer's website. The link must be

accompanied by the following text: "The final

publication is available at link.springer.com”.

1 3

Hum GenetDOI 10.1007/s00439-014-1461-1

OrIGInal InvestIGatIOn

Next generation modeling in GWAS: comparing different genetic architectures

Evangelina López de Maturana · Noelia Ibáñez‑Escriche · Óscar González‑Recio · Gaëlle Marenne · Hossein Mehrban · Stephen J. Chanock · Michael E. Goddard · Núria Malats

received: 11 March 2014 / accepted: 5 June 2014 © springer-verlag Berlin Heidelberg 2014

of estimates. Markers with high MaF are easier to detect by all methods, especially if they have a large effect on the phenotypic trait. a high lD between Qtls with either large or small effects differently affects the power of the methods: it impairs Qtl detection with Ba, irrespectively of the effect size, although boosts that of small effects with Bl and sMr. We demonstrate the convenience of applying MMM rather than sMr because of their larger power and smaller type I error. results from real data when applying MMM suggest novel associations not detected by sMr.

Introduction

One of the goals in biomedical science is to identify and understand the relationship between phenotypes and geno-types. the technological advances in high-throughput gen-otyping and sequencing technologies allow the genotyping of hundreds of thousands of genetic markers in the genome. these resources have motivated human phenotype gene discovery through the genome-wide association studies

Abstract the continuous advancement in genotyping technology has not been accompanied by the application of innovative statistical methods, such as multi-marker meth-ods (MMM), to unravel genetic associations with com-plex traits. although the performance of MMM has been widely explored in a prediction context, little is known on their behavior in the quantitative trait loci (Qtl) detec-tion under complex genetic architectures. We shed light on this still open question by applying Bayes a (Ba) and Bayesian lassO (Bl) to simulated and real data. Both methods were compared to the single marker regression (sMr). simulated data were generated in the context of six scenarios differing on effect size, minor allele frequency (MaF) and linkage disequilibrium (lD) between Qtls. these were based on real snP genotypes in chromosome 21 from the spanish Bladder Cancer study. We show how the genetic architecture dramatically affects the behavior of the methods in terms of power, type I error and accuracy

Electronic supplementary material the online version of this article (doi:10.1007/s00439-014-1461-1) contains supplementary material, which is available to authorized users.

e. lópez de Maturana (*) · G. Marenne · n. Malats Genetic and Molecular epidemiology Group, spanish national Cancer research Centre (CnIO), C/MelchorFernándezalmagro, 3, 28029 Madrid, spaine-mail: [email protected]

n. Ibáñez-escriche · H. Mehrban Genética i Millora animal, research Institute and agricultural technology (Irta), avd. alcalde rovira roure 191, 25198 lleida, spain

Ó. González-recio · M. e. Goddard Biosciences research Division, Department of environment and Primary Industries, agribio, 5 ring road, Bundoora, vIC 3083, australia

Ó. González-recio Dairy Futures Cooperative research Centre, Bundoora, vIC 3083, australia

H. Mehrban Department of animal science, University of shahrekord, P.O. Box 115, 88186-34141 shahrekord, Iran

s. J. Chanock Division of Cancer epidemiology and Genetics, Department of Health and Human services, national Cancer Institute, Bethesda, MD, Usa

M. e. Goddard Department of Food and agricultural systems, University of Melbourne, Melbourne, australia

Author's personal copy

Hum Genet

1 3

(GWas). In less than a decade, GWas have advanced from testing thousands to several millions of snPs in increas-ingly large sample sizes. the most common analytical pro-cedure used in GWas is the so-called single marker regres-sion (sMr) in which (one by one) association between the frequency of each of hundreds of thousands common variants and a given phenotype is tested. Only snPs, or variants, that exceed a conservative genome-wide thresh-old for association (usually p < 5 × 10−8) are called sig-nificant, and then usually tested for evidence of validation in independent studies. However, much work needs to be done to understand the full contribution of the genome to the human phenotype, especially in complex traits that are mainly regulated by many genes with small effect. For these traits, the “common disease–common variant” model that dominated the GWas approach has been reconsidered in the light of the ‘missing heritability problem’, that is, loci detected by GWass explain, almost without excep-tion, a very small inferred genetic variance (Gibson 2012; Maher 2008; Manolio et al. 2009). For example, over 180 loci have been identified as associated with height in Cau-casian populations (lango allen et al. 2010; Weedon et al. 2008), which cumulatively explain 10 % of the variation, whereas the total estimated heritable variation is 80–90 % (Hirschhorn et al. 2001; sale et al. 2005). When evidence suggests that many genetic variants drive phenotypic vari-ation, sMr becomes under-powered to map the location of most of these causal variants. Yang et al. (2010) and Makowsky et al. (2011) demonstrated that the use of all markers simultaneously captured a much higher percent-age of the genetic variance of height. Meuwissen et al. (2001) were pioneers in the joint use of whole-genome markers in a Bayesian linear regression model, and deal-ing with the “large p small n” problem that arises when the number of markers (p) greatly exceeds the number of individuals (n). In the past few years, whole-genome ena-bled prediction (WGP) has received much attention in ani-mal and plant breeding (see de los Campos et al. 2013 for a literature review), and more recently in human genetics (Makowsky et al. 2011; vazquez et al. 2012). two of the methods frequently used in WGP are Bayes a and Bayes-ian lassO (herein after Ba and Bl). linkage disequilib-rium (lD) span, trait heritability, marker density, size of the training data set, and the statistical model have been identified as the main factors affecting the prediction accu-racy of WGP in simulation studies (de los Campos et al. 2013), although real data analyses have not always con-firmed those differences. Bayesian shrinkage methods (also denoted as Bayesian alphabet by Gianola et al. 2009) are common tools in genomic prediction. However, their use in rigorous decision-making in the context of quantitative trait loci (Qtl) detection is not well established (Heaton and scott 2010). although some studies have compared

those methods to find genetic associations in simulations with a low number of Qtls with relatively large effects (e.g., Yi and Xu 2008). therefore, there is a lack of a full comparison of their performance on detecting associations in scenarios mimicking the genetic architecture of complex traits. In addition, decision-making regarding true and false signals remains an open problem (Mutshinda and sillanpaa 2012), and it is unclear to what extent the genetic architec-ture (e.g., Qtl effect size, lD span, allele frequency) may affect their behavior in terms of power and type I error of the Qtl detection methods.

the objective of this paper is to compare the perfor-mances of sMr, Ba and Bl when facing different genetic architectures to detect polymorphisms associated to a given trait. the detection rule in the Bayesian shrinkage methods was the permutation within Markov chain Monte Carlo algorithm (McMC) (Che and Xu 2010). We used simu-lated data representing six scenarios with different genetic architecture regarding MaF, magnitude of the effect, and lD between Qtls with heterogeneous effects and evalu-ated the performance of the selected methods in terms of power, type I error (t1e), positive predictive value (PPv), and accuracy of the estimates of the Qtl effects. these methods were also applied to real data to identify asso-ciations with urothelial carcinoma of the bladder (UCB), a paradigm of a complex disease mostly environmentally driven, where low penetrance genetic variants may play an important role. Finally, we interpreted the results from real data according to those obtained in the simulation.

Materials and methods

ethics statement

Informed consent was obtained from study participants in accordance with the Institutional review Board of the Us national Cancer Institute and the ethics Committees of each participating hospital.

real genetic data

We restricted the analysis to snPs located on chromosome 21 from genotyping 2,229 unrelated individuals (1,134 cases and 1,095 controls) included in the sBC/ePICUrO study. this is a hospital-based case–control study con-ducted during 1998–2001 in 18 hospitals in five spanish areas (asturias, Barcelona metropolitan area, vallès/Bages, alicante and tenerife); see for further details the study of Garcia-Closas et al. (2005). Individuals were genotyped by the Illumina Infinium HumanHap1M array at the Core Genotyping Facility, national Cancer Institute, Usa (roth-man et al. 2010), and passed the quality control procedures,

Author's personal copy

Hum Genet

1 3

including the checking for low genotyping rate (<95 %), MaF less than 0.02 in either UCB cases or controls, or Hardy–Weinberg equilibrium p value <10−5 in controls. In order to eliminate redundant information, we selected one or more representative snPs for lD blocks taking as threshold r2 < 0.8, and prioritizing the snPs with lower number of missing data. Missing genotypes were imputed with the random Forest method in the package random Forest version 4.6-2 in r 2.11 (Foulkes 2009). a total of 7,329 snPs from chromosome 21 passed the quality con-trol procedures and were used in the analyses.

simulated data

a simulation study based on the genotypes previously described was designed to exploit the real genomic struc-ture. We investigated the performance of the three statisti-cal methods under different genetic architectures regarding the effect size, MaF of the markers, and the different lD between the large- and small-effect Qtls.

although in principle, a Qtl usually does not exactly locate at a marker position, for the sake of simplicity, we selected markers to approximate each true Qtl. Briefly, 320 additive Qtls were simulated, 20 with large effect (5 or −5 units) and 300 with small effect randomly generated from a gamma distribution Γ(0.5, 0.5) (see supplementary Fig. s1). large and small effects were randomly assigned to markers according to the six scenarios described in table 1. thus, the MaF of the Qtls was either defined as high (MaF >0.3) or low (MaF <0.2), and the lD between the large- and small-effect Qtls was either null [linkage equilibrium (le) r2 < 0.000000001] or high (r2 > 0.7) (see Fig. s2 for a scatterplot of the MaF of snPs in high lD/le).

note that, although the effect size of the Qtl (bk) is the same, the proportion of the genetic variance explained by each large/tiny Qtl (Vgk

) differed in each scenario (see Fig. 1) because of the different MaF (pk). It is cal-

culated as Vgk=

σ 2gk

σ 2g

, where σ 2gk

= 2pk(1 − pk)b2k and

σ 2g =

∑pk=1 2pk(1 − pk)b

2k, assuming le among markers.

For each scenario, we simulated a continuous phenotype to maintain the estimated genetic effects on the scale of the simulated values so that we could appreciate the extent of the model-induced shrinkage on individual locus effects. the model was as follows:

where b corresponds to the vector of p = 320 additive snP effects which was kept fixed across the replicates in each scenario; xi (i = 2,229) is a vector containing the number of minor alleles for each of the p snPs in indi-vidual i; and ei corresponds to the error term which was

yi = xib + ei,

assumed to be distributed as ei ∼ N

(

0,σ 2

g

h2 − σ 2g

)

, where

h2 is the heritability of the trait (assumed to be 0.5) and σ 2g

is the additive genetic variance, calculated assuming le as:σ 2

g =∑p

k=1 2pk(1 − pk)b2k .

twenty-five replicates were generated under each sce-nario keeping fixed the assignment of markers into causal snPs for the purpose of simulating fixed scenarios in terms of genetic architectures. each replicate comprised a sam-ple of 1,000 individuals from the initial simulated dataset of 2,229 individuals.

statistical methods

the simulated data for each scenario were analyzed by fitting two different regularized Bayesian models using all markers simultaneously: Bayes a (Ba) (Meuwissen et al. 2001) and Bayesian lassO (Bl) (Park and Casella 2008). the com-mon approach sMr was used as a benchmark. the MMMs differ on the prior density of the marker effects, which results in different regularization of marker estimates. Both Ba and Bl are characterized by thick-tailed priors, the scaled t and the double exponential (De), respectively. these densities have higher mass at 0, which shrinks toward 0 the estimates of marker effects with small effects and induces less shrink-age (thicker tails) to markers with larger effects. the thick-tail densities are commonly represented as infinite mixtures of scaled normal densities (andrews and Malows 1974)

of the form: p(

βj

∣ω)

=∫ ∞

0 N(

0

∣σ 2

βj

)

p(

σ 2βj

|ω)

dσ 2βj

,

where p(

σ 2βj

|ω)

is a scaled inverse Chi-square density

[Inv − χ2(4.01, 0.016), scaled t prior for marker effects] in Ba and an exponential density [double exponential prior for

marker effects, ∏m

j=1 N(

βj

∣0, τ 2

j σ 2ε

)

×∏m

j=1 exp(

τ 2j |�

)

]

in Bl. Parameter λ in Bl controls the shape of the prior distribution assigned to τ−2

j , assigning more density to small values of τj than to large ones, and follows, a priori, a

Table 1 Description of the simulated scenarios regarding minor allele frequency (MaF), effect size and linkage disequilibrium (lD)

High MaF refers to markers with MaF >0.3, whereas low MaF cor-responds to those with MaF <0.2

scenario MaF lD

large-effect Qtls small-effect Qtls

sc. 1 High High High

sc. 2 low low High

sc. 3 High High nil

sc. 4 High low nil

sc. 5 low High nil

sc. 6 low low nil

Author's personal copy

Hum Genet

1 3

Gamma distribution G(λ2|10, 0.75). With this representation, the fully conditional densities of marker effects and those of their conditional variances have closed forms (de los Cam-pos et al. 2013).

real data were analyzed using Ba, Bl and sMr adapted to find associations with a binary outcome (disease status). threshold models (Wright 1934) coupled with Ba and Bl for the multi-snP approach (González-recio et al. 2009) and logistic regression for the sMr were implemented. all models also included environmental factors that affect UCB risk: region (5 categories), gender, age (included as a continuous covariate) and smoking status (4 levels: never smokers, if they had smoked less than 100 cigarettes in their lifetime; occasional smokers, if they had smoked at least one cigarette per day for less than 6 months; former smokers, if they had smoked regularly but stopped smoking more than 1 year before the study inclusion date; and cur-rent smokers, if they had smoked regularly within a year of the inclusion date). no correction for population stratifica-tion was made in either sMr or MMM because, although participants were selected from different regions of spain,

no subpopulations were apparent after a principal compo-nent analysis (results not shown). In addition, population structure does not seem to affect the performance of MMM (Karkkainen and sillanpaa 2012).

We fitted each Bayesian statistical model to the 25 rep-licates for each simulated scenario and to the real datasets. Posterior distributions of the snP effects were approxi-mated using a McMC algorithm with 20,000 samples and discarding the first 10,000 as burn-in.

a permutation within McMC chain approach (Che and Xu 2010) was used to determine the markers that were associated with the phenotype for the Bayesian shrinkage models. this strategy speeds up the process of perform-ing a permutation test and consists of obtaining the null distribution of each snP effect by permuting the pheno-types in every iteration within the McMC algorithm before the next round of sampling. Markers were considered as associated to the trait if the max (a, 1 − a) > t, where

a =∫ β̂p

−∞ p(

βp

∣ypermu, β−p

)

dβp, β̂p is the posterior mean of

the snP p after analyzing the original data, and p(βp|ypermu)

Fig. 1 Box plots of the propor-tion of the genetic variance explained by a large- and b small-effect Qtls in each scenario (see table 1 for a defi-nition of the scenarios)

Author's personal copy

Hum Genet

1 3

is the posterior distribution of the marker effect given the permuted data (ypermu). Four values for t were considered: 0.8, 0.85, 0.9, and 0.95. although this approach may lack a theoretical justification (Mutshinda and sillanpaa 2012), a good correspondence between the values of a from the classical permutation test and that proposed by Che and Xu (2010) (Pearson’s correlation = 0.98) was obtained for one of the replicates in the scenario 3 (results not shown).

Bonferroni’s adjustment was used in the sMr. thus, markers were declared as Qtls if p value < 0.05

7329.

Model comparison

the performance of the different models, and thresholds considered in the Bayesian methods, was evaluated in each simulated scenarios in terms of the type I error (t1e or 1-specificity), statistical power, and the positive predictive value (PPv), each defined as follows:

where tP, tn, FP and Fn refer to the numbers of true-positive findings (i.e., true detected Qtls), of true negative findings (with null effect and not decided to be associated), of false-positive findings (with null effect and declared as Qtl), and of false-negative findings (with non-null effect and not declared as Qtl), respectively. rOC spaces, defined by the false-positive rate or specificity and the true-positive rate or 1-sensitivity (or t1e), and the predic-tion results of the confusion matrix corresponding to each method and criterion in each scenario were also calculated. Please see the supplementary table s1 for a description of the relationships among the criteria used for the compari-son of methods. the different statistical approaches were further compared in terms of accuracy through the mean squared error (Mse) of the estimated effect and its regres-sion on the true values.

Results

We first present the comparison of the general behavior of the considered statistical methods and their performances according to the genetic architectures simulated regarding effect size (large/small), MaF (high/low) and lD (high/le) between the large- and small-effect Qtls. next, we report results corresponding to real data analyses.

T1E =FP

(FP + TN)

Power =TP

(TP + FN)

PPV =TP

(TP + FP),

simulated data

In general, low standard deviations were obtained for each statistic, showing the robustness of the measurements to be discussed next.

Method’s performance across scenarios

Methods and criteria (hereinafter Bat and Blt, where t = {0.8, 0.85, 0.9, 0.95}, the thresholds used to declare an snP as significant in Bayesian analyses) were evaluated using some of the criteria described in table s1.

Figure 2 displays the bar plots of the t1e. as expected, the more stringent the criterion, the lower is t1e for the Bayesian methods. the snPs detected by both Ba and Bl at each t had the lowest t1e rate (<0.1 %). Bayes a was the method with the highest t1e, ranging from <1 % (Ba0.95) to >4 % (Ba0.8). Both sMr and Bl had less than 1 % of t1e, with lower t1e obtained by sMr.

Bar plots of the power to detect large/small-effect Qtls for each method and scenario are presented in Fig. 3. In general, Bat outperformed the other two methods regard-less of the threshold above which a Qtl is declared. Blt coupled with the least stringent criteria (Bl0.80) followed Bat in terms of power, and the sMr analysis was able to identify a higher rate of positive results than Bl0.95 in each scenario. as expected, the power of Bayesian methods decreased as the threshold is more stringent, and particu-larly, that of Blt decreased to a low rate in every scenario. the power of considering the Qtls detected by both Bat and Blt was similar or slightly lower to that of Blt alone, indicating that the majority of the snPs declared as true Qtls by Blt were also detected by Bat for all t.

Bayes a, irrespectively of the criterion, achieved nearly 100 % power to identify large-effect Qtls in every sce-nario, ranging from 96 % (Ba0.8 in the scenario 2) to 100 %. On the contrary, Blt detected less and its power decreased as the criterion becomes more stringent (the highest power was for Bl0.8 in scenario 4: 89 %). the performance of sMr to detect large-effect Qtls across scenarios followed the pattern of that of Blt. Considering as Qtls the snPs detected by both Ba0.8 and Bl0.8 had higher power than sMr in every scenario, except when the large-effect Qtl has low MaF (scenario 2).

For small effects, Ba0.8 was also by far the method with the highest power, ranging 43–64 % across scenarios, followed by Bl0.8 which ranged 1.5–13.4 %. a smaller number of Qtls were detected by both Ba0.8 and Bl0.8 (1.3–12.0 %). sMr was the method that detected the low-est number of small-effect Qtls across scenarios (0.8–5.6 %). the differences among the powers to detect large- vs. small-effect Qtls were higher for Bl0.8, sMr and the combination between Ba0.8 and Bl0.8 than for Ba0.8.

Author's personal copy

Hum Genet

1 3

Contrary to power and similarly to t1e, PPv (tP rate over the total of markers declared as Qtls) of Bayesian shrinkage methods increased along with t, that is, as the criterion becomes more stringent in all the scenarios (see Fig. 4, for further details). Considering only the snPs detected by both Bat and Blt reported the highest PPv in every scenario (PPv ranged 0.72–0.88 with t = 0.8). except in scenarios 6 (large- and small-effect Qtls with low MaF) and 4 (large/small-effect Qtl with high/low MaF) with t = 0.8, the PPv of Blt was higher than that of Bat at the same t. the range of PPv of Bat was similar across scenarios, and that of Blt was similar only for sce-narios where small-effect Qtls have large MaF. the PPv of sMr ranged from 0.29 (scenario 4) to 0.57 (scenario 2), and no pattern regarding MaF was detected.

Figure 5 shows the rOC spaces, defined by the false-positive rate or specificity and the true-positive rate or 1-sensitivity, and the prediction results of the confusion matrix corresponding to each method and criterion in each scenario. the performance of every rule evaluated in each simulated genetic architecture was better than random. Bat outperformed the other two methods in every sce-nario, with Ba0.8 achieving the best performance. Blt and sMr in scenarios 2, 4 and 6 (marker with small effects and low MaF) performed similarly. likewise, selecting snPs detected by both Bayesian methods performed worse than each one independently (similar FPr but lower specificity).

the accuracy of the estimates obtained with each method and criteria was evaluated through the averaged Mse between estimated and true detected Qtl effects in

Fig. 2 type I (%) error of each method in the six scenarios simulated (see table 1 for a definition of the scenarios). a scenario 1, b scenario 2, c sce-nario 3, d scenario 4, e scenario 5 and f scenario 6

Author's personal copy

Hum Genet

1 3

le scenarios (see table 2). Bayes a gave the most accu-rate estimates for both large and small Qtls in terms of Mse across scenarios. Bl gave the largest error associated to large effect estimates and sMr for the small effect ones. large-effect Qtls were more accurately estimated by sMr than by Bl.

the regressions of true Qtl effects on their estimates for Qtl that were also computed in le scenarios (see Fig. 6). regression coefficients for estimates obtained by Ba0.8 were close to 1 ranging 0.99–1.04 and 1.07–1.12

for large and small effects, respectively. the effects of both large and small Qtls were underestimated by Bl, espe-cially those with low MaF. In turn, sMr overestimated the effects (regression coefficients ranged 0.74–0.92 and 0.34–0.64 for large and small effects).

High lD vs. le scenarios

to illustrate the effect of lD on method’s performance, we focused on the Qtls that are in high lD (note that there are

Fig. 3 Power (%) to detect large (in grey) and small (in black) effect Qtls by Bayes a (Ba), Bayesian lassO (Bl), the combination of both Bayesian methods (BOtH) and single marker analysis (sMr) in

the six scenarios simulated (see table 1 for a definition of the scenar-ios). a scenario 1, b scenario 2, c scenario 3, d scenario 4, e scenario 5 and f scenario 6

Author's personal copy

Hum Genet

1 3

24 small-effect Qtls in high lD with large effect ones in scenarios 1 and 2). For the sake of simplicity, Fig. 7 shows the bar plots of the power obtained from the less strin-gent criterion in Ba and Bl (Ba0.8 and Bl0.8), as well as that from sMr, to detect large- and small-effect Qtls if they are in high lD/le. results show that the effect of lD on the power is affected by both the Qtls’ effect size and MaF. a high lD between large- and small-effect Qtls worsens the power of Bl0.8 and sMr to detect the large effect ones, especially if they have low MaF (~30 % poorer power than if they are in le). However, the power attained by Ba0.8 is only (negatively) affected by high lD when MaF is low, but less importantly (~8 %). the effect of lD on the power to detect the small-effect Qtls varies with the method: whereas it impairs the power of Ba spe-cially when Qtls have low MaF (by 46 vs. 39 % if they have high MaF), it boosts that of Bl and sMr. When the small Qtl has low (high) MaF and is in high lD with a

large-effect Qtl, the probability of being detected by Bl is ~11 (6) times higher than when it is in le. the differ-ence among lD and le scenarios is even higher for sMr (17 times for high MaF ones vs. 48 times for low MaF ones).

a high lD also affects methods’ PPv similarly to sta-tistical power to detect small-effect Qtls. High lD slightly deteriorates PPv of Bat (i.e., higher average rate of FP among the declared Qtls), whereas it improves that of Blt and sMr (see Fig. 8, for a graphical comparison between PPv of Ba0.8, Bl0.8 and sMr in high lD vs. le scenarios).

the effect of a high lD among heterogeneous effect size Qtls on the accuracy of their estimates is further illustrated in table 3. the table shows the averaged Mse of estimates attained by each method of true Qtls that are either in high lD or in le. a high lD between large- and small-effect Qtls impairs the estimates accuracy for

Fig. 4 Bar plots of the averaged PPv of the methods com-pared: Bayes a (Ba), Bayesian lassO (Bl), the combina-tion of both Bayesian methods (BOtH) and single marker analysis (sMr) in the scenarios simulated (see table 1 for a definition of the scenarios). a scenario 1, b scenario 2, c sce-nario 3, d scenario 4, e scenario 5 and f scenario 6

Author's personal copy

Hum Genet

1 3

Fig. 5 Prediction results of the confusion matrix in the rOC space corresponding to the performance of Bayes a (Ba), Bayesian lassO (Bl), considering different thresholds, and single marker

analyses in each scenario simulated (see table 1 for a definition of the scenarios). a scenario 1, b scenario 2, c scenario 3, d scenario 4, e scenario 5 and f scenario 6

Author's personal copy

Hum Genet

1 3

Ba0.8 and sMr. this negative effect is influenced by both MaF and effect size of Qtls, although in a differ-ent manner depending on the method. the highest impair-ments of accuracy for Ba0.8 were found for low MaF Qtls (6/2.7 times lower accuracy when they have large/small effect), whereas those for sMr were found for

small-effect Qtls (8.2/13.5 times lower accuracy when they have high/low MaF). although much less impor-tantly, a high lD also worsens the accuracy of Bl0.8 esti-mates, except for large-effect Qtls with low MaF. the highest impairment was attained for small-effect Qtls (~20 % less accuracy).

Table 2 averaged mean squared error (Mse) (standard deviation) of the estimated value of the true Qtls that were detected by each statistical method in each scenario

Qtl MaF 0.80 0.85 0.90 0.95 sMr

Ba Bl Ba Bl Ba Bl Ba Bl

MSE

large H (sc 3) 0.19 (0.33) 17.89 (0.51) 0.19 (0.33) 17.64 (0.46) 0.19 (0.33) 17.35 (0.50) 0.09 (0.03) 16.64 (0.69) 1.46 (0.35)

large H (sc 4) 0.38 (0.45) 16.44 (0.66) 0.33 (0.42) 16.41 (0.68) 0.33 (0.42) 16.26 (0.76) 0.33 (0.42) 15.97 (0.70) 0.82 (0.25)

large l (sc 5) 0.43 (0.58) 22.35 (0.35) 0.43 (0.58) 21.85 (0.53) 0.43 (0.58) 21.22 (0.50) 0.39 (0.38) 19.66 (1.11) 3.46 (1.34)

large l (sc 6) 0.35 (0.28) 22.37 (0.21) 0.35 (0.28) 22.21 (0.26) 0.35 (0.28) 21.88 (0.30) 0.30 (0.10) 21.25 (0.52) 1.39 (0.48)

small H (sc3) 0.32 (0.03) 4.03 (0.42) 0.31 (0.03) 4.66 (0.63) 0.30 (0.03) 5.61 (1.50) 0.28 (0.03) 8.49 (2.75) 14.02 (0.12)

small l (sc 4) 0.74 (0.07) 8.82 (2.16) 0.71 (0.08) 12.17 (3.23) 0.66 (0.08) 13.24 (2.09) 0.58 (0.07) 12.96 (0.48) 21.68 (6.45)

small H (sc 5) 0.30 (0.03) 4.24 (0.20) 0.29 (0.03) 5.19 (0.36) 0.28 (0.03) 5.90 (0.50) 0.24 (0.03) 6.83 (1.34) 4.25 (0.80)

small l (sc 6) 0.67 (0.08) 7.81 (1.48) 0.66 (0.09) 11.28 (2.04) 0.63 (0.10) 12.99 (3.27) 0.56 (0.08) 15.43 (0.46) 2.34 (3.30)

Fig. 6 Plots of the regressions of the estimated effect of large (in red) and small (in blue) Qtls obtained by Bayes a (Ba), Bayesian lassO (Bl) and single marker regression (sMr) on their true values). a scenario 1, b scenario 2, c scenario 3 and d scenario 4 (color figure online)

Author's personal copy

Hum Genet

1 3

Detection of Qtls with high versus low MaF

to evaluate the influence of Qtl’s MaF on their detec-tion by each method, we will focus only on the scenarios where markers with heterogeneous effect size were in le (scenarios 3–6). as expected, results show that Qtls with high MaF are better detected by every method than those with low MaF (Fig. 9). except for Ba0.8, with a power to detect large-effect Qtls close to 100 %, the dif-ference between the powers of Bl0.8 or sMr to detect Qtls with high vs. low MaF was higher when the Qtl effect is large.

the MaF of Qtls also affects the PPv of Bayesian shrinkage methods (see Fig. 4). the PPvs of Ba0.8 and Bl0.8 are higher in the scenarios were the small-effect Qtls have high MaF, with higher differences between the PPvs of Bl0.8. On the contrary, the MaF does not seem to affect the PPv of sMr.

the MaF of Qtls also impacts on the accuracy of the effect estimate in terms of Mse (see averaged Mse of true detected Qtls in table 3). the lower the MaF of the

small-effect Qtl, the more inaccurate the estimates from Ba and Bl. this pattern is also seen for large-effect Qtls’ estimates attained by Bl, although the difference in accu-racy among high and low MaF Qtls was much smaller. the effect of MaF on the accuracy of large-effect Qtls when Ba or sMr were applied is not clear.

real data

Figure 10 shows the Manhattan plots obtained from the application of each method to the UCB data. Ba identified the highest number of snPs associated with UCB, followed by Bl. no snP reached the significance level after Bonfer-roni’s correction when the sMr was performed. In spite of that, Bl and sMr produced the most similar ‘skylines’. this similarity is reflected in a high (0.82) Pearson’s cor-relation between the percentile of the posterior distribution of permuted data corresponding to β̂p attained by Bl and p values of the sMr. On the contrary, its counterpart between the percentiles of posterior distributions of permuted data acquired by both Bayesian methods was moderate (0.54). a

Fig. 7 Bar plots of the power to detect large- and small-effect Qtls with high/low MaF and in high lD/le by Ba0.8, Bl0.8 and sMr. a High MaF, b low MaF, c high MaF and d low MaF

Author's personal copy

Hum Genet

1 3

low correlation (0.29) was obtained between sMr p values and percentiles corresponding to Ba.

Figure 11 showed the venn’s diagrams correspond-ing to the number of snPs detected by each statistical method (and criteria) as associated with UCB risk. as in the simulated data, Ba0.8 detected the largest num-ber of snPs, followed by Bl0.8. the more stringent the criterion, the less snPs were detected by each (or both) method(s). When the least stringent criterion was applied (0.8), Ba detected 187 snPs, whereas Bl detected 45. twenty-seven snPs were detected by both methods. among those from Ba0.8, three pairs were in high lD (r2 > 0.7). the maximum pairwise lD among those from Bl0.8 was 0.64, and that among those detected by both methods was 0.39. When t = 0.85, less than one-third

of the snPs detected previously from Ba still remained associated, whereas only 5 out 45 remained from Bl. When the most stringent criterion (t = 0.95) was applied, only two snPs remained detected by Ba; whereas, Bl did not identify any.

the more stringent the criterion, the narrower the spec-trum of MaF of the detected markers by Bayesian meth-ods: Ba0.8 and Ba0.85 detected snPs with MaF ranging 0.03–0.50, and Ba0.9 and Ba0.95 detected snPs with higher MaFs (ranging 0.12–0.49 and 0.26–0.38, respec-tively). On the contrary, Bl only detected snPs with high MaF (ranging 0.18–0.50; 0.27–0.46; 0.40, for t = 0.8, 0.85 and 0.9, respectively). the MaF ranges of markers detected by both methods were the same as those acquired by Bl.

Fig. 8 Positive predictive value (in %, PPv) of Bayes a (Ba), Bayesian lassO (Bl) and single marker regression (sMr) in sce-narios with high (lD) vs. nil linkage disequilibrium (le) among the

large- and small-effect Qtls with high/low minor allele frequency (MaF). a large and small Qtls with high MaF and b large and small Qtls with low MaF

Table 3 averaged mean squared error (Mse) (standard deviation) of estimates of Qtls in high linkage disequilibrium/equilibrium (lD/le) obtained by Bayes a (Ba), Bayesian lassO (Bl) and single marker regression (sMr) in scenarios with similar MaF

H high, L low

effect size MaF Ba Bl sMr

lD le lD le lD le

MSE

large H 0.98 (0.54) 0.48 (0.58) 18.80 (0.44) 18.23 (0.46) 2.66 (0.61) 1.82 (0.29)

large l 2.78 (1.09) 0.40 (0.36) 22.81 (0.30) 23.19 (0.14) 3.53 (0.66) 1.66 (0.40)

small H 0.81 (0.36) 0.30 (0.02) 1.54 (0.07) 1.28 (0.01) 17.79 (1.37) 1.94 (0.14)

small l 1.99 (0.74) 0.54 (0.04) 1.64 (0.05) 1.40 (0.00) 20.66 (1.86) 1.42 (0.10)

Author's personal copy

Hum Genet

1 3

Discussion

the continuous advancement of genotyping and sequencing technologies in the last decade has not been

accompanied by the use of more advanced statistical methods in GWas, and the classical sMr approach cou-pled with the restrictive genome-wide p value threshold is by far the most utilized method. although multi-marker

Fig. 9 Bar plots of the averaged power to detect large/small-effect Qtls with high vs. low MaF (in light and dark grey, respectively) by Ba0.8, Bl0.8 and sMr in scenarios where Qtls are in le (scenarios 3–6). a large Qtls and b small Qtls

Fig. 10 Manhattan plots Manhattan plots from Bayes a (Ba), Bayesian lassO (Bl) and single marker regres-sion (sMr). Green and pink dotted lines correspond to the percentile 0.8 and threshold α =

0.05

7329, respectively. Orange,

green and dark red vertical lines point the snPs that were selected by both Bat and Blt (BOtHt) at t = 0.8, 0.85 and 0.90, respectively. *Percentile of the posterior distribution of permuted data corresponding to effect estimates obtained by Ba and Bl (color figure online)

Author's personal copy

Hum Genet

1 3

methods have been widely used to estimate the genetic variance of complex traits (de los Campos et al. 2013; Yang et al. 2010) and to predict the genetic merit (Gia-nola 2013; Gianola et al. 2009; Meuwissen et al. 2001), little is known about their performance in the Qtl detec-tion framework under complex genetic architectures. several studies have studied the application of differ-ent detection rules in the context of Bayesian MMM, although simulations have comprised few causal mark-ers with relatively high effects (li et al. 2011; Mutshinda and sillanpaa 2012; Xu 2003), which may not reflect the nature of complex traits. Our study threw light on the behavior of two of the best known Bayesian multi-marker methods, Ba and Bl, coupled with the within-chain permutation approach described by Che and Xu (2010), as well as that of sMr in the Qtl detection procedure using simulated data under different genetic architecture (lD, MaF, effect size or proportion of genetic variance explained). Because the genetic data used in the simula-tion study were from ePICUrO/sBC study, we applied the lessons learned from simulation studies to explain the results obtained in real data. We chose the polymor-phisms in chromosome 21 because its relatively small size allowed simulating multiple scenarios and replicates in a reasonable time frame, and no previous associations with UCB were found by GWas.

Comparison of methods performance

simulation results show a different performance of the methods as a consequence of the different priors adopted in the Bayesian shrinkage methods and the implicit assump-tions of sMr (Qtls are independent and have large effects). In general, Bayesian shrinkage methods, irre-spective of the stringent criterion applied, showed a bet-ter performance in Qtl detection than sMr. although the shrinkage of both Ba and Bl is allelic frequency and effect-size dependent, our results reflected a different behavior because the regularization via the priors differs: Ba produces less shrinkage towards zero as the absolute value of the marker effect increases, whereas Bl shrinks strongly, especially those with small effects, which are effectively wiped out of the model (Gianola 2013). thus, as a consequence of the lower grade of regularization, Ba was by far the method with the highest power (it was able to detect nearly 100 % of large-effect Qtls), even though it reported the largest number of falsely associated markers. In addition, Ba provided the most accurate estimates of the snPs with both large and small effect, suggesting that Ba better fitted the data. Bl outperformed sMr in terms of power (unless the criteria used to declare a Qtl are very strict) and PPv, although presented a slightly higher t1e. the strong regularization of Bl is also reflected in

Fig. 11 venn diagrams of the number of snPs detected by each Bayesian method cor-responding to each selection criterion. a 0.8, b 0.85, c 0.9 and d 0.95

Author's personal copy

Hum Genet

1 3

the underestimation of the effects, especially for snPs with small effect. the t needed to be as low as 0.58–0.62 to get similar false-positive rate in the rOC space as Ba, although with a much lower power (see Fig. 5). the combination of both Ba and Bl showed the lowest t1e (<0.1 %), as well as the highest performance in terms of PPv (>75 %). In terms of power, this method is quite con-servative (similar to Bl, which shrinks strongly towards 0), although less than the sMr. surprisingly, t1e of sMr was much higher than expected (<1 % vs. 0.05/7329), which may be explained by the lD existing between markers. as a consequence of the implicit assumption of sMr, the esti-mates were overestimated, especially for those Qtls with small effect.

Impact of genetic architecture on Qtl detection

Genetic architecture affects methods’ performances in terms of statistical power, PPv, t1e and accuracy of the estimates in a different manner. this is also a consequence of the different priors of Ba and Bl, and the inherent assumptions of sMr. as expected, large-effect Qtls were better detected than the small ones by all methods. the low power to detect the small-effect Qtls using Bl reflected the strong shrinkage that the De prior places on them.

the different impact of lD on the performance of Bayesian methods is also a consequence of the different regularization via priors. Bayes a detected a lower number of both large- and small-effect Qtls when they were in high lD, because the marginal prior for the marker effects in Ba is the same for all markers, there is scant Bayesian learning for marker specific variances, and this can dilute the effect of highly correlated markers (Gianola 2013). In the case of either Bl or sMr, part of the genetic variance of large-effect Qtls seemed to be captured by small ones in lD. this results in a larger power to detect the small Qtls, at the expense of detecting less large-effect Qtls. the effect of lD on the detection of small-effect Qtls is also reflected in the rate of FP, increasing it in the case of Ba but decreasing it for Bl and sMr. results also reflected the effect of lD on the bias of Qtl effect estima-tion, which was modulated by the effect size of the Qtl. thus, estimates of small-effect Qtls in high lD with large Qtls were downward biased, whereas those of large ones were upward biased (results not shown). a high lD also worsened the accuracy acquired by every method (espe-cially Ba and sMr) in terms of Mse. the impairment was driven by the allele frequency of the Qtl in the case of Ba, and by the effect size in the case of sMr.

the effect of the allele frequency on methods perfor-mance is clear, and also affects them in a different manner. small-effect Qtls were better detected if they have high MaF by all methods. very few small-effect Qtls with low

allele frequency were identified by either Bl or sMr. this was confirmed by the results of real data analyses, where Ba discovered markers with a much wider range of MaF than either Bl or sMr, especially with the most flexible criterion. accuracy of the estimates also depends on MaF. Qtl effects were more accurately estimated when the allele frequency was high.

real data analysis

all the statistical methods were also applied to real-world data. Contrary to the simulation study, neither the true effect of the Qtls available nor the genetic architecture of UCB is known. Genetic factors are estimated to explain 7–31 % of bladder cancer susceptibility (Czene et al. 2002; lichtenstein et al. 2000). recent GWas studies have iden-tified 13 snPs associated to UCB although none of them were in chromosome 21 (Figueroa et al. 2014; Garcia-Clo-sas et al. 2013; rothman et al. 2010; Wu et al. 2009). this fact also motivated us to explore the performance of multi-marker methods to detect potential associations with snPs in chromosome 21.

as it is shown also in the simulated data, Ba was the method that identified the largest number of snPs and with a more ample spectrum of MaF. the more stringent the cri-terion, the lower proportion of snPs of the total identified by Ba that were detected by both Ba and Bl (14.4, 5.3, 5.5 and 0 % for t = 0.8, 0.85, 0.9 and 0.95), and the lower range of MaF were detected. as expected from previous analysis, sMr did not detect any significant associations after correcting for Bonferroni. In the light of the results, the hypothesis of genetic variants with very small effect underlying the susceptibility of UCB seems more plausible.

at the expense of excluding many false-negative asso-ciations, we will consider and discuss only those snPs that were selected by both Ba and Bl at every threshold as associated to the disease to have a low rate of FP. Poly-morphisms in twenty genes (see table 4) were detected for both Ba0.8 and Bl0.8. among them, three (UMODL1, SLC19A1 and C21orf91) and one (C21orf91) were selected when applying t = 0.85 and 0.9, respectively. With the most stringent criterion, no snP was selected. although none of the genetic variants have been previously associ-ated with UCB risk through GWas, some of the genes identified here have been related to UCB via expression analysis, or to other type of cancers via expression analy-sis or GWas. thus, human UMODL1 is dramatically up-regulated in cancer tissues originated from the lymph node, bladder, liver pancreas and ovary (Wang et al. 2012). TIAM1 has been found over expressed in most malignan-cies, and has been associated with metastasis in hepatocel-lular carcinoma (Huang et al. 2013) and with migration, invasion and viability of ovarian cancer cells (li et al.

Author's personal copy

Hum Genet

1 3

2012). C21orf2 antigen showed cancer-associated reactiv-ity and reacted preferentially with serum from colon cancer patients compared with normal human serum (line et al. 2002). relative expression of PDe9, which has been shown to be expressed in the urothelium of the low urinary tract, in malignant tumors was significantly higher than that in respective normal breast tissues and benign tumors (5.5-fold, p < 0.001 and sixfold, p < 0.001, respectively) (naga-saki et al. 2012). Furthermore, DSCAM gene has been asso-ciated with non-small-cell lung carcinoma survival (sato et al. 2011) and SLC19A1 gene with breast and prostate cancer risk (Haiman et al. 2013).

some of the genes identified in this study have been pre-viously associated with inflammatory processes, support-ing the link between inflammation and UBC (de Maturana et al. 2013; Fortuny et al. 2006; Murta-nascimento et al. 2007). C21orf34 gene has been reported to be associated

with Il-6, a systemic inflammatory marker in patients with chronic obstructive pulmonary disease (Kim et al. 2012). erG gene is a key regulator of inflammation, and RIP140 gene was associated with subclinical Inflammation in type 2 Diabetic Patients (Xue et al. 2013).

strengths and limitations

although valuable findings are reported in this study, there are limitations that should be also discussed. First, it is pos-sible that none of the simulated scenarios fully fits with the real associations of genetic variants in chromosome 21 with UCB risk. time and space limitations made impractical to simulate all the possible combinations regarding lD, MaF or effect size of the Qtls. also for that reason, we had to ignore possible associations of genetic variants out of chro-mosome 21 in the models. second, a continuous trait with higher heritability than that of UCB risk (7–31 %) (Czene et al. 2002; lichtenstein et al. 2000) was simulated to better illustrate the effect of the genetic architecture on the per-formance of the methods. third, Qtls were assigned to genotype snPs, which may not be the case in real analyses data. For the sake of simplicity, the existence of epistasis or dominance was ignored in the simulation. Future studies may consider worthy to explore other scenarios where the genetic architecture also comprised gene–gene interaction, dominance or even gene–environment interaction.

In spite of that, this study reports noteworthy findings. to our knowledge, this is the first study comparing the per-formance of Ba and Bl on the detection of Qtls associ-ated to complex traits and the most comprehensive study regarding how different genetic architecture may affect the performance of the methods. this analysis disentangled the effect of the lD among markers with heterogeneous effect on Qtl detection, and shows that Bayesian shrink-age methods (especially Ba) outperform sMr in terms of power, t1e, PPv or accuracy of the effect estimates. although extrapolating the conclusions from the simulated to the real data is challenging because of the limitations that we pointed out above, the hypothesis of low penetrant genetic variants with very low effect underlying UCB seems plausible and is supported by this study. the appli-cation of multi-marker methods also allowed identifying genes that may play a role in the development of UCB, that were not detected by the standard method used in GWas.

Implications for data analysis

Important implications can be derived from this study. We showed that the effectiveness of Qtl detection methods in terms of power, t1e, PPv and accuracy of effect esti-mates depends dramatically on the genetic architecture of the trait. a high lD between markers with large and

Table 4 list of snPs that were detected as associated with bladder cancer risk by both Bayesian methods with the less stringent criterion (t = 0.8)

those that remained as associated with more stringent criteria are ticked

snP Gene 0.85 0.9 0.95 p value

rs2070572 C21orf2 0.0142

rs2245036 C21orf25 0.0035

rs1074741 C21orf34 0.0173

rs2829269 C21orf42 0.0072

rs2834005 C21orf62 0.0080

rs243619 C21orf91 × × 0.0013

rs13052393 CrYaa 0.0180

rs872331 CrYaa 0.0047

rs1882760 DsCaM 0.0175

rs12482444 DsCaM 0.0144

rs9975719 DsCaM 0.0008

rs2836336 erG 0.0103

rs2838281 FlJ41733 0.0167

rs8129919 KCnJ6 0.0097

rs915792 nCaM2 0.0423

rs2826223 nCaM2 0.0077

rs964035 nrIP1 0.0035

rs2823233 nrIP1 0.0080

rs228039 PDe9a 0.0015

rs2824732 Prss7 0.0188

rs2825301 Prss7 0.0086

rs2838958 slC19a1 0.0051

rs4818789 slC19a1 × 0.0021

rs2096852 snF1lK 0.0247

rs1414193 tIaM1 0.0060

rs7278065 tMPrss2 0.0190

rs2839471 UMODl1 × 0.0005

Author's personal copy

Hum Genet

1 3

small effects differently affects the power of each method: whereas it impairs that of Ba, it boosts that of Bl and sMr when detecting small-effect Qtls. a high lD also impairs the accuracy of the effect estimates acquired by all methods. the allele frequency of the Qtls also alters the behavior of the methods. In general, Qtls with higher MaF are better detected, especially if their effect is large, and their effect estimates more accurate. In addition, lower t1e are expected when the MaF of small-effect Qtls is high. these results emphasize the difficulty of estimating the effect of a Qtl on the phenotype of a given trait using current models, because the variance explained by it may be small and confounded by other Qtl in lD with it.

Multi-snP methods outperformed the classical sMr in terms of power, type I error and accuracy effect estimates, especially Ba, which is also able to identify a wider range of markers regarding the MaF. Combining both Bayesian methods provided a balance between power and t1e/PPv in the simulation study, and therefore it was the method of choice in the real data analysis. However, the magni-tude of the true effect of genomic variants on phenotype should be taken cautiously because they are not independ-ent of method, MaF, lD and other genetic architecture landscape.

results from real-world data support the hypothesis that small-effect genetic variants underlie the genetic basis of UCB risk and suggest novel associations with variants in chromosome 21 not detected by sMr in previous studies. Our findings should encourage future GWas to use multi-marker methods to detect Qtl associations with complex traits.

Acknowledgments this study was partially conducted in the Bio-sciences research Division, Department of environment and Primary Industries (Melbourne, australia). Many thanks to Phil Bowman for his help using the clusters. We also acknowledge the principal investi-gators, coordinators, field and administrative workers, technicians and study participants of the spanish Bladder Cancer/ePICUrO study.

References

andrews DF, Malows Cl (1974) scale mixtures of normal distribu-tions. J r stat soc ser B 36:99–102

Che X, Xu s (2010) significance test and genome selection in Bayes-ian shrinkage analysis. Int J Plant Genomics 2010:893206

Czene K, lichtenstein P, Hemminki K (2002) environmental and heritable causes of cancer among 9.6 million individuals in the swedish family-cancer database. Int J Cancer 99:260–266

de los Campos G, Hickey JM, Pong-Wong r, Daetwyler HD, Calus MP (2013) Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics 193:327–345

de Maturana el, Ye Y, Calle Ml, rothman n, Urrea v, Kogevinas M, Petrus s, Chanock sJ, tardon a, Garcia-Closas M, Gonzalez-neira a, vellalta G, Carrato a, navarro a, lorente-Galdos B, silverman Dt, real FX, Wu X, Malats n (2013) application of multi-snP approaches Bayesian lassO and aUC-rF to detect

main effects of inflammatory-gene variants associated with blad-der cancer risk. Plos One 8:e83745

Figueroa JD, Ye Y, siddiq a, Garcia-Closas M, Chatterjee n, Prokunina-Olsson l, Cortessis vK, Kooperberg C, Cussenot O, Benhamou s, Prescott J, Porru s, Dinney CP, Malats n, Baris D, Purdue M, Jacobs eJ, albanes D, Wang Z, Deng X, Chung CC, tang W, Bas Bueno-de-Mesquita H, trichopoulos D, ljun-gberg B, Clavel-Chapelon F, Weiderpass e, Krogh v, Dorronsoro M, travis r, tjonneland a, Brenan P, Chang-Claude J, riboli e, Conti D, Gago-Dominguez M, stern MC, Pike MC, van Den Berg D, Yuan JM, Hohensee C, rodabough r, Cancel-tassin G, roupret M, Comperat e, Chen C, De vivo I, Giovannucci e, Hunter DJ, Kraft P, lindstrom s, Carta a, Pavanello s, arici C, Mastrangelo G, Kamat aM, lerner sP, Barton Grossman H, lin J, Gu J, Pu X, Hutchinson a, Burdette l, Wheeler W, Kogevi-nas M, tardon a, serra C, Carrato a, Garcia-Closas r, lloreta J, schwenn M, Karagas Mr, Johnson a, schned a, armenti Kr, Hosain GM, andriole G Jr, Grubb r 3rd, Black a, ryan Diver W, Gapstur sM, Weinstein sJ, virtamo J, Haiman Ca, landi Mt, Caporaso n, Fraumeni JF Jr, vineis P, Wu X, silverman Dt, Cha-nock s, rothman n (2014) Genome-wide association study iden-tifies multiple loci associated with bladder cancer risk. Hum Mol Genet 23:1387–1398

Fortuny J, Kogevinas M, Garcia-Closas M, real FX, tardon a, vil-lanueva C, Dosemeci M, Malats n, silverman D (2006) Use of analgesics and nonsteroidal anti-inflammatory drugs, genetic pre-disposition, and bladder cancer risk in spain. Cancer epidemiol Biomark Prev 16:1696–1702

Foulkes as (2009) applied statistical genetics with r for population-based association studies. springer science + Business Media, llC, new York

Garcia-Closas M, Malats n, silverman D, Dosemeci M, Kogevi-nas M, Hein DW, tardon a, serra C, Carrato a, Garcia-Closas r, lloreta J, Castano-vinyals G, Yeager M, Welch r, Chanock s, Chatterjee n, Wacholder s, samanic C, tora M, Fernandez F, real FX, rothman n (2005) nat2 slow acetylation, GstM1 null genotype, and risk of bladder cancer: results from the spanish Bladder Cancer study and meta-analyses. lancet 366:649–659

Garcia-Closas M, rothman n, Figueroa JD, Prokunina-Olsson l, Han ss, Baris D, Jacobs eJ, Malats n, De vivo I, albanes D, Pur-due MP, sharma s, Fu YP, Kogevinas M, Wang Z, tang W, tar-don a, serra C, Carrato a, Garcia-Closas r, lloreta J, Johnson a, schwenn M, Karagas Mr, schned a, andriole G Jr, Grubb r 3rd, Black a, Gapstur sM, thun M, Diver Wr, Weinstein sJ, virtamo J, Hunter DJ, Caporaso n, landi Mt, Hutchinson a, Burdett l, Jacobs KB, Yeager M, Fraumeni JF Jr, Chanock sJ, silverman Dt, Chatterjee n (2013) Common genetic polymor-phisms modify the effect of smoking on absolute risk of bladder cancer. Cancer res 73:2211–2220

Gianola D (2013) Priors in whole-genome regression: the Bayesian alphabet returns. Genetics 194:573–596

Gianola D, de los Campos G, Hill WG, Manfredi e, Fernando r (2009) additive genetic variability and the Bayesian alphabet. Genetics 183: 347-63

Gibson G (2012) rare and common variants: twenty arguments. nat rev Genet 13:135–145

González-recio O, lópez de Maturana e, vega at, engelman CD, Broman KW (2009) Detecting single-nucleotide polymorphism by single-nucleotide polymorphism interactions in rheumatoid arthritis using a two-step approach with machine learning and a Bayesian threshold least absolute shrinkage and selection opera-tor (lassO) model. In: BMC proceedings, vol 3 (suppl 7)

Haiman Ca, Han Y, Feng Y, Xia l, Hsu C, sheng X, Pooler lC, Patel Y, Kolonel ln, Carter e, Park K, le Marchand l, van Den Berg D, Henderson Be, stram DO (2013) Genome-wide testing of putative functional exonic variants in relationship with breast

Author's personal copy

Hum Genet

1 3

and prostate cancer risk in a multiethnic population. Plos Genet 9:e1003419

Heaton MJ, scott JG (2010) Bayesian computation and the linear model. In: Chen MH, Dey DK, Müller P, sun D, Ye K (eds) Frontiers of statistical decision making and Bayesian analysis. springer, new York, pp 527–545

Hirschhorn Jn, lindgren CM, Daly MJ, Kirby a, schaffner sF, Burtt nP, altshuler D, Parker a, rioux JD, Platko J, Gaudet D, Hudson tJ, Groop lC, lander es (2001) Genomewide linkage analysis of stature in multiple populations reveals several regions with evi-dence of linkage to adult height. am J Hum Genet 69:106–116

Huang J, Ye X, Guan J, Chen B, li Q, Zheng X, liu l, Wang s, Ding Y, Chen l (2013) tiam1 is associated with hepatocellular carci-noma metastasis. Int J Cancer 132:90–100

Karkkainen HP, sillanpaa MJ (2012) robustness of Bayesian multi-locus association models to cryptic relatedness. ann Hum Genet 76:510–523

Kim DK, Cho MH, Hersh CP, lomas Da, Miller Be, Kong X, Bakke P, Gulsvik a, agusti a, Wouters e, Celli B, Coxson H, vestbo J, Macnee W, Yates JC, rennard s, litonjua a, Qiu W, Beaty tH, Crapo JD, riley JH, tal-singer r, silverman eK (2012) Genome-wide association analysis of blood biomarkers in chronic obstructive pulmonary disease. am J respir Crit Care Med 186:1238–1247

lango allen H, estrada K, lettre G, Berndt sI, Weedon Mn, rivade-neira F, Willer CJ, Jackson aU, vedantam s, raychaudhuri s, Ferreira t, Wood ar, Weyant rJ, segre av, speliotes eK, Wheeler e, soranzo n, Park JH, Yang J, Gudbjartsson D, Heard-Costa nl, randall JC, Qi l, vernon smith a, Magi r, Pastinen t, liang l, Heid IM, luan J, thorleifsson G, Winkler tW, God-dard Me, sin lo K, Palmer C, Workalemahu t, aulchenko Ys, Johansson a, Zillikens MC, Feitosa MF, esko t, Johnson t, Ketkar s, Kraft P, Mangino M, Prokopenko I, absher D, albre-cht e, ernst F, Glazer nl, Hayward C, Hottenga JJ, Jacobs KB, Knowles JW, Kutalik Z, Monda Kl, Polasek O, Preuss M, rayner nW, robertson nr, steinthorsdottir v, tyrer JP, voight BF, Wiklund F, Xu J, Zhao JH, nyholt Dr, Pellikka n, Perola M, Perry Jr, surakka I, tammesoo Ml, altmaier el, amin n, aspelund t, Bhangale t, Boucher G, Chasman DI, Chen C, Coin l, Cooper Mn, Dixon al, Gibson Q, Grundberg e, Hao K, Juhani Junttila M, Kaplan lM, Kettunen J, Konig Ir, Kwan t, lawrence rW, levinson DF, lorentzon M, McKnight B, Morris aP, Muller M, suh ngwa J, Purcell s, rafelt s, salem rM, salvi e et al (2010) Hundreds of variants clustered in genomic loci and biological pathways affect human height. nature 467:832–838

li J, Das K, Fu G, li r, Wu r (2011) the Bayesian lasso for genome-wide association studies. Bioinformatics 27:516–523

li J, liang s, Jin H, Xu C, Ma D, lu X (2012) tiam1, negatively regulated by mir-22, mir-183 and mir-31, is involved in migra-tion, invasion and viability of ovarian cancer cells. Oncol rep 27:1835–1842

lichtenstein P, Holm nv, verkasalo PK, Iliadou a, Kaprio J, Kosken-vuo M, Pukkala e, skytthe a, Hemminki K (2000) environmen-tal and heritable factors in the causation of cancer—analyses of cohorts of twins from sweden, Denmark, and Finland. n engl J Med 343:78–85

line a, slucka Z, stengrevics a, silina K, li G, rees rC (2002) Characterisation of tumour-associated antigens in colon cancer. Cancer Immunol Immunother 51:574–582

Maher B (2008) Personal genomes: the case of the missing heritabil-ity. nature 456:18–21

Makowsky r, Pajewski nM, Klimentidis YC, vazquez aI, Duarte CW, allison DB, de los Campos G (2011) Beyond missing herit-ability: prediction of complex traits. Plos Genet 7:e1002051

Manolio ta, Collins Fs, Cox nJ, Goldstein DB, Hindorff la, Hunter DJ, McCarthy MI, ramos eM, Cardon lr, Chakravarti a, Cho

JH, Guttmacher ae, Kong a, Kruglyak l, Mardis e, rotimi Cn, slatkin M, valle D, Whittemore as, Boehnke M, Clark aG, eichler ee, Gibson G, Haines Jl, Mackay tF, McCarroll sa, visscher PM (2009) Finding the missing heritability of complex diseases. nature 461:747–753

Meuwissen tH, Hayes BJ, Goddard Me (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829

Murta-nascimento C, schmitz-Drager BJ, Zeegers MP, steineck G, Kogevinas M, real FX, Malats n (2007) epidemiology of uri-nary bladder cancer: from tumor development to patient’s death. World J Urol 25:285–295

Mutshinda CM, sillanpaa MJ (2012) a decision rule for quantitative trait locus detection under the extended Bayesian lassO model. Genetics 192:1483–1491

nagasaki s, nakano Y, Masuda M, Ono K, Miki Y, shibahara Y, sasano H (2012) Phosphodiesterase type 9 (PDe9) in the human lower urinary tract: an immunohistochemical study. BJU Int 109:934–940

Park t, Casella G (2008) the Bayesian lassO. J am statist assoc 103:681–686

rothman n, Garcia-Closas M, Chatterjee n, Malats n, Wu X, Figueroa JD, real FX, van Den Berg D, Matullo G, Baris D, thun M, Kiemeney la, vineis P, De vivo I, albanes D, Purdue MP, rafnar t, Hildebrandt Ma, Kiltie ae, Cussenot O, Golka K, Kumar r, taylor Ja, Mayordomo JI, Jacobs KB, Kogevinas M, Hutchinson a, Wang Z, Fu YP, Prokunina-Olsson l, Bur-dett l, Yeager M, Wheeler W, tardon a, serra C, Carrato a, Garcia-Closas r, lloreta J, Johnson a, schwenn M, Karagas Mr, schned a, andriole G Jr, Grubb r 3rd, Black a, Jacobs eJ, Diver Wr, Gapstur sM, Weinstein sJ, virtamo J, Cortes-sis vK, Gago-Dominguez M, Pike MC, stern MC, Yuan JM, Hunter DJ, McGrath M, Dinney CP, Czerniak B, Chen M, Yang H, vermeulen sH, aben KK, Witjes Ja, Makkinje rr, sulem P, Besenbacher s, stefansson K, riboli e, Brennan P, Panico s, navarro C, allen ne, Bueno-de-Mesquita HB, trichopoulos D, Caporaso n, landi Mt, Canzian F, ljungberg B, tjonneland a, Clavel-Chapelon F, Bishop Dt, teo Mt, Knowles Ma, Guarrera s, Polidoro s, ricceri F, sacerdote C, allione a, Cancel-tassin G, selinski s, Hengstler JG, Dietrich H, Fletcher t, rudnai P, Gurzau e, Koppova K, Bolick sC, Godfrey a, Xu Z et al (2010) a multi-stage genome-wide association study of bladder cancer identifies multiple susceptibility loci. nat Genet 42:978–984

sale MM, Freedman BI, Hicks PJ, Williams aH, langefeld CD, Gal-lagher CJ, Bowden DW, rich ss (2005) loci contributing to adult height and body mass index in african american families ascertained for type 2 diabetes. ann Hum Genet 69:517–527

sato Y, Yamamoto n, Kunitoh H, Ohe Y, Minami H, laird nM, Katori n, saito Y, Ohnami s, sakamoto H, sawada J, saijo n, Yoshida t, tamura t (2011) Genome-wide association study on overall survival of advanced non-small cell lung cancer patients treated with carboplatin and paclitaxel. J thorac Oncol 6:132–138

vazquez aI, de los Campos G, Klimentidis YC, rosa GJ, Gianola D, Yi n, allison DB (2012) a comprehensive genetic approach for improving prediction of skin cancer risk in humans. Genetics 192:1493–1502

Wang W, tang Y, ni l, Kim e, Jongwutiwes t, Hourvitz a, Zhang r, Xiong H, liu HC, rosenwaks Z (2012) Overexpression of uro-modulin-like1 accelerates follicle depletion and subsequent ovar-ian degeneration. Cell Death Dis 3:e433

Weedon Mn, lango H, lindgren CM, Wallace C, evans DM, Man-gino M, Freathy rM, Perry Jr, stevens s, Hall as, samani nJ, shields B, Prokopenko I, Farrall M, Dominiczak a, Johnson t, Bergmann s, Beckmann Js, vollenweider P, Waterworth DM, Mooser v, Palmer Cn, Morris aD, Ouwehand WH, Zhao JH, li s, loos rJ, Barroso I, Deloukas P, sandhu Ms, Wheeler e,

Author's personal copy

Hum Genet

1 3

soranzo n, Inouye M, Wareham nJ, Caulfield M, Munroe PB, Hattersley at, McCarthy MI, Frayling tM (2008) Genome-wide association analysis identifies 20 loci that influence adult height. nat Genet 40:575–583

Wright s (1934) an analysis of variability in number of digits in an inbred strain of Guinea pigs. Genetics 19:506–536

Wu X, Ye Y, Kiemeney la, sulem P, rafnar t, Matullo G, seminara D, Yoshida t, saeki n, andrew as, Dinney CP, Czerniak B, Zhang ZF, Kiltie ae, Bishop Dt, vineis P, Porru s, Buntinx F, Kellen e, Zeegers MP, Kumar r, rudnai P, Gurzau e, Koppova K, Mayordomo JI, sanchez M, saez B, lindblom a, de verdier P, steineck G, Mills GB, schned a, Guarrera s, Polidoro s, Chang sC, lin J, Chang DW, Hale Ks, Majewski t, Grossman HB, thorlacius s, thorsteinsdottir U, aben KK, Witjes Ja, stefans-son K, amos CI, Karagas Mr, Gu J (2009) Genetic variation in

the prostate stem cell antigen gene PsCa confers susceptibility to urinary bladder cancer. nat Genet 41:991–995

Xu s (2003) estimating polygenic effects using markers of the entire genome. Genetics 163:789–801

Xue J, Zhao H, shang G, Zou r, Dai Z, Zhou D, Huang Q, Xu Y (2013) rIP140 is associated with subclinical inflammation in type 2 diabetic patients. exp Clin endocrinol Diabetes 121:37–42

Yang J, Benyamin B, Mcevoy BP, Gordon s, Henders aK, nyholt Dr, Madden Pa, Heath aC, Martin nG, Montgomery GW, Goddard Me, visscher PM (2010) Common snPs explain a large proportion of the heritability for human height. nat Genet 42:565–569

Yi n, Xu s (2008) Bayesian lassO for quantitative trait locus map-ping. Genetics 179:1045–1055

Author's personal copy