Multivariable Mendelian randomization: the use of pleiotropic genetic variants to estimate causal...

47
Multivariable Mendelian randomization: the use of pleiotropic genetic variants to estimate causal effects Stephen Burgess * Simon G. Thompson Department of Public Health and Primary Care, University of Cambridge December 11, 2014 Running head: Multivariable Mendelian randomization [37 chars]. Key words: instrumental variables, causal inference, Mendelian randomization, pleiotropy, lipid fractions. Abbreviations: 2SLS, two-stage least squares; CHD, coronary heart disease; HDL-C, high-density lipoprotein cholesterol; LDL-C, low-density lipoprotein choles- terol. Acknowledgements: Dr. Stephen Burgess is supported by a fellowship from the Wellcome Trust (100114). The authors have no conflict of interest. * Address: Strangeways Research Laboratory, 2 Worts Causeway, Cambridge, CB1 8RN, UK. Telephone: +44 1223 748651. Fax: +44 1223 748658. Correspondence to: [email protected]. 1

Transcript of Multivariable Mendelian randomization: the use of pleiotropic genetic variants to estimate causal...

Multivariable Mendelian randomization: the use ofpleiotropic genetic variants to estimate causal effects

Stephen Burgess ∗ Simon G. Thompson

Department of Public Health and Primary Care,University of Cambridge

December 11, 2014

Running head: Multivariable Mendelian randomization [37 chars].

Key words: instrumental variables, causal inference, Mendelian randomization,

pleiotropy, lipid fractions.

Abbreviations: 2SLS, two-stage least squares; CHD, coronary heart disease;

HDL-C, high-density lipoprotein cholesterol; LDL-C, low-density lipoprotein choles-

terol.

Acknowledgements: Dr. Stephen Burgess is supported by a fellowship from the

Wellcome Trust (100114). The authors have no conflict of interest.

∗Address: Strangeways Research Laboratory, 2 Worts Causeway, Cambridge, CB1 8RN, UK.Telephone: +44 1223 748651. Fax: +44 1223 748658. Correspondence to: [email protected].

1

Abstract

A conventional Mendelian randomization analysis assesses the causal effect of a risk

factor on an outcome by using genetic variants solely associated with the risk factor of

interest as instrumental variables. However, in some cases, such as for triglycerides as

a cardiovascular risk factor, it may be difficult to find a relevant genetic variant that

is not also associated with related risk factors, such as other lipid fractions. Such a

variant is known as pleiotropic. In this paper, we propose an extension to Mendelian

randomization that uses multiple genetic variants associated with several measured

risk factors to simultaneously estimate the causal effects of each of the risk factors

on the outcome. This “multivariable Mendelian randomization” approach is similar

to the simultaneous assessment of several treatments in a factorial randomized trial.

Methods for estimating the causal effects are presented and compared using real and

simulated data, and the assumptions necessary for a valid multivariable Mendelian

randomization analysis are discussed. Subject to these assumptions, we demonstrate

that triglycerides-related pathways have a causal effect on the risk of coronary heart

disease independent of the effects of low-density lipoprotein cholesterol and high-

density lipoprotein cholesterol. (188 words)

2

Mendelian randomization employs genetic variants as instrumental variables to

estimate the causal effect of a risk factor on an outcome using observational data,

even in the presence of unmeasured confounding [1, 2]. A genetic variable is a valid

instrumental variable if:

i. the variant is associated with the risk factor of interest,

ii. the variant is not associated with any confounder of the risk factor–outcome

association,

iii. the variant is conditionally independent of the outcome given the risk factor and

confounders [3, 4].

These assumptions can be illustrated using a causal directed acyclic graph, dis-

playing a causal effect of one variable on another by an arrow and the absence of a

direct causal effect by the lack of an arrow (Figure 1) [5]. Although a genetic variant

need not be causally associated with the risk factor to be a valid instrumental vari-

able, we assume that there exists a causal variant for which the measured variant is

a proxy [6].

[Figure 1 should appear about here]

In order to avoid violations of the second and third instrumental variable assump-

tions, Mendelian randomization experiments have generally relied on genetic variants

which are associated with a single risk factor. In practice, however, many variants

are pleiotropic, that is associated with multiple risk factors. Indeed, in some cases,

there may be no variants which are solely associated with the risk factor of inter-

est, and a Mendelian randomization analysis cannot be performed without consid-

ering pleiotropic variants. In any case, it may desirable to include information on

pleiotropic variants in order to provide a more powerful analysis, provided that this

3

does not prejudice its validity. It may also be that multiple quantitative traits relat-

ing to the same risk factor are of interest; for example in cardiovascular disease, the

concentration of lipoprotein(a) and the size of lipoprotein(a) particles [7]. In this case,

the relative proportions of risk reduction associated with interventions separately tar-

geting lipoprotein(a) concentrations and the size of lipoprotein(a) particles may be of

interest, and the traits may be regarded as independent risk factors, even if the same

genetic variants influence both traits. The possibility of including multiple risk fac-

tors in an instrumental variable analysis is discussed in many econometric textbooks

[8], and applied instrumental variable analyses involving multiple risk factors have

been performed [9, 10], but we are unaware of application of the approach in genetic

epidemiology.

The context of this paper is that there are measurements on multiple genetic

variants and several associated risk factors, the causal effect of at least one of which

on the outcome is of interest. We assume that the genetic variants do not influence

the outcome via any pathway except those fully mediated by one of the measured

risk factors, or by some combination of the measured risk factors. Questions about

variants with potentially unmeasured or unknown pleiotropic associations are reserved

for the discussion. We initially discuss how pleiotropic associations may arise, and

the methods and necessary assumptions for estimating causal effects with several

risk factors. We demonstrate the use of these methods in an applied example, and

then construct a simulation study with parameters chosen to be similar to those in the

example to investigate how the methods perform. Finally, we discuss the application of

the methods in epidemiological practice and the interpretation of the applied example.

4

Methods

Mechanisms for association with multiple risk factors

There are several causal mechanisms by which a genetic variant may be associated

with multiple risk factors [11]. We divide the possible mechanisms into two cases

(Figure 2): (1) vertical pleiotropy, where a variant is associated with multiple risk

factors due to the causal effect of the primary risk factor on a secondary trait; and (2)

functional pleiotropy, where the genetic variant is associated with multiple pathways.

These two cases are not mutually exclusive; it is possible for both these to occur for

the same variant.

[Figure 2 should appear about here]

In the case of vertical pleiotropy, genetic variants associated with the primary risk

factor would be expected to show consistent associations with the secondary trait. For

example, genetic variants associated with higher body mass index would be expected

to show a consistent association with higher blood pressure. In this case, the causal

effect of body mass index on the outcome would include an indirect effect, mediated

through blood pressure, as well as a direct effect comprising all other pathways from

body mass index to the outcome that are not operating via blood pressure. If it is

assumed that the only pathway by which the genetic variants are associated with the

secondary trait is via the primary risk factor, then a simple Mendelian randomization

analysis would consistently estimate the causal effect of the primary risk factor on the

outcome in spite of the apparent pleiotropic association.

In the case of functional pleiotropy, we suppose there are multiple genetic variants

(at least as many variants as there are risk factors) which have different magnitudes

of effect on the risk factors. These genetic variants can be used to estimate the

causal effects of each risk factor even if none of the variants are specifically associated

5

with any one particular risk factor. Since Mendelian randomization is analogous to

a randomized trial [12], the use of genetic variants to assess the causal effects of

multiple risk factors in a single study is analogous to a factorial randomized trial

(Web Figure 1), where multiple randomized interventions are simultaneously assessed

[13]. We refer to such an analysis as “multivariable Mendelian randomization”.

Assumptions

For a multivariable Mendelian randomization analysis to be valid, the genetic variants

must satisfy a similar set of assumptions as a conventional instrumental variable, but

in this case the variants must be exclusively associated not with a single risk factor,

but with a set of measured risk factors. It is not necessary for each variant to be

associated with every risk factor in the set, but a variant cannot have associations

with the outcome except via the risk factors of interest. Specifically, for each variant,

we assume that:

i. the variant is associated with one or more of the risk factors,

ii. the variant is not associated with a confounder of any of the risk factor–outcome

associations,

iii. the variant is conditionally independent of the outcome given the risk factors

and confounders.

In order to define and interpret causal effect estimates, we initially assume that the

effect of each of the risk factors on the outcome is not mediated by another of the

risk factors: each risk factor can be intervened on independently of all the other

risk factors, and an intervention on one risk factor will not influence any other risk

factor. We refer to such risk factors as “causally independent”. A causal directed

acyclic graph corresponding to these assumptions with three genetic variants and two

6

risk factors is presented in Figure 3 (left). We later relax the assumption of causal

independence and allow causal effects between the risk factors, as in Figure 3 (right).

In this paper, we assume that all associations are linear.

[Figure 3 should appear about here]

Individual-level data: two-stage least squares method

If individual-level data are available on the genetic variants, risk factors and outcome,

causal effects of the risk factors on the outcome can be estimated using a two-stage

least squares (2SLS) approach [14]. The risk factors are regressed on the genetic vari-

ants in a multivariate linear regression (first stage; a multivariate multiple regression,

as there are multiple dependent variables and multiple independent variables), and

then the outcome is regressed linearly on the fitted values of each of the risk factors

(second stage; a univariate multiple regression, as there is one dependent variable

and multiple independent variables). An alternative model for the genetic association

with the risk factor could be proposed (such as one including interaction terms), but a

model which is additive and linear in the variants is used here for comparability with

the summarized data methods considered in the next section. Although a sequential

regression approach gives the correct point estimates, the use of 2SLS software (such

as the ivreg2 command in Stata [15]) is recommended for estimation in practice to

give correct standard errors [16]. Estimates from the method are valid even if the

genetic variants are in linkage disequilibrium.

Summarized data: likelihood-based method

If individual-level data are not available, but rather summarized (aggregated) data

on the beta-coefficients and standard errors for the associations between the genetic

variants and the risk factors and outcome from separate univariate regressions, then

7

the causal effects of the risk factors on the outcome can be estimated using a likelihood-

based method [17]. For example, if there are two risk factors X1 and X2, which each

have no causal effect on the other, a multivariate normal distribution can be assumed

for the beta-coefficients representing the genetic associations with each of the risk

factors X1, X2 and the outcome Y from univariate linear regressions. Specifically, we

assume that the estimate of association of genetic variant j, j = 1, . . . , J with X1 is

X1j with standard error σX1j, and similarly with X2 (X2j, standard error σX2j) and

with Y (Yj, standard error σY j):

X1j

X2j

Yj

∼ N3

ξ1j

ξ2j

β1ξ1j + β2ξ2j

,

σ2X1j ρ12σX1jσX2j ρ1Y σX1jσY j

ρ12σX1jσX2j σ2X2j ρ2Y σX2jσY j

ρ1Y σX1jσY j ρ2Y σX2jσY j σ2Y j

(1)

Estimates of the causal effects of X1 and X2 on Y (β1 and β2) can be obtained

by numerical maximization of this likelihood function, or by Bayesian methods [18].

If there are K risk factors, data are required on K + 1 beta-coefficients for each ge-

netic variant (X1j, . . . , XKj, Yj) and corresponding standard errors; there are K(J+1)

parameters in the model, and equation (1) is a (K + 1)-variate normal distribution.

If the outcome is binary, and the beta-coefficients for the genetic association with

the outcome represent log-relative risks or log-odds ratios, then the causal effect esti-

mates will represent log-relative risks or log-odds ratios respectively. The model for

the genetic associations with the outcome is linear in contributions from the genetic

associations with the risk factors.

The parameters ρ12, ρ1Y , ρ2Y represent the correlations between the beta-coefficients.

These will be non-zero if the beta-coefficients are derived from the same data. Al-

though these correlations can only be estimated from individual-level data, they

8

should be approximately equal to the observational correlations between the vari-

ables X1, X2, and Y . It is advisable to conduct a sensitivity analysis to assess the

impact of these parameter values on the causal estimates. If data on the associations

with the risk factors and outcome are obtained from separate data sources, then the

relevant correlations will be zero.

As the likelihood function comprises contributions from each variant, it is necessary

that the information on the causal parameters provided by each variant is indepen-

dent. Therefore the genetic variants used in a summarized data analysis must be

uncorrelated (not in linkage disequilibrium); otherwise confidence intervals estimated

by the method will be too narrow [17]. If the genetic variants are in linkage disequilib-

rium and the correlations between the variants are known, then these correlations can

be used in a modified likelihood-based model: the correlations between the genetic

variants are the same as the correlations between the beta-coefficients corresponding

to the genetic variants. If all the variants are correlated, then instead of a separate

(K +1)-variate normal distribution for each of the J genetic variants, we can employ

a J(K + 1)-variate normal distribution in (1) for all of the variants together.

Summarized data: regression-based method

A further method which has been proposed for the analysis of summarized data is

a linear regression-based approach, which gives estimates for each of the risk factors

separately [19]. This is performed in two stages. First, the beta-coefficients for the

genetic association with the outcome are regressed on the beta-coefficients for the

competing risk factors. Then, the residuals from the first regression are regressed on

the beta-coefficients for the risk factor of interest.

For example, if there are two risk factors X1 and X2 and we want to estimate the

effect of X1 on Y , the two stages are:

9

1. Regress the beta-coefficients Y1, Y2, . . . , YJ on the beta-coefficientsX21, X22, . . . , X2J

to obtain residuals Y = Y − β2X2.

2. Regress the residuals Y1, Y2, . . . , YJ on the beta-coefficients X11, X12, . . . , X1J .

The regression-based estimate is the regression coefficient for X1.

The intuitive rationale is that these residuals represent any causal effects not explained

by the alternative risk factors, but which are potentially explained by the risk factor

of interest. However, it is an ad hoc approach with no clear theoretical basis, and

which ignores the uncertainty in the beta-coefficients [20].

Example: causal effects of LDL-C, HDL-C and triglyc-

erides on CHD risk

The causal nature of the associations of various lipid fractions, including low-density

lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C) and

triglycerides, with the risk of coronary heart disease (CHD) is an issue with important

consequences for disease prevention and drug development strategies. Observational

studies have shown associations of LDL-C with increased CHD risk, of HDL-C with

decreased CHD risk, and a null association of triglycerides with CHD risk upon adjust-

ment for a number of risk factors including HDL-C and non-HDL-C concentrations,

systolic blood pressure and body mass index [21]. However, a causal interpretation to

these results may be misleading due to unmeasured confounding and the possibility

that some of the covariates adjusted for are on causal pathways, making their inclusion

in a regression model inappropriate. Efforts to elucidate causal relationships using ge-

netic variants in a Mendelian randomization approach have indicated that LDL-C has

a causal role in increasing the risk of CHD [22], and have suggested a null causal

effect of HDL-C on CHD risk [23]. However, the latter estimate had wide confidence

10

intervals, because only a few variants, those not associated with other lipid fractions,

were included in the analysis. The inability to find variants associated with triglyc-

erides and not associated with LDL-C or HDL-C has precluded reliable Mendelian

randomization investigations for triglycerides.

We address the question of the causal effects of LDL-C, HDL-C and triglycerides

on CHD risk by multivariable Mendelian randomization using published data. Wa-

terworth et al. reported genetic associations from univariate regressions for 28 genetic

variants with log-transformed LDL-C, HDL-C and triglycerides, and with the log odds

of CHD [24]. Details of the variants and the beta-coefficients for the associations are

given in Web Table 1. Figures 4 and 5 depict the associations of each of the variants

with the lipid fractions and CHD risk.

We combine these beta-coefficients in a multivariable Mendelian randomization

analysis using the likelihood-based and regression-based methods. Estimates using

the likelihood-based method were obtained in a Bayesian framework using WinBUGS

(http://www.mrc-bsu.cam.ac.uk/bugs). Technical details of the analysis and the code

used are provided in Web Appendix 1. A sensitivity analysis for the values of the

correlation parameters (ρ1Y , ρ2Y , . . .) is given in Web Table 2. Initially, we do not

account for linkage disequilibrium between the genetic variants, so that the analysis

methods can be more directly compared.

Using the likelihood-based method, the multivariable Mendelian randomization

analysis gives causal odds ratios for CHD of 0.50 (95% credible interval 0.40–0.62) per

30% reduction in LDL-C, 1.22 (95% credible interval 0.91–1.63) per 30% reduction in

HDL-C, and 0.77 (95% credible interval 0.68–0.87) per 30% reduction in triglycerides.

This suggests that reductions in LDL-C and in triglycerides are causally protective of

CHD. The causal effect for HDL-C is compatible with the null. The regression-based

method gives rather different results: the corresponding odds ratios with 95% confi-

11

dence intervals are 0.69 (0.51–0.92) for LDL-C, 1.25 (0.90–1.74) for HDL-C, and 0.92

(0.82–1.04) for triglycerides. In particular, the causal odds ratio for triglycerides from

the regression-based method does not reach the conventional threshold for statistical

significance. When linkage disequilibrium between the genetic variants is accounted

for, the likelihood-based method gives odds ratios with 95% credible intervals of 0.52

(0.42–0.65) for LDL-C, 1.28 (0.96–1.71) for HDL-C, and 0.78 (0.70–0.86) for triglyc-

erides.

[Figure 4 should appear about here]

[Figure 5 should appear about here]

Simulation study

In order to assess the statistical properties of the analysis methods used, we perform

a simulation study. The set-up corresponds to the example above.

We generate data for 30 000 individuals indexed by i on three risk factors (X1, X2, X3)

and an outcome (Y ) from the following data-generating model:

x1i =28∑j=1

αG1jgij + αU2u2i + αU3u3i + ϵX1i (2)

x2i =28∑j=1

αG2jgij + αU1u1i + αU3u3i + ϵX2i

x3i =28∑j=1

αG3jgij + αU1u1i + αU2u2i + ϵX3i

yi = βU1u1i + βU2u2i + βU3u3i + β1x1i + β2x2i + β3x3i + ϵY i

gij ∼ Binomial(2, 0.3) independently for each j = 1, . . . , 28

u1i, u2i, u3i ∼ N (0, 1) independently

ϵX1i, ϵX2i, ϵX3i, ϵY i ∼ N (0, 1) independently

12

We set the genetic association parameters αG1j, αG2j, αG3j for j = 1, . . . , 28 to

take the values in Web Table 1 for log-transformed LDL-C, HDL-C and triglycerides

respectively, to be similar to the applied example. The instrumental variables gij

are drawn from binomial distributions, representing independent single nucleotide

polymorphisms with minor allele frequencies of 0.3. The causal effects β1, β2, β3 are

set to 0.3, 0, and −0.1. The variables U1, U2, U3 represent confounders, leading

to correlations between X1, X2, X3 and Y . The parameters βU1, βU2, βU3 are each

fixed at 0.3 throughout, and the parameters αU1, αU2, αU3 are varied to take values

0.3 or −0.3, leading to 8 different scenarios. The mean R2 values, representing the

proportion of variation in each risk factor explained by the 28 instrumental variables

together, for X1, X2, and X3 were 0.6%, 0.5%, and 3.2% respectively, corresponding

to mean F statistics of 6.6, 5.2, and 35.4.

Estimates for 1000 datasets generated in each of the 8 scenarios considered were

derived using the 2SLS, likelihood-based, and regression-based methods. The Monte

Carlo standard errors for the mean estimates were around 0.003 and for the power

around 1%. The likelihood-based method was performed in a Bayesian framework

using WinBUGS; technical details of the analyses are provided in Web Appendix 2.

Table 1 reports for each scenario the mean estimate, the mean standard error, the

standard deviation of estimates, and the power to detect a non-zero effect at a nominal

5% significance level. For β2 = 0, the expected power is 5%. We see that the mean

estimates from the 2SLS and likelihood-based methods are close to the true values,

with some deviation depending on the direction of confounding. This may represent

the effect of weak instrument bias, corresponding to the low F-statistics above for

X1 and X2 [25]. The efficiency of the 2SLS and likelihood-based methods is similar,

despite the reliance of the likelihood-based method on only summarized data.

In contrast, estimates from the regression-based method are biased, although they

13

appear give approximately valid inference for the presence of a causal effect under the

null. However the power, especially to estimate β3, is much lower than from the other

methods. We therefore recommend the likelihood-based method for use in practice

when summarized data are available.

[Table 1 should appear about here]

Causal relationships between risk factors

In order to investigate the performance of the methods when there are causal rela-

tionships between the risk factors, we repeated the simulation but replaced the first

line by:

x1i =28∑j=1

αG1jgij + αU2u2i + αU3u3i + αX2x2i + αX3x3i + ϵX1i

The additional terms αX2x2i and αX3x3i represent causal effects of X2 and X3 (which

were evaluated first) on X1. We set αU1, αU2, αU3 = 0.3 (the first scenario consid-

ered above) and took 9 values of the parameters αX2 and αX3 (Table 2). All other

parameters were taken as in the original simulation study.

[Table 2 should appear about here]

Table 2 shows the mean estimates of the causal parameters from each of the

methods. Aside from the regression-based method, which gives widely varying results,

we see that the estimates do not change substantially as the parameters vary. This

indicates that the 2SLS and likelihood-based methods estimate the direct causal effect

of each risk factor on the outcome, not including paths via the other risk factors.

This can lead to misleading conclusions about the total effects of the variables. For

example, when αX2 = 0, αX3 = 0.5, the total causal effect of X3 on Y is β3+αX1β1 =

−0.1 + 0.5 × 0.3 = 0.05 (including the path via X1). The mean estimates from the

14

2SLS and likelihood-based methods are in the opposite direction to the true total

effect.

The differences between the estimated values of β1, β2, β3 in Table 2 and their

true values can be attributed to weak instrument bias. Weak instrument scenarios

will be common in multivariable Mendelian randomization, as it is necessary to use

multiple instrumental variables to estimate the different causal effects. If the genetic

associations with the risk factors and with the outcome are measured in the same

dataset, this will lead to bias in the direction of the observational association [25],

whereas if the genetic associations with the risk factors and with the outcome come

from different sources (known as a two-sample instrumental variable analysis), the

bias will be in the direction of the null [26]. To demonstrate this, we repeat the

simulation study in Web Appendix 3 with fewer non-weak instrumental variables.

The mean estimates of the causal effect parameters are very close to their true values

(Web Table 3). In this simulation context, we also explore the impact of interactions

between genetic variants in their effects on the risk factors. The likelihood-based

method is robust to these misspecifications of the analysis model (Web Table 4).

We also investigate a modification to the 2SLS method referred to as “sequential

adjustment” by Holmes et al. [27], in which the causal effects of each of the risk

factors are estimated in turn, and alternative risk factors are adjusted for as if they

are confounders. Web Tables 5 and 6 indicate that substantial bias in the sequential

adjustment method is evident even under the null, and its direction depends on the

unknown confounders.

A non-zero causal estimate from a multivariable Mendelian randomization ap-

proach when there are causal relationships between the risk factors implies that the

variable is an independent causal risk factor, in the sense that an intervention on the

variable keeping the other risk factors constant (the controlled direct effect) would

15

affect the outcome. However, the magnitude of the causal estimate may not represent

the total causal effect of the variable on the outcome.

Discussion

In this paper, we have introduced multivariable Mendelian randomization, an impor-

tant and practically relevant extension to the Mendelian randomization paradigm for

the estimation of causal effects using genetic variants associated with more than one

risk factor. For a valid analysis, the variants must satisfy a set of assumptions sim-

ilar to those for an instrumental variable in conventional Mendelian randomization,

but modified to take account of the multiple risk factors. A multivariable Mendelian

randomization analysis may be beneficial where genetic variants are associated with

several related risk factors, such as in the example with lipid fractions. It enables the

causal evaluation of a risk factor even if no variants are uniquely associated with it,

as for triglycerides.

There are several limitations to this approach, many of which are shared with

conventional Mendelian randomization [28, 29]. The specific association of a genetic

variant with a single risk factor may be a reasonable assumption if the function of

the genetic region where the variant is located is well-characterized. The assumption

of an exclusive association between genetic variants and a set of risk factors is un-

likely unless the risk factors have strong biological associations. However, if they are

strongly associated, an assumption that the risk factors are causally independent is

less plausible. Weak instrument bias, a phenomenon by which instrumental variable

estimates using variants not strongly associated with the risk factor of interest are

biased, may be substantial if large numbers of genetic variants are used [30], as may

be necessary in a multivariable Mendelian randomization experiment. The effects of

the risk factors on the outcome are assumed to be linear. While some researchers

16

do not view this as a crucial assumption, citing the interpretation of an instrumental

variable estimate as an average causal effect [6, 31], others have shown that depar-

tures from linearity can affect the findings of an instrumental variable analysis [32].

Our preference is to take a less literal view of causal estimates, and to emphasize the

outcome of a Mendelian randomization analysis as testing a causal effect, rather than

necessarily estimating a causal parameter. From this perspective, while the linearity

assumption is important, it is less important than the other instrumental variable

assumptions.

Although multivariable Mendelian randomization is able to allow for genetic vari-

ants with ‘measured’ pleiotropic associations, under the assumptions discussed in this

paper, it is unable to deal with unmeasured or unknown pleiotropy. If an apparent

causal finding is dependent on the association of a small number of variants with the

outcome, then the result may plausibly be due to pleiotropic variants rather than

being a true causal effect. However, if several variants in different genetic regions

all demonstrate consistent associations with the outcome, then it is perhaps unlikely

that all of these associations reflect pleiotropic mechanisms [33]. In the case of the

applied example in the paper, we constructed a lipid risk score for each variant by

multiplying the genetic associations with each lipid fraction by the estimate of the

lipid fraction’s causal effect on CHD risk; details are given in Web Appendix 4. Web

Figure 2 displays the lipid risk score plotted against the log odds ratio of CHD risk for

each variant. Aside from variant rs2304130, it seems that the estimated causal effect

of the lipid fractions on CHD risk is consistent across variants, and so unmeasured

pleiotropy is unlikely to explain the causal effects.

We have performed the applied analysis for the causal effects of lipid fractions

on CHD risk in this paper at face value, assuming that the instrumental variable

assumptions are satisfied. In reality, the assumptions of the division into only three

17

lipid categories, and the effects of the genetic variants being restricted to these lipid

fractions are over-simplifications. Some lipid fractions (for example, intermediate-

density lipoprotein cholesterol) are omitted in the analysis, and the variability of

particle size within the categories is ignored [34]. An assumption that the causal effect

of triglycerides on CHD risk is independent of those of LDL-C and HDL-C may not be

satisfied, particularly as evidenced by the attenuation of the observational association

of triglycerides with CHD risk upon adjustment for HDL-C and non-HDL-C [21] and

studies of the APOA5 gene [35]. Our simulations above have shown that, in the case

of causal effects between risk factors, estimates represent the direct causal effect of

each risk factor on the outcome by a pathway that is not operating via the other risk

factors. This may not equal the total causal effect of the risk factor, but it provides

important evidence on the independent causal effect of the risk factor. Finally, the

estimated causal odds ratio for a 30% decrease in LDL-C reported in this manuscript is

surprisingly large, compared to both estimates of the effect of statin usage, which also

reduces LDL-C by around 30% [22], and to a Mendelian randomization analysis using

variants solely associated with LDL-C [17]. A nuanced interpretation of Mendelian

randomization estimates, and of multivariable Mendelian randomization estimates in

particular, is required in the light of the uncertainty of the underlying assumptions in

any applied analysis. These aspects are considered in more detail elsewhere [20].

In conclusion, the findings of this manuscript provide some evidence of a causal

effect of triglyceride-related pathways on CHD risk independent of the effects of LDL-

C and HDL-C, but the weight of evidence attributed to the findings is a matter

of interpretation depending on the degree of validity attributed to the instrumental

variable assumptions.

18

References

[1] Davey Smith G, Ebrahim S. ‘Mendelian randomization’: can genetic epidemiol-

ogy contribute to understanding environmental determinants of disease? Inter-

national Journal of Epidemiology 2003;32(1):1–22.

[2] Lawlor D, Harbord R, Sterne J, Timpson N, Davey Smith G. Mendelian random-

ization: using genes as instruments for making causal inferences in epidemiology.

Statistics in Medicine 2008;27(8):1133–1163.

[3] Greenland S. An introduction to instrumental variables for epidemiologists. In-

ternational Journal of Epidemiology 2000;29(4):722–729.

[4] Martens E, Pestman W, de Boer A, Belitser S, Klungel O. Instrumental variables:

application and limitations. Epidemiology 2006;17(3):260–267.

[5] Didelez V, Sheehan N. Mendelian randomization as an instrumental variable

approach to causal inference. Statistical Methods in Medical Research 2007;

16(4):309–330.

[6] Hernan M, Robins J. Instruments for causal inference: an epidemiologist’s

dream? Epidemiology 2006;17(4):360–372.

[7] Clarke R, Peden J, Hopewell J, Kyriakou T, Goel A, Heath S, et al. Genetic vari-

ants associated with Lp(a) lipoprotein level and coronary disease. New England

Journal of Medicine 2009;361(26):2518–2528.

[8] Wooldridge J. Introductory econometrics: A modern approach. South-Western,

Nashville, TN, 2009.

19

[9] Angrist JD. Instrumental variables methods in experimental criminological re-

search: what, why and how. Journal of Experimental Criminology 2006;2(1):23–

44.

[10] Ludwig J, Kling JR. Is crime contagious? Journal of Law and Economics 2007;

50(3):491–518.

[11] Hodgkin J. Seven types of pleiotropy. International Journal of Developmental

Biology 1998;42(3):501–505.

[12] Hingorani A, Humphries S. Nature’s randomised trials. The Lancet 2005;

366(9501):1906–1908.

[13] Stampfer M, Buring J, Willett W, Rosner B, Eberlein K, Hennekens C. The 2

× 2 factorial design: Its application to a randomized trial of aspirin and U.S.

physicians. Statistics in Medicine 1985;4(2):111–116.

[14] Baum C, Schaffer M, Stillman S. Instrumental variables and GMM: Estimation

and testing. Stata Journal 2003;3(1):1–31.

[15] StataCorp. Stata Statistical Software: Release 12. College Station, TX, 2011.

[16] Angrist J, Pischke J. Mostly harmless econometrics: an empiricist’s companion.

Chapter 4: Instrumental variables in action: sometimes you get what you need.

Princeton University Press, 2009.

[17] Burgess S, Butterworth A, Thompson S. Mendelian randomization analysis with

multiple genetic variants using summarized data. Genetic Epidemiology 2013;

37(7):658–665.

20

[18] Thompson J, Minelli C, Abrams K, Tobin M, Riley R. Meta-analysis of genetic

studies using Mendelian randomization – a multivariate approach. Statistics in

Medicine 2005;24(14):2241–2254.

[19] Do R, Willer CJ, Schmidt EM, Sengupta S, Gao C, Peloso GM, et al. Common

variants associated with plasma triglycerides and risk for coronary artery disease.

Nature Genetics 2013;45:1345–1352.

[20] Burgess S, Freitag D, Khan H, Gorman D, Thompson S. Using multivariable

Mendelian randomization to disentangle the causal effects of lipid fractions. PLoS

One 2014;available online before print.

[21] Di Angelantonio E, Sarwar N, Perry P, Kaptoge S, Ray KK, Thompson A, et al.

Major lipids, apolipoproteins, and risk of vascular disease. Journal of the Amer-

ican Medical Association 2009;302(18):1993–2000.

[22] Burgess S, Butterworth A, Malarstig A, Thompson S. Use of Mendelian randomi-

sation to assess potential benefit of clinical intervention. British Medical Journal

2012;345:e7325.

[23] Voight B, Peloso G, Orho-Melander M, Frikke-Schmidt R, Barbalic M, Jensen

M, et al. Plasma HDL cholesterol and risk of myocardial infarction: a mendelian

randomisation study. The Lancet 2012;380(9841):572–580.

[24] Waterworth D, Ricketts S, Song K, Chen L, Zhao J, Ripatti S, et al. Genetic

variants influencing circulating lipid levels and risk of coronary artery disease.

Arteriosclerosis, Thrombosis, and Vascular Biology 2010;30(11):2264–2276.

[25] Burgess S, Thompson S. Bias in causal estimates from Mendelian randomization

studies with weak instruments. Statistics in Medicine 2011;30(11):1312–1323.

21

[26] Pierce B, Burgess S. Efficient design for Mendelian randomization studies: sub-

sample and two-sample instrumental variable estimators. American Journal of

Epidemiology 2013;178(7):1177–1184.

[27] Holmes MV, Asselbergs FW, Palmer TM, Drenos F, Lanktree MB, Nelson CP,

et al. Mendelian randomization of blood lipids for coronary heart disease. Eu-

ropean Heart Journal 2014;[Available online ahead of print, January 27, 2014]

DOI: 10.1093/eurheartj/eht571.

[28] Davey Smith G, Ebrahim S. Mendelian randomization: prospects, potentials,

and limitations. International Journal of Epidemiology 2004;33(1):30–42.

[29] Schatzkin A, Abnet C, Cross A, Gunter M, Pfeiffer R, Gail M, et al. Mendelian

randomization: how it can – and cannot – help confirm causal relations between

nutrition and cancer. Cancer Prevention Research 2009;2(2):104–113.

[30] Burgess S, Thompson S, CRP CHD Genetics Collaboration. Avoiding bias from

weak instruments in Mendelian randomization studies. International Journal of

Epidemiology 2011;40(3):755–764.

[31] Angrist J, Pischke JS. The credibility revolution in empirical economics: How

better research design is taking the con out of econometrics. The Journal of

Economic Perspectives 2010;24(2):3–30.

[32] Mogstad M, Wiswall M. Linearity in instrumental variables estimation: Problems

and solutions. Tech. rep., Forschungsinstitut zur Zukunft der Arbeit. Bonn,

Germany., 2010.

[33] Davey Smith G. Random allocation in observational data: how small but ro-

bust effects could facilitate hypothesis-free causal inference. Epidemiology 2011;

22(4):460–463.

22

[34] Wurtz P, Kangas AJ, Soininen P, Lehtimaki T, Kahonen M, Viikari JS, et al.

Lipoprotein subclass profiling reveals pleiotropy in the genetic variants of lipid

risk factors for coronary heart disease: a note on Mendelian randomization stud-

ies. Journal of the American College of Cardiology 2013;62(20):1906–1908.

[35] Triglyceride Coronary Disease Genetics Consortium, Emerging Risk Factors Col-

laboration. Triglyceride-mediated pathways and coronary disease: collaborative

analysis of 101 studies. The Lancet 2010;375(9726):1634–1639.

23

Tables

24

Tab

le1:

Resultsfrom

simulation

studywithou

tcausalrelation

shipsbetweenrisk

factors

Estim

ates

ofβ1=

0.3

Two-stageleastsquares

(2SLS)

Likelihood-basedmethod

Regression-basedmethod

αU1

αU2

αU3

Mean

SE

SD

Pow

er(%

)Mean

SE

SD

Pow

er(%

)Mean

SE

SD

Pow

er(%

)

0.3

0.3

0.3

0.318

0.090

0.091

93.0%

0.317

0.092

0.091

91.2%

0.224

0.078

0.070

85.1%

0.3

0.3

-0.3

0.290

0.090

0.091

88.8%

0.290

0.092

0.092

86.9%

0.208

0.078

0.070

77.7%

0.3

-0.3

0.3

0.296

0.090

0.089

90.2%

0.296

0.092

0.090

89.3%

0.211

0.078

0.069

81.7%

0.3

-0.3

-0.3

0.271

0.090

0.089

86.7%

0.272

0.092

0.090

85.3%

0.193

0.077

0.067

74.9%

-0.3

0.3

0.3

0.329

0.091

0.092

93.1%

0.328

0.092

0.093

91.7%

0.234

0.079

0.073

85.9%

-0.3

0.3

-0.3

0.305

0.090

0.090

92.1%

0.304

0.092

0.091

91.0%

0.220

0.078

0.069

85.0%

-0.3

-0.3

0.3

0.309

0.090

0.086

93.6%

0.309

0.092

0.086

92.1%

0.221

0.079

0.068

84.4%

-0.3

-0.3

-0.3

0.283

0.090

0.088

88.1%

0.283

0.092

0.089

86.7%

0.204

0.078

0.068

77.9%

Estim

ates

ofβ2=

0

0.3

0.3

0.3

0.055

0.110

0.111

7.7

0.054

0.113

0.111

7.8

0.035

0.085

0.069

4.2

0.3

0.3

-0.3

0.010

0.111

0.113

5.9

0.010

0.114

0.114

6.1

0.007

0.085

0.069

2.2

0.3

-0.3

0.3

0.040

0.111

0.112

6.2

0.040

0.113

0.113

6.4

0.027

0.085

0.070

2.8

0.3

-0.3

-0.3

0.001

0.111

0.111

4.1

0.001

0.114

0.112

4.8

0.002

0.085

0.069

2.2

-0.3

0.3

0.3

0.001

0.111

0.112

3.9

0.001

0.114

0.113

4.7

-0.001

0.085

0.069

1.7

-0.3

0.3

-0.3

-0.050

0.111

0.107

6.6

-0.049

0.113

0.108

6.8

-0.033

0.085

0.067

2.8

-0.3

-0.3

0.3

-0.008

0.111

0.113

5.6

-0.007

0.114

0.115

6.0

-0.006

0.086

0.071

2.6

-0.3

-0.3

-0.3

-0.045

0.110

0.111

7.6

-0.045

0.112

0.112

7.1

-0.031

0.085

0.070

2.9

Estim

ates

ofβ3=

−0.1

0.3

0.3

0.3

-0.087

0.047

0.045

47.4

-0.087

0.047

0.045

46.4

-0.039

0.033

0.023

13.0

0.3

0.3

-0.3

-0.090

0.047

0.049

49.3

-0.090

0.048

0.049

49.8

-0.041

0.033

0.024

16.7

0.3

-0.3

0.3

-0.090

0.047

0.045

49.2

-0.090

0.047

0.046

47.8

-0.041

0.033

0.023

14.3

0.3

-0.3

-0.3

-0.094

0.047

0.047

52.0

-0.094

0.048

0.047

49.7

-0.043

0.033

0.023

16.9

-0.3

0.3

0.3

-0.106

0.047

0.048

61.2

-0.106

0.048

0.048

58.6

-0.049

0.033

0.025

25.8

-0.3

0.3

-0.3

-0.111

0.047

0.046

66.2

-0.110

0.048

0.047

62.6

-0.052

0.033

0.024

28.2

-0.3

-0.3

0.3

-0.105

0.047

0.044

63.5

-0.105

0.048

0.045

59.7

-0.049

0.034

0.023

24.8

-0.3

-0.3

-0.3

-0.111

0.047

0.046

66.4

-0.110

0.047

0.047

63.1

-0.052

0.033

0.024

28.9

Pow

eristheem

pirical

pow

erto

detectacausaleff

ectat

anom

inal

5%sign

ificance

level.Results(m

eanestimates,meanstan

dard

errors,stan

dard

deviationsof

estimates,an

dpow

er)aregiven

usingthreean

alysismethods(two-stag

eleastsquares,likelihood-based

,an

dregression-basedmethods)

toestimatecausaleff

ects

ofX

1on

Y(β

1=

0.3),ofX

2on

Y(β

2=

0),an

dof

X3on

Y(β

3=

0.1)

Abbreviation

s:SD,stan

darddeviation

;SE,stan

darderror

25

Table 2: Results from simulation study with causal relationships between risk factorsTwo-stage least squares Likelihood-based Regression-based

αX2 αX3 β1 β2 β3 β1 β2 β3 β1 β2 β30 0 0.318 0.055 -0.087 0.317 0.054 -0.087 0.224 0.035 -0.0390.5 0 0.322 0.038 -0.090 0.321 0.038 -0.090 0.229 0.020 -0.041−0.5 0 0.318 0.059 -0.089 0.318 0.058 -0.089 0.164 0.034 -0.0410 0.5 0.317 0.046 -0.097 0.316 0.045 -0.097 0.064 0.030 -0.0160 −0.5 0.321 0.048 -0.079 0.320 0.048 -0.079 0.192 0.030 -0.0370.5 0.5 0.316 0.041 -0.097 0.315 0.041 -0.097 0.077 0.021 -0.016−0.5 0.5 0.318 0.060 -0.098 0.318 0.060 -0.098 0.049 0.036 -0.0160.5 −0.5 0.318 0.042 -0.080 0.317 0.042 -0.081 0.125 0.022 -0.038−0.5 −0.5 0.317 0.057 -0.079 0.316 0.056 -0.080 0.240 0.034 -0.037

26

Figure captions

Figure 1: Causal directed acyclic graph of Mendelian randomization assumptions for

variant G with risk factor X in confounded association with outcome Y . Confounders

represented by U are assumed to be unknown.

Figure 2: Causal directed acyclic graphs of associations between variant G, risk

factors X1, X2, and outcome Y illustrating A) vertical and B) functional pleiotropy.

Figure 3: Causal directed acyclic graph of associations between variantsG1, G2, G3,

risk factors X1, X2, and outcome Y illustrating multivariable Mendelian randomiza-

tion. Confounders U1 and U2 are assumed to be unknown. A) Risk factors are causally

independent (no causal effects between X1 and X2); B) Risk factors are causally de-

pendent (X1 has a causal effect on X2).

Figure 4: Association of coronary heart disease (CHD) risk-increasing alleles of

28 genetic variants with all possible pairings of low-density lipoprotein cholesterol

(LDL-C), high-density lipoprotein cholesterol (HDL-C), and triglycerides (color and

size of points: darker points correspond to stronger associations with CHD risk, larger

points correspond to more precise estimates). Note that some points are overlapping.

Figure 5: Association of coronary heart disease (CHD) risk-increasing alleles

of 28 genetic variants with odds of CHD, and with each of low-density lipoprotein

cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C) and triglycerides

in turn (color and size of points: darker points correspond to stronger associations

with CHD risk, larger points correspond to more precise estimates). Note that some

points are overlapping.

27

Figures

G X

U

YFigure 1: Causal directed acyclic graph of Mendelian randomization assumptions forvariant G with risk factor X in confounded association with outcome Y . Confoundersrepresented by U are assumed to be unknown.

X1G Y

G X1 X2 Y1)

2)

X2

Figure 2: Causal directed acyclic graphs of associations between variant G, risk factorsX1, X2, and outcome Y illustrating A) vertical and B) functional pleiotropy.

28

G1

X1

U1

YX2

G2

G3 U2

G1

X1

U1

YX2

G2

G3 U2

Figure 3: Causal directed acyclic graph of associations between variants G1, G2, G3,risk factors X1, X2, and outcome Y illustrating multivariable Mendelian randomiza-tion. Confounders U1 and U2 are assumed to be unknown. A) Risk factors are causallyindependent (no causal effects between X1 and X2); B) Risk factors are causally de-pendent (X1 has a causal effect on X2).

29

Percentage change in LDL−C per allele

Per

cent

age

chan

ge in

trig

lyer

ides

per

alle

le

−5%

0+

5%+

10%

+15

%

−2% 0 +2% +4% +6%

Percentage change in HDL−C per allele

Per

cent

age

chan

ge in

trig

lyer

ides

per

alle

le

−5%

0+

5%+

10%

+15

%

−4% −2% 0 +2%

Percentage change in LDL−C per allele

Per

cent

age

chan

ge in

HD

L−C

per

alle

le

−2% 0 +2% +4% +6%

−4%

−2%

0+

2%

Figure 4: Association of coronary heart disease (CHD) risk-increasing alleles of 28 ge-netic variants with all possible pairings of low-density lipoprotein cholesterol (LDL-C),high-density lipoprotein cholesterol (HDL-C), and triglycerides (brightness and sizeof points: brighter points correspond to stronger associations with CHD risk, largerpoints correspond to more precise estimates). Note that some points are overlapping.

30

Percentage change in LDL−C per allele

Log

odds

of C

HD

per

alle

le

1.0

1.05

1.1

1.15

1.2

−2% 0 +2% +4% +6%

Percentage change in HDL−C per allele

Log

odds

of C

HD

per

alle

le

−4% −2% 0 +2%

1.0

1.05

1.1

1.15

1.2

Percentage change in triglyerides per allele

Log

odds

of C

HD

per

alle

le

−5% 0 +5% +10% +15%

1.0

1.05

1.1

1.15

1.2

Figure 5: Association of coronary heart disease (CHD) risk-increasing alleles of 28genetic variants with odds of CHD, and with each of low-density lipoprotein choles-terol (LDL-C), high-density lipoprotein cholesterol (HDL-C) and triglycerides in turn(brightness and size of points: brighter points correspond to stronger associationswith CHD risk, larger points correspond to more precise estimates). Note that somepoints are overlapping.

31

���������

������ �������

������ �������

������ ���������

������ �������

�����������������

���������

�����������

����������

��������������������

����������

������� ����

�����������

���

��������� �������

������ ����

�������������

��������� ������� ��

������������� �������

��

����

� ! "�#�� �����

������$

����

� ! "� ���%

%�&%��#��

���&%���%�&%��$

�������

''

��''

� ! "� ���&%���

%�&%��#�����%

%�&%��$

��''

� ! "� ���%

%�&%��#�����%

%�&%��$

Web Figure 1: Analogy between a factorial randomized trial with two treatments (Aand B) and multivariable Mendelian randomization with two genetic variants (1 and2). The minor allele of genetic variant 1 (a) has a large effect on LDL-C and a modesteffect on triglycerides; the minor allele of genetic variant 2 (b) has a large effect ontriglycerides and a modest effect on LDL-C.

Abbreviations: LDL-C = low-density lipoprotein cholesterol, TG = triglycerides.

1

Web Appendix 1

Supplementary methods for applied example

Web Table 1 gives the estimates of the associations between each of 28 genetic variants

and low-density lipoprotein cholesterol (LDL-C), triglycerides, high-density lipopro-

tein cholesterol (HDL-C), and the log odds ratio of coronary heart disease (CHD)

reported by Waterworth et al. [1].

Estimates for the analysis from the likelihood-based method were obtained in a

Bayesian framework using WinBUGS (http://www.mrc-bsu.cam.ac.uk/bugs). Direct

maximization of the likelihood is impractical in this case, as there are 28× 3+3 = 87

parameters to optimize over. In the likelihood function, the correlation between the

beta-coefficients for LDL-C and HDL-C was taken as −0.1, for LDL-C and triglyc-

erides 0.2, for LDL-C and CHD risk 0.1, for HDL-C and triglycerides −0.1, for HDL-C

and CHD risk −0.1, and for triglycerides and CHD risk 0.1. A sensitivity analysis for

these values is given in Web Table 2. Normal priors with mean 0 and variance 102 are

used for each of the unknown parameters. The WinBUGS code used to implement

the likelihood-based analysis is:

model {

beta1 ~ dnorm(0, 0.01) # prior for causal effect of LDL-C

beta2 ~ dnorm(0, 0.01) # prior for causal effect of HDL-C

beta3 ~ dnorm(0, 0.01) # prior for causal effect of triglycerides

for (i in 1:28) { # i indexes the 28 variants

Tau0[i,1:4,1:4] <- inverse(Sigma0[i,1:4,1:4])

# Sigma0 is the variance-covariance matrix for the multivariate

# distribution of the beta-coefficients

2

x[i,1:4] ~ dmnorm(xi[i,1:4], Tau0[i,1:4,1:4])

# x[i, 1:4] are the beta-coefficients for genetic variant i

# with LDL-C, HDL-C, triglycerides, and CHD risk

xi[i,1] ~ dnorm(0, 0.01)

# prior for mean association of genetic variant i with LDL-C

xi[i,2] ~ dnorm(0, 0.01)

# prior for mean association of genetic variant i with HDL-C

xi[i,3] ~ dnorm(0, 0.01)

# prior for mean association of genetic variant i with triglycerides

xi[i,4] <- beta1*xi[i,1] + beta2*xi[i,2] + beta3*xi[i,3]

# mean association of genetic variant i with CHD risk

}

}

The R code for the regression-based based analysis is:

for (j in 1:28) {

bx1 = lm(x1~gene[,j])$coef[2]

bx2 = lm(x2~gene[,j])$coef[2]

bx3 = lm(x3~gene[,j])$coef[2]

by = lm(y ~gene[,j])$coef[2]

bx1se = summary(lm(x1~gene[,j]))$coef[2,2]

bx2se = summary(lm(x2~gene[,j]))$coef[2,2]

bx3se = summary(lm(x3~gene[,j]))$coef[2,2]

byse = summary(lm(y ~gene[,j]))$coef[2,2]

}

beta1 = lm(lm(by~bx2+bx3)$res~bx1)$coef[2]

beta2 = lm(lm(by~bx1+bx3)$res~bx2)$coef[2]

beta3 = lm(lm(by~bx1+bx2)$res~bx3)$coef[2]

beta1se = summary(lm(lm(by~bx2+bx3)$res~bx1))$coef[2,2]

beta2se = summary(lm(lm(by~bx1+bx3)$res~bx2))$coef[2,2]

beta3se = summary(lm(lm(by~bx1+bx2)$res~bx3))$coef[2,2]

Estimates of the correlations between genetic variants were taken from the 1000

Genomes Pilot 1 dataset and obtained from the SNP Annotation and Proxy Search

(SNAP; http://www.broadinstitute.org/mpg/snap/ldsearchpw.php).

3

Sensitivity analysis for correlation parameters in likelihood-

based method

In order to assess the sensitivity of the causal estimates in the applied example to

the choice of correlation parameters, we performed a sensitivity analysis, estimating

the causal effects of LDL-C, HDL-C and triglycerides on CHD risk for a number of

different choices of the parameters. Web Table 2 displays the causal estimates for the

parameter values used in the main analysis (shown in italics), as well as for values 2,

1.5, 0.5, 0, and −1 times these values. We see that the causal estimates and standard

errors are fairly robust to different values of these parameters.

4

Genetic variant XL (SE) XT (SE) XH (SE) Y (SE)

rs11206510 0.026 (0.004) 0.016 (0.008) 0.002 (0.004) 0.068 (0.029)rs660240 -0.044 (0.004) -0.004 (0.008) 0.005 (0.004) -0.162 (0.030)rs515135 -0.038 (0.004) -0.009 (0.008) 0.003 (0.004) -0.105 (0.031)rs12916 -0.023 (0.003) 0.003 (0.006) 0.001 (0.003) -0.062 (0.024)rs2954021 -0.017 (0.003) -0.039 (0.006) 0.011 (0.003) -0.083 (0.022)rs1558861 -0.031 (0.006) -0.142 (0.012) 0.031 (0.006) -0.128 (0.067)rs2738459 -0.018 (0.004) 0.007 (0.007) -0.003 (0.004) -0.041 (0.037)rs10401969 0.046 (0.007) 0.095 (0.013) -0.007 (0.006) 0.077 (0.054)rs4420638 0.059 (0.004) 0.042 (0.008) -0.021 (0.004) 0.157 (0.031)rs10489615 0.004 (0.003) -0.023 (0.006) 0.018 (0.003) -0.030 (0.024)rs11902417 0.011 (0.004) 0.036 (0.007) -0.017 (0.003) 0.010 (0.028)

rs325 -0.005 (0.005) 0.097 (0.010) -0.047 (0.005) 0.182 (0.040)rs3890182 0.004 (0.005) 0.013 (0.009) 0.022 (0.004) -0.041 (0.034)rs964184 0.022 (0.005) 0.142 (0.009) -0.029 (0.004) 0.199 (0.034)rs9943753 -0.005 (0.004) -0.005 (0.007) 0.016 (0.003) 0.010 (0.038)rs261334 -0.002 (0.004) 0.019 (0.007) 0.034 (0.004) 0.049 (0.029)rs9989419 -0.002 (0.003) 0.003 (0.006) 0.035 (0.003) 0.010 (0.025)rs12449157 0.004 (0.004) -0.018 (0.008) 0.019 (0.004) -0.041 (0.032)rs2156552 0.011 (0.004) -0.020 (0.008) 0.028 (0.004) -0.030 (0.032)rs1168013 0.009 (0.003) 0.035 (0.007) 0.001 (0.003) -0.041 (0.024)rs6544366 -0.011 (0.004) -0.036 (0.007) 0.016 (0.003) -0.020 (0.028)rs1260333 -0.003 (0.003) -0.054 (0.006) 0.005 (0.003) -0.062 (0.022)rs1178979 -0.012 (0.004) 0.054 (0.008) -0.010 (0.004) 0.030 (0.030)rs10105606 0.003 (0.003) 0.067 (0.006) -0.023 (0.003) 0.068 (0.024)rs2954029 -0.015 (0.003) -0.040 (0.006) 0.012 (0.003) -0.073 (0.022)rs4938303 -0.008 (0.004) -0.067 (0.007) 0.018 (0.003) -0.073 (0.025)rs16965220 0.009 (0.003) 0.028 (0.006) -0.006 (0.003) 0.000 (0.026)rs2304130 -0.036 (0.007) -0.070 (0.013) 0.004 (0.006) 0.020 (0.065)

Web Table 1: Association of 28 genetic variants with log-transformed low-densitylipoprotein cholesterol (XL), log-transformed triglycerides (XT ), log-transformedhigh-density lipoprotein cholesterol (XH) and log odds ratio of coronary heart dis-ease (Y ) with standard errors (SE) taken from Waterworth et al. [1]

5

ρLH ρLT ρLY ρHT ρHY ρTY βLDL−C (SE) βHDL−C (SE) βTG (SE)

-0.2 0.4 0.2 -0.2 -0.2 0.2 1.96 (0.31) -0.53 (0.39) 0.74 (0.16)-0.15 0.3 0.15 -0.15 -0.15 0.15 1.96 (0.31) -0.55 (0.40) 0.74 (0.17)-0.1 0.2 0.1 -0.1 -0.1 0.1 1.96 (0.32) -0.56 (0.41) 0.73 (0.17)-0.05 0.1 0.05 -0.05 -0.05 0.05 1.97 (0.32) -0.57 (0.42) 0.73 (0.17)0 0 0 0 0 0 1.97 (0.33) -0.58 (0.42) 0.73 (0.18)0.1 -0.2 -0.1 0.1 0.1 -0.1 1.97 (0.34) -0.59 (0.44) 0.73 (0.18)

Web Table 2: Sensitivity analysis for the correlation parameters between beta-coefficients for genetic associations with low-density lipoprotein cholesterol (L), high-density lipoprotein cholesterol (H), triglycerides (T ) and CHD risk (Y ) in likelihood-based method for multivariable Mendelian randomization (original analysis shown initalics)

6

Web Appendix 2

Supplementary methods for simulation study

For the simulation studies, in the likelihood-based method, we regard the mean and

standard deviation of the posterior distribution as the ‘estimate’ and ‘standard error’,

and the 2.5th to 97.5th percentile range as the ‘95% confidence interval’. The obser-

vational correlations between the variables estimated in the individual-level data were

used for the correlation parameters. Normal priors with mean zero and variance 102

were used for each of the unknown parameters.

7

Web Appendix 3

Additional simulation studies with non-weak instrumental vari-

ables

Using non-weak instrumental variables

In order to show that the differences between estimates in Tables 1 and 2 from the

main text and the true values of the parameters for the two-stage least squares (2SLS)

and likelihood-based methods are due to weak instruments, we repeat the simulation

but using only 5 instrumental variables having strong associations with the risk factor.

The data-generating model is:

x1i =5∑

j=1

αG1jgij + αU2u2i + αU3u3i + αX2x2i + αX3x3i + ϵX1i (3)

x2i =5∑

j=1

αG2jgij + αU1u1i + αU3u3i + ϵX2i

x3i =5∑

j=1

αG3jgij + αU1u1i + αU2u2i + ϵX3i

yi = βU1u1i + βU2u2i + βU3u3i + β1x1i + β2x2i + β3x3i + ϵY i

gij ∼ Binomial(2, 0.3) independently for each j = 1, . . . , 5

u1i, u2i, u3i ∼ N (0, 1) independently

ϵX1i, ϵX2i, ϵX3i, ϵY i ∼ N (0, 1) independently

The coefficients αGkj for risk factor k and genetic variant j are taken as:

8

Genetic variant (j) αG1j αG2j αG3j

1 0.1 0.3 0.5

2 0.2 0.4 0.1

3 0.3 0.5 0.2

4 0.4 0.1 0.3

5 0.5 0.2 0.4

We set αU1, αU2, αU3 = 0.3 and take 9 values of the parameters αX2 and αX3 as

in the second set of simulations in the main paper (the section ‘Causal relationships

between risk factors’). All other parameters are taken as in the main paper. The

average F statistics for the associations of the genetic variants with each of risk factors

are around 1180.

Results are displayed in Web Table 3. We see that there is much less difference

between the estimated and true values of the causal effect parameters for the two-stage

least squares and likelihood-based methods with non-weak instrumental variables.

Two-stage least squares Likelihood-based

αX2 αX3 β1 β2 β3 β1 β2 β30 0 0.299 0.000 -0.099 0.299 0.000 -0.0990.5 0 0.300 0.000 -0.099 0.300 0.000 -0.100−0.5 0 0.300 0.000 -0.100 0.300 0.000 -0.1000 0.5 0.299 0.001 -0.100 0.299 0.001 -0.1000 −0.5 0.301 0.000 -0.102 0.301 0.000 -0.1020.5 0.5 0.301 -0.001 -0.100 0.301 -0.001 -0.100−0.5 0.5 0.301 0.000 -0.100 0.301 0.000 -0.1000.5 −0.5 0.300 0.000 -0.100 0.300 0.000 -0.100−0.5 −0.5 0.299 0.000 -0.100 0.299 0.000 -0.100

Web Table 3: Mean estimates of the causal effects of X1 (true direct effect β1 = 0.3),X2 (β2 = 0), and X3 (β3 = −0.1) using two-stage least squares and likelihood-basedmethods from simulation study of multivariable Mendelian randomization with causalrelationships between the risk factors and non-weak genetic instrumental variables

9

Allowing interaction terms in the genetic associations with the risk factors

To investigate the impact of interactions between the effects of genetic variants on

estimates from the two-stage least squares and likelihood-based methods, we extend

the above simulation (3) by introducing interaction terms in the genetic associations

with the risk factors into the data-generating model:

x1i =5∑

j=1

αG1jgij +∑j1>j2

αG1j1j2gij1gij2 + αU2u2i + αU3u3i + ϵX1i

x2i =5∑

j=1

αG2jgij +∑j1>j2

αG2j1j2gij1gij2 + αU1u1i + αU3u3i + ϵX2i

x3i =5∑

j=1

αG3jgij +∑j1>j2

αG3j1j2gij1gij2 + αU1u1i + αU2u2i + ϵX3i

Effects between the risk factors (αX2, αX3) are set to zero so that the risk factors are

causally independent. The interaction terms αG1j1j2 , αG2j1j2 , αG3j1j2 are drawn from

a normal distribution with mean αG× and variance σαG× . We take σαG× = 0.1 and

consider four different values of αG× = 0.1, 0.05,−0.05,−0.1. We analyze the data

using the two-stage least squares and likelihood-based methods as previously, not

including the interaction terms in the analysis model.

Results are displayed in Web Table 4. Estimates do not change substantially

despite large interaction terms (compared with the main effect terms) resulting in

misspecification of the analysis models. We see that the two-stage least squares and

likelihood-based methods are robust to interactions between the genetic variants in

their associations with the risk factors, even if these interactions are ignored in the

analysis model.

10

Two-stage least squares Likelihood-based

αG× β1 β2 β3 β1 β2 β30.1 0.300 0.000 -0.100 0.301 0.000 -0.1000.05 0.298 0.000 -0.098 0.299 0.000 -0.098−0.05 0.300 0.002 -0.097 0.300 0.002 -0.0970.1 0.302 0.002 -0.100 0.302 0.002 -0.101

Web Table 4: Mean estimates of the causal effects of X1 (true direct effect β1 = 0.3),X2 (β2 = 0), and X3 (β3 = −0.1) using two-stage least squares and likelihood-basedmethods from simulation study of multivariable Mendelian randomization with non-weak genetic instrumental variables and interactions in the data-generating model forthe genetic associations with the risk factors

11

Sequential adjustment method

We additionally consider a method for dealing with multiple risk factors referred to

by Holmes et al. [2] as a sequential adjustment method. This is a 2SLS instrumental

variable analysis for each of the risk factors in turn with adjustment for the other

risk factors in the regression models. We perform two versions of this method. In

the first version of the method, adjustment for the other risk factors is made only in

the second-stage regression model; in the second version, adjustment is made in both

regression stages. For example, in the second version of the method, if there are three

risk factors X1, X2 and X3, then the estimate for X1 is obtained by first regressing

X1 on the IV(s), X2 and X3, and then regressing Y on the fitted values of X1 from

the first-stage regression, X2 and X3. Both versions of the analysis are performed as

it is not clear which version of the method was chosen by the original investigators

in Holmes et al. [2]. To more closely mimic the method performed by the original

authors, we also performed the methods using a weighted allele score in place of the

individual genetic variants [3]. Weights were taken as the true coefficients for the risk

factor under analysis in the data-generating model.

We consider 8 sets of parameter values for αU1, αU2, αU3 = ±0.3 and set αX2 = 0

and αX3 = 0 as in the initial set of simulations in the main paper. All other parameters

are taken as in the main paper.

Results are given in Web Table 5 for the sequential adjustment methods using the

individual variants as instrumental variables, and in Web Table 6 for the sequential

adjustment methods using the allele scores as instrumental variables, with estimates

from the 2SLS method using all of the risk factors as in the main body of the paper

presented in each case for comparison. We see that, while the 2SLS method seems

to give unbiased estimates of the causal parameters, that the estimates from the

sequential adjustment method are biased, and not consistent in direction even under

12

the null.

Two-stage least squares Sequential adjustment (1) Sequential adjustment (2)αU1 αU2 αU3 β1 β2 β3 β1 β2 β3 β1 β2 β3

0.3 0.3 0.3 0.299 0.000 -0.099 0.070 -0.181 -0.290 0.094 -0.179 -0.3060.3 0.3 -0.3 0.300 0.000 -0.100 0.270 -0.090 -0.198 0.300 -0.089 -0.2130.3 -0.3 0.3 0.300 0.000 -0.100 0.179 0.000 -0.181 0.207 0.000 -0.1930.3 -0.3 -0.3 0.299 0.001 -0.100 0.379 0.091 -0.090 0.414 0.090 -0.100-0.3 0.3 0.3 0.301 0.000 -0.102 0.161 -0.091 -0.091 0.186 -0.090 -0.101-0.3 0.3 -0.3 0.301 -0.001 -0.100 0.362 0.000 0.002 0.393 0.000 -0.007-0.3 -0.3 0.3 0.301 -0.001 -0.100 0.270 0.091 0.019 0.300 0.089 0.014-0.3 -0.3 -0.3 0.300 0.000 -0.100 0.470 0.181 0.110 0.506 0.178 0.106

Web Table 5: Mean estimates of the causal effects of X1 (true effect β1 = 0.3), X2

(β2 = 0), and X3 (β3 = −0.1) using three different methods from simulation studyusing individual genetic variants as instrumental variables

Two-stage least squares Sequential adjustment (1) Sequential adjustment (2)αU1 αU2 αU3 β1 β2 β3 β1 β2 β3 β1 β2 β3

0.3 0.3 0.3 0.299 0.000 -0.099 -0.039 -0.346 -0.406 -0.046 -0.432 -0.5140.3 0.3 -0.3 0.300 0.000 -0.100 0.251 -0.199 -0.205 0.300 -0.248 -0.2590.3 -0.3 0.3 0.300 0.000 -0.100 0.139 0.000 -0.279 0.166 0.000 -0.3530.3 -0.3 -0.3 0.299 0.001 -0.100 0.429 0.147 -0.078 0.514 0.183 -0.099-0.3 0.3 0.3 0.301 0.000 -0.102 0.071 -0.148 -0.080 0.085 -0.184 -0.101-0.3 0.3 -0.3 0.301 -0.001 -0.100 0.363 0.001 0.121 0.435 0.001 0.153-0.3 -0.3 0.3 0.301 -0.001 -0.100 0.251 0.200 0.047 0.300 0.249 0.060-0.3 -0.3 -0.3 0.300 0.000 -0.100 0.541 0.346 0.247 0.647 0.431 0.313

Web Table 6: Mean estimates of the causal effects of X1 (true effect β1 = 0.3), X2

(β2 = 0), and X3 (β3 = −0.1) using three different methods from simulation studyusing allele scores constructed from the genetic variants as instrumental variables

13

Web Appendix 4

Consistency of association of genetic variants with lipid frac-

tions and CHD risk

In order to assess the consistency of the association of genetic variants with lipid

fractions and CHD risk, we constructed a lipid risk score for each variant and assessed

the association of this score with CHD risk. This score was calculated using the causal

estimates βLDL−C , βHDL−C and βTG from Web Table 2 and the genetic associations

XL, XH and XT from Web Table 1 for each genetic variant j = 1, . . . , 28 as:

βLDL−CXLj + βHDL−CXHj + βTGXTj (4)

The association between the lipid risk score and the log odds of CHD is plotted in Web

Figure 2. If the associations of each lipid fraction are homogeneous across variants, we

expect the points to lie on a straight line through the origin. With the exception of the

point on the far left representing genetic variant rs2304130, this is approximately true.

This variant shows an association with LDL-C and triglycerides, but no association

with CHD risk and a point estimate in the opposite direction from that which would

be expected based on its associations with LDL-C, HDL-C and triglycerides alone.

Further investigation is needed to show whether the result for variant rs2304130, is

a chance result or reflects a pleiotropic association of this variant with another risk

factor.

The causal odds ratios for CHD based on 27 variants excluding rs2304130 are 0.48

(95% credible interval 0.39–0.60) per 30% reduction in LDL-C, 1.19 (95% credible

interval 0.89–1.58) per 30% reduction in HDL-C, and 0.77 (95% credible interval

0.68–0.87) per 30% reduction in triglycerides.

14

Lipid risk score per allele

Log

odds

of C

HD

per

alle

le

−10% −5% 0 +5% +10% +15%

1.0

1.05

1.1

1.15

1.2

Web Figure 2: Association of coronary heart disease (CHD) risk-increasing allelesof 28 genetic variants with lipid risk score and odds of CHD (brightness and sizeof points: brighter points correspond to stronger associations with CHD risk, largerpoints correspond to more precise estimates). Note that some points are overlapping.

15

References

[1] Waterworth D, Ricketts S, Song K, Chen L, Zhao J, Ripatti S, et al. Genetic

variants influencing circulating lipid levels and risk of coronary artery disease.

Arteriosclerosis, Thrombosis, and Vascular Biology 2010;30(11):2264–2276.

[2] Holmes MV, Asselbergs FW, Palmer TM, Drenos F, Lanktree MB, Nelson CP,

et al. Mendelian randomization of blood lipids for coronary heart disease. Euro-

pean Heart Journal 2014;[Available online ahead of print, January 27, 2014] DOI:

10.1093/eurheartj/eht571.

[3] Burgess S, Thompson S. Use of allele scores as instrumental variables

for Mendelian randomization. International Journal of Epidemiology 2013;

42(4):1134–1144.

16