Extensions to the Modeling of Initiation and Progression: Applications to Substance Use and Abuse

18
ORIGINAL PAPER Extensions to the Modeling of Initiation and Progression: Applications to Substance Use and Abuse Michael C. Neale Eric Harvey Hermine H.M. Maes Patrick F. Sullivan Kenneth S. Kendler Received: 9 June 2005 / Accepted: 17 February 2006 / Published online: 8 June 2006 Ó Springer Science+Business Media, Inc. 2006 Abstract Twin data can provide valuable insight into the relationship between the stages of phenomena such as disease or substance abuse. Initiation of substance use may be caused by factors that are the same as, partially shared with, or completely independent of those that cause progression from use to abuse. Comparison of rates of progression among the cotwins of twins who do vs. do not initiate provides indirect information about the relationship between initiation and progression. Existing models for this relationship have been difficult to extend because they are usually expressed in terms of explicit integrals. In this paper, the problem is overcome by regarding the analysis of twin data on initiation and progression as a special case of missing data, in which individuals who do not initiate are regarded as having missing data on progression mea- sures. Using the general framework for the analysis of ordinal data with missing values available in Mx makes extensions that include other variables much easier. The effects of continuous covariates such as age on initiation and progression becomes simple. Also facilitated are the examination of initiation and progression in two or more substances, and transition models with two or more steps. The methods are illustrated with data on the effects of cohort on liability to cannabis use and abuse, bivariate analysis of tobacco use and dependence and cannabis use and abuse, and the relationships between initiation of smoking, regular smoking and nicotine dependence. Other suitable applications include the relationship between symptoms and diagnosis, such as fears and the progression to phobia. Keywords Age Cannabis Comorbidity Dependence Genetics Initiation Method Missing data Missingness Model Nicotine Substance abuse Substance use Tobacco Twins Introduction An important class of measurements describes two-stage phenomena, such as initiation of substance use and pro- gression to substance abuse or addiction. Many of these phenomena involve conditional processes where an initial Edited by Michael Stallings Michael C. Neale (&) Hermine H.M. Maes Kenneth S. Kendler Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, 980126, Richmond, VA 23298-0126, USA e-mail: [email protected] Tel.: +1-804-8283369 Fax: +1-804-8281471 Michael C. Neale Hermine H.M. Maes Kenneth S. Kendler Department of Psychiatry, Virginia Commonwealth University, Richmond, VA, USA Michael C. Neale Kenneth S. Kendler Department of Human Genetics, Virginia Commonwealth University, Richmond, VA, USA Michael C. Neale Department of Psychology, Virginia Commonwealth University, Richmond, VA, USA E. Harvey Department of Environmental Sciences and Engineering, University of North Carolina at Chapel Hill, Chapel Hill 27599 NC, USA Patrick F. Sullivan Department of Genetics Psychiatry and Epidemiology, University of North Carolina at Chapel Hill, Chapel Hill 27599 NC, USA Behav Genet (2006) 36:507–524 DOI 10.1007/s10519-006-9063-x 123

Transcript of Extensions to the Modeling of Initiation and Progression: Applications to Substance Use and Abuse

ORIGINAL PAPER

Extensions to the Modeling of Initiation and Progression:Applications to Substance Use and Abuse

Michael C. Neale Æ Eric Harvey Æ Hermine H.M. Maes ÆPatrick F. Sullivan Æ Kenneth S. Kendler

Received: 9 June 2005 / Accepted: 17 February 2006 / Published online: 8 June 2006

� Springer Science+Business Media, Inc. 2006

Abstract Twin data can provide valuable insight into the

relationship between the stages of phenomena such as

disease or substance abuse. Initiation of substance use may

be caused by factors that are the same as, partially shared

with, or completely independent of those that cause

progression from use to abuse. Comparison of rates of

progression among the cotwins of twins who do vs. do not

initiate provides indirect information about the relationship

between initiation and progression. Existing models for this

relationship have been difficult to extend because they are

usually expressed in terms of explicit integrals. In this

paper, the problem is overcome by regarding the analysis

of twin data on initiation and progression as a special case

of missing data, in which individuals who do not initiate

are regarded as having missing data on progression mea-

sures. Using the general framework for the analysis of

ordinal data with missing values available in Mx makes

extensions that include other variables much easier. The

effects of continuous covariates such as age on initiation

and progression becomes simple. Also facilitated are the

examination of initiation and progression in two or more

substances, and transition models with two or more steps.

The methods are illustrated with data on the effects of

cohort on liability to cannabis use and abuse, bivariate

analysis of tobacco use and dependence and cannabis use

and abuse, and the relationships between initiation of

smoking, regular smoking and nicotine dependence. Other

suitable applications include the relationship between

symptoms and diagnosis, such as fears and the progression

to phobia.

Keywords Age Æ Cannabis Æ Comorbidity ÆDependence Æ Genetics Æ Initiation Æ Method Æ Missing

data Æ Missingness Æ Model Æ Nicotine Æ Substance

abuse Æ Substance use Æ Tobacco Æ Twins

Introduction

An important class of measurements describes two-stage

phenomena, such as initiation of substance use and pro-

gression to substance abuse or addiction. Many of these

phenomena involve conditional processes where an initial

Edited by Michael Stallings

Michael C. Neale (&) Æ Hermine H.M. Maes Æ Kenneth S.

Kendler

Virginia Institute for Psychiatric and Behavioral Genetics,

Virginia Commonwealth University, 980126, Richmond, VA

23298-0126, USA

e-mail: [email protected]

Tel.: +1-804-8283369

Fax: +1-804-8281471

Michael C. Neale Æ Hermine H.M. Maes Æ Kenneth S. Kendler

Department of Psychiatry, Virginia Commonwealth University,

Richmond, VA, USA

Michael C. Neale Æ Kenneth S. Kendler

Department of Human Genetics, Virginia Commonwealth

University, Richmond, VA, USA

Michael C. Neale

Department of Psychology, Virginia Commonwealth University,

Richmond, VA, USA

E. Harvey

Department of Environmental Sciences and Engineering,

University of North Carolina at Chapel Hill, Chapel Hill 27599

NC, USA

Patrick F. Sullivan

Department of Genetics Psychiatry and Epidemiology,

University of North Carolina at Chapel Hill, Chapel Hill 27599

NC, USA

Behav Genet (2006) 36:507–524

DOI 10.1007/s10519-006-9063-x

123

‘‘gateway’’ event necessarily precedes the development of

a subsequent outcome. This class is broad; it includes

seeking treatment given presence of a disorder, develop-

ment of a phobic disorder given an unreasonable fear, the

transition from mild symptom expression to clinical dis-

order, exposure to a risk factor such as combat stress, and

subsequent development of post-traumatic stress disorder

(PTSD) symptoms or diagnosis. Diagnostic criteria for

many psychiatric disorders have inherent hierarchies:

binging is a necessary but not sufficient condition for a

diagnosis of Bulimia Nervosa; and childhood conduct

disorder is a prerequisite for an adult diagnosis of antisocial

behavior. While the transitions from one point to the next

on a Likert scale (such as from ‘‘Never’’ to ‘‘Rarely’’ to

‘‘Often’’) are usually regarded as changes on the same

underlying continuum, it is possible that discontinuities

exist and that different factors are related to changes at

different points on the scale. The possible absence of a

single continuum may be more obvious in the case of

interview questions that have a stem and follow-up format,

such as ‘‘Have you ever...’’ with a set of subsequent items

that are asked only if the respondent replies in the affir-

mative to the stem question. For simplicity, we will refer to

these two stages of development as initiation and pro-

gression, but they clearly apply to a wide variety of mea-

surement contexts.

A popular approach to the examination of initiation and

progression is to use data collected from pairs of relatives

(Heath et al., 1998; Kendler et al., 1999). Essentially, the

method compares the rate of progression in the relatives of

initiators with the rate of progression in the relatives of

non-initiators. As long as there is some correlation between

relatives for initiation, it is possible to obtain an estimate of

the strength of the relationship between initiation and

progression within an individual. This proxy form of

information indirectly indexes the within-person correla-

tion between liability to initiate and liability to develop

dependence once initiation has begun.

In the classical twin study of MZ and DZ twins reared

together, the sources of variation are decomposed into

additive genetic, random environment and either domi-

nance genetic or common environment components. With

multivariate data it is also possible to decompose the

covariance between traits into the same (three) compo-

nents. In principle, therefore, one might expect to be able to

partition covariation between liability to initiation and

liability to progression into the same components. How-

ever, the lack of direct information on the within-person

resemblance for initiation and progression prevents a full

three-way decomposition. Those who do not initiate do not

have an opportunity to express the progression phenotype;

data on progression is missing in all those who do not

initiate. Nevertheless, it is possible to get an estimate of the

relationship between liability to initiation and to progres-

sion via structural equation modeling of twin data.

Figure 1 shows a path diagram of a model that specifies

a direct causal path from initiation to progression within

each member of a twin pair. This model is identified with

data collected from MZ and DZ twins. Applications to date

include alcohol, tobacco, cannabis and stimulants (Heath

and Martin, 1993; Heath et al., 1997; Kendler and Prescott,

1998; Kendler et al., 1999; Koopmans et al., 1999) and a

common finding is that there are additive genetic, shared

and specific environmental influences on initiation but

factors specific to progression appear to be influenced by

AI CI EI

I

P

AP CP EP

1.00 1.00 1.00

1.00 1.00 1.00

ai ci ei

ap cp ep

b

Fig. 1 Causal model of the relationship between initiation and

progression for an individual. The model requires cotwins and

specification for MZ and DZ twins for identification

508 Behav Genet (2006) 36:507–524

123

additive genetic and specific environment factors with little

role for the shared environment. The study of Kendler et al

(Kendler et al., 1999) was something of an exception to

this pattern, finding little evidence for the shared environ-

ment for initiation. However, initiation in that study was

defined as initiation of regular smoking rather than any

tobacco use whatsoever.

Certain submodels of the causal model shown in Fig-

ure 1 have important substantive meaning. First, if the

pathway from initiation to progression is zero, risk factors

for initiation and progression are entirely independent of

each other. Second, if this path has a standardized value of

unity (no residual variation in progression) then initiation

and progression represent different thresholds on a single

continuum of liability. These two models correspond to the

independent liability dimensions model and the single

liability dimension model of Heath et al. (1993). The full

model described here is an alternative to the combined

liability dimension model of Heath et al. It may be termed

a ‘‘causal contingent model’’ as it is essentially a direction

of causation model (initiation liability causes progression

liability) (Neale and Cardon, 1992) but it is applied to data

where progression is contigent on initiation.

In this article we describe some extensions of this

important and widely applicable model using missing data

methods. The first of these new models is the addition of

external indices of risk for initiation or progression. The

second is the bivariate case, in which comorbidity for

initiation and progression between two substances may be

examined. The third is multiple stages, such as initiation,

addiction and recovery.

The Univariate Model

One way to consider data on initiation and progression

from pairs of relatives is as a 4· 4 contingency table as

shown in Table 1a. As noted above, it is not possible to

discriminate progression from non-progression in those

who do not initiate. Therefore the 4 · 4 theoretical table of

outcomes (Table 1a) collapses to a 3· 3 table of observable

outcomes (Table 1b). One approach to modeling the rela-

tionship between the liabilities to initiation and progression

in pairs of relatives is to assume that there is a multivariate

normal distribution underlying the process. Individuals

whose liability is above a threshold, tI, on the initiation

dimension initiate the process, e.g., they start to smoke

tobacco. Those who have initiated and are also above

threshold tP on the progression dimension progress, e.g.,

they become nicotine dependent. This model is attractive

because it allows for a variety of models of covariation

between initiation and progression (Heath, 1990; Heath and

Martin, 1993; Kendler et al., 1999; Koopmans et al., 1997;

Meyer et al., 1992; True et al., 1997), of which the causal

model (Kendler et al., 1999; Neale and Cardon, 1992) has

proven to be popular. The model generates a predicted

covariance matrix R which is a function of a vector of

parameters of additive genetic, common environment,

specific environment and the causal path from initiation to

progression. This model can be specified in a variety of

ways, including through the graphical user interface of Mx

(Neale et al., 1999, http://www.views.vcu.edu/mx).

Appendix A contains an Mx script which implements the

model specification in matrix form.

Table 1 Crosstabulation of possible outcomes for a pair of relatives assessed on dichotomous initiation and progression variables

Twin 1 Twin 2

Initiation Progression Initiation

No Yes

Progression Progression

No Yes No Yes

a) Theoretical Cells

No No 1 (0000) 2 (0001) 3 (0010) 4 (0011)

No Yes 5 (0100) 6 (0101) 7 (0110) 8 (0111)

Yes No 9 (1000) 10 (1001) 11 (1010) 12 (1011)

Yes Yes 13 (1100) 14 (1101) 15 (1110) 16 (1111)

b) Observable cells

Twin 1 Twin 2

Initiation Progression Initiation

No Yes

Progression Progression

? No Yes

No ? 1+2+5+6 3+7 4+8

Yes No 9+10 11 12

Yes Yes 13+14 15 16

Behav Genet (2006) 36:507–524 509

123

Given the predicted covariance matrix R, the expected

cell frequencies for the 16 cells in Table 1 (panel a) may be

written as

EðCell abcdÞ

¼Z taþ1

ta

Z tbþ1

tb

Z tcþ1

tc

Z tdþ1

td

Uðxa; xb; xc; xdÞdxddxcdxbdxa

ð1Þ

where a and b (scored 0 or 1) index the initiation and pro-

gression of relative 1, and c and d index initiation and pro-

gression for relative 2. Mathematical description of the

proportions under the normal distribution with a single

threshold can be achieved using three quantities for the limits

of integration: –¥, t and +¥, which we label as Threshold 0,

Threshold 1 and Threshold 2. For the initiation dimension in

both twins, the threshold is denoted tI, and for the progres-

sion dimensions it is tP. These definitions enable us to use the

binary identification code for a cell in Table 1 (panel a) to

define the integral that describes its predicted proportion. For

example, cell 8 of Table 1 (panel a) has a binary identifi-

cation of abcd=0111, so the expected cell frequency is

EðCell 0111Þ

¼Z t1

t0

Z t2

t1

Z t2

t1

Z t2

t1

UðI1; P1; I2; P2ÞdP2; dI2; dP1; dI1

¼Z tI

�1

Z 1tP

Z 1tI

Z 1tP

UðI1; P1; I2; P2ÞdP2; dI2; dP1; dI1:

ð2Þ

This cell cannot be observed directly, because progres-

sion cannot be measured in twin 1 who is below threshold

for initiation. Pairs of this type will be indistinguishable

from pairs where twin 1 is below threshold on both initi-

ation and progression, Thus the predicted frequency for

pairs where twin 1 is below threshold (cell 3+7 in Table 1,

panel b), is the sum of the expected frequencies for cells 3

and 7 in Table 1 (panel a). The sum of these two four-

dimensional integrals may be expressed as a three dimen-

sional integral which may be evaluated more rapidly:

EðCell 0111Þ þ EðCell 0011Þ

¼Z tI

�1

Z 1tI

Z 1tP

UðI1; I2; P2ÞdP2dI2dI1

ð3Þ

because

Z t

�1/ðxÞ dxþ

Z 1t

/ðxÞ dx ¼Z 1�1

/ðxÞ dx ¼ 1: ð4Þ

The Mx script in Appendix A evaluates these integrals

explicitly, using the mnor function, which takes a covari-

ance matrix, a vector of means, a vector of lower thresh-

olds, a vector of upper thresholds, and a vector of flags

indicating the type of integration required. mnor returns the

value of the integral requested, which is the area under the

curve in the case of a univariate function, a volume in the

case of a bivariate function, and a hypervolume in the case

of trivariate or higher order multivariate functions.

Equivalence of Missing Data Approach

Although it is possible to use the function mnor to evaluate

integrals of the type given in Equation (3) in Mx, it becomes

increasingly inconvenient to do so when we wish to extend

the model to the multivariate case, or when the number of

subcategories of progression is large. Much simpler speci-

fication can be obtained by recognizing that the situation is

a special case of missing data. Individuals below threshold

for initiation have unknown status for progression, so they

have a missing value for progression. The maximum like-

lihood approach for dealing with missing data implemented

in Mx does precisely the same thing as is done by the

manual addition of integrals described above. In Mx the

user specifies a model (a formula that yields an m · m

matrix) for the covariance structure, and another (1 · m)

matrix formula for the thresholds. These formulae describe

a general case that has no missing data. The likelihood of a

binary data vector is computed by evaluating an m-dimen-

sional integral of the form of Equation (3). When there are

missing data, the covariance matrix is ‘‘filtered’’ to yield a

covariance matrix containing only those elements corre-

sponding to the variables that are present (rows and columns

corresponding to variables that are missing are deleted).

Likewise, the threshold matrix is filtered to delete those

thresholds corresponding to the variables that are missing.

The likelihood, a reduced dimension integral, is equivalent

to summing over all the possibilities, as in Equation (4).

It is therefore possible to specify a model for the co-

variances, a model for the thresholds and to enter the data

in rectangular ordinal file format for raw ordinal maximum

likelihood analysis. This approach has tremendous advan-

tages in terms of simplicity and flexibility. First, we can

take advantage of being able to model the thresholds as a

function of covariates such as age. Second, we can exploit

the general treatment of ordinal variables in Mx, simply by

increasing the size of the threshold formula. For example,

if there are eight ordered forms of progression, changing

the model simply requires increasing the number of rows of

the threshold matrix formula to accomodate the new

thresholds. In contrast, the required modifications to the

mnor based script of Appendix A would be substantial,

involving 81 formulae for the relevant integrals. Third, it

becomes much easier to use existing Mx scripts for more

complicated forms of twin data analysis, such as G·E

510 Behav Genet (2006) 36:507–524

123

interaction, sex limitation, rater bias, QTL effects and

multivariate models. Examples of these three classes of

extension are presented below.

Analysis of Covariates via Means

A very simple extension to the causal contingent model is

to include the effects of subject age. One approach to this

problem was presented by Neale et al. (1989) in which age

was added as an observed variable that caused the pheno-

types of twins. Although the method appears to work rea-

sonably well, there can be a problem if age is not normally

distributed in the sample. The general assumption of

multivariate normality would be violated, which would

cause goodness-of-fit statistics to be biased. To circumvent

this problem, it is possible to remove the effects of age by

modeling the threshold of subject i, tI as a simple linear

function:

tI ¼ t þ ageita

where t is the population baseline threshold (for individuals

of age zero), ta models the regression of the threshold on

age, and agei is the age in years of individual i at assess-

ment. It is possible to implement this type of ‘‘multilevel’’

analysis in Mx (Neale et al., 1999) via definition variables.

In this example the model for the effects of age at

assessment is linear, but more complex forms, such as

quadratic or logistic would be easy to specify.

An analogous approach was used to model the effects of

age in the continuous variable case in previous reports

(Neale et al., 2000; Zhu et al., 1999), where there is a

change in mean as a function of age. A more complex

approach to modeling thresholds as a function of age using

multiple groups was presented by Pickles et al. (1994). In

the present case, initiation and progression are treated as a

bivariate twin problem, and therefore thresholds for both

initiation and progression are allowed to vary as separate

linear functions of age.

Application: Age Effects on Cannabis Consumption

Sample and Measures

Illustrative data for age effects come from a study of ge-

netic and environmental risk factors for common psychi-

atric and substance use disorders in Caucasian female–

female twin pairs from the population-based Virginia Twin

Registry (Kendler et al., 1999). Telephone interviews were

completed on 1942 of 2293 (86.1% of eligible) individual

twins. Lifetime cannabis use and abuse was assessed by

modules modified from the Structured Clinical Interview

for DSM-IIIR (SCID; Spitzer et al., 1987). Use and abuse/

dependence were coded as binary variables. Cannabis use

was defined as lifetime use of cannabis (hashish or mari-

juana). Abuse was defined when subjects reported at least

one of the following criteria: use in dangerous situations;

legal problems arising from use; social problems arising

from use; or ignoring work or other significant obligations.

Dependence was defined when at least three of the following

symptoms were reported: use despite it causing physical

problems; use in larger quantities than intended; unsuc-

cessful attempts to cut down or quit; spending large amounts

of time obtaining the drug; tolerance to the drug’s effects; or

withdrawal symptoms experienced on cessation of use.

Results of fitting the causal model without age effects

were presented by Kendler et al. (1999). This earlier

analysis was restricted to pairs in which both twins par-

ticipated and provided responses. Here we use data from

incomplete pairs in which one twin participated and their

cotwin did not. If these incomplete twin pairs do not differ

from the complete pairs in their means or variances, then

the data are missing completely at random (MCAR) (Little

and Rubin, 1987). Under MCAR, the addition of these

pairs will not substantially affect the parameter estimates

but they will increase the precision of the estimate of the

threshold. A second possibility is that the unmatched pairs

are not representative of the population, but the values of

the non-missing twin predict the missingness of the cotwin,

along with some completely random missingness. This is

known as missing at random (MAR) (Little and Rubin,

1987). Here, use of unmatched twins will lead to changes

in the estimates of the parameters as well as increasing

their precision. In total, data from 647 MZ pairs (499

complete) and 450 DZ pairs (327 complete) were used.

Method

The original analyses by Kendler et al. (1999) were con-

ducted using a minimum v2 loss function which has some

advantages when cell frequencies are small (Agresti, 1990).

In this article we will use the maximum likelihood loss

function, and will compare the estimates obtained with the

two approaches. A third possibility is to use the power

divergence statistic with k=2/3 (Read and Cressie, 1988)

which has been shown to have advantageous properties when

cell frequencies are low (Read and Cressie, 1988). The

minimum v2 and maximum likelihood functions are special

cases of the power divergence statistic, with k=1 and lim kfi 0 respectively. The Mx script shown in Appendix B was

used to fit the model with age/cohort effects.

Results

The first two rows of Table 2 show results of analysing the

exact same data with the maximum likelihood and mini-

Behav Genet (2006) 36:507–524 511

123

mum v2 loss functions. While the results are not identical,

they are very similar and suggest that the use of maximum

likelihood with these sample sizes and this model is not

likely to generate substantial bias in parameter estimates.

This argument is supported by the fact that estimates with

ML are well within the confidence intervals of those ob-

tained under minimum v2. The third row of Table 2 shows

that including pairs with missing data increases the esti-

mate of the causal path from use to abuse, with a con-

comitant reduction in the proportion of variance accounted

for by residual components. Addition of age effects on the

two thresholds substantially improves the fit of the model

(difference between the two models’ –2ln L (asymptoti-

cally distributed as v22) =26.30, p < .001). Parameter

estimates show that the thresholds increase (and prevalence

decreases) with age in years according to the formulae

tI=–.797+.023*age (95% CI=.013; .029)and tP=0.999

+.012*age (95% CI=–.002; .024), indicating significant

effects of age for initiation only. The broader confidence

intervals on the age effect for persistence are consistent

with the lower frequency of persistence in the sample.

Mean bootstrap parameter estimates and their 95% CI’s

estimated directly from the 2.5% and 97.5% of 348 boot-

strap estimates are shown in the fifth row of the Table. The

lower CI’s for common environment effects overlap zero

for the common environment effects on initiation and for

all components of variance for abuse. This is not surprising

given that the value of .96 for the upper CI of the regres-

sion of risk for abuse on initition is close to unity. Results

from using the Power Divergence statistic (row PD in

Table 2) are generally consistent with those of minimum

v2, and differ only slightly from the maximum likelihood

parameter estimates, indicating that the ML estimation

procedure is relatively robust to the observed cell fre-

quencies in this case.

The addition of age effects on the prevalence of initia-

tion and progression does not affect the squared estimate of

b2, the causal path from initiation to progression. The

proportion of variance in progression varies in a narrow

range (.66 to .74) across all the analyses, indicating good

agreement on the point estimate of this component. The

proportion of variance associated with the common envi-

ronment (c2) is only slightly reduced by the addition of age,

which implies that most c2 is not due to changes in avail-

ability of cannabis across the cohort years assessed in this

sample. The confidence intervals on the transmission path

indicate that the hypothesis that the factors are independent

can be rejected, as can the hypothesis that they represent

points on a single liability dimension.

Multivariate Model

Often it is not sufficient to model external variables as

covariates because a deeper understanding of their rela-

tionship with substance use or abuse is desired. For

example, quite different psychobiological conclusions

would be drawn if the relationship between drug use and

the personality factor neuroticism was due to environ-

mental correlation than if it was due to shared genetic

factors. Furthermore we may wish to understand the rela-

tionship between the use of different substances, whether

initiation of one substance is a risk factor for initiation of a

second, or if initiation of one substance increases risk for

dependence on another. Development of a multivariate

model is necessary to test hypotheses of this type.

The simpler missing data approach extends directly to

the multivariate case, which makes possible the consider-

ation of alternative models for the covariance between

initiation and progression of two or more substances or

traits. Perhaps the most straightforward case is where we

estimate parameters that can be used to partition the

covariance between the traits into genetic and environ-

mental components. The Cholesky decomposition is useful

Table 2 Results of fitting the conditional causal model to data on

initiation and subsequent abuse (progression) of cannabis. Parameter

estimates were obtained using minimum v2 (min v2), maximum

likelihood (ML), and power divergence fit statistics on data from

complete pairs. The effects of including pairs in which data on use are

missing for one of the twins (ML miss) are also shown, together with

estimates when the effects of age on the threshold are modeled (ML

miss age). Averaged bootstrap estimates for this model (which were

computed by sampling the dataset with replacement and re-analyzing,

using the Mx bootstrap options) are denoted ML miss age bs. The

parameters subscripted I and M refer to initiation and abuse,

respectively; a is additive genetic, c is shared environment, e is

random environment, and b is the causal path from initiation to

progression.

Method

Initiation

b2

Progression

a2I c2

I e2I a2

M c2M e2

M

Min v2 .46 .29 .25 .66 .17 .00 .17

ML .48 .28 .25 .71 .14 .00 .15

ML miss .48 .28 .25 .73 .17 .00 .16

ML miss age .48 .26 .26 .73 .16 .00 .16

ML miss age bs .49 .26 .26 .74 .12 .01 .15

95% CI’s .18; .77 .00; .53 .18; .34 .43; .96 00; .33 .00 .09 .00; .36

PD .46 .29 .25 .68 .17 .00 .15

512 Behav Genet (2006) 36:507–524

123

for this purpose as it is numerically robust and easy to

specify (Neale and Cardon, 1992). In common with the

univariate treatment, the genetic and environmental factors

for initiation and progression are assumed to be uncorre-

lated1, and only within-trait causal paths are specified, as

shown in Figure 2a. It is the job of model fitting to find the

combination of these components that best matches the

observed pattern of familial resemblance in the data.

There are several alternative genetic models of comor-

bidity between initiation and progression of two sub-

stances. In principle, all the general comorbidity models of

Neale and Kendler (1995) could be applied to this type of

data. Here we use the reciprocal causal model as it provides

a simple examination of whether liability to use of one

substance is a risk factor for use or abuse of another, or vice

versa. Figure 2b shows this reciprocal causal model for an

individual. It is important to note that this model is at the

level of liability, not at the level of expression of the

phenotype per se, and is therefore not the best represen-

tation of the gateway hypothesis of substance abuse. To

clarify, suppose that an environmental factor causes a

change of +1 to the z-score of an individual’s liability to

use of substance X. In the causal model used here, this

environmental factor would have the same effect on lia-

bility to use and abuse of other substances regardless of the

individual’s initial liability. By contrast, under a gateway

hypothesis, it is assumed that it is only use of substance X

itself that causes an increase in the use or abuse of other

substances. Therefore, the environmental factor would only

have an effect on the use of substance X if the prior lia-

bility to use X was at most one unit less than the threshold

for use. The environmental factor would cause a change in

the use status of the individual, with possible concomitant

changes in liability to other substances. This more explicit

gateway model is not used here.

Application: Tobacco and Cannabis Smoking

Sample and Measures

The data for these analyses are taken from the third inter-

view wave of a population-based longitudinal study of

female twins (1,898 individuals, 851 complete pairs).

Sample ascertainment and the smoking measures are

described in detail elsewhere (Kendler et al., 1993, 1999).

Briefly, we defined smoking initiation as having ever

smoked a single cigarette. Regular smoking was defined as

a pattern of use in which the respondent smoked an average

of seven cigarettes per week for at least 4 weeks. Nicotine

dependence (ND) was defined according to scores on the

Fagerstrom Tolerance Questionnaire (FTQ) (Fagerstrom,

1978; Fagerstrom and Schneider, 1989). The FTQ is an

eight item scale (range 0–11) that is widely used in the

smoking literature and which assesses the degree of

dependence on nicotine. Scores of seven or more are

consistent with ND (Fagerstrom and Schneider, 1989). The

time frame for the FTQ was the subject’s lifetime period of

maximum cigarette use.

The current analyses differ from those previously pub-

lished Kendler et al., 1999) in two ways. First, smoking

initiation in the prior report (Kendler et al., 1999) is re-

ferred to here as regular smoking. Second, ND in the prior

report was based on a continuous factor score based on 12

items (including the eight FTQ items) whereas for sim-

plicity ND in this report is a dichotomization of the FTQ

total score. Measurement of cannabis use and abuse was as

described above for the analysis of age effects.

Method

Data were prepared as rectangular files consisting of one

record per twin pair. Each record contained data on initi-

ation of and dependence on nicotine and initiation and

abuse of cannabis. The Cholesky and Reciprocal Interac-

tion models were fitted by maximum likelihood to the raw

data. The Mx script used for this purpose is shown in

Appendix C.

Results

Table 3 shows parameter estimates from fitting two mul-

tivariate models to the twin data on nicotine initition (NI)

and dependence (ND) and cannabis use (CU) and abuse

(CM). The Cholesky and the causal models provided

similar fit to the data. Using Akaike’s Information Crite-

rion, the causal model is slightly more parsimonious having

two more degrees of freedom and only fitting 2.8 v2 units

more poorly. Despite their different substantive interpre-

tations, the two models generate very similar predicted

within-person correlations (see Table 4). These correla-

tions show that liability to initiation and progression for

nicotine and cannabis are closely related, especially within

substance. The two lowest correlations of approximately .5

are between ND and cannabis initiation.

The Cholesky model indicates quite substantial corre-

lations between the genetic and environmental components

of NI and CI, and between ND and CM, although these

correlations must be viewed with scepticism when the

variance components are small. For example, the common

environment correlation of 1.0 for ND with CM is practi-

cally meaningless because only a tiny fraction of the

1 This assumption is a consequence of partitioning the variation in

progression into components due to liability to initiate, and residual

components.

Behav Genet (2006) 36:507–524 513

123

1.00

AIN

1.00

CIN

1.00

EIN

IN

PN

APN CPN EPN

1.00 1.00 1.00

ain cin ein

apn cpn epn

bn

AIC CIC EIC

IC

PC

APC CPC EPC

1.00

aic

1.00

cic

1.00

eic

bc

apc cpc epc

1.00 1.00 1.00

aicn cicn eicn

apcn cpcn epcn

AIN CIN EIN

IN

PN

APN CPN EPN

AIC CIC EIC

IC

PC

APC CPC EPC

1.00 1.00 1.00

1.00 1.00 1.00

ain cin ein

apn cpn epn

bn

1.00

aic

1.00

cic

1.00

eic

bc

apc cpc epc

1.00 1.00 1.00

bicn

bpcnbpnc

binc

(a)

(b)

Fig. 2 Two bivariate models

for initation (I) and progression

(P) of cannabis (C) and nicotine

(N). Top: Cholesky

decomposition of sources of

covariance between IN and IC,

and between PN and PC.

Bottom: causal model of IN and

IC as risk factors for each other

and for PN and PC

514 Behav Genet (2006) 36:507–524

123

variance of each of these traits is associated with the

common environment. By contrast, the substantial genetic

correlation of .82 between NI and CI is quite precise be-

cause the genetic variance for both is quite substantial. An

interesting feature of the Cholesky model is that the spe-

cific environment correlation for dependence is negative

while the genetic correlation is positive. This finding sug-

gests that the two traits have more risk factors in common

than their phenotypic correlation suggests, although some

specific environment risk factors act in opposite ways on

nicotine and cannabis use.

The causal model results indicate similar findings to

previous univariate analyses in that liability to initiate ac-

counts for a substantial proportion of variance in liability to

dependence or abuse. A novel finding is that for initiation

there appears to be a negative feedback loop such that

liability to initiate smoking increases liability to initiate

cannabis (path =.82), whereas liability to CI decreases

liability to initiate nicotine (path=–.63, shown as a squared

term binc2=(–).40 in Table 3). Another interesting result is

that the cross-paths, from initiation of one substance to

abuse/dependence on the other are estimated at zero. That

is, liability to initiation of nicotine (cannabis) does not

appear to influence the liability for progression to abuse of

cannabis (nicotine dependence). There is however an

indirect effect as the initiation liabilities appear to have

mutual influence.

Multiple Stage Model

Several traits of interest may have more than one pre-

requisite before they are observed. For example, cessation

of nicotine dependence cannot occur before both nicotine

initiation and nicotine dependence have occurred. In a

clinical setting, a surgery may not occur unless injury has

occurred and it has been deemed serious enough by the

primary care physician to warrant referral to the surgical

unit. Similarly, one might observe initiation of alcohol use,

progression to alcohol abuse, and clinical treatment for

alcohol abuse as three stages of interest.

The theoretical possible outcomes for a pair of relatives

for a three-stage (two transitions) model are shown in

Table 5. While there are in principle 26=64 possible out-

comes, many cannot be observed in practice, or at least

only as part a heterogeneous outcome. The possible pair-

wise combinations that may be observed are delineated by

solid lines in Table 5; for example, if Twin 1 does not

initiate, and Twin 2 initiates but does not progress to stage

2, the cells 7, 15, 23 and 31 describe possible pair types

with this outcome. In this case, there are only four possible

outcomes for an individual: no initiation; initiation but no

progression; initiation and progression to the next stage

only; initiation and progression to the next stage and to the

final stage.

Technically, it is very easy to implement the multiple

stage model with the raw missing data approach. A path

diagram of the model is shown in Figure 3. This diagram

can be drawn in the Mx graphical interface, or script lan-

guage can be used to implement the model using matrix

algebra. The raw ordinal data input consists of twin pair

records with six values per pair: initiation, stage 1 and

stage 2 for both twins. When there is no initiation, both

stage 1 and stage 2 are coded as missing. When there is

Table 3 Parameter estimates from fitting two multivariate models to

data on the initiation and progression of cannabis and nicotine.

Proportions of additive genetic (a2), common environment (c2), and

specific environment (e2) variance are proportions of variance

excluding the causal influence( fi ) of other variables. Note: rA, rC

and rE were computed from parameter estimates in Figure 2 using,

e.g., ain � aicn=ffiffið

pain2ðaicn2 þ aic2ÞÞ

Nicotine Cannabis

Initiation Progression Initiation Progression

Cholesky factor model (–2lnL=6523.05, df=6015, AIC=–5506.95)

a2 0.67 0.71 0.40 0.49

c2 0.11 0.02 0.33 0.02

e2 0.21 0.27 0.27 0.48

bn – .57 – –

bc – – – .79

rA .82 .38

rC .85 1.00

rE .45 –.39

Causal model (–2lnL=6525.87, df=6017, AIC=–5508.132)

a2 0.81 0.76 0.15 0.43

c2 0.00 0.00 0.37 0.00

e2 0.19 0.24 0.47 0.57

bn2 – .68 – –

bc2 – – – .64

binc2 (–).40 – – –

bicn2 – – .85 –

bpnc2 – .00 – –

bpcn2 – – – .01

Notes: progression of cannabis is defined as DSM-IIIR abuse or

dependence; progression of nicotine is defined as nicotine depen-

dence. Fit statistics are: –2lnL, minus twice the logarithm of the

likelihood; df, degrees of freedom; and AIC, Akaike’s Information

Criterion (Akaike, 1987).

Table 4 Predicted within-person correlations between nicotine and

cannabis initiation, nicotine dependence and cannabis abuse from two

multivariate models. Results from the Cholesky factor model are

below the diagonal, and from the causal model are above the diagonal

Nicotine Cannabis

Initiation Dependence Initiation Abuse

Nicotine Initiation 1.00 0.79 0.70 0.64

Nicotine Dependence 0.75 1.00 0.53 0.49

Cannabis Initiation 0.70 0.52 1.00 0.84

Cannabis Abuse 0.63 0.50 0.89 1.00

Behav Genet (2006) 36:507–524 515

123

initiation but no progression to stage 1, stage 1 is coded as

zero and stage 2 is coded as missing. When there is initi-

ation and progression to stage 1 but not to stage 2, stage 2

is coded as zero.

In practice, the effective sample sizes will generally

decline for more advanced stages. This will make estima-

tion of variance components specific to advanced stages

more difficult, as they will have larger standard errors.

Furthermore, if the path from previous stages is large there

will be relatively little specific variation to partition and

thus estimation will be even less precise.

Figure 3 shows a direct path b3 from Initiation to

Stage 2, which can be difficult to grasp conceptually. It

is important to remember that these paths reflect statis-

tical regression paths, and not pathways for transition.

Thus b3 represents some influence of liability to Initia-

tion that does not influence liability to Stage 1. For

example, suppose that the three stages are initiation of

drug use, drug use on more than five occasions, and drug

addiction. Imagine that there are regional differences in

availability such that the drug is always available in

some regions but not often available in others. This

variation in availability might cause individual differ-

ences in initiation, and also in addiction which cannot

occur without a regular supply. However, the effect of

erratic supply on trying the substance five or more times

might be relatively trivial and therefore transmission of

cause through this intermediate stage would underesti-

mate the strength of association between initiation and

addiction.

Application: Tobacco Initiation, Regular Smoking and

Nicotine Dependence

Sample and Measures

The sample used here is the same as described above for

the bivariate analysis of nicotine and cannabis initiation

and progression.

Table 5 Crosstabulation of theoretically possible outcomes for a pair of relatives assessed on initiation and two stages of progression. Of the 64

possible cells, only 16 may be observed. The solid lines delineate the 16 possible observed classes

Twin 1 Twin 2

Init Stg 1 Stg 2 Initiation

No Yes

Stage 1 Stage 1

No Yes No Yes

Stage 2 Stage 2 Stage 2 Stage 2

No Yes No Yes No Yes No Yes

No No No 1 2 3 4 5 6 7 8

No No Yes 9 10 11 12 13 14 15 16

No Yes No 17 18 19 20 21 22 23 24

No Yes Yes 25 26 27 28 29 30 31 32

Yes No No 33 34 35 36 37 38 39 40

Yes No Yes 41 42 43 44 45 46 47 48

Yes Yes No 49 50 51 52 53 54 55 56

Yes Yes Yes 57 58 59 60 61 62 63 64

Fig. 3 Multiple stage model for

initiation and progression to two

possible further stages.

Progression is contingent on

being positive for the previous

stage. Parameters may be

estimated with data collected

from relatives

AIN CIN EIN

INITIATION

AS1 CS1 ES1

STAGE 1

AS2 CS2 ES2

STAGE 2

1.00

ain

1.00

cin

1.00

ein

b1

1.00

as1

1.00

cs1

1.00

es1

1.00

as2

1.00

cs2

1.00

es2

b2

b3

516 Behav Genet (2006) 36:507–524

123

Method

The multiple stage model was fitted by maximum likeli-

hood to the raw data files prepared for MZ and DZ twins.

The Mx script used for this purpose is shown in Appendix

D. Data from incomplete pairs were included in the analyis

to reduce possible bias that may accrue if data are MAR but

not MCAR.

Results

Maximum likelihood parameter estimates from fitting the

two transition models are shown in Table 6. The top half of

the table shows that the pattern of strong familial resem-

blance for smoking initiation, comprising 56% additive

genetic 22% common environment variance is recovered

with these data. The vast majority (80%) of the variance of

liability to regular smoking is accounted for by liability to

initiation, with the remainder almost entirely due to addi-

tive genetic effects specific to regular smoking. The tran-

sition from regular smoking to ND is much less strong.

There appears to be residual familial resemblance for ND

that is not accounted for by variance in regular smoking. In

addition, the factors responsible for ND as opposed to

regular smoking appear to be less strongly related to ini-

tiation of any smoking than seemed to be the case when

initiation and dependence were analyzed without infor-

mation on regular smoking.

The lower half of Table 6 shows the results of fitting a

more elaborate model of the relationship between initia-

tion, regular smoking and ND. Here a direct path from

initiation to ND was included. However, very little

improvement in fit was observed and the value of the path

itself is small. These results indicate that there is little

relationship between factors that influence initiation and

factors that influence ND beyond those that are mediated

by regular smoking.

Discussion

Methodological Development

Several extensions to the conditional causal model have

been described and applied. These extensions permit a

number of new hypotheses to be tested. At the simplest

level, the re-specification of the problem as one of missing

ordinal data permits a more general modeling of the

threshold. Covariates such as age, sex or genotype may be

specified to have direct effects on the mean and therefore

on the thresholds. Thus tests for the linear effect of age, or

the additive or non-additive association with measured

genotypes may be implemented without the need to

simultaneously model the distribution of age or the allele

frequencies.

Some covariates are inherently shared by members of

a twin pair reared together, such as socioeconomic status

or demography of the place of residence. Age is fre-

quently regarded as the same for both members of a

pair, but in practice there may be variation if the twins

are assessed at different times. Even when this is the

case, there is little interest in modeling the genetic and

environmental determinants of age, and age probably

will have a non-normal distribution. Therefore, modeling

age and shared environmental effects via their direct

effects on the mean (a ‘‘multilevel’’ model) is a practical

approach to removing their effects from the variable of

interest. No assumption about the distribution of these

variables is required; what remains is that the residuals

of the variables being analysed have a multivariate

normal distribution.

More subtle modeling of the effects of covariates is

possible when the covariate differs between twins in a

meaningful way. While such covariates might be treated in

the same way as age and other shared variables described

above, it should be understood that doing so simply

regresses out the effect of the covariate, which assumes

that the covariate causes the variables being analyzed. As

we have noted elsewhere (Heath et al., 1993; Neale and

Cardon, 1992; Neale et al., 1994a, b), this causal assump-

tion is empirically testable in twin data. It is also possible

to partition the covariance between the covariate and the

variables of interest into genetic, shared and specific

environmental components, via standard multivariate

genetic analysis.

There are some technical limitations to the analysis of

multiple ordinal variables. Integration of the multivariate

Table 6 Parameter estimates from fitting a multiple stage model to

data collected from female MZ and DZ twins on initiation of any

smoking, of regular smoking, and nicotine dependence

Initiation Regular Smoking Nicotine Dependence

No direct effect from Initiation to Dependence (–2lnL= 4014.62,

df=3465)

a2 .56 .17 .23

c2 .22 .02 .21

e2 .21 .01 .38

b12 – .80 –

b22 – – .17

Direct effect from Initiation to Dependence (–2lnL=4014.49,

df=3464)

a2 .56 .20 .08

c2 .22 .00 .32

e2 .21 .01 .39

b12 – .80 –

b22 – – .09

b32 – – .03

Behav Genet (2006) 36:507–524 517

123

normal distribution is computationally demanding when

the number of variables is large. For the purposes of reli-

able optimization, numerical integration should be per-

formed at a high level of accuracy, but doing so makes it

run very slowly. There is thus a trade-off between opti-

mization performance and computational speed. This

problem is somewhat temporary; as computers increase in

speed and as parallel computer architecture is exploited

more effectively, the time taken for more precise integra-

tion will decrease and stable optimization will be more

frequently obtained. In the analyses to date, there can be a

tendency to find local rather than global minima and it is

therefore prudent to use a variety of starting values to en-

sure that the solution obtained is indeed the maximum

likelihood.

Clearly these methods can provide valuable insight into

the scaling of variables that is difficult or impossible to

obtain from other sources. Modeling of data from relatives

provides a unique perspective within a particular variable.

Perhaps the closest parallel is the analysis of longitudinal

data. However, repeated measures taken across a long

interval might not be measuring the same construct. Con-

versely, a short interval between occasions may give rise to

response bias if there is interference from recent testing.

Only the study of relatives seems to be free of such diffi-

culties.

Substantive Findings

The illustrative applications of the methods described in

this paper offer a number of new insights to the etiology of

tobacco and cannabis initiation and progression to abuse or

dependence. First, including age in the model does reveal a

significant change in prevalence of cannabis use across the

18–50 year-old age range of the twins in this study. The

direction of the effect is positive for both initiation and

progression to abuse, according to the formulae

tI=–.797+.023*age and tP=0.999+.012*age. Therefore the

prevalence of both cannabis use and abuse is greater in

younger rather than older samples. This finding is consis-

tent with other epidemiological studies in the USA

(Department of Health and Human Services, 1999).

Despite being statistically significant, the effect is not large

enough to account for the substantial proportion of com-

mon environment variance typically found in studies of

twins. Thus failure to correct for age in previous studies has

most likely increased the estimate of common environment

variance by only one or two percent. The remainder may be

due to parental attributes, religion, availability, rural vs.

urban living, or other demographic factors that are shared

by twins.

Second, the bivariate analysis of nicotine and cannabis

initiation and progression produced several new insights.

Within individuals, initiation for these two substances is

highly correlated (r=.70). Progression to nicotine depen-

dence is less highly correlated with progression to cannabis

abuse (r=.50) although this latter correlation has a broader

confidence interval because data on abuse are missing in

those subjects that have not initiated use. Consistent with

univariate analyses of these data, there are substantial

correlations between the liabilities to initiate and progres-

sion of both nicotine and cannabis. There are also sub-

stantial correlations across the substances; liability to

initiate nicotine correlates .63 with liability to cannabis

abuse and liability to cannabis use correlates .53 with lia-

bility to nicotine dependence. These results indicate that

the factors that cause individual differences in liability to

both licit and illicit substance use have much in common.

Given that both involve the reward system in the brain, this

result may not be surprising, particularly for dependence or

abuse. At the components of variance level, the shared

environmental correlation between nicotine and cannabis

initiation is high (rc=.85), suggesting that social determi-

nants of substance use may be common to both nicotine

and cannabis. The same is true for additive genetic factors

that predispose to initiation of both substances. For factors

specific to progression to ND or cannabis abuse, the results

are less consistent. There is little shared environmental

variation for progression for either substance so the rc is

not relevant. While the additive genetic correlation is po-

sitive ra=.38, the specific environmental correlation is

opposite (re=–.39) suggesting that although there is little

within person correlation for factors specific to ND and

substance abuse, their causes may be quite substantially

correlated. These data therefore support the hypothesis that

there exist general neurobiological factors that predispose

to progression to substance abuse. However, the genetic

correlation is modest, indicating a substantial role for

factors specific to each substance as well. Apparently,

environmental factors not shared by twins may lead to

preference of one substance over the other, reflected by the

negative environment correlation.

The alternative, causal, model for the relationships

among nicotine and cannabis initiation and progression has

a provocative finding of a negative feedback loop between

nicotine and cannabis initiation. Liability to initiate nico-

tine increases liability to initiate cannabis, whereas liability

to initiate cannabis decreases liability to initiate nicotine.

Onset of tobacco use usually precedes onset of cannabis

use, which is somewhat consistent with this finding. The

findings do not answer the question of whether later onset

of cannabis use leads to a decrease in the use of tobacco. It

may be that they reflect a limitation of this modeling at the

population level, and mask a mixture of substance use

patterns. For example, one type of person might initiate

cannabis use and shun legal drugs such as tobacco and

518 Behav Genet (2006) 36:507–524

123

alcohol, perhaps as a manifestation of anti-establishment

feelings. A second type might participate in legal substance

use only, and a third may be willing to try anything.

Mixture distribution modeling, perhaps combined with

longitudinal analysis, might provide an empirical test of

such hypotheses, although it would likely require a good

indicator of group membership.

Third, the multiple stage model of nicotine initiation,

regular use and dependence elucidated some novel findings

about the development of ND in women. Liability to be-

come a regular smoker is very closely related to liability to

initiate. The development of dependence is less closely

related to either liability to regular smoking or initiation.

Factors specific to the development of regular smoking

appear to be largely additive genetic in origin, whereas

those specific to nicotine dependence (excluding those in-

volved in regular smoking) appear to be more environ-

mental. However, confidence intervals are broad and

therefore these findings are not likely to be very robust.

Finally, we note that in this study, abuse or dependence

of cannabis, and nicotine dependence were coded as a

binary variable based on a symptom count. This aggrega-

tion of symptoms into a sum score may lead to biased

estimates of variance components of the latent trait

underlying responses to the symptoms (Neale et al., 2005).

We hope that fully multivariate analyses of the symptoms

will become technically feasible in the future, so that this

potential source of bias may be examined.

Acknowledgements Michael Neale is grateful for support from

PHS grants RR08123, MH01458, DA-18673. Eric Harvey was sup-

ported by NIMH training grant MH-20030, Hermine Maes by HL-

60688, Patrick Sullivan by MH-59160, and Kenneth Kendler by AA-

09095, MH/AA-49492 and DA-11287.

Appendix A

Mx script for fitting univariate model to data collected

from twins using explicit calls to the multivariate

normal distribution integration routine mnor

!

! Mx script for causal conditional model

!

# ngroup 6

Group 1 Compute MZ Correlations

Calculation

Begin Matrices;

A Di 2 2 Free

C Di 2 2 Free

E Di 2 2 Free

B Fu 2 2

I Id 2 2

End Matrices;

Specify B ! causal parameter from

initiation to progression

0 0

7 0

! startingvalues

Matrix A :7 :7

Matrix C :5 :5

Matrix E :5 :5

Matrix B 0 0 :4 0

! parameter bounds

Bound 0 1 A 1 1 A 2 2 C 1 1 C 2 2

Begin algebra;

X = A*A’ ;

Y = C*C’ ;

Z = E*E’ ;

R = (I@((I-B)tf=}PSSym}e))& (X+Y+Zj X+YX+Yj X+Y+Z);End algebra;

End group

Group 2 DZ Correlation matrix

Calculation

Begin Matrices=Group 1;

H Fu 1 1 ! .5

End Matrices;

Matrix H .5

Begin algebra;

R = (I@((I-B)tf=}PSSym}e)) &(X+Y+Zj h@X+Yh@X+Yj X+Y+Z);End algebra;

End

Fit model to MZ data with user-defined

fitfunctionðMLÞData Ni=1 No=1

Begin Matrices;

d full 1 1 ! two

i zero 1 1

n full 1 1 ! scalar 2.0

o full 9 1

r computed =R1 ! correlation matrix A1B1A2B2

t full 1 4 ! thresholds abab

w zero 1 4 ! means

z unit 1 1

End matrices;

matrix d 2

matrix n 2

matrix o ! non, ex, current cell frequencies

214 53 5

55 117 17

1 20 18

! mnor function takes matrices with

4morerowsthancolumns:

! first n (=4) rows are correlation matrix

Behav Genet (2006) 36:507–524 519

123

! row n+1 is mean vector

! row n+2 is upper thresholds

! row n+3 is lower threshold

! row n+4 is indicator, 0 = integrate

from � infinity to upper threshold

! 1 = integrate

from lower threshold to þ infinity

!

Begin algebra ;

e = -o. ln

ðnmnor ðr w t t ðijijijiÞÞþnmnor ðr w t t ðijijijzÞÞþnmnor ðr w t t ðijzjijiÞÞþnmnor ðr w t t ðijzjijzÞÞnmnor ðr w t t ðijijzjiÞÞþnmnor ðr w t t ðijzjzjiÞÞnmnor ðr w t t ðijijzjzÞÞþnmnor ðr w t t ðijzjzjzÞÞnmnor ðr w t t ðzjijijiÞÞþnmnor ðr w t t ðzjijijzÞÞnmnor ðr w t t ðzjijzjiÞÞnmnor ðr w t t ðzjijzjzÞÞnmnor ðr w t t ðzjzjijiÞÞþnmnor ðr w t t ðzjzjijzÞÞnmnor ðr w t t ðzjzjzjiÞÞnmnor ðr w t t ðzjzjzjzÞÞÞ;End algebra ;

Compute d.nsum(e) ;

Option user rs

End

Fit model to DZ data with

user � definedfitfunctionðMLÞData Ni=1 No=1

Begin Matrices = Group 3;

! re-declare o

and r as they are different for DZ0so full 9 1

r comp =R2 ! correlation matrix A1B1A2B2

End matrices;

Specify t 10 11 10 11 ! equate thresholds for

twin 1 2ð and MZ=DZÞMatrix t .2 .0 .2 .0

Matrix o

100 52 3

45 82 16

6 18 4

Bound -2 3 10 11

Begin algebra ;

e = -(o). ln

ðnmnor ðr w t t ðijijijiÞÞþnmnor ðr w t t ðijijijzÞÞþnmnor ðr w t t ðijzjijiÞÞþnmnor ðr w t t ðijzjijzÞÞ

nmnor ðr w t t ðijijzjiÞÞþnmnor ðr w t t ðijzjzjiÞÞnmnor ðr w t t ðijijzjzÞÞþnmnor ðr w t t ðijzjzjzÞÞnmnor ðr w t t ðzjijijiÞÞþnmnor ðr w t t ðzjijijzÞÞnmnor ðr w t t ðzjijzjiÞÞnmnor ðr w t t ðzjijzjzÞÞnmnor ðr w t t ðzjzjijiÞÞþnmnor ðr w t t ðzjzjijzÞÞnmnor ðr w t t ðzjzjzjiÞÞnmnor ðr w t t ðzjzjzjzÞÞÞ;End algebra ;

Compute d.nsum(e);Option user

End

Group 5 constrain variances to 1

Constraint NI=1

Begin Matrices = Group 1;

U Unit 1 2

z izero 4 2

End matrices;

Constraint U ¼ nd2v(r)*z;End

Group 6 - standardize estimates

Calculation

Begin Matrices = Group 1;

I Id 2 2

J iden 4 4

End matrices;

Begin algebra;

K = I@((I-B)tf=}PSSym}e);L ¼ nv2dðnsqrtðnd2v((I@((I-B)tf=}PSSym}e))*ðXþ Yþ ZjX+Y

X+Yj X+Y+Z)* (I@((I-B)tf=}PSSym}e)’))));M = Ltf=}PSSym}e*K*L;R ¼ nd2v(X+Y+Z);

S ¼ ððnd2v(X))%R) ððnd2v(Y))%R) ððnd2v(Z))%R);End algebra;

Labels row S

A A A B C A C B E A E B

Labels row L

MZT1Abeta MZT1Bbeta

MZT1Abeta MZT1Bbeta

End

option func=1.e-10 ! function precision for

optimization

option df=18 ! adjust df

option nd=4 ! 4 decimal places

option eps=.00000001 ! integration precision for mnor

option th=-2 ! retry optimization from final

point twice

520 Behav Genet (2006) 36:507–524

123

Appendix B

Mx script for fitting conditional causal model including

cohort/age effects

! CCC with age

#ngroup 6

Group 1 Compute MZ Correlations

Calculation

Begin Matrices;

A Di 2 2 Free

C Di 2 2 Free

E Di 2 2 Free

B Fu 2 2

I Id 2 2

End matrices;

Specify B ! causal parameter from

initiationtoprogression

0 0

7 0

! starting values

Matrix A .7 .7

Matrix C .5 .5

Matrix E .5 .5

Matrix B 0 0 .4 0

! parameter bounds

Bound 0 1 A 1 1 A 2 2 C 1 1 C 2 2

Begin algebra;

X = A*A’ ;

Y = C*C’ ;

Z = E*E’ ;

R = (I@((I-B)tf=}PSSym}e)) & (X+Y+Zj X+YX+Yj X+Y+Z);End algebra;

End group

Group 2 Compute DZ Correlation matrix

Calculation

Begin Matrices=Group 1;

H Fu 1 1 ! .5

End matrices;

Matrix H .5

Begin algebra;

R = (I@((I-B)tf=}PSSym}e)) &(X+Y+Zj h@X+Yh@X+Yj X+Y+Z);End algebra;

End

Fit model to MZ data

Data Ninput=10

Labels zyg agea nicusea nicpc2a canusea

canabua

nicuseb nicpc2b canuseb canabub

Ordinal file=ffpair2.rec

Select if zyg = 1

Select agea canusea canabua canuseb canabub ;

Definition agea ;

Begin Matrices;

a full 1 1

i zero 1 1

n full 1 1 ! scalar 2.0

o full 9 1

r full 4 4 =R1 ! correlation matrix A1B1A2B2

t full 1 4 ! thresholds abab

u full 1 4 ! thresholds abab

w zero 1 4 ! means

z unit 1 1

End matrices;

Specify a agea ! A gets updated with age for

each case

! during calculation of covariances and

thresholds

matrix n 2

covariance r ;

thresholds t+u@a ;

option rs

End

Fit model to DZ data

Data Ninput=10

Labels zyg agea nicusea nicpc2a canusea

canabua

nicuseb nicpc2b canuseb canabub

Ordinal file=ffpair2.rec

Select if zyg = 2

Select agea canusea canabua canuseb canabub ;

Definition agea ;

Begin matrices =Group 3;

o full 9 1

r full 4 4 =R2 ! correlation matrix A1B1A2B2

End matrices;

specify t 10 11 10 11

Matrix t .2 .0 .2 .0

specify a agea

specify u 12 13 12 13

matrix u .01 .01 .01 .01

bound -.05 .05 u 1 1 U 1 2

Bound -2 3 10 11

covariance r ;

Thresholds t + u@a;

End

Group 5 - constrain variances = 1

Constraint NI=1

Begin Matrices = Group 1

U Unit 1 2

V iz 4 2

End matrices;

Constraint U ¼ nd2v(R)*V ;

End

Behav Genet (2006) 36:507–524 521

123

Group 6 - standardize estimates

Data calc

Begin Matrices;

A Di 2 2 = A1

C Di 2 2 = C1

E Di 2 2 = E1

B Fu 2 2 = B1

I Id 2 2

H Fu 1 1 ! .5

J iden 4 4

End matrices;

Begin algebra;

X = A*A’ ;

Y = C*C’ ;

Z = E*E’ ;

K = I@((I-B)tf=}PSSym}e);L ¼ nv2dðnsqrtðnd2v((I@((I-B)tf=}PSSym}e))*(X+Y+Zj X+YX+Yj X+Y+Z)* (I@((I-B)tf=}PSSym}e)’))));M = Ltf=}PSSym}e*K*L;R ¼ nd2v(X+Y+Z);S ¼ ððnd2v(X))%RÞ ððnd2v(Y))%RÞ ððnd2v(Z))%R);End algebra;

Labels row S

A A A B C A C B E A E B

Labels row L

MZT1Abeta MZT1Bbeta

MZT1Abeta MZT1Bbeta

Interval B 1 2 1

option mu nd=4

option nag=10 db=1

option func=1.e-8

option th=-2

option multiple issat

End

!fit submodel without age effect

save cccage.mxs

drop 12 13

End

Appendix C

Mx script for fitting bivariate conditional causal model

for initiation and progression in pairs of twins

! Bivariate Genetic Cholesky Model CCC

! Simulated ordinal data

#ngroups 4

#define nthresh1 1

#define nthresh2 1

#define nvar 4

Group 1: set up model

Calculation

Begin Matrices;

End matrices;

Bound .0 2 X 1 1 X 2 2 X 3 3 X 4 4

Bound .0 2 Y 1 1 Y 2 2 Y 3 3 Y 4 4

Bound .1 2 Z 1 1 Z 2 2 Z 3 3 Z 4 4

Matrix T

0 0 0 0 0 0 0 0

Bound -3 3 T 1 1 - T 1 8

Specify B

0 0 104 0

101 0 105 0

102 0 0 0

103 0 106 0

Bound -.99 .99 B 1 1 to B 4 4

Labels Col B IS DS IC DC

Labels Row B IS DS IC DC

Matrix X .8 .7071 .6 .5

Matrix Z .6 .5 .8 .7071

Matrix B

0 0 .3 0

.5 0 .3 0

.3 0 0 0

.3 0 .5 0

Begin algebra;

A= X*X’ ;

C= Y*Y’ ;

E= Z*Z’ ;

D= W*W’ ;

F= (J@(K-B))tf=}PSSym}e ;

End algebra;

End

Group 2: Fit model to MZ twin pairs

Data Ninput=10

Labels zyg agea nicusea nicpc2a canusea

canabua

nicuseb nicpc2b canuseb canabub

Ordinal file=ffinits.rec

Select if zyg = 1

select nicusea nicpc2a canusea canabua

nicuseb nicpc2b canuseb canabub ;

Begin Matrices= Group 1;

Covariances F&(A+C+E+Dj A+C+D

B Full nvar nvar Free ! causal pathways

J Iden 2 2

K Iden 4 4

X diag nvar nvar Free ! genetic structure

Y diag nvar nvar Free ! common environmental structure

Z diag nvar nvar Free ! specific environmental

structure

W Lower nvar nvar ! dominance structure (set to

zero)

T Full nthresh2 8 Free

522 Behav Genet (2006) 36:507–524

123

A+C+Dj A+C+E+D) /

Thresholds T ;

Option RSidual

End

Group 3: Fit model to DZ twin pairs

Data Ninput=10

Labels zyg agea nicusea nicpc2a canusea canabua

nicuseb nicpc2b canuseb canabub

Ordinal file=ffinits.rec

Select if zyg = 2

select nicusea nicpc2a canusea canabua

nicuseb nicpc2b canuseb canabub ;

Begin Matrices= Group 1;

H Full 1 1

Q Full 1 1

End matrices;

Matrix H .5

Matrix Q .25

Covariances F&(A+C+E+Dj H@A+C+Q@DH@A+C+Q@Dj A+C+E+D) /

Thresholds T ;

Option RSidual

Options NDecimals=4

option func=1.e-8

End

G4: Constrain variances

Constraint NI=1

Begin Matrices;

U unit 1 4

E symm 8 8 = %e2

Z iz 8 4

End matrices;

Constraint U ¼ nd2v(E) * Z ;

Option

End

Appendix D

Mx script for fitting three stage/two transition causal

model for initiation and two progressions in pairs of

twins

! Bivariate Genetic Cholesky Model CCC

! Simulated ordinal data

!

#ngroups 4

#define nthresh1 1

#define nthresh2 1

#define nvar 3

G1: set up model

Calculation

Begin Matrices;

End matrices;

Bound .0 1 X 1 1 X 2 2 X 3 3

Bound .0 1 Y 1 1 Y 2 2 Y 3 3

Bound .1 1 Z 1 1 Z 2 2 Z 3 3

Bound -3 3 T 1 1 - T 1 6

Specify B

0 0 0

101 0 0

0 1020 0

Bound -.99 .99 B 2 1 B 3 2

Labels Col B IS RS ND

Labels Row B IS RS ND

Matrix X .8 .7071 .6

Matrix Z .6 .5 .8

Matrix B 0 0 0 .5 0 0 0 .5 0

Begin algebra;

A= X*X’ ;

C= Y*Y’ ;

E= Z*Z’ ;

D= W*W’ ;

F= (J@(K-B))tf=}PSSym}e ;

End algebra;

End

G2: MZ twin pairs

#include patccc1.dat

Select if zyg = 1

select evera rega nda everb regb ndb ;

Begin Matrices= Group 1;

Covariances F&(A+C+E+Dj A+C+DA+C+Dj A+C+E+D) /

Thresholds T ;

Option RSidual

End

G3: DZ twin pairs

#include patccc1.dat

Select if zyg = 2

select evera rega nda everb regb ndb ;

Begin Matrices= Group 1;

H Full 1 1

Q Full 1 1

End matrices;

B Full nvar nvar Free ! causal pathways

J Iden 2 2

K Iden nvar nvar

I Lower nthresh2 nthresh2

X diag nvar nvar Free ! genetic structure

Y diag nvar nvar Free ! common environmental

structure

Z diag nvar nvar Free ! specific environmental

structure

W Lower nvar nvar ! dominance structure

T Full nthresh2 6 Free

Behav Genet (2006) 36:507–524 523

123

Matrix H .5

Matrix Q .25

Covariances F&(A+C+E+Dj H@A+C+Q@DH@A+C+Q@Dj A+C+E+D) /

Thresholds T ;

Option RSidual

Options NDecimals=4

option func=1.e-8

End

G4: Constrain variances

Constraint NI=1

Begin Matrices;

U unit 1 nvar

E symm 6 6 = %e2

Z iz 6 nvar

End matrices;

Constraint U ¼ nd2v(E) * Z ;

Option Multiple th=-2

End

References

Agresti A (1990) Categorical data analysis. Wiley

Akaike H (1987) Factor analysis and AIC. Psychometrika 52:317–332

Department of Health and Human Services (1999) National house-

hold survey on drug abuse main findings 1997. 5600 Fishers

Lane, Room 16-015, Rockville MD 20857: Office of Applied

Studies, Substance Abuse and Mental Health Services Admin-

istration. (http://www.samhsa.gov)

Fagerstrom K (1978) Measuring degree of physical dependence to

tobacco smoking with reference to individualization of treat-

ment. Addict Behav 3:235–241

Fagerstrom K, Schneider N (1989) Measuring nicotine dependence: a

review of the fagerstrom tolerance questionnaire. J Behav Med

12:59–182

Heath AC (1990) Persist or quit? testing for a genetic contribution to

smoking persistence. Acta Genet Med Gemellol 39:447–458

Heath AC, Bucholz KK, Madden PAF, Dinwiddie SH, Slutske WS,

Bierut LJ, Statham DJ, Dunne MP, Whitfield JB, Martin NG

(1997) Genetic and environmental contributions to alcohol

dependence risk in a national twin sample: consistency of find-

ings in women and men. Psychol Med 27:381–1396

Heath AC, Kessler RC, Neale MC, Hewitt JK, Eaves LJ, Kendler KS

(1993) Testing hypotheses about direction-of-causation using

cross-sectional family data. Behav Genet, 23(1):29–50

Heath AC, Madden PAF, Martin NG (1998) Statistical methods in

genetic research on smoking. Stat Methods Med Res 7:65–86

Heath AC, Martin NG (1993). Genetic models for the natural history

of smoking: evidence for a genetic influence on smoking per-

sistence. Addict Behav 18:9–34

Kendler KS, Karkowski LM, Corey LA, Prescott CA, Neale MC

(1999) Genetic and environmental risk factors in the aetiology of

illicit drug initiation and subsequent misuse in women. Brit J

Psychiat 175:351–356

Kendler KS, Neale MC, Maclean CJ, Heath AC, Eaves LJ, Kessler

RC (1993) Smoking and major depression: a causal analysis.

Arch Gen Psychiat 50:36–43

Kendler KS, Neale MC, Sullivan PF, Gardner CO, Prescott CA

(1999) A population-based twin study in women of smoking

initiation and nicotine dependence. Psychol Med 29:299–308

Kendler KS, Prescott CA (1998) Cannabis use, abuse and dependence

in a population-based sample of female twins. Am J Psychiat

155:1016–1022

Koopmans J, Heath A, Neale M, Boomsma D (1997) The genetics of

initiation and quantity of alcohol and tobacco use. In: Koopmans

JR (eds) The genetics of health-related behaviors. Print Partners

Ipskamp, Amsterdam, pp 90–108

Koopmans J, Slutske W, Heath A, Neale M (1999) The genetics of

smoking initiation and quantity smoked in dutch adolescent and

young adult twins. Behav Genet 29:383–394

Little RJA, Rubin DB (1987) Statistical analysis with missing data.

New York, Wiley

Meyer JM, Heath AC, Eaves LJ (1992) Using multidimensional

scaling on data from pairs of relatives to ex plore the dimen-

sionality of categorical multifactorial traits. Genet Epidemiol

9:87–107

Neale MC, Cardon LR (1992) Methodology for genetic studies of

twins and families. Kluwer Academic Press

Neale MC, Eaves LJ, Hewitt JK, Kendler KS (1994) Multiple

regression with data collected from relatives. Multivar Behav

Res 29:33–61

Neale MC, Kendler KS (1995) Models of comorbidity for multifac-

torial disorders. Am J Human Genet 57:935–953

Neale MC, Lubke GH, Aggen SH, Dolan CV (2005) Problems with

using sum scores for estimating variance components: contam-

ination and measurement non-invariance. Twin Res Human

Genet 8(6). (In Press)

Neale MC, Martin NG (1989). The effects of age, sex and genotype

on self-report drunkenness following a challenge dose of alco-

hol. Behav Genet 19:63–78

Neale MC, Walters EW, Heath AC, Kessler RC, Perusse D, Eaves LJ,

Kendler KS (1994) Depression and parental bonding: cause,

consequence, or genetic covariance? Genet Epidemiol 11:503–

522

Neale M, Boker S, Xie G, Maes H (1999) Mx: statistical modeling

(5th Ed). Box 980126 Richmond VA, Department of Psychiatry

Virginia Commonwealth University

Neale M, De Knijff P, Havekes L, Boomsma D (2000) Influences of

the ApoE polymorphism on quantitative apolipoprotein E levels.

Genet Epidemiol 18:331–340

Pickles A, Neale MC, Simonoff E, Rutter M, Hewitt J, Meyer J,

Crouchley R, Silberg J, Eaves L (1994) A simple method for

censored age of onset data subject to recall bias: mothers reports

of age of puberty in male twins. Behav Genet 24:457–468

Read TRC, Cressie NAC (1988). Goodness-of-fit statistics for dis-

crete multivariate data. New York, Springer-Verlag

Spitzer RL, Williams JB, Gibbon M (1987) Structured Clinical

Interview for DSM-III-R. New York, Biometrics Research Dept.

and New York State Psychiatric Institute

True WR, Heath AC, Scherrer JF, Goldberg J, Lin N, Eisen SA,

Lyons MJ (1997) Genetic and environmental contributions to

cigarette smoking. Addiction 92:1277–1287

Zhu G, Duffy DL, Eldridge A, Grace M, Mayne C, O’Gorman L,

Aitken JF, Neale MC, Hayward NK, Green NG, Martin AC

(1999) A major quantitative-trait locus for mole density is linked

to the familial melanoma gene cdkn2a: a maximum-likelihood

combined linkage and association analysis in twins and their

sibs. Am J Human Genet

524 Behav Genet (2006) 36:507–524

123