Use of unphased multilocus genotype data in indirect association studies

14
Use of Unphased Multilocus Genotype Data in Indirect Association Studies David Clayton, Juliet Chapman and Jason Cooper Diabetes and Inflammation Laboratory, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, United Kingdom It is usually assumed that detection of a disease susceptability gene via marker polymorphisms in linkage disequilibrium with it is facilitated by consideration of marker haplotypes. However, capture of the marker haplotype information requires resolution of gametic phase, and this must usually be inferred statistically. Recently, we questioned the value of the marker haplotype information, and suggested that certain analyses of multivariate marker data, not based on haplotypes explicitly and not requiring resolution of gametic phase, are often more powerful than analyses based on haplotypes. Here, we review this work and assess more carefully the situations in which our conclusions might apply. We also relate these analyses to alternative approaches to haplotype analysis, namely those based on haplotype similarity and those inspired by cladistics. Genet. Epidemiol. & 2004 Wiley-Liss, Inc. Key words: association studies; haplotype analysis; gametic phase; linkage disequilibrium; perfect phytogeny Contract grant sponsor: Wellcome Trust; Contract grant sponsor: Juvenile Diabetes Research Foundation. n Correspondence to: David Clayton, Cambridge Institute for Medical Research (CIMR), Wellcome Trust/MRC Building, Addenbrooke’s Hospital, Hills Road, Cambridge CB2 2XY, UK. E-mail: [email protected] Tel: (01223) 762669. Received 23 June 2004; Accepted 29 June 2004 Published online 12 October 2004 in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/gepi.20032 INTRODUCTION: HAPLOTYPES, GENOTYPES AND DIPLOTYPES In recent years, mainly as a result of the availability of very large numbers of single nucleotide polymorphisms (SNPs), there has been increasing interest in genetic associations invol- ving several closely linked loci. These are gen- erally termed ‘‘haplotype associations’’ but, as we shall see, they do not always depend on haplo- types as such, but on the diplotype. Since terminology is not always consistently used, it is as well to define what we mean by these terms before proceeding to elaborate on the different types of multi-locus association in the context of direct and indirect studies. Our terminology broadly follows that of Brum- field et al. [2003]. The term haplotype refers to a particular set of alleles at linked loci, found together on a single chromosome and inherited together. In the context of linkage analysis, a haplotype may be inherited intact (i.e., without mutation or recombination) only in a single meiosis but, for association studies, our interest is only in haplotypes that have remained intact for very many generations. Such haplotypes behave in many respects like alleles of a single, highly polymorphic locus. We shall refer to the pair of such haplotypes carried in an autosomal region of the genome as the genotype, even though the term is more often used in the context of single loci. So defined, haplotypes and genotypes are not directly observed, but must be inferred (usually not unambiguously) from the diplotype, the collection of single-locus genotypes whose phase is unspeci- fied. For example, consider two loci, A and B, with alleles (A, a) and (B, b) respectively. For a subject whose diplotype is doubly heterozygous, i.e., (A/a, B/b), the genotype is ambiguous, since its phase is unknown; it could consist either of the pair of haplotypes AB and ab, or of the pair Ab and aB. However, phase is not, by any means, always ambiguous; for several loci, the phase of the genotype is only completely unknown if the diplotype is heterozygous at every locus. This report is concerned with the extent to which phase is relevant to association. However, this question is highly dependent on the context. Two major types of genetic association study can be distinguished: 1. direct studies of candidate, potentially causal, polymorphisms, and Genetic Epidemiology 27: 415–428 (2004) & 2004 Wiley-Liss, Inc.

Transcript of Use of unphased multilocus genotype data in indirect association studies

Use of Unphased Multilocus Genotype Data inIndirect Association Studies

David Clayton, Juliet Chapman and Jason Cooper

Diabetes and Inflammation Laboratory, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, United Kingdom

It is usually assumed that detection of a disease susceptability gene via marker polymorphisms in linkage disequilibriumwith it is facilitated by consideration of marker haplotypes. However, capture of the marker haplotype information requiresresolution of gametic phase, and this must usually be inferred statistically. Recently, we questioned the value of the markerhaplotype information, and suggested that certain analyses of multivariate marker data, not based on haplotypes explicitlyand not requiring resolution of gametic phase, are often more powerful than analyses based on haplotypes. Here, we reviewthis work and assess more carefully the situations in which our conclusions might apply. We also relate these analyses toalternative approaches to haplotype analysis, namely those based on haplotype similarity and those inspired by cladistics.Genet. Epidemiol. & 2004 Wiley-Liss, Inc.

Key words: association studies; haplotype analysis; gametic phase; linkage disequilibrium; perfect phytogeny

Contract grant sponsor: Wellcome Trust; Contract grant sponsor: Juvenile Diabetes Research Foundation.nCorrespondence to: David Clayton, Cambridge Institute for Medical Research (CIMR), Wellcome Trust/MRC Building, Addenbrooke’sHospital, Hills Road, Cambridge CB2 2XY, UK. E-mail: [email protected] Tel: (01223) 762669.Received 23 June 2004; Accepted 29 June 2004Published online 12 October 2004 in Wiley InterScience (www.interscience.wiley.com)DOI: 10.1002/gepi.20032

INTRODUCTION: HAPLOTYPES,GENOTYPES AND DIPLOTYPES

In recent years, mainly as a result of theavailability of very large numbers of singlenucleotide polymorphisms (SNPs), there has beenincreasing interest in genetic associations invol-ving several closely linked loci. These are gen-erally termed ‘‘haplotype associations’’ but, as weshall see, they do not always depend on haplo-types as such, but on the diplotype. Sinceterminology is not always consistently used, it isas well to define what we mean by these termsbefore proceeding to elaborate on the differenttypes of multi-locus association in the context ofdirect and indirect studies.Our terminology broadly follows that of Brum-

field et al. [2003]. The term haplotype refers to aparticular set of alleles at linked loci, foundtogether on a single chromosome and inheritedtogether. In the context of linkage analysis, ahaplotype may be inherited intact (i.e., withoutmutation or recombination) only in a singlemeiosis but, for association studies, our interest isonly in haplotypes that have remained intact forvery many generations. Such haplotypes behave in

many respects like alleles of a single, highlypolymorphic locus. We shall refer to the pair ofsuch haplotypes carried in an autosomal region ofthe genome as the genotype, even though the termis more often used in the context of single loci. Sodefined, haplotypes and genotypes are not directlyobserved, but must be inferred (usually notunambiguously) from the diplotype, the collectionof single-locus genotypes whose phase is unspeci-fied. For example, consider two loci, A and B, withalleles (A, a) and (B, b) respectively. For a subjectwhose diplotype is doubly heterozygous, i.e., (A/a,B/b), the genotype is ambiguous, since its phase isunknown; it could consist either of the pair ofhaplotypes A–B and a–b, or of the pair A–b anda–B. However, phase is not, by any means, alwaysambiguous; for several loci, the phase of thegenotype is only completely unknown if thediplotype is heterozygous at every locus.This report is concerned with the extent to

which phase is relevant to association. However,this question is highly dependent on the context.Two major types of genetic association study canbe distinguished:

1. direct studies of candidate, potentially causal,polymorphisms, and

Genetic Epidemiology 27: 415–428 (2004)

& 2004 Wiley-Liss, Inc.

2. indirect studies of neutral markers, with the aimof detecting causal variants via linkage dis-equilibrium.

We shall consider direct associations first. If twoloci, A and B are causally related to a phenotype insuch a way that phase is relevant, then we have cisand trans interactions; a combination of alleles atthe two loci has a different effect according towhether it is found on the same chromosomes (acis effect) or on different ones (a trans effect). It isclear that, in this case, the relevance of phase hasstrong mechanistic implications. However, a di-rect cis effect of two loci can be difficult todistinguish from an indirect effect of a third locus;if the A–B haplotype confers high risk of aparticular disease, this may be simply because itis likely also to carry a causal allele at a third(unobserved) locus C. Only functional studies canresolve this with certainty.The above example of indirect association

illustrates why we tend to assume that indirectmapping is intimately concerned with haplotypes;the A and B loci on their own could be poorpredictors of the functional allele at the C locus,but together identify a haplotype that carries ahigh-risk C allele. Since it is the haplotype thatcarries this causal allele, it is natural to assumethat the resolution of diplotype data into a pair ofphased haplotypes is necessary for the demon-stration of indirect associations. In the remainderof this report, we shall question the universalvalidity of this assumption.

A MODEL FOR, AND TESTING FOR,INDIRECT ASSOCIATION

In this section we shall introduce some notationfor the problem of indirect association with anautosomal causal variant via a marker genotype.We shall go on to discuss models for genotype–phenotype association in this context, deriveefficient tests for association, and discuss theproblem presented by unknown phase.Let Y denote the phenotype of interest and let

(Z(1), Z(2)) denote the genotype at the causal locus.For the present, the locus will be assumed to bediallelic with Z(1) and Z(2) coded as 0 or 1. Thisgenotype can also be represented by the singlevariable, Z(+)¼Z(1)+Z(2), coded 0, 1 or 2. R.A.Fisher introduced the idea of expressing thegenotype-phenotype relationship as an additivemodel in which the additive component representsa sum of the ‘‘main effects’’ of the two alleles and

the dominance component represents their interac-tion (this use of the term dominance being relatedto, but different from, its use in classical Mende-lian genetics). In our notation,

EðYÞ ¼ mþ aZð1Þ þ aZð2Þ þ dZð1ÞZð2Þ

where a and d represent additive and dominanceffects. Initially, we will assume no dominanceeffects (so that d ¼ 0), but we extend the abovemodel to the generalized linear model [Nelderand Wedderburn, 1972] in which the expectedresponse is related to the linear predictor by amonotone ‘‘link’’ function, g()

gfEðYÞg ¼ mþ aZðþÞ:

Note that, for a binary phenotype, such as adisease, this model includes as special cases:

1. the probit model in which response occurswhen underlying latent ‘‘liability’’ exceedssome threshold,

2. the multiplicative model for risk, in which thelink function is the logarithm, and

3. the logistic model, which allows the multi-plicative model for risk to be fitted using datafrom case-control studies [Breslow and Day,1980].

However, despite this flexibility, these modelsshare the property that phenotype is related togenotype via a linear function of Z(+), an im-portant property for what follows.If Z(+) were observed, the score test of the null

hypothesis H0 : a ¼ 0, has the form

T ¼XNi¼1

WiðYi � m̂miÞZðþÞi

where i¼1yN identifies subjects, Wi are weights,and m̂m is the estimate of E(Yi) under H0. In thesimplest case, Wi are constant and can be omitted,and m̂m can be replaced by the sample mean�YY ¼

PNi¼1 Yi=N. This covers most important cases,

the exceptions being when the link function is not‘‘canonical’’ for the response distribution consid-ered, and H0 allows for the effect of anothervariable (such as a population stratification) uponmean phenotype. To keep notation simple, we willnot discuss such cases further, and consider onlythe simpler form:

T ¼XNi¼1

Yi � �YY� �

ZðþÞi :

This reduces to familiar tests in special cases.For example, when Y is a binary response it

Clayton et al.416

becomes the Cochran-Armitage test, a test forassociation between binary phenotype and adiallelic polymorphism. This is recommendedbecause it does not assume Hardy-Weinbergequilibrium in the study population [Devlin andRoeder, 1999].However, in indirect association studies the

genotype at the causal locus, (Z(1), Z(2)), is notobserved. Instead, we observe a set of markers.Initially, let us assume that we are able to observethe complete marker genotype, including phase,and each marker haplotype provides an indirectmeasure of the allele carried at the causal locus onthat chromosome. Let X(1) and X(2) representvectors of binary indicators, each of which codessome attribute of the corresponding markerhaplotype. We assume that the expected valuesof the corresponding allele at the causal locus aregiven by a classical linear model

EðZð1ÞÞ ¼oþbTXð1Þ;

EðZð2ÞÞ ¼oþbTXð2Þ:

Since Z(1) and Z(2) are binary variables, it mighthave been thought that a logistic or probitregression model might be more appropriate butwe shall show, in the next section, that the above ismore natural. We further assume that the markersare unrelated to phenotype except via the causalgenotype (Z(1), Z(2)). The model is representedgraphically in Figure 1. With this model, since(Z(1), Z(2)) are not observed, the full likelihood is amissing data likelihood of the form discussed byDempster et al. [1977]. The score function isobtained by replacing the complete data scorefunction by its conditional expectation givenavailable data. It follows that the score test forno association ðH0 : a ¼ 0Þ is now [Chapman et al.2003]

T ¼XNi¼1

ðYi � �YYÞEðZðþÞi jXð1Þ

i ;Xð2Þi ;Y;H0Þ

¼XNi¼1

ðYi � �YYÞðbTXð1Þi þ bTXð2Þ

i Þ

¼bTXNi¼1

ðYi � �YYÞXðþÞi :

Since b is unknown, we replace it to maximize Tsubject to its having fixed (unit) variance. Thisleads to the Legrange multiplier test [Hosking,

1983]:

T ¼ UTV�U

where

U ¼XNi¼1

ðYi � �YYÞXðþÞi ;

V ¼VarðUÞ ðunderH0Þ

and � denotes a generalized inverse. This isasymptotically distributed as chi-squared withdegrees of freedom equal to the rank of V. In theabsence of linear dependencies, this is simply thelength of the X vectors. The variance-covariancematrix of U, V, can be estimated by

V̂V ¼ 1

N � 1

XNi¼1

ðYi � �YYÞ2XNi¼1

ðXðþÞi � XðþÞÞ

�ðXðþÞi � XðþÞÞT:

This is also the permutation variance of U inrandom permutations of the phenotype vector Y.The same permutation argument can be used toobtain an ‘‘exact’’ distribution of T under H0.In general, the pair of haplotype scores, (X(1),

X(2)) and, therefore, their sum X(+), are not directlyobservable from the diplotype. In this case phase,like the unobserved causal genotype (Z(1), Z(2)),becomes missing data in the likelihood. Again,standard theory suggests that the score test isobtained by replacing X(+) by its expectation underH0. This simply generalizes the approach of Schaidet al. [2002], who considered the case in which Xrepresents a set of K� 1 indicator variables codingcontrasts between the K haplotypes.

POWER OF INDIRECTASSOCIATION STUDIES

The general arguments set out above can beused to investigate the power of indirect studies ina wide range of situations. Under small depar-tures from H0, the statistic T is distributed

Fig. 1. The model for indirect association. Phenotype is related

to marker haplotypes only via the causal locus.

Unphased Genotypes and Tag SNPs 417

according to a non-central chi-squared distribu-tion with non-centrality parameter

Z ¼ EðUTÞV�EðUÞ:In the case in which the generalized model for Y

has linear link function, this takes a particularlyintuitive form:

Z ¼ ðN � 1ÞR2H

where H is the proportion of the variance‘‘explained’’ by additive effects of the alleles Z(1)

and Z(2) of the causal locus (i.e., that part of theheritability of Y accounted for by the association),and R2 represents the coefficient of determination(or proportion of variance explained) in themultiple linear regression of the causal allele, Z,on the marker haplotype indicators X.For a case-control study of a binary response,

assuming a multiplicative model in which eachcopy of the causal high-risk allele multiplies riskby y, the non-centrality parameter becomes

Z ¼ 2N0N1ðp0 � pÞ2

N�ppð1� �ppÞ þ N0N1

N ðp0 � pÞ2R2;

where N0, N1 are the numbers of controls andcases, p is the frequency of the causal allele incontrols (and in the general population), p0 ¼yp=ð1� pþ ypÞ is its frequency in cases, and �pp ¼ðN0pþN1p0Þ=ðN0 þN1Þ is its frequency in thecomplete study. As before, the strength of associa-tion between markers and causal variant ismeasured by R2. Note that in this case, a simplealgebraic rearrangement yields

U ¼ NcasesNcontrols

Ncases þNcontrolsXðþÞ

cases � XðþÞcontrols

� �

so that the test may be regarded as a special caseof Hotelling’s T2 test [Hotelling, 1931].For studies of trios of affected offspring and

both parents, the likelihood arguments are rathermore complicated. However, Chapman et al.[2003] show that a simple test is available usingsimilar arguments. A case score, XðþÞ

i1 , is generatedfrom the genotype of the affected offspring, whilethe genotype constructed from the two untrans-mitted haplotypes is scored to obtain a matchedcontrol score, X

ðþÞi0 . The test statistic is a paired

Hotelling’s T2 test for non-zero mean of thedifference vectors ðXðþÞ

i1 � XðþÞi0 Þ.

SCORING HAPLOTYPES

Up to this point, we have left unspecified theway in which a marker haplotype might be scored

as a vector of indicators, X. The simplest possibi-lity is to let the elements of X be indicatorvariables corresponding to each of the possiblehaplotype configurations. To avoid linear depen-dency, the indicator corresponding to a ‘‘refer-ence’’ configuration (perhaps the most common)would be omitted so that, if K different haplotypesoccur in the data, the length of the X vectorswould be K�1.It is widely recognised that this strategy is only

useful when rather few marker loci are consid-ered, so that K is small. Otherwise the number ofdegrees of freedom in the test is large, leading tolow statistical power. In effect, degrees of freedomare being wasted by testing rare haplotypes forwhich the study would have insufficient power todetect association even if it were present. This isusually avoided in practice by setting someminimum haplotype frequency for considerationand grouping all rarer haplotypes together. This isnot altogether satisfactory since the haplotypes sogrouped have nothing in common save theirrarity.A number of authors have suggested that, for

case-control studies, this is avoided by use of testsbased upon pairwise comparisons between hap-lotypes [van der Meulen and te Meerman, 1997;Tzeng et al., 2003, 2004]. The general idea is thatmarker haplotypes around a causal locus will bemore similar in case-case comparisons than inrandom pairs in the population. For simplicity ofnotation, consider the case of equal numbers ofcases and controls and let fa be the K-vector ofhaplotype frequencies in affected subjects (cases)and fu the corresponding vector for unaffectedpersons (controls). Then, if A is a similarity matrix,the proposed statistics have the form

D / fTa Afa � fTu Afu:

Typically D is standardized by dividing by thesquare root of the sampling variance and com-pared with the standard normal distribution.Since this is equivalent to comparing its squarewith the chi-squared distribution on one-degree offreedom, Tzeng et al. [2003] described this as aone-degree of freedom test, a description thatimplies resolution of the many degrees of freedomproblem and, therefore, increased power. How-ever, this is misleading. If the chi-squared teststatistic proposed above was on a large number ofdegrees of freedom, we might equally well haveinvoked a normal approximation for its distribu-tion and described that as a chi-squared test onone degree of freedom.

Clayton et al.418

Haplotype similarity statistics were originallyproposed in the context of long-range linkagedisequilibrium mapping and measures of similar-ity typically incorporated the length of the twohaplotypes shared identically by state. Tzeng et al.[2003] describe this as a length measure ofsimilarity. However, over the shorter distancesnow usually considered for studies using SNPmarkers, physical distance becomes largely irrele-vant. For such cases, Tzeng et al. [2003] describetwo further measures: the matching measure, inwhich haplotypes are only similar if they areidentical at every locus, or the counting measure,in which similarity is measured by the number ofloci that are identical by state.Clayton and Jones [1999] have pointed out that

haplotype similarity tests and the Legrange multi-plier tests described here are not as different as atfirst they might appear. In the case-control studyconsidered above, our U vector of the (K� 1)degree of freedom Legrange multiplier test is alinear transformation of (fa� fu), so that this test isof the form

ðfa � fuÞTAðfa � fuÞ ¼ fTa Afa þ fTu Afu � 2fTa Afu;

which, if A is interpreted as a similarity matrix,contrasts the sum of case-case and control-controlsimilarities with twice the case-control similarity.Note that other, more parsimonious, methods ofscoring marker haplotypes reduce to the samegeneral form with suitable modifications to the Amatrix. We see that the main difference betweenthe Legrange multiplier chi-squared tests pro-posed by Chapman et al. [2003] and the haplotypesimilarity tests is (1) whether case-case similaritiesand control-control similarities are compared witheach other, or with case-control similarity, and (2)the precise form of the matrix A. In the haplotypesimilarity methods, the similarity matrix is chosenon intuitive grounds, whilst in our approach it isdetermined by the method for scoring markerhaplotypes and by the variance-covariance matrixof the resulting scores under H0.The method considered by Chapman et al.

[2003] to achieve greater parsimony, as reflectedin degrees of freedom of the chi-squared test, wasto consider scorings based on the familiar hier-archical main effects and interactions in linearmodels. Thus, with four diallelic markers, A, B, C,and D, there are sixteen possible haplotypes and,if they all occur, the full coding leads to a fifteen-degree-of-freedom test. Reparametrizing these asconventional linear model contrasts leads to fourmain effects, six first-order interactions, four

second-order interactions, and one third-orderinteraction. The scores, X, can now be thought ofas rows of a design matrix and greater parsimonycan be achieved by removing columns corre-sponding to high-order interaction terms. As wereduce the number of terms in the model, the R2

measuring the propensity of the markers topredict the causal variant will decrease and, as aresult, so will the non-centrality parameter of thedistribution of the test statistic. This would lead toreduced power, were it not for the fact that thedegrees of freedom of the distribution alsodecline. It is to be expected that, in any one case,there would be an optimal level of complexity.Chapman et al. [2003] examined this empirically

in 17 extensively resequenced genes by selecting,at random, one SNP as a causal variant and othersas marker sets. Unexpectedly, they found that themost powerful tests were achieved with very loworders of complexity; specifically, in most cases,the best choice of scoring corresponded to thelinear model with main effects and no interac-tions. Even when there was a gain in power byincluding one or two low-order interactions, thegain was very modest. These observations ledChapman et al. [2003] to conclude that thesimplified test with L degrees of freedom for Lmarker loci is generally to be preferred to testingfor differences between haplotypes. We shall referto this test as a locus-based test, as opposed to ahaplotype-based test. However, in one or two ofthe situations studied by Chapman et al. [2003],some modest gain in power could be obtained byscoring partial haplotype information. We shallsee why this is the case later.Scoring of marker haplotypes in terms of main

effects alone has an additional attraction, namelythat it removes the problem of unknown phase. Inour example of four diallelic markers, X(l) and X(2)

each has four elements, corresponding to the fourmarker loci. Each of these is coded 0 or 1according to the allele present. Thus, the elementsof X(+) code the single-locus genotypes as 0, 1, or 2,and there is no need to attempt to resolvehaplotype phase. It is interesting to note inpassing that Tzeng et al. [2003] noted that thecounting method of scoring haplotype similarityin their approach was a ‘‘surprising exception’’ innot requiring resolution of haplotype phase. Thisis another demonstration that there are closeparallels between the methods; the chi-squaredtests on K� 1 and L degrees of freedom, corre-sponding to full haplotype and main effectsscorings of Chapman et al. [2003], are closely

Unphased Genotypes and Tag SNPs 419

analogous to the matching and counting methodsof scoring similarity in the approach of Tzenget al. [2003], albeit with different weightingschemes. These two extremes correspond tomethods that count haplotype similarities, andmethods that count locus similarities. The inter-mediate scorings investigated by Chapman et al.[2003] also allow counting of similarities forshorter haplotypes based on subsets of markers.It might be thought that the surprisingly modest

contribution of haplotype information to thepower of association tests could be due to erosionof its value by unknown phase. It could be that, byresolving haplotype phase either by an advance ingenotyping technology or by typing family mem-bers of cases and controls, a rather differentpattern might emerge. However, we have re-peated the exercise reported by Chapman et al.[2003], assuming phase to be known. Although thepower of all tests except that based solely on locusscoring was increased, the increase was extremelymodest and did not substantially affect theconclusions.Chapman et al. [2003] also showed that the

locus-based test has a particularly simple form incase-parent trio studies since, in this case, it is notnecessary to resolve phase to compute the scorevectors X

ðþÞi1 and X

ðþÞi0 for transmitted and un-

transmitted ‘‘genotypes’’; we need only the pairsof transmitted and untransmitted alleles at eachlocus. This considerably simplifies testing sincewe can use a permutation argument in whichtransmitted and untransmitted scores are ran-domly exchanged. Note that such an argument isnot possible when the X vectors involve imputa-tion of haplotype phase, since the evidenceavailable for such imputation varies according totransmission. This means that inference is compu-tationally intensive and yields only asymptotictests [Clayton, 1999].Further attempts to devise more powerful tests

for haplotype association have been based on theconstruction of phylogenetic trees to classifyhaplotypes [Templeton, 1995; Templeton et al.,1987, 1992, 1988; Templeton and Sing, 1993;Seltman et al., 2001]. The general idea is that thecausal variant, if it exists, is embedded within thecoalescent process describing the evolution of thestudy sample. This suggests testing for associationby a series of one-degree-of-freedom tests guidedby the cladogram.To illustrate this approach, Figure 2 shows

cladogram representing evolutionary relation-ships between seven haplotypes numbered I to

VII. Seltman et al. [2001] discuss the use of such acladogram in the analysis of disease risk in case-parent trio studies, but the approach they describecould be applied more generally. In our example,the omnibus test for differences of phenotypedistribution between haplotypes has six degreesof freedom, but Seltman et al. [2001] suggestcarrying out a sequence of six one-degree-of-freedom tests, stopping when one of theseachieves a pre-specified significance level. Thesequence of tests corresponds to successivecollapsing of the cladogram. The first four ofthese test each of the zero-step clades III, IV, VI, andVII. Thus, we would test for differences between(III and I), (IV and I), (VI and II), and (VII and V),respectively. If none of these differences achievesignificance, the remaining two tests compare theone-step clades, i.e., (I+III+IV) versus (II+VI), and(V+VII) versus (II+VI). If neither of these testsachieves significance, we accept the global nullhypothesis of no association. Seltman et al. [2001]recommend using a Bonferoni correction whenchoosing the significance level for interim tests.Thus, for a nominal a level and for K haplotypes,the testing and collapsing of the cladogramcontinues until no further reduction is possiblewith significance level of a=ðK � 1Þ. Because testsare not independent, they recommend use of aMonte Carlo method to compute a correctedoverall P value.Although seemingly based entirely on analyses

of haplotypes, cladistic analyses such as this arenot unrelated to the topic of this report. Anobservation that key to this was made by Seltmanet al. [2001], who wrote:

‘‘Implicit...is the requirement that recombinationsbetween markers are rare, so that the history of thehaplotypes can be described by a mutation tree.This assumption is not unduly restrictive; ingeneral, only if the region is in tight linkage willthere be substantial association between haplotypesand an embedded disorder-related mutation.’’

In the next section, we consider the extreme caseof ‘‘complete’’ linkage disequilibrium.

Fig. 2. Cladogram of seven haplotypes.

Clayton et al.420

COMPLETE LINKAGEDISEQUILIBRIUM AND THE

PERFECT PHYLOGENY

In retrospect, the empirical findings ofChapman et al. [2003] should not have beenunexpected. They arise as a result of the fact thatthey were considering association in small geno-mic regions in which linkage disequilibrium isextremely strong. Such situations approach thestate of ‘‘complete’’ linkage disequilibrium, inwhich there have been no recombinations in thepopulation history and each SNP has arisen as aresult of a single ancestral mutation, so that thevalue of Lewontin’s D’ between every pair ofmarkers is then 1.0. In this case, the number ofhaplotypes is one more than the number of SNPsand the degrees of freedom for haplotype-basedand locus-based tests are equal. Indeed, the testsare identical. For the reasons given in thequotation at the end of the previous section, realindirect mapping applications may not differ byvery much from this extreme situation; haplotype-based approaches are then using extra degrees offreedom that are expressing rare features of thehaplotype distribution. Equally, such situationslead to little phase uncertainty so that it shouldnot be surprising that little is gained by resolvingphase exactly.Table I illustrates complete linkage disequili-

brium in the case of six loci, defining seven haplo-types. The evolutionary history of the haplotypesis shown in Figure 3, and this also corresponds tothe cladogram of Figure 2. It should be noted thatcomplete linkage disequilibrium such as is seenhere does not necessarily suggest high pair-wiseR2 values between loci; only two of these exceed0.25 and most are less than 0.1. This impliesthat hoping to pick up indirect genetic associationby looking at single markers may not be a

very powerful strategy unless markers are care-fully chosen to represent different parts of thegenealogy.The cladogram illustrates clustering of haplo-

types, mirroring their evolutionary history. Analternative diagram is the conditional independencegraph shown in Figure 4, which illustrates themultivariate interdependence of the six binaryvariables A to F. This graph expresses the fact that,for example, E and F are conditionally indepen-dent given D. This conditional independencestructure has a one-to-one correspondence withthe structure of the cladogram (Fig. 2), junctions

TABLE I. Six loci in complete linkage disequilibrium

Locus

Haplotype A B C D E F Probability

I 1 1 1 1 1 1 .30II 2 1 1 1 1 1 .20III 1 2 1 1 1 1 .15IV 1 1 2 1 1 1 .15V 2 1 1 2 1 1 .10VI 2 1 1 1 2 1 .05VII 2 1 1 2 1 2 .05

Fig. 3. Evolutionary history of the seven haplotypes of Table I.

Fig. 4. Conditional independence graph for the joint distribu-tion of six SNPs shown in Table I.

Unphased Genotypes and Tag SNPs 421

on the cladogram corresponding to cliques in theconditional independence graph (a clique is a setof nodes that are fully connected). Thus, there arethree cliques in the graph of Figure 4, namely (F,D), (A, D, E), and (A, B, C), corresponding to thejunctions on the cladogram at haplotypes V, II,and I. A further property of this multivariatedistribution is that the conditional expectations ofeach variable are given by linear regressionequations:

EðAÞ ¼ 1� :4B� :4Cþ :6Dþ :6E

EðBÞ ¼ 2� :33A� :33C

EðCÞ ¼ 2� :33A� :33B

EðDÞ ¼ :33þ:33A� :33Eþ :67F

EðEÞ ¼ 1þ:2A� :2D

EðFÞ ¼ :67þ:33D:

This justifies our assumption of linear regres-sions for the expectation of the functional variantZ given marker scores, X. To gain insight intothese conditional independence relationships, letus consider the fourth of these equations. Firstly,A¼2 is a necessary condition for D to depart from1 and, conditional upon A, both B and C areirrelevant to its value. Secondly, if there were anyfurther mutations of haplotype VII, leading tofurther SNPs, then these would be irrelevantgiven the value of F. Thus, the conditionaldistribution of D depends only on A, E, and F.Only four configurations of these SNPs areobserved. With the configurations (A¼E¼F¼1)and (A¼E¼2, F¼1), D must take the value 1.Similarly, for (A¼2, E¼1, F¼2), D must take thevalue 2. But the configuration (A¼2, E¼F¼1)identifies either haplotype II, for which D¼1, orhaplotype V, for which D¼2. Since the relativefrequencies of these haplotypes are 0.2 and 0.1,respectively, the conditional expectation of D inthis case is 1.33. These four conditions determine

the regression coefficients in the above equation,and their values are obtained by solving thecorresponding four simultaneous equations. Ingeneral, the conditional distribution of the SNPcorresponding to mutation of the ‘‘parent’’ haplo-type X to the ‘‘offspring’’ haplotype Y depends onthose SNPs that correspond to the branches of thephylogeny (1) from the parent of X to X, (2) from Yto any offspring nodes, and (3) from X to the‘‘siblings’’ of Y.The linear regressions that predict each SNP

from the remaining ones do not, in the completelinkage disequilibrium case, involve ‘‘interaction’’terms. The presence of recombination inthe evolutionary history, however, changes this.Figure 5 illustrates the case where our completedisequilibrium pattern is disrupted by a recombi-nation. It can be shown that the conditionalindependence graph between A to F is modifiedonly by addition of edges between A and F andbetween B and D. However, some of the regres-sion models for the conditional expectations nowinvolve a first-order interaction term. The regres-sion for A involves a B�D interaction, those for B,E, and F involve A�D interactions, and theregression for D requires an A� F interaction.Thus, ancestral recombination means that somehaplotype information becomes relevant. How-ever, with a modest amount of recombination, amain effects model can do nearly as well, thoughit may correspond to a less sparse conditionalindependence graph. It may also be that, if thesimplified model is to be used, slightly moremarkers need to be typed to fully capture theinformation in a region.We now return to the complete linkage dis-

equilibrium case in order to gain some insight intothe relative merits of a multiple degree of freedomtest over the sequence of one degree of freedomtests as suggested by Seltman et al. [2001].Consider the regression (or logistic regression) of

Fig. 5. Addition of a recombinant haplotype to the history shown in Figure 3. Dashed arrows denote a recombination.

Clayton et al.422

phenotype on the six loci A–F, each scored 0, 1, or2; that is, the regression on the locus scoring forX(+). In the case of complete linkage disequili-brium, this is simply a reparametrisation of thehaplotype scoring, so that the six regressioncoefficients parametrize differences between theeffects of the seven haplotypes on the distributionof phenotype. The step-wise testing procedure ofSeltman et al. [2001] is then equivalent to step-wise dropping of loci from the regression. Thefirst four tests (corresponding to the zero-stepclades) are generated by dropping B, C, E, or Fand, after all of these have been dropped, theremaining two tests (corresponding to the one-step clades) are generated by dropping A or D.Why should the sequence of six one-degree-of-

freedom tests, with Bonferoni correction, bepreferred to a single six-degree-of-freedom test?In the simple case in which an L-degree-of-freedom test can be decomposed into the sum ofL independent one-degree-of-freedom tests, thenon-centrality parameter can be decomposed inexactly the same manner. If the total non-centralityconcentrates largely in a single one of these tests,greater power is obtained by carrying out L

separate one-degree-of-freedom tests, with aBonferoni correction. This is because the Bonferonicorrection has a less extreme effect than doesincreasing the degrees of freedom without in-creasing the non-centrality parameter. In contrast,if the non-centrality is spread throughout all thetests, greater power is obtained using the singletest with L degrees of freedom. This argumentsuggests that the cladogram-collapsing approachwill be advantageous if it leads to the associationbeing largely captured by one of the one-degree-of-freedom tests. Let us consider the situation ofour six markers defining the seven haplotypesportrayed in the cladogram of Figure 2. Anunobserved causal variant, Z, in complete linkagedisequilibrium with these markers, will create anadditional haplotype, and this would fall any-where on the cladogram. Figure 6 shows threepossible positions. In the first, the causal mutationcreates a variant of haplotype VII, and in thesecond case it creates a variant of haplotype II. Inthe third case, it creates a variant of haplotype I,and all haplotypes descended from I also carry thecausal variant. These three possibilities give verydifferent conditional independent graphs (Fig. 7).

Fig. 6. Three alternative positions for the causal mutation.

Fig. 7. Conditional independence graph for six markers, A–F, and a causal variant, Z, as shown in Figure 6. a–c: The cases where the

causal mutation creates haplotype VIIa, IIa, and Ia, respectively.

Unphased Genotypes and Tag SNPs 423

Figure 7a shows that, when the causal mutationoccurs on haplotype VIIa, indirect association willbe created only with marker F, and this singleparameter, which compares effects of haplotypesVII and V, However, the other two cases lead toconditional associations between phenotype andthree of the loci, so that all association is no longercaptured by single regression coefficients.This illustration suggests that the sequence of

tests suggested by Seltman et al. [2001] cannot beoptimal in all cases. However, the idea that morecareful consideration of the cladogram could leadto more powerful test strategies than a singlemultiple-degree-of-freedom chi-squared test de-serves further investigation.

IMPLICATIONS FOR SELECTION OF‘‘TAG’’ SNPs

It has been pointed out that a further conse-quence of strong linkage disequilibrium is tocreate considerable marker redundancy [Johnsonet al., 2001] so that a subset of ‘‘tag’’ SNPs capturesalmost all the information. There is now aconsiderable literature on different algorithmsfor selecting tag SNPs from a larger (ideallyexhaustive) set of known variants. Our theoreticaltreatment of the power of association studiessuggests that the aim of such selection should bethat any omitted variants should be predictablefrom the selected set of tags with some acceptablelevel of R2 (we use 0.8). This recommendation iscommon to many approaches. However, there arevariations on the theme. For example, while wefocus on the prediction of omitted markers, Stramet al. [2003] considers the R2 to predict the morecommon haplotypes. The discussion of the pre-vious section shows that, when linkage disequili-brium is strong, there will be little differencebetween these approaches. A further difference iswhether prediction should be achieved by a‘‘main effects’’ regression model (locus-basedprediction), by use of the full tag haplotype(haplotype-based prediction), or by any of theintermediate design-matrix based predictionsconsidered by Chapman et al. [2003]. To achievethe same R2 using locus-based prediction, slightlymore tags will need to be used, but power will beincreased by use of smaller degrees of freedom.Table II lists our experience of 34 candidate genesor regions for type 1 diabetes. It is striking howoften locus-based predictions perform nearly aswell as haplotype-based predictions, demonstrat-

ing how close these real situations are to theidealized complete linkage disequilibrium caseconsidered in the previous section. The largestdiscrepancy in Table II is for the CTLA4-centralregion, where haplotype scoring increases R2 from0.82 to 0.97. This is attributable to the fact thatthere is a single recombinant haplotype thatdescribes some 15% of chromosomes in thisregion. However, to capture this information, itshould not be necessary to revert to scoring fullhaplotypes; scoring for partial haplotypes ofperhaps two or three of the tags should besufficient and incorporation of this in the testwould not inflate degrees of freedom unduly.Improved algorithms for tag SNP selection shouldbe able to detect such situations and suggestefficient scorings.Another striking feature of Table II is the

dependence of the number and density of tagsrequired to the number and density of variantsexisting in the region. This dependence is shownin Figure 8. This has clear implications for theInternational HapMap project [International Hap-Map Consortium, 2003]. Wang and Todd [2003]came to a similar conclusion from their analysis of73 autosomal gene segments from the UW-FHCRC Variation Discovery Resource database(http://pga.mbt.washington.edu).Consideration of the relationship between the

haplotype phylogeny and the locus conditionalindependence graph suggests that the mainproperties required of a set of tag SNPs are:

1. the set of tags should be well dispersedthroughout the conditional independencegraph, and

2. all parts of the graph should be represented.

In a situation that approaches completelinkage disequilibrium, all the informationconcerning the haplotype distribution is containedin the allele frequencies and the pairwise correla-tions so that the above properties should beable to be restated in these terms. Thus, Carlsonet al. [2004] showed that efficient haplotypetagging can be achieved using a simple andfast algorithm that first clusters SNPs into ‘‘bins’’such that pairwise r2 values between SNPs withinbins are high and then selects one tag SNP fromeach bin.The selection of tag SNPs must usually be made

based on sequencing, or genotyping, a relativelysmall panel of samples and R2 values are knownto be exaggerated in small samples. The extent ofthis bias depends on the size of the sample in

Clayton et al.424

relation to the number of parameters which mustbe estimated in the predictive regression equa-tions. Thus, the R2 values based on haplotypescoring will be more biased than those based onlocus scoring. This consideration renders the goodperformance of the latter even more impressive. Infact, the bias in the locus R2 may be quite small inall but the smallest of studies. In the case ofmultivariate normal data, Wishart [1931] showedthe upward bias to be

ð1� R2Þ p� 1

n� 1� 2R2ð1� R2Þ n� p

n2 � 1þO

1

n2

� �

where n is the sample size and p is the number ofpredictor variables in the multiple regression.When R2 is large, this bias is modest. Although

genotype data are not multivariate normal, ap-proximately the same order of bias seems to apply.

DISCUSSION

Throughout, we have assumed the generalizedlinear model for additivity of effects of the twoalleles at a causal autosomal locus. It is this thatensures that the marker genotype score X(+)

captures all information for testing for indirectassociation. We have seen that, when linkagedisequilibrium is strong, an efficient scoring haseach element of X(+) simply equal to the singlelocus genotype score (0, 1, or 2) for each marker sothat the marker diplotype information alone isrequired.

TABLE II. Numbers of tag SNPs required for haplotype-based and locus-based predictiona

Haplotype R2 selection Locus R2 selection

Gene/region kb SNPs Common SNPs Tag SNPs Min R2 Tag SNPs Min R2

FRAP 1 160 56 25 8 0.91 8 0.82CBLB 210 35 21 6 0.88 8 0.84CTLA4-extended 110 78 76 10 0.85 13 0.82CTLA4-central 24 32 30 8 0.97 8 0.82IL2 25 20 10 5 0.89 6 0.87IL21 8 15 10 4 1.00 4 0.95IAN4L1 26 29 25 6 0.84 7 0.83IFNB1 1 21 17 6 0.83 7 0.97IFNW1 2 29 25 7 0.81 11 0.81FCER1B (MS4A2) 10 34 15 4 0.81 5 0.81TH5’ Region 1 10 12 12 3 0.84 4 0.84TH5’ Region 2 20 11 11 5 0.90 6 0.84TH-INS-IGF2AS 30 28 24 9 0.85 11 0.84TRANCE 33 26 14 3 0.88 3 0.86IL21R 48 38 21 11 0.85 17 0.83ICSBP1 22 42 35 8 0.83 9 0.80RANK 38 22 14 6 0.94 6 0.90CD101 38 31 21 8 0.80 10 0.85ACT1 37 15 10 5 0.82 6 0.83IGF1 88 27 10 6 0.87 7 0.87IL15RA 36 113 113 17 0.82 20 0.83IL2RA 69 30 28 13 0.82 15 0.88IL2RB 26 97 59 18 0.84 19 0.87FYN 117 48 18 13 0.86 16 0.83CREM 79 33 28 6 0.90 6 0.80B2M 28 13 10 5 0.90 8 0.90NRAMP 38 20 13 4 0.88 4 0.84SDF1 34 32 27 6 0.94 6 0.92MHC2TA 54 88 55 24 0.94 30 0.82TREM1 17 39 29 8 0.88 8 0.81TLT1-TREM2 20 19 5 3 1.00 3 1.00CRP 30 20 15 5 0.80 7 0.85CCL5 15 20 17 4 0.87 5 0.86FGF2 78 36 22 10 0.85 11 0.81

a‘‘Common’’ SNPs are those with a minor allele frequency of at least 3%.

Unphased Genotypes and Tag SNPs 425

If there are strong dominance effects, homo-zygosity at the causal locus should be consideredin the model. Chapman et al. [2003] suggest thatone additional element should be added to themarker genotype score X(+), providing an indirectmeasure of homozygosity at Z (‘‘one degree offreedom for dominance’’). This suggestion re-quires more careful study.We have also assumed a single diallelic causal

mutation in the region tested. If there is more thanone causal mutation, we must also consider themode of their joint action. Two extreme modelscan be considered for the joint action of twomutations:

1. both mutations must be present on the samechromosome to confer increased risk (cisinteraction), or

2. either mutation confers increased risk (hetero-geneity).

Chapman et al. [2003] considered the firstpossibility and showed that, whereas power wasreduced, haplotype scoring and locus scoring ofthe marker set were affected more or less equally,so that their conclusions from the single causalvariant case were unaffected. In effect, the pair ofinteracting mutations together define a causalallele at a composite locus. The reduced powerresults from the low frequency of this compositeallele and, therefore, the low R2 with which it canbe predicted.The alternative model of genetic heterogeneity

can be closely modelled by assuming causalvariants to act additively in the linear predictorof phenotype. The causal variant status of each

Fig. 8. Tag SNPs required versus SNPs known to exist. Top: The number of tags required per kb, bottom: the number per gene/region.

Clayton et al.426

chromosome, Z(1) and Z(2), then become vectors ofbinary indicators and the generalized linear modelfor additive locus heterogeneity with no domi-nance becomes

gfEðYÞg ¼ mþ aTZðþÞ:

This model leads to the same class of tests asbefore, but power is now determined by the R2 forthe multiple regression of aTZðþÞ on XðþÞ. In thesimple case where the elements of a are equal andthe elements of Z are uncorrelated, this is simplythe mean of the R2 for predicting each mutation inturn. Under the heterogeneity model, we expecteach causal mutation to be relatively infrequentand consequently unpredictable, so that we wouldexpect low R2 values. Here, as in the case of cisinteraction, the power of studies depends on theability of markers to predict rare alleles orcombinations of alleles. The work of Chapmanet al. [2003] suggests that, when D’ values acrossthe region of interest are high, and when R2 valuesfor the causal variants are large enough for theindirect approach to have sufficient power, thenthere is still relatively little increase in R2 achievedby scoring haplotypes of markers, and that thissmall increase is more than offset by the increaseddegrees of freedom incurred by so doing. How-ever, their study was limited in this respect.We should sound a further note of caution. The

conclusion that haplotype information, as such,does little to increase the power of associationtests may not hold in all populations. For example,when studying population isolates in which thecurrent population is descended from a relativelysmall founder population in the fairly recent past,linkage disequilibrium will extend over largedistances and haplotype diversity will be limited.However, the haplotype phylogeny created by thebottleneck could conceivably be such that linearpredictions based on locus scoring would notperform particularly well. This, too, requiresfurther study, either by simulation or, empirically,in population isolates.We would also not suggest that analysis of

haplotypes is useless. Having found significantassociation with a given region, the charting ofhigh- and low-risk haplotypes can be invaluablein informing further fine mapping problems.

ACKNOWLEDGMENTS

We are grateful to our colleagues in the Diabetesand Inflammation Laboratory. In particular, we

thank John Todd for helpful comments on anearlier draft and Bryan Barratt, John Hulme,Christopher Lowe, Rebecca Pask, Felicity Payne,Deborah Smyth, and Rebecca Twells for sharingunpublished SNP sequence data summarized inTable II.

REFERENCES

Breslow N, Day N. 1980. Statistical methods in cancer research.Volume I: The analysis of case-control studies. IARC ScientificPublications. Lyon: IARC.

Brumfield R, Beerli P, Nickerson D, Edwards S. 2003. The utility ofsingle nucleotide polymorphisms in inferences of populationhistory. Trends Ecol Evol 18:249–256.

Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, NickersonDA. 2004. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses usinglinkage disequilibrium. Am J Hum Genet 74:106–120.

Chapman JM, Cooper JD, Todd JA, Clayton DG. 2003. Detectingdisease associations due to linkage disequilibrium usinghaplotype tags: a class of tests and the determinants ofstatistical power. Hum Hered 56:18–31.

Clayton D. 1999. A generalization of the transmission/disequilibrium test for uncertain haplotype transmission. AmJ Hum Genet 65:1170–1177.

Clayton D, Jones H. 1999. Transmission/disequilibrium tests forextended marker haplotypes. Am J Hum Genet 65:1161–1169.

Dempster A, Laird N, Rubin D. 1977. Maximum likelihood fromincomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–22.

Devlin B, Roeder K. 1999. Genomic control for association studies.Biometrics 55:997–1004.

Hosking J. 1983. Legrange multiplier tests. In: Encyclopediaof statistical sciences, vol. 4, New York: John Wiley & Sons.p 456–459.

Hotelling H. 1931. The generalization of student’s ratio. Ann MathStat 2:360–378.

International HapMap Consortium. 2003. The InternationalHapMap project. Nature 426:789–796.

Johnson G, Esposito L, Barratt B, Smith A, Heward J, Di Genova G,Ueda H, Cordell H, Eaves I, Dudbridge F, Twells R, Payne F,Hughes W, Nutland S, Stevens H, Carr P, Tuomilehto-Wolf E,Tuomilehto J, Gough S, Clayton D, Todd J. 2001. Haplotypetagging for the identification of common disease genes. NatGenet 29:233–237.

Nelder J, Wedderburn R. 1972. Generalized linear models. J R StatSoc Ser A 135:370–384.

Schaid D, Rowland C, Tines D, Jacobson R, Poland G. 2002. Scoretests for association between traits and haplotypes when linkagephase is ambiguous. Am J Hum Genet 70:425–34.

Seltman H, Roeder K, Devlin B. 2001. Transmission/disequilibrium test meets measured haplotype analysis:family-based association analysis guided by evolution ofhaplotypes. Am J Hum Genet 68:1250–1263.

Stram D, Haiman C, Altshuler D, Kolonel L, Henderson B, Pike M.2003. Choosing haplotype tagging SNPs based on unphasedgenotype data using a preliminary sample of unrelated subjectswith an example from the multiethnic cohort study. Hum Hered55:27–36.

Templeton A. 1995. A cladistic analysis of phenotypic associationswith haplotypes inferred from restriction endonucleasemapping or DNA sequencing. V. Analysis of case/control

Unphased Genotypes and Tag SNPs 427

sampling designs: Alzheimer’s disease and the apoprotein Elocus. Genetics 140:403–409.

Templeton A, Sing C. 1993. A cladistic analysis of phenotypicassociations with haplotypes inferred from restrictionendonuclease mapping. IV. Nested analyses with cladogramuncertainty and recombination. Genetics 134:659–669.

Templeton A, Boerwinkle E, Sing C. 1987. A cladistic analysis ofphenotypic associations with haplotypes inferred fromrestriction endonuclease mapping. I. Basic theory and ananalysis of estimation. Genetics 119:343–351.

Templeton A, Sing C, Kessling A, Humphires S. 1988. A cladisticanalysis of phenotype associations with haplotypes inferredfrom restriction endonuclease mapping. II. The analysis ofnatural populations. Genetics 120:1145–1154.

Templeton A, Crandall K, Sing C. 1992. A cladistic analysis ofphenotypic associations with haplotypes inferred fromrestriction endonuclease mapping and DNA sequencing dataIII. Cladogram estimation. Genetics 132:619–633.

Tzeng JY, Devlin B, Roeder K, Wasserman L. 2003.On the identification of disease mutations by the analysis ofhaplotype matching and goodness-of-fit. Am J Hum Genet72:891–902.

Tzeng JY, Byerley W, Devlin B, Roeder K, Wasserman L. 2004.Outlier detection and false discovery rates for whole-genomeDNA matching. J Am Stat Assoc 98:236–247.

van der Meulen M, te Meerman G. 1997. Association andhaplotype sharing due to identity by descent, with anapplication to genetic mapping. In: Pawlowitki I, Edwards J,Thompson E, editors. Genetic mapping of disease genes.London: Academic Press; p 115–136.

Wang W, Todd J. 2003. The usefulness of different density snpmaps for disease association studies of common variants. HumMol Genet 12:3145–3149.

Wishart J. 1931. The mean and second moment coefficient of themultiple correlation coefficient in samples from a normalpopulation. Biometrika 24:353–376.

Clayton et al.428