Unbiased Interpretation of Haplotypes at Duplicated Microsatellites

11
doi: 10.1111/j.1469-1809.2006.00311.x Unbiased Interpretation of Haplotypes at Duplicated Microsatellites P. Balaresque 1,2, , A. Sibert 1 , E. Heyer 1 and B. Crouau-Roy 2 1 Eco-anthropologie et Ethnobiologie, UMR5145 D´ epartement Hommes Natures Soci´ et´ es, Mus´ ee de l’Homme - 17, Place du Trocad´ ero - 75116 Paris, France 2 Laboratoire Evolution & Diversit´ e Biologique, UMR 5174, Bˆ atiment 4R3, Universit´ e Paul Sabatier Toulouse III -118, route de Narbonne - 31062 Toulouse cedex 4, France Summary The Y-chromosome is a powerful tool for population geneticists to study human evolutionary history. Haploid and largely non-recombining, it should contain a simple record of past mutational events. However, this apparent simplicity is compromised by Y-linked duplicons, which make up 35% of this chromosome; 25% of these duplicons are large inverted repeats (palindromes). For microsatellites lying in these palindromes, two loci cannot be easily distinguished due to PCR co-amplification, and this order misspecification of alleles generates an additional variance component. Due to this ambiguity, population geneticists have traditionally used an arbitrary method to assign the alleles (shorter allele to locus 1, larger allele to locus 2). Here, we simulate these posterior estimate distributions under three different novel allele assignment priors and compare this with the original method. We use a sample of 33 human populations, typed for duplicated microsatellites lying within palindrome P8, to illustrate our approach. We show that both intra- and inter-population statistics can be dramatically affected by order misspecification. Surprisingly, matrices of pairwise F-statistics or distance estimates appear far less sensitive to order misspecification and remain relatively unchanged under the priors considered, suggesting that these microsatellites can be considered as useful markers for population genetic studies using an appropriate data treatment. Duplicated microsatellites represent an attractive source of information to investigate the extensive structural polymorphism observed among human Y chromosomes, as well as processes of intra-chromosomal gene conversion acting between duplicons. Keywords: duplicated microsatellites, Y-chromosome, haplotypes order, assignment methods, palindromes, gene conversion, Bayesian approach. Introduction The Y-chromosome represents an attractive candidate to study human evolutionary history. Due to the absence of recombination, it contains a relatively simple record of past mutational events. However, this apparent sim- plicity is compromised by the existence of Y-linked du- plicons (Hurles & Jobling 2003; Skaletsky, et al. 2003). In the specific case of microsatellites, while the infor- mation carried by single copy microsatellites is unam- Corresponding author: Patricia Balaresque, Department of Ge- netics, University of Leicester, Adrian Building, University Road, LEICESTER, LE1 7RH, United Kingdom. Tel/Fax: +44 (0) 116 252 3377/78. E-mail: [email protected] biguous, variation at multi-copy microsatellites is much more complex to analyze and interpret (Butler et al. 2005). Nonetheless, multi-copy microsatellites represent an intrinsic part of the Y-chromosome, and are subject not only to the same evolutionary mechanisms operat- ing at single copy microsatellites, but also to mechanisms specific to duplicons. Most of these duplicated microsatellites (such as YCAII, YCAIII, DYS385I/II and DYS464) are located in the eight massive palindromes which represent 25% of the MSY euchromatin, and show an arm-to-arm identity ranging from 99.94% (P3) to 99.997% (P8) (Skaletsky et al. 2003). From a practical point of view this very high degree of similarity means that it is impossible C 2006 The Authors Journal compilation C 2006 University College London Annals of Human Genetics (2006) 71,209–219 209

Transcript of Unbiased Interpretation of Haplotypes at Duplicated Microsatellites

doi: 10.1111/j.1469-1809.2006.00311.x

Unbiased Interpretation of Haplotypes at DuplicatedMicrosatellites

P. Balaresque1,2,∗, A. Sibert1, E. Heyer1 and B. Crouau-Roy2

1Eco-anthropologie et Ethnobiologie, UMR5145 Departement Hommes Natures Societes, Musee de l’Homme - 17,Place du Trocadero - 75116 Paris, France2Laboratoire Evolution & Diversite Biologique, UMR 5174, Batiment 4R3, Universite Paul Sabatier Toulouse III -118,route de Narbonne - 31062 Toulouse cedex 4, France

Summary

The Y-chromosome is a powerful tool for population geneticists to study human evolutionary history. Haploid

and largely non-recombining, it should contain a simple record of past mutational events. However, this apparent

simplicity is compromised by Y-linked duplicons, which make up∼35% of this chromosome; 25% of these duplicons

are large inverted repeats (palindromes). For microsatellites lying in these palindromes, two loci cannot be easily

distinguished due to PCR co-amplification, and this order misspecification of alleles generates an additional variance

component. Due to this ambiguity, population geneticists have traditionally used an arbitrary method to assign the

alleles (shorter allele to locus 1, larger allele to locus 2). Here, we simulate these posterior estimate distributions

under three different novel allele assignment priors and compare this with the original method. We use a sample of

33 human populations, typed for duplicated microsatellites lying within palindrome P8, to illustrate our approach.

We show that both intra- and inter-population statistics can be dramatically affected by order misspecification.

Surprisingly, matrices of pairwise F-statistics or distance estimates appear far less sensitive to order misspecification

and remain relatively unchanged under the priors considered, suggesting that these microsatellites can be considered

as useful markers for population genetic studies using an appropriate data treatment. Duplicated microsatellites

represent an attractive source of information to investigate the extensive structural polymorphism observed among

human Y chromosomes, as well as processes of intra-chromosomal gene conversion acting between duplicons.

Keywords: duplicated microsatellites, Y-chromosome, haplotypes order, assignment methods, palindromes, gene

conversion, Bayesian approach.

Introduction

The Y-chromosome represents an attractive candidate to

study human evolutionary history. Due to the absence

of recombination, it contains a relatively simple record

of past mutational events. However, this apparent sim-

plicity is compromised by the existence of Y-linked du-

plicons (Hurles & Jobling 2003; Skaletsky, et al. 2003).

In the specific case of microsatellites, while the infor-

mation carried by single copy microsatellites is unam-

∗Corresponding author: Patricia Balaresque, Department of Ge-

netics, University of Leicester, Adrian Building, University Road,

LEICESTER, LE1 7RH, United Kingdom. Tel/Fax: +44 (0)

116 252 3377/78. E-mail: [email protected]

biguous, variation at multi-copy microsatellites is much

more complex to analyze and interpret (Butler et al.

2005). Nonetheless, multi-copy microsatellites represent

an intrinsic part of the Y-chromosome, and are subject

not only to the same evolutionary mechanisms operat-

ing at single copy microsatellites, but also to mechanisms

specific to duplicons.

Most of these duplicated microsatellites (such as

YCAII, YCAIII, DYS385I/II and DYS464) are located

in the eight massive palindromes which represent 25%

of the MSY euchromatin, and show an arm-to-arm

identity ranging from 99.94% (P3) to 99.997% (P8)

(Skaletsky et al. 2003). From a practical point of view this

very high degree of similarity means that it is impossible

C© 2006 The AuthorsJournal compilation C© 2006 University College London

Annals of Human Genetics (2006) 71,209–219 209

P. Balaresque et al.

to design PCR primers to amplify the different mi-

crosatellite copies independently, and they are there-

fore indistinguishable. The main consequence of such

a co-amplification is that the corresponding multi-copy

haplotypes cannot be determined properly: only “un-

ordered haplotypes” are available from routine molec-

ular analyses. The use of such multi-copy microsatel-

lites for population genetic studies requires particular

caution, because we do not know how co-amplified

copies should be assigned to distinct loci. This assign-

ment problem has previously been addressed by Mathias

et al. (1994). They isolated a set of highly polymorphic

duplicated microsatellites and described the most parsi-

monious method to assign the amplified alleles to dis-

tinct loci: when alleles differed in size, the longest was

labelled allele A and the shortest allele B. This method

is unbiased if the length ranges of both microsatellites

do not overlap. However, in many studies haplotypes

comprising two alleles of the same length (such as {18-

18} or {25-25} repeats for example) are very common

(Scozzari et al. 1997, 1999; Quintana-Murci et al. 1999;

Malaspina et al. 2000; Cruciani, et al. 2002). Their high

frequencies suggest that in many cases both loci have

similar allele ranges with partial or complete overlap.

Due to their high level of polymorphism, duplicated

microsatellites have been widely used in human pop-

ulation genetics to identify males and male lineages in

forensic practices (Kayser et al. 1997, and see the YSTR

database), and also to infer population diversity param-

eters and to estimate coalescence times of haplogroups

defined by binary markers (Scozzari et al. 1999; Cruciani

et al. 2002; Kayser et al. 2003). However, the popula-

tion parameters published so far have been estimated

only using one allele assignment method, which ap-

pears both risky and restrictive. The aim of the present

paper is to evaluate the effect of misspecifying haplo-

type order on standard diversity estimates, and to de-

termine how much information can still be recovered

from unordered haplotypes. For this purpose we focus

on the P8 palindrome, which exhibits the highest arm-

to-arm identity (Skaletsky et al. 2003), and for which the

misspecification in allele assignment is therefore maxi-

mal. We introduce three random priors (RP: random

prior; UP: uniform prior; ZOP: 0/1 prior) that assign

co-amplified alleles to each locus with different proba-

bility models. We apply these priors and Mathias et al.

(1994)’s arbitrary prior (AP) to a dataset of 33 popula-

tion samples for YCAIII a/b (DYS413 a/b), duplicated

dinucleotide microsatellites. We report the relative effect

of random assignment priors on usual parameters at the

intra-population (gene diversity) and inter-population

(for different genetic distances - FST, RST and δμ2, ASD)

levels and discuss their implications.

Materials and Methods

Haplotype Information: Problematic

Haplotypes are usually defined as a combination of al-

leles carried by the same chromosome. In the case of

coamplified alleles, the genetic information can be rep-

resented by the ordered pair (i,j), where i and j are the

states of allele 1 and allele 2 respectively. For a sample

of individuals one can summarise the information con-

tained in their haplotypes in a matrix, whose elements

nij are the number of individuals with ordered haplotype

(i,j). Since the number of (i,j) and (j,i) ordered pairs do

not need to be equal, this matrix is not necessarily sym-

metrical. In the case of ordered haplotypic data we call

(i,j) and (j,i) reverse haplotypes. For duplicated markers

it is not possible to assign one allele to one specific locus.

In consequence the resulting information is well repre-

sented by an unordered pair {i,j}, where i is the state

of one of the alleles, and j the state of the other one.

The haplotypic information contained in a sample can

be summarised by a matrix whose elements mij are the

number of individuals with unordered haplotype {i,j}.

Since mij = mji the matrix is symmetrical. Practically, it

is possible to draw the matrix of unordered haplotypes

of a sample from the matrix of its ordered haplotypes

through the following mapping:{mi j = m ji = ni j + n j i ∀ i �= j

m i i = ni i ∀ i(1)

Unfortunately the converse is not true: it is not possi-

ble to deduce the nij’s from the mij’s only. In other words,

several sets of ordered haplotypes can yield the same set

of unordered haplotypes under the above mapping (1).

The only case where the haplotype frequencies remain

unchanged is for the haplotypes (i,i) only, referred to

as homo-allelic hereafter. The above considerations are

210 Annals of Human Genetics (2006) 71,209–219 C© 2006 The AuthorsJournal compilation C© 2006 University College London

Y-Duplicated Microsatellites and Haplotypes Order Misspecification

equivalent to stating that there is less information in un-

ordered than in ordered haplotypic data. Practically, a set

of ordered haplotypes can be artificially reconstructed

from unordered data, satisfying equations (1). We call

this reconstruction scheme the assignment prior, since

for each unordered haplotype one allele is assigned to

locus 1 and the other to locus 2.

Allele Assignment Prior

In order to compare the impact of different allele as-

signment priors, i.e. several ways of simulating ordered

haplotypes when only unordered data are available by

assigning alleles to loci, we describe four allele assign-

ment priors, that of Mathias et al. (1994) and three new

ones.

Arbitrary prior (AP): Mathias’s prior

Mathias et al. (1994) assigned the co-amplified alleles

to each locus according to their sizes, the longest al-

lele being assigned to locus 1 (allele A), the shortest to

locus 2 (allele B). In the following section all parame-

ters estimated with this arbitrary prior carry the index

“AP”.

Binomial prior (BP)

Here we assume that the co-amplified alleles are ran-

domly assigned to the two distinct loci independently

for each haplotype. The rationale of this prior is the ob-

servation of allele range overlap at both loci (see Appe-

ndix I). For each haplotype in the sample the longer

allele (A) is assigned to locus 1 with probability p and to

locus 2 with probability 1 − p. Haplotype assignments

are assumed to be independent from each other, and the

prior is called binomial for this reason. When p = 1/2,

the allele size is not taken into account for the simulation

of ordered haplotypes, unlike the arbitrary prior which

corresponds to p = 1. All the results obtained with the

binomial prior are indexed with “BP”. Since the prob-

lem is symmetric with regard to both haplotypes, we

can restrict the range of p to [0;0.5].

Uniform prior (UP)

In fact, there is no obvious reason why assignments

should be independent for all the individuals (i.e. hap-

lotypes) in the sample. Indeed, due to genealogical re-

lationships, the states of all haplotypes in the sample are

correlated. Despite the absence of further information

concerning the magnitude of this correlation, which is

expected to vary among populations, we can consider

assignment priors with higher variance than the bino-

mial prior. For example, one can think of a prior for

which all pairs (nij, nji) satisfying nij + nji = mij have

an equal probability. We call this the uniform prior and

index it with “UP”.

Zero-One prior (ZOP)

Finally, we consider the Zero-One prior (indexed

ZOP), which assigns all unordered haplotypes sam-

pled in a population to the same ordered haplotype,

in other words (nij, nji) = (mij, 0) with probability p

and (0, mij) with a probability 1 − p, independently for

each population sample. The rationale for this is that

Mathias et al. (1994)’s prior yields unjustified correla-

tion not only between individuals within populations

but also between populations themselves. We call this

prior zero-one since the frequency of ordered haplo-

types in a given population sample is either 0 or 1. At

the intra-population level, this prior is strictly equiva-

lent to Mathias et al. (1994)’s scheme. Therefore both

will yield identical intra-population diversity parameter

estimates. The difference between both methods lies at

the inter-population level only. It must be stressed that

the zero-one-distribution, for which the variable equals

1 with probability p and 0 with probability 1 − p, max-

imises the variance on the set of all distributions with

compact support [0, 1] and mean p. Therefore this prior

maximises the variance in nij’s between population sam-

ples and is expected to minimise the correlation between

samples.

In all cases the a posteriori distribution was obtained

by Monte Carlo integration. The rationale for doing

this Monte Carlo integration under BP and UP priors

is that it allows us to explore the whole range of or-

dered haplotypes corresponding to the same unordered

haplotype data. Therefore, under these two prior assign-

ments, the exact ordered haplotype state of the sample

(which is actually unknown) has non-zero probability.

In other words, for any sample statistics such as gene

diversity, the actual value lies in the range of the a poste-

riori distribution under BP and UP. Contrary to Mathias

et al. (1994)’s method (AP), that gives a point estimate

C© 2006 The AuthorsJournal compilation C© 2006 University College London

Annals of Human Genetics (2006) 71,209–219 211

P. Balaresque et al.

without information about its closeness to the actual

value, our results allow us to determine a range for the

exact sample value by simulation.

Measures of Intra-population Diversity and

Inter-population Differentiation

Among a variety of measures of genetic diversity at

the intra-population level we use h = nn−1

(1 − ∑f 2i )

in the following, where fi represents the observed fre-

quency of the ith ordered haplotype (Nei, 1987). At

the inter-population level we compute four of the most

popular genetic distance estimators such as FST, RST,

ASD and δμ2 (Wright, 1951; Slatkin, 1995; Goldstein

et al. 1995). For each genetic distance and assignment

Figure 1 Percentage of hetero-allelic haplotypes (represented in white) in all populations sampled.

prior, 1000 replicates of the sample are simulated and

Mantel statistics used to compare them (Mantel, 1967).

Palindrome P8, the Microsatellite YCAIII and

Population Sampling

We focus on palindrome P8, characterised by an arm-

to arm homology of 99.997%. The two copies of

the YCAIII microsatellite have previously been lo-

cated in silico on each arm (Balaresque et al. 2006).

As Y-linked microsatellites can be subject to duplica-

tion/deletion events (see mutations/duplications section

in http://www.ystr.org/), we checked by a semi-

quantitative method (using peak height) whether the

YCAIII microsatellites had been subject to copy number

212 Annals of Human Genetics (2006) 71,209–219 C© 2006 The AuthorsJournal compilation C© 2006 University College London

Y-Duplicated Microsatellites and Haplotypes Order Misspecification

polymorphism. As we did not find any evidence for such

polymorphism, we assumed that the two loci had been

systematically amplified. We analyzed a dataset of 33

populations from Africa, Europe and Asia representing

1,413 Y-chromosomes (see Appendix I); these data were

extracted from Scozzari et al. (1997) and Balaresque et al.

(2006).

Results

Distribution of Hetero-allelic Haplotypes

among Populations

The assignment hypotheses are relevant only to the

hetero-allelic haplotypes (i, j) in each population. The

percentage of such haplotypes is represented in Figure 1,

and varies drastically among the samples. However, East

and North African populations tend to show a higher

percentage of hetero-allelic haplotypes than European

0.878 0.8820.878 0.882

South Italian (81)Spanish (36)

0.439 0.441 0.4430.439 0.441 0.443

Basque (55)

0.833 0.836 0.8390.833 0.836 0.839

Sardinian (97)

0.932 0.936 0.9400.932 0.936 0.940

Turkish (20)

0.55 0.65 0.750.55 0.65 0.75

Moroccan (38)

0.45 0.55 0.65 0.750.45 0.55 0.65 0.75

Bamileke (51)

0.50 0.60 0.700.50 0.60 0.70

Amhara (48)

0.4 0.5 0.6 0.70.4 0.5 0.6 0.7

Fulbe Burkina (20)

0.3 0.4 0.5 0.6 0.70.3 0.4 0.5 0.6 0.7

Ewondo (31)

0.60 0.70 0.800.60 0.70 0.80

Rimaibe (42)

0.60 0.70 0.800.60 0.70 0.80

Mossi (56)

0.705 0.715 0.7250.705 0.715 0.725

Fulbe Cameroon (17)

0.80 0.82 0.840.80 0.82 0.84

!Kung (63)

0.90 0.92 0.940.90 0.92 0.94

Egyptian (48)

0.89 0.91 0.930.89 0.91 0.93

Orcadian (27)

0.85 0.87 0.89 0.910.85 0.87 0.89 0.91

East Arabian (32)

0.75 0.80 0.85 0.900.75 0.80 0.85 0.90

Uldeme (23)

0.70 0.75 0.80 0.850.70 0.75 0.80 0.85

Oromo (38)

0.70 0.75 0.80 0.850.70 0.75 0.80 0.85

Fali (39)

0.60 0.70 0.800.60 0.70 0.80

Mozabite (80)

0.60 0.70 0.800.60 0.70 0.80

Khwe (26)

0.84 0.880.84 0.88

Daba (18)

0.80 0.85 0.900.80 0.85 0.90

Tali (15)

0.902 0.906 0.9100.902 0.906 0.910

0.657 0.659 0.6610.657 0.659 0.661

Yakuba (62)

0.826 0.830 0.8340.826 0.830 0.834

Pakistani (20)

0.884 0.890 0.8960.884 0.890 0.896

Greek (51)

0.860 0.8700.860 0.870

English (20)

0.938 0.944 0.9500.938 0.944 0.950

Akan (57)

0.895 0.905 0.9150.895 0.905 0.915

Central Italian (84)

0.865 0.875 0.8850.865 0.875 0.885

Danish (35)

0.876 0.8800.876 0.880

Corsican (63)

0.864 0.868 0.8720.864 0.868 0.872

North Italian (20)

BP

Gene diversity scale

AP & ZOP

UP

UP & BPsuperposed

Figure 2 Distribution of sample gene diversity estimates under various assignment priors (AP and ZOP: vertical line with square on

the top; UP: black bars; BP: white bars; UP and BP overlapping: grey bars). Populations are ordered from lowest to highest �HBP (cf .

text for explanations) and the sample size is given in brackets.

or West Asian populations. In East Africa populations

show a similar percentage of homo- and hetero-allelic

haplotypes. In consequence, we expected these differ-

ent populations to be differently affected by allele as-

signment methods.

Intra-population Genetic Diversity Estimate

We estimated gene diversity h under all four assignment

priors (AP, BP, UP and ZOP) applied to our set of 33

populations. Under AP and ZOP priors for the hetero-

allelic haplotypes, only one of the two possible reverse

haplotypes was considered. So, for each population the

same point estimate was obtained for the two priors

(represented in Figure 2 by a vertical line with a square

on the top).

The main observation is that the AP-ZOP point es-

timates are uniformly lower than the estimates obtained

with BP and UP priors. This is strongly related to the

C© 2006 The AuthorsJournal compilation C© 2006 University College London

Annals of Human Genetics (2006) 71,209–219 213

P. Balaresque et al.

fact that random assignments increase the total num-

ber of haplotypes, by creating reverse haplotypes when

hetero-allelic haplotypes exist. Moreover the mean es-

timate of gene diversity is almost always larger under

BP than under UP as expected: gene diversity is highest

when both reverse haplotypes (i, j) and (j, i) are found

at equal frequency in the sample, and this occurs more

frequently under BP than UP. Finally, it also appears

that estimates are spread over a broader range under UP

than BP, and this point can be explained by the fact that

the probability distribution of the frequency of ordered

haplotypes has a greater variance under UP.

To measure the extent to which the AP-ZOP assign-

ment prior tends to underestimate h (corresponding to

the results found in the literature for duplicated mi-

crosatellites to date), we defined the quantity �HB P =HB P − HAP (resp.�HU P = HU P − HAP ) where bars

denote the average over simulated values. The param-

eter �H is not ‘a bias’ strictly speaking, but more the

difference of two biases. It provides a measure of the

expected difference in gene diversity estimates under

0.2 0.4 0.6 0.8 1.0

-2.5

-2

.0-

1.5

-1.0

-0

.5

Frequency of hete

Basque

Spanish

Sardinian S. Italian

N. Italian Turkish

CorsicanPakistani

Greek

English

F. Cameroon

Danish

ro- haplotypes

log

10(Δ

BP)

C. Italian

Yakuba

Moroccan

BamilekeAmhara

F. BurkinaEwondo

Uldeme

Oromo

Fali

Mozabite Khwe

Rimaibe

Mossi

!KungEgyptian

Orcadian

E. ArabianDaba

Tali

Akan

Figure 3 Logarithm of mean difference �HB P versus observed frequency in hetero-haplotypes.The vertical line corresponds to 70%

hetero-allele haplotypes; the horizontal line to log (0.05); the oblique line to the least-square regression (c f . text).

two distinct assignment priors. We found that in both

cases �H is highly variable among population samples.

Figure 3 presents population samples ordered accord-

ing to increasing �HBP (from Southern Spaniards to

Ewondo). It appears that in 13 samples out of 33 �HBP

lies above 0.1 (from the Tali sample onwards), so that

the estimate obtained with Mathias et al. (1994)’s prior

is more than 10% lower than the mean estimate un-

der the binomial scheme. By contrast, in twelve samples

the difference is less than 0.01. Moreover, we noticed

that �H depends strongly on the level of gene diver-

sity in the original sample: intuitively, when a hetero-

allelic haplotype is very frequent in a population sample,

the randomization prior raises the mean gene diversity

estimate from almost 0 (arbitrary prior) to around 0.5

(random priors). We studied to what extent the sam-

ple composition in hetero- and homo-allelic haplotypes

is correlated with �H (Figure 3). For this purpose the

least-square regression of log (�H) on the frequency of

hetero-allelic haplotypes shows a significant correlation

(R2 = 0.55, p < 0.001; Figure 3).

214 Annals of Human Genetics (2006) 71,209–219 C© 2006 The AuthorsJournal compilation C© 2006 University College London

Y-Duplicated Microsatellites and Haplotypes Order Misspecification

Figure 3 allows us to discriminate between two types

of population samples according to the mean difference

�H: if we fix a 5% arbitrary threshold under which �H

is considered negligible, this threshold corresponds to a

proportion of hetero-allelic haplotypes of around 70%

in the sample. While samples carrying less than this pro-

portion show negligible bias in gene diversity estimates

(‘Low-Bias samples’), the bias is significant for samples

carrying greater than 70% (‘High-Bias samples’). No-

tably, the 14 population samples in the latter category

all belong to the African continent, although this does

not apply to all the African samples (see Figure 1).

Inter-population Differentiation

In order to exemplify the range of the a posteriori dis-

tributions of F-statistics and distances under the various

priors considered here, we chose five of the 528 possible

pairs of population samples, each pair illustrating a dif-

ferent combination of low-bias and high-bias samples:

- 0.07 0.1 - 0.09 0.22 4.81 1.47 2.24

0.05 0.09

FST

Basque / S. Span.

0.07 0.14

C. Ital. / Pakistani

0.27 0.57

English / Ewondo

0.06 0.44

E. Arab / Bamileke

0.05 0.63

Fali / Mossi

- 0.02 0.04

RST

- 0.03 0.02 0.59 0.83 - 0.02 0.31 0 0.73

(δμ)2

3.05 - 0.12 0.01

3.16 3.29

ASD

2.92 3.05 4.49 6 3.57 5.21 0.81 3.09

Figure 4 Posterior distribution of F-statistics and distances for 5 pairs of population samples under various priors (AP: vertical lines;

BP: open bars; UP: gray bars; ZOP: black bars). The range of the distribution is given on the abscissa axis.

Basque (LB) vs. Southern Spaniards (LB), Central Ital-

ians (LB) vs. Pakistani (LB), English (LB) vs. Ewondo

(HB), Eastern Arabs (HB) vs. Bamileke (HB), and Fali

(HB) vs. Mossi (HB).

Figure 4 shows that the five pairs present a surpris-

ing variety of situations that can arise. Firstly, the overall

range, defined as the minimum and maximum of the val-

ues simulated under all four priors, varies dramatically,

although it is consistent for the three parameters FST,

RST and (δμ)2. Large overall ranges can contain spectra

of values either significantly different from 0 (English vs.

Ewondo sample pair) or not (Eastern Arab vs. Bamileke,

or Fali vs. Mossi). For FST, RST and (δμ)2 the general

patterns are that the overall lower boundary is always at-

tained under BP (and corresponding values cluster near

the boundary) and that the upper boundary is always and

exclusively attained under ZOP. Apart from the first pair

(Basque vs. S.Spanish), average estimates are lower un-

der BP, intermediate under UP and larger under ZOP,

a prior that appears to maximise the distances between

C© 2006 The AuthorsJournal compilation C© 2006 University College London

Annals of Human Genetics (2006) 71,209–219 215

P. Balaresque et al.

samples. This can be easily understood since under ZOP,

for hetero-allelic haplotypes one ordered haplotype (e.g.

{11, 14}) may be found in the first population, and the

other one ({14, 11}) in the second population. The

overall ranges (within brackets in Figure 4) can be or-

dered consistently for all three parameters across sample

pairs, and the parameter distributions for the three ran-

dom priors are very similar between FST and RST on

the one hand, and RST and (δμ)2 on the other hand.

The position of the point estimate obtained under the

arbitrary prior (AP) always lies within the overall range,

except for the FST in English vs. Ewondo, for which

it attains the upper bound. Like gene diversity it does

not always lie within the range of the posterior distri-

bution under BP, but always within the common range

of UP and ZOP posteriors. This is consistent with the

fact that haplotypes predicted by the arbitrary prior are

more likely under UP and ZOP than under BP (where

frequencies of reverse haplotypes are more probable),

unless few hetero-allelic haplotypes are present. Unlike

gene diversity results there does not seem to be a general

trend for the point-estimate position relative to means

obtained under UP and ZOP.

By contrast, as far as ASD is concerned, AP’s point

estimate is always close to the lower bound, and BP

estimates cluster in the middle of the overall range. As in

the case of the other parameters, ZOP estimates almost

cover the overall range and always exclusively attain the

upper boundary.

Correlation between F-statistics and/or

distances

We tested whether F-statistics and genetic distance (or

dissimilarity) estimates (FST, RST, δμ2, ASD) are signif-

icantly modified by the choice of assignment prior. For

each combination of genetic distance and assignment

prior, 1000 replicates of the sample were simulated and

pairwise genetic distances were estimated. Mantel cor-

Table 1 Mantel statistics mean and 95% confidence interval (within brackets) for pairs of F-statistics and/or genetic distances. Columns

correspond to pairs compared; rows correspond to assignment priors (∗∗∗∗ : p < 10−4; ∗∗∗ : p < 10−3; ∗∗ : p < 10−2,∗ : p < 5%).

FST vs. RST FST vs. (δμ)2 FST vs. ASD RST vs. (δμ)2 RST vs. ASD (δμ)2 vs.ASD

AP 64∗∗∗∗ 43∗∗∗ −15 83∗∗∗∗ −4 33∗∗

BP 58∗∗∗∗ [57;59] 46∗∗∗ [45;47] 1 [−2;3] 90∗∗∗∗ [90;91] 8 [6;9] 33∗∗ [33;34]

UP 60∗∗∗∗ [56;63] 44∗∗∗ [40;48] −13 [−19;−6] 88∗∗∗∗ [86;89] −3 [−10;2] 30∗ [27;32]

ZOP 69∗∗∗∗ [66;72] 46∗∗∗ [42;50] −30 [−33;−26] 84∗∗∗∗ [83;85] −16 [−19;−13] 23 [20;25]

relation statistics for the comparison of distance matrices

were evaluated for pairs of simulated data with (i) iden-

tical assignment prior and (ii) identical genetic distance,

in order to assess their effect on the characterisation of

genetic structure.

Table 1 presents the empirical mean and 95% confi-

dence intervals of Mantel correlation statistics for pairs

of F-statistics and/or distances (FST, RST, δμ2, ASD), es-

timated under the identical assignment prior. Whatever

the prior, the results in the first column show that FST

and RST are strongly correlated with Mantel statistics,

uniformly above 0.5 (p < 10−4). Similar conclusions

are drawn from the comparison of FST and δμ2 (Man-

tel statistics uniformly above 0.4) and of RST and δμ2

(Mantel statistics uniformly above 0.8). In both cases the

correlation is highly significant with p < 10−4. On aver-

age, δμ2 and ASD are less positively correlated, though

still significant under AP, BP and UP (p < 0.01) and even

under ZOP (p < 0.05), this is consistent with the fact

that both measures are related by definition (Goldstein

et al. 1995). Finally, it is impossible to reject the null hy-

pothesis of independence for pairs ASD - FST and ASD

- RST at the 5% level, the statistics distribution compris-

ing 0 or being even negative (p > 0.2) (Table 1). The

latter result is not surprising since F-statistics and dis-

tance measures have no reasons to give consistent results

a priori.

Whatever the pair of F-statistics and/or distances con-

sidered, Mantel statistics values obtained under all four

assignment priors are strongly consistent (Table 1). In

particular the values obtained from Mathias et al. (1994)’s

scheme always lie within the range defined by the three

other priors, and almost always within the range found

under UP. Contrary to gene diversity estimates, the

value does not systematically correspond to the lower

boundary of simulated values, and it is not possible to

conclude that there is a bias of AP results in any direc-

tion. Moreover it is noteworthy that the Mantel statistics

216 Annals of Human Genetics (2006) 71,209–219 C© 2006 The AuthorsJournal compilation C© 2006 University College London

Y-Duplicated Microsatellites and Haplotypes Order Misspecification

Table 2 Mantel statistics mean and 95% confidence interval (within brackets) for pairs of assignment priors. Columns correspond to

pairs compared; rows correspond to F-statistics or distances (in all cases p < 10−4).

AP vs. BP AP vs. UP AP vs. ZOP BP vs. UP BP vs. ZOP UP vs. ZOP

Fst 88 [87;89] 87 [83;90] 78 [75;82] 88 [82;94] 68 [64;71] 74 [69;79]

Rst 94 [93;95] 91 [87;94] 80 [76;84] 93 [89;97] 75 [68;80] 78 [72;86]

Dmu 98 [97;99] 97 [95;98] 90 [87;93] 98 [96;99] 88 [82;93] 88 [82;94]

ASD 99 [99;99.1] 98.9 [98.5;99.1] 98.2 [97.7;98.6] 99.7 [99.6;99.8] 99 [98.2;99.4] 98.7 [98;99.2]

variances increase from BP to ZOP and UP, in the same

way as do variances of corresponding assignment proce-

dures (the variance of BP is proportional to the sample

size while it is proportional to the squared sample size

for ZOP and UP). This result indicates that the general

properties of distance estimators are independent of the

assignment method used.

Correlation between Assignment Methods

Table 2 gives the empirical frequency distribution of

Mantel statistics for pairs of distance matrices generated

under various assignment methods. The values observed

yield significant Mantel test result with p < 10−4: what-

ever the F-statistics or distance considered, results are

strongly correlated when pairs of assignment priors are

compared. However it is interesting to note that dis-

tances may be more conservative than F-statistics. The

Average Squared Distance (ASD) for example remains

nearly unchanged under all four assignment priors con-

sidered, since the Mantel correlation is fairly high (0.97)

for all simulated pairs of matrices under two different

methods (Table 2).

(δμ)2 estimates are also highly correlated for pairs

of AP, BP and UP priors (Mantel statistics >0.95) and

less correlated when ZOP is compared with one of

the previous priors. This pattern is enhanced for both

F-statistics. As for gene diversity estimates, this is likely

to be due to the higher variance underlying the zero-

one distribution, as mentioned in the description of

the prior. Since all Mantel tests were highly significant,

these differences are likely to have an effect on the

description of fine-scale structures, but not on medium

or large scale ones.

Discussion

In this paper, we ask how different assignment methods

affect population diversity/differentiation estimates us-

ing duplicated microsatellites. We assume that for a given

unordered hetero-allelic haplotype the two reverse or-

dered haplotypes are likely to appear in the sample. We

consider an a priori distribution for the ordered hap-

lotypic data D, given unordered data, so that we can

express the a posteriori distribution for any statistics θ as

P (θ |unordered D) =∑

ordered D

P (θ |ordered D)

× P (ordered D|unordered D).

We investigate three a priori distributions (binomial

(BP), uniform (UP) and zero-one (ZOP -identical to

Mathias et al. (1994)’s prior at the intra-population level)

and compare the a posteriori estimates with the results

obtained under Mathias et al. (1994)’s scheme (AP).

Our first result is that gene diversity estimates can

be strongly affected by misspecification and the way or-

dered haplotypes are drawn from unordered data. In any

sample containing heterallelic haplotypes, Mathias et al.

(1994)’s prior generates downward biased estimates of

the actual gene diversity. Although the extent of the

bias �H may generally be small (especially in popu-

lation samples with high gene diversity), it can reach

values as high as 0.3. We have shown that the magni-

tude of the mean difference is strongly linked to the

proportion of homo-allelic haplotypes in the sample.

As this proportion varies drastically among populations,

they are affected differently by the order misspecifica-

tion. Although the regression analysis has shown that a

proportion of such haplotypes above 30% ensures a bias

of less than 5%, about one third of populations lie be-

low this threshold, and exhibit bias. While extreme bias

values are unlikely, we recommend the use of a random

method, such as our uniform assignment prior (UP), to

compute confidence intervals instead of considering the

observed gene diversity in the sample.

Our second result is that order misspecification is not

likely to have a significant effect on the estimates of

C© 2006 The AuthorsJournal compilation C© 2006 University College London

Annals of Human Genetics (2006) 71,209–219 217

P. Balaresque et al.

F-statistics or distances among population samples (al-

though it affects them). Simulated results appear very

close to each other (as judged by highly significant Man-

tel tests) under all possible random assignment methods

considered. Therefore we can conclude that the actual

values estimated from ordered data must also be close

to the simulated values and can be satisfactorily approx-

imated this way. As far as population structure is con-

cerned, the absolute values of F-statistics and distances

are not as important as their relative values and order.

Comparing matrices of pairwise F-statistics and/or dis-

tances, we have noticed that in all cases our results seem

independent of the assignment prior. When pairs of F-

statistics and/or distances are compared, mean Mantel

statistics are comparable but variances increase from BP

to ZOP and UP, in the same way as do variances of the

corresponding assignment priors (the variance of BP is

proportional to the sample size, while it is proportional

to the squared sample size for ZOP and UP). Compar-

ing pairs of assignment priors, F-statistics and distances

are more or less conserved: Average Squared Distance

(ASD) estimates are highly conserved, while F-statistics

(more sensitive to the reconstruction methods) are more

variable, though not significantly so.

Importantly, the choice of assignment method does

not appear to affect the estimates of relative F-statistics

or genetic distances between populations. Therefore

when analysing population structure, one can expect

little bias due to order misspecification in haplotypic

data. Finally, we have also considered the fact that al-

lelic states at both loci are correlated, based on Kit-

tler et al. (2003). Since these duplicated microsatellites

are located in palindromic sequences they probably un-

dergo gene conversion – an idea supported by the ob-

servation that, for DYS385a/b, the mutational process

at the first locus does not seem to be independent of the

second locus (Kittler et al. 2003). To take into account

the phylogenetic relationship existing between the indi-

viduals within each sample (as haplogroup information

does), we have introduced moderate levels of correlation

among loci (p < 0.5). In consequence, when a {12, 13}haplotype is generated in the population 1, the chance

for simulating another {12, 13} is higher than generat-

ing a {13, 12}. When we estimate the F-statistics, we

obtain results very close to the previous ones (not de-

veloped here).

We have shown that intra-population parameters are

more likely to be affected by assignment priors than

are inter-population parameters (F-statistics or genetic

distances). The experimental determination of ordered

haplotypes within large duplicons is not routine, and as

long as this remains the case, it seems worthwhile taking

the ambiguity of haplotype order into account. Com-

puting confidence intervals using one of the methods

presented here would both prevent misinterpretations

and retain all potential information carried by dupli-

cated microsatellites.

The results presented here are based on P8-linked

microsatellites only, and do not consider that the on-

going molecular mechanisms may be different within

each palindrome. We know, for example, that DYS385

is likely to undergo duplication/deletion events as 29

duplications events have been reported in the YHRD

(3 or 4 copies of DYS385). The susceptibility of Y-STRs

to gene conversion or to other molecular mechanisms

may vary according to the palindromes present – or more

generally to the large repeated sequences involved. The

different molecular structures (Skaletsky et al. 2003) and

different evolutionary histories (Rozen et al. 2003) of

the eight palindromes may be crucial in that respect.

For this reason it would be interesting to extend our re-

sults, both to other palindromes and to other population

genetic parameters (for example, TMRCA estimates).

Y-chromosomal haplogroups are strongly geographically

differentiated, and can also be associated with differ-

ent arrangements of palindromic sequences; it would

therefore also be worth considering haplogroup infor-

mation to identify population-specific effects on these

estimates.

Acknowledgments

The authors thank Laurent Excoffier for providing valuable

comments on a draft manuscript and Mark A. Jobling for

simulating discussions and his insightful comments. We are

grateful to G. Corti for her invaluable assistance and N. Poulet

for his help with graphics. The authors would like to thank the

two reviewers for their helpful comments on the manuscript.

Electronic References

Y-chromosome Haplotype Reference Database

(YHRD): http://ystr.org/index.html Insightful

218 Annals of Human Genetics (2006) 71,209–219 C© 2006 The AuthorsJournal compilation C© 2006 University College London

Y-Duplicated Microsatellites and Haplotypes Order Misspecification

References

Balaresque, P., Manni, F., Dugoujon, J. M., Crouau-Roy, B. &

Heyer, E. (2006) Estimating sex-specific processes in human

populations: Are XY-homologous markers an effective tool?

Heredity 96, 214–21.

Butler, J. M., Decker, A. E., Kline, M. C. & Vallone,

P. M. (2005) Chromosomal duplications along the Y-

chromosome and their potential impact on Y-STR inter-

pretation. J Forensic Sci 50, 853–859.

Cruciani, F., Santolamazza, P., Shen, P. D., Macaulay,

V., Moral, P., Olckers, A., Modiano, D., Holmes, S.,

Destro-Bisol, G., Coia, V., Wallace, D. C., Oefner, P. J.,

Torroni, A., Cavalli-Sforza, L. L., Scozzari, R. & Under-

hill, P. A. (2002) A back migration from Asia to sub-Saharan

Africa is supported by high-resolution analysis of human

Y-chromosome haplotypes. Am J Hum Genet 70, 1197–

1214.

Goldstein, D. B., Linares, A. R., Cavallisforza, L. L. &

Feldman, M. W. (1995) An Evaluation of Genetic Distances

for Use with Microsatellite Loci. Genetics 139, 463–471.

Hurles, M. E. & Jobling, M. (2003) A singular chromosome.

Nat Genet 34, 246–247.

Kayser, M., Brauer, S., Schadlich, H., Prinz, M., Batzer, M.

A., Zimmerman, P. A., Boatin, B. A. & Stoneking, M.

(2003) Y chromosome STR haplotypes and the genetic

structure of U.S. populations of African, European, and

Hispanic ancestry. Genome Res 13, 624–34.

Kayser, M., Caglia, A., Corach, D., Fretwell, N., Gehrig, C.,

Graziosi, G., Heidorn, F., Herrmann, S., Herzog, B., Hid-

ding, M., Honda, K., Jobling, M., Krawczak, M., Leim,

K., Meuser, S., Meyer, E., Oesterreich, W., Pandya, A.,

Parson, W., Penacino, G., Perez-Lezaun, A., Piccinini, A.,

Prinz, M., Schmitt, C., Roewer, L. et al. (1997) Evaluation

of Y-chromosomal STRs: a multicenter study. Int J Legal

Med 110, 125–33.

Kittler, R., Erler, A., Brauer, S., Stoneking, M. & Kayser,

M. (2003) Apparent intrachromosomal exchange on the

human Y chromosome explained by population history.

Eur J Hum Genet 11, 304–314.

Malaspina, P., Cruciani, F., Santolamazza, P., Torroni, A.,

Pangrazio, A., Akar, N., Bakalli, V., Brdicka, R., Jaruzelska,

J., Kozlov, A., Malyarchuk, B., Mehdi, S. Q., Michalodim-

itrakis, E., Varesi, L., Memmi, M. M., Vona, G., Villems,

R., Parik, J., Romano, V., Stefan, M., Stenico, M.,

Terrenato, L., Novelletto, A. & Scozzari, R. (2000) Pat-

terns of male-specific inter-population divergence in Eu-

rope, West Asia and North Africa. Ann Hum Genet 64,

395–412.

Mantel, N. (1967) The detection of disease clustering and a

generalized regression approach. Cancer Res 27, 209–20.

Mathias, N., Bayes, M. & Tyler-Smith, C. (1994) Highly in-

formative compound haplotypes for the human Y chromo-

some. Hum Mol Genet 3, 115–23.

Nei, M. (1987) Molecular Evolutionary Genetics. Columbia

University Press, New York.

Quintana-Murci, L., Semino, O., Poloni, E. S., Liu, A.,

Van Gijn, M., Passarino, G., Brega, A., Nasidze, I. S.,

Maccioni, L., Cossu, G., Al-Zahery, N., Kidd, J. R.,

Kidd, K. K. & Santachiara-Benerecetti, A. S. (1999) Y-

chromosome specific YCAII, DYS19 and YAP polymor-

phisms in human populations: a comparative study. Ann

Hum Genet 63, 153–166.

Rozen, S., Skaletsky, H., Marszalek, J. D., Minx, P. J.,

Cordum, H. S., Waterston, R. H., Wilson, R. K. & Page,

D. C. (2003) Abundant gene conversion between arms of

palindromes in human and ape Y chromosomes. Nature

423, 873–876.

Scozzari, R., Cruciani, F., Malaspina, P., Santolamazza, P.,

Ciminelli, B. M., Torroni, A., Modiano, D., Wallace, D. C.,

Kidd, K. K., Olckers, A., Moral, P., Terrenato, L., Akar, N.,

Qamar, R., Mansoor, A., Mehdi, S. Q., Meloni, G., Vona,

G., Cole, D. E., Cai, W. & Novelletto, A. (1997) Differential

structuring of human populations for homologous X and

Y microsatellite loci. Am J Hum Genet 61, 719–33.

Scozzari, R., Cruciani, F., Santolamazza, P., Malaspina, P.,

Torroni, A., Sellitto, D., Arredi, B., Destro-Bisol, G.,

De Stefano, G., Rickards, O., Martinez-Labarga, C.,

Modiano, D., Biondi, G., Moral, P., Olckers, A., Wallace,

D. C. & Novelletto, A. (1999) Combined use of biallelic

and microsatellite Y-chromosome polymorphisms to infer

affinities among African populations. Am J Hum Genet 65,

829–46.

Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J.,

Cordum, H. S., Hillier, L., Brown, L. G., Repping, S., Pyn-

tikova, T., Ali, J., Bieri, T., Chinwalla, A., Delehaunty, A.,

Delehaunty, K., Du, H., Fewell, G., Fulton, L., Fulton, R.,

Graves, T., Hou, S. F., Latrielle, P., Leonard, S., Mardis, E.,

Maupin, R., McPherson, J., Miner, T., Nash, W., Nguyen,

C., Ozersky, P., Pepin, K., Rock, S., Rohlfing, T., Scott,

K., Schultz, B., Strong, C., Tin-Wollam, A., Yang, S. P.,

Waterston, R. H., Wilson, R. K., Rozen, S. & Page, D. C.

(2003) The male-specific region of the human Y chromo-

some is a mosaic of discrete sequence classes. Nature 423,

825–37.

Slatkin, M. (1995) A Measure of Population Subdivision Based

on Microsatellite Allele Frequencies. Genetics 139, 1463–

1463.

Wright, S. (1951) The genetical structure of populations. Ann

Eugenics 15, 323–354.

Appendix I: Haplotype frequencies ofYCAIII (DYS413ab) polymorphism in 33Human populations

Received: 3 March 2006

Accepted: 7 July 2006

C© 2006 The AuthorsJournal compilation C© 2006 University College London

Annals of Human Genetics (2006) 71,209–219 219