Post on 22-Apr-2023
doi: 10.1111/j.1469-1809.2006.00311.x
Unbiased Interpretation of Haplotypes at DuplicatedMicrosatellites
P. Balaresque1,2,∗, A. Sibert1, E. Heyer1 and B. Crouau-Roy2
1Eco-anthropologie et Ethnobiologie, UMR5145 Departement Hommes Natures Societes, Musee de l’Homme - 17,Place du Trocadero - 75116 Paris, France2Laboratoire Evolution & Diversite Biologique, UMR 5174, Batiment 4R3, Universite Paul Sabatier Toulouse III -118,route de Narbonne - 31062 Toulouse cedex 4, France
Summary
The Y-chromosome is a powerful tool for population geneticists to study human evolutionary history. Haploid
and largely non-recombining, it should contain a simple record of past mutational events. However, this apparent
simplicity is compromised by Y-linked duplicons, which make up∼35% of this chromosome; 25% of these duplicons
are large inverted repeats (palindromes). For microsatellites lying in these palindromes, two loci cannot be easily
distinguished due to PCR co-amplification, and this order misspecification of alleles generates an additional variance
component. Due to this ambiguity, population geneticists have traditionally used an arbitrary method to assign the
alleles (shorter allele to locus 1, larger allele to locus 2). Here, we simulate these posterior estimate distributions
under three different novel allele assignment priors and compare this with the original method. We use a sample of
33 human populations, typed for duplicated microsatellites lying within palindrome P8, to illustrate our approach.
We show that both intra- and inter-population statistics can be dramatically affected by order misspecification.
Surprisingly, matrices of pairwise F-statistics or distance estimates appear far less sensitive to order misspecification
and remain relatively unchanged under the priors considered, suggesting that these microsatellites can be considered
as useful markers for population genetic studies using an appropriate data treatment. Duplicated microsatellites
represent an attractive source of information to investigate the extensive structural polymorphism observed among
human Y chromosomes, as well as processes of intra-chromosomal gene conversion acting between duplicons.
Keywords: duplicated microsatellites, Y-chromosome, haplotypes order, assignment methods, palindromes, gene
conversion, Bayesian approach.
Introduction
The Y-chromosome represents an attractive candidate to
study human evolutionary history. Due to the absence
of recombination, it contains a relatively simple record
of past mutational events. However, this apparent sim-
plicity is compromised by the existence of Y-linked du-
plicons (Hurles & Jobling 2003; Skaletsky, et al. 2003).
In the specific case of microsatellites, while the infor-
mation carried by single copy microsatellites is unam-
∗Corresponding author: Patricia Balaresque, Department of Ge-
netics, University of Leicester, Adrian Building, University Road,
LEICESTER, LE1 7RH, United Kingdom. Tel/Fax: +44 (0)
116 252 3377/78. E-mail: plb7@le.ac.uk
biguous, variation at multi-copy microsatellites is much
more complex to analyze and interpret (Butler et al.
2005). Nonetheless, multi-copy microsatellites represent
an intrinsic part of the Y-chromosome, and are subject
not only to the same evolutionary mechanisms operat-
ing at single copy microsatellites, but also to mechanisms
specific to duplicons.
Most of these duplicated microsatellites (such as
YCAII, YCAIII, DYS385I/II and DYS464) are located
in the eight massive palindromes which represent 25%
of the MSY euchromatin, and show an arm-to-arm
identity ranging from 99.94% (P3) to 99.997% (P8)
(Skaletsky et al. 2003). From a practical point of view this
very high degree of similarity means that it is impossible
C© 2006 The AuthorsJournal compilation C© 2006 University College London
Annals of Human Genetics (2006) 71,209–219 209
P. Balaresque et al.
to design PCR primers to amplify the different mi-
crosatellite copies independently, and they are there-
fore indistinguishable. The main consequence of such
a co-amplification is that the corresponding multi-copy
haplotypes cannot be determined properly: only “un-
ordered haplotypes” are available from routine molec-
ular analyses. The use of such multi-copy microsatel-
lites for population genetic studies requires particular
caution, because we do not know how co-amplified
copies should be assigned to distinct loci. This assign-
ment problem has previously been addressed by Mathias
et al. (1994). They isolated a set of highly polymorphic
duplicated microsatellites and described the most parsi-
monious method to assign the amplified alleles to dis-
tinct loci: when alleles differed in size, the longest was
labelled allele A and the shortest allele B. This method
is unbiased if the length ranges of both microsatellites
do not overlap. However, in many studies haplotypes
comprising two alleles of the same length (such as {18-
18} or {25-25} repeats for example) are very common
(Scozzari et al. 1997, 1999; Quintana-Murci et al. 1999;
Malaspina et al. 2000; Cruciani, et al. 2002). Their high
frequencies suggest that in many cases both loci have
similar allele ranges with partial or complete overlap.
Due to their high level of polymorphism, duplicated
microsatellites have been widely used in human pop-
ulation genetics to identify males and male lineages in
forensic practices (Kayser et al. 1997, and see the YSTR
database), and also to infer population diversity param-
eters and to estimate coalescence times of haplogroups
defined by binary markers (Scozzari et al. 1999; Cruciani
et al. 2002; Kayser et al. 2003). However, the popula-
tion parameters published so far have been estimated
only using one allele assignment method, which ap-
pears both risky and restrictive. The aim of the present
paper is to evaluate the effect of misspecifying haplo-
type order on standard diversity estimates, and to de-
termine how much information can still be recovered
from unordered haplotypes. For this purpose we focus
on the P8 palindrome, which exhibits the highest arm-
to-arm identity (Skaletsky et al. 2003), and for which the
misspecification in allele assignment is therefore maxi-
mal. We introduce three random priors (RP: random
prior; UP: uniform prior; ZOP: 0/1 prior) that assign
co-amplified alleles to each locus with different proba-
bility models. We apply these priors and Mathias et al.
(1994)’s arbitrary prior (AP) to a dataset of 33 popula-
tion samples for YCAIII a/b (DYS413 a/b), duplicated
dinucleotide microsatellites. We report the relative effect
of random assignment priors on usual parameters at the
intra-population (gene diversity) and inter-population
(for different genetic distances - FST, RST and δμ2, ASD)
levels and discuss their implications.
Materials and Methods
Haplotype Information: Problematic
Haplotypes are usually defined as a combination of al-
leles carried by the same chromosome. In the case of
coamplified alleles, the genetic information can be rep-
resented by the ordered pair (i,j), where i and j are the
states of allele 1 and allele 2 respectively. For a sample
of individuals one can summarise the information con-
tained in their haplotypes in a matrix, whose elements
nij are the number of individuals with ordered haplotype
(i,j). Since the number of (i,j) and (j,i) ordered pairs do
not need to be equal, this matrix is not necessarily sym-
metrical. In the case of ordered haplotypic data we call
(i,j) and (j,i) reverse haplotypes. For duplicated markers
it is not possible to assign one allele to one specific locus.
In consequence the resulting information is well repre-
sented by an unordered pair {i,j}, where i is the state
of one of the alleles, and j the state of the other one.
The haplotypic information contained in a sample can
be summarised by a matrix whose elements mij are the
number of individuals with unordered haplotype {i,j}.
Since mij = mji the matrix is symmetrical. Practically, it
is possible to draw the matrix of unordered haplotypes
of a sample from the matrix of its ordered haplotypes
through the following mapping:{mi j = m ji = ni j + n j i ∀ i �= j
m i i = ni i ∀ i(1)
Unfortunately the converse is not true: it is not possi-
ble to deduce the nij’s from the mij’s only. In other words,
several sets of ordered haplotypes can yield the same set
of unordered haplotypes under the above mapping (1).
The only case where the haplotype frequencies remain
unchanged is for the haplotypes (i,i) only, referred to
as homo-allelic hereafter. The above considerations are
210 Annals of Human Genetics (2006) 71,209–219 C© 2006 The AuthorsJournal compilation C© 2006 University College London
Y-Duplicated Microsatellites and Haplotypes Order Misspecification
equivalent to stating that there is less information in un-
ordered than in ordered haplotypic data. Practically, a set
of ordered haplotypes can be artificially reconstructed
from unordered data, satisfying equations (1). We call
this reconstruction scheme the assignment prior, since
for each unordered haplotype one allele is assigned to
locus 1 and the other to locus 2.
Allele Assignment Prior
In order to compare the impact of different allele as-
signment priors, i.e. several ways of simulating ordered
haplotypes when only unordered data are available by
assigning alleles to loci, we describe four allele assign-
ment priors, that of Mathias et al. (1994) and three new
ones.
Arbitrary prior (AP): Mathias’s prior
Mathias et al. (1994) assigned the co-amplified alleles
to each locus according to their sizes, the longest al-
lele being assigned to locus 1 (allele A), the shortest to
locus 2 (allele B). In the following section all parame-
ters estimated with this arbitrary prior carry the index
“AP”.
Binomial prior (BP)
Here we assume that the co-amplified alleles are ran-
domly assigned to the two distinct loci independently
for each haplotype. The rationale of this prior is the ob-
servation of allele range overlap at both loci (see Appe-
ndix I). For each haplotype in the sample the longer
allele (A) is assigned to locus 1 with probability p and to
locus 2 with probability 1 − p. Haplotype assignments
are assumed to be independent from each other, and the
prior is called binomial for this reason. When p = 1/2,
the allele size is not taken into account for the simulation
of ordered haplotypes, unlike the arbitrary prior which
corresponds to p = 1. All the results obtained with the
binomial prior are indexed with “BP”. Since the prob-
lem is symmetric with regard to both haplotypes, we
can restrict the range of p to [0;0.5].
Uniform prior (UP)
In fact, there is no obvious reason why assignments
should be independent for all the individuals (i.e. hap-
lotypes) in the sample. Indeed, due to genealogical re-
lationships, the states of all haplotypes in the sample are
correlated. Despite the absence of further information
concerning the magnitude of this correlation, which is
expected to vary among populations, we can consider
assignment priors with higher variance than the bino-
mial prior. For example, one can think of a prior for
which all pairs (nij, nji) satisfying nij + nji = mij have
an equal probability. We call this the uniform prior and
index it with “UP”.
Zero-One prior (ZOP)
Finally, we consider the Zero-One prior (indexed
ZOP), which assigns all unordered haplotypes sam-
pled in a population to the same ordered haplotype,
in other words (nij, nji) = (mij, 0) with probability p
and (0, mij) with a probability 1 − p, independently for
each population sample. The rationale for this is that
Mathias et al. (1994)’s prior yields unjustified correla-
tion not only between individuals within populations
but also between populations themselves. We call this
prior zero-one since the frequency of ordered haplo-
types in a given population sample is either 0 or 1. At
the intra-population level, this prior is strictly equiva-
lent to Mathias et al. (1994)’s scheme. Therefore both
will yield identical intra-population diversity parameter
estimates. The difference between both methods lies at
the inter-population level only. It must be stressed that
the zero-one-distribution, for which the variable equals
1 with probability p and 0 with probability 1 − p, max-
imises the variance on the set of all distributions with
compact support [0, 1] and mean p. Therefore this prior
maximises the variance in nij’s between population sam-
ples and is expected to minimise the correlation between
samples.
In all cases the a posteriori distribution was obtained
by Monte Carlo integration. The rationale for doing
this Monte Carlo integration under BP and UP priors
is that it allows us to explore the whole range of or-
dered haplotypes corresponding to the same unordered
haplotype data. Therefore, under these two prior assign-
ments, the exact ordered haplotype state of the sample
(which is actually unknown) has non-zero probability.
In other words, for any sample statistics such as gene
diversity, the actual value lies in the range of the a poste-
riori distribution under BP and UP. Contrary to Mathias
et al. (1994)’s method (AP), that gives a point estimate
C© 2006 The AuthorsJournal compilation C© 2006 University College London
Annals of Human Genetics (2006) 71,209–219 211
P. Balaresque et al.
without information about its closeness to the actual
value, our results allow us to determine a range for the
exact sample value by simulation.
Measures of Intra-population Diversity and
Inter-population Differentiation
Among a variety of measures of genetic diversity at
the intra-population level we use h = nn−1
(1 − ∑f 2i )
in the following, where fi represents the observed fre-
quency of the ith ordered haplotype (Nei, 1987). At
the inter-population level we compute four of the most
popular genetic distance estimators such as FST, RST,
ASD and δμ2 (Wright, 1951; Slatkin, 1995; Goldstein
et al. 1995). For each genetic distance and assignment
Figure 1 Percentage of hetero-allelic haplotypes (represented in white) in all populations sampled.
prior, 1000 replicates of the sample are simulated and
Mantel statistics used to compare them (Mantel, 1967).
Palindrome P8, the Microsatellite YCAIII and
Population Sampling
We focus on palindrome P8, characterised by an arm-
to arm homology of 99.997%. The two copies of
the YCAIII microsatellite have previously been lo-
cated in silico on each arm (Balaresque et al. 2006).
As Y-linked microsatellites can be subject to duplica-
tion/deletion events (see mutations/duplications section
in http://www.ystr.org/), we checked by a semi-
quantitative method (using peak height) whether the
YCAIII microsatellites had been subject to copy number
212 Annals of Human Genetics (2006) 71,209–219 C© 2006 The AuthorsJournal compilation C© 2006 University College London
Y-Duplicated Microsatellites and Haplotypes Order Misspecification
polymorphism. As we did not find any evidence for such
polymorphism, we assumed that the two loci had been
systematically amplified. We analyzed a dataset of 33
populations from Africa, Europe and Asia representing
1,413 Y-chromosomes (see Appendix I); these data were
extracted from Scozzari et al. (1997) and Balaresque et al.
(2006).
Results
Distribution of Hetero-allelic Haplotypes
among Populations
The assignment hypotheses are relevant only to the
hetero-allelic haplotypes (i, j) in each population. The
percentage of such haplotypes is represented in Figure 1,
and varies drastically among the samples. However, East
and North African populations tend to show a higher
percentage of hetero-allelic haplotypes than European
0.878 0.8820.878 0.882
South Italian (81)Spanish (36)
0.439 0.441 0.4430.439 0.441 0.443
Basque (55)
0.833 0.836 0.8390.833 0.836 0.839
Sardinian (97)
0.932 0.936 0.9400.932 0.936 0.940
Turkish (20)
0.55 0.65 0.750.55 0.65 0.75
Moroccan (38)
0.45 0.55 0.65 0.750.45 0.55 0.65 0.75
Bamileke (51)
0.50 0.60 0.700.50 0.60 0.70
Amhara (48)
0.4 0.5 0.6 0.70.4 0.5 0.6 0.7
Fulbe Burkina (20)
0.3 0.4 0.5 0.6 0.70.3 0.4 0.5 0.6 0.7
Ewondo (31)
0.60 0.70 0.800.60 0.70 0.80
Rimaibe (42)
0.60 0.70 0.800.60 0.70 0.80
Mossi (56)
0.705 0.715 0.7250.705 0.715 0.725
Fulbe Cameroon (17)
0.80 0.82 0.840.80 0.82 0.84
!Kung (63)
0.90 0.92 0.940.90 0.92 0.94
Egyptian (48)
0.89 0.91 0.930.89 0.91 0.93
Orcadian (27)
0.85 0.87 0.89 0.910.85 0.87 0.89 0.91
East Arabian (32)
0.75 0.80 0.85 0.900.75 0.80 0.85 0.90
Uldeme (23)
0.70 0.75 0.80 0.850.70 0.75 0.80 0.85
Oromo (38)
0.70 0.75 0.80 0.850.70 0.75 0.80 0.85
Fali (39)
0.60 0.70 0.800.60 0.70 0.80
Mozabite (80)
0.60 0.70 0.800.60 0.70 0.80
Khwe (26)
0.84 0.880.84 0.88
Daba (18)
0.80 0.85 0.900.80 0.85 0.90
Tali (15)
0.902 0.906 0.9100.902 0.906 0.910
0.657 0.659 0.6610.657 0.659 0.661
Yakuba (62)
0.826 0.830 0.8340.826 0.830 0.834
Pakistani (20)
0.884 0.890 0.8960.884 0.890 0.896
Greek (51)
0.860 0.8700.860 0.870
English (20)
0.938 0.944 0.9500.938 0.944 0.950
Akan (57)
0.895 0.905 0.9150.895 0.905 0.915
Central Italian (84)
0.865 0.875 0.8850.865 0.875 0.885
Danish (35)
0.876 0.8800.876 0.880
Corsican (63)
0.864 0.868 0.8720.864 0.868 0.872
North Italian (20)
BP
Gene diversity scale
AP & ZOP
UP
UP & BPsuperposed
Figure 2 Distribution of sample gene diversity estimates under various assignment priors (AP and ZOP: vertical line with square on
the top; UP: black bars; BP: white bars; UP and BP overlapping: grey bars). Populations are ordered from lowest to highest �HBP (cf .
text for explanations) and the sample size is given in brackets.
or West Asian populations. In East Africa populations
show a similar percentage of homo- and hetero-allelic
haplotypes. In consequence, we expected these differ-
ent populations to be differently affected by allele as-
signment methods.
Intra-population Genetic Diversity Estimate
We estimated gene diversity h under all four assignment
priors (AP, BP, UP and ZOP) applied to our set of 33
populations. Under AP and ZOP priors for the hetero-
allelic haplotypes, only one of the two possible reverse
haplotypes was considered. So, for each population the
same point estimate was obtained for the two priors
(represented in Figure 2 by a vertical line with a square
on the top).
The main observation is that the AP-ZOP point es-
timates are uniformly lower than the estimates obtained
with BP and UP priors. This is strongly related to the
C© 2006 The AuthorsJournal compilation C© 2006 University College London
Annals of Human Genetics (2006) 71,209–219 213
P. Balaresque et al.
fact that random assignments increase the total num-
ber of haplotypes, by creating reverse haplotypes when
hetero-allelic haplotypes exist. Moreover the mean es-
timate of gene diversity is almost always larger under
BP than under UP as expected: gene diversity is highest
when both reverse haplotypes (i, j) and (j, i) are found
at equal frequency in the sample, and this occurs more
frequently under BP than UP. Finally, it also appears
that estimates are spread over a broader range under UP
than BP, and this point can be explained by the fact that
the probability distribution of the frequency of ordered
haplotypes has a greater variance under UP.
To measure the extent to which the AP-ZOP assign-
ment prior tends to underestimate h (corresponding to
the results found in the literature for duplicated mi-
crosatellites to date), we defined the quantity �HB P =HB P − HAP (resp.�HU P = HU P − HAP ) where bars
denote the average over simulated values. The param-
eter �H is not ‘a bias’ strictly speaking, but more the
difference of two biases. It provides a measure of the
expected difference in gene diversity estimates under
0.2 0.4 0.6 0.8 1.0
-2.5
-2
.0-
1.5
-1.0
-0
.5
Frequency of hete
Basque
Spanish
Sardinian S. Italian
N. Italian Turkish
CorsicanPakistani
Greek
English
F. Cameroon
Danish
ro- haplotypes
log
10(Δ
BP)
C. Italian
Yakuba
Moroccan
BamilekeAmhara
F. BurkinaEwondo
Uldeme
Oromo
Fali
Mozabite Khwe
Rimaibe
Mossi
!KungEgyptian
Orcadian
E. ArabianDaba
Tali
Akan
Figure 3 Logarithm of mean difference �HB P versus observed frequency in hetero-haplotypes.The vertical line corresponds to 70%
hetero-allele haplotypes; the horizontal line to log (0.05); the oblique line to the least-square regression (c f . text).
two distinct assignment priors. We found that in both
cases �H is highly variable among population samples.
Figure 3 presents population samples ordered accord-
ing to increasing �HBP (from Southern Spaniards to
Ewondo). It appears that in 13 samples out of 33 �HBP
lies above 0.1 (from the Tali sample onwards), so that
the estimate obtained with Mathias et al. (1994)’s prior
is more than 10% lower than the mean estimate un-
der the binomial scheme. By contrast, in twelve samples
the difference is less than 0.01. Moreover, we noticed
that �H depends strongly on the level of gene diver-
sity in the original sample: intuitively, when a hetero-
allelic haplotype is very frequent in a population sample,
the randomization prior raises the mean gene diversity
estimate from almost 0 (arbitrary prior) to around 0.5
(random priors). We studied to what extent the sam-
ple composition in hetero- and homo-allelic haplotypes
is correlated with �H (Figure 3). For this purpose the
least-square regression of log (�H) on the frequency of
hetero-allelic haplotypes shows a significant correlation
(R2 = 0.55, p < 0.001; Figure 3).
214 Annals of Human Genetics (2006) 71,209–219 C© 2006 The AuthorsJournal compilation C© 2006 University College London
Y-Duplicated Microsatellites and Haplotypes Order Misspecification
Figure 3 allows us to discriminate between two types
of population samples according to the mean difference
�H: if we fix a 5% arbitrary threshold under which �H
is considered negligible, this threshold corresponds to a
proportion of hetero-allelic haplotypes of around 70%
in the sample. While samples carrying less than this pro-
portion show negligible bias in gene diversity estimates
(‘Low-Bias samples’), the bias is significant for samples
carrying greater than 70% (‘High-Bias samples’). No-
tably, the 14 population samples in the latter category
all belong to the African continent, although this does
not apply to all the African samples (see Figure 1).
Inter-population Differentiation
In order to exemplify the range of the a posteriori dis-
tributions of F-statistics and distances under the various
priors considered here, we chose five of the 528 possible
pairs of population samples, each pair illustrating a dif-
ferent combination of low-bias and high-bias samples:
- 0.07 0.1 - 0.09 0.22 4.81 1.47 2.24
0.05 0.09
FST
Basque / S. Span.
0.07 0.14
C. Ital. / Pakistani
0.27 0.57
English / Ewondo
0.06 0.44
E. Arab / Bamileke
0.05 0.63
Fali / Mossi
- 0.02 0.04
RST
- 0.03 0.02 0.59 0.83 - 0.02 0.31 0 0.73
(δμ)2
3.05 - 0.12 0.01
3.16 3.29
ASD
2.92 3.05 4.49 6 3.57 5.21 0.81 3.09
Figure 4 Posterior distribution of F-statistics and distances for 5 pairs of population samples under various priors (AP: vertical lines;
BP: open bars; UP: gray bars; ZOP: black bars). The range of the distribution is given on the abscissa axis.
Basque (LB) vs. Southern Spaniards (LB), Central Ital-
ians (LB) vs. Pakistani (LB), English (LB) vs. Ewondo
(HB), Eastern Arabs (HB) vs. Bamileke (HB), and Fali
(HB) vs. Mossi (HB).
Figure 4 shows that the five pairs present a surpris-
ing variety of situations that can arise. Firstly, the overall
range, defined as the minimum and maximum of the val-
ues simulated under all four priors, varies dramatically,
although it is consistent for the three parameters FST,
RST and (δμ)2. Large overall ranges can contain spectra
of values either significantly different from 0 (English vs.
Ewondo sample pair) or not (Eastern Arab vs. Bamileke,
or Fali vs. Mossi). For FST, RST and (δμ)2 the general
patterns are that the overall lower boundary is always at-
tained under BP (and corresponding values cluster near
the boundary) and that the upper boundary is always and
exclusively attained under ZOP. Apart from the first pair
(Basque vs. S.Spanish), average estimates are lower un-
der BP, intermediate under UP and larger under ZOP,
a prior that appears to maximise the distances between
C© 2006 The AuthorsJournal compilation C© 2006 University College London
Annals of Human Genetics (2006) 71,209–219 215
P. Balaresque et al.
samples. This can be easily understood since under ZOP,
for hetero-allelic haplotypes one ordered haplotype (e.g.
{11, 14}) may be found in the first population, and the
other one ({14, 11}) in the second population. The
overall ranges (within brackets in Figure 4) can be or-
dered consistently for all three parameters across sample
pairs, and the parameter distributions for the three ran-
dom priors are very similar between FST and RST on
the one hand, and RST and (δμ)2 on the other hand.
The position of the point estimate obtained under the
arbitrary prior (AP) always lies within the overall range,
except for the FST in English vs. Ewondo, for which
it attains the upper bound. Like gene diversity it does
not always lie within the range of the posterior distri-
bution under BP, but always within the common range
of UP and ZOP posteriors. This is consistent with the
fact that haplotypes predicted by the arbitrary prior are
more likely under UP and ZOP than under BP (where
frequencies of reverse haplotypes are more probable),
unless few hetero-allelic haplotypes are present. Unlike
gene diversity results there does not seem to be a general
trend for the point-estimate position relative to means
obtained under UP and ZOP.
By contrast, as far as ASD is concerned, AP’s point
estimate is always close to the lower bound, and BP
estimates cluster in the middle of the overall range. As in
the case of the other parameters, ZOP estimates almost
cover the overall range and always exclusively attain the
upper boundary.
Correlation between F-statistics and/or
distances
We tested whether F-statistics and genetic distance (or
dissimilarity) estimates (FST, RST, δμ2, ASD) are signif-
icantly modified by the choice of assignment prior. For
each combination of genetic distance and assignment
prior, 1000 replicates of the sample were simulated and
pairwise genetic distances were estimated. Mantel cor-
Table 1 Mantel statistics mean and 95% confidence interval (within brackets) for pairs of F-statistics and/or genetic distances. Columns
correspond to pairs compared; rows correspond to assignment priors (∗∗∗∗ : p < 10−4; ∗∗∗ : p < 10−3; ∗∗ : p < 10−2,∗ : p < 5%).
FST vs. RST FST vs. (δμ)2 FST vs. ASD RST vs. (δμ)2 RST vs. ASD (δμ)2 vs.ASD
AP 64∗∗∗∗ 43∗∗∗ −15 83∗∗∗∗ −4 33∗∗
BP 58∗∗∗∗ [57;59] 46∗∗∗ [45;47] 1 [−2;3] 90∗∗∗∗ [90;91] 8 [6;9] 33∗∗ [33;34]
UP 60∗∗∗∗ [56;63] 44∗∗∗ [40;48] −13 [−19;−6] 88∗∗∗∗ [86;89] −3 [−10;2] 30∗ [27;32]
ZOP 69∗∗∗∗ [66;72] 46∗∗∗ [42;50] −30 [−33;−26] 84∗∗∗∗ [83;85] −16 [−19;−13] 23 [20;25]
relation statistics for the comparison of distance matrices
were evaluated for pairs of simulated data with (i) iden-
tical assignment prior and (ii) identical genetic distance,
in order to assess their effect on the characterisation of
genetic structure.
Table 1 presents the empirical mean and 95% confi-
dence intervals of Mantel correlation statistics for pairs
of F-statistics and/or distances (FST, RST, δμ2, ASD), es-
timated under the identical assignment prior. Whatever
the prior, the results in the first column show that FST
and RST are strongly correlated with Mantel statistics,
uniformly above 0.5 (p < 10−4). Similar conclusions
are drawn from the comparison of FST and δμ2 (Man-
tel statistics uniformly above 0.4) and of RST and δμ2
(Mantel statistics uniformly above 0.8). In both cases the
correlation is highly significant with p < 10−4. On aver-
age, δμ2 and ASD are less positively correlated, though
still significant under AP, BP and UP (p < 0.01) and even
under ZOP (p < 0.05), this is consistent with the fact
that both measures are related by definition (Goldstein
et al. 1995). Finally, it is impossible to reject the null hy-
pothesis of independence for pairs ASD - FST and ASD
- RST at the 5% level, the statistics distribution compris-
ing 0 or being even negative (p > 0.2) (Table 1). The
latter result is not surprising since F-statistics and dis-
tance measures have no reasons to give consistent results
a priori.
Whatever the pair of F-statistics and/or distances con-
sidered, Mantel statistics values obtained under all four
assignment priors are strongly consistent (Table 1). In
particular the values obtained from Mathias et al. (1994)’s
scheme always lie within the range defined by the three
other priors, and almost always within the range found
under UP. Contrary to gene diversity estimates, the
value does not systematically correspond to the lower
boundary of simulated values, and it is not possible to
conclude that there is a bias of AP results in any direc-
tion. Moreover it is noteworthy that the Mantel statistics
216 Annals of Human Genetics (2006) 71,209–219 C© 2006 The AuthorsJournal compilation C© 2006 University College London
Y-Duplicated Microsatellites and Haplotypes Order Misspecification
Table 2 Mantel statistics mean and 95% confidence interval (within brackets) for pairs of assignment priors. Columns correspond to
pairs compared; rows correspond to F-statistics or distances (in all cases p < 10−4).
AP vs. BP AP vs. UP AP vs. ZOP BP vs. UP BP vs. ZOP UP vs. ZOP
Fst 88 [87;89] 87 [83;90] 78 [75;82] 88 [82;94] 68 [64;71] 74 [69;79]
Rst 94 [93;95] 91 [87;94] 80 [76;84] 93 [89;97] 75 [68;80] 78 [72;86]
Dmu 98 [97;99] 97 [95;98] 90 [87;93] 98 [96;99] 88 [82;93] 88 [82;94]
ASD 99 [99;99.1] 98.9 [98.5;99.1] 98.2 [97.7;98.6] 99.7 [99.6;99.8] 99 [98.2;99.4] 98.7 [98;99.2]
variances increase from BP to ZOP and UP, in the same
way as do variances of corresponding assignment proce-
dures (the variance of BP is proportional to the sample
size while it is proportional to the squared sample size
for ZOP and UP). This result indicates that the general
properties of distance estimators are independent of the
assignment method used.
Correlation between Assignment Methods
Table 2 gives the empirical frequency distribution of
Mantel statistics for pairs of distance matrices generated
under various assignment methods. The values observed
yield significant Mantel test result with p < 10−4: what-
ever the F-statistics or distance considered, results are
strongly correlated when pairs of assignment priors are
compared. However it is interesting to note that dis-
tances may be more conservative than F-statistics. The
Average Squared Distance (ASD) for example remains
nearly unchanged under all four assignment priors con-
sidered, since the Mantel correlation is fairly high (0.97)
for all simulated pairs of matrices under two different
methods (Table 2).
(δμ)2 estimates are also highly correlated for pairs
of AP, BP and UP priors (Mantel statistics >0.95) and
less correlated when ZOP is compared with one of
the previous priors. This pattern is enhanced for both
F-statistics. As for gene diversity estimates, this is likely
to be due to the higher variance underlying the zero-
one distribution, as mentioned in the description of
the prior. Since all Mantel tests were highly significant,
these differences are likely to have an effect on the
description of fine-scale structures, but not on medium
or large scale ones.
Discussion
In this paper, we ask how different assignment methods
affect population diversity/differentiation estimates us-
ing duplicated microsatellites. We assume that for a given
unordered hetero-allelic haplotype the two reverse or-
dered haplotypes are likely to appear in the sample. We
consider an a priori distribution for the ordered hap-
lotypic data D, given unordered data, so that we can
express the a posteriori distribution for any statistics θ as
P (θ |unordered D) =∑
ordered D
P (θ |ordered D)
× P (ordered D|unordered D).
We investigate three a priori distributions (binomial
(BP), uniform (UP) and zero-one (ZOP -identical to
Mathias et al. (1994)’s prior at the intra-population level)
and compare the a posteriori estimates with the results
obtained under Mathias et al. (1994)’s scheme (AP).
Our first result is that gene diversity estimates can
be strongly affected by misspecification and the way or-
dered haplotypes are drawn from unordered data. In any
sample containing heterallelic haplotypes, Mathias et al.
(1994)’s prior generates downward biased estimates of
the actual gene diversity. Although the extent of the
bias �H may generally be small (especially in popu-
lation samples with high gene diversity), it can reach
values as high as 0.3. We have shown that the magni-
tude of the mean difference is strongly linked to the
proportion of homo-allelic haplotypes in the sample.
As this proportion varies drastically among populations,
they are affected differently by the order misspecifica-
tion. Although the regression analysis has shown that a
proportion of such haplotypes above 30% ensures a bias
of less than 5%, about one third of populations lie be-
low this threshold, and exhibit bias. While extreme bias
values are unlikely, we recommend the use of a random
method, such as our uniform assignment prior (UP), to
compute confidence intervals instead of considering the
observed gene diversity in the sample.
Our second result is that order misspecification is not
likely to have a significant effect on the estimates of
C© 2006 The AuthorsJournal compilation C© 2006 University College London
Annals of Human Genetics (2006) 71,209–219 217
P. Balaresque et al.
F-statistics or distances among population samples (al-
though it affects them). Simulated results appear very
close to each other (as judged by highly significant Man-
tel tests) under all possible random assignment methods
considered. Therefore we can conclude that the actual
values estimated from ordered data must also be close
to the simulated values and can be satisfactorily approx-
imated this way. As far as population structure is con-
cerned, the absolute values of F-statistics and distances
are not as important as their relative values and order.
Comparing matrices of pairwise F-statistics and/or dis-
tances, we have noticed that in all cases our results seem
independent of the assignment prior. When pairs of F-
statistics and/or distances are compared, mean Mantel
statistics are comparable but variances increase from BP
to ZOP and UP, in the same way as do variances of the
corresponding assignment priors (the variance of BP is
proportional to the sample size, while it is proportional
to the squared sample size for ZOP and UP). Compar-
ing pairs of assignment priors, F-statistics and distances
are more or less conserved: Average Squared Distance
(ASD) estimates are highly conserved, while F-statistics
(more sensitive to the reconstruction methods) are more
variable, though not significantly so.
Importantly, the choice of assignment method does
not appear to affect the estimates of relative F-statistics
or genetic distances between populations. Therefore
when analysing population structure, one can expect
little bias due to order misspecification in haplotypic
data. Finally, we have also considered the fact that al-
lelic states at both loci are correlated, based on Kit-
tler et al. (2003). Since these duplicated microsatellites
are located in palindromic sequences they probably un-
dergo gene conversion – an idea supported by the ob-
servation that, for DYS385a/b, the mutational process
at the first locus does not seem to be independent of the
second locus (Kittler et al. 2003). To take into account
the phylogenetic relationship existing between the indi-
viduals within each sample (as haplogroup information
does), we have introduced moderate levels of correlation
among loci (p < 0.5). In consequence, when a {12, 13}haplotype is generated in the population 1, the chance
for simulating another {12, 13} is higher than generat-
ing a {13, 12}. When we estimate the F-statistics, we
obtain results very close to the previous ones (not de-
veloped here).
We have shown that intra-population parameters are
more likely to be affected by assignment priors than
are inter-population parameters (F-statistics or genetic
distances). The experimental determination of ordered
haplotypes within large duplicons is not routine, and as
long as this remains the case, it seems worthwhile taking
the ambiguity of haplotype order into account. Com-
puting confidence intervals using one of the methods
presented here would both prevent misinterpretations
and retain all potential information carried by dupli-
cated microsatellites.
The results presented here are based on P8-linked
microsatellites only, and do not consider that the on-
going molecular mechanisms may be different within
each palindrome. We know, for example, that DYS385
is likely to undergo duplication/deletion events as 29
duplications events have been reported in the YHRD
(3 or 4 copies of DYS385). The susceptibility of Y-STRs
to gene conversion or to other molecular mechanisms
may vary according to the palindromes present – or more
generally to the large repeated sequences involved. The
different molecular structures (Skaletsky et al. 2003) and
different evolutionary histories (Rozen et al. 2003) of
the eight palindromes may be crucial in that respect.
For this reason it would be interesting to extend our re-
sults, both to other palindromes and to other population
genetic parameters (for example, TMRCA estimates).
Y-chromosomal haplogroups are strongly geographically
differentiated, and can also be associated with differ-
ent arrangements of palindromic sequences; it would
therefore also be worth considering haplogroup infor-
mation to identify population-specific effects on these
estimates.
Acknowledgments
The authors thank Laurent Excoffier for providing valuable
comments on a draft manuscript and Mark A. Jobling for
simulating discussions and his insightful comments. We are
grateful to G. Corti for her invaluable assistance and N. Poulet
for his help with graphics. The authors would like to thank the
two reviewers for their helpful comments on the manuscript.
Electronic References
Y-chromosome Haplotype Reference Database
(YHRD): http://ystr.org/index.html Insightful
218 Annals of Human Genetics (2006) 71,209–219 C© 2006 The AuthorsJournal compilation C© 2006 University College London
Y-Duplicated Microsatellites and Haplotypes Order Misspecification
References
Balaresque, P., Manni, F., Dugoujon, J. M., Crouau-Roy, B. &
Heyer, E. (2006) Estimating sex-specific processes in human
populations: Are XY-homologous markers an effective tool?
Heredity 96, 214–21.
Butler, J. M., Decker, A. E., Kline, M. C. & Vallone,
P. M. (2005) Chromosomal duplications along the Y-
chromosome and their potential impact on Y-STR inter-
pretation. J Forensic Sci 50, 853–859.
Cruciani, F., Santolamazza, P., Shen, P. D., Macaulay,
V., Moral, P., Olckers, A., Modiano, D., Holmes, S.,
Destro-Bisol, G., Coia, V., Wallace, D. C., Oefner, P. J.,
Torroni, A., Cavalli-Sforza, L. L., Scozzari, R. & Under-
hill, P. A. (2002) A back migration from Asia to sub-Saharan
Africa is supported by high-resolution analysis of human
Y-chromosome haplotypes. Am J Hum Genet 70, 1197–
1214.
Goldstein, D. B., Linares, A. R., Cavallisforza, L. L. &
Feldman, M. W. (1995) An Evaluation of Genetic Distances
for Use with Microsatellite Loci. Genetics 139, 463–471.
Hurles, M. E. & Jobling, M. (2003) A singular chromosome.
Nat Genet 34, 246–247.
Kayser, M., Brauer, S., Schadlich, H., Prinz, M., Batzer, M.
A., Zimmerman, P. A., Boatin, B. A. & Stoneking, M.
(2003) Y chromosome STR haplotypes and the genetic
structure of U.S. populations of African, European, and
Hispanic ancestry. Genome Res 13, 624–34.
Kayser, M., Caglia, A., Corach, D., Fretwell, N., Gehrig, C.,
Graziosi, G., Heidorn, F., Herrmann, S., Herzog, B., Hid-
ding, M., Honda, K., Jobling, M., Krawczak, M., Leim,
K., Meuser, S., Meyer, E., Oesterreich, W., Pandya, A.,
Parson, W., Penacino, G., Perez-Lezaun, A., Piccinini, A.,
Prinz, M., Schmitt, C., Roewer, L. et al. (1997) Evaluation
of Y-chromosomal STRs: a multicenter study. Int J Legal
Med 110, 125–33.
Kittler, R., Erler, A., Brauer, S., Stoneking, M. & Kayser,
M. (2003) Apparent intrachromosomal exchange on the
human Y chromosome explained by population history.
Eur J Hum Genet 11, 304–314.
Malaspina, P., Cruciani, F., Santolamazza, P., Torroni, A.,
Pangrazio, A., Akar, N., Bakalli, V., Brdicka, R., Jaruzelska,
J., Kozlov, A., Malyarchuk, B., Mehdi, S. Q., Michalodim-
itrakis, E., Varesi, L., Memmi, M. M., Vona, G., Villems,
R., Parik, J., Romano, V., Stefan, M., Stenico, M.,
Terrenato, L., Novelletto, A. & Scozzari, R. (2000) Pat-
terns of male-specific inter-population divergence in Eu-
rope, West Asia and North Africa. Ann Hum Genet 64,
395–412.
Mantel, N. (1967) The detection of disease clustering and a
generalized regression approach. Cancer Res 27, 209–20.
Mathias, N., Bayes, M. & Tyler-Smith, C. (1994) Highly in-
formative compound haplotypes for the human Y chromo-
some. Hum Mol Genet 3, 115–23.
Nei, M. (1987) Molecular Evolutionary Genetics. Columbia
University Press, New York.
Quintana-Murci, L., Semino, O., Poloni, E. S., Liu, A.,
Van Gijn, M., Passarino, G., Brega, A., Nasidze, I. S.,
Maccioni, L., Cossu, G., Al-Zahery, N., Kidd, J. R.,
Kidd, K. K. & Santachiara-Benerecetti, A. S. (1999) Y-
chromosome specific YCAII, DYS19 and YAP polymor-
phisms in human populations: a comparative study. Ann
Hum Genet 63, 153–166.
Rozen, S., Skaletsky, H., Marszalek, J. D., Minx, P. J.,
Cordum, H. S., Waterston, R. H., Wilson, R. K. & Page,
D. C. (2003) Abundant gene conversion between arms of
palindromes in human and ape Y chromosomes. Nature
423, 873–876.
Scozzari, R., Cruciani, F., Malaspina, P., Santolamazza, P.,
Ciminelli, B. M., Torroni, A., Modiano, D., Wallace, D. C.,
Kidd, K. K., Olckers, A., Moral, P., Terrenato, L., Akar, N.,
Qamar, R., Mansoor, A., Mehdi, S. Q., Meloni, G., Vona,
G., Cole, D. E., Cai, W. & Novelletto, A. (1997) Differential
structuring of human populations for homologous X and
Y microsatellite loci. Am J Hum Genet 61, 719–33.
Scozzari, R., Cruciani, F., Santolamazza, P., Malaspina, P.,
Torroni, A., Sellitto, D., Arredi, B., Destro-Bisol, G.,
De Stefano, G., Rickards, O., Martinez-Labarga, C.,
Modiano, D., Biondi, G., Moral, P., Olckers, A., Wallace,
D. C. & Novelletto, A. (1999) Combined use of biallelic
and microsatellite Y-chromosome polymorphisms to infer
affinities among African populations. Am J Hum Genet 65,
829–46.
Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P. J.,
Cordum, H. S., Hillier, L., Brown, L. G., Repping, S., Pyn-
tikova, T., Ali, J., Bieri, T., Chinwalla, A., Delehaunty, A.,
Delehaunty, K., Du, H., Fewell, G., Fulton, L., Fulton, R.,
Graves, T., Hou, S. F., Latrielle, P., Leonard, S., Mardis, E.,
Maupin, R., McPherson, J., Miner, T., Nash, W., Nguyen,
C., Ozersky, P., Pepin, K., Rock, S., Rohlfing, T., Scott,
K., Schultz, B., Strong, C., Tin-Wollam, A., Yang, S. P.,
Waterston, R. H., Wilson, R. K., Rozen, S. & Page, D. C.
(2003) The male-specific region of the human Y chromo-
some is a mosaic of discrete sequence classes. Nature 423,
825–37.
Slatkin, M. (1995) A Measure of Population Subdivision Based
on Microsatellite Allele Frequencies. Genetics 139, 1463–
1463.
Wright, S. (1951) The genetical structure of populations. Ann
Eugenics 15, 323–354.
Appendix I: Haplotype frequencies ofYCAIII (DYS413ab) polymorphism in 33Human populations
Received: 3 March 2006
Accepted: 7 July 2006
C© 2006 The AuthorsJournal compilation C© 2006 University College London
Annals of Human Genetics (2006) 71,209–219 219