Estimating population structure from AFLP amplification intensity

10
Estimating population structure from AFLP amplification intensity MATTHIEU FOLL,*†MARTIN C. FISCHER,*† GERALD HECKEL*† and LAURENT EXCOFFIER*† *Computational and Molecular Population Genetics (CMPG), Institute of Ecology and Evolution, University of Bern, Baltzerstrasse 6, CH-3012 Bern, Switzerland, Swiss Institute of Bioinformatics, Genopole, 1015 Lausanne, Switzerland Abstract In the last decade, amplified fragment length polymorphisms (AFLPs) have become one of the most widely used molecular markers to study the genetic structure of natural populations. Most of the statistical methods available to study the genetic structure of populations using AFLPs consider these markers as dominant and are thus unable to distinguish between individuals being heterozygous or homozygous for the dominant allele. Some attempts have been made to treat AFLPs as codominant markers by using AFLP band intensities to infer the most likely genotype of each individual. These two approaches have some drawbacks, the former discarding potentially valuable informa- tion and the latter being sometimes unable to correctly assign genotypes to individuals. In this study, we propose an alternative likelihood-based approach, which does not attempt at inferring the genotype of each individual, but rather incorporate the uncertainty about genotypes into a Bayesian framework leading to the estimation of population-specific F IS and F ST coefficients. We show with simulations that the accuracy of our method is much higher than one using AFLP as dominant markers and is generally close to what would be obtained by using the same number of Single-Nucleotide Polymorphism (SNP) markers. The method is applied to a data set of four populations of the common vole (Microtus arvalis) from Grisons in Switzerland, for which we obtained 562 polymorphic AFLP markers. Our approach is very general and has the potential to make AFLP markers as useful as SNP data for nonmodel species. Keywords: amplified fragment length polymorphism, Bayesian statistics, F-statistics, Markov chain Monte Carlo, Microtus arvalis, population structure Received 7 May 2010; revision received 23 June 2010; accepted 23 June 2010 Introduction Many if not most species leave in a subdivided habitat and cannot be considered as a single panmictic popula- tion (Waples & Gaggiotti 2006). In this case, their genetic diversity is structured into components within and between local populations, also called demes or subpopulations, and the global system is generally described as a metapopulation. Such spatial structuring has important implications for the evolution of species, and its study is fundamental for applications in the domains of conservation biology and genetic epidemiol- ogy. This problem was recognized early on by Sewall Wright who proposed to quantify the genetic structure using F-statistics (Wright 1951), namely F IS , F ST and F IT . Loosely speaking F IS represents the shared ancestry between alleles of an individual relative to the popula- tion and it is also called the local inbreeding coefficient. F ST represents the shared ancestry within the population relative to the whole metapopulation and it is usually used to measure the degree of differentiation among populations (see Holsinger & Weir 2009 for a review). Finally, F IT represents the shared ancestry between alleles of an individual relative to the metapopulation and it provides an overall measure of inbreeding. F ST is also the proportion of the total genetic variance explained by differences among populations, which has become its prevailing interpretation among molecular ecologists. This has resulted in the adoption of a single Correspondence: Matthieu Foll, Fax: +41 31 631 48 88; E-mail: [email protected] Ó 2010 Blackwell Publishing Ltd Molecular Ecology (2010) doi: 10.1111/j.1365-294X.2010.04820.x

Transcript of Estimating population structure from AFLP amplification intensity

Estimating population structure from AFLP amplificationintensity

MATTHIEU FOLL,*†MARTIN C. FISCHER,*† GERALD HECKEL*† and LAURENT EXCOFFIER*†*Computational and Molecular Population Genetics (CMPG), Institute of Ecology and Evolution, University of Bern,Baltzerstrasse 6, CH-3012 Bern, Switzerland, †Swiss Institute of Bioinformatics, Genopole, 1015 Lausanne, Switzerland

Abstract

In the last decade, amplified fragment length polymorphisms (AFLPs) have become oneof the most widely used molecular markers to study the genetic structure of naturalpopulations. Most of the statistical methods available to study the genetic structure ofpopulations using AFLPs consider these markers as dominant and are thus unable todistinguish between individuals being heterozygous or homozygous for the dominantallele. Some attempts have been made to treat AFLPs as codominant markers by usingAFLP band intensities to infer the most likely genotype of each individual. These twoapproaches have some drawbacks, the former discarding potentially valuable informa-tion and the latter being sometimes unable to correctly assign genotypes to individuals.In this study, we propose an alternative likelihood-based approach, which does notattempt at inferring the genotype of each individual, but rather incorporate theuncertainty about genotypes into a Bayesian framework leading to the estimation ofpopulation-specific FIS and FST coefficients. We show with simulations that the accuracyof our method is much higher than one using AFLP as dominant markers and is generallyclose to what would be obtained by using the same number of Single-NucleotidePolymorphism (SNP) markers. The method is applied to a data set of four populations ofthe common vole (Microtus arvalis) from Grisons in Switzerland, for which we obtained562 polymorphic AFLP markers. Our approach is very general and has the potential tomake AFLP markers as useful as SNP data for nonmodel species.

Keywords: amplified fragment length polymorphism, Bayesian statistics, F-statistics, Markovchain Monte Carlo, Microtus arvalis, population structure

Received 7 May 2010; revision received 23 June 2010; accepted 23 June 2010

Introduction

Many if not most species leave in a subdivided habitatand cannot be considered as a single panmictic popula-tion (Waples & Gaggiotti 2006). In this case, theirgenetic diversity is structured into components withinand between local populations, also called demes orsubpopulations, and the global system is generallydescribed as a metapopulation. Such spatial structuringhas important implications for the evolution of species,and its study is fundamental for applications in thedomains of conservation biology and genetic epidemiol-ogy. This problem was recognized early on by Sewall

Wright who proposed to quantify the genetic structureusing F-statistics (Wright 1951), namely FIS, FST and FIT.Loosely speaking FIS represents the shared ancestrybetween alleles of an individual relative to the popula-tion and it is also called the local inbreeding coefficient.FST represents the shared ancestry within the populationrelative to the whole metapopulation and it is usuallyused to measure the degree of differentiation amongpopulations (see Holsinger & Weir 2009 for a review).Finally, FIT represents the shared ancestry betweenalleles of an individual relative to the metapopulationand it provides an overall measure of inbreeding. FST isalso the proportion of the total genetic varianceexplained by differences among populations, which hasbecome its prevailing interpretation among molecularecologists. This has resulted in the adoption of a single

Correspondence: Matthieu Foll, Fax: +41 31 631 48 88;E-mail: [email protected]

! 2010 Blackwell Publishing Ltd

Molecular Ecology (2010) doi: 10.1111/j.1365-294X.2010.04820.x

estimate of FST as the standard approach to measuringgenetic differentiation, but this approach ignores thefact that in most if not all realistic situations, local popu-lations differ in their effective sizes and migration ratesmaking it interesting to estimate population-specific FST s(see Gaggiotti & Foll 2010 for a detailed review on thissubject). The idea of estimating population-specific FST swas introduced by Balding & Nichols (1995), andRannala & Hartigan (1996) as well as Balding (2003)proposed a general framework to rigorously define allF-statistics using the beta-binomial model proposed byBalding & Nichols (1995). This new formulation alsocalled the F-model (Falush et al. 2003), and in particularits multiallelic version, the multinomial Dirichlet hasbeen revisited many times in the recent statisticalgenetics literature (Balding 2003; Falush et al. 2003;Beaumont & Balding 2004; Foll & Gaggiotti 2006; Faubet& Gaggiotti 2008; Foll et al. 2008; Guillot 2008) butmostly in the context of methods aiming at identifyingoutlier loci or estimating migration rates.Among the wide variety of molecular markers avail-

able, the use of codominant markers such as allozymes,microsatellites or SNPs leads to clearly distinguishablegenotypes, which can be readily analysed using existingsoftware (see Excoffier & Heckel 2006). Conversely, theuse of dominant markers is problematic because of thedifficulty in distinguishing between individuals that areheterozygous or homozygous for the dominant allele.Nevertheless, they have become very popular in the lastdecade, mostly because of the development of theamplified fragment length polymorphism (AFLP) tech-nique, a relatively inexpensive and convenient way toquickly obtain a large number of genetic markers froma wide variety of organisms (Bensch & Akesson 2005;Meudt & Clarke 2007). In a given individual, an AFLPmarker is assayed by measuring the concentration ofPCR products of a specific DNA fragment. Conse-quently, the raw data consists in a quantitative measureof the degree of amplification of each fragment (calledband intensity hereafter), but most of the time, AFLPmarkers are simply scored as being present or absent (ifband intensity is lower than an arbitrary threshold) andtherefore coded as a dominant marker. This leads todifficulties when inferring population genetics parame-ters of interest, especially those based on the estimationof allele frequencies (as heterozygosity or F-statistics,see Bonin et al. 2007 for a review). One possible solu-tion is to assume Hardy–Weinberg equilibrium andabsence of inbreeding to estimate allele frequencies(Lynch & Milligan 1994; Zhivotovsky 1999; Hill & Weir2004). Recent studies have tried to relax these assump-tions and to co-estimate FIS and FST, but available meth-ods are either biased or associated with a large variance(Holsinger et al. 2002; Foll et al. 2008).

In principle, band intensity is expected to be stron-ger for individuals homozygous for the dominantallele than for heterozygotes, because the amount ofPCR products should be about twice larger. Using thisproperty, a few attempts have been made to useAFLP as codominant markers (Piepho & Koch 2000;Gort & Van Eeuwijk 2010). However, various stochas-tic factors lead to a large variance in band intensityamong individuals of the same genotype. This vari-ance can even be so large that the distributions of thethree genotype classes overlap, making it difficult todetermine the genotype of individuals. To overcomethis problem Piepho & Koch (2000) proposed a mix-ture model to infer for each individual the posteriorprobability of the three possible genotypes. Thisapproach has been recently extended and studied inmore details by Gort & Van Eeuwijk (2010). However,for many individuals, the odds in favour of any geno-type are very close to 1, leading to unreliable geno-type assignments (see Table 4 in Piepho & Koch 2000or Fig. 1 in Gort & Van Eeuwijk 2010). According tothis, it does not seem possible to accurately infer thegenotypes for every individual at every locus, conse-quently to use AFLPs as codominant markers. How-ever, using these markers as purely dominant, like inthe majority of AFLP studies, discards some informa-tion that may be valuable.Here, we propose a different approach: instead of try-

ing to determine the genotype of each individual (as inPiepho & Koch 2000), we incorporate the distribution ofband intensities in each population into a more generalmodel. To do this, this distribution is modelled as a sta-tistical mixture similar to the approach proposed by

p

p 1

2

FST

FIS

Y

i

,i j

j

j iii

i

Fig. 1 Direct Acyclic Graph (DAG) of the model given inEqn 2. Square nodes denote known quantities (i.e. data) and cir-cles represent parameters to be estimated. Lines between nodesrepresent direct stochastic relationships within the model. Smallcap letters at the end of the arrows correspond to indices usedfor the corresponding vector in the text, namely i for loci and jfor populations. The variables within each node correspond tothe different model parameters discussed in the text.

2 M. FOLL ET AL.

! 2010 Blackwell Publishing Ltd

Piepho & Koch (2000), and it is then used directly inthe Bayesian F-model mentioned earlier. In the follow-ing, we first present the implementation of our Bayesianformulation for the estimation of FIS and FST from AFLPdata, and we then demonstrate the strength of thisframework compared to previous approaches usingsimulations. We finally apply this method to estimatepopulation-specific FIS and FST coefficients in four pop-ulations of the common vole (Microtus arvalis) from Gri-sons in Switzerland, based on a large set of AFLPmarkers.

Materials and methods

Bayesian model

We consider a set of J populations for which we scoredI AFLP loci. Each population j is made up of Kj individ-uals. For each individual k in population j, a band inten-sity yijk is measured at each locus i. We denote by Aand a, the dominant and recessive alleles, respectively,and we assume that we observe a null band intensity(yijk = 0) for the genotype aa. In practice, we generallyconsider that an individual has genotype aa whenyijk < ai, with ai defined as a fraction (10% in our case,see below) of the highest observed value at locus i. Forgenotypes AA and Aa, we assume that band intensityvalues follow some continuous distribution. Scoringerrors or homoplasy (the presence of more than onefragment with the same length but different sequence)can lead to individual values of band intensity that areextreme outliers, and the use of a normal distribution inour mixture model of band intensities would not allowus to accommodate these extreme values. We thereforepropose to use Cauchy distributions that are flat-tailedin our mixture model (see below). Like the normal dis-tribution, the Cauchy distribution is bell-shaped and isdefined with a location parameter corresponding to themode and a scale parameter defining half-width at half-maximum. More precisely, we take:

yijkjAa; li; r1i ! Cauchy li; r1i" #

yijkjAA; li; di; r2i ! Cauchy li $ di; r2i" #

Because the band intensity of dominant homozygotesshould be about twice larger than that of heterozygotes,one could use di = li, but to allow for additional vari-ability, we chose di to follow the prior distribution

dijli ! N li;li10

! "

The joint distribution of individuals with the threegenotypes is then a mixture of the three distributions,

with mixing proportions equal to the genotype frequen-cies. At the population level, we can view FjIS as theprobability of sampling an individual inbred for a par-ticular locus i, and we can write the genotype frequen-cies as a function of the allele frequencies ~pij as

PijAA % ~p2ij $ FjIS~pij"1& ~pij#

PijAa % "1& FjIS#2~pij"1& ~pij#

Pijaa % "1& ~pij#2 $ FjIS~pij"1& ~pij#

"1#

Then, the likelihood of yijk can be written as:

Lijk % f"yijkjFjIS; ~pij; li; di; r1i; r2i#

% Pijaa10"yijk# $ Pij

Aa/"yijk; li; r1i#

$ PijAA/"yijk; li $ di; r2i#;

where u is the probability density function of the Cau-chy distribution and

10"yijk# % 1 if yijk % 00 otherwise

#

As band intensity measures at different loci and fordifferent subpopulations are assumed to be indepen-dent, the joint likelihood is given by:

L %YI

i%1

YJ

j%1

YKj

k%1

Lijk

As our likelihood function depends on allele frequen-cies in each population at each locus, we used a Betaprior to model population structure (see Balding 2003for more details):

~pij ! Beta1& FjSTFjST

pi;1& FjSTFjST

"1& pi#

!

;

where FjST measures the extent of differentiationbetween population j and a common migrant pool, andpi is the overall allele frequency at locus i.The full Bayesian model represented by the directed

acyclic graph (DAG) shown in Fig. 1 is given by

p"~p; l; d; r1; r2; FIS; p; FSTjY# ! L ' p"~pjp; FST#( p"djl# ( p"FIS# ( p"l# ( p"r1# ( p"r2#( p"FST# ( p"p#

"2#

We used a uniform prior between 0 and 1 for eachFjIS and FjST coefficient, as well as for each allele fre-quency pi. We normalized band intensity at each locusto have maxj,k (yijk) = 1, and we also used a uniformprior between 0 and 1 for each li parameter. Finally, for

GENETIC STRUCTURE FROM AFLP INTENSITY 3

! 2010 Blackwell Publishing Ltd

each s1i and s2i parameter, we used a gamma priorGamma (1,0.05).The estimation of model parameters was performed

with a Markov Chain Monte Carlo (MCMC) algorithm.We evaluated the convergence of the method using thediagnostic tests implemented in the R BOA package(Smith 2005). The tests indicated that a burn-in of100 000 iterations is necessary to reach convergence inmost situations. We used a sample size of 10 000 and athinning interval of 20 as suggested by the autocorrela-tion analysis. With these parameter values, the totallength of the chain was 200 000 iterations. The methodhas been implemented in a program written in C++.Proposal distributions need to be adjusted to haveacceptance rates between 0.25 and 0.45. Indeed, if wepropose values over a very wide interval, most moveswill be rejected as they may correspond to areas of lowposterior probability. On the other hand, if we proposevalues very close to the current one, the move will bealmost always accepted but the chain will take a longtime to explore all the parameter space (poor mixing).These values are automatically tuned by our programby performing a series of 20 short pilot runs of 2000iterations each and adjusting the proposal distributionfor each parameter. The pilot runs are also used as aburn-in.

Simulation study

We evaluated the performance of our method to esti-mate F-statistics with simulated data. We used the samestatistical model as assumed by our method (the infer-ence model), which allowed us to study the effect ofthe quality of the samples (number of loci, sample sizes,number of populations, etc.) on the accuracy of the esti-mations. Four different data types were simulated: (i)binary AFLP matrix (absent ⁄present), (ii) low-qualityAFLP band intensity matrix with l = d = 50,s1 = s2 = 20 (before normalization), (iii) high-qualityband intensity matrix with l = d = 50, s1 = s2 = 10 and(iv) co-dominant SNP matrix where the genotype ofeach individual is thus known without error. Figure 2shows example distributions of a mixture of two Cau-chy distributions for low-quality (dotted line) and high-quality (solid line) AFLP band intensities. For each sce-nario, 100 replicates were simulated, which allowed usto calculate bias and variance of the posterior mean forFIS and FST coefficients.We first chose a set of default values for the key

parameters of the inference model (see Table 1) andused this first scenario to compare the accuracy of ourmethod based on band intensity with previousapproaches considering AFLP as dominant binary data(presence or absence): (i) the ABC method proposed by

Foll et al. (2008) (ii) the MCMC method first proposedby Holsinger et al. (2002) and shown by Foll et al.(2008) to generally provide unreliable estimates of FISbut better estimates of FST. We also compared theresults of the AFLP analyses to those obtained for SNPdata, which can be considered as a particular case ofour model where s1 and s2 fi 0.We then studied the effect of key model parameters

on the accuracy of the estimations. To do this, we modi-fied the parameters of the reference scenario one by oneand created seven additional scenarios (see Table 2).We studied the influence of the amount of data avail-able by changing the number of loci (scenario 2), num-ber of populations (scenario 5) and number ofindividuals (scenario 6). We also considered a scenariowith very high inbreeding coefficient (scenario 3) and ascenario with low population differentiation (scenario4). We tested for the effect of ascertainment bias byimposing a minimum allele frequency of 10% at eachmarker (scenario 7), as it has been shown that FIS maybe very sensitive to this bias with dominant AFLPmarkers (Foll et al. 2008). Finally, we tested the robust-ness of our choice of a Cauchy distribution of band

0 50 100 150 2000.00

00.

005

0.01

00.

015

Band intensity

Den

sity

Fig. 2 Mixture of two Cauchy distributions used to simulatesynthetic band intensity data in our sensitivity study. Solidand dashed lines correspond to what we defined as ‘high-qual-ity’ and ‘low-quality’ AFLP data, respectively. Mixture propor-tions are here set arbitrarily to 50% of Aa and 50% of AAgenotypes.

Table 1 Parameters of the simulated reference scenario (sce-nario 1)

Parameters Values

Number of loci (I) 50FIS 0.2FST 0.1Number of populations (J) 5Number of individuals (Kj) 30

4 M. FOLL ET AL.

! 2010 Blackwell Publishing Ltd

intensity by simulating data from a mixture of normaldistributions with means of 50 and 100 for heterozygoteand dominant homozygote individuals, respectively(scenario 8). In this case, we used a variance of 100 tosimulate high-quality AFLP markers and of 200 forlow-quality data. For each of those seven additional sce-narios, we also calculated the ratio of the variance com-pared to (i) the reference scenario under the same datatype and (ii) the SNP data set under the same scenario.

Samples and AFLP analyses

To illustrate our method, we analysed the AFLP diver-sity in a small rodent, the common vole (Microtus arva-lis), which is widespread over most of Europe (Heckelet al. 2005). We selected four vole populations from Gri-sons in Switzerland (Table 3). The distance between thefour populations was on average 7.4 km (range: 4.1 and11 km). AFLP marker analyses were performed byadopting the protocols established by Vos et al. (1995),Donaldson et al. (1998) and Fink et al. (2010). Six AFLPprimer combinations were amplified for 86 individuals.The investigated primer combinations are referred to bythe last two selective bases: e.g. the combination E01-ACT ⁄ M02-CAA is shortened to CTaa. The followingprimer pairs were amplified CTaa, ACag, GCat, CGtt,AGtg and CCac.Special care was taken to ensure reproducibility of

AFLP marker analyses (for details see Fink et al. 2010).

A liquid-handling robot (Microlab STAR, HamiltonBonaduz AG) was used for the preparation of thefollowing steps: selective amplification, multiplexing ofPCR products and loading of the 96-well sequencerplates. To check the reproducibility of individualresults, selective PCRs of six individuals (7%) wereindependently replicated for all six primer combina-tions. These replicates were distributed over differentruns of the sequencer and over different 96-well plates.In addition, the position of the samples on the 96-wellplates was altered, to account for potential positionaleffects (see Fink et al. 2010). The resulting reproducibil-ity for the AFLP markers was 98% for the six repro-duced individuals.AFLP fragment scoring was performed with the

GeneMapper software version 3.7 (Applied Biosystems).Bin sets were created automatically and extensivelymanually revised. Bins were accepted when bands in abin fulfilled the following criteria: first, band intensitywas clearly above the background noise of the electro-pherogram; second, band peaks were smoothly curved(i.e. no artifact); and third, bin width was not widerthan 0.8 base pairs (bp) to ensure homology of bands ina bin. Scoring was performed automatically by theGeneMapper software with extensive posterior manualsupervision. Fragments between 50 and 600 bp werescored in comparison with the internal GeneScan-600LIZ size standard.For the analyses, the matrix of marker band intensity

was used as provided by GeneMapper. This matrix wasscanned for potential artifacts of the sequencer. A par-ticular band intensity value was discarded (consideredas missing data) if it exceeded three times the 95%quantile of the distribution of the band intensity for thecorresponding marker. For the present analysis, mono-morphic markers (markers with all individuals havingthe same genotype) were discarded. As we were notusing AFLPs as purely dominant here, the characteriza-tion of monomorphic markers is different than the com-monly used definition. When using AFLPs as binarydata, a marker with all individuals having the band isconsidered as monomorphic. However, in our model,such a marker is still considered as polymorphic if both

Table 2 Parameters changed in alternative scenarios whencompared to the reference scenario

Scenario Parameter changed

1 –2 200 loci3 FIS = 0.84 FST = 0.055 10 populations6 50 individuals per population7 Ascertainment bias: 10%8 Normal mixture

Table 3 Localities and properties of Microtus arvalis samples from Grison in Switzerland

Population ID Sample location N Latitude Longitude FIS 95% HDPI FST 95% HDPI

Ca Calanda 22 46"53¢N 9"29¢E [0.00; 0.04] [0.13; 0.17]AP Alp di Plaun 20 46"47¢N 9"29¢E [0.02; 0.07] [0.10; 0.14]DP Domat ⁄Ems 22 46"50¢N 9"26¢E [0.00; 0.02] [0.14; 0.18]Bo Bonaduz 22 46"48¢N 9"24¢E [0.00; 0.03] [0.16; 0.20]

N, number of individuals per population; HPDI, highest posterior density interval.

GENETIC STRUCTURE FROM AFLP INTENSITY 5

! 2010 Blackwell Publishing Ltd

heterozygote and homozygote individuals are presentin the sample. Indeed, the distribution of band intensityacross all individuals is expected to be bimodal in thiscase (as in Fig. 2). Therefore, we did not discard mark-ers with all individuals having the band when unimo-dality was rejected using a dip test (Hartigan &Hartigan 1985) at the 5% level. The dip test measuresmultimodality of the band intensity by the maximumdifference over all sample points between the empiricaldistribution function and the unimodal distributionfunction that minimizes the maximum difference (Harti-gan & Hartigan 1985). The final data set obtained afterthese quality check steps consisted in 562 polymorphicAFLP loci.

Results

Simulation study

Results for the reference scenario for FIS and FST arepresented in Figs 3 and 4, respectively, and are basedon the mean of the posterior distributions of theseparameters. Note that we estimated population-specificF-statistics, but that we present results for only one ofthe population as those of the other populations arevery similar. First, we observe that all box plots for FISare correctly centred on the true value of 0.2. We see anincrease in variance between SNP data, high-quality

and low-quality AFLP band intensity data. Note thatwe cannot estimate FIS from binary data with theMCMC method (see Foll et al. 2008), and the ABCmethod as the only other suitable method has a muchhigher variance than our new model (more than 20times larger than that estimated from low-quality bandintensity AFLP data). Interestingly, the increase in FSTvariance between SNP data, high-quality and low-qual-ity AFLP band intensity data is less pronounced, andthe estimates are also correctly centred on the truevalue of 0.1. As previously shown (Foll et al. 2008), theanalysis of AFLP markers considered as mere binarydata leads to either biased FST (MCMC) or to a veryhigh FST variance (ABC).The results about the accuracy of the estimation of FIS

and FST under the eight scenarios are presented inTables 4 and 5, respectively. Considering all scenariosand when compared to SNP data, the variances of theestimators for FIS are overall 1.49 and 2.29 times largerfor high-quality and low-quality band intensity data,respectively. Interestingly, this effect is less pronouncedfor FST (1.09 and 1.11), but this is because of the factthat the accuracy of our new method is almost identicalto that of SNP data and this even with low-quality bandintensity data. The bias of the estimates is very low inall cases, and neither ascertainment bias (scenario 7)nor the use of a normal mixture (scenario 8) seems tohave a noticeable effect on the results. Contrastingly,increasing the number of loci is highly beneficial as itleads to a reduction in the variance for both FIS andFST: multiplying by four the number of loci divided by

Binary AFLPs(ABC method)

Low qualityband intensity

High qualityband intensity

SNPs

0.0

0.2

0.4

0.6

0.8

1.0

FIS

Fig. 3 Box-plot of FIS estimates based on 100 replicates of fourdifferent data sets generated under the reference scenariodescribed in Table 1. The true value of FIS is here 0.2. See textfor a description of the different methods used to estimate FIS.Note that FIS is not estimated by the MCMC method whenusing binary AFLP data. Boxes include 50% of the FST values,and whiskers indicate the minimum and maximum values butonly if they lie within 1.5 times the box height. Solid circles areoutlier values.

0.0

0.1

0.2

0.3

0.4

FS

T

BinaryAFLPs(MCMCmethod)

BinaryAFLPs(ABC

method)

Low qualityband

intensity

High qualityband

intensity

SNPs

Fig. 4 Box-plot of FST estimates based on 100 replicates of fourdifferent data sets generated under the reference scenariodescribed in Table 1. The true FST value is 0.1. See text for adescription of the different methods used to estimate FST.

6 M. FOLL ET AL.

! 2010 Blackwell Publishing Ltd

four the variance of the estimates. Adding more indi-viduals also greatly improves the estimates for FIS, butit has a weaker effect on FST. As we estimate popula-tion-specific F-statistics, the number of sampled popula-tions has little effect on either FIS or FST accuracy.

Application to common vole populations

Figure 5 shows the posterior distribution of FIS coeffi-cients for each population based on the 562 polymor-phic AFLP loci (solid lines). Table 3 gives the 95%highest posterior density intervals (HPDI, the smallestinterval that contains 95% of the values) of FIS and FSTcoefficients. For three populations of the four (Bo, Ca,and DP), the lower bound of the 95% HPDI of FIS issmaller than 10)3, indicating very weak evidenceagainst Hardy–Weinberg equilibrium. The range of theestimated FIS values is in agreement with estimatesfrom other M. arvalis populations based on STRs (FIS:)0.04 – 0.09, Schweizer et al. 2007). For the AP popula-tion, a significant inbreeding coefficient was observed(95% HPDI: [0.02; 0.07]). This may be because of a Wa-hlund effect, as individuals from this population mighthave come from different subpopulations. Indeed, this

population sample consisted of (i) 17 individuals from amain sampling site, (ii) one individual trapped 800metres apart and (iii) two further individuals trapped350 metres apart. To check whether this sampling mayhave induced a Wahlund effect and an increased valueof FIS, we excluded those three individuals and re-per-formed the analysis. The posterior distribution of FIS forthis population is represented by the dashed line onFig. 5 (labelled as ‘AP*’). We no longer observe a signif-icant inbreeding coefficient (95% HPDI: [0.00; 0.03]),supporting the hypothesis of a Wahlund effect becauseof our sampling scheme. Additionally, we chose to arti-ficially pool Bo and DP populations into a single popu-lation, and we re-estimated FIS coefficients for the threeresulting populations. As expected, the FIS coefficientfor the artificially pooled population differed stronglyfrom Hardy–Weinberg proportions (95% HPDI: [0.05;0.09]) (see Fig. 4, dotted line). FST coefficients are over-all found relatively large in the four populations, rang-ing from 0.12 (AP population) to 0.18 (Bo population)consistent with overall high genetic differentiation overrelatively short distances in common voles (Heckelet al. 2005; Schweizer et al. 2007; Braaker & Heckel2009).

Table 4 Precision of the estimation of FIS under different scenarios

Scenario 1: Reference 2: 200 loci 3: FIS = 0.8 4: FST = 0.05 5: 10 pops 6: 50 inds 7: 10% bias 8: Normal

SNP Variance 1.00E)03 2.57E)04 4.18E)04 9.12E)04 8.92E)04 5.64E)04 8.93E)04 1.02E)03Bias )5.02E)04 )7.33E)04 )1.89E)03 8.87E)06 )7.05E)04 4.93E)04 )1.05E)03 )6.15E)04

High Variance ⁄ SNP* 1.45 1.53 1.77 1.37 1.51 1.44 1.60 1.29Quality Variance ⁄Ref† 1.00 0.27 0.50 0.85 0.92 0.55 0.97 0.90AFLPs Bias )4.77E)03 )6.38E)03 )2.37E)03 )2.44E)03 )3.94E)03 )4.84E)03 )8.60E)03 )7.96E)03Low Variance ⁄ SNP* 2.28 1.91 3.44 2.14 2.25 2.40 2.44 1.50Quality Variance ⁄Ref† 1.00 0.22 0.63 0.94 0.88 0.60 0.96 0.67AFLPs Bias )1.00E)02 )1.40E)02 1.13E)03 )4.22E)03 )8.62E)03 )8.72E)03 )1.47E)02 )1.71E)02

*Ratio of the FIS variance compared to the SNP data set under the same scenario.†Ratio of the FIS variance compared to the reference scenario under the same data type.

Table 5 Precision of the estimation of FST under different scenarios

Scenario 1: Reference 2: 200 loci 3: FIS = 0.8 4: FST = 0.05 5: 10 pops 6: 50 inds 7: 10% bias 8: Normal

SNP Variance 4.20E)04 1.09E)04 5.28E)04 2.01E)04 4.19E)04 3.86E)04 5.00E)04 4.63E)04Bias 4.97E)03 )1.50E)03 4.72E)03 )4.19E)02 5.37E)03 5.09E)03 6.12E)03 4.21E)03

High Variance ⁄ SNP* 1.03 1.10 1.04 1.15 1.09 1.08 1.08 1.12Quality Variance ⁄Ref† 1.00 0.28 1.27 0.54 1.06 0.96 1.26 1.20AFLPs Bias 4.30E)03 )2.09E)03 4.93E)03 )4.12E)02 5.13E)03 4.70E)03 5.95E)03 9.80E)03Low Variance ⁄ SNP* 1.09 1.15 1.02 1.17 1.12 1.12 1.08 1.16Quality Variance ⁄Ref† 1.00 0.27 1.18 0.52 1.03 0.94 1.18 1.17AFLPs Bias 2.96E)03 )4.60E)03 4.57E)03 )4.12E)02 3.72E)03 3.19E)03 4.05E)03 1.09E)02

*Ratio of the FST variance compared to the SNP data set under the same scenario.†Ratio of the FST variance compared to the reference scenario under the same data type.

GENETIC STRUCTURE FROM AFLP INTENSITY 7

! 2010 Blackwell Publishing Ltd

To check the validity of the Cauchy mixture assump-tion for our M. arvalis data set, we obtained the poster-ior means of the parameters of the mixture model forthe 562 AFLP loci (namely l, d, s1, s2). The overallgenotype frequencies (Paa, PAa, PAA) were obtained as apopulation-size weighted average of the genotype fre-quencies (Eqn 1). Using those parameters, we per-formed a one-sample Kolmogorov–Smirnov test tocompare the empirical distribution of band intensitywith our mixture of Cauchy distributions. Overall, 25markers showed a significant deviation from the esti-mated Cauchy mixture at the 1% level. This indicatesthat most of the makers (>95%) do not show a signifi-cant deviation from our model. As an example,Figure 6 shows the fitting of the mixture model at agiven locus. Posterior means of the parameters in this

case were l = 0.30, d = 0.36, s1 = 0.27, s2 = 0.38, Paa =0.07, PAa = 0.35 and PAA = 0.59.

Discussion

In this study, we introduce an efficient way of usingAFLP markers to study the genetic structure of popula-tions. We propose an extension of the Bayesian F-modelfor population structure specifically designed for AFLPmarkers. The accuracy of our method is much higherthan previous approaches and almost reaches the oneobtained using SNP markers. Our method providesvery accurate estimates of population-specific FIS andFST coefficients, but its way to deal with band intensityis very general and it could be incorporated into anyprogram based on the F-model, like STRUCTURE (Fa-lush et al. 2003), GENELAND (Guillot 2008), GESTE(Foll & Gaggiotti 2006), BayeScan (Foll & Gaggiotti2008) or BIMr (Faubet & Gaggiotti 2008).Our simulation study shows that the accuracy of F-

statistics estimates is much higher when information onband intensity is used than when AFLPs are treated asdominant binary markers. The precision of the esti-mates is very close to that obtained with SNP, in partic-ular for FST coefficients where the variance of ourestimator is only 10% higher than that obtained withSNP markers. Remarkably, this precision remains forlow-quality AFLP data (when there is a strong overlapin band intensity distribution between homozygous andheterozygous individuals, see Fig. 2). In such a case, itwould be impossible to accurately infer the genotypesof many individuals using Piepho & Koch’s (2000)approach. This underlines the importance to fully takeuncertainty about genotypes into account when usingband intensity data, rather than trying to make a deci-sion on the genotype class before the estimation of pop-ulation parameters.Estimating accurate inbreeding (FIS) coefficients with

standard binary AFLP data has been shown to be verydifficult, and past estimates were either biased or asso-ciated with a very large variance (Foll et al. 2008). Here,we show that the use of band intensity data makes itpossible to detect very small inbreeding coefficients(e.g. such as caused by subpopulation structure, Wahl-und effect or by real departure from random mating).Special care needs to be taken in the laboratory when

using AFLP band intensity to infer genotype frequenciesand FIS, because band intensity differences betweenindividuals need to be a result of homozygous or hetero-zygous genotypes and not a consequence of other arti-factual amplification factors. It is therefore advisable tocheck AFLP data for markers with bimodal distribution(see Fig. 6) before using this approach. It should benoted that not all the markers are expected to present

0.00 0.02 0.04 0.06 0.08 0.10 0.12

010

2030

4050

60

Fis

Den

sity

AP

Bo

Ca

DP

AP* Bo + DP

Fig. 5 Posterior distributions of FIS coefficients obtained from562 polymorphic AFLP markers in the Microtus arvalis data set.Solid lines correspond to the four sampled populations. Thedashed line corresponds to the 17 individuals from the APpopulation that were caught at main sampling site (marked as‘AP*’). The dotted line corresponds to the artificial pooling ofBo and DP populations to check that our method enables oneto detect a Wahlund effect.

Band intensity

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

Fig. 6 Example of the fit of the mixture model at a given locusof the Microtus arvalis AFLP data set. The histogram representsthe distribution of band intensities at this locus. The solid linerepresents the mixture of two Cauchy distributions, usingmeans of the parameter posterior distributions inferred fromour method: l = 0.30, d = 0.36, s1 = 0.27, s2 = 0.38, Paa = 0.07,PAa = 0.35 and PAA = 0.59.

8 M. FOLL ET AL.

! 2010 Blackwell Publishing Ltd

such distributions of band intensity. In our M. arvalisdata set, it was the case for only a small fraction of themakers, owing to the strong genetic drift acting on thosepopulations, but we still obtained narrow posterior dis-tributions for FIS coefficients. A very precise titration ofinitial DNA concentration, the use of a liquid-han-dling robot, PCR and fragment separation of each pri-mer combination on the same day should guaranteehigh-quality AFLP data and reproducibility (see Finket al. 2010).In the development of our model, we have made two

important assumptions that may not be valid in everystudy. First, we assumed a double amount of PCRproducts for dominant homozygous individuals leadsto band intensity values on average two times largerthan those of heterozygous individuals. Then, weassumed that any band intensity below a given thresh-old ai corresponds to a recessive homozygote genotype.In practice, we defined this threshold as a fraction(10%) of the highest observed band intensity value atthe same locus. For example, Gort & Van Eeuwijk(2010) rather found that a third component in the mix-ture model was necessary to account for variability inband intensities of recessive homozygous individuals(see Fig. 1 in Gort & Van Eeuwijk 2010). In our case,we did not observe this phenomenon and the absenceof a third component in the mixture model leads to fas-ter calculations. In all our estimations, we defined ai asbeing 10% of the highest observed value at locus i, butas shown on Fig. 2, we simulated data allowing bandintensity values to reach 0 (in this case this thresholdshould be set to 0%). This allowed us to mimic a realis-tic situation where some individuals having the bandmay have band intensities below the threshold andtherefore be mistakenly classified as recessive homozyg-otes. However, our results show that this has very littleinfluence as our estimates are all unbiased. Departurefrom these two assumptions could be easily accommo-dated depending on the data, and we advise here againto check band intensity distributions before using ourmethod. It could be even possible to extend our modelto a polyploidy level P using P components and thegeneralization of the Wahlund inequality (Rosenberg &Calabrese 2004).In this article, we introduce an efficient way of using

AFLP markers to study the genetic structure of popula-tions. Using band intensity, data overcome many draw-backs usually associated with the use of AFLP markers.The accuracy of our method is much higher than previ-ous approaches and almost reaches that obtained withSNP markers. Additionally, the Bayesian model we pro-pose is very flexible and could be incorporated in manyexisting estimation procedures. We think that ourmethod may help AFLP markers to become more

attractive again and allow one to reliably study popula-tion structure and inbreeding in nonmodel species.

Acknowledgements

We thank Susanne Tellenbach for technical assistance. Thisstudy was supported by grants from the University of BernResearch Foundation and the Swiss National Science Founda-tion (grants No 3100A0-112072 and No 3100A0-126074 to LEand GH). The method described in this study has been imple-mented as a command line version under Linux and as a ver-sion with a user-friendly graphical user interface for MicrosoftWindows. Both programs, as well as the R function we devel-oped to identify monomorphic loci, are available from MFupon request.

References

Balding DJ (2003) Likelihood-based inference for geneticcorrelation coefficients. Theoretical Population Biology, 63, 221–230.

Balding DJ, Nichols RA (1995) A method for quantifyingdifferentiation between populations at multi-allelic loci andits implications for investigating identity and paternity.Genetica, 96, 3–12.

Beaumont MA, Balding DJ (2004) Identifying adaptive geneticdivergence among populations from genome scans. MolecularEcology, 13, 969–980.

Bensch S, Akesson M (2005) Ten years of AFLP in ecology andevolution: why so few animals? Molecular Ecology, 14, 2899–2914.

Bonin A, Ehrich D, Manel S (2007) Statistical analysis ofamplified fragment length polymorphism data: a toolbox formolecular ecologists and evolutionists. Molecular Ecology, 16,3737–3758.

Braaker S, Heckel G (2009) Transalpine colonisation and partialphylogeographic erosion by dispersal in the common vole(Microtus arvalis). Molecular Ecology, 18, 2518–2531.

Donaldson SL, Chopin T, Saunders GW (1998) AmplifiedFragment Length Polymorphism (AFLP) as a source ofgenetic markers for red algae. Journal of Applied Phycology,10, 365–370.

Excoffier L, Heckel G (2006) Computer programs for popula-tion genetics data analysis: a survival guide. Nature ReviewsGenetics, 7, 745–758.

Falush D, Stephens M, Pritchard JK (2003) Inference ofpopulation structure using multilocus genotype data: linkedloci and correlated allele frequencies. Genetics, 164, 1567–1587.

Faubet P, Gaggiotti OE (2008) A new Bayesian method toidentify the environmental factors that influence recentmigration. Genetics, 178, 1491–1504.

Fink S, Fischer MC, Excoffier L, Heckel G (2010) Genomicscans support repetitive continental colonization eventsduring the rapid radiation of voles (Rodentia: Microtus): theutility of AFLPs versus mitochondrial and nuclear sequencemarkers. Systematic Biology, doi: 10.1093/sysbio/syq042.

Foll M, Gaggiotti O (2006) Identifying the environmentalfactors that determine the genetic structure of populations.Genetics, 174, 875–891.

GENETIC STRUCTURE FROM AFLP INTENSITY 9

! 2010 Blackwell Publishing Ltd

Foll M, Gaggiotti O (2008) A genome-scan method to identifyselected loci appropriate for both dominant and codominantmarkers: a Bayesian perspective. Genetics, 180, 977–993.

Foll M, Beaumont MA, Gaggiotti O (2008) An approximateBayesian computation approach to overcome biases thatarise when using amplified fragment length polymorphismmarkers to study population structure. Genetics, 179, 927–939.

Gaggiotti O, Foll M (In press) Quantifying population structureusing the F-model. Molecular Ecology Resources, 10, 821–830.

Gort G, Van Eeuwijk F (2010) Codominant scoring of AFLP inassociation panels. Theoretical and Applied Genetics, 121, 337–351.

Guillot G (2008) Inference of structure in subdivided popula-tions at low levels of genetic differentiation-the correlatedallele frequencies model revisited. Bioinformatics, 24, 2222–2228.

Hartigan JA, Hartigan PM (1985) The DIP test of unimodality.Annals of Statistics, 13, 70–84.

Heckel G, Burri R, Fink S, Desmet J-F, Excoffier L (2005) Geneticstructure and colonization processes in European populationsof the common voleMicrotus arvalis. Evolution, 59, 2231–2242.

Hill WG, Weir BS (2004) Moment estimation of populationdiversity and genetic distance from data on recessivemarkers. Molecular Ecology, 13, 895–908.

Holsinger KE, Weir BS (2009) Genetics in geographicallystructured populations: defining, estimating and interpretingFST. Nature Reviews Genetics, 10, 639–650.

Holsinger KE, Lewis PO, Dey DK (2002) A Bayesian approachto inferring population structure from dominant markers.Molecular Ecology, 11, 1157–1164.

Lynch M, Milligan BG (1994) Analysis of population genetic-structure with rapd markers. Molecular Ecology, 3, 91–99.

Meudt HM, Clarke AC (2007) Almost forgotten or latestpractice? AFLP applications, analyses and advances. Trendsin Plant Science, 12, 106–117.

Piepho HP, Koch G (2000) Codominant analysis of bandingdata from a dominant marker system by normal mixtures.Genetics, 155, 1459–1468.

Rannala B, Hartigan JA (1996) Estimating gene flow in islandpopulations. Genetical Research, 67, 147–158.

Rosenberg NA, Calabrese PP (2004) Polyploid and multilocusextensions of the Wahlund inequality. Theoretical PopulationBiology, 66, 381–391.

Schweizer M, Excoffier L, Heckel G (2007) Fine-scale geneticstructure and dispersal patterns in the common voleMicrotus arvalis. Molecular Ecology, 16, 2463–2473.

Smith BJ (2005) Bayesian Output Analysis program (BOA),version 1.1.5. http://www.public-health.uiowa.edu/boa.

Vos P, Hogers R, Bleeker M et al. (1995) AFLP - a newtechnique for DNA-fingerprinting. Nucleic Acids Research, 23,4407–4414.

Waples RS, Gaggiotti O (2006) What is a population? Anempirical evaluation of some genetic methods for identifyingthe number of gene pools and their degree of connectivity.Molecular Ecology, 15, 1419–1439.

Wright S (1951) The genetical structure of populations. Annalsof Eugenics, 15, 323–354.

Zhivotovsky L (1999) Estimating population structure indiploids with multilocus dominant DNA markers. MolecularEcology, 8, 907–913.

M. F. is interested in developing new statistical genetics mod-els to address ecological and evolutionary questions, in parti-cular concerning the spatial structure of genetic diversity andthe role of natural selection in this structure. L. E. is a popula-tion geneticist with interests in estimating demographic para-meters from genetic data under complex evolutionary models.G.H.’s research addresses the evolutionary consequences ofmigration and individual behaviour across various taxa, ran-ging from variation in dispersal patterns to the evolutionarygenetics of mating systems. M.C.F.’s research focuses on thegenetic basis of recent adaptation and selection in various bio-logical models.

10 M. FOLL ET AL.

! 2010 Blackwell Publishing Ltd