Estimating Time to the Common Ancestor for a Beneficial Allele

15
Estimating Time to the Common Ancestor for a Beneficial Allele Joel Smith,* ,1 Graham Coop, 2 Matthew Stephens, 3,4 and John Novembre 1,3 1 Department of Ecology and Evolution, University of Chicago, Chicago, IL 2 Center for Population Biology, Department of Evolution and Ecology, University of California, Davis, Davis, CA 3 Department of Human Genetics, University of Chicago, Chicago, IL 4 Department of Statistics, University of Chicago, Chicago, IL *Corresponding author: E-mail: [email protected]. Associate editor: Yuseob Kim Abstract The haplotypes of a beneficial allele carry information about its history that can shed light on its age and the putative cause for its increase in frequency. Specifically, the signature of an allele’s age is contained in the pattern of variation that mutation and recombination impose on its haplotypic background. We provide a method to exploit this pattern and infer the time to the common ancestor of a positively selected allele following a rapid increase in frequency. We do so using a hidden Markov model which leverages the length distribution of the shared ancestral haplotype, the accumu- lation of derived mutations on the ancestral background, and the surrounding background haplotype diversity. Using simulations, we demonstrate how the inclusion of information from both mutation and recombination events increases accuracy relative to approaches that only consider a single type of event. We also show the behavior of the estimator in cases where data do not conform to model assumptions, and provide some diagnostics for assessing and improving inference. Using the method, we analyze population-specific patterns in the 1000 Genomes Project data to estimate the timing of adaptation for several variants which show evidence of recent selection and functional relevance to diet, skin pigmentation, and morphology in humans. Key words: haplotype, allele age, adaptation. Introduction A complete understanding of adaptation depends on a de- scription of the genetic mechanisms and selective history that underly heritable traits (Radwan and Babik 2012). Once a ge- netic variant underlying a putatively adaptive trait has been identified, several questions remain: What is the molecular mechanism by which the variant affects organismal traits and fitness (Dalziel et al. 2009)?; what is the selective mecha- nism responsible for allelic differences in fitness?; did the var- iant arise by mutation more than once (Elmer and Meyer 2011)?; and when did each unique instance of the variant arise and spread (Slatkin and Rannala 2000)? Addressing these ques- tions for numerous case studies of beneficial variants across multiple species will be necessary to gain insight into general properties of adaptation (Stinchcombe and Hoekstra 2008). Here, our focus is on the last of the questions given above; that is, when did a mutation arise and spread? Understanding these dates can give indirect evidence regarding the selective pressure that may underlie the adaptation. This is especially useful in cases where it is logistically infeasible to assess fitness consequences of a variant in the field directly (Barrett and Hoekstra 2011). In humans, for example, dispersal across the globe has resulted in the occupation of a wide variety of habitats, and in several cases, selection in response to specific ecological pressures appears to have taken place. There are well-documented cases of loci showing evidence of recent selection in addition to being functionally relevant to known phenotypes of interest (Jeong and Di Rienzo 2014). Nakagome et al. (2016) specify time intervals defined by the human dispersal out-of-Africa and the spread of agriculture to show the relative concordance among allele ages for several loci associated with autoimmune protection and risk, skin pigmentation, hair and eye color, and lactase persistence. When a putative variant is identified as the selected site, the nonrandom association of surrounding variants on a chromosome can be used to understand its history. This combination of surrounding variants is called a haplotype, and the nonrandom association between any pair of variants is called linkage disequilibrium (LD). Due to recombination, LD between the focal mutation and its initial background of surrounding variants follows a per-generation rate of decay. New mutations also occur on this haplotype at an average rate per generation. The focal mutation’s frequency follows a trajectory determined by the stochastic outcome of survival, mating success, and offspring number. If the allele’s selective benefit increases its frequency at a rate faster than the rate at which LD decays, the resulting signature is one of high LD and a reduction of polymorphism near the selected mutation (Smith and Haigh 1974). Many methods to exploit this pat- tern have been developed in an effort to identify loci under Article ß The Author(s) 2018. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] Open Access Mol. Biol. Evol. 35(4):1003–1017 doi:10.1093/molbev/msy006 Advance Access publication January 18, 2018 1003 Downloaded from https://academic.oup.com/mbe/article/35/4/1003/4816405 by guest on 10 August 2022

Transcript of Estimating Time to the Common Ancestor for a Beneficial Allele

Estimating Time to the Common Ancestor fora Beneficial Allele

Joel Smith1 Graham Coop2 Matthew Stephens34 and John Novembre13

1Department of Ecology and Evolution University of Chicago Chicago IL2Center for Population Biology Department of Evolution and Ecology University of California Davis Davis CA3Department of Human Genetics University of Chicago Chicago IL4Department of Statistics University of Chicago Chicago IL

Corresponding author E-mail joelsmithuchicagoedu

Associate editor Yuseob Kim

Abstract

The haplotypes of a beneficial allele carry information about its history that can shed light on its age and the putativecause for its increase in frequency Specifically the signature of an allelersquos age is contained in the pattern of variation thatmutation and recombination impose on its haplotypic background We provide a method to exploit this pattern andinfer the time to the common ancestor of a positively selected allele following a rapid increase in frequency We do sousing a hidden Markov model which leverages the length distribution of the shared ancestral haplotype the accumu-lation of derived mutations on the ancestral background and the surrounding background haplotype diversity Usingsimulations we demonstrate how the inclusion of information from both mutation and recombination events increasesaccuracy relative to approaches that only consider a single type of event We also show the behavior of the estimator incases where data do not conform to model assumptions and provide some diagnostics for assessing and improvinginference Using the method we analyze population-specific patterns in the 1000 Genomes Project data to estimate thetiming of adaptation for several variants which show evidence of recent selection and functional relevance to diet skinpigmentation and morphology in humans

Key words haplotype allele age adaptation

IntroductionA complete understanding of adaptation depends on a de-scription of the genetic mechanisms and selective history thatunderly heritable traits (Radwan and Babik 2012) Once a ge-netic variant underlying a putatively adaptive trait has beenidentified several questions remain What is the molecularmechanism by which the variant affects organismal traitsand fitness (Dalziel et al 2009) what is the selective mecha-nism responsible for allelic differences in fitness did the var-iant arise by mutation more than once (Elmer and Meyer2011) and when did each unique instance of the variant ariseand spread (Slatkin and Rannala 2000) Addressing these ques-tions for numerous case studies of beneficial variants acrossmultiple species will be necessary to gain insight into generalproperties of adaptation (Stinchcombe and Hoekstra 2008)

Here our focus is on the last of the questions given abovethat is when did a mutation arise and spread Understandingthese dates can give indirect evidence regarding the selectivepressure that may underlie the adaptation This is especiallyuseful in cases where it is logistically infeasible to assess fitnessconsequences of a variant in the field directly (Barrett andHoekstra 2011) In humans for example dispersal across theglobe has resulted in the occupation of a wide variety ofhabitats and in several cases selection in response to specificecological pressures appears to have taken place There are

well-documented cases of loci showing evidence of recentselection in addition to being functionally relevant to knownphenotypes of interest (Jeong and Di Rienzo 2014)Nakagome et al (2016) specify time intervals defined by thehuman dispersal out-of-Africa and the spread of agricultureto show the relative concordance among allele ages for severalloci associated with autoimmune protection and risk skinpigmentation hair and eye color and lactase persistence

When a putative variant is identified as the selected sitethe nonrandom association of surrounding variants on achromosome can be used to understand its history Thiscombination of surrounding variants is called a haplotypeand the nonrandom association between any pair of variantsis called linkage disequilibrium (LD) Due to recombinationLD between the focal mutation and its initial background ofsurrounding variants follows a per-generation rate of decayNew mutations also occur on this haplotype at an averagerate per generation The focal mutationrsquos frequency follows atrajectory determined by the stochastic outcome of survivalmating success and offspring number If the allelersquos selectivebenefit increases its frequency at a rate faster than the rate atwhich LD decays the resulting signature is one of high LD anda reduction of polymorphism near the selected mutation(Smith and Haigh 1974) Many methods to exploit this pat-tern have been developed in an effort to identify loci under

Article

The Author(s) 2018 Published by Oxford University Press on behalf of the Society for Molecular Biology and EvolutionThis is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License(httpcreativecommonsorglicensesby-nc40) which permits non-commercial re-use distribution and reproduction in anymedium provided the original work is properly cited For commercial re-use please contact journalspermissionsoupcom Open AccessMol Biol Evol 35(4)1003ndash1017 doi101093molbevmsy006 Advance Access publication January 18 2018 1003

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

recent positive selection (reviewed in Nielsen [2005]) A par-allel effort has focused on quantifying specific properties ofthe signature to infer the age of the selected allele

The most commonly used methods to estimate allele agerely on summary statistics These approaches can be furtherclassified as either heuristic or model-based methodsHeuristic approximations rely on a point estimate of themean length of the selected haplotype (using the decay ofhomozygosity around the selected locus) or a count of de-rived mutations within an arbitrary cutoff distance from theselected site (Thomson et al 2000 Tang et al 2002Meligkotsidou and Fearnhead 2005 Hudson 2007 Coopet al 2008) These approaches ignore uncertainty in the ex-tent of the selected haplotype on each chromosome whichcan lead to inflated confidence in the point estimates

Alternative model-based approaches that also use sum-mary statistics employ an Approximate BayesianComputation (ABC) framework These methods use an ex-plicit model for simulation to identify a distribution of agesthat are consistent with the observed data (Tavare et al 1997Pritchard et al 1999 Beaumont et al 2002 Przeworski 2003Voight et al 2006 Tishkoff et al 2007 Peter et al 2012 BelezaSantos et al 2013 Nakagome et al 2016 Ormond et al 2016)This provides a measure of uncertainy induced by the ran-domness of recombination mutation and genealogical his-tory and produces an approximate posterior distribution onallele age Despite these advantages ABC approaches sufferfrom an inability to capture all relevant features of the sampledue to their reliance on summary statistics

As full-sequencing data become more readily availabledefining the summary statistics which capture the complexLD among sites and the subtle differences between haplo-types will be increasingly challenging For this reason effi-ciently computable likelihood functions that leverage thefull sequence data rather than low dimensional summariesof the data are increasingly favorable

Several approaches attempt to compute the full likelihoodof the data using an importance sampling framework (Slatkin2001 Coop and Griffiths 2004 Slatkin 2008 Chen and Slatkin2013) Conditioning on the current frequency of the selectedallele frequency trajectories and genealogies are simulatedand given weight proportional to the probability of their oc-currence under a population genetic model While theseapproaches aim to account for uncertainty in the allelersquos fre-quency trajectory and genealogy they remain computation-ally infeasible for large samples or do not considerrecombination across numerous loci

In a related problem early likelihood-based methods fordisease mapping have modelled recombination around theancestral haplotype providing information for the time to thecommon ancestor (TMRCA) rather than time of mutation(Rannala and Reeve 2001 2002 McPeek and Strahs 1999Morris et al 2000 2002) These models allowed for the treat-ment of unknown genealogies and background haplotypediversity before access to large data sets made computationat the genome-wide scale too costly Inference is performedunder Markov chain Monte Carlo (MCMC) to sample overthe unknown genealogy while ignoring LD on the background

haplotypes or approximating it using a first-order Markovchain In a similar spirit Chen et al (2015) revisit this classof models to estimate the strength of selection and time ofmutation for an allele under positive selection using a hiddenMarkov model (HMM)

HMMs have become a routine tool for inference in popu-lation genetics The Markov assumption allows for fast com-putation and has proven an effective approximation forinferring the population-scaled recombination rate the demo-graphic history of population size changes and the timing andmagnitude of admixture events among genetically distinctpopulations (Li and Stephens 2003 Price et al 2009 Hinchet al 2011 Li and Durbin 2011 Wegmann et al 2011) Theapproach taken by Chen et al (2015) is a special case of twohidden statesmdashthe ancestral and background haplotypes Theancestral haplotype represents the linked background that thefocal allele arose on while the background haplotypes repre-sent some combination of alleles that recombine with theancestral haplotype during its increase in frequency Chenet al (2015) compute maximum-likelihood estimates for thelength of the ancestral haplotype on each chromosome car-rying the selected allele Inference for the time of mutation isperformed on these fixed estimates assuming that they areknown The authors condition the probability of an ancestryswitch event on a logistic frequency trajectory for the selectedallele and assume independence among haplotypes leading tothe common ancestor The likelihood for background haplo-types is approximated using a first-order Markov chain toaccount for nonindependence among linked sites

Here we present an HMM that leverages both the length ofthe ancestral haplotype on each chromosome as well as de-rived mutations that have accumulated on the ancestral hap-lotype Our method implements an MCMC which samplesover the unknown ancestral haplotype to generate a sampleof the posterior distribution for the TMRCA Our emissionprobabilities account for the LD structure among backgroundhaplotypes using the Li and Stephens (2003) haplotype copyingmodel and a reference panel of haplotypes without the se-lected allele (fig 1b and c) In contrast to the first-order Markovchain employed by Chen et al (2015) the Li and Stephens(2003) model provides an approximation to the coalescentwith recombination by modelling a focal haplotype as an im-perfect mosaic of haplotypes in the reference panel

While Chen et al (2015) use a mutation parameter in theirHMM the count of derived mutations on the backgroundhaplotype does not directly influence their estimation oftime since mutation The probability of observing a mutationon the selected haplotypes of beneficial allele carriers dependson two parameters The per-generation mutation rate and thetime to the common ancestor (TMRCA) The Chen et almodel uses a compound parameter for these such that theobserved mutations do not directly inform their estimates oftiming In our model we separately include the TMRCA andmutation rate as parameters and thus incorporate informationfrom mutations directly into our inference of the TMRCA

Our approach also differs in that we do not presume toknow the true extent of the ancestral haplotype and insteadtreat it as a latent variable to be marginalized over This allows

Smith et al doi101093molbevmsy006 MBE

1004

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

our estimation of the TMRCA to reflect uncertainty in theprecise switch point off of the ancestral haplotype which inmany cases will be difficult to distinguish from the backgroundhaplotypes Another significant difference is that our modeldoes not make assumptions about the frequency trajectoryapart from that a sufficiently hard sweep occurred to incur astar-shaped genealogy Below we use simulations to show thesensitivity of our model to these simplified assumptions forvarying strengths of selection final allele frequencies and sam-pling regimes for the choice of reference panel An R package isavailable to implement this method on github (httpsgithubcomjhavsmithstartmrca last accessed January 23 2018)

New Approaches

Model DescriptionIn general the TMRCA for a sample of haplotypes carryingthe advantageous allele (hereafter referred to as tca) will bemore recent than the time of mutation (Kaplan et al 1989)We aim to estimate tca in the case where a selectively advan-tageous mutation occurred in an ancestor of our sample t1

generations ago (fig 1a in main text) Viewed backwards intime the selected variant decreases in frequency at a rateproportional to the selection strength During a rapid dropin allele frequency the coalescent rate among haplotypescarrying the selected variant is amplified The same effect

would be observed for population growth from a small initialsize forward in time (Hudson 1990 Slatkin and Hudson 1991)As a result the genealogy of a sample having undergoneselection andor population growth becomes more ldquostar-shapedrdquo This offers some convenience as it becomes moreappropriate to invoke an assumption of independenceamong lineages when selection is strong We would like toemphasize that this assumption necessarily implies that thebeneficial allele has a single ancestral haplotype that has in-creased in frequency This is in contrast to a scenario in whichthe beneficial allele has been present in the population forsome time prior to selection For that case multiple ancestralhaplotypes would increase in frequency simultaneouslyresulting in a genealogy that is not star-shaped

We assume no crossover interference between recombi-nation events within a haplotype and therefore treat eachside flanking the focal allele separately We define one side ofthe selected site within a window of some predeterminedlength to have L segregating sites such that an individualrsquossequence will be indexed from site s frac14 f1 Lg wheresfrac14 1 refers to the selected site (a notation reference is pro-vided in table 1) To simplify notation this description will bewritten for a window on one side flanking the selected siteNote that the opposing side of the selected site is modelled inan identical fashion after redefining L

Let X denote an n L data matrix for a sample of nchromosomes with the selected variant Xij is the observedallelic type in chromosome i at variant site j and is assumed tobe biallelic where Xij 2 f1 0g Let H denote an m L matrixcomprising m chromosomes that do not have the selectedvariant where Hij 2 f1 0g Let A denote the ancestral hap-lotype as a vector of length L where Aj is the allelic type on theancestral selected haplotype at segregating site j andAj 2 f1 0g We assume independence among lineages lead-ing to the most recent common ancestor of the selectedhaplotype This is equivalent to assuming a star-shaped ge-nealogy which as noted above is a reasonable assumption forsites linked to a favorable variant under strong selection Wecan then write the likelihood as

PrethXjtcaAHTHORN frac14Yn

i

PrethXijtcaAHTHORN (1)

In each individual haplotype Xi we assume the ancestralhaplotype extends from the selected allele until a recombina-tion event switches ancestry to a different genetic backgroundLet Wfrac14w indicate that the location of the first recombi-nation event occurs between sites w and wthorn 1 where W2 f1 Lg (wfrac14 L indicates no recombination up to siteL) We can then condition the probability of the data on theinterval where the first recombination event occurs and sumover all possible intervals to express the likelihood as

PrethXijtcaAHTHORN frac14XL

wfrac141

PrethXijtcaAHWi frac14 wTHORN

PrethWi frac14 wjtcaTHORN (2)

FIG 1 Visual descriptions of the model (a) An idealized illustration ofthe effect of a selectively favored mutationrsquos frequency trajectory(black line) on the shape of a genealogy at the selected locus Theorange lineages are chromosomes with the selected allele The bluelineages indicate chromosomes that do not have the selected alleleNote the distinction between the time to the common ancestor ofchromosomes with the selected allele tca and the time at which themutation arose t1 (b) The copying model follows the ancestral hap-lotype (orange) moving away from the selected site until recombina-tion events within the reference panel lead to a mosaic of nonselectedhaplotypes surrounding the ancestral haplotype (c) A demographichistory with two choices for the reference panel Local and divergedAfter the ancestral population at the top of the figure splits into twosister populations a beneficial mutation arises and begins increasingin frequency The orange and blue colors indicate frequency of theselected and nonselected alleles respectively

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1005

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Assuming haplotype lengths are independent and identi-cally distributed draws from an exponential distribution thetransition probabilities for a recombination event off of theancestral haplotype are

PrethWi frac14 wjtcaTHORN

frac14ertcadweth1 ertcaethdwthorn1dwTHORNTHORN if w frac14 f1 ethL 1THORNg

ertcadL if w frac14 L

(

(3)

where dw is the distance in base pairs of site w from theselected site and r is the local recombination rate per basepair per generation The data for each individual Xi can bedivided into two parts One indicating the portion of anindividualrsquos sequence residing on the ancestral haplotype (be-fore recombining between sites w and wthorn 1) XiethjwTHORN andthat portion residing off of the ancestral haplotype after arecombination event XiethjgtwTHORN We denote a separate likelihoodfor each portion

aiw frac14 PrethXiethjwTHORNjtcaAWi frac14 wTHORN (4)

biw frac14 PrethXiethjgtwTHORNjHethjgtwTHORNWi frac14 wTHORN (5)

Because the focal allele is on the selected haplotypeai1 frac14 1 Conversely we assume a recombination eventoccurs at some point beyond locus L such thatbiL frac14 1 We assume the waiting time to mutation ateach site on the ancestral haplotype is exponentiallydistributed with no reverse mutations and express thelikelihood as

aiw frac14 PrethXiethjwTHORNjtcaAWi frac14 wTHORN

frac14 etcalethdwwTHORNYw

jfrac142

PrethXij frac14 ajtcaATHORN (6)

PrethXij frac14 ajtcaATHORN frac14etcal if a frac14 Aj

1 etcal if a frac14 Aj

((7)

The term etcalethdwwTHORN on the right side of equation (6)captures the lack of mutation at invariant sites betweeneach segregating site Assuming tcal is small equation (6) isequivalent to assuming a Poisson number of mutations(with mean tcal) occurring on the ancestral haplotype

For biw the probability of observing a particular sequenceafter recombining off of the ancestral haplotype is dependenton standing variation in background haplotype diversity TheLi and Stephens (2003) haplotype copying model allows forfast computation of an approximation to the probability ofobserving a sample of chromosomes related by a genealogywith recombination Given a sample of m haplotypesH 2 fh1 hmg a population scaled recombination rateq and mutation rate h an observed sequence of alleles ismodeled as an imperfect copy of any one haplotype in thereference panel at each SNP Let Zij denote the reference panelhaplotype which Xi copies at the jth SNP and cj denote thenumber of base pairs between SNPs j and jthorn 1 Zij follows aMarkov process with transition probabilities

PrethZiethjthorn1THORN frac14 z0jZij frac14 zTHORN

frac14eqjcj=m thorn eth1 eqjcj=mTHORNeth1=mTHORN if z0 frac14 z

eth1 eqjcj=mTHORNeth1=mTHORN if z0 frac14 z

(

(8)

To include mutation the probability that the sampledhaplotype matches a haplotype in the reference panel ism=ethmthorn hTHORN and the probability of a mismatch (or muta-tion event) is h=ethmthorn hTHORN Letting a refer to an allele wherea 2 f1 0g the matching and mismatching probabilitiesare

PrethXij frac14 ajZij frac14 z h1 hmTHORN

frac14m=ethmthorn hTHORN thorn eth1=2THORNethh=ethmthorn hTHORNTHORN if hzj frac14 a

eth1=2THORNethh=ethmthorn hTHORNTHORN if hzj frac14 a

(

(9)

Equation (5) requires a sum over the probabilities of allpossible values of Zj using equations (8) and (9) This is com-puted using the forward algorithm as described in Rabiner(1989) and Appendix A of Li and Stephens (2003) It shouldbe noted that this formulation does not model the observa-tion of an invariant site among the background haplotypesWe tried an approach to model these sites but saw no im-provement in model performance (see supplementary ap-pendix S2 Supplementary Material online)

The complete likelihood for our problem can then beexpressed as

Table 1 Notation Used to Describe the Model

n Number of haplotypes with the selected allelem Number of haplotypes without the selected alleleL Number of SNPs flanking the selected site (one side considered

at a time)X n L matrix of haplotypes with the selected alleleH m L matrix of haplotypes without the selected alleleXij Allele in haplotype i at SNP j where i 2 f1 ng and

j 2 f1 LgHzj Allele in haplotype z at SNP j where z 2 f1 mg and

j 2 f1 LgAj Allele at site j on the ancestral haplotypeZij The reference panel haplotype from which Xi copies at site jtca Time to the most recent common ancestor (TMRCA)Wi The location of the first recombination event off of the ancestral

haplotyper Recombination rate per basepair per generationl Mutation rate per basepair per generationh Haplotype miscopying rate or population-scaled mutation rate

eth4NlTHORNq Haplotype switching rate or population-scaled recombination

rate eth4NrTHORNdw Physical distance of site w from the selected site where

w 2 f1 Lgcj Number of basepairs between sites j and j thorn 1aiw Likelihood of haplotype i for sites 1 wbiw Likelihood of haplotype i for SNPs ethwthorn 1THORN L

Smith et al doi101093molbevmsy006 MBE

1006

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

PrethXjtcaAHTHORN frac14Yn

ifrac141

XL

wfrac141

aiwbiwPrethWi frac14 wjtcaATHORN (10)

This computation is on the order 2Lnm2 and in practicefor mfrac14 20 nfrac14 100 and Lfrac14 4000 takes approximately3027 s to compute on an Intel Core i7-4750HQ CPU at200 GHz 8 with 156 GB RAM

InferencePerforming inference on tca requires addressing the latentvariables w and A in the model Marginalizing over possiblevalues of w is a natural summation per haplotype that is linearin L as shown above For A the number of possible values islarge (2L) and so we employ a MetropolisndashHastings algorithmto jointly sample the posterior of A and tca and then we takemarginal samples of tca for inference We assign a uniformprior density for both A and tca such that PrethATHORN frac14 1=2L andPrethtcaTHORN frac14 1=ethtmax tminTHORN where tmax and tmin are user-specified maximum and minimum values for tca ProposedMCMC updates of the ancestral haplotype A0 are generatedby randomly selecting a site in A and flipping to the alterna-tive allele For tca proposed values are generated by adding anormally distributed random variable centered at 0tca0 frac14 tca thorn Neth0 r2THORN To start the MetropolisndashHastings algo-

rithm an initial value of tca is uniformly drawn from a user-specified range of values (10ndash2000 in the applications here)To initialize the ancestral haplotype to a reasonable value weuse a heuristic algorithm which exploits the characteristicdecrease in variation near a selected site (see supplementaryappendix S2 Supplementary Material online)

For each haplotype in the sample of beneficial allele car-riers the Li and Stephens (2003) model uses a haplotypemiscopying rate h and switching rate q to compute a like-lihood term for loci following the recombination event off ofthe ancestral haplotype For our analyses we set q frac14 44104 using our simulated values of r frac14 11 108 per bpper generation and Nfrac14 10000 where q frac14 4Nr Following Li

and Stephens (2003) we fix h frac14 ethPn

mfrac1411=mTHORN1 as derived

from the expected number of mutation events on a geneal-ogy relating n chromosomes at a particular site We found nodiscernible effects on estimate accuracy when specifying dif-ferent values of q (supplementary fig 1 SupplementaryMaterial online)

ResultsBecause our model requires a sample (or ldquopanelrdquo) of referencehaplotypes without the selected allele we tested our methodfor cases in which the reference panel is chosen from the localpopulation in which the selected allele is found as well ascases where the panel is from a diverged population wherethe selected haplotype is absent (fig 1c) Regardless of sce-nario the estimates are on average within a factor of 2 of thetrue value and often much closer When using a local refer-ence panel point estimates of tca increasingly underestimatethe true value (TMRCA) as selection becomes weaker and thefinal allele frequency increases (fig 2) Put differently the age

of older TMRCAs tends to be underestimated with local ref-erence panels Using the mean posteriors as point estimatesmean values of log2(estimatetrue value) range from062 to014 Simulations using a diverged population for the refer-ence panel removed the bias though only in cases where thedivergence time was not large For a reference panel divergedby 05N generations mean log2(estimatetrue value) valuesrange from 021 to 018 As the reference panel becomestoo far diverged from the selected population estimates be-come older than the true value (036 to 094 log2(estimatetrue value)) In these cases the HMM is unlikely to infer aclose match between background haplotypes in the sampleand the reference panel leading to many more mismatchesbeing inferred as mutation events on the ancestral haplotypeand an older estimate of tca

The bottom panel of figure 2 shows the effect of selectionstrength and final allele frequency on the size of the 95credible interval around point estimates normalized by thetrue TMRCA for each simulated data set Before normalizingcredible interval sizes using a local reference panel range from73 to 213 generations for 2Ns frac14 100 versus 18 to 22 gener-ations when 2Ns frac14 2000 Using local and diverged referencepanels we found a minimal effect of the sample size on pointestimates (supplementary figs 2 and 3 SupplementaryMaterial online) As expected larger sample sizes for the car-rier panel improve estimate accuracy However higher allelefrequencies and weak selection are likely to induce more un-certainty due to the ancestral haplotype tracts recombiningwithin the sample We find this effect more pronounced withlarge sample sizes for the reference panel We speculate that alarge sample of reference haplotypes leads the focal selectedhaplotype to have increased probability of copying from thereference panel leading to a shorter selected haplotype andslight overestimate of the TMRCA

We also performed simulations under varying degrees ofmutation and recombination rate misspecification (supple-mentary fig 4 Supplementary Material online) In most casesmean values of log2(estimatetrue value) stay within an orderof magnitude of 0 As expected when both the mutation andrecombination rate are misspecified we find the most dis-crepancy To assess the convergence properties of theMCMC five replicate chains were run for each of 20 sim-ulated data sets produced under three 2Ns values (100200 and 2000) for frequency trajectories ending at 01(supplementary fig 5 Supplementary Material online)While care is always warranted with MCMC approacheswe find in practice that convergence among our replicatechains is attained relatively quickly (3 000 iterations forsimulated data and 3000ndash9000 iterations for appliedcases see supplementary fig 9 Supplementary Materialonline)

We compared the performance of our estimator withthree other model-based approaches for allele age estimationby matching the simulation scheme performed by Chen et al(2015) (supplementary table 5 Supplementary Material on-line) Our method shows improvement in accuracy (RMSE)andor lower bias for simulations with lower frequencies ofthe beneficial allele (40) regardless of the reference panel

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1007

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

used In cases where the final beneficial allele frequency ishigher (80) our methodrsquos accuracy remains as good or bet-ter than the other methods when using a diverged referencepanel with a two-orders-of-magnitude improvement of biasunder strong selection (sfrac14 005) Estimates when using a localreference panel and a high final beneficial allele frequencyremain comparable to the other methods for strong selectionbut tend to have more bias and decreased accuracy as selec-tion strength decreases

Assuming a star-genealogy among beneficial allele carriersmay result in underestimating the variance for the posteriordistribution when there is nonindependence in our sample

To measure this affect we computed TMRCA estimates on100 bootstrap replicates for 4 simulated data sets under 2selection strengths and 2 final allele frequencies (supplemen-tary table 7 Supplementary Material online) We find closeagreement between the 95 posterior credible intervals ofthe original data and the 95 confidence intervals computedon the bootstrap estimates for a selection strength of 01 forboth final allele frequencies of 04 and 08 As expected olderTMRCAs are likely to violate the star-genealogy assumptionand in these cases we find that estimates from our originaldata are more narrow than the bootstrap confidenceintervals

Frequency 01 Frequency 03 Frequency 06 Frequency 09

minus2

minus1

0

1

2

3

0005 001 01 0005 001 01 0005 001 01 0005 001 01

log2

(Est

imat

eTr

ue V

alue

)Local 02N 05N 10N

Frequency 01 Frequency 03 Frequency 06 Frequency 09

02

04

06

08

0005 001 01 0005 001 01 0005 001 01 0005 001 01Selection Strength

Nor

mal

ized

95

Cre

dibl

e In

terv

al R

ange

FIG 2 Accuracy results from simulated data Accuracy of TMRCA point estimates and 95 credible interval ranges from posteriors inferred fromsimulated data under different strengths of selection final allele frequencies and choice of reference panel Credible interval range sizes are in unitsof generations and are normalized by the true TMRCA for each simulated data set See Materials and Methods below for simulation details

Smith et al doi101093molbevmsy006 MBE

1008

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Recombination versus Mutation as a Source ofInformationWe compared our model-based inference with simpler esti-mates of the TMRCA using the number of derived mutationson the ancestral haplotype and the mean length of the an-cestral haplotype In addition to quantifying the improve-ment our method has over these calculations this alsoserves as an ad-hoc way to understand how the relativeweight of information from mutation and recombinationaffects the performance of our method One can model thehaplotype lengths as independent and exponentially distrib-uted to derive a recombination-based estimator tr as

tr frac141

wor (11)

where r is the recombination rate and wo is the observedmean ancestral haplotype length To leverage the count ofderived mutations on the ancestral haplotype we use theThomson et al (2000) estimator In a sample of n haplotypeswith the selected allele a mutation-based estimator tm canbe calculated as

tm frac141

n

Xn

i

yi

wil (12)

where yi is the number of derived mutations on the ith hap-lotype which has length wi basepairs See Hudson (2007) for aderivation of the estimate for the variance of the Thomsonestimator

When using derived mutations uncertainty in both theancestral haplotype sequence and the length of the ancestralhaplotype on each chromosome (wi) can lead to poor esti-mation To improve inference researchers typically define arestricted ldquononrecombiningrdquo region that may reliably containderived mutations on the ancestral haplotype This has twodisadvantages 1) There is more information available in thedata which cannot be used because excess caution is neces-sary to prevent overcounting of derived mutations and2) there may still be unobserved recombination events inthis restricted locus To minimize the use of heuristics for aderived mutation approach we used our model to findmaximum-likelihood estimates of the ancestral haplotypebreakpoints using equation (2) in the model descriptionWe also used the mean posterior estimate of the ancestralhaplotype from our model to identify derived mutations Tocalculate a recombination-based estimator of the TMRCA wecalculated wo using the same maximum-likelihood estimatesof the ancestral haplotype lengths inferred for the mutationestimator

When using a local reference panel the simple mutationestimator tm consistently underestimates the true TMRCAThe recombination estimate however remains accurate(supplementary fig 6 Supplementary Material online) Wesuspect this to be a result of poor estimation of the ancestralhaplotype and violation of the star-genealogy assumption Incases where selection is weaker and the genealogy is not star-shaped derived mutations occurring early in the genealogywill be overrepresented and incorrectly inferred to be the

ancestral allele In this way high-frequency-derived alleleswill not be counted As predicted increasing selectionstrength improves mutation estimator accuracy The recom-bination estimator appears robust to this effect as long asselection is not too strong For very strong selection andyoung TMRCA values maximum-likelihood estimates ofthe haplotype lengths become constrained by the size ofthe locus studied For example in simulations with a selectionstrength of 005 and frequency of 01 the mean TMRCA isaround 100 generations Using equation (11) above the meanlength of the ancestral haplotype for a TMRCA of 100 gen-erations is 2 Mb which is twice as large as the window size weuse to make computation for our simulations feasible Using alarger window around the selected locus would amelioratethis effect

When using a diverged reference panel we find an oppositeeffect In this case the count of derived mutations results inan overestimate and the haplotype lengths yield an under-estimate We suspect this to be driven by poor matchingbetween the reference panel and the background haplotypesamong beneficial allele carriers The low probability of match-ing between the reference and background haplotypes meansthat the lengths of the ancestral haplotype are inferred toextend further than their true lengths This also leads to anoverestimate for the mutation estimator because differencesbetween the ancestral and background haplotypes are incor-rectly inferred as derived mutations on the ancestralhaplotype

Application to 1000 Genomes DataWe applied our method to five variants previously identifiedas targets of recent selection in various human populations(fig 3) Using phased data from the 1000 Genomes Project wefocused on variants that are not completely fixed in any onepopulation so that we could use a local reference panel The Liand Stephens (2003) haplotype copying model is appropriatein cases where ancestry switches occur among chromosomeswithin a single population so we excluded populations in theAmericas for which high levels of admixture are known toexist

While the simulation results described above provide someintuition for the effects of selection strength final allele fre-quency and choice of reference panel we also performedsimulations using the demographic history inferred byTennessen et al (2012) to explore the effects of nonequili-brium demographic history on our estimation accuracy (sup-plementary fig 10 Supplementary Material online) We findsubtle differences in accuracy between the two demographichistories where the nonequilibrium histories lead to negligibledifferences in mean values for log2(estimatetrue value) andlarger credible interval ranges

ADH1BA derived allele at high frequency among East Asians at theADH1B gene (rs3811801) has been shown to be functionallyrelevant for alcohol metabolism (Osier et al 2002 Eng et al2007) Previous age estimates are consistent with the timing

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1009

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

of rice domestication and fermentation approximately10000 years ago (Li et al 2007 Peng et al 2010 Peter et al2012) However a more recent estimate by Peter et al (2012)pushes this time back several thousand years to 12876(2204ndash49764) years ago Our results are consistent with an

older timing of selection as our CHB sample (Han Chinese inBeijing China) TMRCA estimate is 15377 (13763ndash17281)years Replicate chains of the MCMC are generally consistentwith the oldest estimates in the CHB sample showing themost variation among resampled data sets and the youngest

FIG 3 Comparison of TMRCA estimates with previous results Violin plots of posterior distributions for the complete set of estimated TMRCAvalues for the five variants indicated in the legend scaled to a generation time of 29 years Each row indicates a population sample from the 1000Genomes Project panel Replicate MCMCs are plotted with transparency Points and lines overlaying the violins are previous point estimates and95 confidence intervals for each of the variants indicated by a color and rs number in the legend (see supplementary tables 3 and 4Supplementary Material online) The population sample abbreviations are defined in text

Smith et al doi101093molbevmsy006 MBE

1010

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

estimate of 10841 (9720ndash12147) in the KHV sample showingthe least When using a fine-scale recombination map all ofthe ADH1B TMRCAs are inferred to be slightly older (supple-mentary fig 7 Supplementary Material online)

EDARPopulation genomic studies have repeatedly identified thegene EDAR to be under recent selection in East Asians(Akey et al 2004 Williamson et al 2005 Voight et al 2006)with a particular site (rs3827760) showing strong evidence forbeing the putative target Functional assays and allele specificexpression differences at this position show phenotypic asso-ciations to a variety of phenotypes including hair thicknessand dental morphology (Bryk et al 2008 Fujimoto et al 2008Kimura et al 2009 Kamberov et al 2013)

Our estimate of 22192 (19683ndash25736) years for the EDARallele in the CHB sample is older than ABC-based estimates of12458 (1314ndash85835) and 13224 (4899ndash50692) years madeby Bryk et al (2008) and Peter et al (2012) respectivelyKamberov et al (2013) use spatially explicit ABC and maxi-mum likelihood approaches to compute older estimates of30925 (13175ndash39575) and 36490 (34775ndash38204) We in-cluded all populations for which the variant is present includ-ing the FIN and BEB samples where it exists at low frequencyOur results for the youngest TMRCAs are found in these twolow frequency populations where the estimate in FIN is17386 (13887ndash20794) and the estimate in BEB is 18370(14325ndash22872) Among East Asian populations the oldestand youngest TMRCA estimates are found in the KHV sam-ple (25683 23169ndash28380) and CHB sample (22192 19683ndash25736)

LCTArguably the best-studied signature of selection in humans isfor an allele at the LCT gene (rs4988235) which confers lactasepersistence into adulthoodmdasha trait unique among mammalsand which is thought to be a result of cattle domesticationand the incorporation of milk into the adult diets of severalhuman populations (Enattah et al 2002 Bersaglieri et al 2004Tishkoff et al 2007) There are multiple alleles that showassociation with lactase persistence (Tishkoff et al 2007)We focused on estimating the age of the T-13910 allele pri-marily found at high frequency among Northern Europeansbut which is also found in South Asian populations In addi-tion to association with the lactase persistence phenotypethis allele has been functionally verified by in vitro experi-ments (Olds and Sibley 2003 Troelsen et al 2003Kuokkanen et al 2006)

Mathieson et al (2015) use ancient DNA collected from83 human samples to get a better understanding of the fre-quency trajectory for several adaptive alleles spanning a timescale of 8000 years For the LCT persistence allele (rs4988235)they find a sharp increase in frequency in the past 4000 yearsago While this is more recent than previous estimates anearlier TMRCA or time of mutation is still compatible withthis scenario

Our estimates using European and South Asian samplesfall between the range from 5000 to 10000 years ago which isbroadly consistent with age estimates from modern data Thecredible intervals for estimates in all of the samples havesubstantial overlap which makes any ranking on the basisof point estimates difficult We infer the PJL (Punjabi fromLahore Pakistan) sample to have the oldest TMRCA estimateof 9514 (8596ndash10383) years Itan et al (2009) use spatialmodelling to infer the geographic spread of lactase allelefrom northern to southern Europe Consistent with theirresults the youngest estimate among European populationsis found in the IBS sample at 9341 (8688ndash9989) yearsAmong all samples the youngest estimate was found inBEB at 6869 (5143ndash8809)

KITLG and OCA2The genetic basis and natural history of human skin pigmen-tation is a well-studied system with several alleles of majoreffect showing signatures consistent with being targets ofrecent selection (Jablonski and Chaplin 2012 Beleza Santoset al 2013 Wilde et al 2014 Eaton et al 2015) We focused onan allele found at high-frequency world-wide among non-African populations at the KITLG locus (rs642742) whichshows significant effects on skin pigmentation differences be-tween Europeans and Africans (Miller et al 2007) althoughmore recent work fails to find any contribution of KITLGtoward variation in skin pigmentation in a Cape VerdeAfrican-European admixed population (Beleza Johnsonet al 2013) We also estimated the TMRCA for a melanin-associated allele at the OCA2 locus (rs1800414) which is onlyfound among East Asian populations at high frequency(Edwards et al 2010)

For the KITLG variant our estimates among differentpopulations vary from 18000 to 34000 years ago with theoldest age being in the YRI (Yoruba in Ibadan Nigeria)sample (33948 28861ndash39099) The youngest TMRCA isfound in FIN at 18733 years (16675ndash20816) The nexttwo youngest estimates are also found in Africa withthe TMRCA in the MSL (Mende in Sierra Leone) samplebeing 22340 (15723ndash28950) years old and that for LWK(Luhya in Webuye Kenya) being 22784 (17922ndash28012)years old suggesting a more complex history than amodel of a simple allele frequency increase outside ofAfrica due to pigmentation related selection pressuresPrevious point estimates using rejection samplingapproaches on a Portuguese sample (32277 6003ndash80683) and East Asian sample (32045 6032ndash98165)are again most consistent with our own results on theIBS (29731 26170ndash32813) and CHB samples (2677324297ndash30141) (Beleza Santos et al 2013 Chen et al2015) Among East Asians the oldest and youngest esti-mates are again found in the JPT (28637 24297ndash30141)and KHV (24544 21643ndash27193) samples respectivelyThe TMRCA for OCA2 alleles in the JPT (1859916110ndash20786) and KHV (16370 14439ndash18102) samplesare also the oldest and youngest respectively

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1011

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

DiscussionOur method improves estimation for the timing of selectionon a beneficial allele using a tractable model of haplotypeevolution This approach leverages detailed information inthe data while remaining amenable to large sample sizesUsing both carriers and noncarriers of the allele we canmore effectively account for uncertainty in the extent ofthe ancestral haplotype and derived mutations We showthe performance of our method using simulations of differentselection strengths beneficial allele frequencies and choicesof reference panel By applying our method to five variantspreviously identified as targets of selection in human popu-lations we provide a comparison among population-specificTMRCAs This gives a more detailed account of the order inwhich populations may have acquired the variant andorexperienced selection for the trait it underlies

In that regard it is hypothesized that local selection pres-sures and a cultural shift toward agrarian societies have in-duced adaptive responses among human populations aroundthe globe The data associated with some variants seem toindicate more recent selective events than others Our resultsfor variants associated with dietary traits at the LCT andADH1B genes both imply relatively recent TMRCAs(lt 15 000 years ago) consistent with hypotheses that selec-tion on these mutations results from recent changes in hu-man diet following the spread of agriculture (Simoons 1970Peng et al 2010) In contrast the inferred TMRCAs for EDARKITLG and OCA2 imply older adaptive events which mayhave coincided more closely with the habitation of new envi-ronments or other cultural changes

Several hypotheses have been suggested to describe theselective drivers of skin pigmentation differences among hu-man populations including reduced UV radiation at highlatitudes and vitamin D deficiency (Loomis 1967 Jablonskiand Chaplin 2000) Estimated TMRCAs for the variants atthe OCA2 and EDAR loci among East Asians appear to beas young or younger than the KITLG variant but older thanthe LCT and ADH1B locus This suggests a selective history inEast Asian populations leading to adaptive responses forthese traits occurring after an initial colonization In somecases the dispersion of replicate MCMC estimates makes itdifficult to describe the historical significance of an observedorder for TMRCA values However the consistency of esti-mates among different populations for particular variantsadds some confidence to our modelrsquos ability to reproducethe ages which are relevant to those loci or certain geographicregions

To assess the relative concordance of our estimates withthose from previous approaches we compared our resultswith a compilation of previously published estimates basedon the time of mutation time since fixation or TMRCA ofvariants associated with the genes studied here (supplemen-tary fig 8 Supplementary Material online) The range of con-fidence intervals for these studies is largely a reflection of theassumptions invoked or relaxed for any one method as wellas the sample size and quality of the data used In principleour method extracts more information than approaches that

use summary statistics such as ABC In our empirical appli-cation we found that our method provides a gain in accuracywhile accounting for uncertainty in both the ancestral hap-lotype and its length on each chromosome Notably ourmethod provides narrower credible intervals by incorporatingthe full information from ancestral haplotype lengths derivedmutations and a reference panel of noncarrier haplotypes

Another caveat of our method is its dependence on thereference panel which is intended to serve as a representativesample of nonancestral haplotypes in the population duringthe selected allelersquos increase in frequency Four possible chal-lenges can arise 1) Segments of the ancestral selected haplo-type may be present in the reference panel due torecombination (this is more likely for alleles that havereached higher frequency) 2) the reference panel may con-tain haplotypes that are similar to the ancestral haplotypedue to low levels of genetic diversity 3) the reference panelmay be too diverged from the focal population and 4) pop-ulation connectivity and turnover may lead the ldquolocalrdquo refer-ence panel to be largely composed of migrant haplotypeswhich were not present during the selected allelersquos initial in-crease in frequency

Under scenarios 1 and 2 the background haplotypes willbe too similar to the ancestral haplotype and it may be dif-ficult for the model to discern a specific ancestry switch lo-cation This leads to fewer differences (mutations) thanexpected between the ancestral haplotype and each benefi-cial allele carrier The simulation results are consistent withthis scenario Our method tends to underestimate the trueage across a range of selection intensities and allele frequen-cies when using a local reference panel

Conversely under scenarios 3 and 4 the model will fail todescribe a recombinant haplotype in the sample of beneficialallele carriers as a mosaic of haplotypes in the reference panelAs a result the model will infer more mutation events toexplain observed differences from the ancestral haplotypeOur simulation results show this to be the case with referencepanels diverged by N generations Posterior mean estimatesare consistently older than their true value Our simulationsare perhaps pessimistic thoughmdashwe chose reference paneldivergence times of N and 05N generations approximatelycorresponding to FST values of 04 and 02 respectively Forthe smaller FST values observed in humans we expect resultsfor diverged panels to be closer to those obtained with thelocal reference panel Nonetheless future extensions to incor-porate multiple populations within the reference panel wouldbe helpful and possible by modifying the approach of Priceet al (2009) Such an approach would also enable the analysisof admixed populations (we excluded admixed samples fromour analysis of the 1000 Genomes data above)

Aside from the challenges imposed by the choice of refer-ence panel another potential source of bias lies in our tran-sition probabilities which are not conditioned on thefrequency of the selected variant In reality recombinationevents at some distance away from the selected site will onlyresult in a switch from the ancestral to background haplo-types at a rate proportional to 1 pl where pl is the fre-quency of the ancestral haplotype alleles at locus l In this way

Smith et al doi101093molbevmsy006 MBE

1012

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

some recombination events may go unobservedmdashas thebeneficial allele goes to high frequency the probability of arecombination event leading to an observable ancestral tobackground haplotype transition decreases One solutionmay be to include the frequency-dependent transition prob-abilities derived by Chen et al (2015) Under their model themutation time is estimated by assuming a deterministic lo-gistic frequency trajectory starting at 1

2N An additional benefitof using frequency trajectories would be the ability to inferposterior distributions on selection coefficients While theselection coefficient is typically assumed to be related tothe time since mutation by t1 frac14 logethNsTHORN=s we do nothave an equivalent expression for time to the common an-cestor Rather than the initial frequency being 1

2N for a newmutation our initial frequency must correspond to the fre-quency at which the TMRCA occurs Griffiths and Tavare(1994) derive a framework to model a genealogy under arbi-trary population size trajectories which should be analogousto the problem of an allele frequency trajectory and addi-tional theory on intra-allelic genealogies may be useful here aswell (Griffiths and Tavare 1994 Wiuf and Donnelly 1999 Wiuf2000 Slatkin and Rannala 2000)

Our model also assumes independence among all haplo-types in the sample in a composite-likelihood frameworkwhich is equivalent to assuming a star-genealogy (Larribeand Fearnhead 2011 Varin et al 2011) This is unlikely tobe the case when sample sizes are large or the TMRCA isold It is also unlikely to be true if the beneficial allele existedon multiple haplotypes preceding the onset of selection wasintroduced by multiple migrant haplotypes from other pop-ulations or occurred by multiple independent mutationevents (Innan and Kim 2004 Hermisson and Pennings2005 Przeworski et al 2005 Pritchard et al 2010 Berg andCoop 2015) Methods for distinguishing selection from stand-ing variation versus de novo mutation are available thatshould make it easier distinguish these cases (Messer andNeher 2012 Peter et al 2012 Messer and Petrov 2013Garud et al 2015)

If the underlying allelic genealogy is not star-like onecan expect different estimates of the TMRCA for differentsubsets of the data Here we performed multiple MCMCson resampled subsets of the data to informally diagnosewhether there are violations from the star-shape geneal-ogy assumption We speculate that exactly how theTMRCAs vary may provide insight to the underlyinghistory In cases where the TMRCA estimates for a par-ticular population are old and more variable than otherpopulations the results may be explained by structure inthe genealogy whereby recent coalescent events haveoccurred among the same ancestral haplotype beforethe common ancestor When estimates are dispersedamong resampled data sets the presence of multiple an-cestral haplotypes prior to the variantrsquos increase in fre-quency may be a better explanation Further support forthis explanation might come from comparisons to otherpopulation samples which show little to no dispersion ofestimates from resampled data sets Future work mightmake it possible to formalize this inference process

A final caveat regards the misspecification of mutation andrecombination rates TMRCA estimates are largely deter-mined by the use of accurate measures for these two param-eters In a way this provides some robustness to our methodOur age estimates depend on mutation and recombinationrates so accurate specification for one of the values can com-pensate for slight misspecification in the other In cases wherea fine-scale recombination map is unavailable we suggest us-ing a uniform recombination rate specific to the locus ofinterest (see Materials and Methods and supplementaryfig 7 Supplementary Material online) Choosing an appropri-ate mutation rate will continue to depend on current andfuture work which tries to resolve discrepancies in publishedmutation rate estimates inferred by various approaches(Segurel et al 2014)

One future direction for our method may be to explicitlyincorporate the possibility of multiple ancestral haplotypeswithin the sample Under a disease mapping frameworkMorris et al (2002) implement a similar idea in the case whereindependent disease causing mutations arise at the same lo-cus leading to independent genealogies for which they cointhe term ldquoshattered coalescentrdquo For our case beneficialmutations may also be independently derived on differenthaplotypes Alternatively a single mutation may be oldenough to reside on different haplotypes due to a sufficientamount of linked variation existing prior to the onset of se-lection Berg and Coop (2015) model selection from standingvariation to derive the distribution of haplotypes that theselected allele is present on

While we have treated the TMRCA as a parameter of in-terest our method also produces a sample of the posteriordistribution on the ancestral haplotype This could provideuseful information to estimate the frequency spectrum ofderived mutations on the ancestral haplotype Similarly thefrequency of shared recombination breakpoints could shedlight on the genealogy and how well it conforms to the star-shape assumption The extent of the ancestral haplotype ineach individual may also prove useful for identifying delete-rious alleles that have increased in frequency as a result ofstrong positive selection on linked beneficial alleles (Chun andFay 2011 Hartfield and Otto 2011) For example Huff et al(2012) describe a risk allele for Crohnrsquos disease at high fre-quency in European populations which they suggest is linkedto a beneficial allele under recent selection Similar to anadmixture mapping approach our method could be usedto identify risk loci by testing for an association betweenthe ancestral haplotype and disease status As another appli-cation identifying the ancestral haplotype may be useful inthe context of identifying a source population (or species) fora beneficial allele prior to its introduction and subsequentincrease in frequency in the recipient population

In many cases the specific site under selection may beunknown or restricted to some set of putative sites Whileour method requires the position of the selected site to bespecified future extensions could treat the selected site as arandom variable to be estimated under the same MCMCframework This framework would also be amenable to mar-ginalizing over uncertainty on the selected site

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1013

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

While we focus here on inference from modern DNAdata the increased accessibility of ancient DNA has addeda new dimension to population genetic data sets (Lazaridiset al 2014 Skoglund et al 2014 Allentoft et al 2015 Haaket al 2015 Mathieson et al 2015) Because it will remaindifficult to use ancient DNA approaches in many specieswith poor sample preservation we believe methods basedon modern DNA will continue to be useful going forwardThat said ancient DNA is providing an interesting avenue forcomparative work between inference from modern and an-cient samples For example Nakagome et al (2016) usesimulations to assess the fit of this ancient DNA polymor-phism to data simulated under their inferred parametervalues for allele age and selection intensity and they findreasonable agreement Much work still remains to fullyleverage ancient samples into population genetic inferencewhile accounting for new sources of uncertainty and poten-tial for sampling bias

Despite these challenges it is clear that our understandingof adaptive history will continue to benefit from new com-putational tools which extract insightful information from adiverse set of data sources

Materials and MethodsWe generated data using the software mssel (Hudson R per-sonal communication) which simulates a sample of haplo-types conditioned on the frequency trajectory of a selectedvariant under the structured coalescent (Hudson and Kaplan1988 Kaplan et al 1988) Trajectories were first simulatedforwards in time under a WrightndashFisher model for an additivelocus with varying strengths of selection and different endingfrequencies of the selected variant Trajectories were thentruncated to end at the first time the allele reaches a specifiedfrequency See supplementary table 1 SupplementaryMaterial online for relative ages of simulated TMRCA val-ues for different end frequencies and selection strengthsFor the results in figure 2 100 simulations were performedfor each parameter combination MCMCs were run for5000 iterations with a burn-in excluding the first 3000iterations A standard deviation of 10 was used for theproposal distribution of tca The red boxplots indicate localreference panels The blue and green boxplots indicate ref-erence panels diverged by 5N generations and 1N gener-ations respectively Each data set was simulated for a 1-Mblocus with a mutation rate of 1 108 recombination rateof 1 108

and population size of 10000 Sample sizesfor the selected and reference panels were 100 and 20respectively

For more efficient run times of the MCMC we set a max-imum number of individuals to include in the selected andreference panels to be 100 and 20 respectively In cases wherethe true number of haplotypes for either panel was greaterthan this in the full data set we resampled a subset ofhaplotypes from each population for a total of five replicatesper population For simulation results supporting the use ofthis resampling strategy see supplementary figure 5Supplementary Material online The MCMCs were run for

15000 iterations with a standard deviation of 20 for theTMRCA proposal distribution The first 9000 iterationswere removed as a burn-in leading to 6000 iterations for asample of the posterior Convergence was assessed by com-parison of MCMC replicates Figure 3 and supplementaryfigure 8 Supplementary Material online show the resultsfor all five variants along with previous point estimates and95 confidence intervals assuming a generation time of29 years (Fenner 2005) Supplementary table 3Supplementary Material online lists the mean and 95 cred-ible intervals for estimates with the highest mean posteriorprobability which we refer to in the text belowSupplementary table 4 Supplementary Material online liststhe previous estimates and confidence intervals with addi-tional details of the different approaches taken

To model recombination rate variation we used recombi-nation rates from the Decode sex-averaged recombinationmap inferred from pedigrees among families in Iceland(Kong et al 2010) Because some populations may have re-combination maps which differ from the Decode map at finescales we used a mean uniform recombination rate inferredfrom the 1-Mb region surrounding each variant The motiva-tion for this arises from how recombination rates have beenpreviously shown to remain relatively consistent among re-combination maps inferred for different populations at themegabase-scale (Broman et al 1998 Baudat et al 2010 Konget al 2010 Auton and McVean 2012) Further we found ourestimates depend mostly on having the megabase-scale re-combination rate appropriately set with little difference inmost cases for estimates obtained by modeling fine-scale re-combination at each locus (supplementary fig 7Supplementary Material online) We specify the switchingrate among background haplotypes after recombining off ofthe ancestral haplotype to be 4Nr where Nfrac14 10000 and r isthe mean recombination rate for the 1-Mb locus

For modeling mutation a challenge is that previous mu-tation rate estimates vary depending on the approach used(Segurel et al 2014) Estimates using the neutral substitutionrate between humans and chimps are more than 2 108

per bp per generation while estimates using whole-genomesequencing are closer to 1 108 As a compromise wespecify a mutation rate of 16 108

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe would like to thank Hussein Al-Asadi Arjun BiddandaAnna Di Rienzo Dick Hudson Choongwon Jeong Evan KochJoseph Marcus Shigeki Nakagome Ben Peter Mark ReppelAlex White members of the Coop laboratory at UC Davismembers of the Przeworski laboratory at ColumbiaUniversity and members of the He and Stephens laboratoriesat the University of Chicago for helpful comments JS wassupported by an NSF Graduate Research Fellowship andNational Institute of General Medical Sciences of the

Smith et al doi101093molbevmsy006 MBE

1014

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

National Institutes of Health under award numbers DGE-1144082 and T32GM007197 respectively This work wasalso supported by the National Institute of General MedicalSciences of the National Institutes of Health under awardnumbers RO1GM83098 and RO1GM107374 to GC as wellas R01HG007089 to JN

ReferencesAkey JM Eberle MA Rieder MJ Carlson CS Shriver MD Nickerson DA

Kruglyak L 2004 Population history and natural selection shapepatterns of genetic variation in 132 genes PLoS Biol 21591ndash1599

Allentoft ME Sikora M Sjogren K-G Rasmussen S Rasmussen MStenderup J Damgaard PB Schroeder H Ahlstrom T Vinner L2015 Population genomics of Bronze Age Eurasia Nature522(7555) 167ndash172

Auton A McVean G 2012 Estimating recombination rates from geneticvariation in humans In Anisimova M ed Evolutionary genomicsMethods in Molecular Biology (Methods and Protocols) vol 856New York Humana Press

Barrett RDH Hoekstra HE 2011 Molecular spandrels tests of adaptationat the genetic level Nat Rev Genet 12(11) 767ndash780

Baudat F Buard J Grey C Fledel-Alon A Ober C Przeworski M Coop GMassy BD 2010 PRDM9 is a major determinant of meiotic recom-bination hotspots in humans and mice Science 327(5967) 836ndash840

Beaumont MA Zhang W Balding DJ 2002 Approximate Bayesian com-putation in population genetics Genetics 162(4) 2025ndash2035

Beleza S Johnson NA Candille SI Absher DM Coram MA Lopes JCampos J Araujo II Anderson TM Vilhjalmsson BJ et al 2013Genetic architecture of skin and eye color in an African-Europeanadmixed population PLoS Genet 9(3) e1003372

Beleza S Santos AM McEvoy B Alves I Martinho C Cameron E ShriverMD Parra EJ Rocha J 2013 The timing of pigmentation lightening inEuropeans Mol Biol Evol 30(1) 24ndash35

Berg JJ Coop G 2015 A coalescent model for a sweep of a uniquestanding variant Genetics 201(2) 707ndash725

Bersaglieri T Sabeti PC Patterson N Vanderploeg T Schaffner SF DrakeJA Rhodes M Reich DE Hirschhorn JN 2004 Genetic signatures ofstrong recent positive selection at the lactase gene Am J Hum Genet74(6) 1111ndash1120

Broman KW Murray JC Sheffield VC White RL Weber JL 1998Comprehensive human genetic maps individual and sex-specificvariation in recombination Am J Hum Genet 63(3) 861ndash869

Bryk J Hardouin E Pugach I Hughes D Strotmann R Stoneking M MylesS 2008 Positive selection in East Asians for an EDAR allele thatenhances NF-jB activation PLoS One 3(5) e2209

Chen H Slatkin M 2013 Inferring selection intensity and allele age frommultilocus haplotype structure G3 3(8) 1429ndash1442

Chen H Hey J Slatkin M 2015 A hidden Markov model for investigatingrecent positive selection through haplotype structure Theor PopulBiol 9918ndash30

Chun S Fay JC 2011 Evidence for hitchhiking of deleterious mutationswithin the human genome PLoS Genet 7(8) e1002240

Coop G Griffiths RC 2004 Ancestral inference on gene trees underselection Theor Popul Biol 66(3) 219ndash232

Coop G Bullaughey K Luca F Przeworski M 2008 The timing of selec-tion at the human FOXP2 gene Mol Biol Evol 25(7) 1257ndash1259

Dalziel AC Rogers SM Schulte PM 2009 Linking genotypes to pheno-types and fitness how mechanistic biology can inform molecularecology Mol Ecol 18(24) 4997ndash5017

Eaton K Edwards M Krithika S Cook G Norton H Parra EJ 2015Association study confirms the role of two OCA2 polymorphismsin normal skin pigmentation variation in East Asian populations AmJ Hum Biol 27 520ndash525

Edwards M Bigham A Tan J Li S Gozdzik A Ross K Jin L Parra EJ 2010Association of the OCA2 polymorphism His615Arg with melanin

content in east Asian populations further evidence of convergentevolution of skin pigmentation PLoS Genet 6(3) e1000867

Elmer KR Meyer A 2011 Adaptation in the age of ecological genomicsinsights from parallelism and convergence Trends Ecol Evol 26(6)298ndash306

Enattah NS Sahi T Savilahti E Terwilliger JD Peltonen L Jeuroarveleuroa I 2002Identification of a variant associated with adult-type hypolactasiaNat Genet 30(2) 233ndash237

Eng MY Luczak SE Wall TL 2007 ALDH2 ADH1B and ADH1C geno-types in Asians a literature review Alcohol Res Health 30(1) 22

Fenner JN 2005 Cross-cultural estimation of the human generationinterval for use in genetics-based population divergence studiesAm J Phys Anthropol 128(2) 415ndash423

Fujimoto A Ohashi J Nishida N Miyagawa T Morishita Y Tsunoda TKimura R Tokunaga K 2008 A replication study confirmed theEDAR gene to be a major contributor to population differentiationregarding head hair thickness in Asia Hum Genet 124(2) 179ndash185

Garud NR Messer PW Buzbas EO Petrov DA 2015 Recent selectivesweeps in North American Drosophila melanogaster show signaturesof soft sweeps PLoS Genet 11(2) e1005004

Griffiths RC Tavare S 1994 Sampling theory for neutral alleles in avarying environment Philos Trans R Soc B Biol Sci 344(1310)403ndash410

Haak W Lazaridis I Patterson N Rohland N Mallick S Llamas B BrandtG Nordenfelt S Harney E Stewardson K 2015 Massive migrationfrom the steppe was a source for Indo-European languages inEurope Nature 522 207ndash211

Hartfield M Otto SP 2011 Recombination and hitchhiking of deleteri-ous alleles Evolution 65(9) 2421ndash2434

Hermisson J Pennings PS 2005 Soft sweeps molecular population ge-netics of adaptation from standing genetic variation Genetics 169(4)2335ndash2352

Hinch AG Tandon A Patterson N Song Y Rohland N Palmer CD ChenGK Wang K Buxbaum SG Akylbekova EL et al 2011 The landscapeof recombination in African Americans Nature 476(7359) 170ndash175

Hudson RR 1990 Gene genealogies and the coalescent process Oxf SurvEvol Biol 7(1) 44

Hudson RR 2007 The variance of coalescent time estimates from DNAsequences J Mol Evol 64(6) 702ndash705

Hudson RR Kaplan NL 1988 The coalescent process in models withselection and recombination Genetics 120(3) 831ndash840

Huff CD Witherspoon DJ Zhang Y Gatenbee C Denson LA KugathasanS Hakonarson H Whiting A Davis CT Wu W et al 2012 Crohnrsquosdisease and genetic hitchhiking at IBD5 Mol Biol Evol 29(1)101ndash111

Innan H Kim Y 2004 Pattern of polymorphism after strong artificialselection in a domestication event Proc Natl Acad Sci U S A 101(29)10667ndash10672

Itan Y Powell A Beaumont MA Burger J Thomas MG 2009 The originsof lactase persistence in Europe PLoS Comput Biol 5(8) e1000491

Jablonski NG Chaplin G 2000 The evolution of human skin coloration JHum Evol 39(1) 57ndash106

Jablonski NG Chaplin G 2012 Human skin pigmentation migration anddisease susceptibility Philos Trans R Soc B Biol Sci 367(1590)785ndash792

Jeong C Di Rienzo A 2014 Adaptations to local environments in mod-ern human populations Curr Opin Genet Dev 291ndash8

Kamberov YG Wang S Tan J Gerbault P Wark A Tan L Yang Y Li STang K Chen H et al 2013 Modeling recent human evolution inmice by expression of a selected EDAR variant Cell 152(4) 691ndash702

Kaplan NL Darden T Hudson RR 1988 The coalescent process inmodels with selection Genetics 120(3) 819ndash829

Kaplan NL Hudson RR Langley CH 1989 The ldquohitchhiking effectrdquo revis-ited Genetics 123(4) 887ndash899

Kimura R Yamaguchi T Takeda M Kondo O Toma T Haneji KHanihara T Matsukusa H Kawamura S Maki K et al 2009 A com-mon variation in EDAR is a genetic determinant of shovel-shapedincisors Am J Hum Genet 85(4) 528ndash535

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1015

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Kong A Thorleifsson G Gudbjartsson DF Masson G Sigurdsson AJonasdottir A Walters GB Jonasdottir A Gylfason A KristinssonKT et al 2010 Fine-scale recombination rate differences betweensexes populations and individuals Nature 467(7319) 1099ndash1103

Kuokkanen M Kokkonen J Enattah NS Ylisaukko-oja T Komu H VariloT Peltonen L Savilahti E Jeuroarveleuroa I 2006 Mutations in the translatedregion of the lactase gene (LCT) underlie congenital lactase defi-ciency Am J Hum Genet 78(2) 339ndash344

Larribe F Fearnhead P 2011 On composite likelihoods in statisticalgenetics Stat Sin 21(1) 43ndash69

Lazaridis I Patterson N Mittnik A Renaud G Mallick S Kirsanow KSudmant PH Schraiber JG Castellano S Lipson M et al 2014Ancient human genomes suggest three ancestral populations forpresent-day Europeans Nature 513(7518) 409ndash413

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357) 493ndash496

Li H Mukherjee N Soundararajan U Tarnok Z Barta C Khaliq SMohyuddin A Kajuna SLB Mehdi SQ Kidd JR et al 2007Geographically separate increases in the frequency of the derivedADH1B 47His allele in eastern and western Asia Am J Hum Genet81(4) 842ndash846

Li N Stephens M 2003 Modeling linkage disequilibrium and identifyingrecombination hotspots using single-nucleotide polymorphismdata Genetics 165(4) 2213ndash2233

Loomis WF 1967 Skin-pigment regulation of vitamin-D biosynthesis inman Science 157(3788) 501ndash506

Mathieson I Lazaridis I Rohland N Mallick S Patterson N RoodenbergSA Harney E Stewardson K Fernandes D Novak M et al 2015Genome-wide patterns of selection in 230 ancient Eurasians Nature528(7583) 499ndash503

McPeek MS Strahs A 1999 Assessment of linkage disequilibrium by thedecay of haplotype sharing with application to fine-scale geneticmapping Am J Hum Genet 65(3) 858ndash875

Meligkotsidou L Fearnhead P 2005 Maximum-likelihood estimationof coalescence times in genealogical trees Genetics 171(4)2073ndash2084

Messer PW Neher RA 2012 Estimating the strength of selective sweepsfrom deep population diversity data Genetics 191(2) 593ndash605

Messer PW Petrov DA 2013 Population genomics of rapid adaptationby soft selective sweeps Trends Ecol Evol 28(11) 659ndash669

Miller CT Beleza S Pollen AA Schluter D Kittles RA Shriver MDKingsley DM 2007 cis-Regulatory changes in kit ligand expressionand parallel evolution of pigmentation in sticklebacks and humansCell 131(6) 1179ndash1189

Morris AP Whittaker JC Balding DJ 2000 Bayesian fine-scale mapping ofdisease loci by hidden Markov models Am J Hum Genet 67(1)155ndash169

Morris AP Whittaker JC Balding DJ 2002 Fine-scale mapping of diseaseloci via shattered coalescent modeling of genealogies Am J HumGenet 70(3) 686ndash707

Nakagome S Alkorta-Aranburu G Amato R Howie B Peter BMHudson RR Di Rienzo A 2016 Estimating the ages of selectionsignals from different epochs in human history Mol Biol Evol33(3) 657ndash669

Nielsen R 2005 Molecular signatures of natural selection Annu RevGenet 39 197ndash218

Olds LC Sibley E 2003 Lactase persistence DNA variant enhances lactasepromoter activity in vitro functional role as a cis regulatory elementHum Mol Genet 12(18) 2333ndash2340

Ormond L Foll M Ewing GB Pfeifer SP Jensen JD 2016 Inferring the ageof a fixed beneficial allele Mol Ecol 25(1) 157ndash169

Osier MV Pakstis AJ Soodyall H Comas D Goldman D Odunsi AOkonofua F Parnas J Schulz LO Bertranpetit J et al 2002 A globalperspective on genetic variation at the ADH genes reveals unusualpatterns of linkage disequilibrium and diversity Am J Hum Genet71(1) 84ndash99

Peng Y Shi H Qi X-B Xiao C-J Zhong H Ma Run-lin Z Su B 2010The ADH1B Arg47His polymorphism in East Asian populations

and expansion of rice domestication in history BMC Evol Biol10(1) 15

Peter BM Huerta-Sanchez E Nielsen R 2012 Distinguishing betweenselective sweeps from standing variation and from a de novo mu-tation PLoS Genet 8(10) e1003011

Przeworski M Coop G Wall JD 2005 The signature of positive selectionon standing genetic variation Evolution 59(11) 2312ndash2323

Price AL Tandon A Patterson N Barnes KC Rafaels N Ruczinski I BeatyTH Mathias R Reich D Myers S 2009 Sensitive detection of chro-mosomal segments of distinct ancestry in admixed populationsPLoS Genet 5(6) e1000519

Pritchard JK Seielstad MT Perez-Lezaun A Feldman MW 1999Population growth of human Y chromosomes a study of Y chro-mosome microsatellites Mol Biol Evol 16(12) 1791ndash1798

Pritchard JK Pickrell JK Coop G 2010 The genetics of human adapta-tion hard sweeps soft sweeps and polygenic adaptation Curr Biol20(4) R208ndashR215

Przeworski M 2003 Estimating the time since the fixation of a beneficialallele Genetics 164(4) 1667ndash1676

Rabiner LR 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77(2) 257ndash286

Radwan J Babik W 2012 The genomics of adaptation Proc R Soc Lond BBiol Sci 279(1749) 5024ndash5028

Rannala B Reeve JP 2001 High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome se-quence Am J Hum Genet 69(1) 159ndash178

Rannala B Reeve JP 2002 Joint Bayesian estimation of mutationlocation and age using linkage disequilibrium Biocomputing2003 526ndash534

Segurel L Wyman MJ Przeworski M 2014 Determinants of mutationrate variation in the human germline Annu Rev Genomics HumGenet 1547ndash70

Simoons FJ 1970 Primary adult lactose intolerance and the milkinghabit a problem in biologic and cultural interrelations Am J DigDis 15(8) 695ndash710

Skoglund P Malmstrom H Omrak A Raghavan M Valdiosera CGunther T Hall P Tambets K Parik J Sjogren K-G et al 2014Genomic diversity and admixture differs for Stone-AgeScandinavian foragers and farmers Science 344(6185) 747ndash750

Slatkin M 2001 Simulating genealogies of selected alleles in a populationof variable size Genet Res 78(01) 49ndash57

Slatkin M 2008 A Bayesian method for jointly estimating allele age andselection intensity Genet Res 90(01) 129ndash137

Slatkin M Hudson RR 1991 Pairwise comparisons of mitochondrialDNA sequences in stable and exponentially growing populationsGenetics 129(2) 555ndash562

Slatkin M Rannala B 2000 Estimating allele age Annu Rev GenomicsHum Genet 1(1) 225ndash249

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenet Res 23(01) 23ndash35

Stinchcombe JR Hoekstra HE 2008 Combining population genomicsand quantitative genetics finding the genes underlying ecologicallyimportant traits Heredity 100(2) 158ndash170

Tang H Siegmund DO Shen P Oefner PJ Feldman MW 2002Frequentist estimation of coalescence times from nucleotide se-quence data using a tree-based partition Genetics 161(1) 447ndash459

Tavare S Balding DJ Griffiths RC Donnelly P 1997 Inferring coalescencetimes from DNA sequence data Genetics 145(2) 505ndash518

Tennessen JA Bigham AW OrsquoConnor TD Fu W Kenny EE Gravel SMcGee S Do R Liu X Jun G et al 2012 Evolution and functionalimpact of rare coding variation from deep sequencing of humanexomes Science 337(6090) 64ndash69

Thomson R Pritchard JK Shen P Oefner PJ Feldman MW 2000Recent common ancestry of human Y chromosomes evidencefrom DNA sequence data Proc Natl Acad Sci U S A 97(13)7360ndash7365

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007

Smith et al doi101093molbevmsy006 MBE

1016

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 39(1) 31ndash40

Troelsen JT Olsen J Moslashller J SjeuroOstreuroOm H 2003 An upstream polymor-phism associated with lactase persistence has increased enhanceractivity Gastroenterology 125(6) 1686ndash1694

Varin C Reid N Firth D 2011 An overview of composite likelihoodmethods Stat Sin 21(1) 5ndash42

Voight BF Kudaravalli S Wen X Pritchard JK Hurst L 2006 A map ofrecent positive selection in the human genome PLoS Biol 4(3) e72

Wegmann D Kessner DE Veeramah KR Mathias RA Nicolae DL YanekLR Sun YV Torgerson DG Rafaels N Mosley T et al 2011Recombination rates in admixed individuals identified by ancestry-based inference Nat Genet 43(9) 847ndash853

Wilde S Timpson A Kirsanow K Kaiser E Kayser M Unterleuroander MHollfelder N Potekhina ID Schier W Thomas MG et al 2014 Directevidence for positive selection of skin hair and eye pigmentation inEuropeans during the last 5000 years Proc Natl Acad Sci U S A111(13) 4832ndash4837

Williamson SH Hernandez R Fledel-Alon A Zhu L Nielsen RBustamante CD 2005 Simultaneous inference of selection and pop-ulation growth from patterns of variation in the human genomeProc Natl Acad Sci U S A 102(22) 7882ndash7887

Wiuf C 2000 On the genealogy of a sample of neutral rare alleles TheorPopul Biol 58(1) 61ndash75

Wiuf C Donnelly P 1999 Conditional genealogies and the age of aneutral mutant Theor Popul Biol 56(2) 183ndash201

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1017

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

recent positive selection (reviewed in Nielsen [2005]) A par-allel effort has focused on quantifying specific properties ofthe signature to infer the age of the selected allele

The most commonly used methods to estimate allele agerely on summary statistics These approaches can be furtherclassified as either heuristic or model-based methodsHeuristic approximations rely on a point estimate of themean length of the selected haplotype (using the decay ofhomozygosity around the selected locus) or a count of de-rived mutations within an arbitrary cutoff distance from theselected site (Thomson et al 2000 Tang et al 2002Meligkotsidou and Fearnhead 2005 Hudson 2007 Coopet al 2008) These approaches ignore uncertainty in the ex-tent of the selected haplotype on each chromosome whichcan lead to inflated confidence in the point estimates

Alternative model-based approaches that also use sum-mary statistics employ an Approximate BayesianComputation (ABC) framework These methods use an ex-plicit model for simulation to identify a distribution of agesthat are consistent with the observed data (Tavare et al 1997Pritchard et al 1999 Beaumont et al 2002 Przeworski 2003Voight et al 2006 Tishkoff et al 2007 Peter et al 2012 BelezaSantos et al 2013 Nakagome et al 2016 Ormond et al 2016)This provides a measure of uncertainy induced by the ran-domness of recombination mutation and genealogical his-tory and produces an approximate posterior distribution onallele age Despite these advantages ABC approaches sufferfrom an inability to capture all relevant features of the sampledue to their reliance on summary statistics

As full-sequencing data become more readily availabledefining the summary statistics which capture the complexLD among sites and the subtle differences between haplo-types will be increasingly challenging For this reason effi-ciently computable likelihood functions that leverage thefull sequence data rather than low dimensional summariesof the data are increasingly favorable

Several approaches attempt to compute the full likelihoodof the data using an importance sampling framework (Slatkin2001 Coop and Griffiths 2004 Slatkin 2008 Chen and Slatkin2013) Conditioning on the current frequency of the selectedallele frequency trajectories and genealogies are simulatedand given weight proportional to the probability of their oc-currence under a population genetic model While theseapproaches aim to account for uncertainty in the allelersquos fre-quency trajectory and genealogy they remain computation-ally infeasible for large samples or do not considerrecombination across numerous loci

In a related problem early likelihood-based methods fordisease mapping have modelled recombination around theancestral haplotype providing information for the time to thecommon ancestor (TMRCA) rather than time of mutation(Rannala and Reeve 2001 2002 McPeek and Strahs 1999Morris et al 2000 2002) These models allowed for the treat-ment of unknown genealogies and background haplotypediversity before access to large data sets made computationat the genome-wide scale too costly Inference is performedunder Markov chain Monte Carlo (MCMC) to sample overthe unknown genealogy while ignoring LD on the background

haplotypes or approximating it using a first-order Markovchain In a similar spirit Chen et al (2015) revisit this classof models to estimate the strength of selection and time ofmutation for an allele under positive selection using a hiddenMarkov model (HMM)

HMMs have become a routine tool for inference in popu-lation genetics The Markov assumption allows for fast com-putation and has proven an effective approximation forinferring the population-scaled recombination rate the demo-graphic history of population size changes and the timing andmagnitude of admixture events among genetically distinctpopulations (Li and Stephens 2003 Price et al 2009 Hinchet al 2011 Li and Durbin 2011 Wegmann et al 2011) Theapproach taken by Chen et al (2015) is a special case of twohidden statesmdashthe ancestral and background haplotypes Theancestral haplotype represents the linked background that thefocal allele arose on while the background haplotypes repre-sent some combination of alleles that recombine with theancestral haplotype during its increase in frequency Chenet al (2015) compute maximum-likelihood estimates for thelength of the ancestral haplotype on each chromosome car-rying the selected allele Inference for the time of mutation isperformed on these fixed estimates assuming that they areknown The authors condition the probability of an ancestryswitch event on a logistic frequency trajectory for the selectedallele and assume independence among haplotypes leading tothe common ancestor The likelihood for background haplo-types is approximated using a first-order Markov chain toaccount for nonindependence among linked sites

Here we present an HMM that leverages both the length ofthe ancestral haplotype on each chromosome as well as de-rived mutations that have accumulated on the ancestral hap-lotype Our method implements an MCMC which samplesover the unknown ancestral haplotype to generate a sampleof the posterior distribution for the TMRCA Our emissionprobabilities account for the LD structure among backgroundhaplotypes using the Li and Stephens (2003) haplotype copyingmodel and a reference panel of haplotypes without the se-lected allele (fig 1b and c) In contrast to the first-order Markovchain employed by Chen et al (2015) the Li and Stephens(2003) model provides an approximation to the coalescentwith recombination by modelling a focal haplotype as an im-perfect mosaic of haplotypes in the reference panel

While Chen et al (2015) use a mutation parameter in theirHMM the count of derived mutations on the backgroundhaplotype does not directly influence their estimation oftime since mutation The probability of observing a mutationon the selected haplotypes of beneficial allele carriers dependson two parameters The per-generation mutation rate and thetime to the common ancestor (TMRCA) The Chen et almodel uses a compound parameter for these such that theobserved mutations do not directly inform their estimates oftiming In our model we separately include the TMRCA andmutation rate as parameters and thus incorporate informationfrom mutations directly into our inference of the TMRCA

Our approach also differs in that we do not presume toknow the true extent of the ancestral haplotype and insteadtreat it as a latent variable to be marginalized over This allows

Smith et al doi101093molbevmsy006 MBE

1004

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

our estimation of the TMRCA to reflect uncertainty in theprecise switch point off of the ancestral haplotype which inmany cases will be difficult to distinguish from the backgroundhaplotypes Another significant difference is that our modeldoes not make assumptions about the frequency trajectoryapart from that a sufficiently hard sweep occurred to incur astar-shaped genealogy Below we use simulations to show thesensitivity of our model to these simplified assumptions forvarying strengths of selection final allele frequencies and sam-pling regimes for the choice of reference panel An R package isavailable to implement this method on github (httpsgithubcomjhavsmithstartmrca last accessed January 23 2018)

New Approaches

Model DescriptionIn general the TMRCA for a sample of haplotypes carryingthe advantageous allele (hereafter referred to as tca) will bemore recent than the time of mutation (Kaplan et al 1989)We aim to estimate tca in the case where a selectively advan-tageous mutation occurred in an ancestor of our sample t1

generations ago (fig 1a in main text) Viewed backwards intime the selected variant decreases in frequency at a rateproportional to the selection strength During a rapid dropin allele frequency the coalescent rate among haplotypescarrying the selected variant is amplified The same effect

would be observed for population growth from a small initialsize forward in time (Hudson 1990 Slatkin and Hudson 1991)As a result the genealogy of a sample having undergoneselection andor population growth becomes more ldquostar-shapedrdquo This offers some convenience as it becomes moreappropriate to invoke an assumption of independenceamong lineages when selection is strong We would like toemphasize that this assumption necessarily implies that thebeneficial allele has a single ancestral haplotype that has in-creased in frequency This is in contrast to a scenario in whichthe beneficial allele has been present in the population forsome time prior to selection For that case multiple ancestralhaplotypes would increase in frequency simultaneouslyresulting in a genealogy that is not star-shaped

We assume no crossover interference between recombi-nation events within a haplotype and therefore treat eachside flanking the focal allele separately We define one side ofthe selected site within a window of some predeterminedlength to have L segregating sites such that an individualrsquossequence will be indexed from site s frac14 f1 Lg wheresfrac14 1 refers to the selected site (a notation reference is pro-vided in table 1) To simplify notation this description will bewritten for a window on one side flanking the selected siteNote that the opposing side of the selected site is modelled inan identical fashion after redefining L

Let X denote an n L data matrix for a sample of nchromosomes with the selected variant Xij is the observedallelic type in chromosome i at variant site j and is assumed tobe biallelic where Xij 2 f1 0g Let H denote an m L matrixcomprising m chromosomes that do not have the selectedvariant where Hij 2 f1 0g Let A denote the ancestral hap-lotype as a vector of length L where Aj is the allelic type on theancestral selected haplotype at segregating site j andAj 2 f1 0g We assume independence among lineages lead-ing to the most recent common ancestor of the selectedhaplotype This is equivalent to assuming a star-shaped ge-nealogy which as noted above is a reasonable assumption forsites linked to a favorable variant under strong selection Wecan then write the likelihood as

PrethXjtcaAHTHORN frac14Yn

i

PrethXijtcaAHTHORN (1)

In each individual haplotype Xi we assume the ancestralhaplotype extends from the selected allele until a recombina-tion event switches ancestry to a different genetic backgroundLet Wfrac14w indicate that the location of the first recombi-nation event occurs between sites w and wthorn 1 where W2 f1 Lg (wfrac14 L indicates no recombination up to siteL) We can then condition the probability of the data on theinterval where the first recombination event occurs and sumover all possible intervals to express the likelihood as

PrethXijtcaAHTHORN frac14XL

wfrac141

PrethXijtcaAHWi frac14 wTHORN

PrethWi frac14 wjtcaTHORN (2)

FIG 1 Visual descriptions of the model (a) An idealized illustration ofthe effect of a selectively favored mutationrsquos frequency trajectory(black line) on the shape of a genealogy at the selected locus Theorange lineages are chromosomes with the selected allele The bluelineages indicate chromosomes that do not have the selected alleleNote the distinction between the time to the common ancestor ofchromosomes with the selected allele tca and the time at which themutation arose t1 (b) The copying model follows the ancestral hap-lotype (orange) moving away from the selected site until recombina-tion events within the reference panel lead to a mosaic of nonselectedhaplotypes surrounding the ancestral haplotype (c) A demographichistory with two choices for the reference panel Local and divergedAfter the ancestral population at the top of the figure splits into twosister populations a beneficial mutation arises and begins increasingin frequency The orange and blue colors indicate frequency of theselected and nonselected alleles respectively

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1005

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Assuming haplotype lengths are independent and identi-cally distributed draws from an exponential distribution thetransition probabilities for a recombination event off of theancestral haplotype are

PrethWi frac14 wjtcaTHORN

frac14ertcadweth1 ertcaethdwthorn1dwTHORNTHORN if w frac14 f1 ethL 1THORNg

ertcadL if w frac14 L

(

(3)

where dw is the distance in base pairs of site w from theselected site and r is the local recombination rate per basepair per generation The data for each individual Xi can bedivided into two parts One indicating the portion of anindividualrsquos sequence residing on the ancestral haplotype (be-fore recombining between sites w and wthorn 1) XiethjwTHORN andthat portion residing off of the ancestral haplotype after arecombination event XiethjgtwTHORN We denote a separate likelihoodfor each portion

aiw frac14 PrethXiethjwTHORNjtcaAWi frac14 wTHORN (4)

biw frac14 PrethXiethjgtwTHORNjHethjgtwTHORNWi frac14 wTHORN (5)

Because the focal allele is on the selected haplotypeai1 frac14 1 Conversely we assume a recombination eventoccurs at some point beyond locus L such thatbiL frac14 1 We assume the waiting time to mutation ateach site on the ancestral haplotype is exponentiallydistributed with no reverse mutations and express thelikelihood as

aiw frac14 PrethXiethjwTHORNjtcaAWi frac14 wTHORN

frac14 etcalethdwwTHORNYw

jfrac142

PrethXij frac14 ajtcaATHORN (6)

PrethXij frac14 ajtcaATHORN frac14etcal if a frac14 Aj

1 etcal if a frac14 Aj

((7)

The term etcalethdwwTHORN on the right side of equation (6)captures the lack of mutation at invariant sites betweeneach segregating site Assuming tcal is small equation (6) isequivalent to assuming a Poisson number of mutations(with mean tcal) occurring on the ancestral haplotype

For biw the probability of observing a particular sequenceafter recombining off of the ancestral haplotype is dependenton standing variation in background haplotype diversity TheLi and Stephens (2003) haplotype copying model allows forfast computation of an approximation to the probability ofobserving a sample of chromosomes related by a genealogywith recombination Given a sample of m haplotypesH 2 fh1 hmg a population scaled recombination rateq and mutation rate h an observed sequence of alleles ismodeled as an imperfect copy of any one haplotype in thereference panel at each SNP Let Zij denote the reference panelhaplotype which Xi copies at the jth SNP and cj denote thenumber of base pairs between SNPs j and jthorn 1 Zij follows aMarkov process with transition probabilities

PrethZiethjthorn1THORN frac14 z0jZij frac14 zTHORN

frac14eqjcj=m thorn eth1 eqjcj=mTHORNeth1=mTHORN if z0 frac14 z

eth1 eqjcj=mTHORNeth1=mTHORN if z0 frac14 z

(

(8)

To include mutation the probability that the sampledhaplotype matches a haplotype in the reference panel ism=ethmthorn hTHORN and the probability of a mismatch (or muta-tion event) is h=ethmthorn hTHORN Letting a refer to an allele wherea 2 f1 0g the matching and mismatching probabilitiesare

PrethXij frac14 ajZij frac14 z h1 hmTHORN

frac14m=ethmthorn hTHORN thorn eth1=2THORNethh=ethmthorn hTHORNTHORN if hzj frac14 a

eth1=2THORNethh=ethmthorn hTHORNTHORN if hzj frac14 a

(

(9)

Equation (5) requires a sum over the probabilities of allpossible values of Zj using equations (8) and (9) This is com-puted using the forward algorithm as described in Rabiner(1989) and Appendix A of Li and Stephens (2003) It shouldbe noted that this formulation does not model the observa-tion of an invariant site among the background haplotypesWe tried an approach to model these sites but saw no im-provement in model performance (see supplementary ap-pendix S2 Supplementary Material online)

The complete likelihood for our problem can then beexpressed as

Table 1 Notation Used to Describe the Model

n Number of haplotypes with the selected allelem Number of haplotypes without the selected alleleL Number of SNPs flanking the selected site (one side considered

at a time)X n L matrix of haplotypes with the selected alleleH m L matrix of haplotypes without the selected alleleXij Allele in haplotype i at SNP j where i 2 f1 ng and

j 2 f1 LgHzj Allele in haplotype z at SNP j where z 2 f1 mg and

j 2 f1 LgAj Allele at site j on the ancestral haplotypeZij The reference panel haplotype from which Xi copies at site jtca Time to the most recent common ancestor (TMRCA)Wi The location of the first recombination event off of the ancestral

haplotyper Recombination rate per basepair per generationl Mutation rate per basepair per generationh Haplotype miscopying rate or population-scaled mutation rate

eth4NlTHORNq Haplotype switching rate or population-scaled recombination

rate eth4NrTHORNdw Physical distance of site w from the selected site where

w 2 f1 Lgcj Number of basepairs between sites j and j thorn 1aiw Likelihood of haplotype i for sites 1 wbiw Likelihood of haplotype i for SNPs ethwthorn 1THORN L

Smith et al doi101093molbevmsy006 MBE

1006

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

PrethXjtcaAHTHORN frac14Yn

ifrac141

XL

wfrac141

aiwbiwPrethWi frac14 wjtcaATHORN (10)

This computation is on the order 2Lnm2 and in practicefor mfrac14 20 nfrac14 100 and Lfrac14 4000 takes approximately3027 s to compute on an Intel Core i7-4750HQ CPU at200 GHz 8 with 156 GB RAM

InferencePerforming inference on tca requires addressing the latentvariables w and A in the model Marginalizing over possiblevalues of w is a natural summation per haplotype that is linearin L as shown above For A the number of possible values islarge (2L) and so we employ a MetropolisndashHastings algorithmto jointly sample the posterior of A and tca and then we takemarginal samples of tca for inference We assign a uniformprior density for both A and tca such that PrethATHORN frac14 1=2L andPrethtcaTHORN frac14 1=ethtmax tminTHORN where tmax and tmin are user-specified maximum and minimum values for tca ProposedMCMC updates of the ancestral haplotype A0 are generatedby randomly selecting a site in A and flipping to the alterna-tive allele For tca proposed values are generated by adding anormally distributed random variable centered at 0tca0 frac14 tca thorn Neth0 r2THORN To start the MetropolisndashHastings algo-

rithm an initial value of tca is uniformly drawn from a user-specified range of values (10ndash2000 in the applications here)To initialize the ancestral haplotype to a reasonable value weuse a heuristic algorithm which exploits the characteristicdecrease in variation near a selected site (see supplementaryappendix S2 Supplementary Material online)

For each haplotype in the sample of beneficial allele car-riers the Li and Stephens (2003) model uses a haplotypemiscopying rate h and switching rate q to compute a like-lihood term for loci following the recombination event off ofthe ancestral haplotype For our analyses we set q frac14 44104 using our simulated values of r frac14 11 108 per bpper generation and Nfrac14 10000 where q frac14 4Nr Following Li

and Stephens (2003) we fix h frac14 ethPn

mfrac1411=mTHORN1 as derived

from the expected number of mutation events on a geneal-ogy relating n chromosomes at a particular site We found nodiscernible effects on estimate accuracy when specifying dif-ferent values of q (supplementary fig 1 SupplementaryMaterial online)

ResultsBecause our model requires a sample (or ldquopanelrdquo) of referencehaplotypes without the selected allele we tested our methodfor cases in which the reference panel is chosen from the localpopulation in which the selected allele is found as well ascases where the panel is from a diverged population wherethe selected haplotype is absent (fig 1c) Regardless of sce-nario the estimates are on average within a factor of 2 of thetrue value and often much closer When using a local refer-ence panel point estimates of tca increasingly underestimatethe true value (TMRCA) as selection becomes weaker and thefinal allele frequency increases (fig 2) Put differently the age

of older TMRCAs tends to be underestimated with local ref-erence panels Using the mean posteriors as point estimatesmean values of log2(estimatetrue value) range from062 to014 Simulations using a diverged population for the refer-ence panel removed the bias though only in cases where thedivergence time was not large For a reference panel divergedby 05N generations mean log2(estimatetrue value) valuesrange from 021 to 018 As the reference panel becomestoo far diverged from the selected population estimates be-come older than the true value (036 to 094 log2(estimatetrue value)) In these cases the HMM is unlikely to infer aclose match between background haplotypes in the sampleand the reference panel leading to many more mismatchesbeing inferred as mutation events on the ancestral haplotypeand an older estimate of tca

The bottom panel of figure 2 shows the effect of selectionstrength and final allele frequency on the size of the 95credible interval around point estimates normalized by thetrue TMRCA for each simulated data set Before normalizingcredible interval sizes using a local reference panel range from73 to 213 generations for 2Ns frac14 100 versus 18 to 22 gener-ations when 2Ns frac14 2000 Using local and diverged referencepanels we found a minimal effect of the sample size on pointestimates (supplementary figs 2 and 3 SupplementaryMaterial online) As expected larger sample sizes for the car-rier panel improve estimate accuracy However higher allelefrequencies and weak selection are likely to induce more un-certainty due to the ancestral haplotype tracts recombiningwithin the sample We find this effect more pronounced withlarge sample sizes for the reference panel We speculate that alarge sample of reference haplotypes leads the focal selectedhaplotype to have increased probability of copying from thereference panel leading to a shorter selected haplotype andslight overestimate of the TMRCA

We also performed simulations under varying degrees ofmutation and recombination rate misspecification (supple-mentary fig 4 Supplementary Material online) In most casesmean values of log2(estimatetrue value) stay within an orderof magnitude of 0 As expected when both the mutation andrecombination rate are misspecified we find the most dis-crepancy To assess the convergence properties of theMCMC five replicate chains were run for each of 20 sim-ulated data sets produced under three 2Ns values (100200 and 2000) for frequency trajectories ending at 01(supplementary fig 5 Supplementary Material online)While care is always warranted with MCMC approacheswe find in practice that convergence among our replicatechains is attained relatively quickly (3 000 iterations forsimulated data and 3000ndash9000 iterations for appliedcases see supplementary fig 9 Supplementary Materialonline)

We compared the performance of our estimator withthree other model-based approaches for allele age estimationby matching the simulation scheme performed by Chen et al(2015) (supplementary table 5 Supplementary Material on-line) Our method shows improvement in accuracy (RMSE)andor lower bias for simulations with lower frequencies ofthe beneficial allele (40) regardless of the reference panel

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1007

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

used In cases where the final beneficial allele frequency ishigher (80) our methodrsquos accuracy remains as good or bet-ter than the other methods when using a diverged referencepanel with a two-orders-of-magnitude improvement of biasunder strong selection (sfrac14 005) Estimates when using a localreference panel and a high final beneficial allele frequencyremain comparable to the other methods for strong selectionbut tend to have more bias and decreased accuracy as selec-tion strength decreases

Assuming a star-genealogy among beneficial allele carriersmay result in underestimating the variance for the posteriordistribution when there is nonindependence in our sample

To measure this affect we computed TMRCA estimates on100 bootstrap replicates for 4 simulated data sets under 2selection strengths and 2 final allele frequencies (supplemen-tary table 7 Supplementary Material online) We find closeagreement between the 95 posterior credible intervals ofthe original data and the 95 confidence intervals computedon the bootstrap estimates for a selection strength of 01 forboth final allele frequencies of 04 and 08 As expected olderTMRCAs are likely to violate the star-genealogy assumptionand in these cases we find that estimates from our originaldata are more narrow than the bootstrap confidenceintervals

Frequency 01 Frequency 03 Frequency 06 Frequency 09

minus2

minus1

0

1

2

3

0005 001 01 0005 001 01 0005 001 01 0005 001 01

log2

(Est

imat

eTr

ue V

alue

)Local 02N 05N 10N

Frequency 01 Frequency 03 Frequency 06 Frequency 09

02

04

06

08

0005 001 01 0005 001 01 0005 001 01 0005 001 01Selection Strength

Nor

mal

ized

95

Cre

dibl

e In

terv

al R

ange

FIG 2 Accuracy results from simulated data Accuracy of TMRCA point estimates and 95 credible interval ranges from posteriors inferred fromsimulated data under different strengths of selection final allele frequencies and choice of reference panel Credible interval range sizes are in unitsof generations and are normalized by the true TMRCA for each simulated data set See Materials and Methods below for simulation details

Smith et al doi101093molbevmsy006 MBE

1008

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Recombination versus Mutation as a Source ofInformationWe compared our model-based inference with simpler esti-mates of the TMRCA using the number of derived mutationson the ancestral haplotype and the mean length of the an-cestral haplotype In addition to quantifying the improve-ment our method has over these calculations this alsoserves as an ad-hoc way to understand how the relativeweight of information from mutation and recombinationaffects the performance of our method One can model thehaplotype lengths as independent and exponentially distrib-uted to derive a recombination-based estimator tr as

tr frac141

wor (11)

where r is the recombination rate and wo is the observedmean ancestral haplotype length To leverage the count ofderived mutations on the ancestral haplotype we use theThomson et al (2000) estimator In a sample of n haplotypeswith the selected allele a mutation-based estimator tm canbe calculated as

tm frac141

n

Xn

i

yi

wil (12)

where yi is the number of derived mutations on the ith hap-lotype which has length wi basepairs See Hudson (2007) for aderivation of the estimate for the variance of the Thomsonestimator

When using derived mutations uncertainty in both theancestral haplotype sequence and the length of the ancestralhaplotype on each chromosome (wi) can lead to poor esti-mation To improve inference researchers typically define arestricted ldquononrecombiningrdquo region that may reliably containderived mutations on the ancestral haplotype This has twodisadvantages 1) There is more information available in thedata which cannot be used because excess caution is neces-sary to prevent overcounting of derived mutations and2) there may still be unobserved recombination events inthis restricted locus To minimize the use of heuristics for aderived mutation approach we used our model to findmaximum-likelihood estimates of the ancestral haplotypebreakpoints using equation (2) in the model descriptionWe also used the mean posterior estimate of the ancestralhaplotype from our model to identify derived mutations Tocalculate a recombination-based estimator of the TMRCA wecalculated wo using the same maximum-likelihood estimatesof the ancestral haplotype lengths inferred for the mutationestimator

When using a local reference panel the simple mutationestimator tm consistently underestimates the true TMRCAThe recombination estimate however remains accurate(supplementary fig 6 Supplementary Material online) Wesuspect this to be a result of poor estimation of the ancestralhaplotype and violation of the star-genealogy assumption Incases where selection is weaker and the genealogy is not star-shaped derived mutations occurring early in the genealogywill be overrepresented and incorrectly inferred to be the

ancestral allele In this way high-frequency-derived alleleswill not be counted As predicted increasing selectionstrength improves mutation estimator accuracy The recom-bination estimator appears robust to this effect as long asselection is not too strong For very strong selection andyoung TMRCA values maximum-likelihood estimates ofthe haplotype lengths become constrained by the size ofthe locus studied For example in simulations with a selectionstrength of 005 and frequency of 01 the mean TMRCA isaround 100 generations Using equation (11) above the meanlength of the ancestral haplotype for a TMRCA of 100 gen-erations is 2 Mb which is twice as large as the window size weuse to make computation for our simulations feasible Using alarger window around the selected locus would amelioratethis effect

When using a diverged reference panel we find an oppositeeffect In this case the count of derived mutations results inan overestimate and the haplotype lengths yield an under-estimate We suspect this to be driven by poor matchingbetween the reference panel and the background haplotypesamong beneficial allele carriers The low probability of match-ing between the reference and background haplotypes meansthat the lengths of the ancestral haplotype are inferred toextend further than their true lengths This also leads to anoverestimate for the mutation estimator because differencesbetween the ancestral and background haplotypes are incor-rectly inferred as derived mutations on the ancestralhaplotype

Application to 1000 Genomes DataWe applied our method to five variants previously identifiedas targets of recent selection in various human populations(fig 3) Using phased data from the 1000 Genomes Project wefocused on variants that are not completely fixed in any onepopulation so that we could use a local reference panel The Liand Stephens (2003) haplotype copying model is appropriatein cases where ancestry switches occur among chromosomeswithin a single population so we excluded populations in theAmericas for which high levels of admixture are known toexist

While the simulation results described above provide someintuition for the effects of selection strength final allele fre-quency and choice of reference panel we also performedsimulations using the demographic history inferred byTennessen et al (2012) to explore the effects of nonequili-brium demographic history on our estimation accuracy (sup-plementary fig 10 Supplementary Material online) We findsubtle differences in accuracy between the two demographichistories where the nonequilibrium histories lead to negligibledifferences in mean values for log2(estimatetrue value) andlarger credible interval ranges

ADH1BA derived allele at high frequency among East Asians at theADH1B gene (rs3811801) has been shown to be functionallyrelevant for alcohol metabolism (Osier et al 2002 Eng et al2007) Previous age estimates are consistent with the timing

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1009

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

of rice domestication and fermentation approximately10000 years ago (Li et al 2007 Peng et al 2010 Peter et al2012) However a more recent estimate by Peter et al (2012)pushes this time back several thousand years to 12876(2204ndash49764) years ago Our results are consistent with an

older timing of selection as our CHB sample (Han Chinese inBeijing China) TMRCA estimate is 15377 (13763ndash17281)years Replicate chains of the MCMC are generally consistentwith the oldest estimates in the CHB sample showing themost variation among resampled data sets and the youngest

FIG 3 Comparison of TMRCA estimates with previous results Violin plots of posterior distributions for the complete set of estimated TMRCAvalues for the five variants indicated in the legend scaled to a generation time of 29 years Each row indicates a population sample from the 1000Genomes Project panel Replicate MCMCs are plotted with transparency Points and lines overlaying the violins are previous point estimates and95 confidence intervals for each of the variants indicated by a color and rs number in the legend (see supplementary tables 3 and 4Supplementary Material online) The population sample abbreviations are defined in text

Smith et al doi101093molbevmsy006 MBE

1010

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

estimate of 10841 (9720ndash12147) in the KHV sample showingthe least When using a fine-scale recombination map all ofthe ADH1B TMRCAs are inferred to be slightly older (supple-mentary fig 7 Supplementary Material online)

EDARPopulation genomic studies have repeatedly identified thegene EDAR to be under recent selection in East Asians(Akey et al 2004 Williamson et al 2005 Voight et al 2006)with a particular site (rs3827760) showing strong evidence forbeing the putative target Functional assays and allele specificexpression differences at this position show phenotypic asso-ciations to a variety of phenotypes including hair thicknessand dental morphology (Bryk et al 2008 Fujimoto et al 2008Kimura et al 2009 Kamberov et al 2013)

Our estimate of 22192 (19683ndash25736) years for the EDARallele in the CHB sample is older than ABC-based estimates of12458 (1314ndash85835) and 13224 (4899ndash50692) years madeby Bryk et al (2008) and Peter et al (2012) respectivelyKamberov et al (2013) use spatially explicit ABC and maxi-mum likelihood approaches to compute older estimates of30925 (13175ndash39575) and 36490 (34775ndash38204) We in-cluded all populations for which the variant is present includ-ing the FIN and BEB samples where it exists at low frequencyOur results for the youngest TMRCAs are found in these twolow frequency populations where the estimate in FIN is17386 (13887ndash20794) and the estimate in BEB is 18370(14325ndash22872) Among East Asian populations the oldestand youngest TMRCA estimates are found in the KHV sam-ple (25683 23169ndash28380) and CHB sample (22192 19683ndash25736)

LCTArguably the best-studied signature of selection in humans isfor an allele at the LCT gene (rs4988235) which confers lactasepersistence into adulthoodmdasha trait unique among mammalsand which is thought to be a result of cattle domesticationand the incorporation of milk into the adult diets of severalhuman populations (Enattah et al 2002 Bersaglieri et al 2004Tishkoff et al 2007) There are multiple alleles that showassociation with lactase persistence (Tishkoff et al 2007)We focused on estimating the age of the T-13910 allele pri-marily found at high frequency among Northern Europeansbut which is also found in South Asian populations In addi-tion to association with the lactase persistence phenotypethis allele has been functionally verified by in vitro experi-ments (Olds and Sibley 2003 Troelsen et al 2003Kuokkanen et al 2006)

Mathieson et al (2015) use ancient DNA collected from83 human samples to get a better understanding of the fre-quency trajectory for several adaptive alleles spanning a timescale of 8000 years For the LCT persistence allele (rs4988235)they find a sharp increase in frequency in the past 4000 yearsago While this is more recent than previous estimates anearlier TMRCA or time of mutation is still compatible withthis scenario

Our estimates using European and South Asian samplesfall between the range from 5000 to 10000 years ago which isbroadly consistent with age estimates from modern data Thecredible intervals for estimates in all of the samples havesubstantial overlap which makes any ranking on the basisof point estimates difficult We infer the PJL (Punjabi fromLahore Pakistan) sample to have the oldest TMRCA estimateof 9514 (8596ndash10383) years Itan et al (2009) use spatialmodelling to infer the geographic spread of lactase allelefrom northern to southern Europe Consistent with theirresults the youngest estimate among European populationsis found in the IBS sample at 9341 (8688ndash9989) yearsAmong all samples the youngest estimate was found inBEB at 6869 (5143ndash8809)

KITLG and OCA2The genetic basis and natural history of human skin pigmen-tation is a well-studied system with several alleles of majoreffect showing signatures consistent with being targets ofrecent selection (Jablonski and Chaplin 2012 Beleza Santoset al 2013 Wilde et al 2014 Eaton et al 2015) We focused onan allele found at high-frequency world-wide among non-African populations at the KITLG locus (rs642742) whichshows significant effects on skin pigmentation differences be-tween Europeans and Africans (Miller et al 2007) althoughmore recent work fails to find any contribution of KITLGtoward variation in skin pigmentation in a Cape VerdeAfrican-European admixed population (Beleza Johnsonet al 2013) We also estimated the TMRCA for a melanin-associated allele at the OCA2 locus (rs1800414) which is onlyfound among East Asian populations at high frequency(Edwards et al 2010)

For the KITLG variant our estimates among differentpopulations vary from 18000 to 34000 years ago with theoldest age being in the YRI (Yoruba in Ibadan Nigeria)sample (33948 28861ndash39099) The youngest TMRCA isfound in FIN at 18733 years (16675ndash20816) The nexttwo youngest estimates are also found in Africa withthe TMRCA in the MSL (Mende in Sierra Leone) samplebeing 22340 (15723ndash28950) years old and that for LWK(Luhya in Webuye Kenya) being 22784 (17922ndash28012)years old suggesting a more complex history than amodel of a simple allele frequency increase outside ofAfrica due to pigmentation related selection pressuresPrevious point estimates using rejection samplingapproaches on a Portuguese sample (32277 6003ndash80683) and East Asian sample (32045 6032ndash98165)are again most consistent with our own results on theIBS (29731 26170ndash32813) and CHB samples (2677324297ndash30141) (Beleza Santos et al 2013 Chen et al2015) Among East Asians the oldest and youngest esti-mates are again found in the JPT (28637 24297ndash30141)and KHV (24544 21643ndash27193) samples respectivelyThe TMRCA for OCA2 alleles in the JPT (1859916110ndash20786) and KHV (16370 14439ndash18102) samplesare also the oldest and youngest respectively

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1011

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

DiscussionOur method improves estimation for the timing of selectionon a beneficial allele using a tractable model of haplotypeevolution This approach leverages detailed information inthe data while remaining amenable to large sample sizesUsing both carriers and noncarriers of the allele we canmore effectively account for uncertainty in the extent ofthe ancestral haplotype and derived mutations We showthe performance of our method using simulations of differentselection strengths beneficial allele frequencies and choicesof reference panel By applying our method to five variantspreviously identified as targets of selection in human popu-lations we provide a comparison among population-specificTMRCAs This gives a more detailed account of the order inwhich populations may have acquired the variant andorexperienced selection for the trait it underlies

In that regard it is hypothesized that local selection pres-sures and a cultural shift toward agrarian societies have in-duced adaptive responses among human populations aroundthe globe The data associated with some variants seem toindicate more recent selective events than others Our resultsfor variants associated with dietary traits at the LCT andADH1B genes both imply relatively recent TMRCAs(lt 15 000 years ago) consistent with hypotheses that selec-tion on these mutations results from recent changes in hu-man diet following the spread of agriculture (Simoons 1970Peng et al 2010) In contrast the inferred TMRCAs for EDARKITLG and OCA2 imply older adaptive events which mayhave coincided more closely with the habitation of new envi-ronments or other cultural changes

Several hypotheses have been suggested to describe theselective drivers of skin pigmentation differences among hu-man populations including reduced UV radiation at highlatitudes and vitamin D deficiency (Loomis 1967 Jablonskiand Chaplin 2000) Estimated TMRCAs for the variants atthe OCA2 and EDAR loci among East Asians appear to beas young or younger than the KITLG variant but older thanthe LCT and ADH1B locus This suggests a selective history inEast Asian populations leading to adaptive responses forthese traits occurring after an initial colonization In somecases the dispersion of replicate MCMC estimates makes itdifficult to describe the historical significance of an observedorder for TMRCA values However the consistency of esti-mates among different populations for particular variantsadds some confidence to our modelrsquos ability to reproducethe ages which are relevant to those loci or certain geographicregions

To assess the relative concordance of our estimates withthose from previous approaches we compared our resultswith a compilation of previously published estimates basedon the time of mutation time since fixation or TMRCA ofvariants associated with the genes studied here (supplemen-tary fig 8 Supplementary Material online) The range of con-fidence intervals for these studies is largely a reflection of theassumptions invoked or relaxed for any one method as wellas the sample size and quality of the data used In principleour method extracts more information than approaches that

use summary statistics such as ABC In our empirical appli-cation we found that our method provides a gain in accuracywhile accounting for uncertainty in both the ancestral hap-lotype and its length on each chromosome Notably ourmethod provides narrower credible intervals by incorporatingthe full information from ancestral haplotype lengths derivedmutations and a reference panel of noncarrier haplotypes

Another caveat of our method is its dependence on thereference panel which is intended to serve as a representativesample of nonancestral haplotypes in the population duringthe selected allelersquos increase in frequency Four possible chal-lenges can arise 1) Segments of the ancestral selected haplo-type may be present in the reference panel due torecombination (this is more likely for alleles that havereached higher frequency) 2) the reference panel may con-tain haplotypes that are similar to the ancestral haplotypedue to low levels of genetic diversity 3) the reference panelmay be too diverged from the focal population and 4) pop-ulation connectivity and turnover may lead the ldquolocalrdquo refer-ence panel to be largely composed of migrant haplotypeswhich were not present during the selected allelersquos initial in-crease in frequency

Under scenarios 1 and 2 the background haplotypes willbe too similar to the ancestral haplotype and it may be dif-ficult for the model to discern a specific ancestry switch lo-cation This leads to fewer differences (mutations) thanexpected between the ancestral haplotype and each benefi-cial allele carrier The simulation results are consistent withthis scenario Our method tends to underestimate the trueage across a range of selection intensities and allele frequen-cies when using a local reference panel

Conversely under scenarios 3 and 4 the model will fail todescribe a recombinant haplotype in the sample of beneficialallele carriers as a mosaic of haplotypes in the reference panelAs a result the model will infer more mutation events toexplain observed differences from the ancestral haplotypeOur simulation results show this to be the case with referencepanels diverged by N generations Posterior mean estimatesare consistently older than their true value Our simulationsare perhaps pessimistic thoughmdashwe chose reference paneldivergence times of N and 05N generations approximatelycorresponding to FST values of 04 and 02 respectively Forthe smaller FST values observed in humans we expect resultsfor diverged panels to be closer to those obtained with thelocal reference panel Nonetheless future extensions to incor-porate multiple populations within the reference panel wouldbe helpful and possible by modifying the approach of Priceet al (2009) Such an approach would also enable the analysisof admixed populations (we excluded admixed samples fromour analysis of the 1000 Genomes data above)

Aside from the challenges imposed by the choice of refer-ence panel another potential source of bias lies in our tran-sition probabilities which are not conditioned on thefrequency of the selected variant In reality recombinationevents at some distance away from the selected site will onlyresult in a switch from the ancestral to background haplo-types at a rate proportional to 1 pl where pl is the fre-quency of the ancestral haplotype alleles at locus l In this way

Smith et al doi101093molbevmsy006 MBE

1012

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

some recombination events may go unobservedmdashas thebeneficial allele goes to high frequency the probability of arecombination event leading to an observable ancestral tobackground haplotype transition decreases One solutionmay be to include the frequency-dependent transition prob-abilities derived by Chen et al (2015) Under their model themutation time is estimated by assuming a deterministic lo-gistic frequency trajectory starting at 1

2N An additional benefitof using frequency trajectories would be the ability to inferposterior distributions on selection coefficients While theselection coefficient is typically assumed to be related tothe time since mutation by t1 frac14 logethNsTHORN=s we do nothave an equivalent expression for time to the common an-cestor Rather than the initial frequency being 1

2N for a newmutation our initial frequency must correspond to the fre-quency at which the TMRCA occurs Griffiths and Tavare(1994) derive a framework to model a genealogy under arbi-trary population size trajectories which should be analogousto the problem of an allele frequency trajectory and addi-tional theory on intra-allelic genealogies may be useful here aswell (Griffiths and Tavare 1994 Wiuf and Donnelly 1999 Wiuf2000 Slatkin and Rannala 2000)

Our model also assumes independence among all haplo-types in the sample in a composite-likelihood frameworkwhich is equivalent to assuming a star-genealogy (Larribeand Fearnhead 2011 Varin et al 2011) This is unlikely tobe the case when sample sizes are large or the TMRCA isold It is also unlikely to be true if the beneficial allele existedon multiple haplotypes preceding the onset of selection wasintroduced by multiple migrant haplotypes from other pop-ulations or occurred by multiple independent mutationevents (Innan and Kim 2004 Hermisson and Pennings2005 Przeworski et al 2005 Pritchard et al 2010 Berg andCoop 2015) Methods for distinguishing selection from stand-ing variation versus de novo mutation are available thatshould make it easier distinguish these cases (Messer andNeher 2012 Peter et al 2012 Messer and Petrov 2013Garud et al 2015)

If the underlying allelic genealogy is not star-like onecan expect different estimates of the TMRCA for differentsubsets of the data Here we performed multiple MCMCson resampled subsets of the data to informally diagnosewhether there are violations from the star-shape geneal-ogy assumption We speculate that exactly how theTMRCAs vary may provide insight to the underlyinghistory In cases where the TMRCA estimates for a par-ticular population are old and more variable than otherpopulations the results may be explained by structure inthe genealogy whereby recent coalescent events haveoccurred among the same ancestral haplotype beforethe common ancestor When estimates are dispersedamong resampled data sets the presence of multiple an-cestral haplotypes prior to the variantrsquos increase in fre-quency may be a better explanation Further support forthis explanation might come from comparisons to otherpopulation samples which show little to no dispersion ofestimates from resampled data sets Future work mightmake it possible to formalize this inference process

A final caveat regards the misspecification of mutation andrecombination rates TMRCA estimates are largely deter-mined by the use of accurate measures for these two param-eters In a way this provides some robustness to our methodOur age estimates depend on mutation and recombinationrates so accurate specification for one of the values can com-pensate for slight misspecification in the other In cases wherea fine-scale recombination map is unavailable we suggest us-ing a uniform recombination rate specific to the locus ofinterest (see Materials and Methods and supplementaryfig 7 Supplementary Material online) Choosing an appropri-ate mutation rate will continue to depend on current andfuture work which tries to resolve discrepancies in publishedmutation rate estimates inferred by various approaches(Segurel et al 2014)

One future direction for our method may be to explicitlyincorporate the possibility of multiple ancestral haplotypeswithin the sample Under a disease mapping frameworkMorris et al (2002) implement a similar idea in the case whereindependent disease causing mutations arise at the same lo-cus leading to independent genealogies for which they cointhe term ldquoshattered coalescentrdquo For our case beneficialmutations may also be independently derived on differenthaplotypes Alternatively a single mutation may be oldenough to reside on different haplotypes due to a sufficientamount of linked variation existing prior to the onset of se-lection Berg and Coop (2015) model selection from standingvariation to derive the distribution of haplotypes that theselected allele is present on

While we have treated the TMRCA as a parameter of in-terest our method also produces a sample of the posteriordistribution on the ancestral haplotype This could provideuseful information to estimate the frequency spectrum ofderived mutations on the ancestral haplotype Similarly thefrequency of shared recombination breakpoints could shedlight on the genealogy and how well it conforms to the star-shape assumption The extent of the ancestral haplotype ineach individual may also prove useful for identifying delete-rious alleles that have increased in frequency as a result ofstrong positive selection on linked beneficial alleles (Chun andFay 2011 Hartfield and Otto 2011) For example Huff et al(2012) describe a risk allele for Crohnrsquos disease at high fre-quency in European populations which they suggest is linkedto a beneficial allele under recent selection Similar to anadmixture mapping approach our method could be usedto identify risk loci by testing for an association betweenthe ancestral haplotype and disease status As another appli-cation identifying the ancestral haplotype may be useful inthe context of identifying a source population (or species) fora beneficial allele prior to its introduction and subsequentincrease in frequency in the recipient population

In many cases the specific site under selection may beunknown or restricted to some set of putative sites Whileour method requires the position of the selected site to bespecified future extensions could treat the selected site as arandom variable to be estimated under the same MCMCframework This framework would also be amenable to mar-ginalizing over uncertainty on the selected site

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1013

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

While we focus here on inference from modern DNAdata the increased accessibility of ancient DNA has addeda new dimension to population genetic data sets (Lazaridiset al 2014 Skoglund et al 2014 Allentoft et al 2015 Haaket al 2015 Mathieson et al 2015) Because it will remaindifficult to use ancient DNA approaches in many specieswith poor sample preservation we believe methods basedon modern DNA will continue to be useful going forwardThat said ancient DNA is providing an interesting avenue forcomparative work between inference from modern and an-cient samples For example Nakagome et al (2016) usesimulations to assess the fit of this ancient DNA polymor-phism to data simulated under their inferred parametervalues for allele age and selection intensity and they findreasonable agreement Much work still remains to fullyleverage ancient samples into population genetic inferencewhile accounting for new sources of uncertainty and poten-tial for sampling bias

Despite these challenges it is clear that our understandingof adaptive history will continue to benefit from new com-putational tools which extract insightful information from adiverse set of data sources

Materials and MethodsWe generated data using the software mssel (Hudson R per-sonal communication) which simulates a sample of haplo-types conditioned on the frequency trajectory of a selectedvariant under the structured coalescent (Hudson and Kaplan1988 Kaplan et al 1988) Trajectories were first simulatedforwards in time under a WrightndashFisher model for an additivelocus with varying strengths of selection and different endingfrequencies of the selected variant Trajectories were thentruncated to end at the first time the allele reaches a specifiedfrequency See supplementary table 1 SupplementaryMaterial online for relative ages of simulated TMRCA val-ues for different end frequencies and selection strengthsFor the results in figure 2 100 simulations were performedfor each parameter combination MCMCs were run for5000 iterations with a burn-in excluding the first 3000iterations A standard deviation of 10 was used for theproposal distribution of tca The red boxplots indicate localreference panels The blue and green boxplots indicate ref-erence panels diverged by 5N generations and 1N gener-ations respectively Each data set was simulated for a 1-Mblocus with a mutation rate of 1 108 recombination rateof 1 108

and population size of 10000 Sample sizesfor the selected and reference panels were 100 and 20respectively

For more efficient run times of the MCMC we set a max-imum number of individuals to include in the selected andreference panels to be 100 and 20 respectively In cases wherethe true number of haplotypes for either panel was greaterthan this in the full data set we resampled a subset ofhaplotypes from each population for a total of five replicatesper population For simulation results supporting the use ofthis resampling strategy see supplementary figure 5Supplementary Material online The MCMCs were run for

15000 iterations with a standard deviation of 20 for theTMRCA proposal distribution The first 9000 iterationswere removed as a burn-in leading to 6000 iterations for asample of the posterior Convergence was assessed by com-parison of MCMC replicates Figure 3 and supplementaryfigure 8 Supplementary Material online show the resultsfor all five variants along with previous point estimates and95 confidence intervals assuming a generation time of29 years (Fenner 2005) Supplementary table 3Supplementary Material online lists the mean and 95 cred-ible intervals for estimates with the highest mean posteriorprobability which we refer to in the text belowSupplementary table 4 Supplementary Material online liststhe previous estimates and confidence intervals with addi-tional details of the different approaches taken

To model recombination rate variation we used recombi-nation rates from the Decode sex-averaged recombinationmap inferred from pedigrees among families in Iceland(Kong et al 2010) Because some populations may have re-combination maps which differ from the Decode map at finescales we used a mean uniform recombination rate inferredfrom the 1-Mb region surrounding each variant The motiva-tion for this arises from how recombination rates have beenpreviously shown to remain relatively consistent among re-combination maps inferred for different populations at themegabase-scale (Broman et al 1998 Baudat et al 2010 Konget al 2010 Auton and McVean 2012) Further we found ourestimates depend mostly on having the megabase-scale re-combination rate appropriately set with little difference inmost cases for estimates obtained by modeling fine-scale re-combination at each locus (supplementary fig 7Supplementary Material online) We specify the switchingrate among background haplotypes after recombining off ofthe ancestral haplotype to be 4Nr where Nfrac14 10000 and r isthe mean recombination rate for the 1-Mb locus

For modeling mutation a challenge is that previous mu-tation rate estimates vary depending on the approach used(Segurel et al 2014) Estimates using the neutral substitutionrate between humans and chimps are more than 2 108

per bp per generation while estimates using whole-genomesequencing are closer to 1 108 As a compromise wespecify a mutation rate of 16 108

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe would like to thank Hussein Al-Asadi Arjun BiddandaAnna Di Rienzo Dick Hudson Choongwon Jeong Evan KochJoseph Marcus Shigeki Nakagome Ben Peter Mark ReppelAlex White members of the Coop laboratory at UC Davismembers of the Przeworski laboratory at ColumbiaUniversity and members of the He and Stephens laboratoriesat the University of Chicago for helpful comments JS wassupported by an NSF Graduate Research Fellowship andNational Institute of General Medical Sciences of the

Smith et al doi101093molbevmsy006 MBE

1014

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

National Institutes of Health under award numbers DGE-1144082 and T32GM007197 respectively This work wasalso supported by the National Institute of General MedicalSciences of the National Institutes of Health under awardnumbers RO1GM83098 and RO1GM107374 to GC as wellas R01HG007089 to JN

ReferencesAkey JM Eberle MA Rieder MJ Carlson CS Shriver MD Nickerson DA

Kruglyak L 2004 Population history and natural selection shapepatterns of genetic variation in 132 genes PLoS Biol 21591ndash1599

Allentoft ME Sikora M Sjogren K-G Rasmussen S Rasmussen MStenderup J Damgaard PB Schroeder H Ahlstrom T Vinner L2015 Population genomics of Bronze Age Eurasia Nature522(7555) 167ndash172

Auton A McVean G 2012 Estimating recombination rates from geneticvariation in humans In Anisimova M ed Evolutionary genomicsMethods in Molecular Biology (Methods and Protocols) vol 856New York Humana Press

Barrett RDH Hoekstra HE 2011 Molecular spandrels tests of adaptationat the genetic level Nat Rev Genet 12(11) 767ndash780

Baudat F Buard J Grey C Fledel-Alon A Ober C Przeworski M Coop GMassy BD 2010 PRDM9 is a major determinant of meiotic recom-bination hotspots in humans and mice Science 327(5967) 836ndash840

Beaumont MA Zhang W Balding DJ 2002 Approximate Bayesian com-putation in population genetics Genetics 162(4) 2025ndash2035

Beleza S Johnson NA Candille SI Absher DM Coram MA Lopes JCampos J Araujo II Anderson TM Vilhjalmsson BJ et al 2013Genetic architecture of skin and eye color in an African-Europeanadmixed population PLoS Genet 9(3) e1003372

Beleza S Santos AM McEvoy B Alves I Martinho C Cameron E ShriverMD Parra EJ Rocha J 2013 The timing of pigmentation lightening inEuropeans Mol Biol Evol 30(1) 24ndash35

Berg JJ Coop G 2015 A coalescent model for a sweep of a uniquestanding variant Genetics 201(2) 707ndash725

Bersaglieri T Sabeti PC Patterson N Vanderploeg T Schaffner SF DrakeJA Rhodes M Reich DE Hirschhorn JN 2004 Genetic signatures ofstrong recent positive selection at the lactase gene Am J Hum Genet74(6) 1111ndash1120

Broman KW Murray JC Sheffield VC White RL Weber JL 1998Comprehensive human genetic maps individual and sex-specificvariation in recombination Am J Hum Genet 63(3) 861ndash869

Bryk J Hardouin E Pugach I Hughes D Strotmann R Stoneking M MylesS 2008 Positive selection in East Asians for an EDAR allele thatenhances NF-jB activation PLoS One 3(5) e2209

Chen H Slatkin M 2013 Inferring selection intensity and allele age frommultilocus haplotype structure G3 3(8) 1429ndash1442

Chen H Hey J Slatkin M 2015 A hidden Markov model for investigatingrecent positive selection through haplotype structure Theor PopulBiol 9918ndash30

Chun S Fay JC 2011 Evidence for hitchhiking of deleterious mutationswithin the human genome PLoS Genet 7(8) e1002240

Coop G Griffiths RC 2004 Ancestral inference on gene trees underselection Theor Popul Biol 66(3) 219ndash232

Coop G Bullaughey K Luca F Przeworski M 2008 The timing of selec-tion at the human FOXP2 gene Mol Biol Evol 25(7) 1257ndash1259

Dalziel AC Rogers SM Schulte PM 2009 Linking genotypes to pheno-types and fitness how mechanistic biology can inform molecularecology Mol Ecol 18(24) 4997ndash5017

Eaton K Edwards M Krithika S Cook G Norton H Parra EJ 2015Association study confirms the role of two OCA2 polymorphismsin normal skin pigmentation variation in East Asian populations AmJ Hum Biol 27 520ndash525

Edwards M Bigham A Tan J Li S Gozdzik A Ross K Jin L Parra EJ 2010Association of the OCA2 polymorphism His615Arg with melanin

content in east Asian populations further evidence of convergentevolution of skin pigmentation PLoS Genet 6(3) e1000867

Elmer KR Meyer A 2011 Adaptation in the age of ecological genomicsinsights from parallelism and convergence Trends Ecol Evol 26(6)298ndash306

Enattah NS Sahi T Savilahti E Terwilliger JD Peltonen L Jeuroarveleuroa I 2002Identification of a variant associated with adult-type hypolactasiaNat Genet 30(2) 233ndash237

Eng MY Luczak SE Wall TL 2007 ALDH2 ADH1B and ADH1C geno-types in Asians a literature review Alcohol Res Health 30(1) 22

Fenner JN 2005 Cross-cultural estimation of the human generationinterval for use in genetics-based population divergence studiesAm J Phys Anthropol 128(2) 415ndash423

Fujimoto A Ohashi J Nishida N Miyagawa T Morishita Y Tsunoda TKimura R Tokunaga K 2008 A replication study confirmed theEDAR gene to be a major contributor to population differentiationregarding head hair thickness in Asia Hum Genet 124(2) 179ndash185

Garud NR Messer PW Buzbas EO Petrov DA 2015 Recent selectivesweeps in North American Drosophila melanogaster show signaturesof soft sweeps PLoS Genet 11(2) e1005004

Griffiths RC Tavare S 1994 Sampling theory for neutral alleles in avarying environment Philos Trans R Soc B Biol Sci 344(1310)403ndash410

Haak W Lazaridis I Patterson N Rohland N Mallick S Llamas B BrandtG Nordenfelt S Harney E Stewardson K 2015 Massive migrationfrom the steppe was a source for Indo-European languages inEurope Nature 522 207ndash211

Hartfield M Otto SP 2011 Recombination and hitchhiking of deleteri-ous alleles Evolution 65(9) 2421ndash2434

Hermisson J Pennings PS 2005 Soft sweeps molecular population ge-netics of adaptation from standing genetic variation Genetics 169(4)2335ndash2352

Hinch AG Tandon A Patterson N Song Y Rohland N Palmer CD ChenGK Wang K Buxbaum SG Akylbekova EL et al 2011 The landscapeof recombination in African Americans Nature 476(7359) 170ndash175

Hudson RR 1990 Gene genealogies and the coalescent process Oxf SurvEvol Biol 7(1) 44

Hudson RR 2007 The variance of coalescent time estimates from DNAsequences J Mol Evol 64(6) 702ndash705

Hudson RR Kaplan NL 1988 The coalescent process in models withselection and recombination Genetics 120(3) 831ndash840

Huff CD Witherspoon DJ Zhang Y Gatenbee C Denson LA KugathasanS Hakonarson H Whiting A Davis CT Wu W et al 2012 Crohnrsquosdisease and genetic hitchhiking at IBD5 Mol Biol Evol 29(1)101ndash111

Innan H Kim Y 2004 Pattern of polymorphism after strong artificialselection in a domestication event Proc Natl Acad Sci U S A 101(29)10667ndash10672

Itan Y Powell A Beaumont MA Burger J Thomas MG 2009 The originsof lactase persistence in Europe PLoS Comput Biol 5(8) e1000491

Jablonski NG Chaplin G 2000 The evolution of human skin coloration JHum Evol 39(1) 57ndash106

Jablonski NG Chaplin G 2012 Human skin pigmentation migration anddisease susceptibility Philos Trans R Soc B Biol Sci 367(1590)785ndash792

Jeong C Di Rienzo A 2014 Adaptations to local environments in mod-ern human populations Curr Opin Genet Dev 291ndash8

Kamberov YG Wang S Tan J Gerbault P Wark A Tan L Yang Y Li STang K Chen H et al 2013 Modeling recent human evolution inmice by expression of a selected EDAR variant Cell 152(4) 691ndash702

Kaplan NL Darden T Hudson RR 1988 The coalescent process inmodels with selection Genetics 120(3) 819ndash829

Kaplan NL Hudson RR Langley CH 1989 The ldquohitchhiking effectrdquo revis-ited Genetics 123(4) 887ndash899

Kimura R Yamaguchi T Takeda M Kondo O Toma T Haneji KHanihara T Matsukusa H Kawamura S Maki K et al 2009 A com-mon variation in EDAR is a genetic determinant of shovel-shapedincisors Am J Hum Genet 85(4) 528ndash535

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1015

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Kong A Thorleifsson G Gudbjartsson DF Masson G Sigurdsson AJonasdottir A Walters GB Jonasdottir A Gylfason A KristinssonKT et al 2010 Fine-scale recombination rate differences betweensexes populations and individuals Nature 467(7319) 1099ndash1103

Kuokkanen M Kokkonen J Enattah NS Ylisaukko-oja T Komu H VariloT Peltonen L Savilahti E Jeuroarveleuroa I 2006 Mutations in the translatedregion of the lactase gene (LCT) underlie congenital lactase defi-ciency Am J Hum Genet 78(2) 339ndash344

Larribe F Fearnhead P 2011 On composite likelihoods in statisticalgenetics Stat Sin 21(1) 43ndash69

Lazaridis I Patterson N Mittnik A Renaud G Mallick S Kirsanow KSudmant PH Schraiber JG Castellano S Lipson M et al 2014Ancient human genomes suggest three ancestral populations forpresent-day Europeans Nature 513(7518) 409ndash413

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357) 493ndash496

Li H Mukherjee N Soundararajan U Tarnok Z Barta C Khaliq SMohyuddin A Kajuna SLB Mehdi SQ Kidd JR et al 2007Geographically separate increases in the frequency of the derivedADH1B 47His allele in eastern and western Asia Am J Hum Genet81(4) 842ndash846

Li N Stephens M 2003 Modeling linkage disequilibrium and identifyingrecombination hotspots using single-nucleotide polymorphismdata Genetics 165(4) 2213ndash2233

Loomis WF 1967 Skin-pigment regulation of vitamin-D biosynthesis inman Science 157(3788) 501ndash506

Mathieson I Lazaridis I Rohland N Mallick S Patterson N RoodenbergSA Harney E Stewardson K Fernandes D Novak M et al 2015Genome-wide patterns of selection in 230 ancient Eurasians Nature528(7583) 499ndash503

McPeek MS Strahs A 1999 Assessment of linkage disequilibrium by thedecay of haplotype sharing with application to fine-scale geneticmapping Am J Hum Genet 65(3) 858ndash875

Meligkotsidou L Fearnhead P 2005 Maximum-likelihood estimationof coalescence times in genealogical trees Genetics 171(4)2073ndash2084

Messer PW Neher RA 2012 Estimating the strength of selective sweepsfrom deep population diversity data Genetics 191(2) 593ndash605

Messer PW Petrov DA 2013 Population genomics of rapid adaptationby soft selective sweeps Trends Ecol Evol 28(11) 659ndash669

Miller CT Beleza S Pollen AA Schluter D Kittles RA Shriver MDKingsley DM 2007 cis-Regulatory changes in kit ligand expressionand parallel evolution of pigmentation in sticklebacks and humansCell 131(6) 1179ndash1189

Morris AP Whittaker JC Balding DJ 2000 Bayesian fine-scale mapping ofdisease loci by hidden Markov models Am J Hum Genet 67(1)155ndash169

Morris AP Whittaker JC Balding DJ 2002 Fine-scale mapping of diseaseloci via shattered coalescent modeling of genealogies Am J HumGenet 70(3) 686ndash707

Nakagome S Alkorta-Aranburu G Amato R Howie B Peter BMHudson RR Di Rienzo A 2016 Estimating the ages of selectionsignals from different epochs in human history Mol Biol Evol33(3) 657ndash669

Nielsen R 2005 Molecular signatures of natural selection Annu RevGenet 39 197ndash218

Olds LC Sibley E 2003 Lactase persistence DNA variant enhances lactasepromoter activity in vitro functional role as a cis regulatory elementHum Mol Genet 12(18) 2333ndash2340

Ormond L Foll M Ewing GB Pfeifer SP Jensen JD 2016 Inferring the ageof a fixed beneficial allele Mol Ecol 25(1) 157ndash169

Osier MV Pakstis AJ Soodyall H Comas D Goldman D Odunsi AOkonofua F Parnas J Schulz LO Bertranpetit J et al 2002 A globalperspective on genetic variation at the ADH genes reveals unusualpatterns of linkage disequilibrium and diversity Am J Hum Genet71(1) 84ndash99

Peng Y Shi H Qi X-B Xiao C-J Zhong H Ma Run-lin Z Su B 2010The ADH1B Arg47His polymorphism in East Asian populations

and expansion of rice domestication in history BMC Evol Biol10(1) 15

Peter BM Huerta-Sanchez E Nielsen R 2012 Distinguishing betweenselective sweeps from standing variation and from a de novo mu-tation PLoS Genet 8(10) e1003011

Przeworski M Coop G Wall JD 2005 The signature of positive selectionon standing genetic variation Evolution 59(11) 2312ndash2323

Price AL Tandon A Patterson N Barnes KC Rafaels N Ruczinski I BeatyTH Mathias R Reich D Myers S 2009 Sensitive detection of chro-mosomal segments of distinct ancestry in admixed populationsPLoS Genet 5(6) e1000519

Pritchard JK Seielstad MT Perez-Lezaun A Feldman MW 1999Population growth of human Y chromosomes a study of Y chro-mosome microsatellites Mol Biol Evol 16(12) 1791ndash1798

Pritchard JK Pickrell JK Coop G 2010 The genetics of human adapta-tion hard sweeps soft sweeps and polygenic adaptation Curr Biol20(4) R208ndashR215

Przeworski M 2003 Estimating the time since the fixation of a beneficialallele Genetics 164(4) 1667ndash1676

Rabiner LR 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77(2) 257ndash286

Radwan J Babik W 2012 The genomics of adaptation Proc R Soc Lond BBiol Sci 279(1749) 5024ndash5028

Rannala B Reeve JP 2001 High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome se-quence Am J Hum Genet 69(1) 159ndash178

Rannala B Reeve JP 2002 Joint Bayesian estimation of mutationlocation and age using linkage disequilibrium Biocomputing2003 526ndash534

Segurel L Wyman MJ Przeworski M 2014 Determinants of mutationrate variation in the human germline Annu Rev Genomics HumGenet 1547ndash70

Simoons FJ 1970 Primary adult lactose intolerance and the milkinghabit a problem in biologic and cultural interrelations Am J DigDis 15(8) 695ndash710

Skoglund P Malmstrom H Omrak A Raghavan M Valdiosera CGunther T Hall P Tambets K Parik J Sjogren K-G et al 2014Genomic diversity and admixture differs for Stone-AgeScandinavian foragers and farmers Science 344(6185) 747ndash750

Slatkin M 2001 Simulating genealogies of selected alleles in a populationof variable size Genet Res 78(01) 49ndash57

Slatkin M 2008 A Bayesian method for jointly estimating allele age andselection intensity Genet Res 90(01) 129ndash137

Slatkin M Hudson RR 1991 Pairwise comparisons of mitochondrialDNA sequences in stable and exponentially growing populationsGenetics 129(2) 555ndash562

Slatkin M Rannala B 2000 Estimating allele age Annu Rev GenomicsHum Genet 1(1) 225ndash249

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenet Res 23(01) 23ndash35

Stinchcombe JR Hoekstra HE 2008 Combining population genomicsand quantitative genetics finding the genes underlying ecologicallyimportant traits Heredity 100(2) 158ndash170

Tang H Siegmund DO Shen P Oefner PJ Feldman MW 2002Frequentist estimation of coalescence times from nucleotide se-quence data using a tree-based partition Genetics 161(1) 447ndash459

Tavare S Balding DJ Griffiths RC Donnelly P 1997 Inferring coalescencetimes from DNA sequence data Genetics 145(2) 505ndash518

Tennessen JA Bigham AW OrsquoConnor TD Fu W Kenny EE Gravel SMcGee S Do R Liu X Jun G et al 2012 Evolution and functionalimpact of rare coding variation from deep sequencing of humanexomes Science 337(6090) 64ndash69

Thomson R Pritchard JK Shen P Oefner PJ Feldman MW 2000Recent common ancestry of human Y chromosomes evidencefrom DNA sequence data Proc Natl Acad Sci U S A 97(13)7360ndash7365

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007

Smith et al doi101093molbevmsy006 MBE

1016

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 39(1) 31ndash40

Troelsen JT Olsen J Moslashller J SjeuroOstreuroOm H 2003 An upstream polymor-phism associated with lactase persistence has increased enhanceractivity Gastroenterology 125(6) 1686ndash1694

Varin C Reid N Firth D 2011 An overview of composite likelihoodmethods Stat Sin 21(1) 5ndash42

Voight BF Kudaravalli S Wen X Pritchard JK Hurst L 2006 A map ofrecent positive selection in the human genome PLoS Biol 4(3) e72

Wegmann D Kessner DE Veeramah KR Mathias RA Nicolae DL YanekLR Sun YV Torgerson DG Rafaels N Mosley T et al 2011Recombination rates in admixed individuals identified by ancestry-based inference Nat Genet 43(9) 847ndash853

Wilde S Timpson A Kirsanow K Kaiser E Kayser M Unterleuroander MHollfelder N Potekhina ID Schier W Thomas MG et al 2014 Directevidence for positive selection of skin hair and eye pigmentation inEuropeans during the last 5000 years Proc Natl Acad Sci U S A111(13) 4832ndash4837

Williamson SH Hernandez R Fledel-Alon A Zhu L Nielsen RBustamante CD 2005 Simultaneous inference of selection and pop-ulation growth from patterns of variation in the human genomeProc Natl Acad Sci U S A 102(22) 7882ndash7887

Wiuf C 2000 On the genealogy of a sample of neutral rare alleles TheorPopul Biol 58(1) 61ndash75

Wiuf C Donnelly P 1999 Conditional genealogies and the age of aneutral mutant Theor Popul Biol 56(2) 183ndash201

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1017

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

our estimation of the TMRCA to reflect uncertainty in theprecise switch point off of the ancestral haplotype which inmany cases will be difficult to distinguish from the backgroundhaplotypes Another significant difference is that our modeldoes not make assumptions about the frequency trajectoryapart from that a sufficiently hard sweep occurred to incur astar-shaped genealogy Below we use simulations to show thesensitivity of our model to these simplified assumptions forvarying strengths of selection final allele frequencies and sam-pling regimes for the choice of reference panel An R package isavailable to implement this method on github (httpsgithubcomjhavsmithstartmrca last accessed January 23 2018)

New Approaches

Model DescriptionIn general the TMRCA for a sample of haplotypes carryingthe advantageous allele (hereafter referred to as tca) will bemore recent than the time of mutation (Kaplan et al 1989)We aim to estimate tca in the case where a selectively advan-tageous mutation occurred in an ancestor of our sample t1

generations ago (fig 1a in main text) Viewed backwards intime the selected variant decreases in frequency at a rateproportional to the selection strength During a rapid dropin allele frequency the coalescent rate among haplotypescarrying the selected variant is amplified The same effect

would be observed for population growth from a small initialsize forward in time (Hudson 1990 Slatkin and Hudson 1991)As a result the genealogy of a sample having undergoneselection andor population growth becomes more ldquostar-shapedrdquo This offers some convenience as it becomes moreappropriate to invoke an assumption of independenceamong lineages when selection is strong We would like toemphasize that this assumption necessarily implies that thebeneficial allele has a single ancestral haplotype that has in-creased in frequency This is in contrast to a scenario in whichthe beneficial allele has been present in the population forsome time prior to selection For that case multiple ancestralhaplotypes would increase in frequency simultaneouslyresulting in a genealogy that is not star-shaped

We assume no crossover interference between recombi-nation events within a haplotype and therefore treat eachside flanking the focal allele separately We define one side ofthe selected site within a window of some predeterminedlength to have L segregating sites such that an individualrsquossequence will be indexed from site s frac14 f1 Lg wheresfrac14 1 refers to the selected site (a notation reference is pro-vided in table 1) To simplify notation this description will bewritten for a window on one side flanking the selected siteNote that the opposing side of the selected site is modelled inan identical fashion after redefining L

Let X denote an n L data matrix for a sample of nchromosomes with the selected variant Xij is the observedallelic type in chromosome i at variant site j and is assumed tobe biallelic where Xij 2 f1 0g Let H denote an m L matrixcomprising m chromosomes that do not have the selectedvariant where Hij 2 f1 0g Let A denote the ancestral hap-lotype as a vector of length L where Aj is the allelic type on theancestral selected haplotype at segregating site j andAj 2 f1 0g We assume independence among lineages lead-ing to the most recent common ancestor of the selectedhaplotype This is equivalent to assuming a star-shaped ge-nealogy which as noted above is a reasonable assumption forsites linked to a favorable variant under strong selection Wecan then write the likelihood as

PrethXjtcaAHTHORN frac14Yn

i

PrethXijtcaAHTHORN (1)

In each individual haplotype Xi we assume the ancestralhaplotype extends from the selected allele until a recombina-tion event switches ancestry to a different genetic backgroundLet Wfrac14w indicate that the location of the first recombi-nation event occurs between sites w and wthorn 1 where W2 f1 Lg (wfrac14 L indicates no recombination up to siteL) We can then condition the probability of the data on theinterval where the first recombination event occurs and sumover all possible intervals to express the likelihood as

PrethXijtcaAHTHORN frac14XL

wfrac141

PrethXijtcaAHWi frac14 wTHORN

PrethWi frac14 wjtcaTHORN (2)

FIG 1 Visual descriptions of the model (a) An idealized illustration ofthe effect of a selectively favored mutationrsquos frequency trajectory(black line) on the shape of a genealogy at the selected locus Theorange lineages are chromosomes with the selected allele The bluelineages indicate chromosomes that do not have the selected alleleNote the distinction between the time to the common ancestor ofchromosomes with the selected allele tca and the time at which themutation arose t1 (b) The copying model follows the ancestral hap-lotype (orange) moving away from the selected site until recombina-tion events within the reference panel lead to a mosaic of nonselectedhaplotypes surrounding the ancestral haplotype (c) A demographichistory with two choices for the reference panel Local and divergedAfter the ancestral population at the top of the figure splits into twosister populations a beneficial mutation arises and begins increasingin frequency The orange and blue colors indicate frequency of theselected and nonselected alleles respectively

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1005

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Assuming haplotype lengths are independent and identi-cally distributed draws from an exponential distribution thetransition probabilities for a recombination event off of theancestral haplotype are

PrethWi frac14 wjtcaTHORN

frac14ertcadweth1 ertcaethdwthorn1dwTHORNTHORN if w frac14 f1 ethL 1THORNg

ertcadL if w frac14 L

(

(3)

where dw is the distance in base pairs of site w from theselected site and r is the local recombination rate per basepair per generation The data for each individual Xi can bedivided into two parts One indicating the portion of anindividualrsquos sequence residing on the ancestral haplotype (be-fore recombining between sites w and wthorn 1) XiethjwTHORN andthat portion residing off of the ancestral haplotype after arecombination event XiethjgtwTHORN We denote a separate likelihoodfor each portion

aiw frac14 PrethXiethjwTHORNjtcaAWi frac14 wTHORN (4)

biw frac14 PrethXiethjgtwTHORNjHethjgtwTHORNWi frac14 wTHORN (5)

Because the focal allele is on the selected haplotypeai1 frac14 1 Conversely we assume a recombination eventoccurs at some point beyond locus L such thatbiL frac14 1 We assume the waiting time to mutation ateach site on the ancestral haplotype is exponentiallydistributed with no reverse mutations and express thelikelihood as

aiw frac14 PrethXiethjwTHORNjtcaAWi frac14 wTHORN

frac14 etcalethdwwTHORNYw

jfrac142

PrethXij frac14 ajtcaATHORN (6)

PrethXij frac14 ajtcaATHORN frac14etcal if a frac14 Aj

1 etcal if a frac14 Aj

((7)

The term etcalethdwwTHORN on the right side of equation (6)captures the lack of mutation at invariant sites betweeneach segregating site Assuming tcal is small equation (6) isequivalent to assuming a Poisson number of mutations(with mean tcal) occurring on the ancestral haplotype

For biw the probability of observing a particular sequenceafter recombining off of the ancestral haplotype is dependenton standing variation in background haplotype diversity TheLi and Stephens (2003) haplotype copying model allows forfast computation of an approximation to the probability ofobserving a sample of chromosomes related by a genealogywith recombination Given a sample of m haplotypesH 2 fh1 hmg a population scaled recombination rateq and mutation rate h an observed sequence of alleles ismodeled as an imperfect copy of any one haplotype in thereference panel at each SNP Let Zij denote the reference panelhaplotype which Xi copies at the jth SNP and cj denote thenumber of base pairs between SNPs j and jthorn 1 Zij follows aMarkov process with transition probabilities

PrethZiethjthorn1THORN frac14 z0jZij frac14 zTHORN

frac14eqjcj=m thorn eth1 eqjcj=mTHORNeth1=mTHORN if z0 frac14 z

eth1 eqjcj=mTHORNeth1=mTHORN if z0 frac14 z

(

(8)

To include mutation the probability that the sampledhaplotype matches a haplotype in the reference panel ism=ethmthorn hTHORN and the probability of a mismatch (or muta-tion event) is h=ethmthorn hTHORN Letting a refer to an allele wherea 2 f1 0g the matching and mismatching probabilitiesare

PrethXij frac14 ajZij frac14 z h1 hmTHORN

frac14m=ethmthorn hTHORN thorn eth1=2THORNethh=ethmthorn hTHORNTHORN if hzj frac14 a

eth1=2THORNethh=ethmthorn hTHORNTHORN if hzj frac14 a

(

(9)

Equation (5) requires a sum over the probabilities of allpossible values of Zj using equations (8) and (9) This is com-puted using the forward algorithm as described in Rabiner(1989) and Appendix A of Li and Stephens (2003) It shouldbe noted that this formulation does not model the observa-tion of an invariant site among the background haplotypesWe tried an approach to model these sites but saw no im-provement in model performance (see supplementary ap-pendix S2 Supplementary Material online)

The complete likelihood for our problem can then beexpressed as

Table 1 Notation Used to Describe the Model

n Number of haplotypes with the selected allelem Number of haplotypes without the selected alleleL Number of SNPs flanking the selected site (one side considered

at a time)X n L matrix of haplotypes with the selected alleleH m L matrix of haplotypes without the selected alleleXij Allele in haplotype i at SNP j where i 2 f1 ng and

j 2 f1 LgHzj Allele in haplotype z at SNP j where z 2 f1 mg and

j 2 f1 LgAj Allele at site j on the ancestral haplotypeZij The reference panel haplotype from which Xi copies at site jtca Time to the most recent common ancestor (TMRCA)Wi The location of the first recombination event off of the ancestral

haplotyper Recombination rate per basepair per generationl Mutation rate per basepair per generationh Haplotype miscopying rate or population-scaled mutation rate

eth4NlTHORNq Haplotype switching rate or population-scaled recombination

rate eth4NrTHORNdw Physical distance of site w from the selected site where

w 2 f1 Lgcj Number of basepairs between sites j and j thorn 1aiw Likelihood of haplotype i for sites 1 wbiw Likelihood of haplotype i for SNPs ethwthorn 1THORN L

Smith et al doi101093molbevmsy006 MBE

1006

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

PrethXjtcaAHTHORN frac14Yn

ifrac141

XL

wfrac141

aiwbiwPrethWi frac14 wjtcaATHORN (10)

This computation is on the order 2Lnm2 and in practicefor mfrac14 20 nfrac14 100 and Lfrac14 4000 takes approximately3027 s to compute on an Intel Core i7-4750HQ CPU at200 GHz 8 with 156 GB RAM

InferencePerforming inference on tca requires addressing the latentvariables w and A in the model Marginalizing over possiblevalues of w is a natural summation per haplotype that is linearin L as shown above For A the number of possible values islarge (2L) and so we employ a MetropolisndashHastings algorithmto jointly sample the posterior of A and tca and then we takemarginal samples of tca for inference We assign a uniformprior density for both A and tca such that PrethATHORN frac14 1=2L andPrethtcaTHORN frac14 1=ethtmax tminTHORN where tmax and tmin are user-specified maximum and minimum values for tca ProposedMCMC updates of the ancestral haplotype A0 are generatedby randomly selecting a site in A and flipping to the alterna-tive allele For tca proposed values are generated by adding anormally distributed random variable centered at 0tca0 frac14 tca thorn Neth0 r2THORN To start the MetropolisndashHastings algo-

rithm an initial value of tca is uniformly drawn from a user-specified range of values (10ndash2000 in the applications here)To initialize the ancestral haplotype to a reasonable value weuse a heuristic algorithm which exploits the characteristicdecrease in variation near a selected site (see supplementaryappendix S2 Supplementary Material online)

For each haplotype in the sample of beneficial allele car-riers the Li and Stephens (2003) model uses a haplotypemiscopying rate h and switching rate q to compute a like-lihood term for loci following the recombination event off ofthe ancestral haplotype For our analyses we set q frac14 44104 using our simulated values of r frac14 11 108 per bpper generation and Nfrac14 10000 where q frac14 4Nr Following Li

and Stephens (2003) we fix h frac14 ethPn

mfrac1411=mTHORN1 as derived

from the expected number of mutation events on a geneal-ogy relating n chromosomes at a particular site We found nodiscernible effects on estimate accuracy when specifying dif-ferent values of q (supplementary fig 1 SupplementaryMaterial online)

ResultsBecause our model requires a sample (or ldquopanelrdquo) of referencehaplotypes without the selected allele we tested our methodfor cases in which the reference panel is chosen from the localpopulation in which the selected allele is found as well ascases where the panel is from a diverged population wherethe selected haplotype is absent (fig 1c) Regardless of sce-nario the estimates are on average within a factor of 2 of thetrue value and often much closer When using a local refer-ence panel point estimates of tca increasingly underestimatethe true value (TMRCA) as selection becomes weaker and thefinal allele frequency increases (fig 2) Put differently the age

of older TMRCAs tends to be underestimated with local ref-erence panels Using the mean posteriors as point estimatesmean values of log2(estimatetrue value) range from062 to014 Simulations using a diverged population for the refer-ence panel removed the bias though only in cases where thedivergence time was not large For a reference panel divergedby 05N generations mean log2(estimatetrue value) valuesrange from 021 to 018 As the reference panel becomestoo far diverged from the selected population estimates be-come older than the true value (036 to 094 log2(estimatetrue value)) In these cases the HMM is unlikely to infer aclose match between background haplotypes in the sampleand the reference panel leading to many more mismatchesbeing inferred as mutation events on the ancestral haplotypeand an older estimate of tca

The bottom panel of figure 2 shows the effect of selectionstrength and final allele frequency on the size of the 95credible interval around point estimates normalized by thetrue TMRCA for each simulated data set Before normalizingcredible interval sizes using a local reference panel range from73 to 213 generations for 2Ns frac14 100 versus 18 to 22 gener-ations when 2Ns frac14 2000 Using local and diverged referencepanels we found a minimal effect of the sample size on pointestimates (supplementary figs 2 and 3 SupplementaryMaterial online) As expected larger sample sizes for the car-rier panel improve estimate accuracy However higher allelefrequencies and weak selection are likely to induce more un-certainty due to the ancestral haplotype tracts recombiningwithin the sample We find this effect more pronounced withlarge sample sizes for the reference panel We speculate that alarge sample of reference haplotypes leads the focal selectedhaplotype to have increased probability of copying from thereference panel leading to a shorter selected haplotype andslight overestimate of the TMRCA

We also performed simulations under varying degrees ofmutation and recombination rate misspecification (supple-mentary fig 4 Supplementary Material online) In most casesmean values of log2(estimatetrue value) stay within an orderof magnitude of 0 As expected when both the mutation andrecombination rate are misspecified we find the most dis-crepancy To assess the convergence properties of theMCMC five replicate chains were run for each of 20 sim-ulated data sets produced under three 2Ns values (100200 and 2000) for frequency trajectories ending at 01(supplementary fig 5 Supplementary Material online)While care is always warranted with MCMC approacheswe find in practice that convergence among our replicatechains is attained relatively quickly (3 000 iterations forsimulated data and 3000ndash9000 iterations for appliedcases see supplementary fig 9 Supplementary Materialonline)

We compared the performance of our estimator withthree other model-based approaches for allele age estimationby matching the simulation scheme performed by Chen et al(2015) (supplementary table 5 Supplementary Material on-line) Our method shows improvement in accuracy (RMSE)andor lower bias for simulations with lower frequencies ofthe beneficial allele (40) regardless of the reference panel

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1007

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

used In cases where the final beneficial allele frequency ishigher (80) our methodrsquos accuracy remains as good or bet-ter than the other methods when using a diverged referencepanel with a two-orders-of-magnitude improvement of biasunder strong selection (sfrac14 005) Estimates when using a localreference panel and a high final beneficial allele frequencyremain comparable to the other methods for strong selectionbut tend to have more bias and decreased accuracy as selec-tion strength decreases

Assuming a star-genealogy among beneficial allele carriersmay result in underestimating the variance for the posteriordistribution when there is nonindependence in our sample

To measure this affect we computed TMRCA estimates on100 bootstrap replicates for 4 simulated data sets under 2selection strengths and 2 final allele frequencies (supplemen-tary table 7 Supplementary Material online) We find closeagreement between the 95 posterior credible intervals ofthe original data and the 95 confidence intervals computedon the bootstrap estimates for a selection strength of 01 forboth final allele frequencies of 04 and 08 As expected olderTMRCAs are likely to violate the star-genealogy assumptionand in these cases we find that estimates from our originaldata are more narrow than the bootstrap confidenceintervals

Frequency 01 Frequency 03 Frequency 06 Frequency 09

minus2

minus1

0

1

2

3

0005 001 01 0005 001 01 0005 001 01 0005 001 01

log2

(Est

imat

eTr

ue V

alue

)Local 02N 05N 10N

Frequency 01 Frequency 03 Frequency 06 Frequency 09

02

04

06

08

0005 001 01 0005 001 01 0005 001 01 0005 001 01Selection Strength

Nor

mal

ized

95

Cre

dibl

e In

terv

al R

ange

FIG 2 Accuracy results from simulated data Accuracy of TMRCA point estimates and 95 credible interval ranges from posteriors inferred fromsimulated data under different strengths of selection final allele frequencies and choice of reference panel Credible interval range sizes are in unitsof generations and are normalized by the true TMRCA for each simulated data set See Materials and Methods below for simulation details

Smith et al doi101093molbevmsy006 MBE

1008

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Recombination versus Mutation as a Source ofInformationWe compared our model-based inference with simpler esti-mates of the TMRCA using the number of derived mutationson the ancestral haplotype and the mean length of the an-cestral haplotype In addition to quantifying the improve-ment our method has over these calculations this alsoserves as an ad-hoc way to understand how the relativeweight of information from mutation and recombinationaffects the performance of our method One can model thehaplotype lengths as independent and exponentially distrib-uted to derive a recombination-based estimator tr as

tr frac141

wor (11)

where r is the recombination rate and wo is the observedmean ancestral haplotype length To leverage the count ofderived mutations on the ancestral haplotype we use theThomson et al (2000) estimator In a sample of n haplotypeswith the selected allele a mutation-based estimator tm canbe calculated as

tm frac141

n

Xn

i

yi

wil (12)

where yi is the number of derived mutations on the ith hap-lotype which has length wi basepairs See Hudson (2007) for aderivation of the estimate for the variance of the Thomsonestimator

When using derived mutations uncertainty in both theancestral haplotype sequence and the length of the ancestralhaplotype on each chromosome (wi) can lead to poor esti-mation To improve inference researchers typically define arestricted ldquononrecombiningrdquo region that may reliably containderived mutations on the ancestral haplotype This has twodisadvantages 1) There is more information available in thedata which cannot be used because excess caution is neces-sary to prevent overcounting of derived mutations and2) there may still be unobserved recombination events inthis restricted locus To minimize the use of heuristics for aderived mutation approach we used our model to findmaximum-likelihood estimates of the ancestral haplotypebreakpoints using equation (2) in the model descriptionWe also used the mean posterior estimate of the ancestralhaplotype from our model to identify derived mutations Tocalculate a recombination-based estimator of the TMRCA wecalculated wo using the same maximum-likelihood estimatesof the ancestral haplotype lengths inferred for the mutationestimator

When using a local reference panel the simple mutationestimator tm consistently underestimates the true TMRCAThe recombination estimate however remains accurate(supplementary fig 6 Supplementary Material online) Wesuspect this to be a result of poor estimation of the ancestralhaplotype and violation of the star-genealogy assumption Incases where selection is weaker and the genealogy is not star-shaped derived mutations occurring early in the genealogywill be overrepresented and incorrectly inferred to be the

ancestral allele In this way high-frequency-derived alleleswill not be counted As predicted increasing selectionstrength improves mutation estimator accuracy The recom-bination estimator appears robust to this effect as long asselection is not too strong For very strong selection andyoung TMRCA values maximum-likelihood estimates ofthe haplotype lengths become constrained by the size ofthe locus studied For example in simulations with a selectionstrength of 005 and frequency of 01 the mean TMRCA isaround 100 generations Using equation (11) above the meanlength of the ancestral haplotype for a TMRCA of 100 gen-erations is 2 Mb which is twice as large as the window size weuse to make computation for our simulations feasible Using alarger window around the selected locus would amelioratethis effect

When using a diverged reference panel we find an oppositeeffect In this case the count of derived mutations results inan overestimate and the haplotype lengths yield an under-estimate We suspect this to be driven by poor matchingbetween the reference panel and the background haplotypesamong beneficial allele carriers The low probability of match-ing between the reference and background haplotypes meansthat the lengths of the ancestral haplotype are inferred toextend further than their true lengths This also leads to anoverestimate for the mutation estimator because differencesbetween the ancestral and background haplotypes are incor-rectly inferred as derived mutations on the ancestralhaplotype

Application to 1000 Genomes DataWe applied our method to five variants previously identifiedas targets of recent selection in various human populations(fig 3) Using phased data from the 1000 Genomes Project wefocused on variants that are not completely fixed in any onepopulation so that we could use a local reference panel The Liand Stephens (2003) haplotype copying model is appropriatein cases where ancestry switches occur among chromosomeswithin a single population so we excluded populations in theAmericas for which high levels of admixture are known toexist

While the simulation results described above provide someintuition for the effects of selection strength final allele fre-quency and choice of reference panel we also performedsimulations using the demographic history inferred byTennessen et al (2012) to explore the effects of nonequili-brium demographic history on our estimation accuracy (sup-plementary fig 10 Supplementary Material online) We findsubtle differences in accuracy between the two demographichistories where the nonequilibrium histories lead to negligibledifferences in mean values for log2(estimatetrue value) andlarger credible interval ranges

ADH1BA derived allele at high frequency among East Asians at theADH1B gene (rs3811801) has been shown to be functionallyrelevant for alcohol metabolism (Osier et al 2002 Eng et al2007) Previous age estimates are consistent with the timing

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1009

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

of rice domestication and fermentation approximately10000 years ago (Li et al 2007 Peng et al 2010 Peter et al2012) However a more recent estimate by Peter et al (2012)pushes this time back several thousand years to 12876(2204ndash49764) years ago Our results are consistent with an

older timing of selection as our CHB sample (Han Chinese inBeijing China) TMRCA estimate is 15377 (13763ndash17281)years Replicate chains of the MCMC are generally consistentwith the oldest estimates in the CHB sample showing themost variation among resampled data sets and the youngest

FIG 3 Comparison of TMRCA estimates with previous results Violin plots of posterior distributions for the complete set of estimated TMRCAvalues for the five variants indicated in the legend scaled to a generation time of 29 years Each row indicates a population sample from the 1000Genomes Project panel Replicate MCMCs are plotted with transparency Points and lines overlaying the violins are previous point estimates and95 confidence intervals for each of the variants indicated by a color and rs number in the legend (see supplementary tables 3 and 4Supplementary Material online) The population sample abbreviations are defined in text

Smith et al doi101093molbevmsy006 MBE

1010

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

estimate of 10841 (9720ndash12147) in the KHV sample showingthe least When using a fine-scale recombination map all ofthe ADH1B TMRCAs are inferred to be slightly older (supple-mentary fig 7 Supplementary Material online)

EDARPopulation genomic studies have repeatedly identified thegene EDAR to be under recent selection in East Asians(Akey et al 2004 Williamson et al 2005 Voight et al 2006)with a particular site (rs3827760) showing strong evidence forbeing the putative target Functional assays and allele specificexpression differences at this position show phenotypic asso-ciations to a variety of phenotypes including hair thicknessand dental morphology (Bryk et al 2008 Fujimoto et al 2008Kimura et al 2009 Kamberov et al 2013)

Our estimate of 22192 (19683ndash25736) years for the EDARallele in the CHB sample is older than ABC-based estimates of12458 (1314ndash85835) and 13224 (4899ndash50692) years madeby Bryk et al (2008) and Peter et al (2012) respectivelyKamberov et al (2013) use spatially explicit ABC and maxi-mum likelihood approaches to compute older estimates of30925 (13175ndash39575) and 36490 (34775ndash38204) We in-cluded all populations for which the variant is present includ-ing the FIN and BEB samples where it exists at low frequencyOur results for the youngest TMRCAs are found in these twolow frequency populations where the estimate in FIN is17386 (13887ndash20794) and the estimate in BEB is 18370(14325ndash22872) Among East Asian populations the oldestand youngest TMRCA estimates are found in the KHV sam-ple (25683 23169ndash28380) and CHB sample (22192 19683ndash25736)

LCTArguably the best-studied signature of selection in humans isfor an allele at the LCT gene (rs4988235) which confers lactasepersistence into adulthoodmdasha trait unique among mammalsand which is thought to be a result of cattle domesticationand the incorporation of milk into the adult diets of severalhuman populations (Enattah et al 2002 Bersaglieri et al 2004Tishkoff et al 2007) There are multiple alleles that showassociation with lactase persistence (Tishkoff et al 2007)We focused on estimating the age of the T-13910 allele pri-marily found at high frequency among Northern Europeansbut which is also found in South Asian populations In addi-tion to association with the lactase persistence phenotypethis allele has been functionally verified by in vitro experi-ments (Olds and Sibley 2003 Troelsen et al 2003Kuokkanen et al 2006)

Mathieson et al (2015) use ancient DNA collected from83 human samples to get a better understanding of the fre-quency trajectory for several adaptive alleles spanning a timescale of 8000 years For the LCT persistence allele (rs4988235)they find a sharp increase in frequency in the past 4000 yearsago While this is more recent than previous estimates anearlier TMRCA or time of mutation is still compatible withthis scenario

Our estimates using European and South Asian samplesfall between the range from 5000 to 10000 years ago which isbroadly consistent with age estimates from modern data Thecredible intervals for estimates in all of the samples havesubstantial overlap which makes any ranking on the basisof point estimates difficult We infer the PJL (Punjabi fromLahore Pakistan) sample to have the oldest TMRCA estimateof 9514 (8596ndash10383) years Itan et al (2009) use spatialmodelling to infer the geographic spread of lactase allelefrom northern to southern Europe Consistent with theirresults the youngest estimate among European populationsis found in the IBS sample at 9341 (8688ndash9989) yearsAmong all samples the youngest estimate was found inBEB at 6869 (5143ndash8809)

KITLG and OCA2The genetic basis and natural history of human skin pigmen-tation is a well-studied system with several alleles of majoreffect showing signatures consistent with being targets ofrecent selection (Jablonski and Chaplin 2012 Beleza Santoset al 2013 Wilde et al 2014 Eaton et al 2015) We focused onan allele found at high-frequency world-wide among non-African populations at the KITLG locus (rs642742) whichshows significant effects on skin pigmentation differences be-tween Europeans and Africans (Miller et al 2007) althoughmore recent work fails to find any contribution of KITLGtoward variation in skin pigmentation in a Cape VerdeAfrican-European admixed population (Beleza Johnsonet al 2013) We also estimated the TMRCA for a melanin-associated allele at the OCA2 locus (rs1800414) which is onlyfound among East Asian populations at high frequency(Edwards et al 2010)

For the KITLG variant our estimates among differentpopulations vary from 18000 to 34000 years ago with theoldest age being in the YRI (Yoruba in Ibadan Nigeria)sample (33948 28861ndash39099) The youngest TMRCA isfound in FIN at 18733 years (16675ndash20816) The nexttwo youngest estimates are also found in Africa withthe TMRCA in the MSL (Mende in Sierra Leone) samplebeing 22340 (15723ndash28950) years old and that for LWK(Luhya in Webuye Kenya) being 22784 (17922ndash28012)years old suggesting a more complex history than amodel of a simple allele frequency increase outside ofAfrica due to pigmentation related selection pressuresPrevious point estimates using rejection samplingapproaches on a Portuguese sample (32277 6003ndash80683) and East Asian sample (32045 6032ndash98165)are again most consistent with our own results on theIBS (29731 26170ndash32813) and CHB samples (2677324297ndash30141) (Beleza Santos et al 2013 Chen et al2015) Among East Asians the oldest and youngest esti-mates are again found in the JPT (28637 24297ndash30141)and KHV (24544 21643ndash27193) samples respectivelyThe TMRCA for OCA2 alleles in the JPT (1859916110ndash20786) and KHV (16370 14439ndash18102) samplesare also the oldest and youngest respectively

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1011

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

DiscussionOur method improves estimation for the timing of selectionon a beneficial allele using a tractable model of haplotypeevolution This approach leverages detailed information inthe data while remaining amenable to large sample sizesUsing both carriers and noncarriers of the allele we canmore effectively account for uncertainty in the extent ofthe ancestral haplotype and derived mutations We showthe performance of our method using simulations of differentselection strengths beneficial allele frequencies and choicesof reference panel By applying our method to five variantspreviously identified as targets of selection in human popu-lations we provide a comparison among population-specificTMRCAs This gives a more detailed account of the order inwhich populations may have acquired the variant andorexperienced selection for the trait it underlies

In that regard it is hypothesized that local selection pres-sures and a cultural shift toward agrarian societies have in-duced adaptive responses among human populations aroundthe globe The data associated with some variants seem toindicate more recent selective events than others Our resultsfor variants associated with dietary traits at the LCT andADH1B genes both imply relatively recent TMRCAs(lt 15 000 years ago) consistent with hypotheses that selec-tion on these mutations results from recent changes in hu-man diet following the spread of agriculture (Simoons 1970Peng et al 2010) In contrast the inferred TMRCAs for EDARKITLG and OCA2 imply older adaptive events which mayhave coincided more closely with the habitation of new envi-ronments or other cultural changes

Several hypotheses have been suggested to describe theselective drivers of skin pigmentation differences among hu-man populations including reduced UV radiation at highlatitudes and vitamin D deficiency (Loomis 1967 Jablonskiand Chaplin 2000) Estimated TMRCAs for the variants atthe OCA2 and EDAR loci among East Asians appear to beas young or younger than the KITLG variant but older thanthe LCT and ADH1B locus This suggests a selective history inEast Asian populations leading to adaptive responses forthese traits occurring after an initial colonization In somecases the dispersion of replicate MCMC estimates makes itdifficult to describe the historical significance of an observedorder for TMRCA values However the consistency of esti-mates among different populations for particular variantsadds some confidence to our modelrsquos ability to reproducethe ages which are relevant to those loci or certain geographicregions

To assess the relative concordance of our estimates withthose from previous approaches we compared our resultswith a compilation of previously published estimates basedon the time of mutation time since fixation or TMRCA ofvariants associated with the genes studied here (supplemen-tary fig 8 Supplementary Material online) The range of con-fidence intervals for these studies is largely a reflection of theassumptions invoked or relaxed for any one method as wellas the sample size and quality of the data used In principleour method extracts more information than approaches that

use summary statistics such as ABC In our empirical appli-cation we found that our method provides a gain in accuracywhile accounting for uncertainty in both the ancestral hap-lotype and its length on each chromosome Notably ourmethod provides narrower credible intervals by incorporatingthe full information from ancestral haplotype lengths derivedmutations and a reference panel of noncarrier haplotypes

Another caveat of our method is its dependence on thereference panel which is intended to serve as a representativesample of nonancestral haplotypes in the population duringthe selected allelersquos increase in frequency Four possible chal-lenges can arise 1) Segments of the ancestral selected haplo-type may be present in the reference panel due torecombination (this is more likely for alleles that havereached higher frequency) 2) the reference panel may con-tain haplotypes that are similar to the ancestral haplotypedue to low levels of genetic diversity 3) the reference panelmay be too diverged from the focal population and 4) pop-ulation connectivity and turnover may lead the ldquolocalrdquo refer-ence panel to be largely composed of migrant haplotypeswhich were not present during the selected allelersquos initial in-crease in frequency

Under scenarios 1 and 2 the background haplotypes willbe too similar to the ancestral haplotype and it may be dif-ficult for the model to discern a specific ancestry switch lo-cation This leads to fewer differences (mutations) thanexpected between the ancestral haplotype and each benefi-cial allele carrier The simulation results are consistent withthis scenario Our method tends to underestimate the trueage across a range of selection intensities and allele frequen-cies when using a local reference panel

Conversely under scenarios 3 and 4 the model will fail todescribe a recombinant haplotype in the sample of beneficialallele carriers as a mosaic of haplotypes in the reference panelAs a result the model will infer more mutation events toexplain observed differences from the ancestral haplotypeOur simulation results show this to be the case with referencepanels diverged by N generations Posterior mean estimatesare consistently older than their true value Our simulationsare perhaps pessimistic thoughmdashwe chose reference paneldivergence times of N and 05N generations approximatelycorresponding to FST values of 04 and 02 respectively Forthe smaller FST values observed in humans we expect resultsfor diverged panels to be closer to those obtained with thelocal reference panel Nonetheless future extensions to incor-porate multiple populations within the reference panel wouldbe helpful and possible by modifying the approach of Priceet al (2009) Such an approach would also enable the analysisof admixed populations (we excluded admixed samples fromour analysis of the 1000 Genomes data above)

Aside from the challenges imposed by the choice of refer-ence panel another potential source of bias lies in our tran-sition probabilities which are not conditioned on thefrequency of the selected variant In reality recombinationevents at some distance away from the selected site will onlyresult in a switch from the ancestral to background haplo-types at a rate proportional to 1 pl where pl is the fre-quency of the ancestral haplotype alleles at locus l In this way

Smith et al doi101093molbevmsy006 MBE

1012

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

some recombination events may go unobservedmdashas thebeneficial allele goes to high frequency the probability of arecombination event leading to an observable ancestral tobackground haplotype transition decreases One solutionmay be to include the frequency-dependent transition prob-abilities derived by Chen et al (2015) Under their model themutation time is estimated by assuming a deterministic lo-gistic frequency trajectory starting at 1

2N An additional benefitof using frequency trajectories would be the ability to inferposterior distributions on selection coefficients While theselection coefficient is typically assumed to be related tothe time since mutation by t1 frac14 logethNsTHORN=s we do nothave an equivalent expression for time to the common an-cestor Rather than the initial frequency being 1

2N for a newmutation our initial frequency must correspond to the fre-quency at which the TMRCA occurs Griffiths and Tavare(1994) derive a framework to model a genealogy under arbi-trary population size trajectories which should be analogousto the problem of an allele frequency trajectory and addi-tional theory on intra-allelic genealogies may be useful here aswell (Griffiths and Tavare 1994 Wiuf and Donnelly 1999 Wiuf2000 Slatkin and Rannala 2000)

Our model also assumes independence among all haplo-types in the sample in a composite-likelihood frameworkwhich is equivalent to assuming a star-genealogy (Larribeand Fearnhead 2011 Varin et al 2011) This is unlikely tobe the case when sample sizes are large or the TMRCA isold It is also unlikely to be true if the beneficial allele existedon multiple haplotypes preceding the onset of selection wasintroduced by multiple migrant haplotypes from other pop-ulations or occurred by multiple independent mutationevents (Innan and Kim 2004 Hermisson and Pennings2005 Przeworski et al 2005 Pritchard et al 2010 Berg andCoop 2015) Methods for distinguishing selection from stand-ing variation versus de novo mutation are available thatshould make it easier distinguish these cases (Messer andNeher 2012 Peter et al 2012 Messer and Petrov 2013Garud et al 2015)

If the underlying allelic genealogy is not star-like onecan expect different estimates of the TMRCA for differentsubsets of the data Here we performed multiple MCMCson resampled subsets of the data to informally diagnosewhether there are violations from the star-shape geneal-ogy assumption We speculate that exactly how theTMRCAs vary may provide insight to the underlyinghistory In cases where the TMRCA estimates for a par-ticular population are old and more variable than otherpopulations the results may be explained by structure inthe genealogy whereby recent coalescent events haveoccurred among the same ancestral haplotype beforethe common ancestor When estimates are dispersedamong resampled data sets the presence of multiple an-cestral haplotypes prior to the variantrsquos increase in fre-quency may be a better explanation Further support forthis explanation might come from comparisons to otherpopulation samples which show little to no dispersion ofestimates from resampled data sets Future work mightmake it possible to formalize this inference process

A final caveat regards the misspecification of mutation andrecombination rates TMRCA estimates are largely deter-mined by the use of accurate measures for these two param-eters In a way this provides some robustness to our methodOur age estimates depend on mutation and recombinationrates so accurate specification for one of the values can com-pensate for slight misspecification in the other In cases wherea fine-scale recombination map is unavailable we suggest us-ing a uniform recombination rate specific to the locus ofinterest (see Materials and Methods and supplementaryfig 7 Supplementary Material online) Choosing an appropri-ate mutation rate will continue to depend on current andfuture work which tries to resolve discrepancies in publishedmutation rate estimates inferred by various approaches(Segurel et al 2014)

One future direction for our method may be to explicitlyincorporate the possibility of multiple ancestral haplotypeswithin the sample Under a disease mapping frameworkMorris et al (2002) implement a similar idea in the case whereindependent disease causing mutations arise at the same lo-cus leading to independent genealogies for which they cointhe term ldquoshattered coalescentrdquo For our case beneficialmutations may also be independently derived on differenthaplotypes Alternatively a single mutation may be oldenough to reside on different haplotypes due to a sufficientamount of linked variation existing prior to the onset of se-lection Berg and Coop (2015) model selection from standingvariation to derive the distribution of haplotypes that theselected allele is present on

While we have treated the TMRCA as a parameter of in-terest our method also produces a sample of the posteriordistribution on the ancestral haplotype This could provideuseful information to estimate the frequency spectrum ofderived mutations on the ancestral haplotype Similarly thefrequency of shared recombination breakpoints could shedlight on the genealogy and how well it conforms to the star-shape assumption The extent of the ancestral haplotype ineach individual may also prove useful for identifying delete-rious alleles that have increased in frequency as a result ofstrong positive selection on linked beneficial alleles (Chun andFay 2011 Hartfield and Otto 2011) For example Huff et al(2012) describe a risk allele for Crohnrsquos disease at high fre-quency in European populations which they suggest is linkedto a beneficial allele under recent selection Similar to anadmixture mapping approach our method could be usedto identify risk loci by testing for an association betweenthe ancestral haplotype and disease status As another appli-cation identifying the ancestral haplotype may be useful inthe context of identifying a source population (or species) fora beneficial allele prior to its introduction and subsequentincrease in frequency in the recipient population

In many cases the specific site under selection may beunknown or restricted to some set of putative sites Whileour method requires the position of the selected site to bespecified future extensions could treat the selected site as arandom variable to be estimated under the same MCMCframework This framework would also be amenable to mar-ginalizing over uncertainty on the selected site

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1013

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

While we focus here on inference from modern DNAdata the increased accessibility of ancient DNA has addeda new dimension to population genetic data sets (Lazaridiset al 2014 Skoglund et al 2014 Allentoft et al 2015 Haaket al 2015 Mathieson et al 2015) Because it will remaindifficult to use ancient DNA approaches in many specieswith poor sample preservation we believe methods basedon modern DNA will continue to be useful going forwardThat said ancient DNA is providing an interesting avenue forcomparative work between inference from modern and an-cient samples For example Nakagome et al (2016) usesimulations to assess the fit of this ancient DNA polymor-phism to data simulated under their inferred parametervalues for allele age and selection intensity and they findreasonable agreement Much work still remains to fullyleverage ancient samples into population genetic inferencewhile accounting for new sources of uncertainty and poten-tial for sampling bias

Despite these challenges it is clear that our understandingof adaptive history will continue to benefit from new com-putational tools which extract insightful information from adiverse set of data sources

Materials and MethodsWe generated data using the software mssel (Hudson R per-sonal communication) which simulates a sample of haplo-types conditioned on the frequency trajectory of a selectedvariant under the structured coalescent (Hudson and Kaplan1988 Kaplan et al 1988) Trajectories were first simulatedforwards in time under a WrightndashFisher model for an additivelocus with varying strengths of selection and different endingfrequencies of the selected variant Trajectories were thentruncated to end at the first time the allele reaches a specifiedfrequency See supplementary table 1 SupplementaryMaterial online for relative ages of simulated TMRCA val-ues for different end frequencies and selection strengthsFor the results in figure 2 100 simulations were performedfor each parameter combination MCMCs were run for5000 iterations with a burn-in excluding the first 3000iterations A standard deviation of 10 was used for theproposal distribution of tca The red boxplots indicate localreference panels The blue and green boxplots indicate ref-erence panels diverged by 5N generations and 1N gener-ations respectively Each data set was simulated for a 1-Mblocus with a mutation rate of 1 108 recombination rateof 1 108

and population size of 10000 Sample sizesfor the selected and reference panels were 100 and 20respectively

For more efficient run times of the MCMC we set a max-imum number of individuals to include in the selected andreference panels to be 100 and 20 respectively In cases wherethe true number of haplotypes for either panel was greaterthan this in the full data set we resampled a subset ofhaplotypes from each population for a total of five replicatesper population For simulation results supporting the use ofthis resampling strategy see supplementary figure 5Supplementary Material online The MCMCs were run for

15000 iterations with a standard deviation of 20 for theTMRCA proposal distribution The first 9000 iterationswere removed as a burn-in leading to 6000 iterations for asample of the posterior Convergence was assessed by com-parison of MCMC replicates Figure 3 and supplementaryfigure 8 Supplementary Material online show the resultsfor all five variants along with previous point estimates and95 confidence intervals assuming a generation time of29 years (Fenner 2005) Supplementary table 3Supplementary Material online lists the mean and 95 cred-ible intervals for estimates with the highest mean posteriorprobability which we refer to in the text belowSupplementary table 4 Supplementary Material online liststhe previous estimates and confidence intervals with addi-tional details of the different approaches taken

To model recombination rate variation we used recombi-nation rates from the Decode sex-averaged recombinationmap inferred from pedigrees among families in Iceland(Kong et al 2010) Because some populations may have re-combination maps which differ from the Decode map at finescales we used a mean uniform recombination rate inferredfrom the 1-Mb region surrounding each variant The motiva-tion for this arises from how recombination rates have beenpreviously shown to remain relatively consistent among re-combination maps inferred for different populations at themegabase-scale (Broman et al 1998 Baudat et al 2010 Konget al 2010 Auton and McVean 2012) Further we found ourestimates depend mostly on having the megabase-scale re-combination rate appropriately set with little difference inmost cases for estimates obtained by modeling fine-scale re-combination at each locus (supplementary fig 7Supplementary Material online) We specify the switchingrate among background haplotypes after recombining off ofthe ancestral haplotype to be 4Nr where Nfrac14 10000 and r isthe mean recombination rate for the 1-Mb locus

For modeling mutation a challenge is that previous mu-tation rate estimates vary depending on the approach used(Segurel et al 2014) Estimates using the neutral substitutionrate between humans and chimps are more than 2 108

per bp per generation while estimates using whole-genomesequencing are closer to 1 108 As a compromise wespecify a mutation rate of 16 108

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe would like to thank Hussein Al-Asadi Arjun BiddandaAnna Di Rienzo Dick Hudson Choongwon Jeong Evan KochJoseph Marcus Shigeki Nakagome Ben Peter Mark ReppelAlex White members of the Coop laboratory at UC Davismembers of the Przeworski laboratory at ColumbiaUniversity and members of the He and Stephens laboratoriesat the University of Chicago for helpful comments JS wassupported by an NSF Graduate Research Fellowship andNational Institute of General Medical Sciences of the

Smith et al doi101093molbevmsy006 MBE

1014

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

National Institutes of Health under award numbers DGE-1144082 and T32GM007197 respectively This work wasalso supported by the National Institute of General MedicalSciences of the National Institutes of Health under awardnumbers RO1GM83098 and RO1GM107374 to GC as wellas R01HG007089 to JN

ReferencesAkey JM Eberle MA Rieder MJ Carlson CS Shriver MD Nickerson DA

Kruglyak L 2004 Population history and natural selection shapepatterns of genetic variation in 132 genes PLoS Biol 21591ndash1599

Allentoft ME Sikora M Sjogren K-G Rasmussen S Rasmussen MStenderup J Damgaard PB Schroeder H Ahlstrom T Vinner L2015 Population genomics of Bronze Age Eurasia Nature522(7555) 167ndash172

Auton A McVean G 2012 Estimating recombination rates from geneticvariation in humans In Anisimova M ed Evolutionary genomicsMethods in Molecular Biology (Methods and Protocols) vol 856New York Humana Press

Barrett RDH Hoekstra HE 2011 Molecular spandrels tests of adaptationat the genetic level Nat Rev Genet 12(11) 767ndash780

Baudat F Buard J Grey C Fledel-Alon A Ober C Przeworski M Coop GMassy BD 2010 PRDM9 is a major determinant of meiotic recom-bination hotspots in humans and mice Science 327(5967) 836ndash840

Beaumont MA Zhang W Balding DJ 2002 Approximate Bayesian com-putation in population genetics Genetics 162(4) 2025ndash2035

Beleza S Johnson NA Candille SI Absher DM Coram MA Lopes JCampos J Araujo II Anderson TM Vilhjalmsson BJ et al 2013Genetic architecture of skin and eye color in an African-Europeanadmixed population PLoS Genet 9(3) e1003372

Beleza S Santos AM McEvoy B Alves I Martinho C Cameron E ShriverMD Parra EJ Rocha J 2013 The timing of pigmentation lightening inEuropeans Mol Biol Evol 30(1) 24ndash35

Berg JJ Coop G 2015 A coalescent model for a sweep of a uniquestanding variant Genetics 201(2) 707ndash725

Bersaglieri T Sabeti PC Patterson N Vanderploeg T Schaffner SF DrakeJA Rhodes M Reich DE Hirschhorn JN 2004 Genetic signatures ofstrong recent positive selection at the lactase gene Am J Hum Genet74(6) 1111ndash1120

Broman KW Murray JC Sheffield VC White RL Weber JL 1998Comprehensive human genetic maps individual and sex-specificvariation in recombination Am J Hum Genet 63(3) 861ndash869

Bryk J Hardouin E Pugach I Hughes D Strotmann R Stoneking M MylesS 2008 Positive selection in East Asians for an EDAR allele thatenhances NF-jB activation PLoS One 3(5) e2209

Chen H Slatkin M 2013 Inferring selection intensity and allele age frommultilocus haplotype structure G3 3(8) 1429ndash1442

Chen H Hey J Slatkin M 2015 A hidden Markov model for investigatingrecent positive selection through haplotype structure Theor PopulBiol 9918ndash30

Chun S Fay JC 2011 Evidence for hitchhiking of deleterious mutationswithin the human genome PLoS Genet 7(8) e1002240

Coop G Griffiths RC 2004 Ancestral inference on gene trees underselection Theor Popul Biol 66(3) 219ndash232

Coop G Bullaughey K Luca F Przeworski M 2008 The timing of selec-tion at the human FOXP2 gene Mol Biol Evol 25(7) 1257ndash1259

Dalziel AC Rogers SM Schulte PM 2009 Linking genotypes to pheno-types and fitness how mechanistic biology can inform molecularecology Mol Ecol 18(24) 4997ndash5017

Eaton K Edwards M Krithika S Cook G Norton H Parra EJ 2015Association study confirms the role of two OCA2 polymorphismsin normal skin pigmentation variation in East Asian populations AmJ Hum Biol 27 520ndash525

Edwards M Bigham A Tan J Li S Gozdzik A Ross K Jin L Parra EJ 2010Association of the OCA2 polymorphism His615Arg with melanin

content in east Asian populations further evidence of convergentevolution of skin pigmentation PLoS Genet 6(3) e1000867

Elmer KR Meyer A 2011 Adaptation in the age of ecological genomicsinsights from parallelism and convergence Trends Ecol Evol 26(6)298ndash306

Enattah NS Sahi T Savilahti E Terwilliger JD Peltonen L Jeuroarveleuroa I 2002Identification of a variant associated with adult-type hypolactasiaNat Genet 30(2) 233ndash237

Eng MY Luczak SE Wall TL 2007 ALDH2 ADH1B and ADH1C geno-types in Asians a literature review Alcohol Res Health 30(1) 22

Fenner JN 2005 Cross-cultural estimation of the human generationinterval for use in genetics-based population divergence studiesAm J Phys Anthropol 128(2) 415ndash423

Fujimoto A Ohashi J Nishida N Miyagawa T Morishita Y Tsunoda TKimura R Tokunaga K 2008 A replication study confirmed theEDAR gene to be a major contributor to population differentiationregarding head hair thickness in Asia Hum Genet 124(2) 179ndash185

Garud NR Messer PW Buzbas EO Petrov DA 2015 Recent selectivesweeps in North American Drosophila melanogaster show signaturesof soft sweeps PLoS Genet 11(2) e1005004

Griffiths RC Tavare S 1994 Sampling theory for neutral alleles in avarying environment Philos Trans R Soc B Biol Sci 344(1310)403ndash410

Haak W Lazaridis I Patterson N Rohland N Mallick S Llamas B BrandtG Nordenfelt S Harney E Stewardson K 2015 Massive migrationfrom the steppe was a source for Indo-European languages inEurope Nature 522 207ndash211

Hartfield M Otto SP 2011 Recombination and hitchhiking of deleteri-ous alleles Evolution 65(9) 2421ndash2434

Hermisson J Pennings PS 2005 Soft sweeps molecular population ge-netics of adaptation from standing genetic variation Genetics 169(4)2335ndash2352

Hinch AG Tandon A Patterson N Song Y Rohland N Palmer CD ChenGK Wang K Buxbaum SG Akylbekova EL et al 2011 The landscapeof recombination in African Americans Nature 476(7359) 170ndash175

Hudson RR 1990 Gene genealogies and the coalescent process Oxf SurvEvol Biol 7(1) 44

Hudson RR 2007 The variance of coalescent time estimates from DNAsequences J Mol Evol 64(6) 702ndash705

Hudson RR Kaplan NL 1988 The coalescent process in models withselection and recombination Genetics 120(3) 831ndash840

Huff CD Witherspoon DJ Zhang Y Gatenbee C Denson LA KugathasanS Hakonarson H Whiting A Davis CT Wu W et al 2012 Crohnrsquosdisease and genetic hitchhiking at IBD5 Mol Biol Evol 29(1)101ndash111

Innan H Kim Y 2004 Pattern of polymorphism after strong artificialselection in a domestication event Proc Natl Acad Sci U S A 101(29)10667ndash10672

Itan Y Powell A Beaumont MA Burger J Thomas MG 2009 The originsof lactase persistence in Europe PLoS Comput Biol 5(8) e1000491

Jablonski NG Chaplin G 2000 The evolution of human skin coloration JHum Evol 39(1) 57ndash106

Jablonski NG Chaplin G 2012 Human skin pigmentation migration anddisease susceptibility Philos Trans R Soc B Biol Sci 367(1590)785ndash792

Jeong C Di Rienzo A 2014 Adaptations to local environments in mod-ern human populations Curr Opin Genet Dev 291ndash8

Kamberov YG Wang S Tan J Gerbault P Wark A Tan L Yang Y Li STang K Chen H et al 2013 Modeling recent human evolution inmice by expression of a selected EDAR variant Cell 152(4) 691ndash702

Kaplan NL Darden T Hudson RR 1988 The coalescent process inmodels with selection Genetics 120(3) 819ndash829

Kaplan NL Hudson RR Langley CH 1989 The ldquohitchhiking effectrdquo revis-ited Genetics 123(4) 887ndash899

Kimura R Yamaguchi T Takeda M Kondo O Toma T Haneji KHanihara T Matsukusa H Kawamura S Maki K et al 2009 A com-mon variation in EDAR is a genetic determinant of shovel-shapedincisors Am J Hum Genet 85(4) 528ndash535

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1015

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Kong A Thorleifsson G Gudbjartsson DF Masson G Sigurdsson AJonasdottir A Walters GB Jonasdottir A Gylfason A KristinssonKT et al 2010 Fine-scale recombination rate differences betweensexes populations and individuals Nature 467(7319) 1099ndash1103

Kuokkanen M Kokkonen J Enattah NS Ylisaukko-oja T Komu H VariloT Peltonen L Savilahti E Jeuroarveleuroa I 2006 Mutations in the translatedregion of the lactase gene (LCT) underlie congenital lactase defi-ciency Am J Hum Genet 78(2) 339ndash344

Larribe F Fearnhead P 2011 On composite likelihoods in statisticalgenetics Stat Sin 21(1) 43ndash69

Lazaridis I Patterson N Mittnik A Renaud G Mallick S Kirsanow KSudmant PH Schraiber JG Castellano S Lipson M et al 2014Ancient human genomes suggest three ancestral populations forpresent-day Europeans Nature 513(7518) 409ndash413

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357) 493ndash496

Li H Mukherjee N Soundararajan U Tarnok Z Barta C Khaliq SMohyuddin A Kajuna SLB Mehdi SQ Kidd JR et al 2007Geographically separate increases in the frequency of the derivedADH1B 47His allele in eastern and western Asia Am J Hum Genet81(4) 842ndash846

Li N Stephens M 2003 Modeling linkage disequilibrium and identifyingrecombination hotspots using single-nucleotide polymorphismdata Genetics 165(4) 2213ndash2233

Loomis WF 1967 Skin-pigment regulation of vitamin-D biosynthesis inman Science 157(3788) 501ndash506

Mathieson I Lazaridis I Rohland N Mallick S Patterson N RoodenbergSA Harney E Stewardson K Fernandes D Novak M et al 2015Genome-wide patterns of selection in 230 ancient Eurasians Nature528(7583) 499ndash503

McPeek MS Strahs A 1999 Assessment of linkage disequilibrium by thedecay of haplotype sharing with application to fine-scale geneticmapping Am J Hum Genet 65(3) 858ndash875

Meligkotsidou L Fearnhead P 2005 Maximum-likelihood estimationof coalescence times in genealogical trees Genetics 171(4)2073ndash2084

Messer PW Neher RA 2012 Estimating the strength of selective sweepsfrom deep population diversity data Genetics 191(2) 593ndash605

Messer PW Petrov DA 2013 Population genomics of rapid adaptationby soft selective sweeps Trends Ecol Evol 28(11) 659ndash669

Miller CT Beleza S Pollen AA Schluter D Kittles RA Shriver MDKingsley DM 2007 cis-Regulatory changes in kit ligand expressionand parallel evolution of pigmentation in sticklebacks and humansCell 131(6) 1179ndash1189

Morris AP Whittaker JC Balding DJ 2000 Bayesian fine-scale mapping ofdisease loci by hidden Markov models Am J Hum Genet 67(1)155ndash169

Morris AP Whittaker JC Balding DJ 2002 Fine-scale mapping of diseaseloci via shattered coalescent modeling of genealogies Am J HumGenet 70(3) 686ndash707

Nakagome S Alkorta-Aranburu G Amato R Howie B Peter BMHudson RR Di Rienzo A 2016 Estimating the ages of selectionsignals from different epochs in human history Mol Biol Evol33(3) 657ndash669

Nielsen R 2005 Molecular signatures of natural selection Annu RevGenet 39 197ndash218

Olds LC Sibley E 2003 Lactase persistence DNA variant enhances lactasepromoter activity in vitro functional role as a cis regulatory elementHum Mol Genet 12(18) 2333ndash2340

Ormond L Foll M Ewing GB Pfeifer SP Jensen JD 2016 Inferring the ageof a fixed beneficial allele Mol Ecol 25(1) 157ndash169

Osier MV Pakstis AJ Soodyall H Comas D Goldman D Odunsi AOkonofua F Parnas J Schulz LO Bertranpetit J et al 2002 A globalperspective on genetic variation at the ADH genes reveals unusualpatterns of linkage disequilibrium and diversity Am J Hum Genet71(1) 84ndash99

Peng Y Shi H Qi X-B Xiao C-J Zhong H Ma Run-lin Z Su B 2010The ADH1B Arg47His polymorphism in East Asian populations

and expansion of rice domestication in history BMC Evol Biol10(1) 15

Peter BM Huerta-Sanchez E Nielsen R 2012 Distinguishing betweenselective sweeps from standing variation and from a de novo mu-tation PLoS Genet 8(10) e1003011

Przeworski M Coop G Wall JD 2005 The signature of positive selectionon standing genetic variation Evolution 59(11) 2312ndash2323

Price AL Tandon A Patterson N Barnes KC Rafaels N Ruczinski I BeatyTH Mathias R Reich D Myers S 2009 Sensitive detection of chro-mosomal segments of distinct ancestry in admixed populationsPLoS Genet 5(6) e1000519

Pritchard JK Seielstad MT Perez-Lezaun A Feldman MW 1999Population growth of human Y chromosomes a study of Y chro-mosome microsatellites Mol Biol Evol 16(12) 1791ndash1798

Pritchard JK Pickrell JK Coop G 2010 The genetics of human adapta-tion hard sweeps soft sweeps and polygenic adaptation Curr Biol20(4) R208ndashR215

Przeworski M 2003 Estimating the time since the fixation of a beneficialallele Genetics 164(4) 1667ndash1676

Rabiner LR 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77(2) 257ndash286

Radwan J Babik W 2012 The genomics of adaptation Proc R Soc Lond BBiol Sci 279(1749) 5024ndash5028

Rannala B Reeve JP 2001 High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome se-quence Am J Hum Genet 69(1) 159ndash178

Rannala B Reeve JP 2002 Joint Bayesian estimation of mutationlocation and age using linkage disequilibrium Biocomputing2003 526ndash534

Segurel L Wyman MJ Przeworski M 2014 Determinants of mutationrate variation in the human germline Annu Rev Genomics HumGenet 1547ndash70

Simoons FJ 1970 Primary adult lactose intolerance and the milkinghabit a problem in biologic and cultural interrelations Am J DigDis 15(8) 695ndash710

Skoglund P Malmstrom H Omrak A Raghavan M Valdiosera CGunther T Hall P Tambets K Parik J Sjogren K-G et al 2014Genomic diversity and admixture differs for Stone-AgeScandinavian foragers and farmers Science 344(6185) 747ndash750

Slatkin M 2001 Simulating genealogies of selected alleles in a populationof variable size Genet Res 78(01) 49ndash57

Slatkin M 2008 A Bayesian method for jointly estimating allele age andselection intensity Genet Res 90(01) 129ndash137

Slatkin M Hudson RR 1991 Pairwise comparisons of mitochondrialDNA sequences in stable and exponentially growing populationsGenetics 129(2) 555ndash562

Slatkin M Rannala B 2000 Estimating allele age Annu Rev GenomicsHum Genet 1(1) 225ndash249

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenet Res 23(01) 23ndash35

Stinchcombe JR Hoekstra HE 2008 Combining population genomicsand quantitative genetics finding the genes underlying ecologicallyimportant traits Heredity 100(2) 158ndash170

Tang H Siegmund DO Shen P Oefner PJ Feldman MW 2002Frequentist estimation of coalescence times from nucleotide se-quence data using a tree-based partition Genetics 161(1) 447ndash459

Tavare S Balding DJ Griffiths RC Donnelly P 1997 Inferring coalescencetimes from DNA sequence data Genetics 145(2) 505ndash518

Tennessen JA Bigham AW OrsquoConnor TD Fu W Kenny EE Gravel SMcGee S Do R Liu X Jun G et al 2012 Evolution and functionalimpact of rare coding variation from deep sequencing of humanexomes Science 337(6090) 64ndash69

Thomson R Pritchard JK Shen P Oefner PJ Feldman MW 2000Recent common ancestry of human Y chromosomes evidencefrom DNA sequence data Proc Natl Acad Sci U S A 97(13)7360ndash7365

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007

Smith et al doi101093molbevmsy006 MBE

1016

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 39(1) 31ndash40

Troelsen JT Olsen J Moslashller J SjeuroOstreuroOm H 2003 An upstream polymor-phism associated with lactase persistence has increased enhanceractivity Gastroenterology 125(6) 1686ndash1694

Varin C Reid N Firth D 2011 An overview of composite likelihoodmethods Stat Sin 21(1) 5ndash42

Voight BF Kudaravalli S Wen X Pritchard JK Hurst L 2006 A map ofrecent positive selection in the human genome PLoS Biol 4(3) e72

Wegmann D Kessner DE Veeramah KR Mathias RA Nicolae DL YanekLR Sun YV Torgerson DG Rafaels N Mosley T et al 2011Recombination rates in admixed individuals identified by ancestry-based inference Nat Genet 43(9) 847ndash853

Wilde S Timpson A Kirsanow K Kaiser E Kayser M Unterleuroander MHollfelder N Potekhina ID Schier W Thomas MG et al 2014 Directevidence for positive selection of skin hair and eye pigmentation inEuropeans during the last 5000 years Proc Natl Acad Sci U S A111(13) 4832ndash4837

Williamson SH Hernandez R Fledel-Alon A Zhu L Nielsen RBustamante CD 2005 Simultaneous inference of selection and pop-ulation growth from patterns of variation in the human genomeProc Natl Acad Sci U S A 102(22) 7882ndash7887

Wiuf C 2000 On the genealogy of a sample of neutral rare alleles TheorPopul Biol 58(1) 61ndash75

Wiuf C Donnelly P 1999 Conditional genealogies and the age of aneutral mutant Theor Popul Biol 56(2) 183ndash201

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1017

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Assuming haplotype lengths are independent and identi-cally distributed draws from an exponential distribution thetransition probabilities for a recombination event off of theancestral haplotype are

PrethWi frac14 wjtcaTHORN

frac14ertcadweth1 ertcaethdwthorn1dwTHORNTHORN if w frac14 f1 ethL 1THORNg

ertcadL if w frac14 L

(

(3)

where dw is the distance in base pairs of site w from theselected site and r is the local recombination rate per basepair per generation The data for each individual Xi can bedivided into two parts One indicating the portion of anindividualrsquos sequence residing on the ancestral haplotype (be-fore recombining between sites w and wthorn 1) XiethjwTHORN andthat portion residing off of the ancestral haplotype after arecombination event XiethjgtwTHORN We denote a separate likelihoodfor each portion

aiw frac14 PrethXiethjwTHORNjtcaAWi frac14 wTHORN (4)

biw frac14 PrethXiethjgtwTHORNjHethjgtwTHORNWi frac14 wTHORN (5)

Because the focal allele is on the selected haplotypeai1 frac14 1 Conversely we assume a recombination eventoccurs at some point beyond locus L such thatbiL frac14 1 We assume the waiting time to mutation ateach site on the ancestral haplotype is exponentiallydistributed with no reverse mutations and express thelikelihood as

aiw frac14 PrethXiethjwTHORNjtcaAWi frac14 wTHORN

frac14 etcalethdwwTHORNYw

jfrac142

PrethXij frac14 ajtcaATHORN (6)

PrethXij frac14 ajtcaATHORN frac14etcal if a frac14 Aj

1 etcal if a frac14 Aj

((7)

The term etcalethdwwTHORN on the right side of equation (6)captures the lack of mutation at invariant sites betweeneach segregating site Assuming tcal is small equation (6) isequivalent to assuming a Poisson number of mutations(with mean tcal) occurring on the ancestral haplotype

For biw the probability of observing a particular sequenceafter recombining off of the ancestral haplotype is dependenton standing variation in background haplotype diversity TheLi and Stephens (2003) haplotype copying model allows forfast computation of an approximation to the probability ofobserving a sample of chromosomes related by a genealogywith recombination Given a sample of m haplotypesH 2 fh1 hmg a population scaled recombination rateq and mutation rate h an observed sequence of alleles ismodeled as an imperfect copy of any one haplotype in thereference panel at each SNP Let Zij denote the reference panelhaplotype which Xi copies at the jth SNP and cj denote thenumber of base pairs between SNPs j and jthorn 1 Zij follows aMarkov process with transition probabilities

PrethZiethjthorn1THORN frac14 z0jZij frac14 zTHORN

frac14eqjcj=m thorn eth1 eqjcj=mTHORNeth1=mTHORN if z0 frac14 z

eth1 eqjcj=mTHORNeth1=mTHORN if z0 frac14 z

(

(8)

To include mutation the probability that the sampledhaplotype matches a haplotype in the reference panel ism=ethmthorn hTHORN and the probability of a mismatch (or muta-tion event) is h=ethmthorn hTHORN Letting a refer to an allele wherea 2 f1 0g the matching and mismatching probabilitiesare

PrethXij frac14 ajZij frac14 z h1 hmTHORN

frac14m=ethmthorn hTHORN thorn eth1=2THORNethh=ethmthorn hTHORNTHORN if hzj frac14 a

eth1=2THORNethh=ethmthorn hTHORNTHORN if hzj frac14 a

(

(9)

Equation (5) requires a sum over the probabilities of allpossible values of Zj using equations (8) and (9) This is com-puted using the forward algorithm as described in Rabiner(1989) and Appendix A of Li and Stephens (2003) It shouldbe noted that this formulation does not model the observa-tion of an invariant site among the background haplotypesWe tried an approach to model these sites but saw no im-provement in model performance (see supplementary ap-pendix S2 Supplementary Material online)

The complete likelihood for our problem can then beexpressed as

Table 1 Notation Used to Describe the Model

n Number of haplotypes with the selected allelem Number of haplotypes without the selected alleleL Number of SNPs flanking the selected site (one side considered

at a time)X n L matrix of haplotypes with the selected alleleH m L matrix of haplotypes without the selected alleleXij Allele in haplotype i at SNP j where i 2 f1 ng and

j 2 f1 LgHzj Allele in haplotype z at SNP j where z 2 f1 mg and

j 2 f1 LgAj Allele at site j on the ancestral haplotypeZij The reference panel haplotype from which Xi copies at site jtca Time to the most recent common ancestor (TMRCA)Wi The location of the first recombination event off of the ancestral

haplotyper Recombination rate per basepair per generationl Mutation rate per basepair per generationh Haplotype miscopying rate or population-scaled mutation rate

eth4NlTHORNq Haplotype switching rate or population-scaled recombination

rate eth4NrTHORNdw Physical distance of site w from the selected site where

w 2 f1 Lgcj Number of basepairs between sites j and j thorn 1aiw Likelihood of haplotype i for sites 1 wbiw Likelihood of haplotype i for SNPs ethwthorn 1THORN L

Smith et al doi101093molbevmsy006 MBE

1006

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

PrethXjtcaAHTHORN frac14Yn

ifrac141

XL

wfrac141

aiwbiwPrethWi frac14 wjtcaATHORN (10)

This computation is on the order 2Lnm2 and in practicefor mfrac14 20 nfrac14 100 and Lfrac14 4000 takes approximately3027 s to compute on an Intel Core i7-4750HQ CPU at200 GHz 8 with 156 GB RAM

InferencePerforming inference on tca requires addressing the latentvariables w and A in the model Marginalizing over possiblevalues of w is a natural summation per haplotype that is linearin L as shown above For A the number of possible values islarge (2L) and so we employ a MetropolisndashHastings algorithmto jointly sample the posterior of A and tca and then we takemarginal samples of tca for inference We assign a uniformprior density for both A and tca such that PrethATHORN frac14 1=2L andPrethtcaTHORN frac14 1=ethtmax tminTHORN where tmax and tmin are user-specified maximum and minimum values for tca ProposedMCMC updates of the ancestral haplotype A0 are generatedby randomly selecting a site in A and flipping to the alterna-tive allele For tca proposed values are generated by adding anormally distributed random variable centered at 0tca0 frac14 tca thorn Neth0 r2THORN To start the MetropolisndashHastings algo-

rithm an initial value of tca is uniformly drawn from a user-specified range of values (10ndash2000 in the applications here)To initialize the ancestral haplotype to a reasonable value weuse a heuristic algorithm which exploits the characteristicdecrease in variation near a selected site (see supplementaryappendix S2 Supplementary Material online)

For each haplotype in the sample of beneficial allele car-riers the Li and Stephens (2003) model uses a haplotypemiscopying rate h and switching rate q to compute a like-lihood term for loci following the recombination event off ofthe ancestral haplotype For our analyses we set q frac14 44104 using our simulated values of r frac14 11 108 per bpper generation and Nfrac14 10000 where q frac14 4Nr Following Li

and Stephens (2003) we fix h frac14 ethPn

mfrac1411=mTHORN1 as derived

from the expected number of mutation events on a geneal-ogy relating n chromosomes at a particular site We found nodiscernible effects on estimate accuracy when specifying dif-ferent values of q (supplementary fig 1 SupplementaryMaterial online)

ResultsBecause our model requires a sample (or ldquopanelrdquo) of referencehaplotypes without the selected allele we tested our methodfor cases in which the reference panel is chosen from the localpopulation in which the selected allele is found as well ascases where the panel is from a diverged population wherethe selected haplotype is absent (fig 1c) Regardless of sce-nario the estimates are on average within a factor of 2 of thetrue value and often much closer When using a local refer-ence panel point estimates of tca increasingly underestimatethe true value (TMRCA) as selection becomes weaker and thefinal allele frequency increases (fig 2) Put differently the age

of older TMRCAs tends to be underestimated with local ref-erence panels Using the mean posteriors as point estimatesmean values of log2(estimatetrue value) range from062 to014 Simulations using a diverged population for the refer-ence panel removed the bias though only in cases where thedivergence time was not large For a reference panel divergedby 05N generations mean log2(estimatetrue value) valuesrange from 021 to 018 As the reference panel becomestoo far diverged from the selected population estimates be-come older than the true value (036 to 094 log2(estimatetrue value)) In these cases the HMM is unlikely to infer aclose match between background haplotypes in the sampleand the reference panel leading to many more mismatchesbeing inferred as mutation events on the ancestral haplotypeand an older estimate of tca

The bottom panel of figure 2 shows the effect of selectionstrength and final allele frequency on the size of the 95credible interval around point estimates normalized by thetrue TMRCA for each simulated data set Before normalizingcredible interval sizes using a local reference panel range from73 to 213 generations for 2Ns frac14 100 versus 18 to 22 gener-ations when 2Ns frac14 2000 Using local and diverged referencepanels we found a minimal effect of the sample size on pointestimates (supplementary figs 2 and 3 SupplementaryMaterial online) As expected larger sample sizes for the car-rier panel improve estimate accuracy However higher allelefrequencies and weak selection are likely to induce more un-certainty due to the ancestral haplotype tracts recombiningwithin the sample We find this effect more pronounced withlarge sample sizes for the reference panel We speculate that alarge sample of reference haplotypes leads the focal selectedhaplotype to have increased probability of copying from thereference panel leading to a shorter selected haplotype andslight overestimate of the TMRCA

We also performed simulations under varying degrees ofmutation and recombination rate misspecification (supple-mentary fig 4 Supplementary Material online) In most casesmean values of log2(estimatetrue value) stay within an orderof magnitude of 0 As expected when both the mutation andrecombination rate are misspecified we find the most dis-crepancy To assess the convergence properties of theMCMC five replicate chains were run for each of 20 sim-ulated data sets produced under three 2Ns values (100200 and 2000) for frequency trajectories ending at 01(supplementary fig 5 Supplementary Material online)While care is always warranted with MCMC approacheswe find in practice that convergence among our replicatechains is attained relatively quickly (3 000 iterations forsimulated data and 3000ndash9000 iterations for appliedcases see supplementary fig 9 Supplementary Materialonline)

We compared the performance of our estimator withthree other model-based approaches for allele age estimationby matching the simulation scheme performed by Chen et al(2015) (supplementary table 5 Supplementary Material on-line) Our method shows improvement in accuracy (RMSE)andor lower bias for simulations with lower frequencies ofthe beneficial allele (40) regardless of the reference panel

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1007

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

used In cases where the final beneficial allele frequency ishigher (80) our methodrsquos accuracy remains as good or bet-ter than the other methods when using a diverged referencepanel with a two-orders-of-magnitude improvement of biasunder strong selection (sfrac14 005) Estimates when using a localreference panel and a high final beneficial allele frequencyremain comparable to the other methods for strong selectionbut tend to have more bias and decreased accuracy as selec-tion strength decreases

Assuming a star-genealogy among beneficial allele carriersmay result in underestimating the variance for the posteriordistribution when there is nonindependence in our sample

To measure this affect we computed TMRCA estimates on100 bootstrap replicates for 4 simulated data sets under 2selection strengths and 2 final allele frequencies (supplemen-tary table 7 Supplementary Material online) We find closeagreement between the 95 posterior credible intervals ofthe original data and the 95 confidence intervals computedon the bootstrap estimates for a selection strength of 01 forboth final allele frequencies of 04 and 08 As expected olderTMRCAs are likely to violate the star-genealogy assumptionand in these cases we find that estimates from our originaldata are more narrow than the bootstrap confidenceintervals

Frequency 01 Frequency 03 Frequency 06 Frequency 09

minus2

minus1

0

1

2

3

0005 001 01 0005 001 01 0005 001 01 0005 001 01

log2

(Est

imat

eTr

ue V

alue

)Local 02N 05N 10N

Frequency 01 Frequency 03 Frequency 06 Frequency 09

02

04

06

08

0005 001 01 0005 001 01 0005 001 01 0005 001 01Selection Strength

Nor

mal

ized

95

Cre

dibl

e In

terv

al R

ange

FIG 2 Accuracy results from simulated data Accuracy of TMRCA point estimates and 95 credible interval ranges from posteriors inferred fromsimulated data under different strengths of selection final allele frequencies and choice of reference panel Credible interval range sizes are in unitsof generations and are normalized by the true TMRCA for each simulated data set See Materials and Methods below for simulation details

Smith et al doi101093molbevmsy006 MBE

1008

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Recombination versus Mutation as a Source ofInformationWe compared our model-based inference with simpler esti-mates of the TMRCA using the number of derived mutationson the ancestral haplotype and the mean length of the an-cestral haplotype In addition to quantifying the improve-ment our method has over these calculations this alsoserves as an ad-hoc way to understand how the relativeweight of information from mutation and recombinationaffects the performance of our method One can model thehaplotype lengths as independent and exponentially distrib-uted to derive a recombination-based estimator tr as

tr frac141

wor (11)

where r is the recombination rate and wo is the observedmean ancestral haplotype length To leverage the count ofderived mutations on the ancestral haplotype we use theThomson et al (2000) estimator In a sample of n haplotypeswith the selected allele a mutation-based estimator tm canbe calculated as

tm frac141

n

Xn

i

yi

wil (12)

where yi is the number of derived mutations on the ith hap-lotype which has length wi basepairs See Hudson (2007) for aderivation of the estimate for the variance of the Thomsonestimator

When using derived mutations uncertainty in both theancestral haplotype sequence and the length of the ancestralhaplotype on each chromosome (wi) can lead to poor esti-mation To improve inference researchers typically define arestricted ldquononrecombiningrdquo region that may reliably containderived mutations on the ancestral haplotype This has twodisadvantages 1) There is more information available in thedata which cannot be used because excess caution is neces-sary to prevent overcounting of derived mutations and2) there may still be unobserved recombination events inthis restricted locus To minimize the use of heuristics for aderived mutation approach we used our model to findmaximum-likelihood estimates of the ancestral haplotypebreakpoints using equation (2) in the model descriptionWe also used the mean posterior estimate of the ancestralhaplotype from our model to identify derived mutations Tocalculate a recombination-based estimator of the TMRCA wecalculated wo using the same maximum-likelihood estimatesof the ancestral haplotype lengths inferred for the mutationestimator

When using a local reference panel the simple mutationestimator tm consistently underestimates the true TMRCAThe recombination estimate however remains accurate(supplementary fig 6 Supplementary Material online) Wesuspect this to be a result of poor estimation of the ancestralhaplotype and violation of the star-genealogy assumption Incases where selection is weaker and the genealogy is not star-shaped derived mutations occurring early in the genealogywill be overrepresented and incorrectly inferred to be the

ancestral allele In this way high-frequency-derived alleleswill not be counted As predicted increasing selectionstrength improves mutation estimator accuracy The recom-bination estimator appears robust to this effect as long asselection is not too strong For very strong selection andyoung TMRCA values maximum-likelihood estimates ofthe haplotype lengths become constrained by the size ofthe locus studied For example in simulations with a selectionstrength of 005 and frequency of 01 the mean TMRCA isaround 100 generations Using equation (11) above the meanlength of the ancestral haplotype for a TMRCA of 100 gen-erations is 2 Mb which is twice as large as the window size weuse to make computation for our simulations feasible Using alarger window around the selected locus would amelioratethis effect

When using a diverged reference panel we find an oppositeeffect In this case the count of derived mutations results inan overestimate and the haplotype lengths yield an under-estimate We suspect this to be driven by poor matchingbetween the reference panel and the background haplotypesamong beneficial allele carriers The low probability of match-ing between the reference and background haplotypes meansthat the lengths of the ancestral haplotype are inferred toextend further than their true lengths This also leads to anoverestimate for the mutation estimator because differencesbetween the ancestral and background haplotypes are incor-rectly inferred as derived mutations on the ancestralhaplotype

Application to 1000 Genomes DataWe applied our method to five variants previously identifiedas targets of recent selection in various human populations(fig 3) Using phased data from the 1000 Genomes Project wefocused on variants that are not completely fixed in any onepopulation so that we could use a local reference panel The Liand Stephens (2003) haplotype copying model is appropriatein cases where ancestry switches occur among chromosomeswithin a single population so we excluded populations in theAmericas for which high levels of admixture are known toexist

While the simulation results described above provide someintuition for the effects of selection strength final allele fre-quency and choice of reference panel we also performedsimulations using the demographic history inferred byTennessen et al (2012) to explore the effects of nonequili-brium demographic history on our estimation accuracy (sup-plementary fig 10 Supplementary Material online) We findsubtle differences in accuracy between the two demographichistories where the nonequilibrium histories lead to negligibledifferences in mean values for log2(estimatetrue value) andlarger credible interval ranges

ADH1BA derived allele at high frequency among East Asians at theADH1B gene (rs3811801) has been shown to be functionallyrelevant for alcohol metabolism (Osier et al 2002 Eng et al2007) Previous age estimates are consistent with the timing

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1009

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

of rice domestication and fermentation approximately10000 years ago (Li et al 2007 Peng et al 2010 Peter et al2012) However a more recent estimate by Peter et al (2012)pushes this time back several thousand years to 12876(2204ndash49764) years ago Our results are consistent with an

older timing of selection as our CHB sample (Han Chinese inBeijing China) TMRCA estimate is 15377 (13763ndash17281)years Replicate chains of the MCMC are generally consistentwith the oldest estimates in the CHB sample showing themost variation among resampled data sets and the youngest

FIG 3 Comparison of TMRCA estimates with previous results Violin plots of posterior distributions for the complete set of estimated TMRCAvalues for the five variants indicated in the legend scaled to a generation time of 29 years Each row indicates a population sample from the 1000Genomes Project panel Replicate MCMCs are plotted with transparency Points and lines overlaying the violins are previous point estimates and95 confidence intervals for each of the variants indicated by a color and rs number in the legend (see supplementary tables 3 and 4Supplementary Material online) The population sample abbreviations are defined in text

Smith et al doi101093molbevmsy006 MBE

1010

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

estimate of 10841 (9720ndash12147) in the KHV sample showingthe least When using a fine-scale recombination map all ofthe ADH1B TMRCAs are inferred to be slightly older (supple-mentary fig 7 Supplementary Material online)

EDARPopulation genomic studies have repeatedly identified thegene EDAR to be under recent selection in East Asians(Akey et al 2004 Williamson et al 2005 Voight et al 2006)with a particular site (rs3827760) showing strong evidence forbeing the putative target Functional assays and allele specificexpression differences at this position show phenotypic asso-ciations to a variety of phenotypes including hair thicknessand dental morphology (Bryk et al 2008 Fujimoto et al 2008Kimura et al 2009 Kamberov et al 2013)

Our estimate of 22192 (19683ndash25736) years for the EDARallele in the CHB sample is older than ABC-based estimates of12458 (1314ndash85835) and 13224 (4899ndash50692) years madeby Bryk et al (2008) and Peter et al (2012) respectivelyKamberov et al (2013) use spatially explicit ABC and maxi-mum likelihood approaches to compute older estimates of30925 (13175ndash39575) and 36490 (34775ndash38204) We in-cluded all populations for which the variant is present includ-ing the FIN and BEB samples where it exists at low frequencyOur results for the youngest TMRCAs are found in these twolow frequency populations where the estimate in FIN is17386 (13887ndash20794) and the estimate in BEB is 18370(14325ndash22872) Among East Asian populations the oldestand youngest TMRCA estimates are found in the KHV sam-ple (25683 23169ndash28380) and CHB sample (22192 19683ndash25736)

LCTArguably the best-studied signature of selection in humans isfor an allele at the LCT gene (rs4988235) which confers lactasepersistence into adulthoodmdasha trait unique among mammalsand which is thought to be a result of cattle domesticationand the incorporation of milk into the adult diets of severalhuman populations (Enattah et al 2002 Bersaglieri et al 2004Tishkoff et al 2007) There are multiple alleles that showassociation with lactase persistence (Tishkoff et al 2007)We focused on estimating the age of the T-13910 allele pri-marily found at high frequency among Northern Europeansbut which is also found in South Asian populations In addi-tion to association with the lactase persistence phenotypethis allele has been functionally verified by in vitro experi-ments (Olds and Sibley 2003 Troelsen et al 2003Kuokkanen et al 2006)

Mathieson et al (2015) use ancient DNA collected from83 human samples to get a better understanding of the fre-quency trajectory for several adaptive alleles spanning a timescale of 8000 years For the LCT persistence allele (rs4988235)they find a sharp increase in frequency in the past 4000 yearsago While this is more recent than previous estimates anearlier TMRCA or time of mutation is still compatible withthis scenario

Our estimates using European and South Asian samplesfall between the range from 5000 to 10000 years ago which isbroadly consistent with age estimates from modern data Thecredible intervals for estimates in all of the samples havesubstantial overlap which makes any ranking on the basisof point estimates difficult We infer the PJL (Punjabi fromLahore Pakistan) sample to have the oldest TMRCA estimateof 9514 (8596ndash10383) years Itan et al (2009) use spatialmodelling to infer the geographic spread of lactase allelefrom northern to southern Europe Consistent with theirresults the youngest estimate among European populationsis found in the IBS sample at 9341 (8688ndash9989) yearsAmong all samples the youngest estimate was found inBEB at 6869 (5143ndash8809)

KITLG and OCA2The genetic basis and natural history of human skin pigmen-tation is a well-studied system with several alleles of majoreffect showing signatures consistent with being targets ofrecent selection (Jablonski and Chaplin 2012 Beleza Santoset al 2013 Wilde et al 2014 Eaton et al 2015) We focused onan allele found at high-frequency world-wide among non-African populations at the KITLG locus (rs642742) whichshows significant effects on skin pigmentation differences be-tween Europeans and Africans (Miller et al 2007) althoughmore recent work fails to find any contribution of KITLGtoward variation in skin pigmentation in a Cape VerdeAfrican-European admixed population (Beleza Johnsonet al 2013) We also estimated the TMRCA for a melanin-associated allele at the OCA2 locus (rs1800414) which is onlyfound among East Asian populations at high frequency(Edwards et al 2010)

For the KITLG variant our estimates among differentpopulations vary from 18000 to 34000 years ago with theoldest age being in the YRI (Yoruba in Ibadan Nigeria)sample (33948 28861ndash39099) The youngest TMRCA isfound in FIN at 18733 years (16675ndash20816) The nexttwo youngest estimates are also found in Africa withthe TMRCA in the MSL (Mende in Sierra Leone) samplebeing 22340 (15723ndash28950) years old and that for LWK(Luhya in Webuye Kenya) being 22784 (17922ndash28012)years old suggesting a more complex history than amodel of a simple allele frequency increase outside ofAfrica due to pigmentation related selection pressuresPrevious point estimates using rejection samplingapproaches on a Portuguese sample (32277 6003ndash80683) and East Asian sample (32045 6032ndash98165)are again most consistent with our own results on theIBS (29731 26170ndash32813) and CHB samples (2677324297ndash30141) (Beleza Santos et al 2013 Chen et al2015) Among East Asians the oldest and youngest esti-mates are again found in the JPT (28637 24297ndash30141)and KHV (24544 21643ndash27193) samples respectivelyThe TMRCA for OCA2 alleles in the JPT (1859916110ndash20786) and KHV (16370 14439ndash18102) samplesare also the oldest and youngest respectively

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1011

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

DiscussionOur method improves estimation for the timing of selectionon a beneficial allele using a tractable model of haplotypeevolution This approach leverages detailed information inthe data while remaining amenable to large sample sizesUsing both carriers and noncarriers of the allele we canmore effectively account for uncertainty in the extent ofthe ancestral haplotype and derived mutations We showthe performance of our method using simulations of differentselection strengths beneficial allele frequencies and choicesof reference panel By applying our method to five variantspreviously identified as targets of selection in human popu-lations we provide a comparison among population-specificTMRCAs This gives a more detailed account of the order inwhich populations may have acquired the variant andorexperienced selection for the trait it underlies

In that regard it is hypothesized that local selection pres-sures and a cultural shift toward agrarian societies have in-duced adaptive responses among human populations aroundthe globe The data associated with some variants seem toindicate more recent selective events than others Our resultsfor variants associated with dietary traits at the LCT andADH1B genes both imply relatively recent TMRCAs(lt 15 000 years ago) consistent with hypotheses that selec-tion on these mutations results from recent changes in hu-man diet following the spread of agriculture (Simoons 1970Peng et al 2010) In contrast the inferred TMRCAs for EDARKITLG and OCA2 imply older adaptive events which mayhave coincided more closely with the habitation of new envi-ronments or other cultural changes

Several hypotheses have been suggested to describe theselective drivers of skin pigmentation differences among hu-man populations including reduced UV radiation at highlatitudes and vitamin D deficiency (Loomis 1967 Jablonskiand Chaplin 2000) Estimated TMRCAs for the variants atthe OCA2 and EDAR loci among East Asians appear to beas young or younger than the KITLG variant but older thanthe LCT and ADH1B locus This suggests a selective history inEast Asian populations leading to adaptive responses forthese traits occurring after an initial colonization In somecases the dispersion of replicate MCMC estimates makes itdifficult to describe the historical significance of an observedorder for TMRCA values However the consistency of esti-mates among different populations for particular variantsadds some confidence to our modelrsquos ability to reproducethe ages which are relevant to those loci or certain geographicregions

To assess the relative concordance of our estimates withthose from previous approaches we compared our resultswith a compilation of previously published estimates basedon the time of mutation time since fixation or TMRCA ofvariants associated with the genes studied here (supplemen-tary fig 8 Supplementary Material online) The range of con-fidence intervals for these studies is largely a reflection of theassumptions invoked or relaxed for any one method as wellas the sample size and quality of the data used In principleour method extracts more information than approaches that

use summary statistics such as ABC In our empirical appli-cation we found that our method provides a gain in accuracywhile accounting for uncertainty in both the ancestral hap-lotype and its length on each chromosome Notably ourmethod provides narrower credible intervals by incorporatingthe full information from ancestral haplotype lengths derivedmutations and a reference panel of noncarrier haplotypes

Another caveat of our method is its dependence on thereference panel which is intended to serve as a representativesample of nonancestral haplotypes in the population duringthe selected allelersquos increase in frequency Four possible chal-lenges can arise 1) Segments of the ancestral selected haplo-type may be present in the reference panel due torecombination (this is more likely for alleles that havereached higher frequency) 2) the reference panel may con-tain haplotypes that are similar to the ancestral haplotypedue to low levels of genetic diversity 3) the reference panelmay be too diverged from the focal population and 4) pop-ulation connectivity and turnover may lead the ldquolocalrdquo refer-ence panel to be largely composed of migrant haplotypeswhich were not present during the selected allelersquos initial in-crease in frequency

Under scenarios 1 and 2 the background haplotypes willbe too similar to the ancestral haplotype and it may be dif-ficult for the model to discern a specific ancestry switch lo-cation This leads to fewer differences (mutations) thanexpected between the ancestral haplotype and each benefi-cial allele carrier The simulation results are consistent withthis scenario Our method tends to underestimate the trueage across a range of selection intensities and allele frequen-cies when using a local reference panel

Conversely under scenarios 3 and 4 the model will fail todescribe a recombinant haplotype in the sample of beneficialallele carriers as a mosaic of haplotypes in the reference panelAs a result the model will infer more mutation events toexplain observed differences from the ancestral haplotypeOur simulation results show this to be the case with referencepanels diverged by N generations Posterior mean estimatesare consistently older than their true value Our simulationsare perhaps pessimistic thoughmdashwe chose reference paneldivergence times of N and 05N generations approximatelycorresponding to FST values of 04 and 02 respectively Forthe smaller FST values observed in humans we expect resultsfor diverged panels to be closer to those obtained with thelocal reference panel Nonetheless future extensions to incor-porate multiple populations within the reference panel wouldbe helpful and possible by modifying the approach of Priceet al (2009) Such an approach would also enable the analysisof admixed populations (we excluded admixed samples fromour analysis of the 1000 Genomes data above)

Aside from the challenges imposed by the choice of refer-ence panel another potential source of bias lies in our tran-sition probabilities which are not conditioned on thefrequency of the selected variant In reality recombinationevents at some distance away from the selected site will onlyresult in a switch from the ancestral to background haplo-types at a rate proportional to 1 pl where pl is the fre-quency of the ancestral haplotype alleles at locus l In this way

Smith et al doi101093molbevmsy006 MBE

1012

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

some recombination events may go unobservedmdashas thebeneficial allele goes to high frequency the probability of arecombination event leading to an observable ancestral tobackground haplotype transition decreases One solutionmay be to include the frequency-dependent transition prob-abilities derived by Chen et al (2015) Under their model themutation time is estimated by assuming a deterministic lo-gistic frequency trajectory starting at 1

2N An additional benefitof using frequency trajectories would be the ability to inferposterior distributions on selection coefficients While theselection coefficient is typically assumed to be related tothe time since mutation by t1 frac14 logethNsTHORN=s we do nothave an equivalent expression for time to the common an-cestor Rather than the initial frequency being 1

2N for a newmutation our initial frequency must correspond to the fre-quency at which the TMRCA occurs Griffiths and Tavare(1994) derive a framework to model a genealogy under arbi-trary population size trajectories which should be analogousto the problem of an allele frequency trajectory and addi-tional theory on intra-allelic genealogies may be useful here aswell (Griffiths and Tavare 1994 Wiuf and Donnelly 1999 Wiuf2000 Slatkin and Rannala 2000)

Our model also assumes independence among all haplo-types in the sample in a composite-likelihood frameworkwhich is equivalent to assuming a star-genealogy (Larribeand Fearnhead 2011 Varin et al 2011) This is unlikely tobe the case when sample sizes are large or the TMRCA isold It is also unlikely to be true if the beneficial allele existedon multiple haplotypes preceding the onset of selection wasintroduced by multiple migrant haplotypes from other pop-ulations or occurred by multiple independent mutationevents (Innan and Kim 2004 Hermisson and Pennings2005 Przeworski et al 2005 Pritchard et al 2010 Berg andCoop 2015) Methods for distinguishing selection from stand-ing variation versus de novo mutation are available thatshould make it easier distinguish these cases (Messer andNeher 2012 Peter et al 2012 Messer and Petrov 2013Garud et al 2015)

If the underlying allelic genealogy is not star-like onecan expect different estimates of the TMRCA for differentsubsets of the data Here we performed multiple MCMCson resampled subsets of the data to informally diagnosewhether there are violations from the star-shape geneal-ogy assumption We speculate that exactly how theTMRCAs vary may provide insight to the underlyinghistory In cases where the TMRCA estimates for a par-ticular population are old and more variable than otherpopulations the results may be explained by structure inthe genealogy whereby recent coalescent events haveoccurred among the same ancestral haplotype beforethe common ancestor When estimates are dispersedamong resampled data sets the presence of multiple an-cestral haplotypes prior to the variantrsquos increase in fre-quency may be a better explanation Further support forthis explanation might come from comparisons to otherpopulation samples which show little to no dispersion ofestimates from resampled data sets Future work mightmake it possible to formalize this inference process

A final caveat regards the misspecification of mutation andrecombination rates TMRCA estimates are largely deter-mined by the use of accurate measures for these two param-eters In a way this provides some robustness to our methodOur age estimates depend on mutation and recombinationrates so accurate specification for one of the values can com-pensate for slight misspecification in the other In cases wherea fine-scale recombination map is unavailable we suggest us-ing a uniform recombination rate specific to the locus ofinterest (see Materials and Methods and supplementaryfig 7 Supplementary Material online) Choosing an appropri-ate mutation rate will continue to depend on current andfuture work which tries to resolve discrepancies in publishedmutation rate estimates inferred by various approaches(Segurel et al 2014)

One future direction for our method may be to explicitlyincorporate the possibility of multiple ancestral haplotypeswithin the sample Under a disease mapping frameworkMorris et al (2002) implement a similar idea in the case whereindependent disease causing mutations arise at the same lo-cus leading to independent genealogies for which they cointhe term ldquoshattered coalescentrdquo For our case beneficialmutations may also be independently derived on differenthaplotypes Alternatively a single mutation may be oldenough to reside on different haplotypes due to a sufficientamount of linked variation existing prior to the onset of se-lection Berg and Coop (2015) model selection from standingvariation to derive the distribution of haplotypes that theselected allele is present on

While we have treated the TMRCA as a parameter of in-terest our method also produces a sample of the posteriordistribution on the ancestral haplotype This could provideuseful information to estimate the frequency spectrum ofderived mutations on the ancestral haplotype Similarly thefrequency of shared recombination breakpoints could shedlight on the genealogy and how well it conforms to the star-shape assumption The extent of the ancestral haplotype ineach individual may also prove useful for identifying delete-rious alleles that have increased in frequency as a result ofstrong positive selection on linked beneficial alleles (Chun andFay 2011 Hartfield and Otto 2011) For example Huff et al(2012) describe a risk allele for Crohnrsquos disease at high fre-quency in European populations which they suggest is linkedto a beneficial allele under recent selection Similar to anadmixture mapping approach our method could be usedto identify risk loci by testing for an association betweenthe ancestral haplotype and disease status As another appli-cation identifying the ancestral haplotype may be useful inthe context of identifying a source population (or species) fora beneficial allele prior to its introduction and subsequentincrease in frequency in the recipient population

In many cases the specific site under selection may beunknown or restricted to some set of putative sites Whileour method requires the position of the selected site to bespecified future extensions could treat the selected site as arandom variable to be estimated under the same MCMCframework This framework would also be amenable to mar-ginalizing over uncertainty on the selected site

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1013

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

While we focus here on inference from modern DNAdata the increased accessibility of ancient DNA has addeda new dimension to population genetic data sets (Lazaridiset al 2014 Skoglund et al 2014 Allentoft et al 2015 Haaket al 2015 Mathieson et al 2015) Because it will remaindifficult to use ancient DNA approaches in many specieswith poor sample preservation we believe methods basedon modern DNA will continue to be useful going forwardThat said ancient DNA is providing an interesting avenue forcomparative work between inference from modern and an-cient samples For example Nakagome et al (2016) usesimulations to assess the fit of this ancient DNA polymor-phism to data simulated under their inferred parametervalues for allele age and selection intensity and they findreasonable agreement Much work still remains to fullyleverage ancient samples into population genetic inferencewhile accounting for new sources of uncertainty and poten-tial for sampling bias

Despite these challenges it is clear that our understandingof adaptive history will continue to benefit from new com-putational tools which extract insightful information from adiverse set of data sources

Materials and MethodsWe generated data using the software mssel (Hudson R per-sonal communication) which simulates a sample of haplo-types conditioned on the frequency trajectory of a selectedvariant under the structured coalescent (Hudson and Kaplan1988 Kaplan et al 1988) Trajectories were first simulatedforwards in time under a WrightndashFisher model for an additivelocus with varying strengths of selection and different endingfrequencies of the selected variant Trajectories were thentruncated to end at the first time the allele reaches a specifiedfrequency See supplementary table 1 SupplementaryMaterial online for relative ages of simulated TMRCA val-ues for different end frequencies and selection strengthsFor the results in figure 2 100 simulations were performedfor each parameter combination MCMCs were run for5000 iterations with a burn-in excluding the first 3000iterations A standard deviation of 10 was used for theproposal distribution of tca The red boxplots indicate localreference panels The blue and green boxplots indicate ref-erence panels diverged by 5N generations and 1N gener-ations respectively Each data set was simulated for a 1-Mblocus with a mutation rate of 1 108 recombination rateof 1 108

and population size of 10000 Sample sizesfor the selected and reference panels were 100 and 20respectively

For more efficient run times of the MCMC we set a max-imum number of individuals to include in the selected andreference panels to be 100 and 20 respectively In cases wherethe true number of haplotypes for either panel was greaterthan this in the full data set we resampled a subset ofhaplotypes from each population for a total of five replicatesper population For simulation results supporting the use ofthis resampling strategy see supplementary figure 5Supplementary Material online The MCMCs were run for

15000 iterations with a standard deviation of 20 for theTMRCA proposal distribution The first 9000 iterationswere removed as a burn-in leading to 6000 iterations for asample of the posterior Convergence was assessed by com-parison of MCMC replicates Figure 3 and supplementaryfigure 8 Supplementary Material online show the resultsfor all five variants along with previous point estimates and95 confidence intervals assuming a generation time of29 years (Fenner 2005) Supplementary table 3Supplementary Material online lists the mean and 95 cred-ible intervals for estimates with the highest mean posteriorprobability which we refer to in the text belowSupplementary table 4 Supplementary Material online liststhe previous estimates and confidence intervals with addi-tional details of the different approaches taken

To model recombination rate variation we used recombi-nation rates from the Decode sex-averaged recombinationmap inferred from pedigrees among families in Iceland(Kong et al 2010) Because some populations may have re-combination maps which differ from the Decode map at finescales we used a mean uniform recombination rate inferredfrom the 1-Mb region surrounding each variant The motiva-tion for this arises from how recombination rates have beenpreviously shown to remain relatively consistent among re-combination maps inferred for different populations at themegabase-scale (Broman et al 1998 Baudat et al 2010 Konget al 2010 Auton and McVean 2012) Further we found ourestimates depend mostly on having the megabase-scale re-combination rate appropriately set with little difference inmost cases for estimates obtained by modeling fine-scale re-combination at each locus (supplementary fig 7Supplementary Material online) We specify the switchingrate among background haplotypes after recombining off ofthe ancestral haplotype to be 4Nr where Nfrac14 10000 and r isthe mean recombination rate for the 1-Mb locus

For modeling mutation a challenge is that previous mu-tation rate estimates vary depending on the approach used(Segurel et al 2014) Estimates using the neutral substitutionrate between humans and chimps are more than 2 108

per bp per generation while estimates using whole-genomesequencing are closer to 1 108 As a compromise wespecify a mutation rate of 16 108

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe would like to thank Hussein Al-Asadi Arjun BiddandaAnna Di Rienzo Dick Hudson Choongwon Jeong Evan KochJoseph Marcus Shigeki Nakagome Ben Peter Mark ReppelAlex White members of the Coop laboratory at UC Davismembers of the Przeworski laboratory at ColumbiaUniversity and members of the He and Stephens laboratoriesat the University of Chicago for helpful comments JS wassupported by an NSF Graduate Research Fellowship andNational Institute of General Medical Sciences of the

Smith et al doi101093molbevmsy006 MBE

1014

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

National Institutes of Health under award numbers DGE-1144082 and T32GM007197 respectively This work wasalso supported by the National Institute of General MedicalSciences of the National Institutes of Health under awardnumbers RO1GM83098 and RO1GM107374 to GC as wellas R01HG007089 to JN

ReferencesAkey JM Eberle MA Rieder MJ Carlson CS Shriver MD Nickerson DA

Kruglyak L 2004 Population history and natural selection shapepatterns of genetic variation in 132 genes PLoS Biol 21591ndash1599

Allentoft ME Sikora M Sjogren K-G Rasmussen S Rasmussen MStenderup J Damgaard PB Schroeder H Ahlstrom T Vinner L2015 Population genomics of Bronze Age Eurasia Nature522(7555) 167ndash172

Auton A McVean G 2012 Estimating recombination rates from geneticvariation in humans In Anisimova M ed Evolutionary genomicsMethods in Molecular Biology (Methods and Protocols) vol 856New York Humana Press

Barrett RDH Hoekstra HE 2011 Molecular spandrels tests of adaptationat the genetic level Nat Rev Genet 12(11) 767ndash780

Baudat F Buard J Grey C Fledel-Alon A Ober C Przeworski M Coop GMassy BD 2010 PRDM9 is a major determinant of meiotic recom-bination hotspots in humans and mice Science 327(5967) 836ndash840

Beaumont MA Zhang W Balding DJ 2002 Approximate Bayesian com-putation in population genetics Genetics 162(4) 2025ndash2035

Beleza S Johnson NA Candille SI Absher DM Coram MA Lopes JCampos J Araujo II Anderson TM Vilhjalmsson BJ et al 2013Genetic architecture of skin and eye color in an African-Europeanadmixed population PLoS Genet 9(3) e1003372

Beleza S Santos AM McEvoy B Alves I Martinho C Cameron E ShriverMD Parra EJ Rocha J 2013 The timing of pigmentation lightening inEuropeans Mol Biol Evol 30(1) 24ndash35

Berg JJ Coop G 2015 A coalescent model for a sweep of a uniquestanding variant Genetics 201(2) 707ndash725

Bersaglieri T Sabeti PC Patterson N Vanderploeg T Schaffner SF DrakeJA Rhodes M Reich DE Hirschhorn JN 2004 Genetic signatures ofstrong recent positive selection at the lactase gene Am J Hum Genet74(6) 1111ndash1120

Broman KW Murray JC Sheffield VC White RL Weber JL 1998Comprehensive human genetic maps individual and sex-specificvariation in recombination Am J Hum Genet 63(3) 861ndash869

Bryk J Hardouin E Pugach I Hughes D Strotmann R Stoneking M MylesS 2008 Positive selection in East Asians for an EDAR allele thatenhances NF-jB activation PLoS One 3(5) e2209

Chen H Slatkin M 2013 Inferring selection intensity and allele age frommultilocus haplotype structure G3 3(8) 1429ndash1442

Chen H Hey J Slatkin M 2015 A hidden Markov model for investigatingrecent positive selection through haplotype structure Theor PopulBiol 9918ndash30

Chun S Fay JC 2011 Evidence for hitchhiking of deleterious mutationswithin the human genome PLoS Genet 7(8) e1002240

Coop G Griffiths RC 2004 Ancestral inference on gene trees underselection Theor Popul Biol 66(3) 219ndash232

Coop G Bullaughey K Luca F Przeworski M 2008 The timing of selec-tion at the human FOXP2 gene Mol Biol Evol 25(7) 1257ndash1259

Dalziel AC Rogers SM Schulte PM 2009 Linking genotypes to pheno-types and fitness how mechanistic biology can inform molecularecology Mol Ecol 18(24) 4997ndash5017

Eaton K Edwards M Krithika S Cook G Norton H Parra EJ 2015Association study confirms the role of two OCA2 polymorphismsin normal skin pigmentation variation in East Asian populations AmJ Hum Biol 27 520ndash525

Edwards M Bigham A Tan J Li S Gozdzik A Ross K Jin L Parra EJ 2010Association of the OCA2 polymorphism His615Arg with melanin

content in east Asian populations further evidence of convergentevolution of skin pigmentation PLoS Genet 6(3) e1000867

Elmer KR Meyer A 2011 Adaptation in the age of ecological genomicsinsights from parallelism and convergence Trends Ecol Evol 26(6)298ndash306

Enattah NS Sahi T Savilahti E Terwilliger JD Peltonen L Jeuroarveleuroa I 2002Identification of a variant associated with adult-type hypolactasiaNat Genet 30(2) 233ndash237

Eng MY Luczak SE Wall TL 2007 ALDH2 ADH1B and ADH1C geno-types in Asians a literature review Alcohol Res Health 30(1) 22

Fenner JN 2005 Cross-cultural estimation of the human generationinterval for use in genetics-based population divergence studiesAm J Phys Anthropol 128(2) 415ndash423

Fujimoto A Ohashi J Nishida N Miyagawa T Morishita Y Tsunoda TKimura R Tokunaga K 2008 A replication study confirmed theEDAR gene to be a major contributor to population differentiationregarding head hair thickness in Asia Hum Genet 124(2) 179ndash185

Garud NR Messer PW Buzbas EO Petrov DA 2015 Recent selectivesweeps in North American Drosophila melanogaster show signaturesof soft sweeps PLoS Genet 11(2) e1005004

Griffiths RC Tavare S 1994 Sampling theory for neutral alleles in avarying environment Philos Trans R Soc B Biol Sci 344(1310)403ndash410

Haak W Lazaridis I Patterson N Rohland N Mallick S Llamas B BrandtG Nordenfelt S Harney E Stewardson K 2015 Massive migrationfrom the steppe was a source for Indo-European languages inEurope Nature 522 207ndash211

Hartfield M Otto SP 2011 Recombination and hitchhiking of deleteri-ous alleles Evolution 65(9) 2421ndash2434

Hermisson J Pennings PS 2005 Soft sweeps molecular population ge-netics of adaptation from standing genetic variation Genetics 169(4)2335ndash2352

Hinch AG Tandon A Patterson N Song Y Rohland N Palmer CD ChenGK Wang K Buxbaum SG Akylbekova EL et al 2011 The landscapeof recombination in African Americans Nature 476(7359) 170ndash175

Hudson RR 1990 Gene genealogies and the coalescent process Oxf SurvEvol Biol 7(1) 44

Hudson RR 2007 The variance of coalescent time estimates from DNAsequences J Mol Evol 64(6) 702ndash705

Hudson RR Kaplan NL 1988 The coalescent process in models withselection and recombination Genetics 120(3) 831ndash840

Huff CD Witherspoon DJ Zhang Y Gatenbee C Denson LA KugathasanS Hakonarson H Whiting A Davis CT Wu W et al 2012 Crohnrsquosdisease and genetic hitchhiking at IBD5 Mol Biol Evol 29(1)101ndash111

Innan H Kim Y 2004 Pattern of polymorphism after strong artificialselection in a domestication event Proc Natl Acad Sci U S A 101(29)10667ndash10672

Itan Y Powell A Beaumont MA Burger J Thomas MG 2009 The originsof lactase persistence in Europe PLoS Comput Biol 5(8) e1000491

Jablonski NG Chaplin G 2000 The evolution of human skin coloration JHum Evol 39(1) 57ndash106

Jablonski NG Chaplin G 2012 Human skin pigmentation migration anddisease susceptibility Philos Trans R Soc B Biol Sci 367(1590)785ndash792

Jeong C Di Rienzo A 2014 Adaptations to local environments in mod-ern human populations Curr Opin Genet Dev 291ndash8

Kamberov YG Wang S Tan J Gerbault P Wark A Tan L Yang Y Li STang K Chen H et al 2013 Modeling recent human evolution inmice by expression of a selected EDAR variant Cell 152(4) 691ndash702

Kaplan NL Darden T Hudson RR 1988 The coalescent process inmodels with selection Genetics 120(3) 819ndash829

Kaplan NL Hudson RR Langley CH 1989 The ldquohitchhiking effectrdquo revis-ited Genetics 123(4) 887ndash899

Kimura R Yamaguchi T Takeda M Kondo O Toma T Haneji KHanihara T Matsukusa H Kawamura S Maki K et al 2009 A com-mon variation in EDAR is a genetic determinant of shovel-shapedincisors Am J Hum Genet 85(4) 528ndash535

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1015

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Kong A Thorleifsson G Gudbjartsson DF Masson G Sigurdsson AJonasdottir A Walters GB Jonasdottir A Gylfason A KristinssonKT et al 2010 Fine-scale recombination rate differences betweensexes populations and individuals Nature 467(7319) 1099ndash1103

Kuokkanen M Kokkonen J Enattah NS Ylisaukko-oja T Komu H VariloT Peltonen L Savilahti E Jeuroarveleuroa I 2006 Mutations in the translatedregion of the lactase gene (LCT) underlie congenital lactase defi-ciency Am J Hum Genet 78(2) 339ndash344

Larribe F Fearnhead P 2011 On composite likelihoods in statisticalgenetics Stat Sin 21(1) 43ndash69

Lazaridis I Patterson N Mittnik A Renaud G Mallick S Kirsanow KSudmant PH Schraiber JG Castellano S Lipson M et al 2014Ancient human genomes suggest three ancestral populations forpresent-day Europeans Nature 513(7518) 409ndash413

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357) 493ndash496

Li H Mukherjee N Soundararajan U Tarnok Z Barta C Khaliq SMohyuddin A Kajuna SLB Mehdi SQ Kidd JR et al 2007Geographically separate increases in the frequency of the derivedADH1B 47His allele in eastern and western Asia Am J Hum Genet81(4) 842ndash846

Li N Stephens M 2003 Modeling linkage disequilibrium and identifyingrecombination hotspots using single-nucleotide polymorphismdata Genetics 165(4) 2213ndash2233

Loomis WF 1967 Skin-pigment regulation of vitamin-D biosynthesis inman Science 157(3788) 501ndash506

Mathieson I Lazaridis I Rohland N Mallick S Patterson N RoodenbergSA Harney E Stewardson K Fernandes D Novak M et al 2015Genome-wide patterns of selection in 230 ancient Eurasians Nature528(7583) 499ndash503

McPeek MS Strahs A 1999 Assessment of linkage disequilibrium by thedecay of haplotype sharing with application to fine-scale geneticmapping Am J Hum Genet 65(3) 858ndash875

Meligkotsidou L Fearnhead P 2005 Maximum-likelihood estimationof coalescence times in genealogical trees Genetics 171(4)2073ndash2084

Messer PW Neher RA 2012 Estimating the strength of selective sweepsfrom deep population diversity data Genetics 191(2) 593ndash605

Messer PW Petrov DA 2013 Population genomics of rapid adaptationby soft selective sweeps Trends Ecol Evol 28(11) 659ndash669

Miller CT Beleza S Pollen AA Schluter D Kittles RA Shriver MDKingsley DM 2007 cis-Regulatory changes in kit ligand expressionand parallel evolution of pigmentation in sticklebacks and humansCell 131(6) 1179ndash1189

Morris AP Whittaker JC Balding DJ 2000 Bayesian fine-scale mapping ofdisease loci by hidden Markov models Am J Hum Genet 67(1)155ndash169

Morris AP Whittaker JC Balding DJ 2002 Fine-scale mapping of diseaseloci via shattered coalescent modeling of genealogies Am J HumGenet 70(3) 686ndash707

Nakagome S Alkorta-Aranburu G Amato R Howie B Peter BMHudson RR Di Rienzo A 2016 Estimating the ages of selectionsignals from different epochs in human history Mol Biol Evol33(3) 657ndash669

Nielsen R 2005 Molecular signatures of natural selection Annu RevGenet 39 197ndash218

Olds LC Sibley E 2003 Lactase persistence DNA variant enhances lactasepromoter activity in vitro functional role as a cis regulatory elementHum Mol Genet 12(18) 2333ndash2340

Ormond L Foll M Ewing GB Pfeifer SP Jensen JD 2016 Inferring the ageof a fixed beneficial allele Mol Ecol 25(1) 157ndash169

Osier MV Pakstis AJ Soodyall H Comas D Goldman D Odunsi AOkonofua F Parnas J Schulz LO Bertranpetit J et al 2002 A globalperspective on genetic variation at the ADH genes reveals unusualpatterns of linkage disequilibrium and diversity Am J Hum Genet71(1) 84ndash99

Peng Y Shi H Qi X-B Xiao C-J Zhong H Ma Run-lin Z Su B 2010The ADH1B Arg47His polymorphism in East Asian populations

and expansion of rice domestication in history BMC Evol Biol10(1) 15

Peter BM Huerta-Sanchez E Nielsen R 2012 Distinguishing betweenselective sweeps from standing variation and from a de novo mu-tation PLoS Genet 8(10) e1003011

Przeworski M Coop G Wall JD 2005 The signature of positive selectionon standing genetic variation Evolution 59(11) 2312ndash2323

Price AL Tandon A Patterson N Barnes KC Rafaels N Ruczinski I BeatyTH Mathias R Reich D Myers S 2009 Sensitive detection of chro-mosomal segments of distinct ancestry in admixed populationsPLoS Genet 5(6) e1000519

Pritchard JK Seielstad MT Perez-Lezaun A Feldman MW 1999Population growth of human Y chromosomes a study of Y chro-mosome microsatellites Mol Biol Evol 16(12) 1791ndash1798

Pritchard JK Pickrell JK Coop G 2010 The genetics of human adapta-tion hard sweeps soft sweeps and polygenic adaptation Curr Biol20(4) R208ndashR215

Przeworski M 2003 Estimating the time since the fixation of a beneficialallele Genetics 164(4) 1667ndash1676

Rabiner LR 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77(2) 257ndash286

Radwan J Babik W 2012 The genomics of adaptation Proc R Soc Lond BBiol Sci 279(1749) 5024ndash5028

Rannala B Reeve JP 2001 High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome se-quence Am J Hum Genet 69(1) 159ndash178

Rannala B Reeve JP 2002 Joint Bayesian estimation of mutationlocation and age using linkage disequilibrium Biocomputing2003 526ndash534

Segurel L Wyman MJ Przeworski M 2014 Determinants of mutationrate variation in the human germline Annu Rev Genomics HumGenet 1547ndash70

Simoons FJ 1970 Primary adult lactose intolerance and the milkinghabit a problem in biologic and cultural interrelations Am J DigDis 15(8) 695ndash710

Skoglund P Malmstrom H Omrak A Raghavan M Valdiosera CGunther T Hall P Tambets K Parik J Sjogren K-G et al 2014Genomic diversity and admixture differs for Stone-AgeScandinavian foragers and farmers Science 344(6185) 747ndash750

Slatkin M 2001 Simulating genealogies of selected alleles in a populationof variable size Genet Res 78(01) 49ndash57

Slatkin M 2008 A Bayesian method for jointly estimating allele age andselection intensity Genet Res 90(01) 129ndash137

Slatkin M Hudson RR 1991 Pairwise comparisons of mitochondrialDNA sequences in stable and exponentially growing populationsGenetics 129(2) 555ndash562

Slatkin M Rannala B 2000 Estimating allele age Annu Rev GenomicsHum Genet 1(1) 225ndash249

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenet Res 23(01) 23ndash35

Stinchcombe JR Hoekstra HE 2008 Combining population genomicsand quantitative genetics finding the genes underlying ecologicallyimportant traits Heredity 100(2) 158ndash170

Tang H Siegmund DO Shen P Oefner PJ Feldman MW 2002Frequentist estimation of coalescence times from nucleotide se-quence data using a tree-based partition Genetics 161(1) 447ndash459

Tavare S Balding DJ Griffiths RC Donnelly P 1997 Inferring coalescencetimes from DNA sequence data Genetics 145(2) 505ndash518

Tennessen JA Bigham AW OrsquoConnor TD Fu W Kenny EE Gravel SMcGee S Do R Liu X Jun G et al 2012 Evolution and functionalimpact of rare coding variation from deep sequencing of humanexomes Science 337(6090) 64ndash69

Thomson R Pritchard JK Shen P Oefner PJ Feldman MW 2000Recent common ancestry of human Y chromosomes evidencefrom DNA sequence data Proc Natl Acad Sci U S A 97(13)7360ndash7365

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007

Smith et al doi101093molbevmsy006 MBE

1016

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 39(1) 31ndash40

Troelsen JT Olsen J Moslashller J SjeuroOstreuroOm H 2003 An upstream polymor-phism associated with lactase persistence has increased enhanceractivity Gastroenterology 125(6) 1686ndash1694

Varin C Reid N Firth D 2011 An overview of composite likelihoodmethods Stat Sin 21(1) 5ndash42

Voight BF Kudaravalli S Wen X Pritchard JK Hurst L 2006 A map ofrecent positive selection in the human genome PLoS Biol 4(3) e72

Wegmann D Kessner DE Veeramah KR Mathias RA Nicolae DL YanekLR Sun YV Torgerson DG Rafaels N Mosley T et al 2011Recombination rates in admixed individuals identified by ancestry-based inference Nat Genet 43(9) 847ndash853

Wilde S Timpson A Kirsanow K Kaiser E Kayser M Unterleuroander MHollfelder N Potekhina ID Schier W Thomas MG et al 2014 Directevidence for positive selection of skin hair and eye pigmentation inEuropeans during the last 5000 years Proc Natl Acad Sci U S A111(13) 4832ndash4837

Williamson SH Hernandez R Fledel-Alon A Zhu L Nielsen RBustamante CD 2005 Simultaneous inference of selection and pop-ulation growth from patterns of variation in the human genomeProc Natl Acad Sci U S A 102(22) 7882ndash7887

Wiuf C 2000 On the genealogy of a sample of neutral rare alleles TheorPopul Biol 58(1) 61ndash75

Wiuf C Donnelly P 1999 Conditional genealogies and the age of aneutral mutant Theor Popul Biol 56(2) 183ndash201

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1017

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

PrethXjtcaAHTHORN frac14Yn

ifrac141

XL

wfrac141

aiwbiwPrethWi frac14 wjtcaATHORN (10)

This computation is on the order 2Lnm2 and in practicefor mfrac14 20 nfrac14 100 and Lfrac14 4000 takes approximately3027 s to compute on an Intel Core i7-4750HQ CPU at200 GHz 8 with 156 GB RAM

InferencePerforming inference on tca requires addressing the latentvariables w and A in the model Marginalizing over possiblevalues of w is a natural summation per haplotype that is linearin L as shown above For A the number of possible values islarge (2L) and so we employ a MetropolisndashHastings algorithmto jointly sample the posterior of A and tca and then we takemarginal samples of tca for inference We assign a uniformprior density for both A and tca such that PrethATHORN frac14 1=2L andPrethtcaTHORN frac14 1=ethtmax tminTHORN where tmax and tmin are user-specified maximum and minimum values for tca ProposedMCMC updates of the ancestral haplotype A0 are generatedby randomly selecting a site in A and flipping to the alterna-tive allele For tca proposed values are generated by adding anormally distributed random variable centered at 0tca0 frac14 tca thorn Neth0 r2THORN To start the MetropolisndashHastings algo-

rithm an initial value of tca is uniformly drawn from a user-specified range of values (10ndash2000 in the applications here)To initialize the ancestral haplotype to a reasonable value weuse a heuristic algorithm which exploits the characteristicdecrease in variation near a selected site (see supplementaryappendix S2 Supplementary Material online)

For each haplotype in the sample of beneficial allele car-riers the Li and Stephens (2003) model uses a haplotypemiscopying rate h and switching rate q to compute a like-lihood term for loci following the recombination event off ofthe ancestral haplotype For our analyses we set q frac14 44104 using our simulated values of r frac14 11 108 per bpper generation and Nfrac14 10000 where q frac14 4Nr Following Li

and Stephens (2003) we fix h frac14 ethPn

mfrac1411=mTHORN1 as derived

from the expected number of mutation events on a geneal-ogy relating n chromosomes at a particular site We found nodiscernible effects on estimate accuracy when specifying dif-ferent values of q (supplementary fig 1 SupplementaryMaterial online)

ResultsBecause our model requires a sample (or ldquopanelrdquo) of referencehaplotypes without the selected allele we tested our methodfor cases in which the reference panel is chosen from the localpopulation in which the selected allele is found as well ascases where the panel is from a diverged population wherethe selected haplotype is absent (fig 1c) Regardless of sce-nario the estimates are on average within a factor of 2 of thetrue value and often much closer When using a local refer-ence panel point estimates of tca increasingly underestimatethe true value (TMRCA) as selection becomes weaker and thefinal allele frequency increases (fig 2) Put differently the age

of older TMRCAs tends to be underestimated with local ref-erence panels Using the mean posteriors as point estimatesmean values of log2(estimatetrue value) range from062 to014 Simulations using a diverged population for the refer-ence panel removed the bias though only in cases where thedivergence time was not large For a reference panel divergedby 05N generations mean log2(estimatetrue value) valuesrange from 021 to 018 As the reference panel becomestoo far diverged from the selected population estimates be-come older than the true value (036 to 094 log2(estimatetrue value)) In these cases the HMM is unlikely to infer aclose match between background haplotypes in the sampleand the reference panel leading to many more mismatchesbeing inferred as mutation events on the ancestral haplotypeand an older estimate of tca

The bottom panel of figure 2 shows the effect of selectionstrength and final allele frequency on the size of the 95credible interval around point estimates normalized by thetrue TMRCA for each simulated data set Before normalizingcredible interval sizes using a local reference panel range from73 to 213 generations for 2Ns frac14 100 versus 18 to 22 gener-ations when 2Ns frac14 2000 Using local and diverged referencepanels we found a minimal effect of the sample size on pointestimates (supplementary figs 2 and 3 SupplementaryMaterial online) As expected larger sample sizes for the car-rier panel improve estimate accuracy However higher allelefrequencies and weak selection are likely to induce more un-certainty due to the ancestral haplotype tracts recombiningwithin the sample We find this effect more pronounced withlarge sample sizes for the reference panel We speculate that alarge sample of reference haplotypes leads the focal selectedhaplotype to have increased probability of copying from thereference panel leading to a shorter selected haplotype andslight overestimate of the TMRCA

We also performed simulations under varying degrees ofmutation and recombination rate misspecification (supple-mentary fig 4 Supplementary Material online) In most casesmean values of log2(estimatetrue value) stay within an orderof magnitude of 0 As expected when both the mutation andrecombination rate are misspecified we find the most dis-crepancy To assess the convergence properties of theMCMC five replicate chains were run for each of 20 sim-ulated data sets produced under three 2Ns values (100200 and 2000) for frequency trajectories ending at 01(supplementary fig 5 Supplementary Material online)While care is always warranted with MCMC approacheswe find in practice that convergence among our replicatechains is attained relatively quickly (3 000 iterations forsimulated data and 3000ndash9000 iterations for appliedcases see supplementary fig 9 Supplementary Materialonline)

We compared the performance of our estimator withthree other model-based approaches for allele age estimationby matching the simulation scheme performed by Chen et al(2015) (supplementary table 5 Supplementary Material on-line) Our method shows improvement in accuracy (RMSE)andor lower bias for simulations with lower frequencies ofthe beneficial allele (40) regardless of the reference panel

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1007

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

used In cases where the final beneficial allele frequency ishigher (80) our methodrsquos accuracy remains as good or bet-ter than the other methods when using a diverged referencepanel with a two-orders-of-magnitude improvement of biasunder strong selection (sfrac14 005) Estimates when using a localreference panel and a high final beneficial allele frequencyremain comparable to the other methods for strong selectionbut tend to have more bias and decreased accuracy as selec-tion strength decreases

Assuming a star-genealogy among beneficial allele carriersmay result in underestimating the variance for the posteriordistribution when there is nonindependence in our sample

To measure this affect we computed TMRCA estimates on100 bootstrap replicates for 4 simulated data sets under 2selection strengths and 2 final allele frequencies (supplemen-tary table 7 Supplementary Material online) We find closeagreement between the 95 posterior credible intervals ofthe original data and the 95 confidence intervals computedon the bootstrap estimates for a selection strength of 01 forboth final allele frequencies of 04 and 08 As expected olderTMRCAs are likely to violate the star-genealogy assumptionand in these cases we find that estimates from our originaldata are more narrow than the bootstrap confidenceintervals

Frequency 01 Frequency 03 Frequency 06 Frequency 09

minus2

minus1

0

1

2

3

0005 001 01 0005 001 01 0005 001 01 0005 001 01

log2

(Est

imat

eTr

ue V

alue

)Local 02N 05N 10N

Frequency 01 Frequency 03 Frequency 06 Frequency 09

02

04

06

08

0005 001 01 0005 001 01 0005 001 01 0005 001 01Selection Strength

Nor

mal

ized

95

Cre

dibl

e In

terv

al R

ange

FIG 2 Accuracy results from simulated data Accuracy of TMRCA point estimates and 95 credible interval ranges from posteriors inferred fromsimulated data under different strengths of selection final allele frequencies and choice of reference panel Credible interval range sizes are in unitsof generations and are normalized by the true TMRCA for each simulated data set See Materials and Methods below for simulation details

Smith et al doi101093molbevmsy006 MBE

1008

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Recombination versus Mutation as a Source ofInformationWe compared our model-based inference with simpler esti-mates of the TMRCA using the number of derived mutationson the ancestral haplotype and the mean length of the an-cestral haplotype In addition to quantifying the improve-ment our method has over these calculations this alsoserves as an ad-hoc way to understand how the relativeweight of information from mutation and recombinationaffects the performance of our method One can model thehaplotype lengths as independent and exponentially distrib-uted to derive a recombination-based estimator tr as

tr frac141

wor (11)

where r is the recombination rate and wo is the observedmean ancestral haplotype length To leverage the count ofderived mutations on the ancestral haplotype we use theThomson et al (2000) estimator In a sample of n haplotypeswith the selected allele a mutation-based estimator tm canbe calculated as

tm frac141

n

Xn

i

yi

wil (12)

where yi is the number of derived mutations on the ith hap-lotype which has length wi basepairs See Hudson (2007) for aderivation of the estimate for the variance of the Thomsonestimator

When using derived mutations uncertainty in both theancestral haplotype sequence and the length of the ancestralhaplotype on each chromosome (wi) can lead to poor esti-mation To improve inference researchers typically define arestricted ldquononrecombiningrdquo region that may reliably containderived mutations on the ancestral haplotype This has twodisadvantages 1) There is more information available in thedata which cannot be used because excess caution is neces-sary to prevent overcounting of derived mutations and2) there may still be unobserved recombination events inthis restricted locus To minimize the use of heuristics for aderived mutation approach we used our model to findmaximum-likelihood estimates of the ancestral haplotypebreakpoints using equation (2) in the model descriptionWe also used the mean posterior estimate of the ancestralhaplotype from our model to identify derived mutations Tocalculate a recombination-based estimator of the TMRCA wecalculated wo using the same maximum-likelihood estimatesof the ancestral haplotype lengths inferred for the mutationestimator

When using a local reference panel the simple mutationestimator tm consistently underestimates the true TMRCAThe recombination estimate however remains accurate(supplementary fig 6 Supplementary Material online) Wesuspect this to be a result of poor estimation of the ancestralhaplotype and violation of the star-genealogy assumption Incases where selection is weaker and the genealogy is not star-shaped derived mutations occurring early in the genealogywill be overrepresented and incorrectly inferred to be the

ancestral allele In this way high-frequency-derived alleleswill not be counted As predicted increasing selectionstrength improves mutation estimator accuracy The recom-bination estimator appears robust to this effect as long asselection is not too strong For very strong selection andyoung TMRCA values maximum-likelihood estimates ofthe haplotype lengths become constrained by the size ofthe locus studied For example in simulations with a selectionstrength of 005 and frequency of 01 the mean TMRCA isaround 100 generations Using equation (11) above the meanlength of the ancestral haplotype for a TMRCA of 100 gen-erations is 2 Mb which is twice as large as the window size weuse to make computation for our simulations feasible Using alarger window around the selected locus would amelioratethis effect

When using a diverged reference panel we find an oppositeeffect In this case the count of derived mutations results inan overestimate and the haplotype lengths yield an under-estimate We suspect this to be driven by poor matchingbetween the reference panel and the background haplotypesamong beneficial allele carriers The low probability of match-ing between the reference and background haplotypes meansthat the lengths of the ancestral haplotype are inferred toextend further than their true lengths This also leads to anoverestimate for the mutation estimator because differencesbetween the ancestral and background haplotypes are incor-rectly inferred as derived mutations on the ancestralhaplotype

Application to 1000 Genomes DataWe applied our method to five variants previously identifiedas targets of recent selection in various human populations(fig 3) Using phased data from the 1000 Genomes Project wefocused on variants that are not completely fixed in any onepopulation so that we could use a local reference panel The Liand Stephens (2003) haplotype copying model is appropriatein cases where ancestry switches occur among chromosomeswithin a single population so we excluded populations in theAmericas for which high levels of admixture are known toexist

While the simulation results described above provide someintuition for the effects of selection strength final allele fre-quency and choice of reference panel we also performedsimulations using the demographic history inferred byTennessen et al (2012) to explore the effects of nonequili-brium demographic history on our estimation accuracy (sup-plementary fig 10 Supplementary Material online) We findsubtle differences in accuracy between the two demographichistories where the nonequilibrium histories lead to negligibledifferences in mean values for log2(estimatetrue value) andlarger credible interval ranges

ADH1BA derived allele at high frequency among East Asians at theADH1B gene (rs3811801) has been shown to be functionallyrelevant for alcohol metabolism (Osier et al 2002 Eng et al2007) Previous age estimates are consistent with the timing

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1009

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

of rice domestication and fermentation approximately10000 years ago (Li et al 2007 Peng et al 2010 Peter et al2012) However a more recent estimate by Peter et al (2012)pushes this time back several thousand years to 12876(2204ndash49764) years ago Our results are consistent with an

older timing of selection as our CHB sample (Han Chinese inBeijing China) TMRCA estimate is 15377 (13763ndash17281)years Replicate chains of the MCMC are generally consistentwith the oldest estimates in the CHB sample showing themost variation among resampled data sets and the youngest

FIG 3 Comparison of TMRCA estimates with previous results Violin plots of posterior distributions for the complete set of estimated TMRCAvalues for the five variants indicated in the legend scaled to a generation time of 29 years Each row indicates a population sample from the 1000Genomes Project panel Replicate MCMCs are plotted with transparency Points and lines overlaying the violins are previous point estimates and95 confidence intervals for each of the variants indicated by a color and rs number in the legend (see supplementary tables 3 and 4Supplementary Material online) The population sample abbreviations are defined in text

Smith et al doi101093molbevmsy006 MBE

1010

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

estimate of 10841 (9720ndash12147) in the KHV sample showingthe least When using a fine-scale recombination map all ofthe ADH1B TMRCAs are inferred to be slightly older (supple-mentary fig 7 Supplementary Material online)

EDARPopulation genomic studies have repeatedly identified thegene EDAR to be under recent selection in East Asians(Akey et al 2004 Williamson et al 2005 Voight et al 2006)with a particular site (rs3827760) showing strong evidence forbeing the putative target Functional assays and allele specificexpression differences at this position show phenotypic asso-ciations to a variety of phenotypes including hair thicknessand dental morphology (Bryk et al 2008 Fujimoto et al 2008Kimura et al 2009 Kamberov et al 2013)

Our estimate of 22192 (19683ndash25736) years for the EDARallele in the CHB sample is older than ABC-based estimates of12458 (1314ndash85835) and 13224 (4899ndash50692) years madeby Bryk et al (2008) and Peter et al (2012) respectivelyKamberov et al (2013) use spatially explicit ABC and maxi-mum likelihood approaches to compute older estimates of30925 (13175ndash39575) and 36490 (34775ndash38204) We in-cluded all populations for which the variant is present includ-ing the FIN and BEB samples where it exists at low frequencyOur results for the youngest TMRCAs are found in these twolow frequency populations where the estimate in FIN is17386 (13887ndash20794) and the estimate in BEB is 18370(14325ndash22872) Among East Asian populations the oldestand youngest TMRCA estimates are found in the KHV sam-ple (25683 23169ndash28380) and CHB sample (22192 19683ndash25736)

LCTArguably the best-studied signature of selection in humans isfor an allele at the LCT gene (rs4988235) which confers lactasepersistence into adulthoodmdasha trait unique among mammalsand which is thought to be a result of cattle domesticationand the incorporation of milk into the adult diets of severalhuman populations (Enattah et al 2002 Bersaglieri et al 2004Tishkoff et al 2007) There are multiple alleles that showassociation with lactase persistence (Tishkoff et al 2007)We focused on estimating the age of the T-13910 allele pri-marily found at high frequency among Northern Europeansbut which is also found in South Asian populations In addi-tion to association with the lactase persistence phenotypethis allele has been functionally verified by in vitro experi-ments (Olds and Sibley 2003 Troelsen et al 2003Kuokkanen et al 2006)

Mathieson et al (2015) use ancient DNA collected from83 human samples to get a better understanding of the fre-quency trajectory for several adaptive alleles spanning a timescale of 8000 years For the LCT persistence allele (rs4988235)they find a sharp increase in frequency in the past 4000 yearsago While this is more recent than previous estimates anearlier TMRCA or time of mutation is still compatible withthis scenario

Our estimates using European and South Asian samplesfall between the range from 5000 to 10000 years ago which isbroadly consistent with age estimates from modern data Thecredible intervals for estimates in all of the samples havesubstantial overlap which makes any ranking on the basisof point estimates difficult We infer the PJL (Punjabi fromLahore Pakistan) sample to have the oldest TMRCA estimateof 9514 (8596ndash10383) years Itan et al (2009) use spatialmodelling to infer the geographic spread of lactase allelefrom northern to southern Europe Consistent with theirresults the youngest estimate among European populationsis found in the IBS sample at 9341 (8688ndash9989) yearsAmong all samples the youngest estimate was found inBEB at 6869 (5143ndash8809)

KITLG and OCA2The genetic basis and natural history of human skin pigmen-tation is a well-studied system with several alleles of majoreffect showing signatures consistent with being targets ofrecent selection (Jablonski and Chaplin 2012 Beleza Santoset al 2013 Wilde et al 2014 Eaton et al 2015) We focused onan allele found at high-frequency world-wide among non-African populations at the KITLG locus (rs642742) whichshows significant effects on skin pigmentation differences be-tween Europeans and Africans (Miller et al 2007) althoughmore recent work fails to find any contribution of KITLGtoward variation in skin pigmentation in a Cape VerdeAfrican-European admixed population (Beleza Johnsonet al 2013) We also estimated the TMRCA for a melanin-associated allele at the OCA2 locus (rs1800414) which is onlyfound among East Asian populations at high frequency(Edwards et al 2010)

For the KITLG variant our estimates among differentpopulations vary from 18000 to 34000 years ago with theoldest age being in the YRI (Yoruba in Ibadan Nigeria)sample (33948 28861ndash39099) The youngest TMRCA isfound in FIN at 18733 years (16675ndash20816) The nexttwo youngest estimates are also found in Africa withthe TMRCA in the MSL (Mende in Sierra Leone) samplebeing 22340 (15723ndash28950) years old and that for LWK(Luhya in Webuye Kenya) being 22784 (17922ndash28012)years old suggesting a more complex history than amodel of a simple allele frequency increase outside ofAfrica due to pigmentation related selection pressuresPrevious point estimates using rejection samplingapproaches on a Portuguese sample (32277 6003ndash80683) and East Asian sample (32045 6032ndash98165)are again most consistent with our own results on theIBS (29731 26170ndash32813) and CHB samples (2677324297ndash30141) (Beleza Santos et al 2013 Chen et al2015) Among East Asians the oldest and youngest esti-mates are again found in the JPT (28637 24297ndash30141)and KHV (24544 21643ndash27193) samples respectivelyThe TMRCA for OCA2 alleles in the JPT (1859916110ndash20786) and KHV (16370 14439ndash18102) samplesare also the oldest and youngest respectively

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1011

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

DiscussionOur method improves estimation for the timing of selectionon a beneficial allele using a tractable model of haplotypeevolution This approach leverages detailed information inthe data while remaining amenable to large sample sizesUsing both carriers and noncarriers of the allele we canmore effectively account for uncertainty in the extent ofthe ancestral haplotype and derived mutations We showthe performance of our method using simulations of differentselection strengths beneficial allele frequencies and choicesof reference panel By applying our method to five variantspreviously identified as targets of selection in human popu-lations we provide a comparison among population-specificTMRCAs This gives a more detailed account of the order inwhich populations may have acquired the variant andorexperienced selection for the trait it underlies

In that regard it is hypothesized that local selection pres-sures and a cultural shift toward agrarian societies have in-duced adaptive responses among human populations aroundthe globe The data associated with some variants seem toindicate more recent selective events than others Our resultsfor variants associated with dietary traits at the LCT andADH1B genes both imply relatively recent TMRCAs(lt 15 000 years ago) consistent with hypotheses that selec-tion on these mutations results from recent changes in hu-man diet following the spread of agriculture (Simoons 1970Peng et al 2010) In contrast the inferred TMRCAs for EDARKITLG and OCA2 imply older adaptive events which mayhave coincided more closely with the habitation of new envi-ronments or other cultural changes

Several hypotheses have been suggested to describe theselective drivers of skin pigmentation differences among hu-man populations including reduced UV radiation at highlatitudes and vitamin D deficiency (Loomis 1967 Jablonskiand Chaplin 2000) Estimated TMRCAs for the variants atthe OCA2 and EDAR loci among East Asians appear to beas young or younger than the KITLG variant but older thanthe LCT and ADH1B locus This suggests a selective history inEast Asian populations leading to adaptive responses forthese traits occurring after an initial colonization In somecases the dispersion of replicate MCMC estimates makes itdifficult to describe the historical significance of an observedorder for TMRCA values However the consistency of esti-mates among different populations for particular variantsadds some confidence to our modelrsquos ability to reproducethe ages which are relevant to those loci or certain geographicregions

To assess the relative concordance of our estimates withthose from previous approaches we compared our resultswith a compilation of previously published estimates basedon the time of mutation time since fixation or TMRCA ofvariants associated with the genes studied here (supplemen-tary fig 8 Supplementary Material online) The range of con-fidence intervals for these studies is largely a reflection of theassumptions invoked or relaxed for any one method as wellas the sample size and quality of the data used In principleour method extracts more information than approaches that

use summary statistics such as ABC In our empirical appli-cation we found that our method provides a gain in accuracywhile accounting for uncertainty in both the ancestral hap-lotype and its length on each chromosome Notably ourmethod provides narrower credible intervals by incorporatingthe full information from ancestral haplotype lengths derivedmutations and a reference panel of noncarrier haplotypes

Another caveat of our method is its dependence on thereference panel which is intended to serve as a representativesample of nonancestral haplotypes in the population duringthe selected allelersquos increase in frequency Four possible chal-lenges can arise 1) Segments of the ancestral selected haplo-type may be present in the reference panel due torecombination (this is more likely for alleles that havereached higher frequency) 2) the reference panel may con-tain haplotypes that are similar to the ancestral haplotypedue to low levels of genetic diversity 3) the reference panelmay be too diverged from the focal population and 4) pop-ulation connectivity and turnover may lead the ldquolocalrdquo refer-ence panel to be largely composed of migrant haplotypeswhich were not present during the selected allelersquos initial in-crease in frequency

Under scenarios 1 and 2 the background haplotypes willbe too similar to the ancestral haplotype and it may be dif-ficult for the model to discern a specific ancestry switch lo-cation This leads to fewer differences (mutations) thanexpected between the ancestral haplotype and each benefi-cial allele carrier The simulation results are consistent withthis scenario Our method tends to underestimate the trueage across a range of selection intensities and allele frequen-cies when using a local reference panel

Conversely under scenarios 3 and 4 the model will fail todescribe a recombinant haplotype in the sample of beneficialallele carriers as a mosaic of haplotypes in the reference panelAs a result the model will infer more mutation events toexplain observed differences from the ancestral haplotypeOur simulation results show this to be the case with referencepanels diverged by N generations Posterior mean estimatesare consistently older than their true value Our simulationsare perhaps pessimistic thoughmdashwe chose reference paneldivergence times of N and 05N generations approximatelycorresponding to FST values of 04 and 02 respectively Forthe smaller FST values observed in humans we expect resultsfor diverged panels to be closer to those obtained with thelocal reference panel Nonetheless future extensions to incor-porate multiple populations within the reference panel wouldbe helpful and possible by modifying the approach of Priceet al (2009) Such an approach would also enable the analysisof admixed populations (we excluded admixed samples fromour analysis of the 1000 Genomes data above)

Aside from the challenges imposed by the choice of refer-ence panel another potential source of bias lies in our tran-sition probabilities which are not conditioned on thefrequency of the selected variant In reality recombinationevents at some distance away from the selected site will onlyresult in a switch from the ancestral to background haplo-types at a rate proportional to 1 pl where pl is the fre-quency of the ancestral haplotype alleles at locus l In this way

Smith et al doi101093molbevmsy006 MBE

1012

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

some recombination events may go unobservedmdashas thebeneficial allele goes to high frequency the probability of arecombination event leading to an observable ancestral tobackground haplotype transition decreases One solutionmay be to include the frequency-dependent transition prob-abilities derived by Chen et al (2015) Under their model themutation time is estimated by assuming a deterministic lo-gistic frequency trajectory starting at 1

2N An additional benefitof using frequency trajectories would be the ability to inferposterior distributions on selection coefficients While theselection coefficient is typically assumed to be related tothe time since mutation by t1 frac14 logethNsTHORN=s we do nothave an equivalent expression for time to the common an-cestor Rather than the initial frequency being 1

2N for a newmutation our initial frequency must correspond to the fre-quency at which the TMRCA occurs Griffiths and Tavare(1994) derive a framework to model a genealogy under arbi-trary population size trajectories which should be analogousto the problem of an allele frequency trajectory and addi-tional theory on intra-allelic genealogies may be useful here aswell (Griffiths and Tavare 1994 Wiuf and Donnelly 1999 Wiuf2000 Slatkin and Rannala 2000)

Our model also assumes independence among all haplo-types in the sample in a composite-likelihood frameworkwhich is equivalent to assuming a star-genealogy (Larribeand Fearnhead 2011 Varin et al 2011) This is unlikely tobe the case when sample sizes are large or the TMRCA isold It is also unlikely to be true if the beneficial allele existedon multiple haplotypes preceding the onset of selection wasintroduced by multiple migrant haplotypes from other pop-ulations or occurred by multiple independent mutationevents (Innan and Kim 2004 Hermisson and Pennings2005 Przeworski et al 2005 Pritchard et al 2010 Berg andCoop 2015) Methods for distinguishing selection from stand-ing variation versus de novo mutation are available thatshould make it easier distinguish these cases (Messer andNeher 2012 Peter et al 2012 Messer and Petrov 2013Garud et al 2015)

If the underlying allelic genealogy is not star-like onecan expect different estimates of the TMRCA for differentsubsets of the data Here we performed multiple MCMCson resampled subsets of the data to informally diagnosewhether there are violations from the star-shape geneal-ogy assumption We speculate that exactly how theTMRCAs vary may provide insight to the underlyinghistory In cases where the TMRCA estimates for a par-ticular population are old and more variable than otherpopulations the results may be explained by structure inthe genealogy whereby recent coalescent events haveoccurred among the same ancestral haplotype beforethe common ancestor When estimates are dispersedamong resampled data sets the presence of multiple an-cestral haplotypes prior to the variantrsquos increase in fre-quency may be a better explanation Further support forthis explanation might come from comparisons to otherpopulation samples which show little to no dispersion ofestimates from resampled data sets Future work mightmake it possible to formalize this inference process

A final caveat regards the misspecification of mutation andrecombination rates TMRCA estimates are largely deter-mined by the use of accurate measures for these two param-eters In a way this provides some robustness to our methodOur age estimates depend on mutation and recombinationrates so accurate specification for one of the values can com-pensate for slight misspecification in the other In cases wherea fine-scale recombination map is unavailable we suggest us-ing a uniform recombination rate specific to the locus ofinterest (see Materials and Methods and supplementaryfig 7 Supplementary Material online) Choosing an appropri-ate mutation rate will continue to depend on current andfuture work which tries to resolve discrepancies in publishedmutation rate estimates inferred by various approaches(Segurel et al 2014)

One future direction for our method may be to explicitlyincorporate the possibility of multiple ancestral haplotypeswithin the sample Under a disease mapping frameworkMorris et al (2002) implement a similar idea in the case whereindependent disease causing mutations arise at the same lo-cus leading to independent genealogies for which they cointhe term ldquoshattered coalescentrdquo For our case beneficialmutations may also be independently derived on differenthaplotypes Alternatively a single mutation may be oldenough to reside on different haplotypes due to a sufficientamount of linked variation existing prior to the onset of se-lection Berg and Coop (2015) model selection from standingvariation to derive the distribution of haplotypes that theselected allele is present on

While we have treated the TMRCA as a parameter of in-terest our method also produces a sample of the posteriordistribution on the ancestral haplotype This could provideuseful information to estimate the frequency spectrum ofderived mutations on the ancestral haplotype Similarly thefrequency of shared recombination breakpoints could shedlight on the genealogy and how well it conforms to the star-shape assumption The extent of the ancestral haplotype ineach individual may also prove useful for identifying delete-rious alleles that have increased in frequency as a result ofstrong positive selection on linked beneficial alleles (Chun andFay 2011 Hartfield and Otto 2011) For example Huff et al(2012) describe a risk allele for Crohnrsquos disease at high fre-quency in European populations which they suggest is linkedto a beneficial allele under recent selection Similar to anadmixture mapping approach our method could be usedto identify risk loci by testing for an association betweenthe ancestral haplotype and disease status As another appli-cation identifying the ancestral haplotype may be useful inthe context of identifying a source population (or species) fora beneficial allele prior to its introduction and subsequentincrease in frequency in the recipient population

In many cases the specific site under selection may beunknown or restricted to some set of putative sites Whileour method requires the position of the selected site to bespecified future extensions could treat the selected site as arandom variable to be estimated under the same MCMCframework This framework would also be amenable to mar-ginalizing over uncertainty on the selected site

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1013

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

While we focus here on inference from modern DNAdata the increased accessibility of ancient DNA has addeda new dimension to population genetic data sets (Lazaridiset al 2014 Skoglund et al 2014 Allentoft et al 2015 Haaket al 2015 Mathieson et al 2015) Because it will remaindifficult to use ancient DNA approaches in many specieswith poor sample preservation we believe methods basedon modern DNA will continue to be useful going forwardThat said ancient DNA is providing an interesting avenue forcomparative work between inference from modern and an-cient samples For example Nakagome et al (2016) usesimulations to assess the fit of this ancient DNA polymor-phism to data simulated under their inferred parametervalues for allele age and selection intensity and they findreasonable agreement Much work still remains to fullyleverage ancient samples into population genetic inferencewhile accounting for new sources of uncertainty and poten-tial for sampling bias

Despite these challenges it is clear that our understandingof adaptive history will continue to benefit from new com-putational tools which extract insightful information from adiverse set of data sources

Materials and MethodsWe generated data using the software mssel (Hudson R per-sonal communication) which simulates a sample of haplo-types conditioned on the frequency trajectory of a selectedvariant under the structured coalescent (Hudson and Kaplan1988 Kaplan et al 1988) Trajectories were first simulatedforwards in time under a WrightndashFisher model for an additivelocus with varying strengths of selection and different endingfrequencies of the selected variant Trajectories were thentruncated to end at the first time the allele reaches a specifiedfrequency See supplementary table 1 SupplementaryMaterial online for relative ages of simulated TMRCA val-ues for different end frequencies and selection strengthsFor the results in figure 2 100 simulations were performedfor each parameter combination MCMCs were run for5000 iterations with a burn-in excluding the first 3000iterations A standard deviation of 10 was used for theproposal distribution of tca The red boxplots indicate localreference panels The blue and green boxplots indicate ref-erence panels diverged by 5N generations and 1N gener-ations respectively Each data set was simulated for a 1-Mblocus with a mutation rate of 1 108 recombination rateof 1 108

and population size of 10000 Sample sizesfor the selected and reference panels were 100 and 20respectively

For more efficient run times of the MCMC we set a max-imum number of individuals to include in the selected andreference panels to be 100 and 20 respectively In cases wherethe true number of haplotypes for either panel was greaterthan this in the full data set we resampled a subset ofhaplotypes from each population for a total of five replicatesper population For simulation results supporting the use ofthis resampling strategy see supplementary figure 5Supplementary Material online The MCMCs were run for

15000 iterations with a standard deviation of 20 for theTMRCA proposal distribution The first 9000 iterationswere removed as a burn-in leading to 6000 iterations for asample of the posterior Convergence was assessed by com-parison of MCMC replicates Figure 3 and supplementaryfigure 8 Supplementary Material online show the resultsfor all five variants along with previous point estimates and95 confidence intervals assuming a generation time of29 years (Fenner 2005) Supplementary table 3Supplementary Material online lists the mean and 95 cred-ible intervals for estimates with the highest mean posteriorprobability which we refer to in the text belowSupplementary table 4 Supplementary Material online liststhe previous estimates and confidence intervals with addi-tional details of the different approaches taken

To model recombination rate variation we used recombi-nation rates from the Decode sex-averaged recombinationmap inferred from pedigrees among families in Iceland(Kong et al 2010) Because some populations may have re-combination maps which differ from the Decode map at finescales we used a mean uniform recombination rate inferredfrom the 1-Mb region surrounding each variant The motiva-tion for this arises from how recombination rates have beenpreviously shown to remain relatively consistent among re-combination maps inferred for different populations at themegabase-scale (Broman et al 1998 Baudat et al 2010 Konget al 2010 Auton and McVean 2012) Further we found ourestimates depend mostly on having the megabase-scale re-combination rate appropriately set with little difference inmost cases for estimates obtained by modeling fine-scale re-combination at each locus (supplementary fig 7Supplementary Material online) We specify the switchingrate among background haplotypes after recombining off ofthe ancestral haplotype to be 4Nr where Nfrac14 10000 and r isthe mean recombination rate for the 1-Mb locus

For modeling mutation a challenge is that previous mu-tation rate estimates vary depending on the approach used(Segurel et al 2014) Estimates using the neutral substitutionrate between humans and chimps are more than 2 108

per bp per generation while estimates using whole-genomesequencing are closer to 1 108 As a compromise wespecify a mutation rate of 16 108

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe would like to thank Hussein Al-Asadi Arjun BiddandaAnna Di Rienzo Dick Hudson Choongwon Jeong Evan KochJoseph Marcus Shigeki Nakagome Ben Peter Mark ReppelAlex White members of the Coop laboratory at UC Davismembers of the Przeworski laboratory at ColumbiaUniversity and members of the He and Stephens laboratoriesat the University of Chicago for helpful comments JS wassupported by an NSF Graduate Research Fellowship andNational Institute of General Medical Sciences of the

Smith et al doi101093molbevmsy006 MBE

1014

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

National Institutes of Health under award numbers DGE-1144082 and T32GM007197 respectively This work wasalso supported by the National Institute of General MedicalSciences of the National Institutes of Health under awardnumbers RO1GM83098 and RO1GM107374 to GC as wellas R01HG007089 to JN

ReferencesAkey JM Eberle MA Rieder MJ Carlson CS Shriver MD Nickerson DA

Kruglyak L 2004 Population history and natural selection shapepatterns of genetic variation in 132 genes PLoS Biol 21591ndash1599

Allentoft ME Sikora M Sjogren K-G Rasmussen S Rasmussen MStenderup J Damgaard PB Schroeder H Ahlstrom T Vinner L2015 Population genomics of Bronze Age Eurasia Nature522(7555) 167ndash172

Auton A McVean G 2012 Estimating recombination rates from geneticvariation in humans In Anisimova M ed Evolutionary genomicsMethods in Molecular Biology (Methods and Protocols) vol 856New York Humana Press

Barrett RDH Hoekstra HE 2011 Molecular spandrels tests of adaptationat the genetic level Nat Rev Genet 12(11) 767ndash780

Baudat F Buard J Grey C Fledel-Alon A Ober C Przeworski M Coop GMassy BD 2010 PRDM9 is a major determinant of meiotic recom-bination hotspots in humans and mice Science 327(5967) 836ndash840

Beaumont MA Zhang W Balding DJ 2002 Approximate Bayesian com-putation in population genetics Genetics 162(4) 2025ndash2035

Beleza S Johnson NA Candille SI Absher DM Coram MA Lopes JCampos J Araujo II Anderson TM Vilhjalmsson BJ et al 2013Genetic architecture of skin and eye color in an African-Europeanadmixed population PLoS Genet 9(3) e1003372

Beleza S Santos AM McEvoy B Alves I Martinho C Cameron E ShriverMD Parra EJ Rocha J 2013 The timing of pigmentation lightening inEuropeans Mol Biol Evol 30(1) 24ndash35

Berg JJ Coop G 2015 A coalescent model for a sweep of a uniquestanding variant Genetics 201(2) 707ndash725

Bersaglieri T Sabeti PC Patterson N Vanderploeg T Schaffner SF DrakeJA Rhodes M Reich DE Hirschhorn JN 2004 Genetic signatures ofstrong recent positive selection at the lactase gene Am J Hum Genet74(6) 1111ndash1120

Broman KW Murray JC Sheffield VC White RL Weber JL 1998Comprehensive human genetic maps individual and sex-specificvariation in recombination Am J Hum Genet 63(3) 861ndash869

Bryk J Hardouin E Pugach I Hughes D Strotmann R Stoneking M MylesS 2008 Positive selection in East Asians for an EDAR allele thatenhances NF-jB activation PLoS One 3(5) e2209

Chen H Slatkin M 2013 Inferring selection intensity and allele age frommultilocus haplotype structure G3 3(8) 1429ndash1442

Chen H Hey J Slatkin M 2015 A hidden Markov model for investigatingrecent positive selection through haplotype structure Theor PopulBiol 9918ndash30

Chun S Fay JC 2011 Evidence for hitchhiking of deleterious mutationswithin the human genome PLoS Genet 7(8) e1002240

Coop G Griffiths RC 2004 Ancestral inference on gene trees underselection Theor Popul Biol 66(3) 219ndash232

Coop G Bullaughey K Luca F Przeworski M 2008 The timing of selec-tion at the human FOXP2 gene Mol Biol Evol 25(7) 1257ndash1259

Dalziel AC Rogers SM Schulte PM 2009 Linking genotypes to pheno-types and fitness how mechanistic biology can inform molecularecology Mol Ecol 18(24) 4997ndash5017

Eaton K Edwards M Krithika S Cook G Norton H Parra EJ 2015Association study confirms the role of two OCA2 polymorphismsin normal skin pigmentation variation in East Asian populations AmJ Hum Biol 27 520ndash525

Edwards M Bigham A Tan J Li S Gozdzik A Ross K Jin L Parra EJ 2010Association of the OCA2 polymorphism His615Arg with melanin

content in east Asian populations further evidence of convergentevolution of skin pigmentation PLoS Genet 6(3) e1000867

Elmer KR Meyer A 2011 Adaptation in the age of ecological genomicsinsights from parallelism and convergence Trends Ecol Evol 26(6)298ndash306

Enattah NS Sahi T Savilahti E Terwilliger JD Peltonen L Jeuroarveleuroa I 2002Identification of a variant associated with adult-type hypolactasiaNat Genet 30(2) 233ndash237

Eng MY Luczak SE Wall TL 2007 ALDH2 ADH1B and ADH1C geno-types in Asians a literature review Alcohol Res Health 30(1) 22

Fenner JN 2005 Cross-cultural estimation of the human generationinterval for use in genetics-based population divergence studiesAm J Phys Anthropol 128(2) 415ndash423

Fujimoto A Ohashi J Nishida N Miyagawa T Morishita Y Tsunoda TKimura R Tokunaga K 2008 A replication study confirmed theEDAR gene to be a major contributor to population differentiationregarding head hair thickness in Asia Hum Genet 124(2) 179ndash185

Garud NR Messer PW Buzbas EO Petrov DA 2015 Recent selectivesweeps in North American Drosophila melanogaster show signaturesof soft sweeps PLoS Genet 11(2) e1005004

Griffiths RC Tavare S 1994 Sampling theory for neutral alleles in avarying environment Philos Trans R Soc B Biol Sci 344(1310)403ndash410

Haak W Lazaridis I Patterson N Rohland N Mallick S Llamas B BrandtG Nordenfelt S Harney E Stewardson K 2015 Massive migrationfrom the steppe was a source for Indo-European languages inEurope Nature 522 207ndash211

Hartfield M Otto SP 2011 Recombination and hitchhiking of deleteri-ous alleles Evolution 65(9) 2421ndash2434

Hermisson J Pennings PS 2005 Soft sweeps molecular population ge-netics of adaptation from standing genetic variation Genetics 169(4)2335ndash2352

Hinch AG Tandon A Patterson N Song Y Rohland N Palmer CD ChenGK Wang K Buxbaum SG Akylbekova EL et al 2011 The landscapeof recombination in African Americans Nature 476(7359) 170ndash175

Hudson RR 1990 Gene genealogies and the coalescent process Oxf SurvEvol Biol 7(1) 44

Hudson RR 2007 The variance of coalescent time estimates from DNAsequences J Mol Evol 64(6) 702ndash705

Hudson RR Kaplan NL 1988 The coalescent process in models withselection and recombination Genetics 120(3) 831ndash840

Huff CD Witherspoon DJ Zhang Y Gatenbee C Denson LA KugathasanS Hakonarson H Whiting A Davis CT Wu W et al 2012 Crohnrsquosdisease and genetic hitchhiking at IBD5 Mol Biol Evol 29(1)101ndash111

Innan H Kim Y 2004 Pattern of polymorphism after strong artificialselection in a domestication event Proc Natl Acad Sci U S A 101(29)10667ndash10672

Itan Y Powell A Beaumont MA Burger J Thomas MG 2009 The originsof lactase persistence in Europe PLoS Comput Biol 5(8) e1000491

Jablonski NG Chaplin G 2000 The evolution of human skin coloration JHum Evol 39(1) 57ndash106

Jablonski NG Chaplin G 2012 Human skin pigmentation migration anddisease susceptibility Philos Trans R Soc B Biol Sci 367(1590)785ndash792

Jeong C Di Rienzo A 2014 Adaptations to local environments in mod-ern human populations Curr Opin Genet Dev 291ndash8

Kamberov YG Wang S Tan J Gerbault P Wark A Tan L Yang Y Li STang K Chen H et al 2013 Modeling recent human evolution inmice by expression of a selected EDAR variant Cell 152(4) 691ndash702

Kaplan NL Darden T Hudson RR 1988 The coalescent process inmodels with selection Genetics 120(3) 819ndash829

Kaplan NL Hudson RR Langley CH 1989 The ldquohitchhiking effectrdquo revis-ited Genetics 123(4) 887ndash899

Kimura R Yamaguchi T Takeda M Kondo O Toma T Haneji KHanihara T Matsukusa H Kawamura S Maki K et al 2009 A com-mon variation in EDAR is a genetic determinant of shovel-shapedincisors Am J Hum Genet 85(4) 528ndash535

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1015

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Kong A Thorleifsson G Gudbjartsson DF Masson G Sigurdsson AJonasdottir A Walters GB Jonasdottir A Gylfason A KristinssonKT et al 2010 Fine-scale recombination rate differences betweensexes populations and individuals Nature 467(7319) 1099ndash1103

Kuokkanen M Kokkonen J Enattah NS Ylisaukko-oja T Komu H VariloT Peltonen L Savilahti E Jeuroarveleuroa I 2006 Mutations in the translatedregion of the lactase gene (LCT) underlie congenital lactase defi-ciency Am J Hum Genet 78(2) 339ndash344

Larribe F Fearnhead P 2011 On composite likelihoods in statisticalgenetics Stat Sin 21(1) 43ndash69

Lazaridis I Patterson N Mittnik A Renaud G Mallick S Kirsanow KSudmant PH Schraiber JG Castellano S Lipson M et al 2014Ancient human genomes suggest three ancestral populations forpresent-day Europeans Nature 513(7518) 409ndash413

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357) 493ndash496

Li H Mukherjee N Soundararajan U Tarnok Z Barta C Khaliq SMohyuddin A Kajuna SLB Mehdi SQ Kidd JR et al 2007Geographically separate increases in the frequency of the derivedADH1B 47His allele in eastern and western Asia Am J Hum Genet81(4) 842ndash846

Li N Stephens M 2003 Modeling linkage disequilibrium and identifyingrecombination hotspots using single-nucleotide polymorphismdata Genetics 165(4) 2213ndash2233

Loomis WF 1967 Skin-pigment regulation of vitamin-D biosynthesis inman Science 157(3788) 501ndash506

Mathieson I Lazaridis I Rohland N Mallick S Patterson N RoodenbergSA Harney E Stewardson K Fernandes D Novak M et al 2015Genome-wide patterns of selection in 230 ancient Eurasians Nature528(7583) 499ndash503

McPeek MS Strahs A 1999 Assessment of linkage disequilibrium by thedecay of haplotype sharing with application to fine-scale geneticmapping Am J Hum Genet 65(3) 858ndash875

Meligkotsidou L Fearnhead P 2005 Maximum-likelihood estimationof coalescence times in genealogical trees Genetics 171(4)2073ndash2084

Messer PW Neher RA 2012 Estimating the strength of selective sweepsfrom deep population diversity data Genetics 191(2) 593ndash605

Messer PW Petrov DA 2013 Population genomics of rapid adaptationby soft selective sweeps Trends Ecol Evol 28(11) 659ndash669

Miller CT Beleza S Pollen AA Schluter D Kittles RA Shriver MDKingsley DM 2007 cis-Regulatory changes in kit ligand expressionand parallel evolution of pigmentation in sticklebacks and humansCell 131(6) 1179ndash1189

Morris AP Whittaker JC Balding DJ 2000 Bayesian fine-scale mapping ofdisease loci by hidden Markov models Am J Hum Genet 67(1)155ndash169

Morris AP Whittaker JC Balding DJ 2002 Fine-scale mapping of diseaseloci via shattered coalescent modeling of genealogies Am J HumGenet 70(3) 686ndash707

Nakagome S Alkorta-Aranburu G Amato R Howie B Peter BMHudson RR Di Rienzo A 2016 Estimating the ages of selectionsignals from different epochs in human history Mol Biol Evol33(3) 657ndash669

Nielsen R 2005 Molecular signatures of natural selection Annu RevGenet 39 197ndash218

Olds LC Sibley E 2003 Lactase persistence DNA variant enhances lactasepromoter activity in vitro functional role as a cis regulatory elementHum Mol Genet 12(18) 2333ndash2340

Ormond L Foll M Ewing GB Pfeifer SP Jensen JD 2016 Inferring the ageof a fixed beneficial allele Mol Ecol 25(1) 157ndash169

Osier MV Pakstis AJ Soodyall H Comas D Goldman D Odunsi AOkonofua F Parnas J Schulz LO Bertranpetit J et al 2002 A globalperspective on genetic variation at the ADH genes reveals unusualpatterns of linkage disequilibrium and diversity Am J Hum Genet71(1) 84ndash99

Peng Y Shi H Qi X-B Xiao C-J Zhong H Ma Run-lin Z Su B 2010The ADH1B Arg47His polymorphism in East Asian populations

and expansion of rice domestication in history BMC Evol Biol10(1) 15

Peter BM Huerta-Sanchez E Nielsen R 2012 Distinguishing betweenselective sweeps from standing variation and from a de novo mu-tation PLoS Genet 8(10) e1003011

Przeworski M Coop G Wall JD 2005 The signature of positive selectionon standing genetic variation Evolution 59(11) 2312ndash2323

Price AL Tandon A Patterson N Barnes KC Rafaels N Ruczinski I BeatyTH Mathias R Reich D Myers S 2009 Sensitive detection of chro-mosomal segments of distinct ancestry in admixed populationsPLoS Genet 5(6) e1000519

Pritchard JK Seielstad MT Perez-Lezaun A Feldman MW 1999Population growth of human Y chromosomes a study of Y chro-mosome microsatellites Mol Biol Evol 16(12) 1791ndash1798

Pritchard JK Pickrell JK Coop G 2010 The genetics of human adapta-tion hard sweeps soft sweeps and polygenic adaptation Curr Biol20(4) R208ndashR215

Przeworski M 2003 Estimating the time since the fixation of a beneficialallele Genetics 164(4) 1667ndash1676

Rabiner LR 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77(2) 257ndash286

Radwan J Babik W 2012 The genomics of adaptation Proc R Soc Lond BBiol Sci 279(1749) 5024ndash5028

Rannala B Reeve JP 2001 High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome se-quence Am J Hum Genet 69(1) 159ndash178

Rannala B Reeve JP 2002 Joint Bayesian estimation of mutationlocation and age using linkage disequilibrium Biocomputing2003 526ndash534

Segurel L Wyman MJ Przeworski M 2014 Determinants of mutationrate variation in the human germline Annu Rev Genomics HumGenet 1547ndash70

Simoons FJ 1970 Primary adult lactose intolerance and the milkinghabit a problem in biologic and cultural interrelations Am J DigDis 15(8) 695ndash710

Skoglund P Malmstrom H Omrak A Raghavan M Valdiosera CGunther T Hall P Tambets K Parik J Sjogren K-G et al 2014Genomic diversity and admixture differs for Stone-AgeScandinavian foragers and farmers Science 344(6185) 747ndash750

Slatkin M 2001 Simulating genealogies of selected alleles in a populationof variable size Genet Res 78(01) 49ndash57

Slatkin M 2008 A Bayesian method for jointly estimating allele age andselection intensity Genet Res 90(01) 129ndash137

Slatkin M Hudson RR 1991 Pairwise comparisons of mitochondrialDNA sequences in stable and exponentially growing populationsGenetics 129(2) 555ndash562

Slatkin M Rannala B 2000 Estimating allele age Annu Rev GenomicsHum Genet 1(1) 225ndash249

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenet Res 23(01) 23ndash35

Stinchcombe JR Hoekstra HE 2008 Combining population genomicsand quantitative genetics finding the genes underlying ecologicallyimportant traits Heredity 100(2) 158ndash170

Tang H Siegmund DO Shen P Oefner PJ Feldman MW 2002Frequentist estimation of coalescence times from nucleotide se-quence data using a tree-based partition Genetics 161(1) 447ndash459

Tavare S Balding DJ Griffiths RC Donnelly P 1997 Inferring coalescencetimes from DNA sequence data Genetics 145(2) 505ndash518

Tennessen JA Bigham AW OrsquoConnor TD Fu W Kenny EE Gravel SMcGee S Do R Liu X Jun G et al 2012 Evolution and functionalimpact of rare coding variation from deep sequencing of humanexomes Science 337(6090) 64ndash69

Thomson R Pritchard JK Shen P Oefner PJ Feldman MW 2000Recent common ancestry of human Y chromosomes evidencefrom DNA sequence data Proc Natl Acad Sci U S A 97(13)7360ndash7365

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007

Smith et al doi101093molbevmsy006 MBE

1016

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 39(1) 31ndash40

Troelsen JT Olsen J Moslashller J SjeuroOstreuroOm H 2003 An upstream polymor-phism associated with lactase persistence has increased enhanceractivity Gastroenterology 125(6) 1686ndash1694

Varin C Reid N Firth D 2011 An overview of composite likelihoodmethods Stat Sin 21(1) 5ndash42

Voight BF Kudaravalli S Wen X Pritchard JK Hurst L 2006 A map ofrecent positive selection in the human genome PLoS Biol 4(3) e72

Wegmann D Kessner DE Veeramah KR Mathias RA Nicolae DL YanekLR Sun YV Torgerson DG Rafaels N Mosley T et al 2011Recombination rates in admixed individuals identified by ancestry-based inference Nat Genet 43(9) 847ndash853

Wilde S Timpson A Kirsanow K Kaiser E Kayser M Unterleuroander MHollfelder N Potekhina ID Schier W Thomas MG et al 2014 Directevidence for positive selection of skin hair and eye pigmentation inEuropeans during the last 5000 years Proc Natl Acad Sci U S A111(13) 4832ndash4837

Williamson SH Hernandez R Fledel-Alon A Zhu L Nielsen RBustamante CD 2005 Simultaneous inference of selection and pop-ulation growth from patterns of variation in the human genomeProc Natl Acad Sci U S A 102(22) 7882ndash7887

Wiuf C 2000 On the genealogy of a sample of neutral rare alleles TheorPopul Biol 58(1) 61ndash75

Wiuf C Donnelly P 1999 Conditional genealogies and the age of aneutral mutant Theor Popul Biol 56(2) 183ndash201

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1017

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

used In cases where the final beneficial allele frequency ishigher (80) our methodrsquos accuracy remains as good or bet-ter than the other methods when using a diverged referencepanel with a two-orders-of-magnitude improvement of biasunder strong selection (sfrac14 005) Estimates when using a localreference panel and a high final beneficial allele frequencyremain comparable to the other methods for strong selectionbut tend to have more bias and decreased accuracy as selec-tion strength decreases

Assuming a star-genealogy among beneficial allele carriersmay result in underestimating the variance for the posteriordistribution when there is nonindependence in our sample

To measure this affect we computed TMRCA estimates on100 bootstrap replicates for 4 simulated data sets under 2selection strengths and 2 final allele frequencies (supplemen-tary table 7 Supplementary Material online) We find closeagreement between the 95 posterior credible intervals ofthe original data and the 95 confidence intervals computedon the bootstrap estimates for a selection strength of 01 forboth final allele frequencies of 04 and 08 As expected olderTMRCAs are likely to violate the star-genealogy assumptionand in these cases we find that estimates from our originaldata are more narrow than the bootstrap confidenceintervals

Frequency 01 Frequency 03 Frequency 06 Frequency 09

minus2

minus1

0

1

2

3

0005 001 01 0005 001 01 0005 001 01 0005 001 01

log2

(Est

imat

eTr

ue V

alue

)Local 02N 05N 10N

Frequency 01 Frequency 03 Frequency 06 Frequency 09

02

04

06

08

0005 001 01 0005 001 01 0005 001 01 0005 001 01Selection Strength

Nor

mal

ized

95

Cre

dibl

e In

terv

al R

ange

FIG 2 Accuracy results from simulated data Accuracy of TMRCA point estimates and 95 credible interval ranges from posteriors inferred fromsimulated data under different strengths of selection final allele frequencies and choice of reference panel Credible interval range sizes are in unitsof generations and are normalized by the true TMRCA for each simulated data set See Materials and Methods below for simulation details

Smith et al doi101093molbevmsy006 MBE

1008

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Recombination versus Mutation as a Source ofInformationWe compared our model-based inference with simpler esti-mates of the TMRCA using the number of derived mutationson the ancestral haplotype and the mean length of the an-cestral haplotype In addition to quantifying the improve-ment our method has over these calculations this alsoserves as an ad-hoc way to understand how the relativeweight of information from mutation and recombinationaffects the performance of our method One can model thehaplotype lengths as independent and exponentially distrib-uted to derive a recombination-based estimator tr as

tr frac141

wor (11)

where r is the recombination rate and wo is the observedmean ancestral haplotype length To leverage the count ofderived mutations on the ancestral haplotype we use theThomson et al (2000) estimator In a sample of n haplotypeswith the selected allele a mutation-based estimator tm canbe calculated as

tm frac141

n

Xn

i

yi

wil (12)

where yi is the number of derived mutations on the ith hap-lotype which has length wi basepairs See Hudson (2007) for aderivation of the estimate for the variance of the Thomsonestimator

When using derived mutations uncertainty in both theancestral haplotype sequence and the length of the ancestralhaplotype on each chromosome (wi) can lead to poor esti-mation To improve inference researchers typically define arestricted ldquononrecombiningrdquo region that may reliably containderived mutations on the ancestral haplotype This has twodisadvantages 1) There is more information available in thedata which cannot be used because excess caution is neces-sary to prevent overcounting of derived mutations and2) there may still be unobserved recombination events inthis restricted locus To minimize the use of heuristics for aderived mutation approach we used our model to findmaximum-likelihood estimates of the ancestral haplotypebreakpoints using equation (2) in the model descriptionWe also used the mean posterior estimate of the ancestralhaplotype from our model to identify derived mutations Tocalculate a recombination-based estimator of the TMRCA wecalculated wo using the same maximum-likelihood estimatesof the ancestral haplotype lengths inferred for the mutationestimator

When using a local reference panel the simple mutationestimator tm consistently underestimates the true TMRCAThe recombination estimate however remains accurate(supplementary fig 6 Supplementary Material online) Wesuspect this to be a result of poor estimation of the ancestralhaplotype and violation of the star-genealogy assumption Incases where selection is weaker and the genealogy is not star-shaped derived mutations occurring early in the genealogywill be overrepresented and incorrectly inferred to be the

ancestral allele In this way high-frequency-derived alleleswill not be counted As predicted increasing selectionstrength improves mutation estimator accuracy The recom-bination estimator appears robust to this effect as long asselection is not too strong For very strong selection andyoung TMRCA values maximum-likelihood estimates ofthe haplotype lengths become constrained by the size ofthe locus studied For example in simulations with a selectionstrength of 005 and frequency of 01 the mean TMRCA isaround 100 generations Using equation (11) above the meanlength of the ancestral haplotype for a TMRCA of 100 gen-erations is 2 Mb which is twice as large as the window size weuse to make computation for our simulations feasible Using alarger window around the selected locus would amelioratethis effect

When using a diverged reference panel we find an oppositeeffect In this case the count of derived mutations results inan overestimate and the haplotype lengths yield an under-estimate We suspect this to be driven by poor matchingbetween the reference panel and the background haplotypesamong beneficial allele carriers The low probability of match-ing between the reference and background haplotypes meansthat the lengths of the ancestral haplotype are inferred toextend further than their true lengths This also leads to anoverestimate for the mutation estimator because differencesbetween the ancestral and background haplotypes are incor-rectly inferred as derived mutations on the ancestralhaplotype

Application to 1000 Genomes DataWe applied our method to five variants previously identifiedas targets of recent selection in various human populations(fig 3) Using phased data from the 1000 Genomes Project wefocused on variants that are not completely fixed in any onepopulation so that we could use a local reference panel The Liand Stephens (2003) haplotype copying model is appropriatein cases where ancestry switches occur among chromosomeswithin a single population so we excluded populations in theAmericas for which high levels of admixture are known toexist

While the simulation results described above provide someintuition for the effects of selection strength final allele fre-quency and choice of reference panel we also performedsimulations using the demographic history inferred byTennessen et al (2012) to explore the effects of nonequili-brium demographic history on our estimation accuracy (sup-plementary fig 10 Supplementary Material online) We findsubtle differences in accuracy between the two demographichistories where the nonequilibrium histories lead to negligibledifferences in mean values for log2(estimatetrue value) andlarger credible interval ranges

ADH1BA derived allele at high frequency among East Asians at theADH1B gene (rs3811801) has been shown to be functionallyrelevant for alcohol metabolism (Osier et al 2002 Eng et al2007) Previous age estimates are consistent with the timing

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1009

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

of rice domestication and fermentation approximately10000 years ago (Li et al 2007 Peng et al 2010 Peter et al2012) However a more recent estimate by Peter et al (2012)pushes this time back several thousand years to 12876(2204ndash49764) years ago Our results are consistent with an

older timing of selection as our CHB sample (Han Chinese inBeijing China) TMRCA estimate is 15377 (13763ndash17281)years Replicate chains of the MCMC are generally consistentwith the oldest estimates in the CHB sample showing themost variation among resampled data sets and the youngest

FIG 3 Comparison of TMRCA estimates with previous results Violin plots of posterior distributions for the complete set of estimated TMRCAvalues for the five variants indicated in the legend scaled to a generation time of 29 years Each row indicates a population sample from the 1000Genomes Project panel Replicate MCMCs are plotted with transparency Points and lines overlaying the violins are previous point estimates and95 confidence intervals for each of the variants indicated by a color and rs number in the legend (see supplementary tables 3 and 4Supplementary Material online) The population sample abbreviations are defined in text

Smith et al doi101093molbevmsy006 MBE

1010

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

estimate of 10841 (9720ndash12147) in the KHV sample showingthe least When using a fine-scale recombination map all ofthe ADH1B TMRCAs are inferred to be slightly older (supple-mentary fig 7 Supplementary Material online)

EDARPopulation genomic studies have repeatedly identified thegene EDAR to be under recent selection in East Asians(Akey et al 2004 Williamson et al 2005 Voight et al 2006)with a particular site (rs3827760) showing strong evidence forbeing the putative target Functional assays and allele specificexpression differences at this position show phenotypic asso-ciations to a variety of phenotypes including hair thicknessand dental morphology (Bryk et al 2008 Fujimoto et al 2008Kimura et al 2009 Kamberov et al 2013)

Our estimate of 22192 (19683ndash25736) years for the EDARallele in the CHB sample is older than ABC-based estimates of12458 (1314ndash85835) and 13224 (4899ndash50692) years madeby Bryk et al (2008) and Peter et al (2012) respectivelyKamberov et al (2013) use spatially explicit ABC and maxi-mum likelihood approaches to compute older estimates of30925 (13175ndash39575) and 36490 (34775ndash38204) We in-cluded all populations for which the variant is present includ-ing the FIN and BEB samples where it exists at low frequencyOur results for the youngest TMRCAs are found in these twolow frequency populations where the estimate in FIN is17386 (13887ndash20794) and the estimate in BEB is 18370(14325ndash22872) Among East Asian populations the oldestand youngest TMRCA estimates are found in the KHV sam-ple (25683 23169ndash28380) and CHB sample (22192 19683ndash25736)

LCTArguably the best-studied signature of selection in humans isfor an allele at the LCT gene (rs4988235) which confers lactasepersistence into adulthoodmdasha trait unique among mammalsand which is thought to be a result of cattle domesticationand the incorporation of milk into the adult diets of severalhuman populations (Enattah et al 2002 Bersaglieri et al 2004Tishkoff et al 2007) There are multiple alleles that showassociation with lactase persistence (Tishkoff et al 2007)We focused on estimating the age of the T-13910 allele pri-marily found at high frequency among Northern Europeansbut which is also found in South Asian populations In addi-tion to association with the lactase persistence phenotypethis allele has been functionally verified by in vitro experi-ments (Olds and Sibley 2003 Troelsen et al 2003Kuokkanen et al 2006)

Mathieson et al (2015) use ancient DNA collected from83 human samples to get a better understanding of the fre-quency trajectory for several adaptive alleles spanning a timescale of 8000 years For the LCT persistence allele (rs4988235)they find a sharp increase in frequency in the past 4000 yearsago While this is more recent than previous estimates anearlier TMRCA or time of mutation is still compatible withthis scenario

Our estimates using European and South Asian samplesfall between the range from 5000 to 10000 years ago which isbroadly consistent with age estimates from modern data Thecredible intervals for estimates in all of the samples havesubstantial overlap which makes any ranking on the basisof point estimates difficult We infer the PJL (Punjabi fromLahore Pakistan) sample to have the oldest TMRCA estimateof 9514 (8596ndash10383) years Itan et al (2009) use spatialmodelling to infer the geographic spread of lactase allelefrom northern to southern Europe Consistent with theirresults the youngest estimate among European populationsis found in the IBS sample at 9341 (8688ndash9989) yearsAmong all samples the youngest estimate was found inBEB at 6869 (5143ndash8809)

KITLG and OCA2The genetic basis and natural history of human skin pigmen-tation is a well-studied system with several alleles of majoreffect showing signatures consistent with being targets ofrecent selection (Jablonski and Chaplin 2012 Beleza Santoset al 2013 Wilde et al 2014 Eaton et al 2015) We focused onan allele found at high-frequency world-wide among non-African populations at the KITLG locus (rs642742) whichshows significant effects on skin pigmentation differences be-tween Europeans and Africans (Miller et al 2007) althoughmore recent work fails to find any contribution of KITLGtoward variation in skin pigmentation in a Cape VerdeAfrican-European admixed population (Beleza Johnsonet al 2013) We also estimated the TMRCA for a melanin-associated allele at the OCA2 locus (rs1800414) which is onlyfound among East Asian populations at high frequency(Edwards et al 2010)

For the KITLG variant our estimates among differentpopulations vary from 18000 to 34000 years ago with theoldest age being in the YRI (Yoruba in Ibadan Nigeria)sample (33948 28861ndash39099) The youngest TMRCA isfound in FIN at 18733 years (16675ndash20816) The nexttwo youngest estimates are also found in Africa withthe TMRCA in the MSL (Mende in Sierra Leone) samplebeing 22340 (15723ndash28950) years old and that for LWK(Luhya in Webuye Kenya) being 22784 (17922ndash28012)years old suggesting a more complex history than amodel of a simple allele frequency increase outside ofAfrica due to pigmentation related selection pressuresPrevious point estimates using rejection samplingapproaches on a Portuguese sample (32277 6003ndash80683) and East Asian sample (32045 6032ndash98165)are again most consistent with our own results on theIBS (29731 26170ndash32813) and CHB samples (2677324297ndash30141) (Beleza Santos et al 2013 Chen et al2015) Among East Asians the oldest and youngest esti-mates are again found in the JPT (28637 24297ndash30141)and KHV (24544 21643ndash27193) samples respectivelyThe TMRCA for OCA2 alleles in the JPT (1859916110ndash20786) and KHV (16370 14439ndash18102) samplesare also the oldest and youngest respectively

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1011

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

DiscussionOur method improves estimation for the timing of selectionon a beneficial allele using a tractable model of haplotypeevolution This approach leverages detailed information inthe data while remaining amenable to large sample sizesUsing both carriers and noncarriers of the allele we canmore effectively account for uncertainty in the extent ofthe ancestral haplotype and derived mutations We showthe performance of our method using simulations of differentselection strengths beneficial allele frequencies and choicesof reference panel By applying our method to five variantspreviously identified as targets of selection in human popu-lations we provide a comparison among population-specificTMRCAs This gives a more detailed account of the order inwhich populations may have acquired the variant andorexperienced selection for the trait it underlies

In that regard it is hypothesized that local selection pres-sures and a cultural shift toward agrarian societies have in-duced adaptive responses among human populations aroundthe globe The data associated with some variants seem toindicate more recent selective events than others Our resultsfor variants associated with dietary traits at the LCT andADH1B genes both imply relatively recent TMRCAs(lt 15 000 years ago) consistent with hypotheses that selec-tion on these mutations results from recent changes in hu-man diet following the spread of agriculture (Simoons 1970Peng et al 2010) In contrast the inferred TMRCAs for EDARKITLG and OCA2 imply older adaptive events which mayhave coincided more closely with the habitation of new envi-ronments or other cultural changes

Several hypotheses have been suggested to describe theselective drivers of skin pigmentation differences among hu-man populations including reduced UV radiation at highlatitudes and vitamin D deficiency (Loomis 1967 Jablonskiand Chaplin 2000) Estimated TMRCAs for the variants atthe OCA2 and EDAR loci among East Asians appear to beas young or younger than the KITLG variant but older thanthe LCT and ADH1B locus This suggests a selective history inEast Asian populations leading to adaptive responses forthese traits occurring after an initial colonization In somecases the dispersion of replicate MCMC estimates makes itdifficult to describe the historical significance of an observedorder for TMRCA values However the consistency of esti-mates among different populations for particular variantsadds some confidence to our modelrsquos ability to reproducethe ages which are relevant to those loci or certain geographicregions

To assess the relative concordance of our estimates withthose from previous approaches we compared our resultswith a compilation of previously published estimates basedon the time of mutation time since fixation or TMRCA ofvariants associated with the genes studied here (supplemen-tary fig 8 Supplementary Material online) The range of con-fidence intervals for these studies is largely a reflection of theassumptions invoked or relaxed for any one method as wellas the sample size and quality of the data used In principleour method extracts more information than approaches that

use summary statistics such as ABC In our empirical appli-cation we found that our method provides a gain in accuracywhile accounting for uncertainty in both the ancestral hap-lotype and its length on each chromosome Notably ourmethod provides narrower credible intervals by incorporatingthe full information from ancestral haplotype lengths derivedmutations and a reference panel of noncarrier haplotypes

Another caveat of our method is its dependence on thereference panel which is intended to serve as a representativesample of nonancestral haplotypes in the population duringthe selected allelersquos increase in frequency Four possible chal-lenges can arise 1) Segments of the ancestral selected haplo-type may be present in the reference panel due torecombination (this is more likely for alleles that havereached higher frequency) 2) the reference panel may con-tain haplotypes that are similar to the ancestral haplotypedue to low levels of genetic diversity 3) the reference panelmay be too diverged from the focal population and 4) pop-ulation connectivity and turnover may lead the ldquolocalrdquo refer-ence panel to be largely composed of migrant haplotypeswhich were not present during the selected allelersquos initial in-crease in frequency

Under scenarios 1 and 2 the background haplotypes willbe too similar to the ancestral haplotype and it may be dif-ficult for the model to discern a specific ancestry switch lo-cation This leads to fewer differences (mutations) thanexpected between the ancestral haplotype and each benefi-cial allele carrier The simulation results are consistent withthis scenario Our method tends to underestimate the trueage across a range of selection intensities and allele frequen-cies when using a local reference panel

Conversely under scenarios 3 and 4 the model will fail todescribe a recombinant haplotype in the sample of beneficialallele carriers as a mosaic of haplotypes in the reference panelAs a result the model will infer more mutation events toexplain observed differences from the ancestral haplotypeOur simulation results show this to be the case with referencepanels diverged by N generations Posterior mean estimatesare consistently older than their true value Our simulationsare perhaps pessimistic thoughmdashwe chose reference paneldivergence times of N and 05N generations approximatelycorresponding to FST values of 04 and 02 respectively Forthe smaller FST values observed in humans we expect resultsfor diverged panels to be closer to those obtained with thelocal reference panel Nonetheless future extensions to incor-porate multiple populations within the reference panel wouldbe helpful and possible by modifying the approach of Priceet al (2009) Such an approach would also enable the analysisof admixed populations (we excluded admixed samples fromour analysis of the 1000 Genomes data above)

Aside from the challenges imposed by the choice of refer-ence panel another potential source of bias lies in our tran-sition probabilities which are not conditioned on thefrequency of the selected variant In reality recombinationevents at some distance away from the selected site will onlyresult in a switch from the ancestral to background haplo-types at a rate proportional to 1 pl where pl is the fre-quency of the ancestral haplotype alleles at locus l In this way

Smith et al doi101093molbevmsy006 MBE

1012

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

some recombination events may go unobservedmdashas thebeneficial allele goes to high frequency the probability of arecombination event leading to an observable ancestral tobackground haplotype transition decreases One solutionmay be to include the frequency-dependent transition prob-abilities derived by Chen et al (2015) Under their model themutation time is estimated by assuming a deterministic lo-gistic frequency trajectory starting at 1

2N An additional benefitof using frequency trajectories would be the ability to inferposterior distributions on selection coefficients While theselection coefficient is typically assumed to be related tothe time since mutation by t1 frac14 logethNsTHORN=s we do nothave an equivalent expression for time to the common an-cestor Rather than the initial frequency being 1

2N for a newmutation our initial frequency must correspond to the fre-quency at which the TMRCA occurs Griffiths and Tavare(1994) derive a framework to model a genealogy under arbi-trary population size trajectories which should be analogousto the problem of an allele frequency trajectory and addi-tional theory on intra-allelic genealogies may be useful here aswell (Griffiths and Tavare 1994 Wiuf and Donnelly 1999 Wiuf2000 Slatkin and Rannala 2000)

Our model also assumes independence among all haplo-types in the sample in a composite-likelihood frameworkwhich is equivalent to assuming a star-genealogy (Larribeand Fearnhead 2011 Varin et al 2011) This is unlikely tobe the case when sample sizes are large or the TMRCA isold It is also unlikely to be true if the beneficial allele existedon multiple haplotypes preceding the onset of selection wasintroduced by multiple migrant haplotypes from other pop-ulations or occurred by multiple independent mutationevents (Innan and Kim 2004 Hermisson and Pennings2005 Przeworski et al 2005 Pritchard et al 2010 Berg andCoop 2015) Methods for distinguishing selection from stand-ing variation versus de novo mutation are available thatshould make it easier distinguish these cases (Messer andNeher 2012 Peter et al 2012 Messer and Petrov 2013Garud et al 2015)

If the underlying allelic genealogy is not star-like onecan expect different estimates of the TMRCA for differentsubsets of the data Here we performed multiple MCMCson resampled subsets of the data to informally diagnosewhether there are violations from the star-shape geneal-ogy assumption We speculate that exactly how theTMRCAs vary may provide insight to the underlyinghistory In cases where the TMRCA estimates for a par-ticular population are old and more variable than otherpopulations the results may be explained by structure inthe genealogy whereby recent coalescent events haveoccurred among the same ancestral haplotype beforethe common ancestor When estimates are dispersedamong resampled data sets the presence of multiple an-cestral haplotypes prior to the variantrsquos increase in fre-quency may be a better explanation Further support forthis explanation might come from comparisons to otherpopulation samples which show little to no dispersion ofestimates from resampled data sets Future work mightmake it possible to formalize this inference process

A final caveat regards the misspecification of mutation andrecombination rates TMRCA estimates are largely deter-mined by the use of accurate measures for these two param-eters In a way this provides some robustness to our methodOur age estimates depend on mutation and recombinationrates so accurate specification for one of the values can com-pensate for slight misspecification in the other In cases wherea fine-scale recombination map is unavailable we suggest us-ing a uniform recombination rate specific to the locus ofinterest (see Materials and Methods and supplementaryfig 7 Supplementary Material online) Choosing an appropri-ate mutation rate will continue to depend on current andfuture work which tries to resolve discrepancies in publishedmutation rate estimates inferred by various approaches(Segurel et al 2014)

One future direction for our method may be to explicitlyincorporate the possibility of multiple ancestral haplotypeswithin the sample Under a disease mapping frameworkMorris et al (2002) implement a similar idea in the case whereindependent disease causing mutations arise at the same lo-cus leading to independent genealogies for which they cointhe term ldquoshattered coalescentrdquo For our case beneficialmutations may also be independently derived on differenthaplotypes Alternatively a single mutation may be oldenough to reside on different haplotypes due to a sufficientamount of linked variation existing prior to the onset of se-lection Berg and Coop (2015) model selection from standingvariation to derive the distribution of haplotypes that theselected allele is present on

While we have treated the TMRCA as a parameter of in-terest our method also produces a sample of the posteriordistribution on the ancestral haplotype This could provideuseful information to estimate the frequency spectrum ofderived mutations on the ancestral haplotype Similarly thefrequency of shared recombination breakpoints could shedlight on the genealogy and how well it conforms to the star-shape assumption The extent of the ancestral haplotype ineach individual may also prove useful for identifying delete-rious alleles that have increased in frequency as a result ofstrong positive selection on linked beneficial alleles (Chun andFay 2011 Hartfield and Otto 2011) For example Huff et al(2012) describe a risk allele for Crohnrsquos disease at high fre-quency in European populations which they suggest is linkedto a beneficial allele under recent selection Similar to anadmixture mapping approach our method could be usedto identify risk loci by testing for an association betweenthe ancestral haplotype and disease status As another appli-cation identifying the ancestral haplotype may be useful inthe context of identifying a source population (or species) fora beneficial allele prior to its introduction and subsequentincrease in frequency in the recipient population

In many cases the specific site under selection may beunknown or restricted to some set of putative sites Whileour method requires the position of the selected site to bespecified future extensions could treat the selected site as arandom variable to be estimated under the same MCMCframework This framework would also be amenable to mar-ginalizing over uncertainty on the selected site

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1013

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

While we focus here on inference from modern DNAdata the increased accessibility of ancient DNA has addeda new dimension to population genetic data sets (Lazaridiset al 2014 Skoglund et al 2014 Allentoft et al 2015 Haaket al 2015 Mathieson et al 2015) Because it will remaindifficult to use ancient DNA approaches in many specieswith poor sample preservation we believe methods basedon modern DNA will continue to be useful going forwardThat said ancient DNA is providing an interesting avenue forcomparative work between inference from modern and an-cient samples For example Nakagome et al (2016) usesimulations to assess the fit of this ancient DNA polymor-phism to data simulated under their inferred parametervalues for allele age and selection intensity and they findreasonable agreement Much work still remains to fullyleverage ancient samples into population genetic inferencewhile accounting for new sources of uncertainty and poten-tial for sampling bias

Despite these challenges it is clear that our understandingof adaptive history will continue to benefit from new com-putational tools which extract insightful information from adiverse set of data sources

Materials and MethodsWe generated data using the software mssel (Hudson R per-sonal communication) which simulates a sample of haplo-types conditioned on the frequency trajectory of a selectedvariant under the structured coalescent (Hudson and Kaplan1988 Kaplan et al 1988) Trajectories were first simulatedforwards in time under a WrightndashFisher model for an additivelocus with varying strengths of selection and different endingfrequencies of the selected variant Trajectories were thentruncated to end at the first time the allele reaches a specifiedfrequency See supplementary table 1 SupplementaryMaterial online for relative ages of simulated TMRCA val-ues for different end frequencies and selection strengthsFor the results in figure 2 100 simulations were performedfor each parameter combination MCMCs were run for5000 iterations with a burn-in excluding the first 3000iterations A standard deviation of 10 was used for theproposal distribution of tca The red boxplots indicate localreference panels The blue and green boxplots indicate ref-erence panels diverged by 5N generations and 1N gener-ations respectively Each data set was simulated for a 1-Mblocus with a mutation rate of 1 108 recombination rateof 1 108

and population size of 10000 Sample sizesfor the selected and reference panels were 100 and 20respectively

For more efficient run times of the MCMC we set a max-imum number of individuals to include in the selected andreference panels to be 100 and 20 respectively In cases wherethe true number of haplotypes for either panel was greaterthan this in the full data set we resampled a subset ofhaplotypes from each population for a total of five replicatesper population For simulation results supporting the use ofthis resampling strategy see supplementary figure 5Supplementary Material online The MCMCs were run for

15000 iterations with a standard deviation of 20 for theTMRCA proposal distribution The first 9000 iterationswere removed as a burn-in leading to 6000 iterations for asample of the posterior Convergence was assessed by com-parison of MCMC replicates Figure 3 and supplementaryfigure 8 Supplementary Material online show the resultsfor all five variants along with previous point estimates and95 confidence intervals assuming a generation time of29 years (Fenner 2005) Supplementary table 3Supplementary Material online lists the mean and 95 cred-ible intervals for estimates with the highest mean posteriorprobability which we refer to in the text belowSupplementary table 4 Supplementary Material online liststhe previous estimates and confidence intervals with addi-tional details of the different approaches taken

To model recombination rate variation we used recombi-nation rates from the Decode sex-averaged recombinationmap inferred from pedigrees among families in Iceland(Kong et al 2010) Because some populations may have re-combination maps which differ from the Decode map at finescales we used a mean uniform recombination rate inferredfrom the 1-Mb region surrounding each variant The motiva-tion for this arises from how recombination rates have beenpreviously shown to remain relatively consistent among re-combination maps inferred for different populations at themegabase-scale (Broman et al 1998 Baudat et al 2010 Konget al 2010 Auton and McVean 2012) Further we found ourestimates depend mostly on having the megabase-scale re-combination rate appropriately set with little difference inmost cases for estimates obtained by modeling fine-scale re-combination at each locus (supplementary fig 7Supplementary Material online) We specify the switchingrate among background haplotypes after recombining off ofthe ancestral haplotype to be 4Nr where Nfrac14 10000 and r isthe mean recombination rate for the 1-Mb locus

For modeling mutation a challenge is that previous mu-tation rate estimates vary depending on the approach used(Segurel et al 2014) Estimates using the neutral substitutionrate between humans and chimps are more than 2 108

per bp per generation while estimates using whole-genomesequencing are closer to 1 108 As a compromise wespecify a mutation rate of 16 108

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe would like to thank Hussein Al-Asadi Arjun BiddandaAnna Di Rienzo Dick Hudson Choongwon Jeong Evan KochJoseph Marcus Shigeki Nakagome Ben Peter Mark ReppelAlex White members of the Coop laboratory at UC Davismembers of the Przeworski laboratory at ColumbiaUniversity and members of the He and Stephens laboratoriesat the University of Chicago for helpful comments JS wassupported by an NSF Graduate Research Fellowship andNational Institute of General Medical Sciences of the

Smith et al doi101093molbevmsy006 MBE

1014

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

National Institutes of Health under award numbers DGE-1144082 and T32GM007197 respectively This work wasalso supported by the National Institute of General MedicalSciences of the National Institutes of Health under awardnumbers RO1GM83098 and RO1GM107374 to GC as wellas R01HG007089 to JN

ReferencesAkey JM Eberle MA Rieder MJ Carlson CS Shriver MD Nickerson DA

Kruglyak L 2004 Population history and natural selection shapepatterns of genetic variation in 132 genes PLoS Biol 21591ndash1599

Allentoft ME Sikora M Sjogren K-G Rasmussen S Rasmussen MStenderup J Damgaard PB Schroeder H Ahlstrom T Vinner L2015 Population genomics of Bronze Age Eurasia Nature522(7555) 167ndash172

Auton A McVean G 2012 Estimating recombination rates from geneticvariation in humans In Anisimova M ed Evolutionary genomicsMethods in Molecular Biology (Methods and Protocols) vol 856New York Humana Press

Barrett RDH Hoekstra HE 2011 Molecular spandrels tests of adaptationat the genetic level Nat Rev Genet 12(11) 767ndash780

Baudat F Buard J Grey C Fledel-Alon A Ober C Przeworski M Coop GMassy BD 2010 PRDM9 is a major determinant of meiotic recom-bination hotspots in humans and mice Science 327(5967) 836ndash840

Beaumont MA Zhang W Balding DJ 2002 Approximate Bayesian com-putation in population genetics Genetics 162(4) 2025ndash2035

Beleza S Johnson NA Candille SI Absher DM Coram MA Lopes JCampos J Araujo II Anderson TM Vilhjalmsson BJ et al 2013Genetic architecture of skin and eye color in an African-Europeanadmixed population PLoS Genet 9(3) e1003372

Beleza S Santos AM McEvoy B Alves I Martinho C Cameron E ShriverMD Parra EJ Rocha J 2013 The timing of pigmentation lightening inEuropeans Mol Biol Evol 30(1) 24ndash35

Berg JJ Coop G 2015 A coalescent model for a sweep of a uniquestanding variant Genetics 201(2) 707ndash725

Bersaglieri T Sabeti PC Patterson N Vanderploeg T Schaffner SF DrakeJA Rhodes M Reich DE Hirschhorn JN 2004 Genetic signatures ofstrong recent positive selection at the lactase gene Am J Hum Genet74(6) 1111ndash1120

Broman KW Murray JC Sheffield VC White RL Weber JL 1998Comprehensive human genetic maps individual and sex-specificvariation in recombination Am J Hum Genet 63(3) 861ndash869

Bryk J Hardouin E Pugach I Hughes D Strotmann R Stoneking M MylesS 2008 Positive selection in East Asians for an EDAR allele thatenhances NF-jB activation PLoS One 3(5) e2209

Chen H Slatkin M 2013 Inferring selection intensity and allele age frommultilocus haplotype structure G3 3(8) 1429ndash1442

Chen H Hey J Slatkin M 2015 A hidden Markov model for investigatingrecent positive selection through haplotype structure Theor PopulBiol 9918ndash30

Chun S Fay JC 2011 Evidence for hitchhiking of deleterious mutationswithin the human genome PLoS Genet 7(8) e1002240

Coop G Griffiths RC 2004 Ancestral inference on gene trees underselection Theor Popul Biol 66(3) 219ndash232

Coop G Bullaughey K Luca F Przeworski M 2008 The timing of selec-tion at the human FOXP2 gene Mol Biol Evol 25(7) 1257ndash1259

Dalziel AC Rogers SM Schulte PM 2009 Linking genotypes to pheno-types and fitness how mechanistic biology can inform molecularecology Mol Ecol 18(24) 4997ndash5017

Eaton K Edwards M Krithika S Cook G Norton H Parra EJ 2015Association study confirms the role of two OCA2 polymorphismsin normal skin pigmentation variation in East Asian populations AmJ Hum Biol 27 520ndash525

Edwards M Bigham A Tan J Li S Gozdzik A Ross K Jin L Parra EJ 2010Association of the OCA2 polymorphism His615Arg with melanin

content in east Asian populations further evidence of convergentevolution of skin pigmentation PLoS Genet 6(3) e1000867

Elmer KR Meyer A 2011 Adaptation in the age of ecological genomicsinsights from parallelism and convergence Trends Ecol Evol 26(6)298ndash306

Enattah NS Sahi T Savilahti E Terwilliger JD Peltonen L Jeuroarveleuroa I 2002Identification of a variant associated with adult-type hypolactasiaNat Genet 30(2) 233ndash237

Eng MY Luczak SE Wall TL 2007 ALDH2 ADH1B and ADH1C geno-types in Asians a literature review Alcohol Res Health 30(1) 22

Fenner JN 2005 Cross-cultural estimation of the human generationinterval for use in genetics-based population divergence studiesAm J Phys Anthropol 128(2) 415ndash423

Fujimoto A Ohashi J Nishida N Miyagawa T Morishita Y Tsunoda TKimura R Tokunaga K 2008 A replication study confirmed theEDAR gene to be a major contributor to population differentiationregarding head hair thickness in Asia Hum Genet 124(2) 179ndash185

Garud NR Messer PW Buzbas EO Petrov DA 2015 Recent selectivesweeps in North American Drosophila melanogaster show signaturesof soft sweeps PLoS Genet 11(2) e1005004

Griffiths RC Tavare S 1994 Sampling theory for neutral alleles in avarying environment Philos Trans R Soc B Biol Sci 344(1310)403ndash410

Haak W Lazaridis I Patterson N Rohland N Mallick S Llamas B BrandtG Nordenfelt S Harney E Stewardson K 2015 Massive migrationfrom the steppe was a source for Indo-European languages inEurope Nature 522 207ndash211

Hartfield M Otto SP 2011 Recombination and hitchhiking of deleteri-ous alleles Evolution 65(9) 2421ndash2434

Hermisson J Pennings PS 2005 Soft sweeps molecular population ge-netics of adaptation from standing genetic variation Genetics 169(4)2335ndash2352

Hinch AG Tandon A Patterson N Song Y Rohland N Palmer CD ChenGK Wang K Buxbaum SG Akylbekova EL et al 2011 The landscapeof recombination in African Americans Nature 476(7359) 170ndash175

Hudson RR 1990 Gene genealogies and the coalescent process Oxf SurvEvol Biol 7(1) 44

Hudson RR 2007 The variance of coalescent time estimates from DNAsequences J Mol Evol 64(6) 702ndash705

Hudson RR Kaplan NL 1988 The coalescent process in models withselection and recombination Genetics 120(3) 831ndash840

Huff CD Witherspoon DJ Zhang Y Gatenbee C Denson LA KugathasanS Hakonarson H Whiting A Davis CT Wu W et al 2012 Crohnrsquosdisease and genetic hitchhiking at IBD5 Mol Biol Evol 29(1)101ndash111

Innan H Kim Y 2004 Pattern of polymorphism after strong artificialselection in a domestication event Proc Natl Acad Sci U S A 101(29)10667ndash10672

Itan Y Powell A Beaumont MA Burger J Thomas MG 2009 The originsof lactase persistence in Europe PLoS Comput Biol 5(8) e1000491

Jablonski NG Chaplin G 2000 The evolution of human skin coloration JHum Evol 39(1) 57ndash106

Jablonski NG Chaplin G 2012 Human skin pigmentation migration anddisease susceptibility Philos Trans R Soc B Biol Sci 367(1590)785ndash792

Jeong C Di Rienzo A 2014 Adaptations to local environments in mod-ern human populations Curr Opin Genet Dev 291ndash8

Kamberov YG Wang S Tan J Gerbault P Wark A Tan L Yang Y Li STang K Chen H et al 2013 Modeling recent human evolution inmice by expression of a selected EDAR variant Cell 152(4) 691ndash702

Kaplan NL Darden T Hudson RR 1988 The coalescent process inmodels with selection Genetics 120(3) 819ndash829

Kaplan NL Hudson RR Langley CH 1989 The ldquohitchhiking effectrdquo revis-ited Genetics 123(4) 887ndash899

Kimura R Yamaguchi T Takeda M Kondo O Toma T Haneji KHanihara T Matsukusa H Kawamura S Maki K et al 2009 A com-mon variation in EDAR is a genetic determinant of shovel-shapedincisors Am J Hum Genet 85(4) 528ndash535

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1015

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Kong A Thorleifsson G Gudbjartsson DF Masson G Sigurdsson AJonasdottir A Walters GB Jonasdottir A Gylfason A KristinssonKT et al 2010 Fine-scale recombination rate differences betweensexes populations and individuals Nature 467(7319) 1099ndash1103

Kuokkanen M Kokkonen J Enattah NS Ylisaukko-oja T Komu H VariloT Peltonen L Savilahti E Jeuroarveleuroa I 2006 Mutations in the translatedregion of the lactase gene (LCT) underlie congenital lactase defi-ciency Am J Hum Genet 78(2) 339ndash344

Larribe F Fearnhead P 2011 On composite likelihoods in statisticalgenetics Stat Sin 21(1) 43ndash69

Lazaridis I Patterson N Mittnik A Renaud G Mallick S Kirsanow KSudmant PH Schraiber JG Castellano S Lipson M et al 2014Ancient human genomes suggest three ancestral populations forpresent-day Europeans Nature 513(7518) 409ndash413

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357) 493ndash496

Li H Mukherjee N Soundararajan U Tarnok Z Barta C Khaliq SMohyuddin A Kajuna SLB Mehdi SQ Kidd JR et al 2007Geographically separate increases in the frequency of the derivedADH1B 47His allele in eastern and western Asia Am J Hum Genet81(4) 842ndash846

Li N Stephens M 2003 Modeling linkage disequilibrium and identifyingrecombination hotspots using single-nucleotide polymorphismdata Genetics 165(4) 2213ndash2233

Loomis WF 1967 Skin-pigment regulation of vitamin-D biosynthesis inman Science 157(3788) 501ndash506

Mathieson I Lazaridis I Rohland N Mallick S Patterson N RoodenbergSA Harney E Stewardson K Fernandes D Novak M et al 2015Genome-wide patterns of selection in 230 ancient Eurasians Nature528(7583) 499ndash503

McPeek MS Strahs A 1999 Assessment of linkage disequilibrium by thedecay of haplotype sharing with application to fine-scale geneticmapping Am J Hum Genet 65(3) 858ndash875

Meligkotsidou L Fearnhead P 2005 Maximum-likelihood estimationof coalescence times in genealogical trees Genetics 171(4)2073ndash2084

Messer PW Neher RA 2012 Estimating the strength of selective sweepsfrom deep population diversity data Genetics 191(2) 593ndash605

Messer PW Petrov DA 2013 Population genomics of rapid adaptationby soft selective sweeps Trends Ecol Evol 28(11) 659ndash669

Miller CT Beleza S Pollen AA Schluter D Kittles RA Shriver MDKingsley DM 2007 cis-Regulatory changes in kit ligand expressionand parallel evolution of pigmentation in sticklebacks and humansCell 131(6) 1179ndash1189

Morris AP Whittaker JC Balding DJ 2000 Bayesian fine-scale mapping ofdisease loci by hidden Markov models Am J Hum Genet 67(1)155ndash169

Morris AP Whittaker JC Balding DJ 2002 Fine-scale mapping of diseaseloci via shattered coalescent modeling of genealogies Am J HumGenet 70(3) 686ndash707

Nakagome S Alkorta-Aranburu G Amato R Howie B Peter BMHudson RR Di Rienzo A 2016 Estimating the ages of selectionsignals from different epochs in human history Mol Biol Evol33(3) 657ndash669

Nielsen R 2005 Molecular signatures of natural selection Annu RevGenet 39 197ndash218

Olds LC Sibley E 2003 Lactase persistence DNA variant enhances lactasepromoter activity in vitro functional role as a cis regulatory elementHum Mol Genet 12(18) 2333ndash2340

Ormond L Foll M Ewing GB Pfeifer SP Jensen JD 2016 Inferring the ageof a fixed beneficial allele Mol Ecol 25(1) 157ndash169

Osier MV Pakstis AJ Soodyall H Comas D Goldman D Odunsi AOkonofua F Parnas J Schulz LO Bertranpetit J et al 2002 A globalperspective on genetic variation at the ADH genes reveals unusualpatterns of linkage disequilibrium and diversity Am J Hum Genet71(1) 84ndash99

Peng Y Shi H Qi X-B Xiao C-J Zhong H Ma Run-lin Z Su B 2010The ADH1B Arg47His polymorphism in East Asian populations

and expansion of rice domestication in history BMC Evol Biol10(1) 15

Peter BM Huerta-Sanchez E Nielsen R 2012 Distinguishing betweenselective sweeps from standing variation and from a de novo mu-tation PLoS Genet 8(10) e1003011

Przeworski M Coop G Wall JD 2005 The signature of positive selectionon standing genetic variation Evolution 59(11) 2312ndash2323

Price AL Tandon A Patterson N Barnes KC Rafaels N Ruczinski I BeatyTH Mathias R Reich D Myers S 2009 Sensitive detection of chro-mosomal segments of distinct ancestry in admixed populationsPLoS Genet 5(6) e1000519

Pritchard JK Seielstad MT Perez-Lezaun A Feldman MW 1999Population growth of human Y chromosomes a study of Y chro-mosome microsatellites Mol Biol Evol 16(12) 1791ndash1798

Pritchard JK Pickrell JK Coop G 2010 The genetics of human adapta-tion hard sweeps soft sweeps and polygenic adaptation Curr Biol20(4) R208ndashR215

Przeworski M 2003 Estimating the time since the fixation of a beneficialallele Genetics 164(4) 1667ndash1676

Rabiner LR 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77(2) 257ndash286

Radwan J Babik W 2012 The genomics of adaptation Proc R Soc Lond BBiol Sci 279(1749) 5024ndash5028

Rannala B Reeve JP 2001 High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome se-quence Am J Hum Genet 69(1) 159ndash178

Rannala B Reeve JP 2002 Joint Bayesian estimation of mutationlocation and age using linkage disequilibrium Biocomputing2003 526ndash534

Segurel L Wyman MJ Przeworski M 2014 Determinants of mutationrate variation in the human germline Annu Rev Genomics HumGenet 1547ndash70

Simoons FJ 1970 Primary adult lactose intolerance and the milkinghabit a problem in biologic and cultural interrelations Am J DigDis 15(8) 695ndash710

Skoglund P Malmstrom H Omrak A Raghavan M Valdiosera CGunther T Hall P Tambets K Parik J Sjogren K-G et al 2014Genomic diversity and admixture differs for Stone-AgeScandinavian foragers and farmers Science 344(6185) 747ndash750

Slatkin M 2001 Simulating genealogies of selected alleles in a populationof variable size Genet Res 78(01) 49ndash57

Slatkin M 2008 A Bayesian method for jointly estimating allele age andselection intensity Genet Res 90(01) 129ndash137

Slatkin M Hudson RR 1991 Pairwise comparisons of mitochondrialDNA sequences in stable and exponentially growing populationsGenetics 129(2) 555ndash562

Slatkin M Rannala B 2000 Estimating allele age Annu Rev GenomicsHum Genet 1(1) 225ndash249

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenet Res 23(01) 23ndash35

Stinchcombe JR Hoekstra HE 2008 Combining population genomicsand quantitative genetics finding the genes underlying ecologicallyimportant traits Heredity 100(2) 158ndash170

Tang H Siegmund DO Shen P Oefner PJ Feldman MW 2002Frequentist estimation of coalescence times from nucleotide se-quence data using a tree-based partition Genetics 161(1) 447ndash459

Tavare S Balding DJ Griffiths RC Donnelly P 1997 Inferring coalescencetimes from DNA sequence data Genetics 145(2) 505ndash518

Tennessen JA Bigham AW OrsquoConnor TD Fu W Kenny EE Gravel SMcGee S Do R Liu X Jun G et al 2012 Evolution and functionalimpact of rare coding variation from deep sequencing of humanexomes Science 337(6090) 64ndash69

Thomson R Pritchard JK Shen P Oefner PJ Feldman MW 2000Recent common ancestry of human Y chromosomes evidencefrom DNA sequence data Proc Natl Acad Sci U S A 97(13)7360ndash7365

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007

Smith et al doi101093molbevmsy006 MBE

1016

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 39(1) 31ndash40

Troelsen JT Olsen J Moslashller J SjeuroOstreuroOm H 2003 An upstream polymor-phism associated with lactase persistence has increased enhanceractivity Gastroenterology 125(6) 1686ndash1694

Varin C Reid N Firth D 2011 An overview of composite likelihoodmethods Stat Sin 21(1) 5ndash42

Voight BF Kudaravalli S Wen X Pritchard JK Hurst L 2006 A map ofrecent positive selection in the human genome PLoS Biol 4(3) e72

Wegmann D Kessner DE Veeramah KR Mathias RA Nicolae DL YanekLR Sun YV Torgerson DG Rafaels N Mosley T et al 2011Recombination rates in admixed individuals identified by ancestry-based inference Nat Genet 43(9) 847ndash853

Wilde S Timpson A Kirsanow K Kaiser E Kayser M Unterleuroander MHollfelder N Potekhina ID Schier W Thomas MG et al 2014 Directevidence for positive selection of skin hair and eye pigmentation inEuropeans during the last 5000 years Proc Natl Acad Sci U S A111(13) 4832ndash4837

Williamson SH Hernandez R Fledel-Alon A Zhu L Nielsen RBustamante CD 2005 Simultaneous inference of selection and pop-ulation growth from patterns of variation in the human genomeProc Natl Acad Sci U S A 102(22) 7882ndash7887

Wiuf C 2000 On the genealogy of a sample of neutral rare alleles TheorPopul Biol 58(1) 61ndash75

Wiuf C Donnelly P 1999 Conditional genealogies and the age of aneutral mutant Theor Popul Biol 56(2) 183ndash201

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1017

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Recombination versus Mutation as a Source ofInformationWe compared our model-based inference with simpler esti-mates of the TMRCA using the number of derived mutationson the ancestral haplotype and the mean length of the an-cestral haplotype In addition to quantifying the improve-ment our method has over these calculations this alsoserves as an ad-hoc way to understand how the relativeweight of information from mutation and recombinationaffects the performance of our method One can model thehaplotype lengths as independent and exponentially distrib-uted to derive a recombination-based estimator tr as

tr frac141

wor (11)

where r is the recombination rate and wo is the observedmean ancestral haplotype length To leverage the count ofderived mutations on the ancestral haplotype we use theThomson et al (2000) estimator In a sample of n haplotypeswith the selected allele a mutation-based estimator tm canbe calculated as

tm frac141

n

Xn

i

yi

wil (12)

where yi is the number of derived mutations on the ith hap-lotype which has length wi basepairs See Hudson (2007) for aderivation of the estimate for the variance of the Thomsonestimator

When using derived mutations uncertainty in both theancestral haplotype sequence and the length of the ancestralhaplotype on each chromosome (wi) can lead to poor esti-mation To improve inference researchers typically define arestricted ldquononrecombiningrdquo region that may reliably containderived mutations on the ancestral haplotype This has twodisadvantages 1) There is more information available in thedata which cannot be used because excess caution is neces-sary to prevent overcounting of derived mutations and2) there may still be unobserved recombination events inthis restricted locus To minimize the use of heuristics for aderived mutation approach we used our model to findmaximum-likelihood estimates of the ancestral haplotypebreakpoints using equation (2) in the model descriptionWe also used the mean posterior estimate of the ancestralhaplotype from our model to identify derived mutations Tocalculate a recombination-based estimator of the TMRCA wecalculated wo using the same maximum-likelihood estimatesof the ancestral haplotype lengths inferred for the mutationestimator

When using a local reference panel the simple mutationestimator tm consistently underestimates the true TMRCAThe recombination estimate however remains accurate(supplementary fig 6 Supplementary Material online) Wesuspect this to be a result of poor estimation of the ancestralhaplotype and violation of the star-genealogy assumption Incases where selection is weaker and the genealogy is not star-shaped derived mutations occurring early in the genealogywill be overrepresented and incorrectly inferred to be the

ancestral allele In this way high-frequency-derived alleleswill not be counted As predicted increasing selectionstrength improves mutation estimator accuracy The recom-bination estimator appears robust to this effect as long asselection is not too strong For very strong selection andyoung TMRCA values maximum-likelihood estimates ofthe haplotype lengths become constrained by the size ofthe locus studied For example in simulations with a selectionstrength of 005 and frequency of 01 the mean TMRCA isaround 100 generations Using equation (11) above the meanlength of the ancestral haplotype for a TMRCA of 100 gen-erations is 2 Mb which is twice as large as the window size weuse to make computation for our simulations feasible Using alarger window around the selected locus would amelioratethis effect

When using a diverged reference panel we find an oppositeeffect In this case the count of derived mutations results inan overestimate and the haplotype lengths yield an under-estimate We suspect this to be driven by poor matchingbetween the reference panel and the background haplotypesamong beneficial allele carriers The low probability of match-ing between the reference and background haplotypes meansthat the lengths of the ancestral haplotype are inferred toextend further than their true lengths This also leads to anoverestimate for the mutation estimator because differencesbetween the ancestral and background haplotypes are incor-rectly inferred as derived mutations on the ancestralhaplotype

Application to 1000 Genomes DataWe applied our method to five variants previously identifiedas targets of recent selection in various human populations(fig 3) Using phased data from the 1000 Genomes Project wefocused on variants that are not completely fixed in any onepopulation so that we could use a local reference panel The Liand Stephens (2003) haplotype copying model is appropriatein cases where ancestry switches occur among chromosomeswithin a single population so we excluded populations in theAmericas for which high levels of admixture are known toexist

While the simulation results described above provide someintuition for the effects of selection strength final allele fre-quency and choice of reference panel we also performedsimulations using the demographic history inferred byTennessen et al (2012) to explore the effects of nonequili-brium demographic history on our estimation accuracy (sup-plementary fig 10 Supplementary Material online) We findsubtle differences in accuracy between the two demographichistories where the nonequilibrium histories lead to negligibledifferences in mean values for log2(estimatetrue value) andlarger credible interval ranges

ADH1BA derived allele at high frequency among East Asians at theADH1B gene (rs3811801) has been shown to be functionallyrelevant for alcohol metabolism (Osier et al 2002 Eng et al2007) Previous age estimates are consistent with the timing

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1009

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

of rice domestication and fermentation approximately10000 years ago (Li et al 2007 Peng et al 2010 Peter et al2012) However a more recent estimate by Peter et al (2012)pushes this time back several thousand years to 12876(2204ndash49764) years ago Our results are consistent with an

older timing of selection as our CHB sample (Han Chinese inBeijing China) TMRCA estimate is 15377 (13763ndash17281)years Replicate chains of the MCMC are generally consistentwith the oldest estimates in the CHB sample showing themost variation among resampled data sets and the youngest

FIG 3 Comparison of TMRCA estimates with previous results Violin plots of posterior distributions for the complete set of estimated TMRCAvalues for the five variants indicated in the legend scaled to a generation time of 29 years Each row indicates a population sample from the 1000Genomes Project panel Replicate MCMCs are plotted with transparency Points and lines overlaying the violins are previous point estimates and95 confidence intervals for each of the variants indicated by a color and rs number in the legend (see supplementary tables 3 and 4Supplementary Material online) The population sample abbreviations are defined in text

Smith et al doi101093molbevmsy006 MBE

1010

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

estimate of 10841 (9720ndash12147) in the KHV sample showingthe least When using a fine-scale recombination map all ofthe ADH1B TMRCAs are inferred to be slightly older (supple-mentary fig 7 Supplementary Material online)

EDARPopulation genomic studies have repeatedly identified thegene EDAR to be under recent selection in East Asians(Akey et al 2004 Williamson et al 2005 Voight et al 2006)with a particular site (rs3827760) showing strong evidence forbeing the putative target Functional assays and allele specificexpression differences at this position show phenotypic asso-ciations to a variety of phenotypes including hair thicknessand dental morphology (Bryk et al 2008 Fujimoto et al 2008Kimura et al 2009 Kamberov et al 2013)

Our estimate of 22192 (19683ndash25736) years for the EDARallele in the CHB sample is older than ABC-based estimates of12458 (1314ndash85835) and 13224 (4899ndash50692) years madeby Bryk et al (2008) and Peter et al (2012) respectivelyKamberov et al (2013) use spatially explicit ABC and maxi-mum likelihood approaches to compute older estimates of30925 (13175ndash39575) and 36490 (34775ndash38204) We in-cluded all populations for which the variant is present includ-ing the FIN and BEB samples where it exists at low frequencyOur results for the youngest TMRCAs are found in these twolow frequency populations where the estimate in FIN is17386 (13887ndash20794) and the estimate in BEB is 18370(14325ndash22872) Among East Asian populations the oldestand youngest TMRCA estimates are found in the KHV sam-ple (25683 23169ndash28380) and CHB sample (22192 19683ndash25736)

LCTArguably the best-studied signature of selection in humans isfor an allele at the LCT gene (rs4988235) which confers lactasepersistence into adulthoodmdasha trait unique among mammalsand which is thought to be a result of cattle domesticationand the incorporation of milk into the adult diets of severalhuman populations (Enattah et al 2002 Bersaglieri et al 2004Tishkoff et al 2007) There are multiple alleles that showassociation with lactase persistence (Tishkoff et al 2007)We focused on estimating the age of the T-13910 allele pri-marily found at high frequency among Northern Europeansbut which is also found in South Asian populations In addi-tion to association with the lactase persistence phenotypethis allele has been functionally verified by in vitro experi-ments (Olds and Sibley 2003 Troelsen et al 2003Kuokkanen et al 2006)

Mathieson et al (2015) use ancient DNA collected from83 human samples to get a better understanding of the fre-quency trajectory for several adaptive alleles spanning a timescale of 8000 years For the LCT persistence allele (rs4988235)they find a sharp increase in frequency in the past 4000 yearsago While this is more recent than previous estimates anearlier TMRCA or time of mutation is still compatible withthis scenario

Our estimates using European and South Asian samplesfall between the range from 5000 to 10000 years ago which isbroadly consistent with age estimates from modern data Thecredible intervals for estimates in all of the samples havesubstantial overlap which makes any ranking on the basisof point estimates difficult We infer the PJL (Punjabi fromLahore Pakistan) sample to have the oldest TMRCA estimateof 9514 (8596ndash10383) years Itan et al (2009) use spatialmodelling to infer the geographic spread of lactase allelefrom northern to southern Europe Consistent with theirresults the youngest estimate among European populationsis found in the IBS sample at 9341 (8688ndash9989) yearsAmong all samples the youngest estimate was found inBEB at 6869 (5143ndash8809)

KITLG and OCA2The genetic basis and natural history of human skin pigmen-tation is a well-studied system with several alleles of majoreffect showing signatures consistent with being targets ofrecent selection (Jablonski and Chaplin 2012 Beleza Santoset al 2013 Wilde et al 2014 Eaton et al 2015) We focused onan allele found at high-frequency world-wide among non-African populations at the KITLG locus (rs642742) whichshows significant effects on skin pigmentation differences be-tween Europeans and Africans (Miller et al 2007) althoughmore recent work fails to find any contribution of KITLGtoward variation in skin pigmentation in a Cape VerdeAfrican-European admixed population (Beleza Johnsonet al 2013) We also estimated the TMRCA for a melanin-associated allele at the OCA2 locus (rs1800414) which is onlyfound among East Asian populations at high frequency(Edwards et al 2010)

For the KITLG variant our estimates among differentpopulations vary from 18000 to 34000 years ago with theoldest age being in the YRI (Yoruba in Ibadan Nigeria)sample (33948 28861ndash39099) The youngest TMRCA isfound in FIN at 18733 years (16675ndash20816) The nexttwo youngest estimates are also found in Africa withthe TMRCA in the MSL (Mende in Sierra Leone) samplebeing 22340 (15723ndash28950) years old and that for LWK(Luhya in Webuye Kenya) being 22784 (17922ndash28012)years old suggesting a more complex history than amodel of a simple allele frequency increase outside ofAfrica due to pigmentation related selection pressuresPrevious point estimates using rejection samplingapproaches on a Portuguese sample (32277 6003ndash80683) and East Asian sample (32045 6032ndash98165)are again most consistent with our own results on theIBS (29731 26170ndash32813) and CHB samples (2677324297ndash30141) (Beleza Santos et al 2013 Chen et al2015) Among East Asians the oldest and youngest esti-mates are again found in the JPT (28637 24297ndash30141)and KHV (24544 21643ndash27193) samples respectivelyThe TMRCA for OCA2 alleles in the JPT (1859916110ndash20786) and KHV (16370 14439ndash18102) samplesare also the oldest and youngest respectively

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1011

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

DiscussionOur method improves estimation for the timing of selectionon a beneficial allele using a tractable model of haplotypeevolution This approach leverages detailed information inthe data while remaining amenable to large sample sizesUsing both carriers and noncarriers of the allele we canmore effectively account for uncertainty in the extent ofthe ancestral haplotype and derived mutations We showthe performance of our method using simulations of differentselection strengths beneficial allele frequencies and choicesof reference panel By applying our method to five variantspreviously identified as targets of selection in human popu-lations we provide a comparison among population-specificTMRCAs This gives a more detailed account of the order inwhich populations may have acquired the variant andorexperienced selection for the trait it underlies

In that regard it is hypothesized that local selection pres-sures and a cultural shift toward agrarian societies have in-duced adaptive responses among human populations aroundthe globe The data associated with some variants seem toindicate more recent selective events than others Our resultsfor variants associated with dietary traits at the LCT andADH1B genes both imply relatively recent TMRCAs(lt 15 000 years ago) consistent with hypotheses that selec-tion on these mutations results from recent changes in hu-man diet following the spread of agriculture (Simoons 1970Peng et al 2010) In contrast the inferred TMRCAs for EDARKITLG and OCA2 imply older adaptive events which mayhave coincided more closely with the habitation of new envi-ronments or other cultural changes

Several hypotheses have been suggested to describe theselective drivers of skin pigmentation differences among hu-man populations including reduced UV radiation at highlatitudes and vitamin D deficiency (Loomis 1967 Jablonskiand Chaplin 2000) Estimated TMRCAs for the variants atthe OCA2 and EDAR loci among East Asians appear to beas young or younger than the KITLG variant but older thanthe LCT and ADH1B locus This suggests a selective history inEast Asian populations leading to adaptive responses forthese traits occurring after an initial colonization In somecases the dispersion of replicate MCMC estimates makes itdifficult to describe the historical significance of an observedorder for TMRCA values However the consistency of esti-mates among different populations for particular variantsadds some confidence to our modelrsquos ability to reproducethe ages which are relevant to those loci or certain geographicregions

To assess the relative concordance of our estimates withthose from previous approaches we compared our resultswith a compilation of previously published estimates basedon the time of mutation time since fixation or TMRCA ofvariants associated with the genes studied here (supplemen-tary fig 8 Supplementary Material online) The range of con-fidence intervals for these studies is largely a reflection of theassumptions invoked or relaxed for any one method as wellas the sample size and quality of the data used In principleour method extracts more information than approaches that

use summary statistics such as ABC In our empirical appli-cation we found that our method provides a gain in accuracywhile accounting for uncertainty in both the ancestral hap-lotype and its length on each chromosome Notably ourmethod provides narrower credible intervals by incorporatingthe full information from ancestral haplotype lengths derivedmutations and a reference panel of noncarrier haplotypes

Another caveat of our method is its dependence on thereference panel which is intended to serve as a representativesample of nonancestral haplotypes in the population duringthe selected allelersquos increase in frequency Four possible chal-lenges can arise 1) Segments of the ancestral selected haplo-type may be present in the reference panel due torecombination (this is more likely for alleles that havereached higher frequency) 2) the reference panel may con-tain haplotypes that are similar to the ancestral haplotypedue to low levels of genetic diversity 3) the reference panelmay be too diverged from the focal population and 4) pop-ulation connectivity and turnover may lead the ldquolocalrdquo refer-ence panel to be largely composed of migrant haplotypeswhich were not present during the selected allelersquos initial in-crease in frequency

Under scenarios 1 and 2 the background haplotypes willbe too similar to the ancestral haplotype and it may be dif-ficult for the model to discern a specific ancestry switch lo-cation This leads to fewer differences (mutations) thanexpected between the ancestral haplotype and each benefi-cial allele carrier The simulation results are consistent withthis scenario Our method tends to underestimate the trueage across a range of selection intensities and allele frequen-cies when using a local reference panel

Conversely under scenarios 3 and 4 the model will fail todescribe a recombinant haplotype in the sample of beneficialallele carriers as a mosaic of haplotypes in the reference panelAs a result the model will infer more mutation events toexplain observed differences from the ancestral haplotypeOur simulation results show this to be the case with referencepanels diverged by N generations Posterior mean estimatesare consistently older than their true value Our simulationsare perhaps pessimistic thoughmdashwe chose reference paneldivergence times of N and 05N generations approximatelycorresponding to FST values of 04 and 02 respectively Forthe smaller FST values observed in humans we expect resultsfor diverged panels to be closer to those obtained with thelocal reference panel Nonetheless future extensions to incor-porate multiple populations within the reference panel wouldbe helpful and possible by modifying the approach of Priceet al (2009) Such an approach would also enable the analysisof admixed populations (we excluded admixed samples fromour analysis of the 1000 Genomes data above)

Aside from the challenges imposed by the choice of refer-ence panel another potential source of bias lies in our tran-sition probabilities which are not conditioned on thefrequency of the selected variant In reality recombinationevents at some distance away from the selected site will onlyresult in a switch from the ancestral to background haplo-types at a rate proportional to 1 pl where pl is the fre-quency of the ancestral haplotype alleles at locus l In this way

Smith et al doi101093molbevmsy006 MBE

1012

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

some recombination events may go unobservedmdashas thebeneficial allele goes to high frequency the probability of arecombination event leading to an observable ancestral tobackground haplotype transition decreases One solutionmay be to include the frequency-dependent transition prob-abilities derived by Chen et al (2015) Under their model themutation time is estimated by assuming a deterministic lo-gistic frequency trajectory starting at 1

2N An additional benefitof using frequency trajectories would be the ability to inferposterior distributions on selection coefficients While theselection coefficient is typically assumed to be related tothe time since mutation by t1 frac14 logethNsTHORN=s we do nothave an equivalent expression for time to the common an-cestor Rather than the initial frequency being 1

2N for a newmutation our initial frequency must correspond to the fre-quency at which the TMRCA occurs Griffiths and Tavare(1994) derive a framework to model a genealogy under arbi-trary population size trajectories which should be analogousto the problem of an allele frequency trajectory and addi-tional theory on intra-allelic genealogies may be useful here aswell (Griffiths and Tavare 1994 Wiuf and Donnelly 1999 Wiuf2000 Slatkin and Rannala 2000)

Our model also assumes independence among all haplo-types in the sample in a composite-likelihood frameworkwhich is equivalent to assuming a star-genealogy (Larribeand Fearnhead 2011 Varin et al 2011) This is unlikely tobe the case when sample sizes are large or the TMRCA isold It is also unlikely to be true if the beneficial allele existedon multiple haplotypes preceding the onset of selection wasintroduced by multiple migrant haplotypes from other pop-ulations or occurred by multiple independent mutationevents (Innan and Kim 2004 Hermisson and Pennings2005 Przeworski et al 2005 Pritchard et al 2010 Berg andCoop 2015) Methods for distinguishing selection from stand-ing variation versus de novo mutation are available thatshould make it easier distinguish these cases (Messer andNeher 2012 Peter et al 2012 Messer and Petrov 2013Garud et al 2015)

If the underlying allelic genealogy is not star-like onecan expect different estimates of the TMRCA for differentsubsets of the data Here we performed multiple MCMCson resampled subsets of the data to informally diagnosewhether there are violations from the star-shape geneal-ogy assumption We speculate that exactly how theTMRCAs vary may provide insight to the underlyinghistory In cases where the TMRCA estimates for a par-ticular population are old and more variable than otherpopulations the results may be explained by structure inthe genealogy whereby recent coalescent events haveoccurred among the same ancestral haplotype beforethe common ancestor When estimates are dispersedamong resampled data sets the presence of multiple an-cestral haplotypes prior to the variantrsquos increase in fre-quency may be a better explanation Further support forthis explanation might come from comparisons to otherpopulation samples which show little to no dispersion ofestimates from resampled data sets Future work mightmake it possible to formalize this inference process

A final caveat regards the misspecification of mutation andrecombination rates TMRCA estimates are largely deter-mined by the use of accurate measures for these two param-eters In a way this provides some robustness to our methodOur age estimates depend on mutation and recombinationrates so accurate specification for one of the values can com-pensate for slight misspecification in the other In cases wherea fine-scale recombination map is unavailable we suggest us-ing a uniform recombination rate specific to the locus ofinterest (see Materials and Methods and supplementaryfig 7 Supplementary Material online) Choosing an appropri-ate mutation rate will continue to depend on current andfuture work which tries to resolve discrepancies in publishedmutation rate estimates inferred by various approaches(Segurel et al 2014)

One future direction for our method may be to explicitlyincorporate the possibility of multiple ancestral haplotypeswithin the sample Under a disease mapping frameworkMorris et al (2002) implement a similar idea in the case whereindependent disease causing mutations arise at the same lo-cus leading to independent genealogies for which they cointhe term ldquoshattered coalescentrdquo For our case beneficialmutations may also be independently derived on differenthaplotypes Alternatively a single mutation may be oldenough to reside on different haplotypes due to a sufficientamount of linked variation existing prior to the onset of se-lection Berg and Coop (2015) model selection from standingvariation to derive the distribution of haplotypes that theselected allele is present on

While we have treated the TMRCA as a parameter of in-terest our method also produces a sample of the posteriordistribution on the ancestral haplotype This could provideuseful information to estimate the frequency spectrum ofderived mutations on the ancestral haplotype Similarly thefrequency of shared recombination breakpoints could shedlight on the genealogy and how well it conforms to the star-shape assumption The extent of the ancestral haplotype ineach individual may also prove useful for identifying delete-rious alleles that have increased in frequency as a result ofstrong positive selection on linked beneficial alleles (Chun andFay 2011 Hartfield and Otto 2011) For example Huff et al(2012) describe a risk allele for Crohnrsquos disease at high fre-quency in European populations which they suggest is linkedto a beneficial allele under recent selection Similar to anadmixture mapping approach our method could be usedto identify risk loci by testing for an association betweenthe ancestral haplotype and disease status As another appli-cation identifying the ancestral haplotype may be useful inthe context of identifying a source population (or species) fora beneficial allele prior to its introduction and subsequentincrease in frequency in the recipient population

In many cases the specific site under selection may beunknown or restricted to some set of putative sites Whileour method requires the position of the selected site to bespecified future extensions could treat the selected site as arandom variable to be estimated under the same MCMCframework This framework would also be amenable to mar-ginalizing over uncertainty on the selected site

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1013

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

While we focus here on inference from modern DNAdata the increased accessibility of ancient DNA has addeda new dimension to population genetic data sets (Lazaridiset al 2014 Skoglund et al 2014 Allentoft et al 2015 Haaket al 2015 Mathieson et al 2015) Because it will remaindifficult to use ancient DNA approaches in many specieswith poor sample preservation we believe methods basedon modern DNA will continue to be useful going forwardThat said ancient DNA is providing an interesting avenue forcomparative work between inference from modern and an-cient samples For example Nakagome et al (2016) usesimulations to assess the fit of this ancient DNA polymor-phism to data simulated under their inferred parametervalues for allele age and selection intensity and they findreasonable agreement Much work still remains to fullyleverage ancient samples into population genetic inferencewhile accounting for new sources of uncertainty and poten-tial for sampling bias

Despite these challenges it is clear that our understandingof adaptive history will continue to benefit from new com-putational tools which extract insightful information from adiverse set of data sources

Materials and MethodsWe generated data using the software mssel (Hudson R per-sonal communication) which simulates a sample of haplo-types conditioned on the frequency trajectory of a selectedvariant under the structured coalescent (Hudson and Kaplan1988 Kaplan et al 1988) Trajectories were first simulatedforwards in time under a WrightndashFisher model for an additivelocus with varying strengths of selection and different endingfrequencies of the selected variant Trajectories were thentruncated to end at the first time the allele reaches a specifiedfrequency See supplementary table 1 SupplementaryMaterial online for relative ages of simulated TMRCA val-ues for different end frequencies and selection strengthsFor the results in figure 2 100 simulations were performedfor each parameter combination MCMCs were run for5000 iterations with a burn-in excluding the first 3000iterations A standard deviation of 10 was used for theproposal distribution of tca The red boxplots indicate localreference panels The blue and green boxplots indicate ref-erence panels diverged by 5N generations and 1N gener-ations respectively Each data set was simulated for a 1-Mblocus with a mutation rate of 1 108 recombination rateof 1 108

and population size of 10000 Sample sizesfor the selected and reference panels were 100 and 20respectively

For more efficient run times of the MCMC we set a max-imum number of individuals to include in the selected andreference panels to be 100 and 20 respectively In cases wherethe true number of haplotypes for either panel was greaterthan this in the full data set we resampled a subset ofhaplotypes from each population for a total of five replicatesper population For simulation results supporting the use ofthis resampling strategy see supplementary figure 5Supplementary Material online The MCMCs were run for

15000 iterations with a standard deviation of 20 for theTMRCA proposal distribution The first 9000 iterationswere removed as a burn-in leading to 6000 iterations for asample of the posterior Convergence was assessed by com-parison of MCMC replicates Figure 3 and supplementaryfigure 8 Supplementary Material online show the resultsfor all five variants along with previous point estimates and95 confidence intervals assuming a generation time of29 years (Fenner 2005) Supplementary table 3Supplementary Material online lists the mean and 95 cred-ible intervals for estimates with the highest mean posteriorprobability which we refer to in the text belowSupplementary table 4 Supplementary Material online liststhe previous estimates and confidence intervals with addi-tional details of the different approaches taken

To model recombination rate variation we used recombi-nation rates from the Decode sex-averaged recombinationmap inferred from pedigrees among families in Iceland(Kong et al 2010) Because some populations may have re-combination maps which differ from the Decode map at finescales we used a mean uniform recombination rate inferredfrom the 1-Mb region surrounding each variant The motiva-tion for this arises from how recombination rates have beenpreviously shown to remain relatively consistent among re-combination maps inferred for different populations at themegabase-scale (Broman et al 1998 Baudat et al 2010 Konget al 2010 Auton and McVean 2012) Further we found ourestimates depend mostly on having the megabase-scale re-combination rate appropriately set with little difference inmost cases for estimates obtained by modeling fine-scale re-combination at each locus (supplementary fig 7Supplementary Material online) We specify the switchingrate among background haplotypes after recombining off ofthe ancestral haplotype to be 4Nr where Nfrac14 10000 and r isthe mean recombination rate for the 1-Mb locus

For modeling mutation a challenge is that previous mu-tation rate estimates vary depending on the approach used(Segurel et al 2014) Estimates using the neutral substitutionrate between humans and chimps are more than 2 108

per bp per generation while estimates using whole-genomesequencing are closer to 1 108 As a compromise wespecify a mutation rate of 16 108

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe would like to thank Hussein Al-Asadi Arjun BiddandaAnna Di Rienzo Dick Hudson Choongwon Jeong Evan KochJoseph Marcus Shigeki Nakagome Ben Peter Mark ReppelAlex White members of the Coop laboratory at UC Davismembers of the Przeworski laboratory at ColumbiaUniversity and members of the He and Stephens laboratoriesat the University of Chicago for helpful comments JS wassupported by an NSF Graduate Research Fellowship andNational Institute of General Medical Sciences of the

Smith et al doi101093molbevmsy006 MBE

1014

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

National Institutes of Health under award numbers DGE-1144082 and T32GM007197 respectively This work wasalso supported by the National Institute of General MedicalSciences of the National Institutes of Health under awardnumbers RO1GM83098 and RO1GM107374 to GC as wellas R01HG007089 to JN

ReferencesAkey JM Eberle MA Rieder MJ Carlson CS Shriver MD Nickerson DA

Kruglyak L 2004 Population history and natural selection shapepatterns of genetic variation in 132 genes PLoS Biol 21591ndash1599

Allentoft ME Sikora M Sjogren K-G Rasmussen S Rasmussen MStenderup J Damgaard PB Schroeder H Ahlstrom T Vinner L2015 Population genomics of Bronze Age Eurasia Nature522(7555) 167ndash172

Auton A McVean G 2012 Estimating recombination rates from geneticvariation in humans In Anisimova M ed Evolutionary genomicsMethods in Molecular Biology (Methods and Protocols) vol 856New York Humana Press

Barrett RDH Hoekstra HE 2011 Molecular spandrels tests of adaptationat the genetic level Nat Rev Genet 12(11) 767ndash780

Baudat F Buard J Grey C Fledel-Alon A Ober C Przeworski M Coop GMassy BD 2010 PRDM9 is a major determinant of meiotic recom-bination hotspots in humans and mice Science 327(5967) 836ndash840

Beaumont MA Zhang W Balding DJ 2002 Approximate Bayesian com-putation in population genetics Genetics 162(4) 2025ndash2035

Beleza S Johnson NA Candille SI Absher DM Coram MA Lopes JCampos J Araujo II Anderson TM Vilhjalmsson BJ et al 2013Genetic architecture of skin and eye color in an African-Europeanadmixed population PLoS Genet 9(3) e1003372

Beleza S Santos AM McEvoy B Alves I Martinho C Cameron E ShriverMD Parra EJ Rocha J 2013 The timing of pigmentation lightening inEuropeans Mol Biol Evol 30(1) 24ndash35

Berg JJ Coop G 2015 A coalescent model for a sweep of a uniquestanding variant Genetics 201(2) 707ndash725

Bersaglieri T Sabeti PC Patterson N Vanderploeg T Schaffner SF DrakeJA Rhodes M Reich DE Hirschhorn JN 2004 Genetic signatures ofstrong recent positive selection at the lactase gene Am J Hum Genet74(6) 1111ndash1120

Broman KW Murray JC Sheffield VC White RL Weber JL 1998Comprehensive human genetic maps individual and sex-specificvariation in recombination Am J Hum Genet 63(3) 861ndash869

Bryk J Hardouin E Pugach I Hughes D Strotmann R Stoneking M MylesS 2008 Positive selection in East Asians for an EDAR allele thatenhances NF-jB activation PLoS One 3(5) e2209

Chen H Slatkin M 2013 Inferring selection intensity and allele age frommultilocus haplotype structure G3 3(8) 1429ndash1442

Chen H Hey J Slatkin M 2015 A hidden Markov model for investigatingrecent positive selection through haplotype structure Theor PopulBiol 9918ndash30

Chun S Fay JC 2011 Evidence for hitchhiking of deleterious mutationswithin the human genome PLoS Genet 7(8) e1002240

Coop G Griffiths RC 2004 Ancestral inference on gene trees underselection Theor Popul Biol 66(3) 219ndash232

Coop G Bullaughey K Luca F Przeworski M 2008 The timing of selec-tion at the human FOXP2 gene Mol Biol Evol 25(7) 1257ndash1259

Dalziel AC Rogers SM Schulte PM 2009 Linking genotypes to pheno-types and fitness how mechanistic biology can inform molecularecology Mol Ecol 18(24) 4997ndash5017

Eaton K Edwards M Krithika S Cook G Norton H Parra EJ 2015Association study confirms the role of two OCA2 polymorphismsin normal skin pigmentation variation in East Asian populations AmJ Hum Biol 27 520ndash525

Edwards M Bigham A Tan J Li S Gozdzik A Ross K Jin L Parra EJ 2010Association of the OCA2 polymorphism His615Arg with melanin

content in east Asian populations further evidence of convergentevolution of skin pigmentation PLoS Genet 6(3) e1000867

Elmer KR Meyer A 2011 Adaptation in the age of ecological genomicsinsights from parallelism and convergence Trends Ecol Evol 26(6)298ndash306

Enattah NS Sahi T Savilahti E Terwilliger JD Peltonen L Jeuroarveleuroa I 2002Identification of a variant associated with adult-type hypolactasiaNat Genet 30(2) 233ndash237

Eng MY Luczak SE Wall TL 2007 ALDH2 ADH1B and ADH1C geno-types in Asians a literature review Alcohol Res Health 30(1) 22

Fenner JN 2005 Cross-cultural estimation of the human generationinterval for use in genetics-based population divergence studiesAm J Phys Anthropol 128(2) 415ndash423

Fujimoto A Ohashi J Nishida N Miyagawa T Morishita Y Tsunoda TKimura R Tokunaga K 2008 A replication study confirmed theEDAR gene to be a major contributor to population differentiationregarding head hair thickness in Asia Hum Genet 124(2) 179ndash185

Garud NR Messer PW Buzbas EO Petrov DA 2015 Recent selectivesweeps in North American Drosophila melanogaster show signaturesof soft sweeps PLoS Genet 11(2) e1005004

Griffiths RC Tavare S 1994 Sampling theory for neutral alleles in avarying environment Philos Trans R Soc B Biol Sci 344(1310)403ndash410

Haak W Lazaridis I Patterson N Rohland N Mallick S Llamas B BrandtG Nordenfelt S Harney E Stewardson K 2015 Massive migrationfrom the steppe was a source for Indo-European languages inEurope Nature 522 207ndash211

Hartfield M Otto SP 2011 Recombination and hitchhiking of deleteri-ous alleles Evolution 65(9) 2421ndash2434

Hermisson J Pennings PS 2005 Soft sweeps molecular population ge-netics of adaptation from standing genetic variation Genetics 169(4)2335ndash2352

Hinch AG Tandon A Patterson N Song Y Rohland N Palmer CD ChenGK Wang K Buxbaum SG Akylbekova EL et al 2011 The landscapeof recombination in African Americans Nature 476(7359) 170ndash175

Hudson RR 1990 Gene genealogies and the coalescent process Oxf SurvEvol Biol 7(1) 44

Hudson RR 2007 The variance of coalescent time estimates from DNAsequences J Mol Evol 64(6) 702ndash705

Hudson RR Kaplan NL 1988 The coalescent process in models withselection and recombination Genetics 120(3) 831ndash840

Huff CD Witherspoon DJ Zhang Y Gatenbee C Denson LA KugathasanS Hakonarson H Whiting A Davis CT Wu W et al 2012 Crohnrsquosdisease and genetic hitchhiking at IBD5 Mol Biol Evol 29(1)101ndash111

Innan H Kim Y 2004 Pattern of polymorphism after strong artificialselection in a domestication event Proc Natl Acad Sci U S A 101(29)10667ndash10672

Itan Y Powell A Beaumont MA Burger J Thomas MG 2009 The originsof lactase persistence in Europe PLoS Comput Biol 5(8) e1000491

Jablonski NG Chaplin G 2000 The evolution of human skin coloration JHum Evol 39(1) 57ndash106

Jablonski NG Chaplin G 2012 Human skin pigmentation migration anddisease susceptibility Philos Trans R Soc B Biol Sci 367(1590)785ndash792

Jeong C Di Rienzo A 2014 Adaptations to local environments in mod-ern human populations Curr Opin Genet Dev 291ndash8

Kamberov YG Wang S Tan J Gerbault P Wark A Tan L Yang Y Li STang K Chen H et al 2013 Modeling recent human evolution inmice by expression of a selected EDAR variant Cell 152(4) 691ndash702

Kaplan NL Darden T Hudson RR 1988 The coalescent process inmodels with selection Genetics 120(3) 819ndash829

Kaplan NL Hudson RR Langley CH 1989 The ldquohitchhiking effectrdquo revis-ited Genetics 123(4) 887ndash899

Kimura R Yamaguchi T Takeda M Kondo O Toma T Haneji KHanihara T Matsukusa H Kawamura S Maki K et al 2009 A com-mon variation in EDAR is a genetic determinant of shovel-shapedincisors Am J Hum Genet 85(4) 528ndash535

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1015

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Kong A Thorleifsson G Gudbjartsson DF Masson G Sigurdsson AJonasdottir A Walters GB Jonasdottir A Gylfason A KristinssonKT et al 2010 Fine-scale recombination rate differences betweensexes populations and individuals Nature 467(7319) 1099ndash1103

Kuokkanen M Kokkonen J Enattah NS Ylisaukko-oja T Komu H VariloT Peltonen L Savilahti E Jeuroarveleuroa I 2006 Mutations in the translatedregion of the lactase gene (LCT) underlie congenital lactase defi-ciency Am J Hum Genet 78(2) 339ndash344

Larribe F Fearnhead P 2011 On composite likelihoods in statisticalgenetics Stat Sin 21(1) 43ndash69

Lazaridis I Patterson N Mittnik A Renaud G Mallick S Kirsanow KSudmant PH Schraiber JG Castellano S Lipson M et al 2014Ancient human genomes suggest three ancestral populations forpresent-day Europeans Nature 513(7518) 409ndash413

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357) 493ndash496

Li H Mukherjee N Soundararajan U Tarnok Z Barta C Khaliq SMohyuddin A Kajuna SLB Mehdi SQ Kidd JR et al 2007Geographically separate increases in the frequency of the derivedADH1B 47His allele in eastern and western Asia Am J Hum Genet81(4) 842ndash846

Li N Stephens M 2003 Modeling linkage disequilibrium and identifyingrecombination hotspots using single-nucleotide polymorphismdata Genetics 165(4) 2213ndash2233

Loomis WF 1967 Skin-pigment regulation of vitamin-D biosynthesis inman Science 157(3788) 501ndash506

Mathieson I Lazaridis I Rohland N Mallick S Patterson N RoodenbergSA Harney E Stewardson K Fernandes D Novak M et al 2015Genome-wide patterns of selection in 230 ancient Eurasians Nature528(7583) 499ndash503

McPeek MS Strahs A 1999 Assessment of linkage disequilibrium by thedecay of haplotype sharing with application to fine-scale geneticmapping Am J Hum Genet 65(3) 858ndash875

Meligkotsidou L Fearnhead P 2005 Maximum-likelihood estimationof coalescence times in genealogical trees Genetics 171(4)2073ndash2084

Messer PW Neher RA 2012 Estimating the strength of selective sweepsfrom deep population diversity data Genetics 191(2) 593ndash605

Messer PW Petrov DA 2013 Population genomics of rapid adaptationby soft selective sweeps Trends Ecol Evol 28(11) 659ndash669

Miller CT Beleza S Pollen AA Schluter D Kittles RA Shriver MDKingsley DM 2007 cis-Regulatory changes in kit ligand expressionand parallel evolution of pigmentation in sticklebacks and humansCell 131(6) 1179ndash1189

Morris AP Whittaker JC Balding DJ 2000 Bayesian fine-scale mapping ofdisease loci by hidden Markov models Am J Hum Genet 67(1)155ndash169

Morris AP Whittaker JC Balding DJ 2002 Fine-scale mapping of diseaseloci via shattered coalescent modeling of genealogies Am J HumGenet 70(3) 686ndash707

Nakagome S Alkorta-Aranburu G Amato R Howie B Peter BMHudson RR Di Rienzo A 2016 Estimating the ages of selectionsignals from different epochs in human history Mol Biol Evol33(3) 657ndash669

Nielsen R 2005 Molecular signatures of natural selection Annu RevGenet 39 197ndash218

Olds LC Sibley E 2003 Lactase persistence DNA variant enhances lactasepromoter activity in vitro functional role as a cis regulatory elementHum Mol Genet 12(18) 2333ndash2340

Ormond L Foll M Ewing GB Pfeifer SP Jensen JD 2016 Inferring the ageof a fixed beneficial allele Mol Ecol 25(1) 157ndash169

Osier MV Pakstis AJ Soodyall H Comas D Goldman D Odunsi AOkonofua F Parnas J Schulz LO Bertranpetit J et al 2002 A globalperspective on genetic variation at the ADH genes reveals unusualpatterns of linkage disequilibrium and diversity Am J Hum Genet71(1) 84ndash99

Peng Y Shi H Qi X-B Xiao C-J Zhong H Ma Run-lin Z Su B 2010The ADH1B Arg47His polymorphism in East Asian populations

and expansion of rice domestication in history BMC Evol Biol10(1) 15

Peter BM Huerta-Sanchez E Nielsen R 2012 Distinguishing betweenselective sweeps from standing variation and from a de novo mu-tation PLoS Genet 8(10) e1003011

Przeworski M Coop G Wall JD 2005 The signature of positive selectionon standing genetic variation Evolution 59(11) 2312ndash2323

Price AL Tandon A Patterson N Barnes KC Rafaels N Ruczinski I BeatyTH Mathias R Reich D Myers S 2009 Sensitive detection of chro-mosomal segments of distinct ancestry in admixed populationsPLoS Genet 5(6) e1000519

Pritchard JK Seielstad MT Perez-Lezaun A Feldman MW 1999Population growth of human Y chromosomes a study of Y chro-mosome microsatellites Mol Biol Evol 16(12) 1791ndash1798

Pritchard JK Pickrell JK Coop G 2010 The genetics of human adapta-tion hard sweeps soft sweeps and polygenic adaptation Curr Biol20(4) R208ndashR215

Przeworski M 2003 Estimating the time since the fixation of a beneficialallele Genetics 164(4) 1667ndash1676

Rabiner LR 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77(2) 257ndash286

Radwan J Babik W 2012 The genomics of adaptation Proc R Soc Lond BBiol Sci 279(1749) 5024ndash5028

Rannala B Reeve JP 2001 High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome se-quence Am J Hum Genet 69(1) 159ndash178

Rannala B Reeve JP 2002 Joint Bayesian estimation of mutationlocation and age using linkage disequilibrium Biocomputing2003 526ndash534

Segurel L Wyman MJ Przeworski M 2014 Determinants of mutationrate variation in the human germline Annu Rev Genomics HumGenet 1547ndash70

Simoons FJ 1970 Primary adult lactose intolerance and the milkinghabit a problem in biologic and cultural interrelations Am J DigDis 15(8) 695ndash710

Skoglund P Malmstrom H Omrak A Raghavan M Valdiosera CGunther T Hall P Tambets K Parik J Sjogren K-G et al 2014Genomic diversity and admixture differs for Stone-AgeScandinavian foragers and farmers Science 344(6185) 747ndash750

Slatkin M 2001 Simulating genealogies of selected alleles in a populationof variable size Genet Res 78(01) 49ndash57

Slatkin M 2008 A Bayesian method for jointly estimating allele age andselection intensity Genet Res 90(01) 129ndash137

Slatkin M Hudson RR 1991 Pairwise comparisons of mitochondrialDNA sequences in stable and exponentially growing populationsGenetics 129(2) 555ndash562

Slatkin M Rannala B 2000 Estimating allele age Annu Rev GenomicsHum Genet 1(1) 225ndash249

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenet Res 23(01) 23ndash35

Stinchcombe JR Hoekstra HE 2008 Combining population genomicsand quantitative genetics finding the genes underlying ecologicallyimportant traits Heredity 100(2) 158ndash170

Tang H Siegmund DO Shen P Oefner PJ Feldman MW 2002Frequentist estimation of coalescence times from nucleotide se-quence data using a tree-based partition Genetics 161(1) 447ndash459

Tavare S Balding DJ Griffiths RC Donnelly P 1997 Inferring coalescencetimes from DNA sequence data Genetics 145(2) 505ndash518

Tennessen JA Bigham AW OrsquoConnor TD Fu W Kenny EE Gravel SMcGee S Do R Liu X Jun G et al 2012 Evolution and functionalimpact of rare coding variation from deep sequencing of humanexomes Science 337(6090) 64ndash69

Thomson R Pritchard JK Shen P Oefner PJ Feldman MW 2000Recent common ancestry of human Y chromosomes evidencefrom DNA sequence data Proc Natl Acad Sci U S A 97(13)7360ndash7365

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007

Smith et al doi101093molbevmsy006 MBE

1016

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 39(1) 31ndash40

Troelsen JT Olsen J Moslashller J SjeuroOstreuroOm H 2003 An upstream polymor-phism associated with lactase persistence has increased enhanceractivity Gastroenterology 125(6) 1686ndash1694

Varin C Reid N Firth D 2011 An overview of composite likelihoodmethods Stat Sin 21(1) 5ndash42

Voight BF Kudaravalli S Wen X Pritchard JK Hurst L 2006 A map ofrecent positive selection in the human genome PLoS Biol 4(3) e72

Wegmann D Kessner DE Veeramah KR Mathias RA Nicolae DL YanekLR Sun YV Torgerson DG Rafaels N Mosley T et al 2011Recombination rates in admixed individuals identified by ancestry-based inference Nat Genet 43(9) 847ndash853

Wilde S Timpson A Kirsanow K Kaiser E Kayser M Unterleuroander MHollfelder N Potekhina ID Schier W Thomas MG et al 2014 Directevidence for positive selection of skin hair and eye pigmentation inEuropeans during the last 5000 years Proc Natl Acad Sci U S A111(13) 4832ndash4837

Williamson SH Hernandez R Fledel-Alon A Zhu L Nielsen RBustamante CD 2005 Simultaneous inference of selection and pop-ulation growth from patterns of variation in the human genomeProc Natl Acad Sci U S A 102(22) 7882ndash7887

Wiuf C 2000 On the genealogy of a sample of neutral rare alleles TheorPopul Biol 58(1) 61ndash75

Wiuf C Donnelly P 1999 Conditional genealogies and the age of aneutral mutant Theor Popul Biol 56(2) 183ndash201

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1017

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

of rice domestication and fermentation approximately10000 years ago (Li et al 2007 Peng et al 2010 Peter et al2012) However a more recent estimate by Peter et al (2012)pushes this time back several thousand years to 12876(2204ndash49764) years ago Our results are consistent with an

older timing of selection as our CHB sample (Han Chinese inBeijing China) TMRCA estimate is 15377 (13763ndash17281)years Replicate chains of the MCMC are generally consistentwith the oldest estimates in the CHB sample showing themost variation among resampled data sets and the youngest

FIG 3 Comparison of TMRCA estimates with previous results Violin plots of posterior distributions for the complete set of estimated TMRCAvalues for the five variants indicated in the legend scaled to a generation time of 29 years Each row indicates a population sample from the 1000Genomes Project panel Replicate MCMCs are plotted with transparency Points and lines overlaying the violins are previous point estimates and95 confidence intervals for each of the variants indicated by a color and rs number in the legend (see supplementary tables 3 and 4Supplementary Material online) The population sample abbreviations are defined in text

Smith et al doi101093molbevmsy006 MBE

1010

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

estimate of 10841 (9720ndash12147) in the KHV sample showingthe least When using a fine-scale recombination map all ofthe ADH1B TMRCAs are inferred to be slightly older (supple-mentary fig 7 Supplementary Material online)

EDARPopulation genomic studies have repeatedly identified thegene EDAR to be under recent selection in East Asians(Akey et al 2004 Williamson et al 2005 Voight et al 2006)with a particular site (rs3827760) showing strong evidence forbeing the putative target Functional assays and allele specificexpression differences at this position show phenotypic asso-ciations to a variety of phenotypes including hair thicknessand dental morphology (Bryk et al 2008 Fujimoto et al 2008Kimura et al 2009 Kamberov et al 2013)

Our estimate of 22192 (19683ndash25736) years for the EDARallele in the CHB sample is older than ABC-based estimates of12458 (1314ndash85835) and 13224 (4899ndash50692) years madeby Bryk et al (2008) and Peter et al (2012) respectivelyKamberov et al (2013) use spatially explicit ABC and maxi-mum likelihood approaches to compute older estimates of30925 (13175ndash39575) and 36490 (34775ndash38204) We in-cluded all populations for which the variant is present includ-ing the FIN and BEB samples where it exists at low frequencyOur results for the youngest TMRCAs are found in these twolow frequency populations where the estimate in FIN is17386 (13887ndash20794) and the estimate in BEB is 18370(14325ndash22872) Among East Asian populations the oldestand youngest TMRCA estimates are found in the KHV sam-ple (25683 23169ndash28380) and CHB sample (22192 19683ndash25736)

LCTArguably the best-studied signature of selection in humans isfor an allele at the LCT gene (rs4988235) which confers lactasepersistence into adulthoodmdasha trait unique among mammalsand which is thought to be a result of cattle domesticationand the incorporation of milk into the adult diets of severalhuman populations (Enattah et al 2002 Bersaglieri et al 2004Tishkoff et al 2007) There are multiple alleles that showassociation with lactase persistence (Tishkoff et al 2007)We focused on estimating the age of the T-13910 allele pri-marily found at high frequency among Northern Europeansbut which is also found in South Asian populations In addi-tion to association with the lactase persistence phenotypethis allele has been functionally verified by in vitro experi-ments (Olds and Sibley 2003 Troelsen et al 2003Kuokkanen et al 2006)

Mathieson et al (2015) use ancient DNA collected from83 human samples to get a better understanding of the fre-quency trajectory for several adaptive alleles spanning a timescale of 8000 years For the LCT persistence allele (rs4988235)they find a sharp increase in frequency in the past 4000 yearsago While this is more recent than previous estimates anearlier TMRCA or time of mutation is still compatible withthis scenario

Our estimates using European and South Asian samplesfall between the range from 5000 to 10000 years ago which isbroadly consistent with age estimates from modern data Thecredible intervals for estimates in all of the samples havesubstantial overlap which makes any ranking on the basisof point estimates difficult We infer the PJL (Punjabi fromLahore Pakistan) sample to have the oldest TMRCA estimateof 9514 (8596ndash10383) years Itan et al (2009) use spatialmodelling to infer the geographic spread of lactase allelefrom northern to southern Europe Consistent with theirresults the youngest estimate among European populationsis found in the IBS sample at 9341 (8688ndash9989) yearsAmong all samples the youngest estimate was found inBEB at 6869 (5143ndash8809)

KITLG and OCA2The genetic basis and natural history of human skin pigmen-tation is a well-studied system with several alleles of majoreffect showing signatures consistent with being targets ofrecent selection (Jablonski and Chaplin 2012 Beleza Santoset al 2013 Wilde et al 2014 Eaton et al 2015) We focused onan allele found at high-frequency world-wide among non-African populations at the KITLG locus (rs642742) whichshows significant effects on skin pigmentation differences be-tween Europeans and Africans (Miller et al 2007) althoughmore recent work fails to find any contribution of KITLGtoward variation in skin pigmentation in a Cape VerdeAfrican-European admixed population (Beleza Johnsonet al 2013) We also estimated the TMRCA for a melanin-associated allele at the OCA2 locus (rs1800414) which is onlyfound among East Asian populations at high frequency(Edwards et al 2010)

For the KITLG variant our estimates among differentpopulations vary from 18000 to 34000 years ago with theoldest age being in the YRI (Yoruba in Ibadan Nigeria)sample (33948 28861ndash39099) The youngest TMRCA isfound in FIN at 18733 years (16675ndash20816) The nexttwo youngest estimates are also found in Africa withthe TMRCA in the MSL (Mende in Sierra Leone) samplebeing 22340 (15723ndash28950) years old and that for LWK(Luhya in Webuye Kenya) being 22784 (17922ndash28012)years old suggesting a more complex history than amodel of a simple allele frequency increase outside ofAfrica due to pigmentation related selection pressuresPrevious point estimates using rejection samplingapproaches on a Portuguese sample (32277 6003ndash80683) and East Asian sample (32045 6032ndash98165)are again most consistent with our own results on theIBS (29731 26170ndash32813) and CHB samples (2677324297ndash30141) (Beleza Santos et al 2013 Chen et al2015) Among East Asians the oldest and youngest esti-mates are again found in the JPT (28637 24297ndash30141)and KHV (24544 21643ndash27193) samples respectivelyThe TMRCA for OCA2 alleles in the JPT (1859916110ndash20786) and KHV (16370 14439ndash18102) samplesare also the oldest and youngest respectively

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1011

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

DiscussionOur method improves estimation for the timing of selectionon a beneficial allele using a tractable model of haplotypeevolution This approach leverages detailed information inthe data while remaining amenable to large sample sizesUsing both carriers and noncarriers of the allele we canmore effectively account for uncertainty in the extent ofthe ancestral haplotype and derived mutations We showthe performance of our method using simulations of differentselection strengths beneficial allele frequencies and choicesof reference panel By applying our method to five variantspreviously identified as targets of selection in human popu-lations we provide a comparison among population-specificTMRCAs This gives a more detailed account of the order inwhich populations may have acquired the variant andorexperienced selection for the trait it underlies

In that regard it is hypothesized that local selection pres-sures and a cultural shift toward agrarian societies have in-duced adaptive responses among human populations aroundthe globe The data associated with some variants seem toindicate more recent selective events than others Our resultsfor variants associated with dietary traits at the LCT andADH1B genes both imply relatively recent TMRCAs(lt 15 000 years ago) consistent with hypotheses that selec-tion on these mutations results from recent changes in hu-man diet following the spread of agriculture (Simoons 1970Peng et al 2010) In contrast the inferred TMRCAs for EDARKITLG and OCA2 imply older adaptive events which mayhave coincided more closely with the habitation of new envi-ronments or other cultural changes

Several hypotheses have been suggested to describe theselective drivers of skin pigmentation differences among hu-man populations including reduced UV radiation at highlatitudes and vitamin D deficiency (Loomis 1967 Jablonskiand Chaplin 2000) Estimated TMRCAs for the variants atthe OCA2 and EDAR loci among East Asians appear to beas young or younger than the KITLG variant but older thanthe LCT and ADH1B locus This suggests a selective history inEast Asian populations leading to adaptive responses forthese traits occurring after an initial colonization In somecases the dispersion of replicate MCMC estimates makes itdifficult to describe the historical significance of an observedorder for TMRCA values However the consistency of esti-mates among different populations for particular variantsadds some confidence to our modelrsquos ability to reproducethe ages which are relevant to those loci or certain geographicregions

To assess the relative concordance of our estimates withthose from previous approaches we compared our resultswith a compilation of previously published estimates basedon the time of mutation time since fixation or TMRCA ofvariants associated with the genes studied here (supplemen-tary fig 8 Supplementary Material online) The range of con-fidence intervals for these studies is largely a reflection of theassumptions invoked or relaxed for any one method as wellas the sample size and quality of the data used In principleour method extracts more information than approaches that

use summary statistics such as ABC In our empirical appli-cation we found that our method provides a gain in accuracywhile accounting for uncertainty in both the ancestral hap-lotype and its length on each chromosome Notably ourmethod provides narrower credible intervals by incorporatingthe full information from ancestral haplotype lengths derivedmutations and a reference panel of noncarrier haplotypes

Another caveat of our method is its dependence on thereference panel which is intended to serve as a representativesample of nonancestral haplotypes in the population duringthe selected allelersquos increase in frequency Four possible chal-lenges can arise 1) Segments of the ancestral selected haplo-type may be present in the reference panel due torecombination (this is more likely for alleles that havereached higher frequency) 2) the reference panel may con-tain haplotypes that are similar to the ancestral haplotypedue to low levels of genetic diversity 3) the reference panelmay be too diverged from the focal population and 4) pop-ulation connectivity and turnover may lead the ldquolocalrdquo refer-ence panel to be largely composed of migrant haplotypeswhich were not present during the selected allelersquos initial in-crease in frequency

Under scenarios 1 and 2 the background haplotypes willbe too similar to the ancestral haplotype and it may be dif-ficult for the model to discern a specific ancestry switch lo-cation This leads to fewer differences (mutations) thanexpected between the ancestral haplotype and each benefi-cial allele carrier The simulation results are consistent withthis scenario Our method tends to underestimate the trueage across a range of selection intensities and allele frequen-cies when using a local reference panel

Conversely under scenarios 3 and 4 the model will fail todescribe a recombinant haplotype in the sample of beneficialallele carriers as a mosaic of haplotypes in the reference panelAs a result the model will infer more mutation events toexplain observed differences from the ancestral haplotypeOur simulation results show this to be the case with referencepanels diverged by N generations Posterior mean estimatesare consistently older than their true value Our simulationsare perhaps pessimistic thoughmdashwe chose reference paneldivergence times of N and 05N generations approximatelycorresponding to FST values of 04 and 02 respectively Forthe smaller FST values observed in humans we expect resultsfor diverged panels to be closer to those obtained with thelocal reference panel Nonetheless future extensions to incor-porate multiple populations within the reference panel wouldbe helpful and possible by modifying the approach of Priceet al (2009) Such an approach would also enable the analysisof admixed populations (we excluded admixed samples fromour analysis of the 1000 Genomes data above)

Aside from the challenges imposed by the choice of refer-ence panel another potential source of bias lies in our tran-sition probabilities which are not conditioned on thefrequency of the selected variant In reality recombinationevents at some distance away from the selected site will onlyresult in a switch from the ancestral to background haplo-types at a rate proportional to 1 pl where pl is the fre-quency of the ancestral haplotype alleles at locus l In this way

Smith et al doi101093molbevmsy006 MBE

1012

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

some recombination events may go unobservedmdashas thebeneficial allele goes to high frequency the probability of arecombination event leading to an observable ancestral tobackground haplotype transition decreases One solutionmay be to include the frequency-dependent transition prob-abilities derived by Chen et al (2015) Under their model themutation time is estimated by assuming a deterministic lo-gistic frequency trajectory starting at 1

2N An additional benefitof using frequency trajectories would be the ability to inferposterior distributions on selection coefficients While theselection coefficient is typically assumed to be related tothe time since mutation by t1 frac14 logethNsTHORN=s we do nothave an equivalent expression for time to the common an-cestor Rather than the initial frequency being 1

2N for a newmutation our initial frequency must correspond to the fre-quency at which the TMRCA occurs Griffiths and Tavare(1994) derive a framework to model a genealogy under arbi-trary population size trajectories which should be analogousto the problem of an allele frequency trajectory and addi-tional theory on intra-allelic genealogies may be useful here aswell (Griffiths and Tavare 1994 Wiuf and Donnelly 1999 Wiuf2000 Slatkin and Rannala 2000)

Our model also assumes independence among all haplo-types in the sample in a composite-likelihood frameworkwhich is equivalent to assuming a star-genealogy (Larribeand Fearnhead 2011 Varin et al 2011) This is unlikely tobe the case when sample sizes are large or the TMRCA isold It is also unlikely to be true if the beneficial allele existedon multiple haplotypes preceding the onset of selection wasintroduced by multiple migrant haplotypes from other pop-ulations or occurred by multiple independent mutationevents (Innan and Kim 2004 Hermisson and Pennings2005 Przeworski et al 2005 Pritchard et al 2010 Berg andCoop 2015) Methods for distinguishing selection from stand-ing variation versus de novo mutation are available thatshould make it easier distinguish these cases (Messer andNeher 2012 Peter et al 2012 Messer and Petrov 2013Garud et al 2015)

If the underlying allelic genealogy is not star-like onecan expect different estimates of the TMRCA for differentsubsets of the data Here we performed multiple MCMCson resampled subsets of the data to informally diagnosewhether there are violations from the star-shape geneal-ogy assumption We speculate that exactly how theTMRCAs vary may provide insight to the underlyinghistory In cases where the TMRCA estimates for a par-ticular population are old and more variable than otherpopulations the results may be explained by structure inthe genealogy whereby recent coalescent events haveoccurred among the same ancestral haplotype beforethe common ancestor When estimates are dispersedamong resampled data sets the presence of multiple an-cestral haplotypes prior to the variantrsquos increase in fre-quency may be a better explanation Further support forthis explanation might come from comparisons to otherpopulation samples which show little to no dispersion ofestimates from resampled data sets Future work mightmake it possible to formalize this inference process

A final caveat regards the misspecification of mutation andrecombination rates TMRCA estimates are largely deter-mined by the use of accurate measures for these two param-eters In a way this provides some robustness to our methodOur age estimates depend on mutation and recombinationrates so accurate specification for one of the values can com-pensate for slight misspecification in the other In cases wherea fine-scale recombination map is unavailable we suggest us-ing a uniform recombination rate specific to the locus ofinterest (see Materials and Methods and supplementaryfig 7 Supplementary Material online) Choosing an appropri-ate mutation rate will continue to depend on current andfuture work which tries to resolve discrepancies in publishedmutation rate estimates inferred by various approaches(Segurel et al 2014)

One future direction for our method may be to explicitlyincorporate the possibility of multiple ancestral haplotypeswithin the sample Under a disease mapping frameworkMorris et al (2002) implement a similar idea in the case whereindependent disease causing mutations arise at the same lo-cus leading to independent genealogies for which they cointhe term ldquoshattered coalescentrdquo For our case beneficialmutations may also be independently derived on differenthaplotypes Alternatively a single mutation may be oldenough to reside on different haplotypes due to a sufficientamount of linked variation existing prior to the onset of se-lection Berg and Coop (2015) model selection from standingvariation to derive the distribution of haplotypes that theselected allele is present on

While we have treated the TMRCA as a parameter of in-terest our method also produces a sample of the posteriordistribution on the ancestral haplotype This could provideuseful information to estimate the frequency spectrum ofderived mutations on the ancestral haplotype Similarly thefrequency of shared recombination breakpoints could shedlight on the genealogy and how well it conforms to the star-shape assumption The extent of the ancestral haplotype ineach individual may also prove useful for identifying delete-rious alleles that have increased in frequency as a result ofstrong positive selection on linked beneficial alleles (Chun andFay 2011 Hartfield and Otto 2011) For example Huff et al(2012) describe a risk allele for Crohnrsquos disease at high fre-quency in European populations which they suggest is linkedto a beneficial allele under recent selection Similar to anadmixture mapping approach our method could be usedto identify risk loci by testing for an association betweenthe ancestral haplotype and disease status As another appli-cation identifying the ancestral haplotype may be useful inthe context of identifying a source population (or species) fora beneficial allele prior to its introduction and subsequentincrease in frequency in the recipient population

In many cases the specific site under selection may beunknown or restricted to some set of putative sites Whileour method requires the position of the selected site to bespecified future extensions could treat the selected site as arandom variable to be estimated under the same MCMCframework This framework would also be amenable to mar-ginalizing over uncertainty on the selected site

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1013

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

While we focus here on inference from modern DNAdata the increased accessibility of ancient DNA has addeda new dimension to population genetic data sets (Lazaridiset al 2014 Skoglund et al 2014 Allentoft et al 2015 Haaket al 2015 Mathieson et al 2015) Because it will remaindifficult to use ancient DNA approaches in many specieswith poor sample preservation we believe methods basedon modern DNA will continue to be useful going forwardThat said ancient DNA is providing an interesting avenue forcomparative work between inference from modern and an-cient samples For example Nakagome et al (2016) usesimulations to assess the fit of this ancient DNA polymor-phism to data simulated under their inferred parametervalues for allele age and selection intensity and they findreasonable agreement Much work still remains to fullyleverage ancient samples into population genetic inferencewhile accounting for new sources of uncertainty and poten-tial for sampling bias

Despite these challenges it is clear that our understandingof adaptive history will continue to benefit from new com-putational tools which extract insightful information from adiverse set of data sources

Materials and MethodsWe generated data using the software mssel (Hudson R per-sonal communication) which simulates a sample of haplo-types conditioned on the frequency trajectory of a selectedvariant under the structured coalescent (Hudson and Kaplan1988 Kaplan et al 1988) Trajectories were first simulatedforwards in time under a WrightndashFisher model for an additivelocus with varying strengths of selection and different endingfrequencies of the selected variant Trajectories were thentruncated to end at the first time the allele reaches a specifiedfrequency See supplementary table 1 SupplementaryMaterial online for relative ages of simulated TMRCA val-ues for different end frequencies and selection strengthsFor the results in figure 2 100 simulations were performedfor each parameter combination MCMCs were run for5000 iterations with a burn-in excluding the first 3000iterations A standard deviation of 10 was used for theproposal distribution of tca The red boxplots indicate localreference panels The blue and green boxplots indicate ref-erence panels diverged by 5N generations and 1N gener-ations respectively Each data set was simulated for a 1-Mblocus with a mutation rate of 1 108 recombination rateof 1 108

and population size of 10000 Sample sizesfor the selected and reference panels were 100 and 20respectively

For more efficient run times of the MCMC we set a max-imum number of individuals to include in the selected andreference panels to be 100 and 20 respectively In cases wherethe true number of haplotypes for either panel was greaterthan this in the full data set we resampled a subset ofhaplotypes from each population for a total of five replicatesper population For simulation results supporting the use ofthis resampling strategy see supplementary figure 5Supplementary Material online The MCMCs were run for

15000 iterations with a standard deviation of 20 for theTMRCA proposal distribution The first 9000 iterationswere removed as a burn-in leading to 6000 iterations for asample of the posterior Convergence was assessed by com-parison of MCMC replicates Figure 3 and supplementaryfigure 8 Supplementary Material online show the resultsfor all five variants along with previous point estimates and95 confidence intervals assuming a generation time of29 years (Fenner 2005) Supplementary table 3Supplementary Material online lists the mean and 95 cred-ible intervals for estimates with the highest mean posteriorprobability which we refer to in the text belowSupplementary table 4 Supplementary Material online liststhe previous estimates and confidence intervals with addi-tional details of the different approaches taken

To model recombination rate variation we used recombi-nation rates from the Decode sex-averaged recombinationmap inferred from pedigrees among families in Iceland(Kong et al 2010) Because some populations may have re-combination maps which differ from the Decode map at finescales we used a mean uniform recombination rate inferredfrom the 1-Mb region surrounding each variant The motiva-tion for this arises from how recombination rates have beenpreviously shown to remain relatively consistent among re-combination maps inferred for different populations at themegabase-scale (Broman et al 1998 Baudat et al 2010 Konget al 2010 Auton and McVean 2012) Further we found ourestimates depend mostly on having the megabase-scale re-combination rate appropriately set with little difference inmost cases for estimates obtained by modeling fine-scale re-combination at each locus (supplementary fig 7Supplementary Material online) We specify the switchingrate among background haplotypes after recombining off ofthe ancestral haplotype to be 4Nr where Nfrac14 10000 and r isthe mean recombination rate for the 1-Mb locus

For modeling mutation a challenge is that previous mu-tation rate estimates vary depending on the approach used(Segurel et al 2014) Estimates using the neutral substitutionrate between humans and chimps are more than 2 108

per bp per generation while estimates using whole-genomesequencing are closer to 1 108 As a compromise wespecify a mutation rate of 16 108

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe would like to thank Hussein Al-Asadi Arjun BiddandaAnna Di Rienzo Dick Hudson Choongwon Jeong Evan KochJoseph Marcus Shigeki Nakagome Ben Peter Mark ReppelAlex White members of the Coop laboratory at UC Davismembers of the Przeworski laboratory at ColumbiaUniversity and members of the He and Stephens laboratoriesat the University of Chicago for helpful comments JS wassupported by an NSF Graduate Research Fellowship andNational Institute of General Medical Sciences of the

Smith et al doi101093molbevmsy006 MBE

1014

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

National Institutes of Health under award numbers DGE-1144082 and T32GM007197 respectively This work wasalso supported by the National Institute of General MedicalSciences of the National Institutes of Health under awardnumbers RO1GM83098 and RO1GM107374 to GC as wellas R01HG007089 to JN

ReferencesAkey JM Eberle MA Rieder MJ Carlson CS Shriver MD Nickerson DA

Kruglyak L 2004 Population history and natural selection shapepatterns of genetic variation in 132 genes PLoS Biol 21591ndash1599

Allentoft ME Sikora M Sjogren K-G Rasmussen S Rasmussen MStenderup J Damgaard PB Schroeder H Ahlstrom T Vinner L2015 Population genomics of Bronze Age Eurasia Nature522(7555) 167ndash172

Auton A McVean G 2012 Estimating recombination rates from geneticvariation in humans In Anisimova M ed Evolutionary genomicsMethods in Molecular Biology (Methods and Protocols) vol 856New York Humana Press

Barrett RDH Hoekstra HE 2011 Molecular spandrels tests of adaptationat the genetic level Nat Rev Genet 12(11) 767ndash780

Baudat F Buard J Grey C Fledel-Alon A Ober C Przeworski M Coop GMassy BD 2010 PRDM9 is a major determinant of meiotic recom-bination hotspots in humans and mice Science 327(5967) 836ndash840

Beaumont MA Zhang W Balding DJ 2002 Approximate Bayesian com-putation in population genetics Genetics 162(4) 2025ndash2035

Beleza S Johnson NA Candille SI Absher DM Coram MA Lopes JCampos J Araujo II Anderson TM Vilhjalmsson BJ et al 2013Genetic architecture of skin and eye color in an African-Europeanadmixed population PLoS Genet 9(3) e1003372

Beleza S Santos AM McEvoy B Alves I Martinho C Cameron E ShriverMD Parra EJ Rocha J 2013 The timing of pigmentation lightening inEuropeans Mol Biol Evol 30(1) 24ndash35

Berg JJ Coop G 2015 A coalescent model for a sweep of a uniquestanding variant Genetics 201(2) 707ndash725

Bersaglieri T Sabeti PC Patterson N Vanderploeg T Schaffner SF DrakeJA Rhodes M Reich DE Hirschhorn JN 2004 Genetic signatures ofstrong recent positive selection at the lactase gene Am J Hum Genet74(6) 1111ndash1120

Broman KW Murray JC Sheffield VC White RL Weber JL 1998Comprehensive human genetic maps individual and sex-specificvariation in recombination Am J Hum Genet 63(3) 861ndash869

Bryk J Hardouin E Pugach I Hughes D Strotmann R Stoneking M MylesS 2008 Positive selection in East Asians for an EDAR allele thatenhances NF-jB activation PLoS One 3(5) e2209

Chen H Slatkin M 2013 Inferring selection intensity and allele age frommultilocus haplotype structure G3 3(8) 1429ndash1442

Chen H Hey J Slatkin M 2015 A hidden Markov model for investigatingrecent positive selection through haplotype structure Theor PopulBiol 9918ndash30

Chun S Fay JC 2011 Evidence for hitchhiking of deleterious mutationswithin the human genome PLoS Genet 7(8) e1002240

Coop G Griffiths RC 2004 Ancestral inference on gene trees underselection Theor Popul Biol 66(3) 219ndash232

Coop G Bullaughey K Luca F Przeworski M 2008 The timing of selec-tion at the human FOXP2 gene Mol Biol Evol 25(7) 1257ndash1259

Dalziel AC Rogers SM Schulte PM 2009 Linking genotypes to pheno-types and fitness how mechanistic biology can inform molecularecology Mol Ecol 18(24) 4997ndash5017

Eaton K Edwards M Krithika S Cook G Norton H Parra EJ 2015Association study confirms the role of two OCA2 polymorphismsin normal skin pigmentation variation in East Asian populations AmJ Hum Biol 27 520ndash525

Edwards M Bigham A Tan J Li S Gozdzik A Ross K Jin L Parra EJ 2010Association of the OCA2 polymorphism His615Arg with melanin

content in east Asian populations further evidence of convergentevolution of skin pigmentation PLoS Genet 6(3) e1000867

Elmer KR Meyer A 2011 Adaptation in the age of ecological genomicsinsights from parallelism and convergence Trends Ecol Evol 26(6)298ndash306

Enattah NS Sahi T Savilahti E Terwilliger JD Peltonen L Jeuroarveleuroa I 2002Identification of a variant associated with adult-type hypolactasiaNat Genet 30(2) 233ndash237

Eng MY Luczak SE Wall TL 2007 ALDH2 ADH1B and ADH1C geno-types in Asians a literature review Alcohol Res Health 30(1) 22

Fenner JN 2005 Cross-cultural estimation of the human generationinterval for use in genetics-based population divergence studiesAm J Phys Anthropol 128(2) 415ndash423

Fujimoto A Ohashi J Nishida N Miyagawa T Morishita Y Tsunoda TKimura R Tokunaga K 2008 A replication study confirmed theEDAR gene to be a major contributor to population differentiationregarding head hair thickness in Asia Hum Genet 124(2) 179ndash185

Garud NR Messer PW Buzbas EO Petrov DA 2015 Recent selectivesweeps in North American Drosophila melanogaster show signaturesof soft sweeps PLoS Genet 11(2) e1005004

Griffiths RC Tavare S 1994 Sampling theory for neutral alleles in avarying environment Philos Trans R Soc B Biol Sci 344(1310)403ndash410

Haak W Lazaridis I Patterson N Rohland N Mallick S Llamas B BrandtG Nordenfelt S Harney E Stewardson K 2015 Massive migrationfrom the steppe was a source for Indo-European languages inEurope Nature 522 207ndash211

Hartfield M Otto SP 2011 Recombination and hitchhiking of deleteri-ous alleles Evolution 65(9) 2421ndash2434

Hermisson J Pennings PS 2005 Soft sweeps molecular population ge-netics of adaptation from standing genetic variation Genetics 169(4)2335ndash2352

Hinch AG Tandon A Patterson N Song Y Rohland N Palmer CD ChenGK Wang K Buxbaum SG Akylbekova EL et al 2011 The landscapeof recombination in African Americans Nature 476(7359) 170ndash175

Hudson RR 1990 Gene genealogies and the coalescent process Oxf SurvEvol Biol 7(1) 44

Hudson RR 2007 The variance of coalescent time estimates from DNAsequences J Mol Evol 64(6) 702ndash705

Hudson RR Kaplan NL 1988 The coalescent process in models withselection and recombination Genetics 120(3) 831ndash840

Huff CD Witherspoon DJ Zhang Y Gatenbee C Denson LA KugathasanS Hakonarson H Whiting A Davis CT Wu W et al 2012 Crohnrsquosdisease and genetic hitchhiking at IBD5 Mol Biol Evol 29(1)101ndash111

Innan H Kim Y 2004 Pattern of polymorphism after strong artificialselection in a domestication event Proc Natl Acad Sci U S A 101(29)10667ndash10672

Itan Y Powell A Beaumont MA Burger J Thomas MG 2009 The originsof lactase persistence in Europe PLoS Comput Biol 5(8) e1000491

Jablonski NG Chaplin G 2000 The evolution of human skin coloration JHum Evol 39(1) 57ndash106

Jablonski NG Chaplin G 2012 Human skin pigmentation migration anddisease susceptibility Philos Trans R Soc B Biol Sci 367(1590)785ndash792

Jeong C Di Rienzo A 2014 Adaptations to local environments in mod-ern human populations Curr Opin Genet Dev 291ndash8

Kamberov YG Wang S Tan J Gerbault P Wark A Tan L Yang Y Li STang K Chen H et al 2013 Modeling recent human evolution inmice by expression of a selected EDAR variant Cell 152(4) 691ndash702

Kaplan NL Darden T Hudson RR 1988 The coalescent process inmodels with selection Genetics 120(3) 819ndash829

Kaplan NL Hudson RR Langley CH 1989 The ldquohitchhiking effectrdquo revis-ited Genetics 123(4) 887ndash899

Kimura R Yamaguchi T Takeda M Kondo O Toma T Haneji KHanihara T Matsukusa H Kawamura S Maki K et al 2009 A com-mon variation in EDAR is a genetic determinant of shovel-shapedincisors Am J Hum Genet 85(4) 528ndash535

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1015

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Kong A Thorleifsson G Gudbjartsson DF Masson G Sigurdsson AJonasdottir A Walters GB Jonasdottir A Gylfason A KristinssonKT et al 2010 Fine-scale recombination rate differences betweensexes populations and individuals Nature 467(7319) 1099ndash1103

Kuokkanen M Kokkonen J Enattah NS Ylisaukko-oja T Komu H VariloT Peltonen L Savilahti E Jeuroarveleuroa I 2006 Mutations in the translatedregion of the lactase gene (LCT) underlie congenital lactase defi-ciency Am J Hum Genet 78(2) 339ndash344

Larribe F Fearnhead P 2011 On composite likelihoods in statisticalgenetics Stat Sin 21(1) 43ndash69

Lazaridis I Patterson N Mittnik A Renaud G Mallick S Kirsanow KSudmant PH Schraiber JG Castellano S Lipson M et al 2014Ancient human genomes suggest three ancestral populations forpresent-day Europeans Nature 513(7518) 409ndash413

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357) 493ndash496

Li H Mukherjee N Soundararajan U Tarnok Z Barta C Khaliq SMohyuddin A Kajuna SLB Mehdi SQ Kidd JR et al 2007Geographically separate increases in the frequency of the derivedADH1B 47His allele in eastern and western Asia Am J Hum Genet81(4) 842ndash846

Li N Stephens M 2003 Modeling linkage disequilibrium and identifyingrecombination hotspots using single-nucleotide polymorphismdata Genetics 165(4) 2213ndash2233

Loomis WF 1967 Skin-pigment regulation of vitamin-D biosynthesis inman Science 157(3788) 501ndash506

Mathieson I Lazaridis I Rohland N Mallick S Patterson N RoodenbergSA Harney E Stewardson K Fernandes D Novak M et al 2015Genome-wide patterns of selection in 230 ancient Eurasians Nature528(7583) 499ndash503

McPeek MS Strahs A 1999 Assessment of linkage disequilibrium by thedecay of haplotype sharing with application to fine-scale geneticmapping Am J Hum Genet 65(3) 858ndash875

Meligkotsidou L Fearnhead P 2005 Maximum-likelihood estimationof coalescence times in genealogical trees Genetics 171(4)2073ndash2084

Messer PW Neher RA 2012 Estimating the strength of selective sweepsfrom deep population diversity data Genetics 191(2) 593ndash605

Messer PW Petrov DA 2013 Population genomics of rapid adaptationby soft selective sweeps Trends Ecol Evol 28(11) 659ndash669

Miller CT Beleza S Pollen AA Schluter D Kittles RA Shriver MDKingsley DM 2007 cis-Regulatory changes in kit ligand expressionand parallel evolution of pigmentation in sticklebacks and humansCell 131(6) 1179ndash1189

Morris AP Whittaker JC Balding DJ 2000 Bayesian fine-scale mapping ofdisease loci by hidden Markov models Am J Hum Genet 67(1)155ndash169

Morris AP Whittaker JC Balding DJ 2002 Fine-scale mapping of diseaseloci via shattered coalescent modeling of genealogies Am J HumGenet 70(3) 686ndash707

Nakagome S Alkorta-Aranburu G Amato R Howie B Peter BMHudson RR Di Rienzo A 2016 Estimating the ages of selectionsignals from different epochs in human history Mol Biol Evol33(3) 657ndash669

Nielsen R 2005 Molecular signatures of natural selection Annu RevGenet 39 197ndash218

Olds LC Sibley E 2003 Lactase persistence DNA variant enhances lactasepromoter activity in vitro functional role as a cis regulatory elementHum Mol Genet 12(18) 2333ndash2340

Ormond L Foll M Ewing GB Pfeifer SP Jensen JD 2016 Inferring the ageof a fixed beneficial allele Mol Ecol 25(1) 157ndash169

Osier MV Pakstis AJ Soodyall H Comas D Goldman D Odunsi AOkonofua F Parnas J Schulz LO Bertranpetit J et al 2002 A globalperspective on genetic variation at the ADH genes reveals unusualpatterns of linkage disequilibrium and diversity Am J Hum Genet71(1) 84ndash99

Peng Y Shi H Qi X-B Xiao C-J Zhong H Ma Run-lin Z Su B 2010The ADH1B Arg47His polymorphism in East Asian populations

and expansion of rice domestication in history BMC Evol Biol10(1) 15

Peter BM Huerta-Sanchez E Nielsen R 2012 Distinguishing betweenselective sweeps from standing variation and from a de novo mu-tation PLoS Genet 8(10) e1003011

Przeworski M Coop G Wall JD 2005 The signature of positive selectionon standing genetic variation Evolution 59(11) 2312ndash2323

Price AL Tandon A Patterson N Barnes KC Rafaels N Ruczinski I BeatyTH Mathias R Reich D Myers S 2009 Sensitive detection of chro-mosomal segments of distinct ancestry in admixed populationsPLoS Genet 5(6) e1000519

Pritchard JK Seielstad MT Perez-Lezaun A Feldman MW 1999Population growth of human Y chromosomes a study of Y chro-mosome microsatellites Mol Biol Evol 16(12) 1791ndash1798

Pritchard JK Pickrell JK Coop G 2010 The genetics of human adapta-tion hard sweeps soft sweeps and polygenic adaptation Curr Biol20(4) R208ndashR215

Przeworski M 2003 Estimating the time since the fixation of a beneficialallele Genetics 164(4) 1667ndash1676

Rabiner LR 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77(2) 257ndash286

Radwan J Babik W 2012 The genomics of adaptation Proc R Soc Lond BBiol Sci 279(1749) 5024ndash5028

Rannala B Reeve JP 2001 High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome se-quence Am J Hum Genet 69(1) 159ndash178

Rannala B Reeve JP 2002 Joint Bayesian estimation of mutationlocation and age using linkage disequilibrium Biocomputing2003 526ndash534

Segurel L Wyman MJ Przeworski M 2014 Determinants of mutationrate variation in the human germline Annu Rev Genomics HumGenet 1547ndash70

Simoons FJ 1970 Primary adult lactose intolerance and the milkinghabit a problem in biologic and cultural interrelations Am J DigDis 15(8) 695ndash710

Skoglund P Malmstrom H Omrak A Raghavan M Valdiosera CGunther T Hall P Tambets K Parik J Sjogren K-G et al 2014Genomic diversity and admixture differs for Stone-AgeScandinavian foragers and farmers Science 344(6185) 747ndash750

Slatkin M 2001 Simulating genealogies of selected alleles in a populationof variable size Genet Res 78(01) 49ndash57

Slatkin M 2008 A Bayesian method for jointly estimating allele age andselection intensity Genet Res 90(01) 129ndash137

Slatkin M Hudson RR 1991 Pairwise comparisons of mitochondrialDNA sequences in stable and exponentially growing populationsGenetics 129(2) 555ndash562

Slatkin M Rannala B 2000 Estimating allele age Annu Rev GenomicsHum Genet 1(1) 225ndash249

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenet Res 23(01) 23ndash35

Stinchcombe JR Hoekstra HE 2008 Combining population genomicsand quantitative genetics finding the genes underlying ecologicallyimportant traits Heredity 100(2) 158ndash170

Tang H Siegmund DO Shen P Oefner PJ Feldman MW 2002Frequentist estimation of coalescence times from nucleotide se-quence data using a tree-based partition Genetics 161(1) 447ndash459

Tavare S Balding DJ Griffiths RC Donnelly P 1997 Inferring coalescencetimes from DNA sequence data Genetics 145(2) 505ndash518

Tennessen JA Bigham AW OrsquoConnor TD Fu W Kenny EE Gravel SMcGee S Do R Liu X Jun G et al 2012 Evolution and functionalimpact of rare coding variation from deep sequencing of humanexomes Science 337(6090) 64ndash69

Thomson R Pritchard JK Shen P Oefner PJ Feldman MW 2000Recent common ancestry of human Y chromosomes evidencefrom DNA sequence data Proc Natl Acad Sci U S A 97(13)7360ndash7365

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007

Smith et al doi101093molbevmsy006 MBE

1016

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 39(1) 31ndash40

Troelsen JT Olsen J Moslashller J SjeuroOstreuroOm H 2003 An upstream polymor-phism associated with lactase persistence has increased enhanceractivity Gastroenterology 125(6) 1686ndash1694

Varin C Reid N Firth D 2011 An overview of composite likelihoodmethods Stat Sin 21(1) 5ndash42

Voight BF Kudaravalli S Wen X Pritchard JK Hurst L 2006 A map ofrecent positive selection in the human genome PLoS Biol 4(3) e72

Wegmann D Kessner DE Veeramah KR Mathias RA Nicolae DL YanekLR Sun YV Torgerson DG Rafaels N Mosley T et al 2011Recombination rates in admixed individuals identified by ancestry-based inference Nat Genet 43(9) 847ndash853

Wilde S Timpson A Kirsanow K Kaiser E Kayser M Unterleuroander MHollfelder N Potekhina ID Schier W Thomas MG et al 2014 Directevidence for positive selection of skin hair and eye pigmentation inEuropeans during the last 5000 years Proc Natl Acad Sci U S A111(13) 4832ndash4837

Williamson SH Hernandez R Fledel-Alon A Zhu L Nielsen RBustamante CD 2005 Simultaneous inference of selection and pop-ulation growth from patterns of variation in the human genomeProc Natl Acad Sci U S A 102(22) 7882ndash7887

Wiuf C 2000 On the genealogy of a sample of neutral rare alleles TheorPopul Biol 58(1) 61ndash75

Wiuf C Donnelly P 1999 Conditional genealogies and the age of aneutral mutant Theor Popul Biol 56(2) 183ndash201

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1017

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

estimate of 10841 (9720ndash12147) in the KHV sample showingthe least When using a fine-scale recombination map all ofthe ADH1B TMRCAs are inferred to be slightly older (supple-mentary fig 7 Supplementary Material online)

EDARPopulation genomic studies have repeatedly identified thegene EDAR to be under recent selection in East Asians(Akey et al 2004 Williamson et al 2005 Voight et al 2006)with a particular site (rs3827760) showing strong evidence forbeing the putative target Functional assays and allele specificexpression differences at this position show phenotypic asso-ciations to a variety of phenotypes including hair thicknessand dental morphology (Bryk et al 2008 Fujimoto et al 2008Kimura et al 2009 Kamberov et al 2013)

Our estimate of 22192 (19683ndash25736) years for the EDARallele in the CHB sample is older than ABC-based estimates of12458 (1314ndash85835) and 13224 (4899ndash50692) years madeby Bryk et al (2008) and Peter et al (2012) respectivelyKamberov et al (2013) use spatially explicit ABC and maxi-mum likelihood approaches to compute older estimates of30925 (13175ndash39575) and 36490 (34775ndash38204) We in-cluded all populations for which the variant is present includ-ing the FIN and BEB samples where it exists at low frequencyOur results for the youngest TMRCAs are found in these twolow frequency populations where the estimate in FIN is17386 (13887ndash20794) and the estimate in BEB is 18370(14325ndash22872) Among East Asian populations the oldestand youngest TMRCA estimates are found in the KHV sam-ple (25683 23169ndash28380) and CHB sample (22192 19683ndash25736)

LCTArguably the best-studied signature of selection in humans isfor an allele at the LCT gene (rs4988235) which confers lactasepersistence into adulthoodmdasha trait unique among mammalsand which is thought to be a result of cattle domesticationand the incorporation of milk into the adult diets of severalhuman populations (Enattah et al 2002 Bersaglieri et al 2004Tishkoff et al 2007) There are multiple alleles that showassociation with lactase persistence (Tishkoff et al 2007)We focused on estimating the age of the T-13910 allele pri-marily found at high frequency among Northern Europeansbut which is also found in South Asian populations In addi-tion to association with the lactase persistence phenotypethis allele has been functionally verified by in vitro experi-ments (Olds and Sibley 2003 Troelsen et al 2003Kuokkanen et al 2006)

Mathieson et al (2015) use ancient DNA collected from83 human samples to get a better understanding of the fre-quency trajectory for several adaptive alleles spanning a timescale of 8000 years For the LCT persistence allele (rs4988235)they find a sharp increase in frequency in the past 4000 yearsago While this is more recent than previous estimates anearlier TMRCA or time of mutation is still compatible withthis scenario

Our estimates using European and South Asian samplesfall between the range from 5000 to 10000 years ago which isbroadly consistent with age estimates from modern data Thecredible intervals for estimates in all of the samples havesubstantial overlap which makes any ranking on the basisof point estimates difficult We infer the PJL (Punjabi fromLahore Pakistan) sample to have the oldest TMRCA estimateof 9514 (8596ndash10383) years Itan et al (2009) use spatialmodelling to infer the geographic spread of lactase allelefrom northern to southern Europe Consistent with theirresults the youngest estimate among European populationsis found in the IBS sample at 9341 (8688ndash9989) yearsAmong all samples the youngest estimate was found inBEB at 6869 (5143ndash8809)

KITLG and OCA2The genetic basis and natural history of human skin pigmen-tation is a well-studied system with several alleles of majoreffect showing signatures consistent with being targets ofrecent selection (Jablonski and Chaplin 2012 Beleza Santoset al 2013 Wilde et al 2014 Eaton et al 2015) We focused onan allele found at high-frequency world-wide among non-African populations at the KITLG locus (rs642742) whichshows significant effects on skin pigmentation differences be-tween Europeans and Africans (Miller et al 2007) althoughmore recent work fails to find any contribution of KITLGtoward variation in skin pigmentation in a Cape VerdeAfrican-European admixed population (Beleza Johnsonet al 2013) We also estimated the TMRCA for a melanin-associated allele at the OCA2 locus (rs1800414) which is onlyfound among East Asian populations at high frequency(Edwards et al 2010)

For the KITLG variant our estimates among differentpopulations vary from 18000 to 34000 years ago with theoldest age being in the YRI (Yoruba in Ibadan Nigeria)sample (33948 28861ndash39099) The youngest TMRCA isfound in FIN at 18733 years (16675ndash20816) The nexttwo youngest estimates are also found in Africa withthe TMRCA in the MSL (Mende in Sierra Leone) samplebeing 22340 (15723ndash28950) years old and that for LWK(Luhya in Webuye Kenya) being 22784 (17922ndash28012)years old suggesting a more complex history than amodel of a simple allele frequency increase outside ofAfrica due to pigmentation related selection pressuresPrevious point estimates using rejection samplingapproaches on a Portuguese sample (32277 6003ndash80683) and East Asian sample (32045 6032ndash98165)are again most consistent with our own results on theIBS (29731 26170ndash32813) and CHB samples (2677324297ndash30141) (Beleza Santos et al 2013 Chen et al2015) Among East Asians the oldest and youngest esti-mates are again found in the JPT (28637 24297ndash30141)and KHV (24544 21643ndash27193) samples respectivelyThe TMRCA for OCA2 alleles in the JPT (1859916110ndash20786) and KHV (16370 14439ndash18102) samplesare also the oldest and youngest respectively

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1011

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

DiscussionOur method improves estimation for the timing of selectionon a beneficial allele using a tractable model of haplotypeevolution This approach leverages detailed information inthe data while remaining amenable to large sample sizesUsing both carriers and noncarriers of the allele we canmore effectively account for uncertainty in the extent ofthe ancestral haplotype and derived mutations We showthe performance of our method using simulations of differentselection strengths beneficial allele frequencies and choicesof reference panel By applying our method to five variantspreviously identified as targets of selection in human popu-lations we provide a comparison among population-specificTMRCAs This gives a more detailed account of the order inwhich populations may have acquired the variant andorexperienced selection for the trait it underlies

In that regard it is hypothesized that local selection pres-sures and a cultural shift toward agrarian societies have in-duced adaptive responses among human populations aroundthe globe The data associated with some variants seem toindicate more recent selective events than others Our resultsfor variants associated with dietary traits at the LCT andADH1B genes both imply relatively recent TMRCAs(lt 15 000 years ago) consistent with hypotheses that selec-tion on these mutations results from recent changes in hu-man diet following the spread of agriculture (Simoons 1970Peng et al 2010) In contrast the inferred TMRCAs for EDARKITLG and OCA2 imply older adaptive events which mayhave coincided more closely with the habitation of new envi-ronments or other cultural changes

Several hypotheses have been suggested to describe theselective drivers of skin pigmentation differences among hu-man populations including reduced UV radiation at highlatitudes and vitamin D deficiency (Loomis 1967 Jablonskiand Chaplin 2000) Estimated TMRCAs for the variants atthe OCA2 and EDAR loci among East Asians appear to beas young or younger than the KITLG variant but older thanthe LCT and ADH1B locus This suggests a selective history inEast Asian populations leading to adaptive responses forthese traits occurring after an initial colonization In somecases the dispersion of replicate MCMC estimates makes itdifficult to describe the historical significance of an observedorder for TMRCA values However the consistency of esti-mates among different populations for particular variantsadds some confidence to our modelrsquos ability to reproducethe ages which are relevant to those loci or certain geographicregions

To assess the relative concordance of our estimates withthose from previous approaches we compared our resultswith a compilation of previously published estimates basedon the time of mutation time since fixation or TMRCA ofvariants associated with the genes studied here (supplemen-tary fig 8 Supplementary Material online) The range of con-fidence intervals for these studies is largely a reflection of theassumptions invoked or relaxed for any one method as wellas the sample size and quality of the data used In principleour method extracts more information than approaches that

use summary statistics such as ABC In our empirical appli-cation we found that our method provides a gain in accuracywhile accounting for uncertainty in both the ancestral hap-lotype and its length on each chromosome Notably ourmethod provides narrower credible intervals by incorporatingthe full information from ancestral haplotype lengths derivedmutations and a reference panel of noncarrier haplotypes

Another caveat of our method is its dependence on thereference panel which is intended to serve as a representativesample of nonancestral haplotypes in the population duringthe selected allelersquos increase in frequency Four possible chal-lenges can arise 1) Segments of the ancestral selected haplo-type may be present in the reference panel due torecombination (this is more likely for alleles that havereached higher frequency) 2) the reference panel may con-tain haplotypes that are similar to the ancestral haplotypedue to low levels of genetic diversity 3) the reference panelmay be too diverged from the focal population and 4) pop-ulation connectivity and turnover may lead the ldquolocalrdquo refer-ence panel to be largely composed of migrant haplotypeswhich were not present during the selected allelersquos initial in-crease in frequency

Under scenarios 1 and 2 the background haplotypes willbe too similar to the ancestral haplotype and it may be dif-ficult for the model to discern a specific ancestry switch lo-cation This leads to fewer differences (mutations) thanexpected between the ancestral haplotype and each benefi-cial allele carrier The simulation results are consistent withthis scenario Our method tends to underestimate the trueage across a range of selection intensities and allele frequen-cies when using a local reference panel

Conversely under scenarios 3 and 4 the model will fail todescribe a recombinant haplotype in the sample of beneficialallele carriers as a mosaic of haplotypes in the reference panelAs a result the model will infer more mutation events toexplain observed differences from the ancestral haplotypeOur simulation results show this to be the case with referencepanels diverged by N generations Posterior mean estimatesare consistently older than their true value Our simulationsare perhaps pessimistic thoughmdashwe chose reference paneldivergence times of N and 05N generations approximatelycorresponding to FST values of 04 and 02 respectively Forthe smaller FST values observed in humans we expect resultsfor diverged panels to be closer to those obtained with thelocal reference panel Nonetheless future extensions to incor-porate multiple populations within the reference panel wouldbe helpful and possible by modifying the approach of Priceet al (2009) Such an approach would also enable the analysisof admixed populations (we excluded admixed samples fromour analysis of the 1000 Genomes data above)

Aside from the challenges imposed by the choice of refer-ence panel another potential source of bias lies in our tran-sition probabilities which are not conditioned on thefrequency of the selected variant In reality recombinationevents at some distance away from the selected site will onlyresult in a switch from the ancestral to background haplo-types at a rate proportional to 1 pl where pl is the fre-quency of the ancestral haplotype alleles at locus l In this way

Smith et al doi101093molbevmsy006 MBE

1012

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

some recombination events may go unobservedmdashas thebeneficial allele goes to high frequency the probability of arecombination event leading to an observable ancestral tobackground haplotype transition decreases One solutionmay be to include the frequency-dependent transition prob-abilities derived by Chen et al (2015) Under their model themutation time is estimated by assuming a deterministic lo-gistic frequency trajectory starting at 1

2N An additional benefitof using frequency trajectories would be the ability to inferposterior distributions on selection coefficients While theselection coefficient is typically assumed to be related tothe time since mutation by t1 frac14 logethNsTHORN=s we do nothave an equivalent expression for time to the common an-cestor Rather than the initial frequency being 1

2N for a newmutation our initial frequency must correspond to the fre-quency at which the TMRCA occurs Griffiths and Tavare(1994) derive a framework to model a genealogy under arbi-trary population size trajectories which should be analogousto the problem of an allele frequency trajectory and addi-tional theory on intra-allelic genealogies may be useful here aswell (Griffiths and Tavare 1994 Wiuf and Donnelly 1999 Wiuf2000 Slatkin and Rannala 2000)

Our model also assumes independence among all haplo-types in the sample in a composite-likelihood frameworkwhich is equivalent to assuming a star-genealogy (Larribeand Fearnhead 2011 Varin et al 2011) This is unlikely tobe the case when sample sizes are large or the TMRCA isold It is also unlikely to be true if the beneficial allele existedon multiple haplotypes preceding the onset of selection wasintroduced by multiple migrant haplotypes from other pop-ulations or occurred by multiple independent mutationevents (Innan and Kim 2004 Hermisson and Pennings2005 Przeworski et al 2005 Pritchard et al 2010 Berg andCoop 2015) Methods for distinguishing selection from stand-ing variation versus de novo mutation are available thatshould make it easier distinguish these cases (Messer andNeher 2012 Peter et al 2012 Messer and Petrov 2013Garud et al 2015)

If the underlying allelic genealogy is not star-like onecan expect different estimates of the TMRCA for differentsubsets of the data Here we performed multiple MCMCson resampled subsets of the data to informally diagnosewhether there are violations from the star-shape geneal-ogy assumption We speculate that exactly how theTMRCAs vary may provide insight to the underlyinghistory In cases where the TMRCA estimates for a par-ticular population are old and more variable than otherpopulations the results may be explained by structure inthe genealogy whereby recent coalescent events haveoccurred among the same ancestral haplotype beforethe common ancestor When estimates are dispersedamong resampled data sets the presence of multiple an-cestral haplotypes prior to the variantrsquos increase in fre-quency may be a better explanation Further support forthis explanation might come from comparisons to otherpopulation samples which show little to no dispersion ofestimates from resampled data sets Future work mightmake it possible to formalize this inference process

A final caveat regards the misspecification of mutation andrecombination rates TMRCA estimates are largely deter-mined by the use of accurate measures for these two param-eters In a way this provides some robustness to our methodOur age estimates depend on mutation and recombinationrates so accurate specification for one of the values can com-pensate for slight misspecification in the other In cases wherea fine-scale recombination map is unavailable we suggest us-ing a uniform recombination rate specific to the locus ofinterest (see Materials and Methods and supplementaryfig 7 Supplementary Material online) Choosing an appropri-ate mutation rate will continue to depend on current andfuture work which tries to resolve discrepancies in publishedmutation rate estimates inferred by various approaches(Segurel et al 2014)

One future direction for our method may be to explicitlyincorporate the possibility of multiple ancestral haplotypeswithin the sample Under a disease mapping frameworkMorris et al (2002) implement a similar idea in the case whereindependent disease causing mutations arise at the same lo-cus leading to independent genealogies for which they cointhe term ldquoshattered coalescentrdquo For our case beneficialmutations may also be independently derived on differenthaplotypes Alternatively a single mutation may be oldenough to reside on different haplotypes due to a sufficientamount of linked variation existing prior to the onset of se-lection Berg and Coop (2015) model selection from standingvariation to derive the distribution of haplotypes that theselected allele is present on

While we have treated the TMRCA as a parameter of in-terest our method also produces a sample of the posteriordistribution on the ancestral haplotype This could provideuseful information to estimate the frequency spectrum ofderived mutations on the ancestral haplotype Similarly thefrequency of shared recombination breakpoints could shedlight on the genealogy and how well it conforms to the star-shape assumption The extent of the ancestral haplotype ineach individual may also prove useful for identifying delete-rious alleles that have increased in frequency as a result ofstrong positive selection on linked beneficial alleles (Chun andFay 2011 Hartfield and Otto 2011) For example Huff et al(2012) describe a risk allele for Crohnrsquos disease at high fre-quency in European populations which they suggest is linkedto a beneficial allele under recent selection Similar to anadmixture mapping approach our method could be usedto identify risk loci by testing for an association betweenthe ancestral haplotype and disease status As another appli-cation identifying the ancestral haplotype may be useful inthe context of identifying a source population (or species) fora beneficial allele prior to its introduction and subsequentincrease in frequency in the recipient population

In many cases the specific site under selection may beunknown or restricted to some set of putative sites Whileour method requires the position of the selected site to bespecified future extensions could treat the selected site as arandom variable to be estimated under the same MCMCframework This framework would also be amenable to mar-ginalizing over uncertainty on the selected site

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1013

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

While we focus here on inference from modern DNAdata the increased accessibility of ancient DNA has addeda new dimension to population genetic data sets (Lazaridiset al 2014 Skoglund et al 2014 Allentoft et al 2015 Haaket al 2015 Mathieson et al 2015) Because it will remaindifficult to use ancient DNA approaches in many specieswith poor sample preservation we believe methods basedon modern DNA will continue to be useful going forwardThat said ancient DNA is providing an interesting avenue forcomparative work between inference from modern and an-cient samples For example Nakagome et al (2016) usesimulations to assess the fit of this ancient DNA polymor-phism to data simulated under their inferred parametervalues for allele age and selection intensity and they findreasonable agreement Much work still remains to fullyleverage ancient samples into population genetic inferencewhile accounting for new sources of uncertainty and poten-tial for sampling bias

Despite these challenges it is clear that our understandingof adaptive history will continue to benefit from new com-putational tools which extract insightful information from adiverse set of data sources

Materials and MethodsWe generated data using the software mssel (Hudson R per-sonal communication) which simulates a sample of haplo-types conditioned on the frequency trajectory of a selectedvariant under the structured coalescent (Hudson and Kaplan1988 Kaplan et al 1988) Trajectories were first simulatedforwards in time under a WrightndashFisher model for an additivelocus with varying strengths of selection and different endingfrequencies of the selected variant Trajectories were thentruncated to end at the first time the allele reaches a specifiedfrequency See supplementary table 1 SupplementaryMaterial online for relative ages of simulated TMRCA val-ues for different end frequencies and selection strengthsFor the results in figure 2 100 simulations were performedfor each parameter combination MCMCs were run for5000 iterations with a burn-in excluding the first 3000iterations A standard deviation of 10 was used for theproposal distribution of tca The red boxplots indicate localreference panels The blue and green boxplots indicate ref-erence panels diverged by 5N generations and 1N gener-ations respectively Each data set was simulated for a 1-Mblocus with a mutation rate of 1 108 recombination rateof 1 108

and population size of 10000 Sample sizesfor the selected and reference panels were 100 and 20respectively

For more efficient run times of the MCMC we set a max-imum number of individuals to include in the selected andreference panels to be 100 and 20 respectively In cases wherethe true number of haplotypes for either panel was greaterthan this in the full data set we resampled a subset ofhaplotypes from each population for a total of five replicatesper population For simulation results supporting the use ofthis resampling strategy see supplementary figure 5Supplementary Material online The MCMCs were run for

15000 iterations with a standard deviation of 20 for theTMRCA proposal distribution The first 9000 iterationswere removed as a burn-in leading to 6000 iterations for asample of the posterior Convergence was assessed by com-parison of MCMC replicates Figure 3 and supplementaryfigure 8 Supplementary Material online show the resultsfor all five variants along with previous point estimates and95 confidence intervals assuming a generation time of29 years (Fenner 2005) Supplementary table 3Supplementary Material online lists the mean and 95 cred-ible intervals for estimates with the highest mean posteriorprobability which we refer to in the text belowSupplementary table 4 Supplementary Material online liststhe previous estimates and confidence intervals with addi-tional details of the different approaches taken

To model recombination rate variation we used recombi-nation rates from the Decode sex-averaged recombinationmap inferred from pedigrees among families in Iceland(Kong et al 2010) Because some populations may have re-combination maps which differ from the Decode map at finescales we used a mean uniform recombination rate inferredfrom the 1-Mb region surrounding each variant The motiva-tion for this arises from how recombination rates have beenpreviously shown to remain relatively consistent among re-combination maps inferred for different populations at themegabase-scale (Broman et al 1998 Baudat et al 2010 Konget al 2010 Auton and McVean 2012) Further we found ourestimates depend mostly on having the megabase-scale re-combination rate appropriately set with little difference inmost cases for estimates obtained by modeling fine-scale re-combination at each locus (supplementary fig 7Supplementary Material online) We specify the switchingrate among background haplotypes after recombining off ofthe ancestral haplotype to be 4Nr where Nfrac14 10000 and r isthe mean recombination rate for the 1-Mb locus

For modeling mutation a challenge is that previous mu-tation rate estimates vary depending on the approach used(Segurel et al 2014) Estimates using the neutral substitutionrate between humans and chimps are more than 2 108

per bp per generation while estimates using whole-genomesequencing are closer to 1 108 As a compromise wespecify a mutation rate of 16 108

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe would like to thank Hussein Al-Asadi Arjun BiddandaAnna Di Rienzo Dick Hudson Choongwon Jeong Evan KochJoseph Marcus Shigeki Nakagome Ben Peter Mark ReppelAlex White members of the Coop laboratory at UC Davismembers of the Przeworski laboratory at ColumbiaUniversity and members of the He and Stephens laboratoriesat the University of Chicago for helpful comments JS wassupported by an NSF Graduate Research Fellowship andNational Institute of General Medical Sciences of the

Smith et al doi101093molbevmsy006 MBE

1014

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

National Institutes of Health under award numbers DGE-1144082 and T32GM007197 respectively This work wasalso supported by the National Institute of General MedicalSciences of the National Institutes of Health under awardnumbers RO1GM83098 and RO1GM107374 to GC as wellas R01HG007089 to JN

ReferencesAkey JM Eberle MA Rieder MJ Carlson CS Shriver MD Nickerson DA

Kruglyak L 2004 Population history and natural selection shapepatterns of genetic variation in 132 genes PLoS Biol 21591ndash1599

Allentoft ME Sikora M Sjogren K-G Rasmussen S Rasmussen MStenderup J Damgaard PB Schroeder H Ahlstrom T Vinner L2015 Population genomics of Bronze Age Eurasia Nature522(7555) 167ndash172

Auton A McVean G 2012 Estimating recombination rates from geneticvariation in humans In Anisimova M ed Evolutionary genomicsMethods in Molecular Biology (Methods and Protocols) vol 856New York Humana Press

Barrett RDH Hoekstra HE 2011 Molecular spandrels tests of adaptationat the genetic level Nat Rev Genet 12(11) 767ndash780

Baudat F Buard J Grey C Fledel-Alon A Ober C Przeworski M Coop GMassy BD 2010 PRDM9 is a major determinant of meiotic recom-bination hotspots in humans and mice Science 327(5967) 836ndash840

Beaumont MA Zhang W Balding DJ 2002 Approximate Bayesian com-putation in population genetics Genetics 162(4) 2025ndash2035

Beleza S Johnson NA Candille SI Absher DM Coram MA Lopes JCampos J Araujo II Anderson TM Vilhjalmsson BJ et al 2013Genetic architecture of skin and eye color in an African-Europeanadmixed population PLoS Genet 9(3) e1003372

Beleza S Santos AM McEvoy B Alves I Martinho C Cameron E ShriverMD Parra EJ Rocha J 2013 The timing of pigmentation lightening inEuropeans Mol Biol Evol 30(1) 24ndash35

Berg JJ Coop G 2015 A coalescent model for a sweep of a uniquestanding variant Genetics 201(2) 707ndash725

Bersaglieri T Sabeti PC Patterson N Vanderploeg T Schaffner SF DrakeJA Rhodes M Reich DE Hirschhorn JN 2004 Genetic signatures ofstrong recent positive selection at the lactase gene Am J Hum Genet74(6) 1111ndash1120

Broman KW Murray JC Sheffield VC White RL Weber JL 1998Comprehensive human genetic maps individual and sex-specificvariation in recombination Am J Hum Genet 63(3) 861ndash869

Bryk J Hardouin E Pugach I Hughes D Strotmann R Stoneking M MylesS 2008 Positive selection in East Asians for an EDAR allele thatenhances NF-jB activation PLoS One 3(5) e2209

Chen H Slatkin M 2013 Inferring selection intensity and allele age frommultilocus haplotype structure G3 3(8) 1429ndash1442

Chen H Hey J Slatkin M 2015 A hidden Markov model for investigatingrecent positive selection through haplotype structure Theor PopulBiol 9918ndash30

Chun S Fay JC 2011 Evidence for hitchhiking of deleterious mutationswithin the human genome PLoS Genet 7(8) e1002240

Coop G Griffiths RC 2004 Ancestral inference on gene trees underselection Theor Popul Biol 66(3) 219ndash232

Coop G Bullaughey K Luca F Przeworski M 2008 The timing of selec-tion at the human FOXP2 gene Mol Biol Evol 25(7) 1257ndash1259

Dalziel AC Rogers SM Schulte PM 2009 Linking genotypes to pheno-types and fitness how mechanistic biology can inform molecularecology Mol Ecol 18(24) 4997ndash5017

Eaton K Edwards M Krithika S Cook G Norton H Parra EJ 2015Association study confirms the role of two OCA2 polymorphismsin normal skin pigmentation variation in East Asian populations AmJ Hum Biol 27 520ndash525

Edwards M Bigham A Tan J Li S Gozdzik A Ross K Jin L Parra EJ 2010Association of the OCA2 polymorphism His615Arg with melanin

content in east Asian populations further evidence of convergentevolution of skin pigmentation PLoS Genet 6(3) e1000867

Elmer KR Meyer A 2011 Adaptation in the age of ecological genomicsinsights from parallelism and convergence Trends Ecol Evol 26(6)298ndash306

Enattah NS Sahi T Savilahti E Terwilliger JD Peltonen L Jeuroarveleuroa I 2002Identification of a variant associated with adult-type hypolactasiaNat Genet 30(2) 233ndash237

Eng MY Luczak SE Wall TL 2007 ALDH2 ADH1B and ADH1C geno-types in Asians a literature review Alcohol Res Health 30(1) 22

Fenner JN 2005 Cross-cultural estimation of the human generationinterval for use in genetics-based population divergence studiesAm J Phys Anthropol 128(2) 415ndash423

Fujimoto A Ohashi J Nishida N Miyagawa T Morishita Y Tsunoda TKimura R Tokunaga K 2008 A replication study confirmed theEDAR gene to be a major contributor to population differentiationregarding head hair thickness in Asia Hum Genet 124(2) 179ndash185

Garud NR Messer PW Buzbas EO Petrov DA 2015 Recent selectivesweeps in North American Drosophila melanogaster show signaturesof soft sweeps PLoS Genet 11(2) e1005004

Griffiths RC Tavare S 1994 Sampling theory for neutral alleles in avarying environment Philos Trans R Soc B Biol Sci 344(1310)403ndash410

Haak W Lazaridis I Patterson N Rohland N Mallick S Llamas B BrandtG Nordenfelt S Harney E Stewardson K 2015 Massive migrationfrom the steppe was a source for Indo-European languages inEurope Nature 522 207ndash211

Hartfield M Otto SP 2011 Recombination and hitchhiking of deleteri-ous alleles Evolution 65(9) 2421ndash2434

Hermisson J Pennings PS 2005 Soft sweeps molecular population ge-netics of adaptation from standing genetic variation Genetics 169(4)2335ndash2352

Hinch AG Tandon A Patterson N Song Y Rohland N Palmer CD ChenGK Wang K Buxbaum SG Akylbekova EL et al 2011 The landscapeof recombination in African Americans Nature 476(7359) 170ndash175

Hudson RR 1990 Gene genealogies and the coalescent process Oxf SurvEvol Biol 7(1) 44

Hudson RR 2007 The variance of coalescent time estimates from DNAsequences J Mol Evol 64(6) 702ndash705

Hudson RR Kaplan NL 1988 The coalescent process in models withselection and recombination Genetics 120(3) 831ndash840

Huff CD Witherspoon DJ Zhang Y Gatenbee C Denson LA KugathasanS Hakonarson H Whiting A Davis CT Wu W et al 2012 Crohnrsquosdisease and genetic hitchhiking at IBD5 Mol Biol Evol 29(1)101ndash111

Innan H Kim Y 2004 Pattern of polymorphism after strong artificialselection in a domestication event Proc Natl Acad Sci U S A 101(29)10667ndash10672

Itan Y Powell A Beaumont MA Burger J Thomas MG 2009 The originsof lactase persistence in Europe PLoS Comput Biol 5(8) e1000491

Jablonski NG Chaplin G 2000 The evolution of human skin coloration JHum Evol 39(1) 57ndash106

Jablonski NG Chaplin G 2012 Human skin pigmentation migration anddisease susceptibility Philos Trans R Soc B Biol Sci 367(1590)785ndash792

Jeong C Di Rienzo A 2014 Adaptations to local environments in mod-ern human populations Curr Opin Genet Dev 291ndash8

Kamberov YG Wang S Tan J Gerbault P Wark A Tan L Yang Y Li STang K Chen H et al 2013 Modeling recent human evolution inmice by expression of a selected EDAR variant Cell 152(4) 691ndash702

Kaplan NL Darden T Hudson RR 1988 The coalescent process inmodels with selection Genetics 120(3) 819ndash829

Kaplan NL Hudson RR Langley CH 1989 The ldquohitchhiking effectrdquo revis-ited Genetics 123(4) 887ndash899

Kimura R Yamaguchi T Takeda M Kondo O Toma T Haneji KHanihara T Matsukusa H Kawamura S Maki K et al 2009 A com-mon variation in EDAR is a genetic determinant of shovel-shapedincisors Am J Hum Genet 85(4) 528ndash535

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1015

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Kong A Thorleifsson G Gudbjartsson DF Masson G Sigurdsson AJonasdottir A Walters GB Jonasdottir A Gylfason A KristinssonKT et al 2010 Fine-scale recombination rate differences betweensexes populations and individuals Nature 467(7319) 1099ndash1103

Kuokkanen M Kokkonen J Enattah NS Ylisaukko-oja T Komu H VariloT Peltonen L Savilahti E Jeuroarveleuroa I 2006 Mutations in the translatedregion of the lactase gene (LCT) underlie congenital lactase defi-ciency Am J Hum Genet 78(2) 339ndash344

Larribe F Fearnhead P 2011 On composite likelihoods in statisticalgenetics Stat Sin 21(1) 43ndash69

Lazaridis I Patterson N Mittnik A Renaud G Mallick S Kirsanow KSudmant PH Schraiber JG Castellano S Lipson M et al 2014Ancient human genomes suggest three ancestral populations forpresent-day Europeans Nature 513(7518) 409ndash413

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357) 493ndash496

Li H Mukherjee N Soundararajan U Tarnok Z Barta C Khaliq SMohyuddin A Kajuna SLB Mehdi SQ Kidd JR et al 2007Geographically separate increases in the frequency of the derivedADH1B 47His allele in eastern and western Asia Am J Hum Genet81(4) 842ndash846

Li N Stephens M 2003 Modeling linkage disequilibrium and identifyingrecombination hotspots using single-nucleotide polymorphismdata Genetics 165(4) 2213ndash2233

Loomis WF 1967 Skin-pigment regulation of vitamin-D biosynthesis inman Science 157(3788) 501ndash506

Mathieson I Lazaridis I Rohland N Mallick S Patterson N RoodenbergSA Harney E Stewardson K Fernandes D Novak M et al 2015Genome-wide patterns of selection in 230 ancient Eurasians Nature528(7583) 499ndash503

McPeek MS Strahs A 1999 Assessment of linkage disequilibrium by thedecay of haplotype sharing with application to fine-scale geneticmapping Am J Hum Genet 65(3) 858ndash875

Meligkotsidou L Fearnhead P 2005 Maximum-likelihood estimationof coalescence times in genealogical trees Genetics 171(4)2073ndash2084

Messer PW Neher RA 2012 Estimating the strength of selective sweepsfrom deep population diversity data Genetics 191(2) 593ndash605

Messer PW Petrov DA 2013 Population genomics of rapid adaptationby soft selective sweeps Trends Ecol Evol 28(11) 659ndash669

Miller CT Beleza S Pollen AA Schluter D Kittles RA Shriver MDKingsley DM 2007 cis-Regulatory changes in kit ligand expressionand parallel evolution of pigmentation in sticklebacks and humansCell 131(6) 1179ndash1189

Morris AP Whittaker JC Balding DJ 2000 Bayesian fine-scale mapping ofdisease loci by hidden Markov models Am J Hum Genet 67(1)155ndash169

Morris AP Whittaker JC Balding DJ 2002 Fine-scale mapping of diseaseloci via shattered coalescent modeling of genealogies Am J HumGenet 70(3) 686ndash707

Nakagome S Alkorta-Aranburu G Amato R Howie B Peter BMHudson RR Di Rienzo A 2016 Estimating the ages of selectionsignals from different epochs in human history Mol Biol Evol33(3) 657ndash669

Nielsen R 2005 Molecular signatures of natural selection Annu RevGenet 39 197ndash218

Olds LC Sibley E 2003 Lactase persistence DNA variant enhances lactasepromoter activity in vitro functional role as a cis regulatory elementHum Mol Genet 12(18) 2333ndash2340

Ormond L Foll M Ewing GB Pfeifer SP Jensen JD 2016 Inferring the ageof a fixed beneficial allele Mol Ecol 25(1) 157ndash169

Osier MV Pakstis AJ Soodyall H Comas D Goldman D Odunsi AOkonofua F Parnas J Schulz LO Bertranpetit J et al 2002 A globalperspective on genetic variation at the ADH genes reveals unusualpatterns of linkage disequilibrium and diversity Am J Hum Genet71(1) 84ndash99

Peng Y Shi H Qi X-B Xiao C-J Zhong H Ma Run-lin Z Su B 2010The ADH1B Arg47His polymorphism in East Asian populations

and expansion of rice domestication in history BMC Evol Biol10(1) 15

Peter BM Huerta-Sanchez E Nielsen R 2012 Distinguishing betweenselective sweeps from standing variation and from a de novo mu-tation PLoS Genet 8(10) e1003011

Przeworski M Coop G Wall JD 2005 The signature of positive selectionon standing genetic variation Evolution 59(11) 2312ndash2323

Price AL Tandon A Patterson N Barnes KC Rafaels N Ruczinski I BeatyTH Mathias R Reich D Myers S 2009 Sensitive detection of chro-mosomal segments of distinct ancestry in admixed populationsPLoS Genet 5(6) e1000519

Pritchard JK Seielstad MT Perez-Lezaun A Feldman MW 1999Population growth of human Y chromosomes a study of Y chro-mosome microsatellites Mol Biol Evol 16(12) 1791ndash1798

Pritchard JK Pickrell JK Coop G 2010 The genetics of human adapta-tion hard sweeps soft sweeps and polygenic adaptation Curr Biol20(4) R208ndashR215

Przeworski M 2003 Estimating the time since the fixation of a beneficialallele Genetics 164(4) 1667ndash1676

Rabiner LR 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77(2) 257ndash286

Radwan J Babik W 2012 The genomics of adaptation Proc R Soc Lond BBiol Sci 279(1749) 5024ndash5028

Rannala B Reeve JP 2001 High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome se-quence Am J Hum Genet 69(1) 159ndash178

Rannala B Reeve JP 2002 Joint Bayesian estimation of mutationlocation and age using linkage disequilibrium Biocomputing2003 526ndash534

Segurel L Wyman MJ Przeworski M 2014 Determinants of mutationrate variation in the human germline Annu Rev Genomics HumGenet 1547ndash70

Simoons FJ 1970 Primary adult lactose intolerance and the milkinghabit a problem in biologic and cultural interrelations Am J DigDis 15(8) 695ndash710

Skoglund P Malmstrom H Omrak A Raghavan M Valdiosera CGunther T Hall P Tambets K Parik J Sjogren K-G et al 2014Genomic diversity and admixture differs for Stone-AgeScandinavian foragers and farmers Science 344(6185) 747ndash750

Slatkin M 2001 Simulating genealogies of selected alleles in a populationof variable size Genet Res 78(01) 49ndash57

Slatkin M 2008 A Bayesian method for jointly estimating allele age andselection intensity Genet Res 90(01) 129ndash137

Slatkin M Hudson RR 1991 Pairwise comparisons of mitochondrialDNA sequences in stable and exponentially growing populationsGenetics 129(2) 555ndash562

Slatkin M Rannala B 2000 Estimating allele age Annu Rev GenomicsHum Genet 1(1) 225ndash249

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenet Res 23(01) 23ndash35

Stinchcombe JR Hoekstra HE 2008 Combining population genomicsand quantitative genetics finding the genes underlying ecologicallyimportant traits Heredity 100(2) 158ndash170

Tang H Siegmund DO Shen P Oefner PJ Feldman MW 2002Frequentist estimation of coalescence times from nucleotide se-quence data using a tree-based partition Genetics 161(1) 447ndash459

Tavare S Balding DJ Griffiths RC Donnelly P 1997 Inferring coalescencetimes from DNA sequence data Genetics 145(2) 505ndash518

Tennessen JA Bigham AW OrsquoConnor TD Fu W Kenny EE Gravel SMcGee S Do R Liu X Jun G et al 2012 Evolution and functionalimpact of rare coding variation from deep sequencing of humanexomes Science 337(6090) 64ndash69

Thomson R Pritchard JK Shen P Oefner PJ Feldman MW 2000Recent common ancestry of human Y chromosomes evidencefrom DNA sequence data Proc Natl Acad Sci U S A 97(13)7360ndash7365

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007

Smith et al doi101093molbevmsy006 MBE

1016

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 39(1) 31ndash40

Troelsen JT Olsen J Moslashller J SjeuroOstreuroOm H 2003 An upstream polymor-phism associated with lactase persistence has increased enhanceractivity Gastroenterology 125(6) 1686ndash1694

Varin C Reid N Firth D 2011 An overview of composite likelihoodmethods Stat Sin 21(1) 5ndash42

Voight BF Kudaravalli S Wen X Pritchard JK Hurst L 2006 A map ofrecent positive selection in the human genome PLoS Biol 4(3) e72

Wegmann D Kessner DE Veeramah KR Mathias RA Nicolae DL YanekLR Sun YV Torgerson DG Rafaels N Mosley T et al 2011Recombination rates in admixed individuals identified by ancestry-based inference Nat Genet 43(9) 847ndash853

Wilde S Timpson A Kirsanow K Kaiser E Kayser M Unterleuroander MHollfelder N Potekhina ID Schier W Thomas MG et al 2014 Directevidence for positive selection of skin hair and eye pigmentation inEuropeans during the last 5000 years Proc Natl Acad Sci U S A111(13) 4832ndash4837

Williamson SH Hernandez R Fledel-Alon A Zhu L Nielsen RBustamante CD 2005 Simultaneous inference of selection and pop-ulation growth from patterns of variation in the human genomeProc Natl Acad Sci U S A 102(22) 7882ndash7887

Wiuf C 2000 On the genealogy of a sample of neutral rare alleles TheorPopul Biol 58(1) 61ndash75

Wiuf C Donnelly P 1999 Conditional genealogies and the age of aneutral mutant Theor Popul Biol 56(2) 183ndash201

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1017

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

DiscussionOur method improves estimation for the timing of selectionon a beneficial allele using a tractable model of haplotypeevolution This approach leverages detailed information inthe data while remaining amenable to large sample sizesUsing both carriers and noncarriers of the allele we canmore effectively account for uncertainty in the extent ofthe ancestral haplotype and derived mutations We showthe performance of our method using simulations of differentselection strengths beneficial allele frequencies and choicesof reference panel By applying our method to five variantspreviously identified as targets of selection in human popu-lations we provide a comparison among population-specificTMRCAs This gives a more detailed account of the order inwhich populations may have acquired the variant andorexperienced selection for the trait it underlies

In that regard it is hypothesized that local selection pres-sures and a cultural shift toward agrarian societies have in-duced adaptive responses among human populations aroundthe globe The data associated with some variants seem toindicate more recent selective events than others Our resultsfor variants associated with dietary traits at the LCT andADH1B genes both imply relatively recent TMRCAs(lt 15 000 years ago) consistent with hypotheses that selec-tion on these mutations results from recent changes in hu-man diet following the spread of agriculture (Simoons 1970Peng et al 2010) In contrast the inferred TMRCAs for EDARKITLG and OCA2 imply older adaptive events which mayhave coincided more closely with the habitation of new envi-ronments or other cultural changes

Several hypotheses have been suggested to describe theselective drivers of skin pigmentation differences among hu-man populations including reduced UV radiation at highlatitudes and vitamin D deficiency (Loomis 1967 Jablonskiand Chaplin 2000) Estimated TMRCAs for the variants atthe OCA2 and EDAR loci among East Asians appear to beas young or younger than the KITLG variant but older thanthe LCT and ADH1B locus This suggests a selective history inEast Asian populations leading to adaptive responses forthese traits occurring after an initial colonization In somecases the dispersion of replicate MCMC estimates makes itdifficult to describe the historical significance of an observedorder for TMRCA values However the consistency of esti-mates among different populations for particular variantsadds some confidence to our modelrsquos ability to reproducethe ages which are relevant to those loci or certain geographicregions

To assess the relative concordance of our estimates withthose from previous approaches we compared our resultswith a compilation of previously published estimates basedon the time of mutation time since fixation or TMRCA ofvariants associated with the genes studied here (supplemen-tary fig 8 Supplementary Material online) The range of con-fidence intervals for these studies is largely a reflection of theassumptions invoked or relaxed for any one method as wellas the sample size and quality of the data used In principleour method extracts more information than approaches that

use summary statistics such as ABC In our empirical appli-cation we found that our method provides a gain in accuracywhile accounting for uncertainty in both the ancestral hap-lotype and its length on each chromosome Notably ourmethod provides narrower credible intervals by incorporatingthe full information from ancestral haplotype lengths derivedmutations and a reference panel of noncarrier haplotypes

Another caveat of our method is its dependence on thereference panel which is intended to serve as a representativesample of nonancestral haplotypes in the population duringthe selected allelersquos increase in frequency Four possible chal-lenges can arise 1) Segments of the ancestral selected haplo-type may be present in the reference panel due torecombination (this is more likely for alleles that havereached higher frequency) 2) the reference panel may con-tain haplotypes that are similar to the ancestral haplotypedue to low levels of genetic diversity 3) the reference panelmay be too diverged from the focal population and 4) pop-ulation connectivity and turnover may lead the ldquolocalrdquo refer-ence panel to be largely composed of migrant haplotypeswhich were not present during the selected allelersquos initial in-crease in frequency

Under scenarios 1 and 2 the background haplotypes willbe too similar to the ancestral haplotype and it may be dif-ficult for the model to discern a specific ancestry switch lo-cation This leads to fewer differences (mutations) thanexpected between the ancestral haplotype and each benefi-cial allele carrier The simulation results are consistent withthis scenario Our method tends to underestimate the trueage across a range of selection intensities and allele frequen-cies when using a local reference panel

Conversely under scenarios 3 and 4 the model will fail todescribe a recombinant haplotype in the sample of beneficialallele carriers as a mosaic of haplotypes in the reference panelAs a result the model will infer more mutation events toexplain observed differences from the ancestral haplotypeOur simulation results show this to be the case with referencepanels diverged by N generations Posterior mean estimatesare consistently older than their true value Our simulationsare perhaps pessimistic thoughmdashwe chose reference paneldivergence times of N and 05N generations approximatelycorresponding to FST values of 04 and 02 respectively Forthe smaller FST values observed in humans we expect resultsfor diverged panels to be closer to those obtained with thelocal reference panel Nonetheless future extensions to incor-porate multiple populations within the reference panel wouldbe helpful and possible by modifying the approach of Priceet al (2009) Such an approach would also enable the analysisof admixed populations (we excluded admixed samples fromour analysis of the 1000 Genomes data above)

Aside from the challenges imposed by the choice of refer-ence panel another potential source of bias lies in our tran-sition probabilities which are not conditioned on thefrequency of the selected variant In reality recombinationevents at some distance away from the selected site will onlyresult in a switch from the ancestral to background haplo-types at a rate proportional to 1 pl where pl is the fre-quency of the ancestral haplotype alleles at locus l In this way

Smith et al doi101093molbevmsy006 MBE

1012

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

some recombination events may go unobservedmdashas thebeneficial allele goes to high frequency the probability of arecombination event leading to an observable ancestral tobackground haplotype transition decreases One solutionmay be to include the frequency-dependent transition prob-abilities derived by Chen et al (2015) Under their model themutation time is estimated by assuming a deterministic lo-gistic frequency trajectory starting at 1

2N An additional benefitof using frequency trajectories would be the ability to inferposterior distributions on selection coefficients While theselection coefficient is typically assumed to be related tothe time since mutation by t1 frac14 logethNsTHORN=s we do nothave an equivalent expression for time to the common an-cestor Rather than the initial frequency being 1

2N for a newmutation our initial frequency must correspond to the fre-quency at which the TMRCA occurs Griffiths and Tavare(1994) derive a framework to model a genealogy under arbi-trary population size trajectories which should be analogousto the problem of an allele frequency trajectory and addi-tional theory on intra-allelic genealogies may be useful here aswell (Griffiths and Tavare 1994 Wiuf and Donnelly 1999 Wiuf2000 Slatkin and Rannala 2000)

Our model also assumes independence among all haplo-types in the sample in a composite-likelihood frameworkwhich is equivalent to assuming a star-genealogy (Larribeand Fearnhead 2011 Varin et al 2011) This is unlikely tobe the case when sample sizes are large or the TMRCA isold It is also unlikely to be true if the beneficial allele existedon multiple haplotypes preceding the onset of selection wasintroduced by multiple migrant haplotypes from other pop-ulations or occurred by multiple independent mutationevents (Innan and Kim 2004 Hermisson and Pennings2005 Przeworski et al 2005 Pritchard et al 2010 Berg andCoop 2015) Methods for distinguishing selection from stand-ing variation versus de novo mutation are available thatshould make it easier distinguish these cases (Messer andNeher 2012 Peter et al 2012 Messer and Petrov 2013Garud et al 2015)

If the underlying allelic genealogy is not star-like onecan expect different estimates of the TMRCA for differentsubsets of the data Here we performed multiple MCMCson resampled subsets of the data to informally diagnosewhether there are violations from the star-shape geneal-ogy assumption We speculate that exactly how theTMRCAs vary may provide insight to the underlyinghistory In cases where the TMRCA estimates for a par-ticular population are old and more variable than otherpopulations the results may be explained by structure inthe genealogy whereby recent coalescent events haveoccurred among the same ancestral haplotype beforethe common ancestor When estimates are dispersedamong resampled data sets the presence of multiple an-cestral haplotypes prior to the variantrsquos increase in fre-quency may be a better explanation Further support forthis explanation might come from comparisons to otherpopulation samples which show little to no dispersion ofestimates from resampled data sets Future work mightmake it possible to formalize this inference process

A final caveat regards the misspecification of mutation andrecombination rates TMRCA estimates are largely deter-mined by the use of accurate measures for these two param-eters In a way this provides some robustness to our methodOur age estimates depend on mutation and recombinationrates so accurate specification for one of the values can com-pensate for slight misspecification in the other In cases wherea fine-scale recombination map is unavailable we suggest us-ing a uniform recombination rate specific to the locus ofinterest (see Materials and Methods and supplementaryfig 7 Supplementary Material online) Choosing an appropri-ate mutation rate will continue to depend on current andfuture work which tries to resolve discrepancies in publishedmutation rate estimates inferred by various approaches(Segurel et al 2014)

One future direction for our method may be to explicitlyincorporate the possibility of multiple ancestral haplotypeswithin the sample Under a disease mapping frameworkMorris et al (2002) implement a similar idea in the case whereindependent disease causing mutations arise at the same lo-cus leading to independent genealogies for which they cointhe term ldquoshattered coalescentrdquo For our case beneficialmutations may also be independently derived on differenthaplotypes Alternatively a single mutation may be oldenough to reside on different haplotypes due to a sufficientamount of linked variation existing prior to the onset of se-lection Berg and Coop (2015) model selection from standingvariation to derive the distribution of haplotypes that theselected allele is present on

While we have treated the TMRCA as a parameter of in-terest our method also produces a sample of the posteriordistribution on the ancestral haplotype This could provideuseful information to estimate the frequency spectrum ofderived mutations on the ancestral haplotype Similarly thefrequency of shared recombination breakpoints could shedlight on the genealogy and how well it conforms to the star-shape assumption The extent of the ancestral haplotype ineach individual may also prove useful for identifying delete-rious alleles that have increased in frequency as a result ofstrong positive selection on linked beneficial alleles (Chun andFay 2011 Hartfield and Otto 2011) For example Huff et al(2012) describe a risk allele for Crohnrsquos disease at high fre-quency in European populations which they suggest is linkedto a beneficial allele under recent selection Similar to anadmixture mapping approach our method could be usedto identify risk loci by testing for an association betweenthe ancestral haplotype and disease status As another appli-cation identifying the ancestral haplotype may be useful inthe context of identifying a source population (or species) fora beneficial allele prior to its introduction and subsequentincrease in frequency in the recipient population

In many cases the specific site under selection may beunknown or restricted to some set of putative sites Whileour method requires the position of the selected site to bespecified future extensions could treat the selected site as arandom variable to be estimated under the same MCMCframework This framework would also be amenable to mar-ginalizing over uncertainty on the selected site

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1013

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

While we focus here on inference from modern DNAdata the increased accessibility of ancient DNA has addeda new dimension to population genetic data sets (Lazaridiset al 2014 Skoglund et al 2014 Allentoft et al 2015 Haaket al 2015 Mathieson et al 2015) Because it will remaindifficult to use ancient DNA approaches in many specieswith poor sample preservation we believe methods basedon modern DNA will continue to be useful going forwardThat said ancient DNA is providing an interesting avenue forcomparative work between inference from modern and an-cient samples For example Nakagome et al (2016) usesimulations to assess the fit of this ancient DNA polymor-phism to data simulated under their inferred parametervalues for allele age and selection intensity and they findreasonable agreement Much work still remains to fullyleverage ancient samples into population genetic inferencewhile accounting for new sources of uncertainty and poten-tial for sampling bias

Despite these challenges it is clear that our understandingof adaptive history will continue to benefit from new com-putational tools which extract insightful information from adiverse set of data sources

Materials and MethodsWe generated data using the software mssel (Hudson R per-sonal communication) which simulates a sample of haplo-types conditioned on the frequency trajectory of a selectedvariant under the structured coalescent (Hudson and Kaplan1988 Kaplan et al 1988) Trajectories were first simulatedforwards in time under a WrightndashFisher model for an additivelocus with varying strengths of selection and different endingfrequencies of the selected variant Trajectories were thentruncated to end at the first time the allele reaches a specifiedfrequency See supplementary table 1 SupplementaryMaterial online for relative ages of simulated TMRCA val-ues for different end frequencies and selection strengthsFor the results in figure 2 100 simulations were performedfor each parameter combination MCMCs were run for5000 iterations with a burn-in excluding the first 3000iterations A standard deviation of 10 was used for theproposal distribution of tca The red boxplots indicate localreference panels The blue and green boxplots indicate ref-erence panels diverged by 5N generations and 1N gener-ations respectively Each data set was simulated for a 1-Mblocus with a mutation rate of 1 108 recombination rateof 1 108

and population size of 10000 Sample sizesfor the selected and reference panels were 100 and 20respectively

For more efficient run times of the MCMC we set a max-imum number of individuals to include in the selected andreference panels to be 100 and 20 respectively In cases wherethe true number of haplotypes for either panel was greaterthan this in the full data set we resampled a subset ofhaplotypes from each population for a total of five replicatesper population For simulation results supporting the use ofthis resampling strategy see supplementary figure 5Supplementary Material online The MCMCs were run for

15000 iterations with a standard deviation of 20 for theTMRCA proposal distribution The first 9000 iterationswere removed as a burn-in leading to 6000 iterations for asample of the posterior Convergence was assessed by com-parison of MCMC replicates Figure 3 and supplementaryfigure 8 Supplementary Material online show the resultsfor all five variants along with previous point estimates and95 confidence intervals assuming a generation time of29 years (Fenner 2005) Supplementary table 3Supplementary Material online lists the mean and 95 cred-ible intervals for estimates with the highest mean posteriorprobability which we refer to in the text belowSupplementary table 4 Supplementary Material online liststhe previous estimates and confidence intervals with addi-tional details of the different approaches taken

To model recombination rate variation we used recombi-nation rates from the Decode sex-averaged recombinationmap inferred from pedigrees among families in Iceland(Kong et al 2010) Because some populations may have re-combination maps which differ from the Decode map at finescales we used a mean uniform recombination rate inferredfrom the 1-Mb region surrounding each variant The motiva-tion for this arises from how recombination rates have beenpreviously shown to remain relatively consistent among re-combination maps inferred for different populations at themegabase-scale (Broman et al 1998 Baudat et al 2010 Konget al 2010 Auton and McVean 2012) Further we found ourestimates depend mostly on having the megabase-scale re-combination rate appropriately set with little difference inmost cases for estimates obtained by modeling fine-scale re-combination at each locus (supplementary fig 7Supplementary Material online) We specify the switchingrate among background haplotypes after recombining off ofthe ancestral haplotype to be 4Nr where Nfrac14 10000 and r isthe mean recombination rate for the 1-Mb locus

For modeling mutation a challenge is that previous mu-tation rate estimates vary depending on the approach used(Segurel et al 2014) Estimates using the neutral substitutionrate between humans and chimps are more than 2 108

per bp per generation while estimates using whole-genomesequencing are closer to 1 108 As a compromise wespecify a mutation rate of 16 108

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe would like to thank Hussein Al-Asadi Arjun BiddandaAnna Di Rienzo Dick Hudson Choongwon Jeong Evan KochJoseph Marcus Shigeki Nakagome Ben Peter Mark ReppelAlex White members of the Coop laboratory at UC Davismembers of the Przeworski laboratory at ColumbiaUniversity and members of the He and Stephens laboratoriesat the University of Chicago for helpful comments JS wassupported by an NSF Graduate Research Fellowship andNational Institute of General Medical Sciences of the

Smith et al doi101093molbevmsy006 MBE

1014

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

National Institutes of Health under award numbers DGE-1144082 and T32GM007197 respectively This work wasalso supported by the National Institute of General MedicalSciences of the National Institutes of Health under awardnumbers RO1GM83098 and RO1GM107374 to GC as wellas R01HG007089 to JN

ReferencesAkey JM Eberle MA Rieder MJ Carlson CS Shriver MD Nickerson DA

Kruglyak L 2004 Population history and natural selection shapepatterns of genetic variation in 132 genes PLoS Biol 21591ndash1599

Allentoft ME Sikora M Sjogren K-G Rasmussen S Rasmussen MStenderup J Damgaard PB Schroeder H Ahlstrom T Vinner L2015 Population genomics of Bronze Age Eurasia Nature522(7555) 167ndash172

Auton A McVean G 2012 Estimating recombination rates from geneticvariation in humans In Anisimova M ed Evolutionary genomicsMethods in Molecular Biology (Methods and Protocols) vol 856New York Humana Press

Barrett RDH Hoekstra HE 2011 Molecular spandrels tests of adaptationat the genetic level Nat Rev Genet 12(11) 767ndash780

Baudat F Buard J Grey C Fledel-Alon A Ober C Przeworski M Coop GMassy BD 2010 PRDM9 is a major determinant of meiotic recom-bination hotspots in humans and mice Science 327(5967) 836ndash840

Beaumont MA Zhang W Balding DJ 2002 Approximate Bayesian com-putation in population genetics Genetics 162(4) 2025ndash2035

Beleza S Johnson NA Candille SI Absher DM Coram MA Lopes JCampos J Araujo II Anderson TM Vilhjalmsson BJ et al 2013Genetic architecture of skin and eye color in an African-Europeanadmixed population PLoS Genet 9(3) e1003372

Beleza S Santos AM McEvoy B Alves I Martinho C Cameron E ShriverMD Parra EJ Rocha J 2013 The timing of pigmentation lightening inEuropeans Mol Biol Evol 30(1) 24ndash35

Berg JJ Coop G 2015 A coalescent model for a sweep of a uniquestanding variant Genetics 201(2) 707ndash725

Bersaglieri T Sabeti PC Patterson N Vanderploeg T Schaffner SF DrakeJA Rhodes M Reich DE Hirschhorn JN 2004 Genetic signatures ofstrong recent positive selection at the lactase gene Am J Hum Genet74(6) 1111ndash1120

Broman KW Murray JC Sheffield VC White RL Weber JL 1998Comprehensive human genetic maps individual and sex-specificvariation in recombination Am J Hum Genet 63(3) 861ndash869

Bryk J Hardouin E Pugach I Hughes D Strotmann R Stoneking M MylesS 2008 Positive selection in East Asians for an EDAR allele thatenhances NF-jB activation PLoS One 3(5) e2209

Chen H Slatkin M 2013 Inferring selection intensity and allele age frommultilocus haplotype structure G3 3(8) 1429ndash1442

Chen H Hey J Slatkin M 2015 A hidden Markov model for investigatingrecent positive selection through haplotype structure Theor PopulBiol 9918ndash30

Chun S Fay JC 2011 Evidence for hitchhiking of deleterious mutationswithin the human genome PLoS Genet 7(8) e1002240

Coop G Griffiths RC 2004 Ancestral inference on gene trees underselection Theor Popul Biol 66(3) 219ndash232

Coop G Bullaughey K Luca F Przeworski M 2008 The timing of selec-tion at the human FOXP2 gene Mol Biol Evol 25(7) 1257ndash1259

Dalziel AC Rogers SM Schulte PM 2009 Linking genotypes to pheno-types and fitness how mechanistic biology can inform molecularecology Mol Ecol 18(24) 4997ndash5017

Eaton K Edwards M Krithika S Cook G Norton H Parra EJ 2015Association study confirms the role of two OCA2 polymorphismsin normal skin pigmentation variation in East Asian populations AmJ Hum Biol 27 520ndash525

Edwards M Bigham A Tan J Li S Gozdzik A Ross K Jin L Parra EJ 2010Association of the OCA2 polymorphism His615Arg with melanin

content in east Asian populations further evidence of convergentevolution of skin pigmentation PLoS Genet 6(3) e1000867

Elmer KR Meyer A 2011 Adaptation in the age of ecological genomicsinsights from parallelism and convergence Trends Ecol Evol 26(6)298ndash306

Enattah NS Sahi T Savilahti E Terwilliger JD Peltonen L Jeuroarveleuroa I 2002Identification of a variant associated with adult-type hypolactasiaNat Genet 30(2) 233ndash237

Eng MY Luczak SE Wall TL 2007 ALDH2 ADH1B and ADH1C geno-types in Asians a literature review Alcohol Res Health 30(1) 22

Fenner JN 2005 Cross-cultural estimation of the human generationinterval for use in genetics-based population divergence studiesAm J Phys Anthropol 128(2) 415ndash423

Fujimoto A Ohashi J Nishida N Miyagawa T Morishita Y Tsunoda TKimura R Tokunaga K 2008 A replication study confirmed theEDAR gene to be a major contributor to population differentiationregarding head hair thickness in Asia Hum Genet 124(2) 179ndash185

Garud NR Messer PW Buzbas EO Petrov DA 2015 Recent selectivesweeps in North American Drosophila melanogaster show signaturesof soft sweeps PLoS Genet 11(2) e1005004

Griffiths RC Tavare S 1994 Sampling theory for neutral alleles in avarying environment Philos Trans R Soc B Biol Sci 344(1310)403ndash410

Haak W Lazaridis I Patterson N Rohland N Mallick S Llamas B BrandtG Nordenfelt S Harney E Stewardson K 2015 Massive migrationfrom the steppe was a source for Indo-European languages inEurope Nature 522 207ndash211

Hartfield M Otto SP 2011 Recombination and hitchhiking of deleteri-ous alleles Evolution 65(9) 2421ndash2434

Hermisson J Pennings PS 2005 Soft sweeps molecular population ge-netics of adaptation from standing genetic variation Genetics 169(4)2335ndash2352

Hinch AG Tandon A Patterson N Song Y Rohland N Palmer CD ChenGK Wang K Buxbaum SG Akylbekova EL et al 2011 The landscapeof recombination in African Americans Nature 476(7359) 170ndash175

Hudson RR 1990 Gene genealogies and the coalescent process Oxf SurvEvol Biol 7(1) 44

Hudson RR 2007 The variance of coalescent time estimates from DNAsequences J Mol Evol 64(6) 702ndash705

Hudson RR Kaplan NL 1988 The coalescent process in models withselection and recombination Genetics 120(3) 831ndash840

Huff CD Witherspoon DJ Zhang Y Gatenbee C Denson LA KugathasanS Hakonarson H Whiting A Davis CT Wu W et al 2012 Crohnrsquosdisease and genetic hitchhiking at IBD5 Mol Biol Evol 29(1)101ndash111

Innan H Kim Y 2004 Pattern of polymorphism after strong artificialselection in a domestication event Proc Natl Acad Sci U S A 101(29)10667ndash10672

Itan Y Powell A Beaumont MA Burger J Thomas MG 2009 The originsof lactase persistence in Europe PLoS Comput Biol 5(8) e1000491

Jablonski NG Chaplin G 2000 The evolution of human skin coloration JHum Evol 39(1) 57ndash106

Jablonski NG Chaplin G 2012 Human skin pigmentation migration anddisease susceptibility Philos Trans R Soc B Biol Sci 367(1590)785ndash792

Jeong C Di Rienzo A 2014 Adaptations to local environments in mod-ern human populations Curr Opin Genet Dev 291ndash8

Kamberov YG Wang S Tan J Gerbault P Wark A Tan L Yang Y Li STang K Chen H et al 2013 Modeling recent human evolution inmice by expression of a selected EDAR variant Cell 152(4) 691ndash702

Kaplan NL Darden T Hudson RR 1988 The coalescent process inmodels with selection Genetics 120(3) 819ndash829

Kaplan NL Hudson RR Langley CH 1989 The ldquohitchhiking effectrdquo revis-ited Genetics 123(4) 887ndash899

Kimura R Yamaguchi T Takeda M Kondo O Toma T Haneji KHanihara T Matsukusa H Kawamura S Maki K et al 2009 A com-mon variation in EDAR is a genetic determinant of shovel-shapedincisors Am J Hum Genet 85(4) 528ndash535

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1015

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Kong A Thorleifsson G Gudbjartsson DF Masson G Sigurdsson AJonasdottir A Walters GB Jonasdottir A Gylfason A KristinssonKT et al 2010 Fine-scale recombination rate differences betweensexes populations and individuals Nature 467(7319) 1099ndash1103

Kuokkanen M Kokkonen J Enattah NS Ylisaukko-oja T Komu H VariloT Peltonen L Savilahti E Jeuroarveleuroa I 2006 Mutations in the translatedregion of the lactase gene (LCT) underlie congenital lactase defi-ciency Am J Hum Genet 78(2) 339ndash344

Larribe F Fearnhead P 2011 On composite likelihoods in statisticalgenetics Stat Sin 21(1) 43ndash69

Lazaridis I Patterson N Mittnik A Renaud G Mallick S Kirsanow KSudmant PH Schraiber JG Castellano S Lipson M et al 2014Ancient human genomes suggest three ancestral populations forpresent-day Europeans Nature 513(7518) 409ndash413

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357) 493ndash496

Li H Mukherjee N Soundararajan U Tarnok Z Barta C Khaliq SMohyuddin A Kajuna SLB Mehdi SQ Kidd JR et al 2007Geographically separate increases in the frequency of the derivedADH1B 47His allele in eastern and western Asia Am J Hum Genet81(4) 842ndash846

Li N Stephens M 2003 Modeling linkage disequilibrium and identifyingrecombination hotspots using single-nucleotide polymorphismdata Genetics 165(4) 2213ndash2233

Loomis WF 1967 Skin-pigment regulation of vitamin-D biosynthesis inman Science 157(3788) 501ndash506

Mathieson I Lazaridis I Rohland N Mallick S Patterson N RoodenbergSA Harney E Stewardson K Fernandes D Novak M et al 2015Genome-wide patterns of selection in 230 ancient Eurasians Nature528(7583) 499ndash503

McPeek MS Strahs A 1999 Assessment of linkage disequilibrium by thedecay of haplotype sharing with application to fine-scale geneticmapping Am J Hum Genet 65(3) 858ndash875

Meligkotsidou L Fearnhead P 2005 Maximum-likelihood estimationof coalescence times in genealogical trees Genetics 171(4)2073ndash2084

Messer PW Neher RA 2012 Estimating the strength of selective sweepsfrom deep population diversity data Genetics 191(2) 593ndash605

Messer PW Petrov DA 2013 Population genomics of rapid adaptationby soft selective sweeps Trends Ecol Evol 28(11) 659ndash669

Miller CT Beleza S Pollen AA Schluter D Kittles RA Shriver MDKingsley DM 2007 cis-Regulatory changes in kit ligand expressionand parallel evolution of pigmentation in sticklebacks and humansCell 131(6) 1179ndash1189

Morris AP Whittaker JC Balding DJ 2000 Bayesian fine-scale mapping ofdisease loci by hidden Markov models Am J Hum Genet 67(1)155ndash169

Morris AP Whittaker JC Balding DJ 2002 Fine-scale mapping of diseaseloci via shattered coalescent modeling of genealogies Am J HumGenet 70(3) 686ndash707

Nakagome S Alkorta-Aranburu G Amato R Howie B Peter BMHudson RR Di Rienzo A 2016 Estimating the ages of selectionsignals from different epochs in human history Mol Biol Evol33(3) 657ndash669

Nielsen R 2005 Molecular signatures of natural selection Annu RevGenet 39 197ndash218

Olds LC Sibley E 2003 Lactase persistence DNA variant enhances lactasepromoter activity in vitro functional role as a cis regulatory elementHum Mol Genet 12(18) 2333ndash2340

Ormond L Foll M Ewing GB Pfeifer SP Jensen JD 2016 Inferring the ageof a fixed beneficial allele Mol Ecol 25(1) 157ndash169

Osier MV Pakstis AJ Soodyall H Comas D Goldman D Odunsi AOkonofua F Parnas J Schulz LO Bertranpetit J et al 2002 A globalperspective on genetic variation at the ADH genes reveals unusualpatterns of linkage disequilibrium and diversity Am J Hum Genet71(1) 84ndash99

Peng Y Shi H Qi X-B Xiao C-J Zhong H Ma Run-lin Z Su B 2010The ADH1B Arg47His polymorphism in East Asian populations

and expansion of rice domestication in history BMC Evol Biol10(1) 15

Peter BM Huerta-Sanchez E Nielsen R 2012 Distinguishing betweenselective sweeps from standing variation and from a de novo mu-tation PLoS Genet 8(10) e1003011

Przeworski M Coop G Wall JD 2005 The signature of positive selectionon standing genetic variation Evolution 59(11) 2312ndash2323

Price AL Tandon A Patterson N Barnes KC Rafaels N Ruczinski I BeatyTH Mathias R Reich D Myers S 2009 Sensitive detection of chro-mosomal segments of distinct ancestry in admixed populationsPLoS Genet 5(6) e1000519

Pritchard JK Seielstad MT Perez-Lezaun A Feldman MW 1999Population growth of human Y chromosomes a study of Y chro-mosome microsatellites Mol Biol Evol 16(12) 1791ndash1798

Pritchard JK Pickrell JK Coop G 2010 The genetics of human adapta-tion hard sweeps soft sweeps and polygenic adaptation Curr Biol20(4) R208ndashR215

Przeworski M 2003 Estimating the time since the fixation of a beneficialallele Genetics 164(4) 1667ndash1676

Rabiner LR 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77(2) 257ndash286

Radwan J Babik W 2012 The genomics of adaptation Proc R Soc Lond BBiol Sci 279(1749) 5024ndash5028

Rannala B Reeve JP 2001 High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome se-quence Am J Hum Genet 69(1) 159ndash178

Rannala B Reeve JP 2002 Joint Bayesian estimation of mutationlocation and age using linkage disequilibrium Biocomputing2003 526ndash534

Segurel L Wyman MJ Przeworski M 2014 Determinants of mutationrate variation in the human germline Annu Rev Genomics HumGenet 1547ndash70

Simoons FJ 1970 Primary adult lactose intolerance and the milkinghabit a problem in biologic and cultural interrelations Am J DigDis 15(8) 695ndash710

Skoglund P Malmstrom H Omrak A Raghavan M Valdiosera CGunther T Hall P Tambets K Parik J Sjogren K-G et al 2014Genomic diversity and admixture differs for Stone-AgeScandinavian foragers and farmers Science 344(6185) 747ndash750

Slatkin M 2001 Simulating genealogies of selected alleles in a populationof variable size Genet Res 78(01) 49ndash57

Slatkin M 2008 A Bayesian method for jointly estimating allele age andselection intensity Genet Res 90(01) 129ndash137

Slatkin M Hudson RR 1991 Pairwise comparisons of mitochondrialDNA sequences in stable and exponentially growing populationsGenetics 129(2) 555ndash562

Slatkin M Rannala B 2000 Estimating allele age Annu Rev GenomicsHum Genet 1(1) 225ndash249

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenet Res 23(01) 23ndash35

Stinchcombe JR Hoekstra HE 2008 Combining population genomicsand quantitative genetics finding the genes underlying ecologicallyimportant traits Heredity 100(2) 158ndash170

Tang H Siegmund DO Shen P Oefner PJ Feldman MW 2002Frequentist estimation of coalescence times from nucleotide se-quence data using a tree-based partition Genetics 161(1) 447ndash459

Tavare S Balding DJ Griffiths RC Donnelly P 1997 Inferring coalescencetimes from DNA sequence data Genetics 145(2) 505ndash518

Tennessen JA Bigham AW OrsquoConnor TD Fu W Kenny EE Gravel SMcGee S Do R Liu X Jun G et al 2012 Evolution and functionalimpact of rare coding variation from deep sequencing of humanexomes Science 337(6090) 64ndash69

Thomson R Pritchard JK Shen P Oefner PJ Feldman MW 2000Recent common ancestry of human Y chromosomes evidencefrom DNA sequence data Proc Natl Acad Sci U S A 97(13)7360ndash7365

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007

Smith et al doi101093molbevmsy006 MBE

1016

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 39(1) 31ndash40

Troelsen JT Olsen J Moslashller J SjeuroOstreuroOm H 2003 An upstream polymor-phism associated with lactase persistence has increased enhanceractivity Gastroenterology 125(6) 1686ndash1694

Varin C Reid N Firth D 2011 An overview of composite likelihoodmethods Stat Sin 21(1) 5ndash42

Voight BF Kudaravalli S Wen X Pritchard JK Hurst L 2006 A map ofrecent positive selection in the human genome PLoS Biol 4(3) e72

Wegmann D Kessner DE Veeramah KR Mathias RA Nicolae DL YanekLR Sun YV Torgerson DG Rafaels N Mosley T et al 2011Recombination rates in admixed individuals identified by ancestry-based inference Nat Genet 43(9) 847ndash853

Wilde S Timpson A Kirsanow K Kaiser E Kayser M Unterleuroander MHollfelder N Potekhina ID Schier W Thomas MG et al 2014 Directevidence for positive selection of skin hair and eye pigmentation inEuropeans during the last 5000 years Proc Natl Acad Sci U S A111(13) 4832ndash4837

Williamson SH Hernandez R Fledel-Alon A Zhu L Nielsen RBustamante CD 2005 Simultaneous inference of selection and pop-ulation growth from patterns of variation in the human genomeProc Natl Acad Sci U S A 102(22) 7882ndash7887

Wiuf C 2000 On the genealogy of a sample of neutral rare alleles TheorPopul Biol 58(1) 61ndash75

Wiuf C Donnelly P 1999 Conditional genealogies and the age of aneutral mutant Theor Popul Biol 56(2) 183ndash201

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1017

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

some recombination events may go unobservedmdashas thebeneficial allele goes to high frequency the probability of arecombination event leading to an observable ancestral tobackground haplotype transition decreases One solutionmay be to include the frequency-dependent transition prob-abilities derived by Chen et al (2015) Under their model themutation time is estimated by assuming a deterministic lo-gistic frequency trajectory starting at 1

2N An additional benefitof using frequency trajectories would be the ability to inferposterior distributions on selection coefficients While theselection coefficient is typically assumed to be related tothe time since mutation by t1 frac14 logethNsTHORN=s we do nothave an equivalent expression for time to the common an-cestor Rather than the initial frequency being 1

2N for a newmutation our initial frequency must correspond to the fre-quency at which the TMRCA occurs Griffiths and Tavare(1994) derive a framework to model a genealogy under arbi-trary population size trajectories which should be analogousto the problem of an allele frequency trajectory and addi-tional theory on intra-allelic genealogies may be useful here aswell (Griffiths and Tavare 1994 Wiuf and Donnelly 1999 Wiuf2000 Slatkin and Rannala 2000)

Our model also assumes independence among all haplo-types in the sample in a composite-likelihood frameworkwhich is equivalent to assuming a star-genealogy (Larribeand Fearnhead 2011 Varin et al 2011) This is unlikely tobe the case when sample sizes are large or the TMRCA isold It is also unlikely to be true if the beneficial allele existedon multiple haplotypes preceding the onset of selection wasintroduced by multiple migrant haplotypes from other pop-ulations or occurred by multiple independent mutationevents (Innan and Kim 2004 Hermisson and Pennings2005 Przeworski et al 2005 Pritchard et al 2010 Berg andCoop 2015) Methods for distinguishing selection from stand-ing variation versus de novo mutation are available thatshould make it easier distinguish these cases (Messer andNeher 2012 Peter et al 2012 Messer and Petrov 2013Garud et al 2015)

If the underlying allelic genealogy is not star-like onecan expect different estimates of the TMRCA for differentsubsets of the data Here we performed multiple MCMCson resampled subsets of the data to informally diagnosewhether there are violations from the star-shape geneal-ogy assumption We speculate that exactly how theTMRCAs vary may provide insight to the underlyinghistory In cases where the TMRCA estimates for a par-ticular population are old and more variable than otherpopulations the results may be explained by structure inthe genealogy whereby recent coalescent events haveoccurred among the same ancestral haplotype beforethe common ancestor When estimates are dispersedamong resampled data sets the presence of multiple an-cestral haplotypes prior to the variantrsquos increase in fre-quency may be a better explanation Further support forthis explanation might come from comparisons to otherpopulation samples which show little to no dispersion ofestimates from resampled data sets Future work mightmake it possible to formalize this inference process

A final caveat regards the misspecification of mutation andrecombination rates TMRCA estimates are largely deter-mined by the use of accurate measures for these two param-eters In a way this provides some robustness to our methodOur age estimates depend on mutation and recombinationrates so accurate specification for one of the values can com-pensate for slight misspecification in the other In cases wherea fine-scale recombination map is unavailable we suggest us-ing a uniform recombination rate specific to the locus ofinterest (see Materials and Methods and supplementaryfig 7 Supplementary Material online) Choosing an appropri-ate mutation rate will continue to depend on current andfuture work which tries to resolve discrepancies in publishedmutation rate estimates inferred by various approaches(Segurel et al 2014)

One future direction for our method may be to explicitlyincorporate the possibility of multiple ancestral haplotypeswithin the sample Under a disease mapping frameworkMorris et al (2002) implement a similar idea in the case whereindependent disease causing mutations arise at the same lo-cus leading to independent genealogies for which they cointhe term ldquoshattered coalescentrdquo For our case beneficialmutations may also be independently derived on differenthaplotypes Alternatively a single mutation may be oldenough to reside on different haplotypes due to a sufficientamount of linked variation existing prior to the onset of se-lection Berg and Coop (2015) model selection from standingvariation to derive the distribution of haplotypes that theselected allele is present on

While we have treated the TMRCA as a parameter of in-terest our method also produces a sample of the posteriordistribution on the ancestral haplotype This could provideuseful information to estimate the frequency spectrum ofderived mutations on the ancestral haplotype Similarly thefrequency of shared recombination breakpoints could shedlight on the genealogy and how well it conforms to the star-shape assumption The extent of the ancestral haplotype ineach individual may also prove useful for identifying delete-rious alleles that have increased in frequency as a result ofstrong positive selection on linked beneficial alleles (Chun andFay 2011 Hartfield and Otto 2011) For example Huff et al(2012) describe a risk allele for Crohnrsquos disease at high fre-quency in European populations which they suggest is linkedto a beneficial allele under recent selection Similar to anadmixture mapping approach our method could be usedto identify risk loci by testing for an association betweenthe ancestral haplotype and disease status As another appli-cation identifying the ancestral haplotype may be useful inthe context of identifying a source population (or species) fora beneficial allele prior to its introduction and subsequentincrease in frequency in the recipient population

In many cases the specific site under selection may beunknown or restricted to some set of putative sites Whileour method requires the position of the selected site to bespecified future extensions could treat the selected site as arandom variable to be estimated under the same MCMCframework This framework would also be amenable to mar-ginalizing over uncertainty on the selected site

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1013

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

While we focus here on inference from modern DNAdata the increased accessibility of ancient DNA has addeda new dimension to population genetic data sets (Lazaridiset al 2014 Skoglund et al 2014 Allentoft et al 2015 Haaket al 2015 Mathieson et al 2015) Because it will remaindifficult to use ancient DNA approaches in many specieswith poor sample preservation we believe methods basedon modern DNA will continue to be useful going forwardThat said ancient DNA is providing an interesting avenue forcomparative work between inference from modern and an-cient samples For example Nakagome et al (2016) usesimulations to assess the fit of this ancient DNA polymor-phism to data simulated under their inferred parametervalues for allele age and selection intensity and they findreasonable agreement Much work still remains to fullyleverage ancient samples into population genetic inferencewhile accounting for new sources of uncertainty and poten-tial for sampling bias

Despite these challenges it is clear that our understandingof adaptive history will continue to benefit from new com-putational tools which extract insightful information from adiverse set of data sources

Materials and MethodsWe generated data using the software mssel (Hudson R per-sonal communication) which simulates a sample of haplo-types conditioned on the frequency trajectory of a selectedvariant under the structured coalescent (Hudson and Kaplan1988 Kaplan et al 1988) Trajectories were first simulatedforwards in time under a WrightndashFisher model for an additivelocus with varying strengths of selection and different endingfrequencies of the selected variant Trajectories were thentruncated to end at the first time the allele reaches a specifiedfrequency See supplementary table 1 SupplementaryMaterial online for relative ages of simulated TMRCA val-ues for different end frequencies and selection strengthsFor the results in figure 2 100 simulations were performedfor each parameter combination MCMCs were run for5000 iterations with a burn-in excluding the first 3000iterations A standard deviation of 10 was used for theproposal distribution of tca The red boxplots indicate localreference panels The blue and green boxplots indicate ref-erence panels diverged by 5N generations and 1N gener-ations respectively Each data set was simulated for a 1-Mblocus with a mutation rate of 1 108 recombination rateof 1 108

and population size of 10000 Sample sizesfor the selected and reference panels were 100 and 20respectively

For more efficient run times of the MCMC we set a max-imum number of individuals to include in the selected andreference panels to be 100 and 20 respectively In cases wherethe true number of haplotypes for either panel was greaterthan this in the full data set we resampled a subset ofhaplotypes from each population for a total of five replicatesper population For simulation results supporting the use ofthis resampling strategy see supplementary figure 5Supplementary Material online The MCMCs were run for

15000 iterations with a standard deviation of 20 for theTMRCA proposal distribution The first 9000 iterationswere removed as a burn-in leading to 6000 iterations for asample of the posterior Convergence was assessed by com-parison of MCMC replicates Figure 3 and supplementaryfigure 8 Supplementary Material online show the resultsfor all five variants along with previous point estimates and95 confidence intervals assuming a generation time of29 years (Fenner 2005) Supplementary table 3Supplementary Material online lists the mean and 95 cred-ible intervals for estimates with the highest mean posteriorprobability which we refer to in the text belowSupplementary table 4 Supplementary Material online liststhe previous estimates and confidence intervals with addi-tional details of the different approaches taken

To model recombination rate variation we used recombi-nation rates from the Decode sex-averaged recombinationmap inferred from pedigrees among families in Iceland(Kong et al 2010) Because some populations may have re-combination maps which differ from the Decode map at finescales we used a mean uniform recombination rate inferredfrom the 1-Mb region surrounding each variant The motiva-tion for this arises from how recombination rates have beenpreviously shown to remain relatively consistent among re-combination maps inferred for different populations at themegabase-scale (Broman et al 1998 Baudat et al 2010 Konget al 2010 Auton and McVean 2012) Further we found ourestimates depend mostly on having the megabase-scale re-combination rate appropriately set with little difference inmost cases for estimates obtained by modeling fine-scale re-combination at each locus (supplementary fig 7Supplementary Material online) We specify the switchingrate among background haplotypes after recombining off ofthe ancestral haplotype to be 4Nr where Nfrac14 10000 and r isthe mean recombination rate for the 1-Mb locus

For modeling mutation a challenge is that previous mu-tation rate estimates vary depending on the approach used(Segurel et al 2014) Estimates using the neutral substitutionrate between humans and chimps are more than 2 108

per bp per generation while estimates using whole-genomesequencing are closer to 1 108 As a compromise wespecify a mutation rate of 16 108

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe would like to thank Hussein Al-Asadi Arjun BiddandaAnna Di Rienzo Dick Hudson Choongwon Jeong Evan KochJoseph Marcus Shigeki Nakagome Ben Peter Mark ReppelAlex White members of the Coop laboratory at UC Davismembers of the Przeworski laboratory at ColumbiaUniversity and members of the He and Stephens laboratoriesat the University of Chicago for helpful comments JS wassupported by an NSF Graduate Research Fellowship andNational Institute of General Medical Sciences of the

Smith et al doi101093molbevmsy006 MBE

1014

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

National Institutes of Health under award numbers DGE-1144082 and T32GM007197 respectively This work wasalso supported by the National Institute of General MedicalSciences of the National Institutes of Health under awardnumbers RO1GM83098 and RO1GM107374 to GC as wellas R01HG007089 to JN

ReferencesAkey JM Eberle MA Rieder MJ Carlson CS Shriver MD Nickerson DA

Kruglyak L 2004 Population history and natural selection shapepatterns of genetic variation in 132 genes PLoS Biol 21591ndash1599

Allentoft ME Sikora M Sjogren K-G Rasmussen S Rasmussen MStenderup J Damgaard PB Schroeder H Ahlstrom T Vinner L2015 Population genomics of Bronze Age Eurasia Nature522(7555) 167ndash172

Auton A McVean G 2012 Estimating recombination rates from geneticvariation in humans In Anisimova M ed Evolutionary genomicsMethods in Molecular Biology (Methods and Protocols) vol 856New York Humana Press

Barrett RDH Hoekstra HE 2011 Molecular spandrels tests of adaptationat the genetic level Nat Rev Genet 12(11) 767ndash780

Baudat F Buard J Grey C Fledel-Alon A Ober C Przeworski M Coop GMassy BD 2010 PRDM9 is a major determinant of meiotic recom-bination hotspots in humans and mice Science 327(5967) 836ndash840

Beaumont MA Zhang W Balding DJ 2002 Approximate Bayesian com-putation in population genetics Genetics 162(4) 2025ndash2035

Beleza S Johnson NA Candille SI Absher DM Coram MA Lopes JCampos J Araujo II Anderson TM Vilhjalmsson BJ et al 2013Genetic architecture of skin and eye color in an African-Europeanadmixed population PLoS Genet 9(3) e1003372

Beleza S Santos AM McEvoy B Alves I Martinho C Cameron E ShriverMD Parra EJ Rocha J 2013 The timing of pigmentation lightening inEuropeans Mol Biol Evol 30(1) 24ndash35

Berg JJ Coop G 2015 A coalescent model for a sweep of a uniquestanding variant Genetics 201(2) 707ndash725

Bersaglieri T Sabeti PC Patterson N Vanderploeg T Schaffner SF DrakeJA Rhodes M Reich DE Hirschhorn JN 2004 Genetic signatures ofstrong recent positive selection at the lactase gene Am J Hum Genet74(6) 1111ndash1120

Broman KW Murray JC Sheffield VC White RL Weber JL 1998Comprehensive human genetic maps individual and sex-specificvariation in recombination Am J Hum Genet 63(3) 861ndash869

Bryk J Hardouin E Pugach I Hughes D Strotmann R Stoneking M MylesS 2008 Positive selection in East Asians for an EDAR allele thatenhances NF-jB activation PLoS One 3(5) e2209

Chen H Slatkin M 2013 Inferring selection intensity and allele age frommultilocus haplotype structure G3 3(8) 1429ndash1442

Chen H Hey J Slatkin M 2015 A hidden Markov model for investigatingrecent positive selection through haplotype structure Theor PopulBiol 9918ndash30

Chun S Fay JC 2011 Evidence for hitchhiking of deleterious mutationswithin the human genome PLoS Genet 7(8) e1002240

Coop G Griffiths RC 2004 Ancestral inference on gene trees underselection Theor Popul Biol 66(3) 219ndash232

Coop G Bullaughey K Luca F Przeworski M 2008 The timing of selec-tion at the human FOXP2 gene Mol Biol Evol 25(7) 1257ndash1259

Dalziel AC Rogers SM Schulte PM 2009 Linking genotypes to pheno-types and fitness how mechanistic biology can inform molecularecology Mol Ecol 18(24) 4997ndash5017

Eaton K Edwards M Krithika S Cook G Norton H Parra EJ 2015Association study confirms the role of two OCA2 polymorphismsin normal skin pigmentation variation in East Asian populations AmJ Hum Biol 27 520ndash525

Edwards M Bigham A Tan J Li S Gozdzik A Ross K Jin L Parra EJ 2010Association of the OCA2 polymorphism His615Arg with melanin

content in east Asian populations further evidence of convergentevolution of skin pigmentation PLoS Genet 6(3) e1000867

Elmer KR Meyer A 2011 Adaptation in the age of ecological genomicsinsights from parallelism and convergence Trends Ecol Evol 26(6)298ndash306

Enattah NS Sahi T Savilahti E Terwilliger JD Peltonen L Jeuroarveleuroa I 2002Identification of a variant associated with adult-type hypolactasiaNat Genet 30(2) 233ndash237

Eng MY Luczak SE Wall TL 2007 ALDH2 ADH1B and ADH1C geno-types in Asians a literature review Alcohol Res Health 30(1) 22

Fenner JN 2005 Cross-cultural estimation of the human generationinterval for use in genetics-based population divergence studiesAm J Phys Anthropol 128(2) 415ndash423

Fujimoto A Ohashi J Nishida N Miyagawa T Morishita Y Tsunoda TKimura R Tokunaga K 2008 A replication study confirmed theEDAR gene to be a major contributor to population differentiationregarding head hair thickness in Asia Hum Genet 124(2) 179ndash185

Garud NR Messer PW Buzbas EO Petrov DA 2015 Recent selectivesweeps in North American Drosophila melanogaster show signaturesof soft sweeps PLoS Genet 11(2) e1005004

Griffiths RC Tavare S 1994 Sampling theory for neutral alleles in avarying environment Philos Trans R Soc B Biol Sci 344(1310)403ndash410

Haak W Lazaridis I Patterson N Rohland N Mallick S Llamas B BrandtG Nordenfelt S Harney E Stewardson K 2015 Massive migrationfrom the steppe was a source for Indo-European languages inEurope Nature 522 207ndash211

Hartfield M Otto SP 2011 Recombination and hitchhiking of deleteri-ous alleles Evolution 65(9) 2421ndash2434

Hermisson J Pennings PS 2005 Soft sweeps molecular population ge-netics of adaptation from standing genetic variation Genetics 169(4)2335ndash2352

Hinch AG Tandon A Patterson N Song Y Rohland N Palmer CD ChenGK Wang K Buxbaum SG Akylbekova EL et al 2011 The landscapeof recombination in African Americans Nature 476(7359) 170ndash175

Hudson RR 1990 Gene genealogies and the coalescent process Oxf SurvEvol Biol 7(1) 44

Hudson RR 2007 The variance of coalescent time estimates from DNAsequences J Mol Evol 64(6) 702ndash705

Hudson RR Kaplan NL 1988 The coalescent process in models withselection and recombination Genetics 120(3) 831ndash840

Huff CD Witherspoon DJ Zhang Y Gatenbee C Denson LA KugathasanS Hakonarson H Whiting A Davis CT Wu W et al 2012 Crohnrsquosdisease and genetic hitchhiking at IBD5 Mol Biol Evol 29(1)101ndash111

Innan H Kim Y 2004 Pattern of polymorphism after strong artificialselection in a domestication event Proc Natl Acad Sci U S A 101(29)10667ndash10672

Itan Y Powell A Beaumont MA Burger J Thomas MG 2009 The originsof lactase persistence in Europe PLoS Comput Biol 5(8) e1000491

Jablonski NG Chaplin G 2000 The evolution of human skin coloration JHum Evol 39(1) 57ndash106

Jablonski NG Chaplin G 2012 Human skin pigmentation migration anddisease susceptibility Philos Trans R Soc B Biol Sci 367(1590)785ndash792

Jeong C Di Rienzo A 2014 Adaptations to local environments in mod-ern human populations Curr Opin Genet Dev 291ndash8

Kamberov YG Wang S Tan J Gerbault P Wark A Tan L Yang Y Li STang K Chen H et al 2013 Modeling recent human evolution inmice by expression of a selected EDAR variant Cell 152(4) 691ndash702

Kaplan NL Darden T Hudson RR 1988 The coalescent process inmodels with selection Genetics 120(3) 819ndash829

Kaplan NL Hudson RR Langley CH 1989 The ldquohitchhiking effectrdquo revis-ited Genetics 123(4) 887ndash899

Kimura R Yamaguchi T Takeda M Kondo O Toma T Haneji KHanihara T Matsukusa H Kawamura S Maki K et al 2009 A com-mon variation in EDAR is a genetic determinant of shovel-shapedincisors Am J Hum Genet 85(4) 528ndash535

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1015

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Kong A Thorleifsson G Gudbjartsson DF Masson G Sigurdsson AJonasdottir A Walters GB Jonasdottir A Gylfason A KristinssonKT et al 2010 Fine-scale recombination rate differences betweensexes populations and individuals Nature 467(7319) 1099ndash1103

Kuokkanen M Kokkonen J Enattah NS Ylisaukko-oja T Komu H VariloT Peltonen L Savilahti E Jeuroarveleuroa I 2006 Mutations in the translatedregion of the lactase gene (LCT) underlie congenital lactase defi-ciency Am J Hum Genet 78(2) 339ndash344

Larribe F Fearnhead P 2011 On composite likelihoods in statisticalgenetics Stat Sin 21(1) 43ndash69

Lazaridis I Patterson N Mittnik A Renaud G Mallick S Kirsanow KSudmant PH Schraiber JG Castellano S Lipson M et al 2014Ancient human genomes suggest three ancestral populations forpresent-day Europeans Nature 513(7518) 409ndash413

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357) 493ndash496

Li H Mukherjee N Soundararajan U Tarnok Z Barta C Khaliq SMohyuddin A Kajuna SLB Mehdi SQ Kidd JR et al 2007Geographically separate increases in the frequency of the derivedADH1B 47His allele in eastern and western Asia Am J Hum Genet81(4) 842ndash846

Li N Stephens M 2003 Modeling linkage disequilibrium and identifyingrecombination hotspots using single-nucleotide polymorphismdata Genetics 165(4) 2213ndash2233

Loomis WF 1967 Skin-pigment regulation of vitamin-D biosynthesis inman Science 157(3788) 501ndash506

Mathieson I Lazaridis I Rohland N Mallick S Patterson N RoodenbergSA Harney E Stewardson K Fernandes D Novak M et al 2015Genome-wide patterns of selection in 230 ancient Eurasians Nature528(7583) 499ndash503

McPeek MS Strahs A 1999 Assessment of linkage disequilibrium by thedecay of haplotype sharing with application to fine-scale geneticmapping Am J Hum Genet 65(3) 858ndash875

Meligkotsidou L Fearnhead P 2005 Maximum-likelihood estimationof coalescence times in genealogical trees Genetics 171(4)2073ndash2084

Messer PW Neher RA 2012 Estimating the strength of selective sweepsfrom deep population diversity data Genetics 191(2) 593ndash605

Messer PW Petrov DA 2013 Population genomics of rapid adaptationby soft selective sweeps Trends Ecol Evol 28(11) 659ndash669

Miller CT Beleza S Pollen AA Schluter D Kittles RA Shriver MDKingsley DM 2007 cis-Regulatory changes in kit ligand expressionand parallel evolution of pigmentation in sticklebacks and humansCell 131(6) 1179ndash1189

Morris AP Whittaker JC Balding DJ 2000 Bayesian fine-scale mapping ofdisease loci by hidden Markov models Am J Hum Genet 67(1)155ndash169

Morris AP Whittaker JC Balding DJ 2002 Fine-scale mapping of diseaseloci via shattered coalescent modeling of genealogies Am J HumGenet 70(3) 686ndash707

Nakagome S Alkorta-Aranburu G Amato R Howie B Peter BMHudson RR Di Rienzo A 2016 Estimating the ages of selectionsignals from different epochs in human history Mol Biol Evol33(3) 657ndash669

Nielsen R 2005 Molecular signatures of natural selection Annu RevGenet 39 197ndash218

Olds LC Sibley E 2003 Lactase persistence DNA variant enhances lactasepromoter activity in vitro functional role as a cis regulatory elementHum Mol Genet 12(18) 2333ndash2340

Ormond L Foll M Ewing GB Pfeifer SP Jensen JD 2016 Inferring the ageof a fixed beneficial allele Mol Ecol 25(1) 157ndash169

Osier MV Pakstis AJ Soodyall H Comas D Goldman D Odunsi AOkonofua F Parnas J Schulz LO Bertranpetit J et al 2002 A globalperspective on genetic variation at the ADH genes reveals unusualpatterns of linkage disequilibrium and diversity Am J Hum Genet71(1) 84ndash99

Peng Y Shi H Qi X-B Xiao C-J Zhong H Ma Run-lin Z Su B 2010The ADH1B Arg47His polymorphism in East Asian populations

and expansion of rice domestication in history BMC Evol Biol10(1) 15

Peter BM Huerta-Sanchez E Nielsen R 2012 Distinguishing betweenselective sweeps from standing variation and from a de novo mu-tation PLoS Genet 8(10) e1003011

Przeworski M Coop G Wall JD 2005 The signature of positive selectionon standing genetic variation Evolution 59(11) 2312ndash2323

Price AL Tandon A Patterson N Barnes KC Rafaels N Ruczinski I BeatyTH Mathias R Reich D Myers S 2009 Sensitive detection of chro-mosomal segments of distinct ancestry in admixed populationsPLoS Genet 5(6) e1000519

Pritchard JK Seielstad MT Perez-Lezaun A Feldman MW 1999Population growth of human Y chromosomes a study of Y chro-mosome microsatellites Mol Biol Evol 16(12) 1791ndash1798

Pritchard JK Pickrell JK Coop G 2010 The genetics of human adapta-tion hard sweeps soft sweeps and polygenic adaptation Curr Biol20(4) R208ndashR215

Przeworski M 2003 Estimating the time since the fixation of a beneficialallele Genetics 164(4) 1667ndash1676

Rabiner LR 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77(2) 257ndash286

Radwan J Babik W 2012 The genomics of adaptation Proc R Soc Lond BBiol Sci 279(1749) 5024ndash5028

Rannala B Reeve JP 2001 High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome se-quence Am J Hum Genet 69(1) 159ndash178

Rannala B Reeve JP 2002 Joint Bayesian estimation of mutationlocation and age using linkage disequilibrium Biocomputing2003 526ndash534

Segurel L Wyman MJ Przeworski M 2014 Determinants of mutationrate variation in the human germline Annu Rev Genomics HumGenet 1547ndash70

Simoons FJ 1970 Primary adult lactose intolerance and the milkinghabit a problem in biologic and cultural interrelations Am J DigDis 15(8) 695ndash710

Skoglund P Malmstrom H Omrak A Raghavan M Valdiosera CGunther T Hall P Tambets K Parik J Sjogren K-G et al 2014Genomic diversity and admixture differs for Stone-AgeScandinavian foragers and farmers Science 344(6185) 747ndash750

Slatkin M 2001 Simulating genealogies of selected alleles in a populationof variable size Genet Res 78(01) 49ndash57

Slatkin M 2008 A Bayesian method for jointly estimating allele age andselection intensity Genet Res 90(01) 129ndash137

Slatkin M Hudson RR 1991 Pairwise comparisons of mitochondrialDNA sequences in stable and exponentially growing populationsGenetics 129(2) 555ndash562

Slatkin M Rannala B 2000 Estimating allele age Annu Rev GenomicsHum Genet 1(1) 225ndash249

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenet Res 23(01) 23ndash35

Stinchcombe JR Hoekstra HE 2008 Combining population genomicsand quantitative genetics finding the genes underlying ecologicallyimportant traits Heredity 100(2) 158ndash170

Tang H Siegmund DO Shen P Oefner PJ Feldman MW 2002Frequentist estimation of coalescence times from nucleotide se-quence data using a tree-based partition Genetics 161(1) 447ndash459

Tavare S Balding DJ Griffiths RC Donnelly P 1997 Inferring coalescencetimes from DNA sequence data Genetics 145(2) 505ndash518

Tennessen JA Bigham AW OrsquoConnor TD Fu W Kenny EE Gravel SMcGee S Do R Liu X Jun G et al 2012 Evolution and functionalimpact of rare coding variation from deep sequencing of humanexomes Science 337(6090) 64ndash69

Thomson R Pritchard JK Shen P Oefner PJ Feldman MW 2000Recent common ancestry of human Y chromosomes evidencefrom DNA sequence data Proc Natl Acad Sci U S A 97(13)7360ndash7365

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007

Smith et al doi101093molbevmsy006 MBE

1016

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 39(1) 31ndash40

Troelsen JT Olsen J Moslashller J SjeuroOstreuroOm H 2003 An upstream polymor-phism associated with lactase persistence has increased enhanceractivity Gastroenterology 125(6) 1686ndash1694

Varin C Reid N Firth D 2011 An overview of composite likelihoodmethods Stat Sin 21(1) 5ndash42

Voight BF Kudaravalli S Wen X Pritchard JK Hurst L 2006 A map ofrecent positive selection in the human genome PLoS Biol 4(3) e72

Wegmann D Kessner DE Veeramah KR Mathias RA Nicolae DL YanekLR Sun YV Torgerson DG Rafaels N Mosley T et al 2011Recombination rates in admixed individuals identified by ancestry-based inference Nat Genet 43(9) 847ndash853

Wilde S Timpson A Kirsanow K Kaiser E Kayser M Unterleuroander MHollfelder N Potekhina ID Schier W Thomas MG et al 2014 Directevidence for positive selection of skin hair and eye pigmentation inEuropeans during the last 5000 years Proc Natl Acad Sci U S A111(13) 4832ndash4837

Williamson SH Hernandez R Fledel-Alon A Zhu L Nielsen RBustamante CD 2005 Simultaneous inference of selection and pop-ulation growth from patterns of variation in the human genomeProc Natl Acad Sci U S A 102(22) 7882ndash7887

Wiuf C 2000 On the genealogy of a sample of neutral rare alleles TheorPopul Biol 58(1) 61ndash75

Wiuf C Donnelly P 1999 Conditional genealogies and the age of aneutral mutant Theor Popul Biol 56(2) 183ndash201

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1017

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

While we focus here on inference from modern DNAdata the increased accessibility of ancient DNA has addeda new dimension to population genetic data sets (Lazaridiset al 2014 Skoglund et al 2014 Allentoft et al 2015 Haaket al 2015 Mathieson et al 2015) Because it will remaindifficult to use ancient DNA approaches in many specieswith poor sample preservation we believe methods basedon modern DNA will continue to be useful going forwardThat said ancient DNA is providing an interesting avenue forcomparative work between inference from modern and an-cient samples For example Nakagome et al (2016) usesimulations to assess the fit of this ancient DNA polymor-phism to data simulated under their inferred parametervalues for allele age and selection intensity and they findreasonable agreement Much work still remains to fullyleverage ancient samples into population genetic inferencewhile accounting for new sources of uncertainty and poten-tial for sampling bias

Despite these challenges it is clear that our understandingof adaptive history will continue to benefit from new com-putational tools which extract insightful information from adiverse set of data sources

Materials and MethodsWe generated data using the software mssel (Hudson R per-sonal communication) which simulates a sample of haplo-types conditioned on the frequency trajectory of a selectedvariant under the structured coalescent (Hudson and Kaplan1988 Kaplan et al 1988) Trajectories were first simulatedforwards in time under a WrightndashFisher model for an additivelocus with varying strengths of selection and different endingfrequencies of the selected variant Trajectories were thentruncated to end at the first time the allele reaches a specifiedfrequency See supplementary table 1 SupplementaryMaterial online for relative ages of simulated TMRCA val-ues for different end frequencies and selection strengthsFor the results in figure 2 100 simulations were performedfor each parameter combination MCMCs were run for5000 iterations with a burn-in excluding the first 3000iterations A standard deviation of 10 was used for theproposal distribution of tca The red boxplots indicate localreference panels The blue and green boxplots indicate ref-erence panels diverged by 5N generations and 1N gener-ations respectively Each data set was simulated for a 1-Mblocus with a mutation rate of 1 108 recombination rateof 1 108

and population size of 10000 Sample sizesfor the selected and reference panels were 100 and 20respectively

For more efficient run times of the MCMC we set a max-imum number of individuals to include in the selected andreference panels to be 100 and 20 respectively In cases wherethe true number of haplotypes for either panel was greaterthan this in the full data set we resampled a subset ofhaplotypes from each population for a total of five replicatesper population For simulation results supporting the use ofthis resampling strategy see supplementary figure 5Supplementary Material online The MCMCs were run for

15000 iterations with a standard deviation of 20 for theTMRCA proposal distribution The first 9000 iterationswere removed as a burn-in leading to 6000 iterations for asample of the posterior Convergence was assessed by com-parison of MCMC replicates Figure 3 and supplementaryfigure 8 Supplementary Material online show the resultsfor all five variants along with previous point estimates and95 confidence intervals assuming a generation time of29 years (Fenner 2005) Supplementary table 3Supplementary Material online lists the mean and 95 cred-ible intervals for estimates with the highest mean posteriorprobability which we refer to in the text belowSupplementary table 4 Supplementary Material online liststhe previous estimates and confidence intervals with addi-tional details of the different approaches taken

To model recombination rate variation we used recombi-nation rates from the Decode sex-averaged recombinationmap inferred from pedigrees among families in Iceland(Kong et al 2010) Because some populations may have re-combination maps which differ from the Decode map at finescales we used a mean uniform recombination rate inferredfrom the 1-Mb region surrounding each variant The motiva-tion for this arises from how recombination rates have beenpreviously shown to remain relatively consistent among re-combination maps inferred for different populations at themegabase-scale (Broman et al 1998 Baudat et al 2010 Konget al 2010 Auton and McVean 2012) Further we found ourestimates depend mostly on having the megabase-scale re-combination rate appropriately set with little difference inmost cases for estimates obtained by modeling fine-scale re-combination at each locus (supplementary fig 7Supplementary Material online) We specify the switchingrate among background haplotypes after recombining off ofthe ancestral haplotype to be 4Nr where Nfrac14 10000 and r isthe mean recombination rate for the 1-Mb locus

For modeling mutation a challenge is that previous mu-tation rate estimates vary depending on the approach used(Segurel et al 2014) Estimates using the neutral substitutionrate between humans and chimps are more than 2 108

per bp per generation while estimates using whole-genomesequencing are closer to 1 108 As a compromise wespecify a mutation rate of 16 108

Supplementary MaterialSupplementary data are available at Molecular Biology andEvolution online

AcknowledgmentsWe would like to thank Hussein Al-Asadi Arjun BiddandaAnna Di Rienzo Dick Hudson Choongwon Jeong Evan KochJoseph Marcus Shigeki Nakagome Ben Peter Mark ReppelAlex White members of the Coop laboratory at UC Davismembers of the Przeworski laboratory at ColumbiaUniversity and members of the He and Stephens laboratoriesat the University of Chicago for helpful comments JS wassupported by an NSF Graduate Research Fellowship andNational Institute of General Medical Sciences of the

Smith et al doi101093molbevmsy006 MBE

1014

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

National Institutes of Health under award numbers DGE-1144082 and T32GM007197 respectively This work wasalso supported by the National Institute of General MedicalSciences of the National Institutes of Health under awardnumbers RO1GM83098 and RO1GM107374 to GC as wellas R01HG007089 to JN

ReferencesAkey JM Eberle MA Rieder MJ Carlson CS Shriver MD Nickerson DA

Kruglyak L 2004 Population history and natural selection shapepatterns of genetic variation in 132 genes PLoS Biol 21591ndash1599

Allentoft ME Sikora M Sjogren K-G Rasmussen S Rasmussen MStenderup J Damgaard PB Schroeder H Ahlstrom T Vinner L2015 Population genomics of Bronze Age Eurasia Nature522(7555) 167ndash172

Auton A McVean G 2012 Estimating recombination rates from geneticvariation in humans In Anisimova M ed Evolutionary genomicsMethods in Molecular Biology (Methods and Protocols) vol 856New York Humana Press

Barrett RDH Hoekstra HE 2011 Molecular spandrels tests of adaptationat the genetic level Nat Rev Genet 12(11) 767ndash780

Baudat F Buard J Grey C Fledel-Alon A Ober C Przeworski M Coop GMassy BD 2010 PRDM9 is a major determinant of meiotic recom-bination hotspots in humans and mice Science 327(5967) 836ndash840

Beaumont MA Zhang W Balding DJ 2002 Approximate Bayesian com-putation in population genetics Genetics 162(4) 2025ndash2035

Beleza S Johnson NA Candille SI Absher DM Coram MA Lopes JCampos J Araujo II Anderson TM Vilhjalmsson BJ et al 2013Genetic architecture of skin and eye color in an African-Europeanadmixed population PLoS Genet 9(3) e1003372

Beleza S Santos AM McEvoy B Alves I Martinho C Cameron E ShriverMD Parra EJ Rocha J 2013 The timing of pigmentation lightening inEuropeans Mol Biol Evol 30(1) 24ndash35

Berg JJ Coop G 2015 A coalescent model for a sweep of a uniquestanding variant Genetics 201(2) 707ndash725

Bersaglieri T Sabeti PC Patterson N Vanderploeg T Schaffner SF DrakeJA Rhodes M Reich DE Hirschhorn JN 2004 Genetic signatures ofstrong recent positive selection at the lactase gene Am J Hum Genet74(6) 1111ndash1120

Broman KW Murray JC Sheffield VC White RL Weber JL 1998Comprehensive human genetic maps individual and sex-specificvariation in recombination Am J Hum Genet 63(3) 861ndash869

Bryk J Hardouin E Pugach I Hughes D Strotmann R Stoneking M MylesS 2008 Positive selection in East Asians for an EDAR allele thatenhances NF-jB activation PLoS One 3(5) e2209

Chen H Slatkin M 2013 Inferring selection intensity and allele age frommultilocus haplotype structure G3 3(8) 1429ndash1442

Chen H Hey J Slatkin M 2015 A hidden Markov model for investigatingrecent positive selection through haplotype structure Theor PopulBiol 9918ndash30

Chun S Fay JC 2011 Evidence for hitchhiking of deleterious mutationswithin the human genome PLoS Genet 7(8) e1002240

Coop G Griffiths RC 2004 Ancestral inference on gene trees underselection Theor Popul Biol 66(3) 219ndash232

Coop G Bullaughey K Luca F Przeworski M 2008 The timing of selec-tion at the human FOXP2 gene Mol Biol Evol 25(7) 1257ndash1259

Dalziel AC Rogers SM Schulte PM 2009 Linking genotypes to pheno-types and fitness how mechanistic biology can inform molecularecology Mol Ecol 18(24) 4997ndash5017

Eaton K Edwards M Krithika S Cook G Norton H Parra EJ 2015Association study confirms the role of two OCA2 polymorphismsin normal skin pigmentation variation in East Asian populations AmJ Hum Biol 27 520ndash525

Edwards M Bigham A Tan J Li S Gozdzik A Ross K Jin L Parra EJ 2010Association of the OCA2 polymorphism His615Arg with melanin

content in east Asian populations further evidence of convergentevolution of skin pigmentation PLoS Genet 6(3) e1000867

Elmer KR Meyer A 2011 Adaptation in the age of ecological genomicsinsights from parallelism and convergence Trends Ecol Evol 26(6)298ndash306

Enattah NS Sahi T Savilahti E Terwilliger JD Peltonen L Jeuroarveleuroa I 2002Identification of a variant associated with adult-type hypolactasiaNat Genet 30(2) 233ndash237

Eng MY Luczak SE Wall TL 2007 ALDH2 ADH1B and ADH1C geno-types in Asians a literature review Alcohol Res Health 30(1) 22

Fenner JN 2005 Cross-cultural estimation of the human generationinterval for use in genetics-based population divergence studiesAm J Phys Anthropol 128(2) 415ndash423

Fujimoto A Ohashi J Nishida N Miyagawa T Morishita Y Tsunoda TKimura R Tokunaga K 2008 A replication study confirmed theEDAR gene to be a major contributor to population differentiationregarding head hair thickness in Asia Hum Genet 124(2) 179ndash185

Garud NR Messer PW Buzbas EO Petrov DA 2015 Recent selectivesweeps in North American Drosophila melanogaster show signaturesof soft sweeps PLoS Genet 11(2) e1005004

Griffiths RC Tavare S 1994 Sampling theory for neutral alleles in avarying environment Philos Trans R Soc B Biol Sci 344(1310)403ndash410

Haak W Lazaridis I Patterson N Rohland N Mallick S Llamas B BrandtG Nordenfelt S Harney E Stewardson K 2015 Massive migrationfrom the steppe was a source for Indo-European languages inEurope Nature 522 207ndash211

Hartfield M Otto SP 2011 Recombination and hitchhiking of deleteri-ous alleles Evolution 65(9) 2421ndash2434

Hermisson J Pennings PS 2005 Soft sweeps molecular population ge-netics of adaptation from standing genetic variation Genetics 169(4)2335ndash2352

Hinch AG Tandon A Patterson N Song Y Rohland N Palmer CD ChenGK Wang K Buxbaum SG Akylbekova EL et al 2011 The landscapeof recombination in African Americans Nature 476(7359) 170ndash175

Hudson RR 1990 Gene genealogies and the coalescent process Oxf SurvEvol Biol 7(1) 44

Hudson RR 2007 The variance of coalescent time estimates from DNAsequences J Mol Evol 64(6) 702ndash705

Hudson RR Kaplan NL 1988 The coalescent process in models withselection and recombination Genetics 120(3) 831ndash840

Huff CD Witherspoon DJ Zhang Y Gatenbee C Denson LA KugathasanS Hakonarson H Whiting A Davis CT Wu W et al 2012 Crohnrsquosdisease and genetic hitchhiking at IBD5 Mol Biol Evol 29(1)101ndash111

Innan H Kim Y 2004 Pattern of polymorphism after strong artificialselection in a domestication event Proc Natl Acad Sci U S A 101(29)10667ndash10672

Itan Y Powell A Beaumont MA Burger J Thomas MG 2009 The originsof lactase persistence in Europe PLoS Comput Biol 5(8) e1000491

Jablonski NG Chaplin G 2000 The evolution of human skin coloration JHum Evol 39(1) 57ndash106

Jablonski NG Chaplin G 2012 Human skin pigmentation migration anddisease susceptibility Philos Trans R Soc B Biol Sci 367(1590)785ndash792

Jeong C Di Rienzo A 2014 Adaptations to local environments in mod-ern human populations Curr Opin Genet Dev 291ndash8

Kamberov YG Wang S Tan J Gerbault P Wark A Tan L Yang Y Li STang K Chen H et al 2013 Modeling recent human evolution inmice by expression of a selected EDAR variant Cell 152(4) 691ndash702

Kaplan NL Darden T Hudson RR 1988 The coalescent process inmodels with selection Genetics 120(3) 819ndash829

Kaplan NL Hudson RR Langley CH 1989 The ldquohitchhiking effectrdquo revis-ited Genetics 123(4) 887ndash899

Kimura R Yamaguchi T Takeda M Kondo O Toma T Haneji KHanihara T Matsukusa H Kawamura S Maki K et al 2009 A com-mon variation in EDAR is a genetic determinant of shovel-shapedincisors Am J Hum Genet 85(4) 528ndash535

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1015

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Kong A Thorleifsson G Gudbjartsson DF Masson G Sigurdsson AJonasdottir A Walters GB Jonasdottir A Gylfason A KristinssonKT et al 2010 Fine-scale recombination rate differences betweensexes populations and individuals Nature 467(7319) 1099ndash1103

Kuokkanen M Kokkonen J Enattah NS Ylisaukko-oja T Komu H VariloT Peltonen L Savilahti E Jeuroarveleuroa I 2006 Mutations in the translatedregion of the lactase gene (LCT) underlie congenital lactase defi-ciency Am J Hum Genet 78(2) 339ndash344

Larribe F Fearnhead P 2011 On composite likelihoods in statisticalgenetics Stat Sin 21(1) 43ndash69

Lazaridis I Patterson N Mittnik A Renaud G Mallick S Kirsanow KSudmant PH Schraiber JG Castellano S Lipson M et al 2014Ancient human genomes suggest three ancestral populations forpresent-day Europeans Nature 513(7518) 409ndash413

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357) 493ndash496

Li H Mukherjee N Soundararajan U Tarnok Z Barta C Khaliq SMohyuddin A Kajuna SLB Mehdi SQ Kidd JR et al 2007Geographically separate increases in the frequency of the derivedADH1B 47His allele in eastern and western Asia Am J Hum Genet81(4) 842ndash846

Li N Stephens M 2003 Modeling linkage disequilibrium and identifyingrecombination hotspots using single-nucleotide polymorphismdata Genetics 165(4) 2213ndash2233

Loomis WF 1967 Skin-pigment regulation of vitamin-D biosynthesis inman Science 157(3788) 501ndash506

Mathieson I Lazaridis I Rohland N Mallick S Patterson N RoodenbergSA Harney E Stewardson K Fernandes D Novak M et al 2015Genome-wide patterns of selection in 230 ancient Eurasians Nature528(7583) 499ndash503

McPeek MS Strahs A 1999 Assessment of linkage disequilibrium by thedecay of haplotype sharing with application to fine-scale geneticmapping Am J Hum Genet 65(3) 858ndash875

Meligkotsidou L Fearnhead P 2005 Maximum-likelihood estimationof coalescence times in genealogical trees Genetics 171(4)2073ndash2084

Messer PW Neher RA 2012 Estimating the strength of selective sweepsfrom deep population diversity data Genetics 191(2) 593ndash605

Messer PW Petrov DA 2013 Population genomics of rapid adaptationby soft selective sweeps Trends Ecol Evol 28(11) 659ndash669

Miller CT Beleza S Pollen AA Schluter D Kittles RA Shriver MDKingsley DM 2007 cis-Regulatory changes in kit ligand expressionand parallel evolution of pigmentation in sticklebacks and humansCell 131(6) 1179ndash1189

Morris AP Whittaker JC Balding DJ 2000 Bayesian fine-scale mapping ofdisease loci by hidden Markov models Am J Hum Genet 67(1)155ndash169

Morris AP Whittaker JC Balding DJ 2002 Fine-scale mapping of diseaseloci via shattered coalescent modeling of genealogies Am J HumGenet 70(3) 686ndash707

Nakagome S Alkorta-Aranburu G Amato R Howie B Peter BMHudson RR Di Rienzo A 2016 Estimating the ages of selectionsignals from different epochs in human history Mol Biol Evol33(3) 657ndash669

Nielsen R 2005 Molecular signatures of natural selection Annu RevGenet 39 197ndash218

Olds LC Sibley E 2003 Lactase persistence DNA variant enhances lactasepromoter activity in vitro functional role as a cis regulatory elementHum Mol Genet 12(18) 2333ndash2340

Ormond L Foll M Ewing GB Pfeifer SP Jensen JD 2016 Inferring the ageof a fixed beneficial allele Mol Ecol 25(1) 157ndash169

Osier MV Pakstis AJ Soodyall H Comas D Goldman D Odunsi AOkonofua F Parnas J Schulz LO Bertranpetit J et al 2002 A globalperspective on genetic variation at the ADH genes reveals unusualpatterns of linkage disequilibrium and diversity Am J Hum Genet71(1) 84ndash99

Peng Y Shi H Qi X-B Xiao C-J Zhong H Ma Run-lin Z Su B 2010The ADH1B Arg47His polymorphism in East Asian populations

and expansion of rice domestication in history BMC Evol Biol10(1) 15

Peter BM Huerta-Sanchez E Nielsen R 2012 Distinguishing betweenselective sweeps from standing variation and from a de novo mu-tation PLoS Genet 8(10) e1003011

Przeworski M Coop G Wall JD 2005 The signature of positive selectionon standing genetic variation Evolution 59(11) 2312ndash2323

Price AL Tandon A Patterson N Barnes KC Rafaels N Ruczinski I BeatyTH Mathias R Reich D Myers S 2009 Sensitive detection of chro-mosomal segments of distinct ancestry in admixed populationsPLoS Genet 5(6) e1000519

Pritchard JK Seielstad MT Perez-Lezaun A Feldman MW 1999Population growth of human Y chromosomes a study of Y chro-mosome microsatellites Mol Biol Evol 16(12) 1791ndash1798

Pritchard JK Pickrell JK Coop G 2010 The genetics of human adapta-tion hard sweeps soft sweeps and polygenic adaptation Curr Biol20(4) R208ndashR215

Przeworski M 2003 Estimating the time since the fixation of a beneficialallele Genetics 164(4) 1667ndash1676

Rabiner LR 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77(2) 257ndash286

Radwan J Babik W 2012 The genomics of adaptation Proc R Soc Lond BBiol Sci 279(1749) 5024ndash5028

Rannala B Reeve JP 2001 High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome se-quence Am J Hum Genet 69(1) 159ndash178

Rannala B Reeve JP 2002 Joint Bayesian estimation of mutationlocation and age using linkage disequilibrium Biocomputing2003 526ndash534

Segurel L Wyman MJ Przeworski M 2014 Determinants of mutationrate variation in the human germline Annu Rev Genomics HumGenet 1547ndash70

Simoons FJ 1970 Primary adult lactose intolerance and the milkinghabit a problem in biologic and cultural interrelations Am J DigDis 15(8) 695ndash710

Skoglund P Malmstrom H Omrak A Raghavan M Valdiosera CGunther T Hall P Tambets K Parik J Sjogren K-G et al 2014Genomic diversity and admixture differs for Stone-AgeScandinavian foragers and farmers Science 344(6185) 747ndash750

Slatkin M 2001 Simulating genealogies of selected alleles in a populationof variable size Genet Res 78(01) 49ndash57

Slatkin M 2008 A Bayesian method for jointly estimating allele age andselection intensity Genet Res 90(01) 129ndash137

Slatkin M Hudson RR 1991 Pairwise comparisons of mitochondrialDNA sequences in stable and exponentially growing populationsGenetics 129(2) 555ndash562

Slatkin M Rannala B 2000 Estimating allele age Annu Rev GenomicsHum Genet 1(1) 225ndash249

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenet Res 23(01) 23ndash35

Stinchcombe JR Hoekstra HE 2008 Combining population genomicsand quantitative genetics finding the genes underlying ecologicallyimportant traits Heredity 100(2) 158ndash170

Tang H Siegmund DO Shen P Oefner PJ Feldman MW 2002Frequentist estimation of coalescence times from nucleotide se-quence data using a tree-based partition Genetics 161(1) 447ndash459

Tavare S Balding DJ Griffiths RC Donnelly P 1997 Inferring coalescencetimes from DNA sequence data Genetics 145(2) 505ndash518

Tennessen JA Bigham AW OrsquoConnor TD Fu W Kenny EE Gravel SMcGee S Do R Liu X Jun G et al 2012 Evolution and functionalimpact of rare coding variation from deep sequencing of humanexomes Science 337(6090) 64ndash69

Thomson R Pritchard JK Shen P Oefner PJ Feldman MW 2000Recent common ancestry of human Y chromosomes evidencefrom DNA sequence data Proc Natl Acad Sci U S A 97(13)7360ndash7365

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007

Smith et al doi101093molbevmsy006 MBE

1016

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 39(1) 31ndash40

Troelsen JT Olsen J Moslashller J SjeuroOstreuroOm H 2003 An upstream polymor-phism associated with lactase persistence has increased enhanceractivity Gastroenterology 125(6) 1686ndash1694

Varin C Reid N Firth D 2011 An overview of composite likelihoodmethods Stat Sin 21(1) 5ndash42

Voight BF Kudaravalli S Wen X Pritchard JK Hurst L 2006 A map ofrecent positive selection in the human genome PLoS Biol 4(3) e72

Wegmann D Kessner DE Veeramah KR Mathias RA Nicolae DL YanekLR Sun YV Torgerson DG Rafaels N Mosley T et al 2011Recombination rates in admixed individuals identified by ancestry-based inference Nat Genet 43(9) 847ndash853

Wilde S Timpson A Kirsanow K Kaiser E Kayser M Unterleuroander MHollfelder N Potekhina ID Schier W Thomas MG et al 2014 Directevidence for positive selection of skin hair and eye pigmentation inEuropeans during the last 5000 years Proc Natl Acad Sci U S A111(13) 4832ndash4837

Williamson SH Hernandez R Fledel-Alon A Zhu L Nielsen RBustamante CD 2005 Simultaneous inference of selection and pop-ulation growth from patterns of variation in the human genomeProc Natl Acad Sci U S A 102(22) 7882ndash7887

Wiuf C 2000 On the genealogy of a sample of neutral rare alleles TheorPopul Biol 58(1) 61ndash75

Wiuf C Donnelly P 1999 Conditional genealogies and the age of aneutral mutant Theor Popul Biol 56(2) 183ndash201

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1017

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

National Institutes of Health under award numbers DGE-1144082 and T32GM007197 respectively This work wasalso supported by the National Institute of General MedicalSciences of the National Institutes of Health under awardnumbers RO1GM83098 and RO1GM107374 to GC as wellas R01HG007089 to JN

ReferencesAkey JM Eberle MA Rieder MJ Carlson CS Shriver MD Nickerson DA

Kruglyak L 2004 Population history and natural selection shapepatterns of genetic variation in 132 genes PLoS Biol 21591ndash1599

Allentoft ME Sikora M Sjogren K-G Rasmussen S Rasmussen MStenderup J Damgaard PB Schroeder H Ahlstrom T Vinner L2015 Population genomics of Bronze Age Eurasia Nature522(7555) 167ndash172

Auton A McVean G 2012 Estimating recombination rates from geneticvariation in humans In Anisimova M ed Evolutionary genomicsMethods in Molecular Biology (Methods and Protocols) vol 856New York Humana Press

Barrett RDH Hoekstra HE 2011 Molecular spandrels tests of adaptationat the genetic level Nat Rev Genet 12(11) 767ndash780

Baudat F Buard J Grey C Fledel-Alon A Ober C Przeworski M Coop GMassy BD 2010 PRDM9 is a major determinant of meiotic recom-bination hotspots in humans and mice Science 327(5967) 836ndash840

Beaumont MA Zhang W Balding DJ 2002 Approximate Bayesian com-putation in population genetics Genetics 162(4) 2025ndash2035

Beleza S Johnson NA Candille SI Absher DM Coram MA Lopes JCampos J Araujo II Anderson TM Vilhjalmsson BJ et al 2013Genetic architecture of skin and eye color in an African-Europeanadmixed population PLoS Genet 9(3) e1003372

Beleza S Santos AM McEvoy B Alves I Martinho C Cameron E ShriverMD Parra EJ Rocha J 2013 The timing of pigmentation lightening inEuropeans Mol Biol Evol 30(1) 24ndash35

Berg JJ Coop G 2015 A coalescent model for a sweep of a uniquestanding variant Genetics 201(2) 707ndash725

Bersaglieri T Sabeti PC Patterson N Vanderploeg T Schaffner SF DrakeJA Rhodes M Reich DE Hirschhorn JN 2004 Genetic signatures ofstrong recent positive selection at the lactase gene Am J Hum Genet74(6) 1111ndash1120

Broman KW Murray JC Sheffield VC White RL Weber JL 1998Comprehensive human genetic maps individual and sex-specificvariation in recombination Am J Hum Genet 63(3) 861ndash869

Bryk J Hardouin E Pugach I Hughes D Strotmann R Stoneking M MylesS 2008 Positive selection in East Asians for an EDAR allele thatenhances NF-jB activation PLoS One 3(5) e2209

Chen H Slatkin M 2013 Inferring selection intensity and allele age frommultilocus haplotype structure G3 3(8) 1429ndash1442

Chen H Hey J Slatkin M 2015 A hidden Markov model for investigatingrecent positive selection through haplotype structure Theor PopulBiol 9918ndash30

Chun S Fay JC 2011 Evidence for hitchhiking of deleterious mutationswithin the human genome PLoS Genet 7(8) e1002240

Coop G Griffiths RC 2004 Ancestral inference on gene trees underselection Theor Popul Biol 66(3) 219ndash232

Coop G Bullaughey K Luca F Przeworski M 2008 The timing of selec-tion at the human FOXP2 gene Mol Biol Evol 25(7) 1257ndash1259

Dalziel AC Rogers SM Schulte PM 2009 Linking genotypes to pheno-types and fitness how mechanistic biology can inform molecularecology Mol Ecol 18(24) 4997ndash5017

Eaton K Edwards M Krithika S Cook G Norton H Parra EJ 2015Association study confirms the role of two OCA2 polymorphismsin normal skin pigmentation variation in East Asian populations AmJ Hum Biol 27 520ndash525

Edwards M Bigham A Tan J Li S Gozdzik A Ross K Jin L Parra EJ 2010Association of the OCA2 polymorphism His615Arg with melanin

content in east Asian populations further evidence of convergentevolution of skin pigmentation PLoS Genet 6(3) e1000867

Elmer KR Meyer A 2011 Adaptation in the age of ecological genomicsinsights from parallelism and convergence Trends Ecol Evol 26(6)298ndash306

Enattah NS Sahi T Savilahti E Terwilliger JD Peltonen L Jeuroarveleuroa I 2002Identification of a variant associated with adult-type hypolactasiaNat Genet 30(2) 233ndash237

Eng MY Luczak SE Wall TL 2007 ALDH2 ADH1B and ADH1C geno-types in Asians a literature review Alcohol Res Health 30(1) 22

Fenner JN 2005 Cross-cultural estimation of the human generationinterval for use in genetics-based population divergence studiesAm J Phys Anthropol 128(2) 415ndash423

Fujimoto A Ohashi J Nishida N Miyagawa T Morishita Y Tsunoda TKimura R Tokunaga K 2008 A replication study confirmed theEDAR gene to be a major contributor to population differentiationregarding head hair thickness in Asia Hum Genet 124(2) 179ndash185

Garud NR Messer PW Buzbas EO Petrov DA 2015 Recent selectivesweeps in North American Drosophila melanogaster show signaturesof soft sweeps PLoS Genet 11(2) e1005004

Griffiths RC Tavare S 1994 Sampling theory for neutral alleles in avarying environment Philos Trans R Soc B Biol Sci 344(1310)403ndash410

Haak W Lazaridis I Patterson N Rohland N Mallick S Llamas B BrandtG Nordenfelt S Harney E Stewardson K 2015 Massive migrationfrom the steppe was a source for Indo-European languages inEurope Nature 522 207ndash211

Hartfield M Otto SP 2011 Recombination and hitchhiking of deleteri-ous alleles Evolution 65(9) 2421ndash2434

Hermisson J Pennings PS 2005 Soft sweeps molecular population ge-netics of adaptation from standing genetic variation Genetics 169(4)2335ndash2352

Hinch AG Tandon A Patterson N Song Y Rohland N Palmer CD ChenGK Wang K Buxbaum SG Akylbekova EL et al 2011 The landscapeof recombination in African Americans Nature 476(7359) 170ndash175

Hudson RR 1990 Gene genealogies and the coalescent process Oxf SurvEvol Biol 7(1) 44

Hudson RR 2007 The variance of coalescent time estimates from DNAsequences J Mol Evol 64(6) 702ndash705

Hudson RR Kaplan NL 1988 The coalescent process in models withselection and recombination Genetics 120(3) 831ndash840

Huff CD Witherspoon DJ Zhang Y Gatenbee C Denson LA KugathasanS Hakonarson H Whiting A Davis CT Wu W et al 2012 Crohnrsquosdisease and genetic hitchhiking at IBD5 Mol Biol Evol 29(1)101ndash111

Innan H Kim Y 2004 Pattern of polymorphism after strong artificialselection in a domestication event Proc Natl Acad Sci U S A 101(29)10667ndash10672

Itan Y Powell A Beaumont MA Burger J Thomas MG 2009 The originsof lactase persistence in Europe PLoS Comput Biol 5(8) e1000491

Jablonski NG Chaplin G 2000 The evolution of human skin coloration JHum Evol 39(1) 57ndash106

Jablonski NG Chaplin G 2012 Human skin pigmentation migration anddisease susceptibility Philos Trans R Soc B Biol Sci 367(1590)785ndash792

Jeong C Di Rienzo A 2014 Adaptations to local environments in mod-ern human populations Curr Opin Genet Dev 291ndash8

Kamberov YG Wang S Tan J Gerbault P Wark A Tan L Yang Y Li STang K Chen H et al 2013 Modeling recent human evolution inmice by expression of a selected EDAR variant Cell 152(4) 691ndash702

Kaplan NL Darden T Hudson RR 1988 The coalescent process inmodels with selection Genetics 120(3) 819ndash829

Kaplan NL Hudson RR Langley CH 1989 The ldquohitchhiking effectrdquo revis-ited Genetics 123(4) 887ndash899

Kimura R Yamaguchi T Takeda M Kondo O Toma T Haneji KHanihara T Matsukusa H Kawamura S Maki K et al 2009 A com-mon variation in EDAR is a genetic determinant of shovel-shapedincisors Am J Hum Genet 85(4) 528ndash535

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1015

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Kong A Thorleifsson G Gudbjartsson DF Masson G Sigurdsson AJonasdottir A Walters GB Jonasdottir A Gylfason A KristinssonKT et al 2010 Fine-scale recombination rate differences betweensexes populations and individuals Nature 467(7319) 1099ndash1103

Kuokkanen M Kokkonen J Enattah NS Ylisaukko-oja T Komu H VariloT Peltonen L Savilahti E Jeuroarveleuroa I 2006 Mutations in the translatedregion of the lactase gene (LCT) underlie congenital lactase defi-ciency Am J Hum Genet 78(2) 339ndash344

Larribe F Fearnhead P 2011 On composite likelihoods in statisticalgenetics Stat Sin 21(1) 43ndash69

Lazaridis I Patterson N Mittnik A Renaud G Mallick S Kirsanow KSudmant PH Schraiber JG Castellano S Lipson M et al 2014Ancient human genomes suggest three ancestral populations forpresent-day Europeans Nature 513(7518) 409ndash413

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357) 493ndash496

Li H Mukherjee N Soundararajan U Tarnok Z Barta C Khaliq SMohyuddin A Kajuna SLB Mehdi SQ Kidd JR et al 2007Geographically separate increases in the frequency of the derivedADH1B 47His allele in eastern and western Asia Am J Hum Genet81(4) 842ndash846

Li N Stephens M 2003 Modeling linkage disequilibrium and identifyingrecombination hotspots using single-nucleotide polymorphismdata Genetics 165(4) 2213ndash2233

Loomis WF 1967 Skin-pigment regulation of vitamin-D biosynthesis inman Science 157(3788) 501ndash506

Mathieson I Lazaridis I Rohland N Mallick S Patterson N RoodenbergSA Harney E Stewardson K Fernandes D Novak M et al 2015Genome-wide patterns of selection in 230 ancient Eurasians Nature528(7583) 499ndash503

McPeek MS Strahs A 1999 Assessment of linkage disequilibrium by thedecay of haplotype sharing with application to fine-scale geneticmapping Am J Hum Genet 65(3) 858ndash875

Meligkotsidou L Fearnhead P 2005 Maximum-likelihood estimationof coalescence times in genealogical trees Genetics 171(4)2073ndash2084

Messer PW Neher RA 2012 Estimating the strength of selective sweepsfrom deep population diversity data Genetics 191(2) 593ndash605

Messer PW Petrov DA 2013 Population genomics of rapid adaptationby soft selective sweeps Trends Ecol Evol 28(11) 659ndash669

Miller CT Beleza S Pollen AA Schluter D Kittles RA Shriver MDKingsley DM 2007 cis-Regulatory changes in kit ligand expressionand parallel evolution of pigmentation in sticklebacks and humansCell 131(6) 1179ndash1189

Morris AP Whittaker JC Balding DJ 2000 Bayesian fine-scale mapping ofdisease loci by hidden Markov models Am J Hum Genet 67(1)155ndash169

Morris AP Whittaker JC Balding DJ 2002 Fine-scale mapping of diseaseloci via shattered coalescent modeling of genealogies Am J HumGenet 70(3) 686ndash707

Nakagome S Alkorta-Aranburu G Amato R Howie B Peter BMHudson RR Di Rienzo A 2016 Estimating the ages of selectionsignals from different epochs in human history Mol Biol Evol33(3) 657ndash669

Nielsen R 2005 Molecular signatures of natural selection Annu RevGenet 39 197ndash218

Olds LC Sibley E 2003 Lactase persistence DNA variant enhances lactasepromoter activity in vitro functional role as a cis regulatory elementHum Mol Genet 12(18) 2333ndash2340

Ormond L Foll M Ewing GB Pfeifer SP Jensen JD 2016 Inferring the ageof a fixed beneficial allele Mol Ecol 25(1) 157ndash169

Osier MV Pakstis AJ Soodyall H Comas D Goldman D Odunsi AOkonofua F Parnas J Schulz LO Bertranpetit J et al 2002 A globalperspective on genetic variation at the ADH genes reveals unusualpatterns of linkage disequilibrium and diversity Am J Hum Genet71(1) 84ndash99

Peng Y Shi H Qi X-B Xiao C-J Zhong H Ma Run-lin Z Su B 2010The ADH1B Arg47His polymorphism in East Asian populations

and expansion of rice domestication in history BMC Evol Biol10(1) 15

Peter BM Huerta-Sanchez E Nielsen R 2012 Distinguishing betweenselective sweeps from standing variation and from a de novo mu-tation PLoS Genet 8(10) e1003011

Przeworski M Coop G Wall JD 2005 The signature of positive selectionon standing genetic variation Evolution 59(11) 2312ndash2323

Price AL Tandon A Patterson N Barnes KC Rafaels N Ruczinski I BeatyTH Mathias R Reich D Myers S 2009 Sensitive detection of chro-mosomal segments of distinct ancestry in admixed populationsPLoS Genet 5(6) e1000519

Pritchard JK Seielstad MT Perez-Lezaun A Feldman MW 1999Population growth of human Y chromosomes a study of Y chro-mosome microsatellites Mol Biol Evol 16(12) 1791ndash1798

Pritchard JK Pickrell JK Coop G 2010 The genetics of human adapta-tion hard sweeps soft sweeps and polygenic adaptation Curr Biol20(4) R208ndashR215

Przeworski M 2003 Estimating the time since the fixation of a beneficialallele Genetics 164(4) 1667ndash1676

Rabiner LR 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77(2) 257ndash286

Radwan J Babik W 2012 The genomics of adaptation Proc R Soc Lond BBiol Sci 279(1749) 5024ndash5028

Rannala B Reeve JP 2001 High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome se-quence Am J Hum Genet 69(1) 159ndash178

Rannala B Reeve JP 2002 Joint Bayesian estimation of mutationlocation and age using linkage disequilibrium Biocomputing2003 526ndash534

Segurel L Wyman MJ Przeworski M 2014 Determinants of mutationrate variation in the human germline Annu Rev Genomics HumGenet 1547ndash70

Simoons FJ 1970 Primary adult lactose intolerance and the milkinghabit a problem in biologic and cultural interrelations Am J DigDis 15(8) 695ndash710

Skoglund P Malmstrom H Omrak A Raghavan M Valdiosera CGunther T Hall P Tambets K Parik J Sjogren K-G et al 2014Genomic diversity and admixture differs for Stone-AgeScandinavian foragers and farmers Science 344(6185) 747ndash750

Slatkin M 2001 Simulating genealogies of selected alleles in a populationof variable size Genet Res 78(01) 49ndash57

Slatkin M 2008 A Bayesian method for jointly estimating allele age andselection intensity Genet Res 90(01) 129ndash137

Slatkin M Hudson RR 1991 Pairwise comparisons of mitochondrialDNA sequences in stable and exponentially growing populationsGenetics 129(2) 555ndash562

Slatkin M Rannala B 2000 Estimating allele age Annu Rev GenomicsHum Genet 1(1) 225ndash249

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenet Res 23(01) 23ndash35

Stinchcombe JR Hoekstra HE 2008 Combining population genomicsand quantitative genetics finding the genes underlying ecologicallyimportant traits Heredity 100(2) 158ndash170

Tang H Siegmund DO Shen P Oefner PJ Feldman MW 2002Frequentist estimation of coalescence times from nucleotide se-quence data using a tree-based partition Genetics 161(1) 447ndash459

Tavare S Balding DJ Griffiths RC Donnelly P 1997 Inferring coalescencetimes from DNA sequence data Genetics 145(2) 505ndash518

Tennessen JA Bigham AW OrsquoConnor TD Fu W Kenny EE Gravel SMcGee S Do R Liu X Jun G et al 2012 Evolution and functionalimpact of rare coding variation from deep sequencing of humanexomes Science 337(6090) 64ndash69

Thomson R Pritchard JK Shen P Oefner PJ Feldman MW 2000Recent common ancestry of human Y chromosomes evidencefrom DNA sequence data Proc Natl Acad Sci U S A 97(13)7360ndash7365

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007

Smith et al doi101093molbevmsy006 MBE

1016

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 39(1) 31ndash40

Troelsen JT Olsen J Moslashller J SjeuroOstreuroOm H 2003 An upstream polymor-phism associated with lactase persistence has increased enhanceractivity Gastroenterology 125(6) 1686ndash1694

Varin C Reid N Firth D 2011 An overview of composite likelihoodmethods Stat Sin 21(1) 5ndash42

Voight BF Kudaravalli S Wen X Pritchard JK Hurst L 2006 A map ofrecent positive selection in the human genome PLoS Biol 4(3) e72

Wegmann D Kessner DE Veeramah KR Mathias RA Nicolae DL YanekLR Sun YV Torgerson DG Rafaels N Mosley T et al 2011Recombination rates in admixed individuals identified by ancestry-based inference Nat Genet 43(9) 847ndash853

Wilde S Timpson A Kirsanow K Kaiser E Kayser M Unterleuroander MHollfelder N Potekhina ID Schier W Thomas MG et al 2014 Directevidence for positive selection of skin hair and eye pigmentation inEuropeans during the last 5000 years Proc Natl Acad Sci U S A111(13) 4832ndash4837

Williamson SH Hernandez R Fledel-Alon A Zhu L Nielsen RBustamante CD 2005 Simultaneous inference of selection and pop-ulation growth from patterns of variation in the human genomeProc Natl Acad Sci U S A 102(22) 7882ndash7887

Wiuf C 2000 On the genealogy of a sample of neutral rare alleles TheorPopul Biol 58(1) 61ndash75

Wiuf C Donnelly P 1999 Conditional genealogies and the age of aneutral mutant Theor Popul Biol 56(2) 183ndash201

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1017

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Kong A Thorleifsson G Gudbjartsson DF Masson G Sigurdsson AJonasdottir A Walters GB Jonasdottir A Gylfason A KristinssonKT et al 2010 Fine-scale recombination rate differences betweensexes populations and individuals Nature 467(7319) 1099ndash1103

Kuokkanen M Kokkonen J Enattah NS Ylisaukko-oja T Komu H VariloT Peltonen L Savilahti E Jeuroarveleuroa I 2006 Mutations in the translatedregion of the lactase gene (LCT) underlie congenital lactase defi-ciency Am J Hum Genet 78(2) 339ndash344

Larribe F Fearnhead P 2011 On composite likelihoods in statisticalgenetics Stat Sin 21(1) 43ndash69

Lazaridis I Patterson N Mittnik A Renaud G Mallick S Kirsanow KSudmant PH Schraiber JG Castellano S Lipson M et al 2014Ancient human genomes suggest three ancestral populations forpresent-day Europeans Nature 513(7518) 409ndash413

Li H Durbin R 2011 Inference of human population history from indi-vidual whole-genome sequences Nature 475(7357) 493ndash496

Li H Mukherjee N Soundararajan U Tarnok Z Barta C Khaliq SMohyuddin A Kajuna SLB Mehdi SQ Kidd JR et al 2007Geographically separate increases in the frequency of the derivedADH1B 47His allele in eastern and western Asia Am J Hum Genet81(4) 842ndash846

Li N Stephens M 2003 Modeling linkage disequilibrium and identifyingrecombination hotspots using single-nucleotide polymorphismdata Genetics 165(4) 2213ndash2233

Loomis WF 1967 Skin-pigment regulation of vitamin-D biosynthesis inman Science 157(3788) 501ndash506

Mathieson I Lazaridis I Rohland N Mallick S Patterson N RoodenbergSA Harney E Stewardson K Fernandes D Novak M et al 2015Genome-wide patterns of selection in 230 ancient Eurasians Nature528(7583) 499ndash503

McPeek MS Strahs A 1999 Assessment of linkage disequilibrium by thedecay of haplotype sharing with application to fine-scale geneticmapping Am J Hum Genet 65(3) 858ndash875

Meligkotsidou L Fearnhead P 2005 Maximum-likelihood estimationof coalescence times in genealogical trees Genetics 171(4)2073ndash2084

Messer PW Neher RA 2012 Estimating the strength of selective sweepsfrom deep population diversity data Genetics 191(2) 593ndash605

Messer PW Petrov DA 2013 Population genomics of rapid adaptationby soft selective sweeps Trends Ecol Evol 28(11) 659ndash669

Miller CT Beleza S Pollen AA Schluter D Kittles RA Shriver MDKingsley DM 2007 cis-Regulatory changes in kit ligand expressionand parallel evolution of pigmentation in sticklebacks and humansCell 131(6) 1179ndash1189

Morris AP Whittaker JC Balding DJ 2000 Bayesian fine-scale mapping ofdisease loci by hidden Markov models Am J Hum Genet 67(1)155ndash169

Morris AP Whittaker JC Balding DJ 2002 Fine-scale mapping of diseaseloci via shattered coalescent modeling of genealogies Am J HumGenet 70(3) 686ndash707

Nakagome S Alkorta-Aranburu G Amato R Howie B Peter BMHudson RR Di Rienzo A 2016 Estimating the ages of selectionsignals from different epochs in human history Mol Biol Evol33(3) 657ndash669

Nielsen R 2005 Molecular signatures of natural selection Annu RevGenet 39 197ndash218

Olds LC Sibley E 2003 Lactase persistence DNA variant enhances lactasepromoter activity in vitro functional role as a cis regulatory elementHum Mol Genet 12(18) 2333ndash2340

Ormond L Foll M Ewing GB Pfeifer SP Jensen JD 2016 Inferring the ageof a fixed beneficial allele Mol Ecol 25(1) 157ndash169

Osier MV Pakstis AJ Soodyall H Comas D Goldman D Odunsi AOkonofua F Parnas J Schulz LO Bertranpetit J et al 2002 A globalperspective on genetic variation at the ADH genes reveals unusualpatterns of linkage disequilibrium and diversity Am J Hum Genet71(1) 84ndash99

Peng Y Shi H Qi X-B Xiao C-J Zhong H Ma Run-lin Z Su B 2010The ADH1B Arg47His polymorphism in East Asian populations

and expansion of rice domestication in history BMC Evol Biol10(1) 15

Peter BM Huerta-Sanchez E Nielsen R 2012 Distinguishing betweenselective sweeps from standing variation and from a de novo mu-tation PLoS Genet 8(10) e1003011

Przeworski M Coop G Wall JD 2005 The signature of positive selectionon standing genetic variation Evolution 59(11) 2312ndash2323

Price AL Tandon A Patterson N Barnes KC Rafaels N Ruczinski I BeatyTH Mathias R Reich D Myers S 2009 Sensitive detection of chro-mosomal segments of distinct ancestry in admixed populationsPLoS Genet 5(6) e1000519

Pritchard JK Seielstad MT Perez-Lezaun A Feldman MW 1999Population growth of human Y chromosomes a study of Y chro-mosome microsatellites Mol Biol Evol 16(12) 1791ndash1798

Pritchard JK Pickrell JK Coop G 2010 The genetics of human adapta-tion hard sweeps soft sweeps and polygenic adaptation Curr Biol20(4) R208ndashR215

Przeworski M 2003 Estimating the time since the fixation of a beneficialallele Genetics 164(4) 1667ndash1676

Rabiner LR 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77(2) 257ndash286

Radwan J Babik W 2012 The genomics of adaptation Proc R Soc Lond BBiol Sci 279(1749) 5024ndash5028

Rannala B Reeve JP 2001 High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome se-quence Am J Hum Genet 69(1) 159ndash178

Rannala B Reeve JP 2002 Joint Bayesian estimation of mutationlocation and age using linkage disequilibrium Biocomputing2003 526ndash534

Segurel L Wyman MJ Przeworski M 2014 Determinants of mutationrate variation in the human germline Annu Rev Genomics HumGenet 1547ndash70

Simoons FJ 1970 Primary adult lactose intolerance and the milkinghabit a problem in biologic and cultural interrelations Am J DigDis 15(8) 695ndash710

Skoglund P Malmstrom H Omrak A Raghavan M Valdiosera CGunther T Hall P Tambets K Parik J Sjogren K-G et al 2014Genomic diversity and admixture differs for Stone-AgeScandinavian foragers and farmers Science 344(6185) 747ndash750

Slatkin M 2001 Simulating genealogies of selected alleles in a populationof variable size Genet Res 78(01) 49ndash57

Slatkin M 2008 A Bayesian method for jointly estimating allele age andselection intensity Genet Res 90(01) 129ndash137

Slatkin M Hudson RR 1991 Pairwise comparisons of mitochondrialDNA sequences in stable and exponentially growing populationsGenetics 129(2) 555ndash562

Slatkin M Rannala B 2000 Estimating allele age Annu Rev GenomicsHum Genet 1(1) 225ndash249

Smith JM Haigh J 1974 The hitch-hiking effect of a favourable geneGenet Res 23(01) 23ndash35

Stinchcombe JR Hoekstra HE 2008 Combining population genomicsand quantitative genetics finding the genes underlying ecologicallyimportant traits Heredity 100(2) 158ndash170

Tang H Siegmund DO Shen P Oefner PJ Feldman MW 2002Frequentist estimation of coalescence times from nucleotide se-quence data using a tree-based partition Genetics 161(1) 447ndash459

Tavare S Balding DJ Griffiths RC Donnelly P 1997 Inferring coalescencetimes from DNA sequence data Genetics 145(2) 505ndash518

Tennessen JA Bigham AW OrsquoConnor TD Fu W Kenny EE Gravel SMcGee S Do R Liu X Jun G et al 2012 Evolution and functionalimpact of rare coding variation from deep sequencing of humanexomes Science 337(6090) 64ndash69

Thomson R Pritchard JK Shen P Oefner PJ Feldman MW 2000Recent common ancestry of human Y chromosomes evidencefrom DNA sequence data Proc Natl Acad Sci U S A 97(13)7360ndash7365

Tishkoff SA Reed FA Ranciaro A Voight BF Babbitt CC Silverman JSPowell K Mortensen HM Hirbo JB Osman M et al 2007

Smith et al doi101093molbevmsy006 MBE

1016

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 39(1) 31ndash40

Troelsen JT Olsen J Moslashller J SjeuroOstreuroOm H 2003 An upstream polymor-phism associated with lactase persistence has increased enhanceractivity Gastroenterology 125(6) 1686ndash1694

Varin C Reid N Firth D 2011 An overview of composite likelihoodmethods Stat Sin 21(1) 5ndash42

Voight BF Kudaravalli S Wen X Pritchard JK Hurst L 2006 A map ofrecent positive selection in the human genome PLoS Biol 4(3) e72

Wegmann D Kessner DE Veeramah KR Mathias RA Nicolae DL YanekLR Sun YV Torgerson DG Rafaels N Mosley T et al 2011Recombination rates in admixed individuals identified by ancestry-based inference Nat Genet 43(9) 847ndash853

Wilde S Timpson A Kirsanow K Kaiser E Kayser M Unterleuroander MHollfelder N Potekhina ID Schier W Thomas MG et al 2014 Directevidence for positive selection of skin hair and eye pigmentation inEuropeans during the last 5000 years Proc Natl Acad Sci U S A111(13) 4832ndash4837

Williamson SH Hernandez R Fledel-Alon A Zhu L Nielsen RBustamante CD 2005 Simultaneous inference of selection and pop-ulation growth from patterns of variation in the human genomeProc Natl Acad Sci U S A 102(22) 7882ndash7887

Wiuf C 2000 On the genealogy of a sample of neutral rare alleles TheorPopul Biol 58(1) 61ndash75

Wiuf C Donnelly P 1999 Conditional genealogies and the age of aneutral mutant Theor Popul Biol 56(2) 183ndash201

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1017

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022

Convergent adaptation of human lactase persistence in Africa andEurope Nat Genet 39(1) 31ndash40

Troelsen JT Olsen J Moslashller J SjeuroOstreuroOm H 2003 An upstream polymor-phism associated with lactase persistence has increased enhanceractivity Gastroenterology 125(6) 1686ndash1694

Varin C Reid N Firth D 2011 An overview of composite likelihoodmethods Stat Sin 21(1) 5ndash42

Voight BF Kudaravalli S Wen X Pritchard JK Hurst L 2006 A map ofrecent positive selection in the human genome PLoS Biol 4(3) e72

Wegmann D Kessner DE Veeramah KR Mathias RA Nicolae DL YanekLR Sun YV Torgerson DG Rafaels N Mosley T et al 2011Recombination rates in admixed individuals identified by ancestry-based inference Nat Genet 43(9) 847ndash853

Wilde S Timpson A Kirsanow K Kaiser E Kayser M Unterleuroander MHollfelder N Potekhina ID Schier W Thomas MG et al 2014 Directevidence for positive selection of skin hair and eye pigmentation inEuropeans during the last 5000 years Proc Natl Acad Sci U S A111(13) 4832ndash4837

Williamson SH Hernandez R Fledel-Alon A Zhu L Nielsen RBustamante CD 2005 Simultaneous inference of selection and pop-ulation growth from patterns of variation in the human genomeProc Natl Acad Sci U S A 102(22) 7882ndash7887

Wiuf C 2000 On the genealogy of a sample of neutral rare alleles TheorPopul Biol 58(1) 61ndash75

Wiuf C Donnelly P 1999 Conditional genealogies and the age of aneutral mutant Theor Popul Biol 56(2) 183ndash201

Estimating Time to the Common Ancestor doi101093molbevmsy006 MBE

1017

Dow

nloaded from httpsacadem

icoupcomm

bearticle35410034816405 by guest on 10 August 2022