Positive Selection during the Evolution of the Blood ... - CORE

17
Article Positive Selection during the Evolution of the Blood Coagulation Factors in the Context of Their Disease-Causing Mutations Pavithra M. Rallapalli, 1 Christine A. Orengo, 1 Romain A. Studer,* ,1 and Stephen J. Perkins* ,1 1 Department of Structural and Molecular Biology, University College London, London, United Kingdom *Corresponding author: E-mail: [email protected]; [email protected]. Associate editor: James McInerney Abstract Blood coagulation occurs through a cascade of enzymes and cofactors that produces a fibrin clot, while otherwise maintaining hemostasis. The 11 human coagulation factors (FG, FII–FXIII) have been identified across all vertebrates, suggesting that they emerged with the first vertebrates around 500 Ma. Human FVIII, FIX, and FXI are associated with thousands of disease-causing mutations. Here, we evaluated the strength of selective pressures on the 14 genes coding for the 11 factors during vertebrate evolution, and compared these with human mutations in FVIII, FIX, and FXI. Positive selection was identified for fibrinogen (FG), FIII, FVIII, FIX, and FX in the mammalian Primates and Laurasiatheria and the Sauropsida (reptiles and birds). This showed that the coagulation system in vertebrates was under strong selective pressures, perhaps to adapt against blood-invading pathogens. The comparison of these results with disease-causing mutations reported in FVIII, FIX, and FXI showed that the number of disease-causing mutations, and the probability of positive selection were inversely related to each other. It was concluded that when a site was under positive selection, it was less likely to be associated with disease-causing mutations. In contrast, sites under negative selection were more likely to be associated with disease-causing mutations and be destabilizing. A residue-by-residue comparison of the FVIII, FIX, and FXI sequence alignments confirmed this. This improved understanding of evolutionary changes in FVIII, FIX, and FXI provided greater insight into disease-causing mutations, and better assessments of the codon sites that may be mutated in applications of gene therapy. Key words: positive selection, coagulation, hemostasis, evolution. Introduction Blood coagulation involves a complex yet regulated cascade of over two dozen proteins in blood (Doolittle 2009). Most of these proteins are serine protease enzymes and circulate in blood as inactive zymogens waiting for an activation trigger such as proteolytic cleavage. For coagulation, this trigger is usually some form of vascular injury, followed by activation. In the classical waterfall model, each activated protein goes on to activate the next protein in a rapidly expanding cascade of reactions which quickly results in the local formation of a fibrin clot to seal the injury (Spronk et al. 2003). The 11 human coagulation factor proteins in blood are usually indi- cated by F and a Roman numeral and followed by a lowercase “a” to indicate their active form, namely FG, FII, FIII, FV, FVII, FVIII, FIX, FX, FXI, FXII, and FXIII (Kawthalkar 2013). FG is comprised of the , , and genes of fibrinogen (FG) (FGA, FGB, and FGG) while FXIII is produced from two genes F13A and F13B. Thus these 11 coagulation factor pro- teins are produced by 14 genes (table 1). The two central processes during coagulation are the conversion of prothrom- bin (FII) to thrombin (FIIa) that cleaves FG to form fibrin, followed by the polymerization of fibrin to form the fibrin clot. The classic waterfall cascade model involved three path- ways (intrinsic, extrinsic, and common) where the intrinsic pathway is first triggered upon injury through FXII and the extrinsic pathway is triggered by the exposure of intracellular tissue factor (FIII) to FVII in serum after which tissue factor binds to and activates FVII. Recent advances in molecular biology have revealed that the waterfall model does not prop- erly account for the roles of tissue factor and FVII (Broze 1995). In the revised waterfall model (fig. 1), thrombin gener- ation occurs in two phases. The initiation phase caused by tissue damage results in relatively low thrombin activation, followed by the amplification (propagation) phase where the bulk of activated thrombin is formed (Butenas et al. 2000). Although the classical model remains useful as a laboratory model of coagulation, the revised model is more effective and logical in laboratory-based screening for coagulation factor abnormalities in bleeding disorders. Many of the coagulation proteins are related to each other via gene duplications that occurred early in vertebrate evolu- tion between the appearance of protochordates and the jaw- less fish (Doolittle et al. 2008; Doolittle 2009). Two rounds of whole-genome duplications occurred at the beginning of ver- tebrates, and a third one occurred at the beginning of teleost fishes (Meyer and Van de Peer 2005). In all vertebrates during evolution, blood coagulation retained a central mechanism in which the generation of thrombin resulted in fibrin clot for- mation. During evolution, several coagulation factors that depend on others for their activity have been altered in a ß The Author 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons. org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Open Access 3040 Mol. Biol. Evol. 31(11):3040–3056 doi:10.1093/molbev/msu248 Advance Access publication August 25, 2014 at UCL Library Services on November 21, 2014 http://mbe.oxfordjournals.org/ Downloaded from brought to you by CORE View metadata, citation and similar papers at core.ac.uk provided by UCL Discovery

Transcript of Positive Selection during the Evolution of the Blood ... - CORE

Article

Positive Selection during the Evolution of theBlood Coagulation Factors in the Context of TheirDisease-Causing MutationsPavithra M Rallapalli1 Christine A Orengo1 Romain A Studer1 and Stephen J Perkins1

1Department of Structural and Molecular Biology University College London London United Kingdom

Corresponding author E-mail rstuderebiacuk sperkinsuclacuk

Associate editor James McInerney

Abstract

Blood coagulation occurs through a cascade of enzymes and cofactors that produces a fibrin clot while otherwisemaintaining hemostasis The 11 human coagulation factors (FG FIIndashFXIII) have been identified across all vertebratessuggesting that they emerged with the first vertebrates around 500 Ma Human FVIII FIX and FXI are associated withthousands of disease-causing mutations Here we evaluated the strength of selective pressures on the 14 genes coding forthe 11 factors during vertebrate evolution and compared these with human mutations in FVIII FIX and FXI Positiveselection was identified for fibrinogen (FG) FIII FVIII FIX and FX in the mammalian Primates and Laurasiatheria and theSauropsida (reptiles and birds) This showed that the coagulation system in vertebrates was under strong selectivepressures perhaps to adapt against blood-invading pathogens The comparison of these results with disease-causingmutations reported in FVIII FIX and FXI showed that the number of disease-causing mutations and the probability ofpositive selection were inversely related to each other It was concluded that when a site was under positive selection itwas less likely to be associated with disease-causing mutations In contrast sites under negative selection were more likelyto be associated with disease-causing mutations and be destabilizing A residue-by-residue comparison of the FVIII FIXand FXI sequence alignments confirmed this This improved understanding of evolutionary changes in FVIII FIX and FXIprovided greater insight into disease-causing mutations and better assessments of the codon sites that may be mutatedin applications of gene therapy

Key words positive selection coagulation hemostasis evolution

IntroductionBlood coagulation involves a complex yet regulated cascadeof over two dozen proteins in blood (Doolittle 2009) Most ofthese proteins are serine protease enzymes and circulate inblood as inactive zymogens waiting for an activation triggersuch as proteolytic cleavage For coagulation this trigger isusually some form of vascular injury followed by activation Inthe classical waterfall model each activated protein goes onto activate the next protein in a rapidly expanding cascade ofreactions which quickly results in the local formation of afibrin clot to seal the injury (Spronk et al 2003) The 11human coagulation factor proteins in blood are usually indi-cated by F and a Roman numeral and followed by a lowercaseldquoardquo to indicate their active form namely FG FII FIII FV FVIIFVIII FIX FX FXI FXII and FXIII (Kawthalkar 2013) FG iscomprised of the and genes of fibrinogen (FG)(FGA FGB and FGG) while FXIII is produced from twogenes F13A and F13B Thus these 11 coagulation factor pro-teins are produced by 14 genes (table 1) The two centralprocesses during coagulation are the conversion of prothrom-bin (FII) to thrombin (FIIa) that cleaves FG to form fibrinfollowed by the polymerization of fibrin to form the fibrinclot The classic waterfall cascade model involved three path-ways (intrinsic extrinsic and common) where the intrinsicpathway is first triggered upon injury through FXII and the

extrinsic pathway is triggered by the exposure of intracellulartissue factor (FIII) to FVII in serum after which tissue factorbinds to and activates FVII Recent advances in molecularbiology have revealed that the waterfall model does not prop-erly account for the roles of tissue factor and FVII (Broze1995) In the revised waterfall model (fig 1) thrombin gener-ation occurs in two phases The initiation phase caused bytissue damage results in relatively low thrombin activationfollowed by the amplification (propagation) phase where thebulk of activated thrombin is formed (Butenas et al 2000)Although the classical model remains useful as a laboratorymodel of coagulation the revised model is more effective andlogical in laboratory-based screening for coagulation factorabnormalities in bleeding disorders

Many of the coagulation proteins are related to each othervia gene duplications that occurred early in vertebrate evolu-tion between the appearance of protochordates and the jaw-less fish (Doolittle et al 2008 Doolittle 2009) Two rounds ofwhole-genome duplications occurred at the beginning of ver-tebrates and a third one occurred at the beginning of teleostfishes (Meyer and Van de Peer 2005) In all vertebrates duringevolution blood coagulation retained a central mechanism inwhich the generation of thrombin resulted in fibrin clot for-mation During evolution several coagulation factors thatdepend on others for their activity have been altered in a

The Author 2014 Published by Oxford University Press on behalf of the Society for Molecular Biology and EvolutionThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby40) which permits unrestricted reuse distribution and reproduction in any medium provided the original work isproperly cited Open Access3040 Mol Biol Evol 31(11)3040ndash3056 doi101093molbevmsu248 Advance Access publication August 25 2014

at UC

L L

ibrary Services on Novem

ber 21 2014httpm

beoxfordjournalsorgD

ownloaded from

brought to you by COREView metadata citation and similar papers at coreacuk

provided by UCL Discovery

complex fashion starting from the first vertebrates Sequenceanalyses have revealed the order in which the factors evolved(Doolittle 2009) There is considerable interest in the evolu-tionary development of the complexity of coagulation inmammals This is driven by the importance of understandingpathogenic disease-causing mutations in humans as well asunderstanding how a well-regulated cascade of enzymaticreactions is developed and obtaining new insights into itsmolecular mechanism Analyses of the known mutations inpatients and comparison with the mutations toleratedduring evolution will clarify which codons are stable andwhich are not

The coagulation system overlaps with the innate immunesystem and the complement proteins through their commonproperties involving vascular permeability Deficiencies in thecoagulation proteins mostly due to genetic variations areassociated with a spectrum of genetic disorders that range

from life-threatening ones such as severe Hemophilia A (as-sociated with FVIII) to milder variants (table 1) Hemophilia Aand B are more common than the others while some are rareand these diseases prevail because their underlying geneticmutations are passed on from generation to generationReplacement therapies using recombinant coagulation pro-teins are expensive More recently FIX gene therapy trials forHemophilia B patients have been successful (Nathwani et al2011) and this method is yet to be applied to FVIII and othercoagulation proteins A major problem faced in FIX genetherapy is the low level of protein expression in part due toa different codon usage in the organism that produces theprotein (Thomas et al 2003) An evolution-driven study ofthe codons in the FIX gene may help identify which codonscould be altered to increase protein expressions in the pro-ducing organism and in the meantime be tolerated in thehost

Table 1 The 14 Coagulation Genes and Their Protein Products

Gene ENSEMBL ID Protein Product Other Name Function Genetic Disorder Amino AcidLength

No ofDomains

Domain Organization

FGA ENSG00000171560 CoagulationFactor I

Fibrinogen alphachain

Forms fibrin clotcofactor in

plateletaggregation

Congenital afibro-genmia famil-

ial renalamyloidosis

866 1 Fibrinogen C-terminaldomain

FGB ENSG00000171564 Fibrinogen betachain

491 1 Fibrinogen C-terminaldomain

FGG ENSG00000171557 Fibrinogen gammachain

453 1 FG C-terminal domain

F2 ENSG00000180210 Coagulationfactor II

Prothrombin Activates FG tofibrin

Thrombophilia 622 4 GlandashKringle 1ndashKringle2ndashSP

F3 ENSG00000117525 Coagulationfactor III

Tissue factor Activates FVIIa mdash 295 3 Transmembranendashtrans-membrane helixndashtransmembrane

F5 ENSG00000198734 Coagulationfactor V

Proaccelerin Combines withFX to activateprothrombin

Activated proteinC resistance

2224 6 F58 type A1ndashF58 typeA2ndashBndashF58 type A3ndash

F58 type C1ndashF58type C2

F7 ENSG00000057593 Coagulationfactor FVII

Proconvertin Activates FIX FX Congenital pro-convertinfactor VIIdeficiency

466 4 GlandashEGF1ndashEGF2ndashSP

F8 ENSG00000185010 CoagulationFactor FVIII

Antihemophilicfactor A (AHF-

A)

Combines withFIX and FIV to

activate FX

Hemophilia A 2351 6 F58 type A1ndashF58 typeA2ndashBndashF58 type A3ndash

F58 type C1ndashF58type C2

F9 ENSG00000101981 Coagulationfactor FIX

Christmas factor Combines withFVIII and FIVto activate FX

Hemophilia B 461 4 GlandashEGF1ndashEGF2ndashSP

F10 ENSG00000126218 Coagulationfactor FX

Stuart-prowerfactor

Converts pro-thrombin to

thrombin

Congenital factorX deficiency

488 4 GlandashEGF1ndashEGF2ndashSP

F11 ENSG00000088926 Coagulationfactor FXI

Plasma thrombo-plastin

antecedent

Combines withFIV to activate

FIX

Factor XIdeficiency

625 4 Apple 1ndashapple 2ndashapple3ndashSP

F12 ENSG00000131187 Coagulationfactor FXII

Hageman factor Activates FXI ac-tivates plasmin

Hereditaryangioedema

type III

615 6 Fibronectin type-IIndashEGF1ndashfibronectin

type-I - EGF2ndashkrin-glendashSP

F13A ENSG00000124491 Coagulationfactor FXIII

Fibrin-stabilizingfactor A chain

Stabilizes fibrin Congenital factorXIIIa deficiency

732 mdash mdash

F13B ENSG00000143278 Fibrin-stabilizingfactor B chain

Stabilizes FXIIIAregulatesthrombin

Congenital factorXIIIb deficiency

661 10 Short complement regu-lator domains 1ndash10

3041

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

In order to interpret the effect of evolutionary and muta-tional changes in coagulation we have examined FVIII FIXand FXI For FVIII 5474 disease-causing mutations havebeen recently compiled (Rallapalli PM Kemball-Cook GTuddenham EG Gomez K Perkins SJ unpublished datahttpwwwfactorviii-dborg last accessed January 2014)3713 mutations were recently published for FIX (Rallapalliet al 2013 httpwwwfactorixorg last accessed January2014) and 487 mutations are known for FXI (Saunders et al2009 httpwwwfactorxiorg last accessed January 2014) Weidentified 47 full genomes in the Ensembl database thatshowed good sequence coverage This enabled the assessmentof positive or negative selection pressures (Yang 2006) in theevolution of coagulation protein sequences The identificationof positive selection in at least nine of the 11 coagulationproteins during different periods of evolution showed thatan effective coagulation pathway was under considerableadaptive constraints even in primates The mutational analy-ses for FVIII FIX and FXI showed that disease-causing muta-tions primarily affected highly conserved residues undernegative selection and damaged protein stabilities This jointevolutionary-mutational study provides novel clarifications ofthe observed data on pathogenic mutations and may facilitatenew gene therapy approaches for their treatments

Results

Selection of 47 Genomes

A data set of 14 coagulation factor genes (table 1) that codefor 11 proteins across different groups of vertebrate genomes(fig 2) was identified from the Ensembl database TheEnsembl database currently holds 78 genomes Of theseonly 47 vertebrate genomes were chosen based on their se-quence coverage and assembly quality (fig 2) Ensembl genesets are built from assemblies of DNA sequences and proteininformation hence it was important to select only good-quality assemblies for the gene trees and sequence data setsused in this study The 47 vertebrate genomes were classifiedinto five major clades based on the Ensembl classificationnamely Primates Glires Laurasiatheria Sauropsida andFishes The mammalian clade of 30 genomes consisted ofthree clades used in our study (Primates Glires andLaurasiatheria) and other clades (Atlantogenata andAustralian mammals) (fig 2) The codon substitution site-models M1a M2a M7 M8 and M8a from CodeML(Materials and Methods) were used to calculate the selectionpressures on the five major clades each with at least sevensequences to ensure that the CodeML calculations were rea-sonably powerful (Anisimova et al 2002) By following this

FIG 1 Schema of the blood coagulation pathway leading to fibrin The relationships between the 11 coagulation factors that are coded by 14 genes areshown in the modern revised coagulation pathway (blue enzymes red cofactors) Tissue factor (TF also known as FIII) initiates coagulation when itbinds and activates FVII The activated TF-FVIIa complex then activates FX which then generates activated thrombin (FIIa) FVIII FIX and FXI enable theamplification of the coagulation pathway to maximize FIIa production and possess the largest number of pathogenic human mutations Activatedthrombin cleaves FG (FI green) to form fibrin polymers (blood clots) that are cross-linked by FXIIIa

3042

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

criterion the Atlantogenata clade with only five sequenceswas eliminated from the selection pressure calculations Themultiple sequence alignments from Ensembl were realignedusing a strict quality control procedure (supplementary figS1 Supplementary Material online) which was performed inorder to minimize the number of false positive results below

Presence and Absence of Coagulation Factors

The 47 genomes were searched for the 14 genes coding forthe 11 coagulation factors In cases where the Ensembl genetree for a given coagulation factor did not show the presenceof a particular genome we searched for putative genes acrossthat genomic sequence to establish whether the organismhad lost them The purpose of the coagulation cascade is toconvert FG into the fibrin clot and unsurprisingly FG (theFGA FGB and FGG genes) was found in all 47 genomes Theother central genes thrombin (F2) tissue factor (F3) the

coagulation cascade cofactors (F5 and F8) F7 F9 and F10were present across all 47 genomes from fishes to humansF11 is present in all Sarcopterygii (tetrapods and coelacanth)but seems to have been lost or highly diverged in teleostfishes because it is only recorded in the spotted garLepisosteus oculatus which belongs to Holostei a sisterclade of teleost fishes F12 appeared in the ancestor of tetra-pods and is absent in fishes but present in amphibians(Xenopus tropicalis) reptiles (Anolis carolinensis) and mam-mals For F13 the alpha chain (F13A) was observed to appearfrom the time of vertebrates because this gene has beenfound in the sea lamprey fishes and tetrapods The betachain (F13B) appears to be present at the origin of vertebratesbecause F13B sequences have been recorded in cave fish(Astyanax mexicanus) and medaka (Oryzias latipes)Occasionally a gene was not reported this was most likelyto arise from incomplete or unidentified sequences

FIG 2 Phylogenetic tree of 47 vertebrate genomes The tree shows the 47 genomes studied here alongside their clade and taxonomic groups The 47vertebrate genomes showed good sequence quality and coverage in the Ensembl database and were grouped into five major clades (Primates GliresLaurasiatheria Sauropsida and Fishes) and three minor clades The Mammals comprised of the Primates Glires and Laurasiatheria clades together withAtlantogenata (African origin mammals and Xenartha) and four Australian species totaling 30 out of 47 genomes

3043

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Selective Pressures Using Codon SubstitutionSite-Models

Codon substitution site-models from the CodeML packagewere used to analyze the protein-coding sequences in the 14genes following the quality control of their alignments Themodels assume that selective pressure does not act equally onthe entire protein sequence but varies between amino acidsites in a gene family We calculated the log likelihood values(lnL) of five different selection models (M1a M2a M7 M8and M8a) in order to perform the likelihood ratio test (LRT)The five models differed from each other in the number ofparameters (np) considered and their variable values Thevalue is the ratio of nonsynonymous and synonymous sub-stitution rates (dNdS) (Materials and Methods) The compar-ison between a null model that does not consider 4 1(M1a M7 or M8a) and an alternate model that considers4 1 (M2a or M8) is a measure of positive selectionIf the null (neutral) model is rejected in favor of thealternate (selection) model positive selection is inferred(Yang 2006)

To evaluate the five models the LRT value was calculatedas follows

LRT frac14 2ethlnLalternate lnLnullTHORN

where lnL is the maximum-likelihood estimate of the proba-bilities of selection The LRT is asymptotically distributed as a2

k function where k (also known as the degrees of freedom) isthe difference between the np in the null and alternatemodels The np values were calculated along with the lnLvalues in CodeML

k frac14 npalternate npnull

The LRT values for which the 2k distribution shows a P

valuelt 005 indicated that some sites were significantlyunder positive selection For this study the comparison ofthe M8a null model with the M8 alternate model (referred toas M8a-M8) was considered optimal to infer positive selection(table 2) The LRT M1a-M2a is too conservative whereas M7ndashM8 was considered to be more problem prone and less ac-curate compared with M8a-M8 (Wong et al 2004 Studer andRobinson-Rechavi 2009b) Because the P values result frommultiple tests performed using different genes at differentclade levels it was necessary to control for the expected pro-portion of false positives False positives were accounted forby calculating the q value correction over the P value for thedifferent genes in each clade this approach being both pow-erful and specific (Studer et al 2008)

The overall picture is that a total of nine coagulation factorproteins (out of the 11 of the cascade) underwent positiveselection at one or more stages in evolution (fig 3) Thehighest number of genes under positive selection within anindividual clade is five (for vertebrates and mammals) ForVertebrates using the LRT M8a-M8 positive selection wasfound in as many as five of the 14 coagulation genes (table2) In figure 3a these included F2 F5 F7 F10 and F13A for all47 vertebrates In figure 3b these included FGA F3 F5 F8 andF9 for the 30 mammals In figure 3cndashe these included FGB F5

and F8 (Primates) FGA and F13B (Laurasiatheria) and F3(Sauropsida) FG (FG = FGA + FGB + FGG) and F5 were ob-served to undergo the most extensive positive selection com-pared with any other coagulation genes The selectivepressure on all the genes in different clades was identifiedfrom the LRT values (table 2) except for those of F11 inLaurasiatheria Sauropsida and Fishes F12 in Sauropsida andFishes and F13B in Fishes because these genes wereabsent (supplementary table S1 Supplementary Materialonline)

Estimation of Selective Pressures Using Branch-SiteModel of Codon Substitution

The above site models estimated selective pressures that varyamong amino acid positions across multiple species But se-lective pressures can also vary among species (Studer andRobinson-Rechavi 2009a) and codon substitution branch-site models have been developed to detect such episodicpositive selection (Zhang et al 2005) Here we have usedthe results of the branch-site model from two differentsources one from the Selectome database and the otherfrom our own calculations Both approaches employedCodeML on the two data sets of this study (Materials andMethods) in order to check for consistency with the positiveselection identified in the site-model calculations (fig 3)Similar to the site model the false positives in our branch-site model calculations were accounted for by calculating theq value correction over the P value for the different genes ineach clade For both branch-site calculations we followed thetaxonomic grouping in the Selectome database (supplemen-tary table S2b Supplementary Material online) The highestorder of classification is the Sarcopterygii (coelacanths lung-fish and tetrapods) followed by its largest subgroupTetrapoda (mammals sauropsids and frogs) Other majortaxonomic subtrees were those of amninots sauropsidsmammals theria and eutheria The Actinopterygii (fishes)and Sauropsida (birds) are common to both the calculationsThe Eutheria and mammalian subtree calculations from theSelectome database were considered to be equivalent to themammalian clade of our analyses (fig 2) Positive selectionacross the vertebrates was observed for all genes except FGGF3 F7 and F11 in our branch-site model calculations (supple-mentary table S2b Supplementary Material online) We ob-tained similar results with our site-model calculations(supplementary table S2a Supplementary Material online)except for the exclusion of F3 and F7 (fig 3a) Our branch-site calculations identified positive selection in mammalianFGB F2 F8 and F9 whereas the site model identified positiveselection in mammalian FGA F3 F5 F8 and F9 Overall FG(FGA FGB) F8 and F9 were consistently identified to be pos-itively selected in mammals whereas F5 shows positive selec-tion up to amniotes after which the mammals and thesauropsids clade split In conclusion the two branch-siteand the site models all suggested signatures of positive selec-tions across F8 and F9 whereas no such selection wasobserved across and between F11

3044

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Identification of Codon Sites under Positive Selection

For up to 11 genes out of 14 identified to be under positiveselection (table 3) further analyses were performed to identifywhich codon sites were under such selection This was doneusing the Bayes Empirical Bayes (BEB) method from the M8site-model (Yang et al 2005) The number of codons poten-tially under positive selection (BEB 450) in the positivelyselected genes ranged between 3 and 64 But the number ofcodons accurately predicted under positive selection (BEB495) is actually much lower (between 0 and 4) The FGAgene in Laurasiatheria showed the highest percentage of pos-itively selected codons (1329 as estimated by CodeML and64 codons detect with two at BEB495) followed by the F5gene in Primates (1321 with 54 codons all BEBlt95 ) Thesmallest percentage of positively selected codons was ob-served in the F10 gene of Vertebrates (001 with threecodons all with BEB lt95) No positive selection was ob-served in any of the codon sites for the FGG F11 and F12genes in the taxa groups of this study

Table 2 LRT Statistics for the Site Model Comparisons of Model 8with Model 8a for the 14 Coagulation Factors among the Five Clades

Gene Clade No ofSequences

No ofCodons

(M8a-M8)

LRT P value q value

FGA Vertebrates 44 236 139 024 035Mammals 29 288 775 001 004Primates 7 865 335 007 016

Glires 7 670 289 009 019Laurasiatheria 9 267 728 001 004

Sauropsida 6 464 033 057 062Fishes 7 592 016 100 072

FGB Vertebrates 40 429 145 023 035Mammals 26 456 278 100 072Primates 6 488 744 001 004

Glires 6 471 069 041 051Laurasiatheria 8 479 144 023 035

Sauropsida 5 473 002 100 072Fishes 7 476 099 100 072

FGG Vertebrates 44 386 174 100 072Mammals 29 427 319 007 017Primates 7 433 398 005 013

Glires 7 433 344 006 016Laurasiatheria 8 432 009 076 072

Sauropsida 5 427 003 086 072Fishes 8 395 542 100 072

F2 Vertebrates 38 475 1691 000 000Mammals 23 571 464 003 011Primates 7 621 425 004 012

Glires 5 611 007 100 072Laurasiatheria 7 620 158 021 034

Sauropsida 5 593 368 005 014Fishes 8 562 005 100 072

F3 Vertebrates 47 166 209 015 027Mammals 26 240 1337 000 000Primates 7 295 001 090 072

Glires 6 287 307 008 018Laurasiatheria 7 292 097 032 044

Sauropsida 5 254 865 000 002Fishes 14 192 082 036 047

F5 Vertebrates 37 1182 1495 000 000Mammals 22 1741 3192 000 000Primates 6 2116 536 002 008

Glires 6 1847 048 049 056Laurasiatheria 6 2043 483 003 010

Sauropsida 5 1550 145 023 035Fishes 8 1534 212 015 027

F7 Vertebrates 34 330 593 001 007Mammals 29 344 306 008 018Primates 7 443 016 069 072

Glires 7 417 03 100 072Laurasiatheria 7 443 057 045 054

Sauropsida 3 425 055 046 054Fishes NA

F8 Vertebrates 38 1028 485 003 010Mammals 24 1985 1119 000 001Primates 7 2351 683 001 004

Glires 5 2257 395 005 013Laurasiatheria 7 2344 025 062 067

Sauropsida 5 1480 003 086 072Fishes 7 1413 09 034 045

F9 Vertebrates 44 329 063 043 052Mammals 24 443 1857 000 000Primates 7 461 128 026 036

Glires 5 468 185 017 030

(continued)

Table 2 Continued

Gene Clade No ofSequences

No ofCodons

(M8a-M8)

LRT P value q value

Laurasiatheria 7 459 434 004 012Sauropsida 5 435 003 100 072

Fishes 13 368 036 100 072

F10 Vertebrates 41 353 1455 000 000Mammals 26 424 023 063 067Primates 7 485 001 091 072

Glires 6 464 206 015 027Laurasiatheria 7 463 23 013 025

Sauropsida 5 385 003 087 072Fishes 8 391 071 040 051

F11 Vertebrates 23 614 034 056 062Mammals 23 614 034 056 062Primates 7 625 005 082 072

Glires 5 619 007 079 072Laurasiatheria 6 625 002 089 072

Sauropsida NAFishes NA

F12 Vertebrates 27 512 398 100 072Mammals 25 533 018 100 072Primates 7 615 002 100 072

Glires 6 557 007 100 072Laurasiatheria 8 513 46 003 011

Sauropsida NAFishes NA

F13A Vertebrates 47 631 907 000 002Mammals 26 684 258 011 022Primates 7 732 162 020 034

Glires 6 730 147 100 072Laurasiatheria 9 730 027 100 072

Sauropsida 5 732 001 092 072Fishes 14 643 259 011 022

F13B Vertebrates 31 609 13 025 036Mammals 25 640 157 021 034Primates 7 660 122 027 037

Glires 5 634 021 100 072Laurasiatheria 7 656 708 001 004

Sauropsida 5 573 093 100 072Fishes NA

NOTEmdashPositive selection is denoted in bold in column 2

Plt 005 Plt 001

3045

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Disease-Causing Missense Mutations and Relationshipto Evolutionary Pressures

The relationship between selective pressure and the proba-bility of disease-causing mutations at a codon site is of greatinterest Disease-causing missense mutations have been re-ported in all 14 coagulation genes in particular forHemophilia A (F8) Hemophilia B (F9) factor XI deficiency(F11) and thrombophilia (F2) (table 1) Extensive compila-tions of 487ndash5474 disease-causing mutations are available forthree of these coagulation factors FVIII FIX and FXI(Saunders et al 2009 Rallapalli et al 2013 Rallapalli et al2014 [unpublished data]) There is a sufficient number ofpathogenic mutation and evolutionary selection data fromthe FVIII FIX and FXI proteins to enable us to compare instatistical detail the correlation between selective pressureand disease-causing mutations

For the comparison of the FVIII FIX and FXI proteinchanges selective pressures were identified using the BEB in-ference method from model M2a (fig 4) While powerful a

disadvantage of Model M8 is that it classifies sites into 11categories (ten in negativeneutral and one positive selec-tion) These 11 categories can vary between different genefamilies making the comparison difficult Thus model M2a isadvantageous in that this has only three distinct categoriesnamely negative selection (dNdSlt 1) neutral evolution (dNdS = 1) and positive selection (dNdS4 1) Under the M1a-M2a test in the primates clade two genes (F8 and F9) showedpositive selection whereas F11 did not show positive selec-tion which differs from that shown in figure 3c which is basedon the M8 model The posterior probabilities of neutral neg-ative and positive selection for the FVIII FIX and FXI proteinsshowed that positive selection occurred in specific codonsthe most being for F8 and the fewest for F11 For completionthe same analysis for the other 11 genes is shown in supple-mentary table S2 Supplementary Material online These showsimilar outcomes to those for FVIII FIX and FXI

For each amino acid position in FVIII FIX and FXI wecompared its selective pressure (positive neutral or negative)

FIG 3 Positively selected genes of the coagulation cascade in vertebrates mammals and three major clades The layout of this figure is a simplifiedrepresentation of figure 1 Genes showing significant positive selection in the LRT calculations are highlighted in red (a) Positive selection was observedin five genes when all 47 genome sequences (vertebrates) were analyzed as a single group (b) In the Mammals group (fig 2) where positive selection isseen for five genes only the (c) Primates and (d) Laurasiatheria clades are shown with positive selection for two or three genes only because no positiveselection was seen in the Glires clade whereas insufficient sequences were available in the Atlantogenata clade for conclusions to be drawn (e) Positiveselection was seen in the Sauropsida clade but no positive selection was observed in Fishes (fig 2)

3046

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

with the reported number of disease-causing mutations Theoutcome was presented using regression plots (fig 5)The regression plots and their correlation coefficients revealedthe relationship between the selection probability values atamino acid sites with the disease-causing mutations at thosesites The average probability of a site to be under positiveselection and its number of disease-causing mutationsshowed a negative relationship with correlation coefficientsof085067 and095 in FVIII FIX and FXI respectively(fig 5 top row) The correlation coefficients for neutral selec-tion followed that of positive selection with values of 087064 and 096 for FVIII FIX and FXI respectively (fig 5middle row) In distinction to these a positive relationshipwas seen between the average probability of negative selec-tion and the disease-causing mutations with correlation co-efficients of +086 +066 and +095 for FVIII FIX and FXIrespectively (fig 5 bottom row) To summarize the fewer thenumber of disease mutations at a given site the higher is theprobability for positive selection or neutral evolution (fig 5

top row and middle row) The greater the number of disease-causing mutations at a given site the higher the probabilityfor negative selection (fig 5 bottom row) The comparisonbetween the positively selected proteins FVIII and FIX withthe nonpositively selected protein FXI is noteworthyIrrespective of the selective pressure all three proteins ex-hibited similar regression lines in each of the three rowsThese comparisons show that there is a general inverse pro-portionality relationship between the probability of positiveselection (or neutral evolution) and the number of disease-causing mutations at each site This confirms previous obser-vations that disease-causing mutations tend to occur at crit-ical sites in the protein and that these highly conserved sitesare less tolerant to new mutations (negative selection)

Stability Effect of Mutations

In order to explain the consequences of disease-causing mu-tations on the protein structure FoldX software was used to

Table 3 Positively Selected Sites in the M8a-M8 Comparison

Gene Clade No ofCodons

Analyzed

No of Codonsunder positive

selection

PercentageEstimated byCodeML ()

Sites under Positive Selection (BEB4 50)

FGA Mammals 288 13 717 71832353951101108178212219246273

FGA Laurasiatheria 267 64 1329 467121929303336405293949697100102103108109116117118123136138141143155171180182195208210213214216218219220222223225227229231232233234236239242244245250253254

256259 260261264265

FGB Primates 488 17 822 394155808182130134135137173290367388413415448

F2 Vertebrates 475 5 099 5257151162381

F3 Mammals 240 11 663 55727691115177179185195230240

F3 Sauropsida 254 13 1060 340414474798081110123170208249

F5 Vertebrates 1182 10 087 1032755745785936027779269951010

F5 Mammals 1741 54 272 237243660134389392656663667669 676680718720731745749765768782809819822824875

88089489590791191692893094596797698910161040107610801089113611391324147114791515

152215481564

F5 Primates 2116 54 1321 74052129180211336341408409434 59266070370772075478487787988889290794195997810331039

10591131116812051209121812201222123012311232123612441254126212651279128514981504

151915261667168218491918

F7 Vertebrates 330 12 522 56526688162167178179209258262

F8 Mammals 1985 27 398 7245281339342521667718765784785792807817889912113011931197

12201229128813081333168619671984

F8 Primates 2351 33 063 42275579181485086088692796496697998899610191071118012821312

136114141421143614381459152715821613162216501689173023402349

F9 Mammals 443 15 440 45100114140150177181199202206209270309348

F10 Vertebrates 353 3 001 14470

F13A Vertebrates 631 6 126 398489497502550613

F13B Laurasiatheria 656 18 152 5567378107341349350374443459461497522525539540624

P4 95 P4 99

3047

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

estimate the stability effect (G) of residue changes on thethree-dimensional structures of human FVIII (chains A and B)FIX and FXI (fig 6) FoldX was designed to predict the stabilityeffect of a mutation This is an empirical method that relies onthe estimation of physical parameters from more than 1000mutations in protein structures and has been evaluated with21 proteins (Schymkowitz et al 2005 Tokuriki et al 2007)First each amino acid at each position was replaced with eachof the other 19 amino acids in order to provide a baseline forcomparison As expected the prediction of all virtual muta-tions shows a distribution skewed toward destabilization (me-dians G range between 051 and 123 kcalmol) Next theeffect of changing the human amino acids with those seen inevolution for the other 46 sequences of this study was calcu-lated in order to examine the effect of evolutionary change onprotein stabilities This produced a narrow distribution pre-dominantly centered on a difference of stability G of0 kcalmol and showing a slight skew toward deleterious mu-tations (medians G range between 014 and 030 kcalmol) This outcome is as expected because evolutionarychanges in the amino acids are selected in order to preservethe stability of the protein structures Finally the effect ofchanging the human amino acids to those observed inunique pathogenic missense mutations was calculated Thisanalysis produced a clearly skewed distribution (mediansG range between 101 and 295 kcalmol) of G

values that ranged as high as 10 kcalmol (and even higherbut these higher values were clipped for easier visualization)this agrees with the fact that these mutations affect the pro-tein structure The first distribution for the virtual mutationsis intermediate between the latter two distributions asexpected

The same outcome of stability effects was also observed atthe individual amino acid level (fig 7) When the effect of 19amino acid replacements were computed at each position forFVIII FIX and FXI and the majority of replacements showed adestabilizing effect (410 kcalmol in red) A small propor-tion of replacements stabilized the structure (lt10 kcalmol in blue) indicating the ability of the protein to compen-sate deleterious mutations by further mutation Although thisappears to occur more strongly for the FVIII chains A and B(fig 7a and b) the proportion of allowed replacements is infact similar for all four proteins when their different sizes areconsidered When the effect of replacements by disease-caus-ing missense mutations was calculated and the G valuesshowed that these almost invariably destabilized the threeprotein structures

DiscussionIn this study we have assembled sequence data for 14 coag-ulation genes across 47 genomes in order to analyze the evo-lutionary changes in amino acids that take place between five

FIG 4 Probability of evolutionary selection across the Primates clade for the three proteins FVIII FIX and FXI The probabilities of negative neutral andpositive selection were plotted against the amino acid position in the three proteins The probability values were obtained from CodeML analyses of theseven primate sequences (gray negative selection yellow neutral selection green positive selection)

3048

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

major clades and several minor ones These changes werecompared directly with disease-causing mutation data forthree coagulation proteins FVIII FIX and FXI Good-qualitypathogenic mutational data are publicly available for thesethree proteins and the bleeding disorders caused by thesethree sets of mutational data correspond to the three mostprevalent bleeding disorders The outcome of these detailedanalyses confirms that positive selection occurred during dif-ferent stages in the evolution of several coagulation proteinsIt was also observed when comparing the protein structurestabilities that the pathogenic mutations have a destabilizingeffect on the protein structure as opposed to the stable mu-tations that are tolerated during evolution (fig 6) The com-bination of positive selection analyses with those forpathogenic mutations provides information about predictingdisease causing mutations and clarifies fresh approaches fortherapeutic interventions in bleeding disorders

Selective Pressures in Vertebrate Evolution

Evolutionary changes combine random mutational eventswith natural selection The probability of a mutation to befixed in a population depends on its consequence on thefitness of an organism New adaptive mutations that are ben-eficial in terms of fitness are more likely to be retained bypositive selection than neutral mutation Neutral mutations

that are neither beneficial nor detrimental are randomly re-tained or removed by genetic drift depending on the popu-lation size (neutral selection) Detrimental mutations aremore likely to be removed from the population by negative(or purifying) selection There are no strict patterns and adeleterious mutation can sometimes be beneficial to oneaspect of the organism The textbook example of this issickle-cell anaemia (drepanocytosis) which is a bleeding dis-order caused by the pGlu6Val mutation in the hemoglobin gene Mutations responsible for sickle-cell anaemia are gen-erally removed by natural selection In malaria-infested pop-ulations particularly in tropical or subtropical regions thesemutations are selected for in one allele because they offerpartial protection against the pathogen (Lopez et al 2010)The malaria-infected sickle cells inhibit infection by the ma-laria plasmodium in heterozygous cases (beneficial) but causesickle-cell anaemia in homozygous cases (deleterious)

Genes corresponding to proteins in direct contact with theenvironment are more likely to be under adaptive selectionThus genes involved in host-pathogen interactions are com-monly associated with positive selection Large-scale analysesrevealing positive selection in vertebrates or arthropods havebeen reported for genes involved with immunity (Nielsenet al 2005 Studer et al 2008 Montoya-Burgos 2011 Rouxet al 2014) In the present detailed study positive selection inseveral genes of the coagulation cascade was identified at

FIG 5 Relationship between disease-causing mutations and evolutionary pressures The correlation between evolutionary selection and the number oftimes a disease-causing mutation occurs at each amino acid position is shown In the columns (a) FVIII (b) FIX and (c) FXI correspond to the threecoagulation factors for which pathogenic mutational information is available in sufficient quantity from our three mutational databases The three rowscorrespond to the posterior probability (based on the BEB model of CodeML) of positive selection neutral evolution and negative selection forprimates

3049

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

different stages in evolution (fig 3) An earlier study identifiedpositive selection across six mammalian genomes (humanchimpanzee macaque mouse rat and dog) using LRTsbased on codon substitution models and false discoveryrates to evaluate multiple LRT comparisons (Kosiol et al2008) Although they found strong positive selection(Plt 005 and false discovery ratelt 005) in nine complementgenes only nominal positive selection (Plt 005) was identi-fied in four coagulation genes (F2 F12 TFPI PROC) none ofwhich were identified in figure 3b A limitation of this earlieranalysis may be the low number of six genomes used Notethat complement is related to coagulation in that the latterhas the ability to activate complement of innate immunitythat mediates host-pathogen interactions (Amara et al 2008)

Positive Selection of the Coagulation Proteins throughHost-Pathogen Interactions

As a consequence of the immune response blood coagula-tion is often exploited by pathogens for reason of infectiveand septic processes For example FG and FV are targeted bybacteria thus offering a straightforward explanation of posi-tive selection (fig 3) We discuss this in terms of FG and FIIfollowed by FIII and FVII and then FV and FX

By taking advantage of the increased number of availablegenome sequences positive selection has been identified inseveral coagulation genes during various stages in evolutionIn particular all three subunits FGA FGB and FGG of FG haveshown positive selection at periods during vertebrate evolu-tion according to both the site model and the branch-sitemodels Similarly prothrombin (FII) has shown positive selec-tion but only across all the vertebrate genomes and notacross any smaller clades Although the immune system isunder strong adaptive constraints because it needs to evolveconstantly against pathogens the coagulation system alsocomprises the first line of immune defense against bacteriato form a clot to block bacterial invasion (Sun 2006) Both FGand prothrombin (FII) are targeted by certain bacteria Forexample FG is targeted during the course of infection bycysteine protease a virulence factor of Streptococcus pyogenes(Matsuka et al 1999) and FG binding protein Efb ofStaphylococcus aureus Similarly FG-like proteins exist inAnopheles gambiae and are involved in the function of theimmune system against malaria or bacterial parasites (Dongand Dimopoulos 2009 Lombardo et al 2013) Thrombin in-teracts with staphylocoagulase a protein secreted by thehuman pathogen S aureus and activates prothrombin with-out proteolysis to form thrombin The resulting

FIG 6 The effect of amino acid replacements on the overall protein stability change G Data are shown for (a and b) FVIII A and B chains (c) FIX (d)FXI Three calculations were performed each being based on the sequences for which crystal structures are known (PDB codes 2R7E [FVIII] 2WPH [FIX]and 2F83 [FXI]) The FVIII A and B chains correspond to the N-terminal residues 19ndash760 and the C-terminal residues 1582ndash2351 that are observed in itscrystal structure The calculation represented by the dashed line indicates the distribution calculated for all 19 possible amino acid replacements Thatrepresented by the dotted line indicates the distribution calculated for all the evolutionary occurring replacements that were observed in our data set of47 genomes That represented by the continuous line indicates the distribution calculated for the disease-causing mutations from the FVIII FIX and FXImutation databases

3050

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

prothrombinmdashstaphylocoagulase complex binds to FG tocleave this into self-polymerizing fibrin This process is centralto the molecular pathology of S aureus endocarditis becausestaphylocoagulase bypasses physiological blood coagulationthrough the evasion of the recognition of prothrombin bycirculating thrombin inhibitors (Panizzi et al 2005)

Positive selection for FIII and FVII is attributed to a host-pathogen interaction with the herpes virus family These cellsurface molecules do not necessarily participate directly in animmune response but can still serve as a convenient gatewayfor pathogen entry (Vallender and Lahn 2004) Herpes virusespromote cellular surface infections by coevolving with theirhuman host indicating a mode by which pathogens exploitthe coagulation system (Sutherland et al 2012) A study of thehost-pathogen driven coevolution of herpes virus revealedthat this virus has existed from the time of mammals anddivided into the HSV-1 and HSV-2 types recently during

human evolution (Kolb et al 2013) These observations sug-gest that the pressure for FIII to evolve in early mammalsespecially through primates could be the reason for the pos-itive selection of FIII across the mammalian clade (fig 3) Wepropose that the FIIIndashFVII interaction is a significant drivingforce for positive selection in FVII

The FV and FVIII cofactors are homologous to each otherin their sequence and domain organization hence it is notsurprising to see similar selection pressures acting on them(fig 3c) A difference arises in the entire vertebrate genome(fig 3a) where FV shows positive selection whereas FVIII doesnot This difference between FV and FVIII across vertebratesmay result from the interaction of FV with Escherichia coliThe extracellular serine protease EspP from E coli cleaves FVto lead to its degradation and thus reduces the activity of thecoagulation cascade (Brunder et al 1997) The activated FVacofactor forms a FVaFXa complex with activated FX (fig 1)

FIG 7 The effect of amino acid replacements at the sequence level on the protein stability change G The residue stability changes are depicted ona seven point scale from highly destabilizing (red) to highly stabilizing (blue) for each of (a) FVIII chain A (b) FVIII chain B (c) FIX and (d) FXI Their PDBcodes are indicated in figure 6 In each panel the upper half shows the G values for all the possible 19 amino acid replacements at each residueposition that is part of the crystal structure and the lower half shows the G values for the disease-causing mutations taken from the mutationdatabases

3051

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

which is important for thrombin formation the most impor-tant step for fibrin clot formation Because both FVa and FXaare central to coagulation and are associated with bacterialproteolysis this offers an explanation for the signatures ofpositive selection in both proteins

FVIII FIX and FXIII are also associated with positive selec-tion however it is not clear what selective pressures are in-volved Although these proteins may also be targeted bypathogens no direct interactions have been reported so far

Absence of Positive Selection for FXI and FXIIFXII is unique in that it does not show positive selection eventhough it is involved in host-pathogen interactions with long-chain inorganic phosphatases (polyP) from microorganisms(Morrissey et al 2012) Small-chain polyP are produced in thehuman brain and by blood platelets Hence even during theabsence of pathogen-originated polyP the human host doesnot lack polyP and there is no pressure to evolve for this (Puyet al 2013) A F12 deficiency in patients does not cause bleed-ing disorders This means that the coagulation cascade can beinitiated even in the absence of F12 For similar reasons FXIshows no positive selection We hypothesize that this is be-cause first no interaction with the environment has beendemonstrated to date and second with respect to the polyPmechanism the coagulation cascade is capable of activatingin a FXI-independent manner

Relationship between Disease-Causing Mutations andSelection

The mechanisms of genetic drift mutational change andevolutionary change are closely related to each otherEvolutionary change is not only dependent on the geneticvariability from mutational change but also on genetic driftwhich describes the random fluctuations in the number ofgenetic variants in an organism Genetic drift and mutationsproduce random variations that selection can act upon Thenew changes incurred in one generation may be cancelled inanother generation It may therefore be important to deter-mine whether a new mutation is lost or becomes commonenough for selection to determine its fate (Masel 2011) In thisstudy we have examined the relationship between naturalselection acting on an amino acid and the disease-causingmutations reported at that amino acid position for thethree prominent coagulation factors FVIII FIX and FXI Wehave shown that the disease-causing mutations are morelikely to occur at a higher frequency at sites under negativeselection in all three proteins (fig 5) As sites under strongselective pressure are more likely to be critical to proteinfunction such a mutation can readily affect protein functionand trigger disease Lethal (or deleterious) mutations thatcause death are de facto not observed in our data set Bycontrast disease-causing mutations occur at a reduced fre-quency in sites under neutral or positive selection suggestingthat these sites are more tolerant to mutations than sitesunder negative selection Interestingly we have shown thatFVIII and FIX have undergone much positive selectionwhereas FXI has no such signatures (fig 4) Despite this dif-ference our results for three proteins show no change in the

relationship between evolutionary selection and disease-causing mutations (figs 5 and 6) In summary the reporteddisease-causing mutations show randomness and the muta-bility of a given amino acid position depends on the selectionpressure acting at that position and not on the entire protein

Effect of Disease-Causing Mutations on the Stability ofProtein Structure

Mutations that change the amino acid sequence can havestriking effects on protein stability Point mutations are re-lated to pathological and genetic conditions by reducing orabolishing coagulation protein function (table 1) Proteinfunction is related to its stability which is often measuredby the change in folding energy G of the protein structure(Stefl et al 2013) We have identified the effect of mutationson the stability of FVIII FIX and FXI in terms of G valuesThe stability effects of polymorphic amino acid changes inhuman sequences follow a Gaussian distribution (de Beeret al 2013) In that study disease-causing mutations alsoshowed the same distribution but slightly skewed towarddestabilization In this study the same trend for aminoacids substitutions in vertebrates were observed but this ismuch stronger for disease-causing mutations (fig 6) The dif-ference between the two studies may be explained by thedifferent methods used de Beer et al (2013) used the DiscreteOptimized Protein (DOPE) score of Modeller as a monitor ofglobal stability whereas we used FoldX (Materials andMethods) as an empirical method to infer directly the stabilityeffect G of a point mutation

Evolutionarily observed mutations are the most stable interms of their protein structure whereas disease-causing mu-tations destabilized the protein structure (figs 6 and 7) Eachsite may or may not tolerate a particular amino acid changedepending on the change in stability When each amino acidwas converted to each of the other 19 possible amino acids(fig 7 top panels) the G values showed that most aminoacids destabilize the structure whereas several may stabilize itThe disease-causing effect of a mutation may be affected notonly by the selection pressure acting on it but also by the typeof the amino acid replaced For the FVIII FIX and FXI bleedingdisorders destabilization was observed for most of the dis-ease-causing mutations

Utility of Evolutionary and Disease-CausingMutational Analyses for Clinical Diagnosis

Interactive locus-specific databases are useful tools for pre-senting patient mutational data for patient diagnosis andcare We plan to incorporate the above data sets on selectivepressures and stability effects into our mutational databasesfor FVIII FIX and FXI This information will provide insight forthe prediction of the effect of disease-causing mutations Thelevel of selective pressure was estimated under a maximum-likelihood framework Such methods are useful for geneticstudies but their results are sometimes difficult to interpretby a nonspecialist The presentation of this information needsto be intuitive to the viewer As shown in supplementaryfigure S3 Supplementary Material online the use of multiple

3052

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

sequence alignments and presentation of stability effectsG structural locations and selective pressures can helpin interpreting the impacts of the mutations This type ofinformation will therefore clarify the interpretation of newlydiscovered disease-causing mutations for clinicians evolu-tionary biologists and protein biochemists This combinedevolutionary-driven study of bleeding disorder-causing muta-tions is the most detailed of its kind as far as we are aware andmay be usefully extended to other biological processes andtheir proteins

Materials and Methods

Genomic Sequences

The flowchart of operations is summarized in supplementaryfigure S1 Supplementary Material online First the completemultiple sequence alignments were retrieved together withtheir gene trees for each of the 14 genes corresponding to the11 coagulation factors (table 1) from the Ensembl Comparadatabase (Vilella et al 2009) using their Ensembl IDs (table 1)and a Perl script Ensembl release 73 was used accessed onSeptember 30 2013 (Flicek et al 2013) These amino acidsequences came from Ensembl as they are generally ofgood quality and contained 74 vertebrate genomes in atotal of 80 genomes We then excluded any sequences thatcame from genomes of low coverage (2 or less) This re-sulted in 47 vertebrate genomes with alignment sizes be-tween 23 and 47 genomes These 47 sets of vertebratesequences provided sufficient genomic coverage to performphylogenetic analysis The 47 vertebrate organisms were clas-sified into five major clades (fig 2) namely Primates Glires(rodents rabbits and pika) Laurasiatheria Sauropsida (rep-tiles and birds) and Fishes (ray-finned fishes) together withminor clades including Atlantogenata and Australianmammals

Sequence Retrieval and Multiple Alignments

The multiple sequence alignments and gene trees fromEnsemblCompara included all the coagulation sequences indifferent species irrespective of their genome coverage Themultiple sequence alignments of the downloaded genes fromthe EnsemblCompara database were constructed using aglobal alignment algorithm provided within this databasefor significant number of sequences of arbitrary lengthsThe data set required for this study comprised 14 genesfrom 47 genomes A thorough quality control procedure ofthe multiple sequence alignments was performed in order togenerate minimum false positive results Thus sequences wereremoved if they showed 41 low-complexity regions orunknown nucleotides (indicated as ldquoNrdquo in a sequence)From these the phylogenetic trees were pruned usingNewick Utilities software (Junier and Zdobnov 2010) to iden-tify the tips corresponding to the retrieved species (fig 2) Thefinal alignments of these sequences were made with PRANKwhich is conservative in that it tends to align only amino acidsderived by substitution and introduces insertions and dele-tions (indels) instead of aligning sequence fragments thatappear too divergent (Loytynoja and Goldman 2008)

PRANK is an advanced probabilistic alignment algorithmthat incorporates phylogenetic information This has beenshown to reduce the number of false-positives when evalu-ating positive selection (Fletcher and Yang 2010 Loytynoja2014) All the PRANK alignments were made using defaultparameters for the empirical codon model and the prunedtrees as guides The resulting alignments are composed ofhighly conserved blocks surrounded by noise due to indelsFinally Gblocks (Castresana 2000) was used to remove thenoise in the unreliable regions of the multiple sequence align-ments which may cause false positives when testing for pos-itive selection to keep only the well-aligned parts The defaultGblocks parameters for the codontype model (parameter ndasht = c) were used with the exception of tolerating half gaps perposition (parameter -b5=h) This procedure resulted in cleanand accurate multiple sequence alignments for the final com-putations of the site and branch-site models using CodeMLtogether with their corresponding gene trees

Analyses of Evolutionary Pressure

For each of the 14 coagulation genes within each of the fivemajor clades the selective pressures were estimated using thedifferent codon substitution site models implemented inCodeML of the phylogenetic analysis with maximum-likeli-hood software (PAML release 47) (Yang 2007) By analyzingthe protein-coding genes at the nucleotide level the synon-ymous or silent changes (nucleotide substitutions that do notchange the encoded amino acid) are distinguished from thenonsynonymous or missense changes (substitutions thatchange the encoded amino acid) Because natural selectionoperates at the protein level and can take any direction ofpositive neutral or negative selection the synonymous andnonsynonymous substitutions are subjected to different se-lective pressures which occur at the rates dS and dN respec-tively Hence the ratio of nonsynonymoussynonymoussubstitution rates (= dNdS) measures the selective pressureat the protein level The value of reveals the direction andstrength of natural selection acting on the protein Values oflt 1 = 1 and 4 1 indicate negative selection neutralevolution and positive selection respectively If an existingfunction or phenotype is evolutionarily favorable then nega-tive selection (or purifying selection) favors the conservationof that function or phenotype (conservation of amino acids)The other is positive selection (or Darwinian selection) whichfavors the promotion of a new function or phenotypeNeutral evolution describes the tolerance to amino acid mu-tations which are neither deleterious nor beneficial and arefixed according to genetic drift

The following five models were used to account for thevariation of dNdS among codon sites M1a (two dNdS ratioswith negative selection [0lt 1] and neutral evolution[1 = 1]) M2a (three dNdS ratios with negative selection[0lt 1] neutral evolution [1 = 1] and positive selection[2 1]) M7 (beta) M8 (beta and 24 1) and M8a (betaand 2 = 1) The alignments contain several positions inwhich one of the sequences is missing and the ldquocleandata = 0rdquooption was selected to retain all sites The inference of positive

3053

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

selection was conducted by performing LRTs in which thefollowing two pairs of models were compared namely M1awith M2a and M8a with M8 The principle is to compare anull model that does not allow24 1 (M1a or M8a) with analternative model that does (M2a or M8) Because the alter-nate model has more parameters than the null model theLRT follows a 2 distribution The LRT is a powerful methodfor comparing two hypotheses where one is the special caseof the other The BEB method (Yang et al 2005) was used tocalculate the posterior probabilities for site classes and theBEB value was used to identify sites under positive selectionwhere the 2 distribution of the LRT was significant withPlt 005 As many different LRT comparisons (differentgenes different clades) were conducted a control for thefalse-discovery rate (expected proportion of false positives)was applied using the q value package in R (Storey andTibshirani 2003) We adjusted the parameters of the q valueto perform a bootstrap analysis (pi0meth = ldquobootstraprdquo) andto tolerate 10 of false positives in our significant results(ldquofdrlevel = 01rdquo) A bootstrap analysis is necessary as the Pvalue distribution is bimodal (Studer et al 2008)

Two sources for the branch-site model were used First theSelectome database at httpselectomeunilch (last accessedJanuary 2014) (Proux et al 2009 Moretti et al 2014) providesthe precomputed results of the branch-site model (supple-mentary table S2b Supplementary Material online) based onthe multiple sequence alignments from the EnsemblCompara database (Vilella et al 2009) The last release ofSelectome (release 6) contains an optimized procedure ofalignment refinement and quality control similar to theone used for our site-model analyses as described aboveThe branch-site model allows the dNdS value to vary notonly between sites but also between branches (Zhang et al2005) The advantage of this model is its ability to detecttraces of episodic positive selection (Studer and Robinson-Rechavi 2009a) Second a branch-site model was determinedbased on our data set (see above) We used the vertebratetrees and their multiple sequence alignments as inputs forbranch-site model calculations The multiple alignments weremanually rechecked to remove barren sequences and thecorresponding vertebrate trees were edited accordinglyusing Newick Utilities Each taxonomic unit also known asthe branch or node of the tree was tagged using an auto-mated python script The CodeML program was run on eachof these tagged trees from all the genes for this branch-sitemodel (supplementary table S2b Supplementary Materialonline) For some alignments we found misaligned short in-sertions up to a maximum of ten amino acids in the con-served blocks probably due to gene model prediction errorsIn our whole data set these errors affect 0ndash4 sequences peralignment As the F2 gene family contained most of thesesmall errors albeit at a very low level this was reanalyzed byremoving four sequences with potential problems from thealignment The recalculation of the likelihood ratio deviatedonly slightly from the initial results (P value of 393e5 for 38sequences vs P value of 217e4 for 34 sequences) This sug-gested that these errors would only have a minor impact ifany on the detection of positive selection especially as the

coagulation factor sequences were long and the model ofcodon substitutions was averaged over all columns (manyhundreds) and sequences (between 23 and 47)

Stability Effect of Mutations

When possible for each coagulation factor the molecularstructures of the domain(s) were retrieved directly fromthe CATH database (Sillitoe et al 2013) Prior to the muta-tional analyses each structure was repaired with the FoldXprogram (Schymkowitz et al 2005) to remove potential stericclashes FoldX utilizes an empirical force field model to esti-mate the stability of a protein (Gwildtype in kcalmol) FoldXwas used to estimate the stability effect on each residueposition when the residue was replaced by any of the 19other amino acids The stability effect is determined bythe difference in the wildtype and mutant energies(G = GmutantGwildtype) This resulted in a stabilitylandscape of all potential mutations including the Gvalues both for the disease mutations and for all the aminoacid substitutions observed across vertebrate evolution

Statistical Analyses

All statistical analyses were performed with R v302 (R CoreTeam 2013) and SigmaPlot (Systat Software San Jose CA)The pipeline was developed with Biopython scripts (Cocket al 2009) The visualization of the multiple sequence align-ments was performed using Jalview 28 (Waterhouse et al2009)

Data Availability

The final sequence alignments are provided on a webpagehttpwwwbiochemuclacukpavithraevolutioncoagulationindexphp (last accessed January 2014) Pathogenic mu-tation data for three human coagulation factors FVIII FIX andFXI were retrieved from our web databases at httpwwwfactorviii-dborg httpwwwfactorixorg and httpwwwfactorxiorg (last accessed January 2014) (Saunders et al2009 Rallapalli et al 2013)

Supplementary MaterialSupplementary tables S1ndashS2 and figures S1ndashS3 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

PMR and SJP thank Pfizer UK for a medical educationalgrant that generously funded this project and the SpecialTrustees of the Royal Free Hospital the KatharineDormandy Trust for Hemophilia and Related Disorders andthe European Association for Hemophilia and AssociatedDisorders for additional support The authors also thank DrKeith Gomez Dr Geoffrey Kemball-Cook and ProfessorEdward G Tuddenham for their generous collaborative sup-port of our studies including access to the FVIII mutations atthe Hemophilia A Database at httpwwwfactorviii-dborgRAS acknowledges funding from the Fondation du 450eme

3054

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

complex fashion starting from the first vertebrates Sequenceanalyses have revealed the order in which the factors evolved(Doolittle 2009) There is considerable interest in the evolu-tionary development of the complexity of coagulation inmammals This is driven by the importance of understandingpathogenic disease-causing mutations in humans as well asunderstanding how a well-regulated cascade of enzymaticreactions is developed and obtaining new insights into itsmolecular mechanism Analyses of the known mutations inpatients and comparison with the mutations toleratedduring evolution will clarify which codons are stable andwhich are not

The coagulation system overlaps with the innate immunesystem and the complement proteins through their commonproperties involving vascular permeability Deficiencies in thecoagulation proteins mostly due to genetic variations areassociated with a spectrum of genetic disorders that range

from life-threatening ones such as severe Hemophilia A (as-sociated with FVIII) to milder variants (table 1) Hemophilia Aand B are more common than the others while some are rareand these diseases prevail because their underlying geneticmutations are passed on from generation to generationReplacement therapies using recombinant coagulation pro-teins are expensive More recently FIX gene therapy trials forHemophilia B patients have been successful (Nathwani et al2011) and this method is yet to be applied to FVIII and othercoagulation proteins A major problem faced in FIX genetherapy is the low level of protein expression in part due toa different codon usage in the organism that produces theprotein (Thomas et al 2003) An evolution-driven study ofthe codons in the FIX gene may help identify which codonscould be altered to increase protein expressions in the pro-ducing organism and in the meantime be tolerated in thehost

Table 1 The 14 Coagulation Genes and Their Protein Products

Gene ENSEMBL ID Protein Product Other Name Function Genetic Disorder Amino AcidLength

No ofDomains

Domain Organization

FGA ENSG00000171560 CoagulationFactor I

Fibrinogen alphachain

Forms fibrin clotcofactor in

plateletaggregation

Congenital afibro-genmia famil-

ial renalamyloidosis

866 1 Fibrinogen C-terminaldomain

FGB ENSG00000171564 Fibrinogen betachain

491 1 Fibrinogen C-terminaldomain

FGG ENSG00000171557 Fibrinogen gammachain

453 1 FG C-terminal domain

F2 ENSG00000180210 Coagulationfactor II

Prothrombin Activates FG tofibrin

Thrombophilia 622 4 GlandashKringle 1ndashKringle2ndashSP

F3 ENSG00000117525 Coagulationfactor III

Tissue factor Activates FVIIa mdash 295 3 Transmembranendashtrans-membrane helixndashtransmembrane

F5 ENSG00000198734 Coagulationfactor V

Proaccelerin Combines withFX to activateprothrombin

Activated proteinC resistance

2224 6 F58 type A1ndashF58 typeA2ndashBndashF58 type A3ndash

F58 type C1ndashF58type C2

F7 ENSG00000057593 Coagulationfactor FVII

Proconvertin Activates FIX FX Congenital pro-convertinfactor VIIdeficiency

466 4 GlandashEGF1ndashEGF2ndashSP

F8 ENSG00000185010 CoagulationFactor FVIII

Antihemophilicfactor A (AHF-

A)

Combines withFIX and FIV to

activate FX

Hemophilia A 2351 6 F58 type A1ndashF58 typeA2ndashBndashF58 type A3ndash

F58 type C1ndashF58type C2

F9 ENSG00000101981 Coagulationfactor FIX

Christmas factor Combines withFVIII and FIVto activate FX

Hemophilia B 461 4 GlandashEGF1ndashEGF2ndashSP

F10 ENSG00000126218 Coagulationfactor FX

Stuart-prowerfactor

Converts pro-thrombin to

thrombin

Congenital factorX deficiency

488 4 GlandashEGF1ndashEGF2ndashSP

F11 ENSG00000088926 Coagulationfactor FXI

Plasma thrombo-plastin

antecedent

Combines withFIV to activate

FIX

Factor XIdeficiency

625 4 Apple 1ndashapple 2ndashapple3ndashSP

F12 ENSG00000131187 Coagulationfactor FXII

Hageman factor Activates FXI ac-tivates plasmin

Hereditaryangioedema

type III

615 6 Fibronectin type-IIndashEGF1ndashfibronectin

type-I - EGF2ndashkrin-glendashSP

F13A ENSG00000124491 Coagulationfactor FXIII

Fibrin-stabilizingfactor A chain

Stabilizes fibrin Congenital factorXIIIa deficiency

732 mdash mdash

F13B ENSG00000143278 Fibrin-stabilizingfactor B chain

Stabilizes FXIIIAregulatesthrombin

Congenital factorXIIIb deficiency

661 10 Short complement regu-lator domains 1ndash10

3041

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

In order to interpret the effect of evolutionary and muta-tional changes in coagulation we have examined FVIII FIXand FXI For FVIII 5474 disease-causing mutations havebeen recently compiled (Rallapalli PM Kemball-Cook GTuddenham EG Gomez K Perkins SJ unpublished datahttpwwwfactorviii-dborg last accessed January 2014)3713 mutations were recently published for FIX (Rallapalliet al 2013 httpwwwfactorixorg last accessed January2014) and 487 mutations are known for FXI (Saunders et al2009 httpwwwfactorxiorg last accessed January 2014) Weidentified 47 full genomes in the Ensembl database thatshowed good sequence coverage This enabled the assessmentof positive or negative selection pressures (Yang 2006) in theevolution of coagulation protein sequences The identificationof positive selection in at least nine of the 11 coagulationproteins during different periods of evolution showed thatan effective coagulation pathway was under considerableadaptive constraints even in primates The mutational analy-ses for FVIII FIX and FXI showed that disease-causing muta-tions primarily affected highly conserved residues undernegative selection and damaged protein stabilities This jointevolutionary-mutational study provides novel clarifications ofthe observed data on pathogenic mutations and may facilitatenew gene therapy approaches for their treatments

Results

Selection of 47 Genomes

A data set of 14 coagulation factor genes (table 1) that codefor 11 proteins across different groups of vertebrate genomes(fig 2) was identified from the Ensembl database TheEnsembl database currently holds 78 genomes Of theseonly 47 vertebrate genomes were chosen based on their se-quence coverage and assembly quality (fig 2) Ensembl genesets are built from assemblies of DNA sequences and proteininformation hence it was important to select only good-quality assemblies for the gene trees and sequence data setsused in this study The 47 vertebrate genomes were classifiedinto five major clades based on the Ensembl classificationnamely Primates Glires Laurasiatheria Sauropsida andFishes The mammalian clade of 30 genomes consisted ofthree clades used in our study (Primates Glires andLaurasiatheria) and other clades (Atlantogenata andAustralian mammals) (fig 2) The codon substitution site-models M1a M2a M7 M8 and M8a from CodeML(Materials and Methods) were used to calculate the selectionpressures on the five major clades each with at least sevensequences to ensure that the CodeML calculations were rea-sonably powerful (Anisimova et al 2002) By following this

FIG 1 Schema of the blood coagulation pathway leading to fibrin The relationships between the 11 coagulation factors that are coded by 14 genes areshown in the modern revised coagulation pathway (blue enzymes red cofactors) Tissue factor (TF also known as FIII) initiates coagulation when itbinds and activates FVII The activated TF-FVIIa complex then activates FX which then generates activated thrombin (FIIa) FVIII FIX and FXI enable theamplification of the coagulation pathway to maximize FIIa production and possess the largest number of pathogenic human mutations Activatedthrombin cleaves FG (FI green) to form fibrin polymers (blood clots) that are cross-linked by FXIIIa

3042

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

criterion the Atlantogenata clade with only five sequenceswas eliminated from the selection pressure calculations Themultiple sequence alignments from Ensembl were realignedusing a strict quality control procedure (supplementary figS1 Supplementary Material online) which was performed inorder to minimize the number of false positive results below

Presence and Absence of Coagulation Factors

The 47 genomes were searched for the 14 genes coding forthe 11 coagulation factors In cases where the Ensembl genetree for a given coagulation factor did not show the presenceof a particular genome we searched for putative genes acrossthat genomic sequence to establish whether the organismhad lost them The purpose of the coagulation cascade is toconvert FG into the fibrin clot and unsurprisingly FG (theFGA FGB and FGG genes) was found in all 47 genomes Theother central genes thrombin (F2) tissue factor (F3) the

coagulation cascade cofactors (F5 and F8) F7 F9 and F10were present across all 47 genomes from fishes to humansF11 is present in all Sarcopterygii (tetrapods and coelacanth)but seems to have been lost or highly diverged in teleostfishes because it is only recorded in the spotted garLepisosteus oculatus which belongs to Holostei a sisterclade of teleost fishes F12 appeared in the ancestor of tetra-pods and is absent in fishes but present in amphibians(Xenopus tropicalis) reptiles (Anolis carolinensis) and mam-mals For F13 the alpha chain (F13A) was observed to appearfrom the time of vertebrates because this gene has beenfound in the sea lamprey fishes and tetrapods The betachain (F13B) appears to be present at the origin of vertebratesbecause F13B sequences have been recorded in cave fish(Astyanax mexicanus) and medaka (Oryzias latipes)Occasionally a gene was not reported this was most likelyto arise from incomplete or unidentified sequences

FIG 2 Phylogenetic tree of 47 vertebrate genomes The tree shows the 47 genomes studied here alongside their clade and taxonomic groups The 47vertebrate genomes showed good sequence quality and coverage in the Ensembl database and were grouped into five major clades (Primates GliresLaurasiatheria Sauropsida and Fishes) and three minor clades The Mammals comprised of the Primates Glires and Laurasiatheria clades together withAtlantogenata (African origin mammals and Xenartha) and four Australian species totaling 30 out of 47 genomes

3043

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Selective Pressures Using Codon SubstitutionSite-Models

Codon substitution site-models from the CodeML packagewere used to analyze the protein-coding sequences in the 14genes following the quality control of their alignments Themodels assume that selective pressure does not act equally onthe entire protein sequence but varies between amino acidsites in a gene family We calculated the log likelihood values(lnL) of five different selection models (M1a M2a M7 M8and M8a) in order to perform the likelihood ratio test (LRT)The five models differed from each other in the number ofparameters (np) considered and their variable values Thevalue is the ratio of nonsynonymous and synonymous sub-stitution rates (dNdS) (Materials and Methods) The compar-ison between a null model that does not consider 4 1(M1a M7 or M8a) and an alternate model that considers4 1 (M2a or M8) is a measure of positive selectionIf the null (neutral) model is rejected in favor of thealternate (selection) model positive selection is inferred(Yang 2006)

To evaluate the five models the LRT value was calculatedas follows

LRT frac14 2ethlnLalternate lnLnullTHORN

where lnL is the maximum-likelihood estimate of the proba-bilities of selection The LRT is asymptotically distributed as a2

k function where k (also known as the degrees of freedom) isthe difference between the np in the null and alternatemodels The np values were calculated along with the lnLvalues in CodeML

k frac14 npalternate npnull

The LRT values for which the 2k distribution shows a P

valuelt 005 indicated that some sites were significantlyunder positive selection For this study the comparison ofthe M8a null model with the M8 alternate model (referred toas M8a-M8) was considered optimal to infer positive selection(table 2) The LRT M1a-M2a is too conservative whereas M7ndashM8 was considered to be more problem prone and less ac-curate compared with M8a-M8 (Wong et al 2004 Studer andRobinson-Rechavi 2009b) Because the P values result frommultiple tests performed using different genes at differentclade levels it was necessary to control for the expected pro-portion of false positives False positives were accounted forby calculating the q value correction over the P value for thedifferent genes in each clade this approach being both pow-erful and specific (Studer et al 2008)

The overall picture is that a total of nine coagulation factorproteins (out of the 11 of the cascade) underwent positiveselection at one or more stages in evolution (fig 3) Thehighest number of genes under positive selection within anindividual clade is five (for vertebrates and mammals) ForVertebrates using the LRT M8a-M8 positive selection wasfound in as many as five of the 14 coagulation genes (table2) In figure 3a these included F2 F5 F7 F10 and F13A for all47 vertebrates In figure 3b these included FGA F3 F5 F8 andF9 for the 30 mammals In figure 3cndashe these included FGB F5

and F8 (Primates) FGA and F13B (Laurasiatheria) and F3(Sauropsida) FG (FG = FGA + FGB + FGG) and F5 were ob-served to undergo the most extensive positive selection com-pared with any other coagulation genes The selectivepressure on all the genes in different clades was identifiedfrom the LRT values (table 2) except for those of F11 inLaurasiatheria Sauropsida and Fishes F12 in Sauropsida andFishes and F13B in Fishes because these genes wereabsent (supplementary table S1 Supplementary Materialonline)

Estimation of Selective Pressures Using Branch-SiteModel of Codon Substitution

The above site models estimated selective pressures that varyamong amino acid positions across multiple species But se-lective pressures can also vary among species (Studer andRobinson-Rechavi 2009a) and codon substitution branch-site models have been developed to detect such episodicpositive selection (Zhang et al 2005) Here we have usedthe results of the branch-site model from two differentsources one from the Selectome database and the otherfrom our own calculations Both approaches employedCodeML on the two data sets of this study (Materials andMethods) in order to check for consistency with the positiveselection identified in the site-model calculations (fig 3)Similar to the site model the false positives in our branch-site model calculations were accounted for by calculating theq value correction over the P value for the different genes ineach clade For both branch-site calculations we followed thetaxonomic grouping in the Selectome database (supplemen-tary table S2b Supplementary Material online) The highestorder of classification is the Sarcopterygii (coelacanths lung-fish and tetrapods) followed by its largest subgroupTetrapoda (mammals sauropsids and frogs) Other majortaxonomic subtrees were those of amninots sauropsidsmammals theria and eutheria The Actinopterygii (fishes)and Sauropsida (birds) are common to both the calculationsThe Eutheria and mammalian subtree calculations from theSelectome database were considered to be equivalent to themammalian clade of our analyses (fig 2) Positive selectionacross the vertebrates was observed for all genes except FGGF3 F7 and F11 in our branch-site model calculations (supple-mentary table S2b Supplementary Material online) We ob-tained similar results with our site-model calculations(supplementary table S2a Supplementary Material online)except for the exclusion of F3 and F7 (fig 3a) Our branch-site calculations identified positive selection in mammalianFGB F2 F8 and F9 whereas the site model identified positiveselection in mammalian FGA F3 F5 F8 and F9 Overall FG(FGA FGB) F8 and F9 were consistently identified to be pos-itively selected in mammals whereas F5 shows positive selec-tion up to amniotes after which the mammals and thesauropsids clade split In conclusion the two branch-siteand the site models all suggested signatures of positive selec-tions across F8 and F9 whereas no such selection wasobserved across and between F11

3044

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Identification of Codon Sites under Positive Selection

For up to 11 genes out of 14 identified to be under positiveselection (table 3) further analyses were performed to identifywhich codon sites were under such selection This was doneusing the Bayes Empirical Bayes (BEB) method from the M8site-model (Yang et al 2005) The number of codons poten-tially under positive selection (BEB 450) in the positivelyselected genes ranged between 3 and 64 But the number ofcodons accurately predicted under positive selection (BEB495) is actually much lower (between 0 and 4) The FGAgene in Laurasiatheria showed the highest percentage of pos-itively selected codons (1329 as estimated by CodeML and64 codons detect with two at BEB495) followed by the F5gene in Primates (1321 with 54 codons all BEBlt95 ) Thesmallest percentage of positively selected codons was ob-served in the F10 gene of Vertebrates (001 with threecodons all with BEB lt95) No positive selection was ob-served in any of the codon sites for the FGG F11 and F12genes in the taxa groups of this study

Table 2 LRT Statistics for the Site Model Comparisons of Model 8with Model 8a for the 14 Coagulation Factors among the Five Clades

Gene Clade No ofSequences

No ofCodons

(M8a-M8)

LRT P value q value

FGA Vertebrates 44 236 139 024 035Mammals 29 288 775 001 004Primates 7 865 335 007 016

Glires 7 670 289 009 019Laurasiatheria 9 267 728 001 004

Sauropsida 6 464 033 057 062Fishes 7 592 016 100 072

FGB Vertebrates 40 429 145 023 035Mammals 26 456 278 100 072Primates 6 488 744 001 004

Glires 6 471 069 041 051Laurasiatheria 8 479 144 023 035

Sauropsida 5 473 002 100 072Fishes 7 476 099 100 072

FGG Vertebrates 44 386 174 100 072Mammals 29 427 319 007 017Primates 7 433 398 005 013

Glires 7 433 344 006 016Laurasiatheria 8 432 009 076 072

Sauropsida 5 427 003 086 072Fishes 8 395 542 100 072

F2 Vertebrates 38 475 1691 000 000Mammals 23 571 464 003 011Primates 7 621 425 004 012

Glires 5 611 007 100 072Laurasiatheria 7 620 158 021 034

Sauropsida 5 593 368 005 014Fishes 8 562 005 100 072

F3 Vertebrates 47 166 209 015 027Mammals 26 240 1337 000 000Primates 7 295 001 090 072

Glires 6 287 307 008 018Laurasiatheria 7 292 097 032 044

Sauropsida 5 254 865 000 002Fishes 14 192 082 036 047

F5 Vertebrates 37 1182 1495 000 000Mammals 22 1741 3192 000 000Primates 6 2116 536 002 008

Glires 6 1847 048 049 056Laurasiatheria 6 2043 483 003 010

Sauropsida 5 1550 145 023 035Fishes 8 1534 212 015 027

F7 Vertebrates 34 330 593 001 007Mammals 29 344 306 008 018Primates 7 443 016 069 072

Glires 7 417 03 100 072Laurasiatheria 7 443 057 045 054

Sauropsida 3 425 055 046 054Fishes NA

F8 Vertebrates 38 1028 485 003 010Mammals 24 1985 1119 000 001Primates 7 2351 683 001 004

Glires 5 2257 395 005 013Laurasiatheria 7 2344 025 062 067

Sauropsida 5 1480 003 086 072Fishes 7 1413 09 034 045

F9 Vertebrates 44 329 063 043 052Mammals 24 443 1857 000 000Primates 7 461 128 026 036

Glires 5 468 185 017 030

(continued)

Table 2 Continued

Gene Clade No ofSequences

No ofCodons

(M8a-M8)

LRT P value q value

Laurasiatheria 7 459 434 004 012Sauropsida 5 435 003 100 072

Fishes 13 368 036 100 072

F10 Vertebrates 41 353 1455 000 000Mammals 26 424 023 063 067Primates 7 485 001 091 072

Glires 6 464 206 015 027Laurasiatheria 7 463 23 013 025

Sauropsida 5 385 003 087 072Fishes 8 391 071 040 051

F11 Vertebrates 23 614 034 056 062Mammals 23 614 034 056 062Primates 7 625 005 082 072

Glires 5 619 007 079 072Laurasiatheria 6 625 002 089 072

Sauropsida NAFishes NA

F12 Vertebrates 27 512 398 100 072Mammals 25 533 018 100 072Primates 7 615 002 100 072

Glires 6 557 007 100 072Laurasiatheria 8 513 46 003 011

Sauropsida NAFishes NA

F13A Vertebrates 47 631 907 000 002Mammals 26 684 258 011 022Primates 7 732 162 020 034

Glires 6 730 147 100 072Laurasiatheria 9 730 027 100 072

Sauropsida 5 732 001 092 072Fishes 14 643 259 011 022

F13B Vertebrates 31 609 13 025 036Mammals 25 640 157 021 034Primates 7 660 122 027 037

Glires 5 634 021 100 072Laurasiatheria 7 656 708 001 004

Sauropsida 5 573 093 100 072Fishes NA

NOTEmdashPositive selection is denoted in bold in column 2

Plt 005 Plt 001

3045

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Disease-Causing Missense Mutations and Relationshipto Evolutionary Pressures

The relationship between selective pressure and the proba-bility of disease-causing mutations at a codon site is of greatinterest Disease-causing missense mutations have been re-ported in all 14 coagulation genes in particular forHemophilia A (F8) Hemophilia B (F9) factor XI deficiency(F11) and thrombophilia (F2) (table 1) Extensive compila-tions of 487ndash5474 disease-causing mutations are available forthree of these coagulation factors FVIII FIX and FXI(Saunders et al 2009 Rallapalli et al 2013 Rallapalli et al2014 [unpublished data]) There is a sufficient number ofpathogenic mutation and evolutionary selection data fromthe FVIII FIX and FXI proteins to enable us to compare instatistical detail the correlation between selective pressureand disease-causing mutations

For the comparison of the FVIII FIX and FXI proteinchanges selective pressures were identified using the BEB in-ference method from model M2a (fig 4) While powerful a

disadvantage of Model M8 is that it classifies sites into 11categories (ten in negativeneutral and one positive selec-tion) These 11 categories can vary between different genefamilies making the comparison difficult Thus model M2a isadvantageous in that this has only three distinct categoriesnamely negative selection (dNdSlt 1) neutral evolution (dNdS = 1) and positive selection (dNdS4 1) Under the M1a-M2a test in the primates clade two genes (F8 and F9) showedpositive selection whereas F11 did not show positive selec-tion which differs from that shown in figure 3c which is basedon the M8 model The posterior probabilities of neutral neg-ative and positive selection for the FVIII FIX and FXI proteinsshowed that positive selection occurred in specific codonsthe most being for F8 and the fewest for F11 For completionthe same analysis for the other 11 genes is shown in supple-mentary table S2 Supplementary Material online These showsimilar outcomes to those for FVIII FIX and FXI

For each amino acid position in FVIII FIX and FXI wecompared its selective pressure (positive neutral or negative)

FIG 3 Positively selected genes of the coagulation cascade in vertebrates mammals and three major clades The layout of this figure is a simplifiedrepresentation of figure 1 Genes showing significant positive selection in the LRT calculations are highlighted in red (a) Positive selection was observedin five genes when all 47 genome sequences (vertebrates) were analyzed as a single group (b) In the Mammals group (fig 2) where positive selection isseen for five genes only the (c) Primates and (d) Laurasiatheria clades are shown with positive selection for two or three genes only because no positiveselection was seen in the Glires clade whereas insufficient sequences were available in the Atlantogenata clade for conclusions to be drawn (e) Positiveselection was seen in the Sauropsida clade but no positive selection was observed in Fishes (fig 2)

3046

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

with the reported number of disease-causing mutations Theoutcome was presented using regression plots (fig 5)The regression plots and their correlation coefficients revealedthe relationship between the selection probability values atamino acid sites with the disease-causing mutations at thosesites The average probability of a site to be under positiveselection and its number of disease-causing mutationsshowed a negative relationship with correlation coefficientsof085067 and095 in FVIII FIX and FXI respectively(fig 5 top row) The correlation coefficients for neutral selec-tion followed that of positive selection with values of 087064 and 096 for FVIII FIX and FXI respectively (fig 5middle row) In distinction to these a positive relationshipwas seen between the average probability of negative selec-tion and the disease-causing mutations with correlation co-efficients of +086 +066 and +095 for FVIII FIX and FXIrespectively (fig 5 bottom row) To summarize the fewer thenumber of disease mutations at a given site the higher is theprobability for positive selection or neutral evolution (fig 5

top row and middle row) The greater the number of disease-causing mutations at a given site the higher the probabilityfor negative selection (fig 5 bottom row) The comparisonbetween the positively selected proteins FVIII and FIX withthe nonpositively selected protein FXI is noteworthyIrrespective of the selective pressure all three proteins ex-hibited similar regression lines in each of the three rowsThese comparisons show that there is a general inverse pro-portionality relationship between the probability of positiveselection (or neutral evolution) and the number of disease-causing mutations at each site This confirms previous obser-vations that disease-causing mutations tend to occur at crit-ical sites in the protein and that these highly conserved sitesare less tolerant to new mutations (negative selection)

Stability Effect of Mutations

In order to explain the consequences of disease-causing mu-tations on the protein structure FoldX software was used to

Table 3 Positively Selected Sites in the M8a-M8 Comparison

Gene Clade No ofCodons

Analyzed

No of Codonsunder positive

selection

PercentageEstimated byCodeML ()

Sites under Positive Selection (BEB4 50)

FGA Mammals 288 13 717 71832353951101108178212219246273

FGA Laurasiatheria 267 64 1329 467121929303336405293949697100102103108109116117118123136138141143155171180182195208210213214216218219220222223225227229231232233234236239242244245250253254

256259 260261264265

FGB Primates 488 17 822 394155808182130134135137173290367388413415448

F2 Vertebrates 475 5 099 5257151162381

F3 Mammals 240 11 663 55727691115177179185195230240

F3 Sauropsida 254 13 1060 340414474798081110123170208249

F5 Vertebrates 1182 10 087 1032755745785936027779269951010

F5 Mammals 1741 54 272 237243660134389392656663667669 676680718720731745749765768782809819822824875

88089489590791191692893094596797698910161040107610801089113611391324147114791515

152215481564

F5 Primates 2116 54 1321 74052129180211336341408409434 59266070370772075478487787988889290794195997810331039

10591131116812051209121812201222123012311232123612441254126212651279128514981504

151915261667168218491918

F7 Vertebrates 330 12 522 56526688162167178179209258262

F8 Mammals 1985 27 398 7245281339342521667718765784785792807817889912113011931197

12201229128813081333168619671984

F8 Primates 2351 33 063 42275579181485086088692796496697998899610191071118012821312

136114141421143614381459152715821613162216501689173023402349

F9 Mammals 443 15 440 45100114140150177181199202206209270309348

F10 Vertebrates 353 3 001 14470

F13A Vertebrates 631 6 126 398489497502550613

F13B Laurasiatheria 656 18 152 5567378107341349350374443459461497522525539540624

P4 95 P4 99

3047

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

estimate the stability effect (G) of residue changes on thethree-dimensional structures of human FVIII (chains A and B)FIX and FXI (fig 6) FoldX was designed to predict the stabilityeffect of a mutation This is an empirical method that relies onthe estimation of physical parameters from more than 1000mutations in protein structures and has been evaluated with21 proteins (Schymkowitz et al 2005 Tokuriki et al 2007)First each amino acid at each position was replaced with eachof the other 19 amino acids in order to provide a baseline forcomparison As expected the prediction of all virtual muta-tions shows a distribution skewed toward destabilization (me-dians G range between 051 and 123 kcalmol) Next theeffect of changing the human amino acids with those seen inevolution for the other 46 sequences of this study was calcu-lated in order to examine the effect of evolutionary change onprotein stabilities This produced a narrow distribution pre-dominantly centered on a difference of stability G of0 kcalmol and showing a slight skew toward deleterious mu-tations (medians G range between 014 and 030 kcalmol) This outcome is as expected because evolutionarychanges in the amino acids are selected in order to preservethe stability of the protein structures Finally the effect ofchanging the human amino acids to those observed inunique pathogenic missense mutations was calculated Thisanalysis produced a clearly skewed distribution (mediansG range between 101 and 295 kcalmol) of G

values that ranged as high as 10 kcalmol (and even higherbut these higher values were clipped for easier visualization)this agrees with the fact that these mutations affect the pro-tein structure The first distribution for the virtual mutationsis intermediate between the latter two distributions asexpected

The same outcome of stability effects was also observed atthe individual amino acid level (fig 7) When the effect of 19amino acid replacements were computed at each position forFVIII FIX and FXI and the majority of replacements showed adestabilizing effect (410 kcalmol in red) A small propor-tion of replacements stabilized the structure (lt10 kcalmol in blue) indicating the ability of the protein to compen-sate deleterious mutations by further mutation Although thisappears to occur more strongly for the FVIII chains A and B(fig 7a and b) the proportion of allowed replacements is infact similar for all four proteins when their different sizes areconsidered When the effect of replacements by disease-caus-ing missense mutations was calculated and the G valuesshowed that these almost invariably destabilized the threeprotein structures

DiscussionIn this study we have assembled sequence data for 14 coag-ulation genes across 47 genomes in order to analyze the evo-lutionary changes in amino acids that take place between five

FIG 4 Probability of evolutionary selection across the Primates clade for the three proteins FVIII FIX and FXI The probabilities of negative neutral andpositive selection were plotted against the amino acid position in the three proteins The probability values were obtained from CodeML analyses of theseven primate sequences (gray negative selection yellow neutral selection green positive selection)

3048

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

major clades and several minor ones These changes werecompared directly with disease-causing mutation data forthree coagulation proteins FVIII FIX and FXI Good-qualitypathogenic mutational data are publicly available for thesethree proteins and the bleeding disorders caused by thesethree sets of mutational data correspond to the three mostprevalent bleeding disorders The outcome of these detailedanalyses confirms that positive selection occurred during dif-ferent stages in the evolution of several coagulation proteinsIt was also observed when comparing the protein structurestabilities that the pathogenic mutations have a destabilizingeffect on the protein structure as opposed to the stable mu-tations that are tolerated during evolution (fig 6) The com-bination of positive selection analyses with those forpathogenic mutations provides information about predictingdisease causing mutations and clarifies fresh approaches fortherapeutic interventions in bleeding disorders

Selective Pressures in Vertebrate Evolution

Evolutionary changes combine random mutational eventswith natural selection The probability of a mutation to befixed in a population depends on its consequence on thefitness of an organism New adaptive mutations that are ben-eficial in terms of fitness are more likely to be retained bypositive selection than neutral mutation Neutral mutations

that are neither beneficial nor detrimental are randomly re-tained or removed by genetic drift depending on the popu-lation size (neutral selection) Detrimental mutations aremore likely to be removed from the population by negative(or purifying) selection There are no strict patterns and adeleterious mutation can sometimes be beneficial to oneaspect of the organism The textbook example of this issickle-cell anaemia (drepanocytosis) which is a bleeding dis-order caused by the pGlu6Val mutation in the hemoglobin gene Mutations responsible for sickle-cell anaemia are gen-erally removed by natural selection In malaria-infested pop-ulations particularly in tropical or subtropical regions thesemutations are selected for in one allele because they offerpartial protection against the pathogen (Lopez et al 2010)The malaria-infected sickle cells inhibit infection by the ma-laria plasmodium in heterozygous cases (beneficial) but causesickle-cell anaemia in homozygous cases (deleterious)

Genes corresponding to proteins in direct contact with theenvironment are more likely to be under adaptive selectionThus genes involved in host-pathogen interactions are com-monly associated with positive selection Large-scale analysesrevealing positive selection in vertebrates or arthropods havebeen reported for genes involved with immunity (Nielsenet al 2005 Studer et al 2008 Montoya-Burgos 2011 Rouxet al 2014) In the present detailed study positive selection inseveral genes of the coagulation cascade was identified at

FIG 5 Relationship between disease-causing mutations and evolutionary pressures The correlation between evolutionary selection and the number oftimes a disease-causing mutation occurs at each amino acid position is shown In the columns (a) FVIII (b) FIX and (c) FXI correspond to the threecoagulation factors for which pathogenic mutational information is available in sufficient quantity from our three mutational databases The three rowscorrespond to the posterior probability (based on the BEB model of CodeML) of positive selection neutral evolution and negative selection forprimates

3049

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

different stages in evolution (fig 3) An earlier study identifiedpositive selection across six mammalian genomes (humanchimpanzee macaque mouse rat and dog) using LRTsbased on codon substitution models and false discoveryrates to evaluate multiple LRT comparisons (Kosiol et al2008) Although they found strong positive selection(Plt 005 and false discovery ratelt 005) in nine complementgenes only nominal positive selection (Plt 005) was identi-fied in four coagulation genes (F2 F12 TFPI PROC) none ofwhich were identified in figure 3b A limitation of this earlieranalysis may be the low number of six genomes used Notethat complement is related to coagulation in that the latterhas the ability to activate complement of innate immunitythat mediates host-pathogen interactions (Amara et al 2008)

Positive Selection of the Coagulation Proteins throughHost-Pathogen Interactions

As a consequence of the immune response blood coagula-tion is often exploited by pathogens for reason of infectiveand septic processes For example FG and FV are targeted bybacteria thus offering a straightforward explanation of posi-tive selection (fig 3) We discuss this in terms of FG and FIIfollowed by FIII and FVII and then FV and FX

By taking advantage of the increased number of availablegenome sequences positive selection has been identified inseveral coagulation genes during various stages in evolutionIn particular all three subunits FGA FGB and FGG of FG haveshown positive selection at periods during vertebrate evolu-tion according to both the site model and the branch-sitemodels Similarly prothrombin (FII) has shown positive selec-tion but only across all the vertebrate genomes and notacross any smaller clades Although the immune system isunder strong adaptive constraints because it needs to evolveconstantly against pathogens the coagulation system alsocomprises the first line of immune defense against bacteriato form a clot to block bacterial invasion (Sun 2006) Both FGand prothrombin (FII) are targeted by certain bacteria Forexample FG is targeted during the course of infection bycysteine protease a virulence factor of Streptococcus pyogenes(Matsuka et al 1999) and FG binding protein Efb ofStaphylococcus aureus Similarly FG-like proteins exist inAnopheles gambiae and are involved in the function of theimmune system against malaria or bacterial parasites (Dongand Dimopoulos 2009 Lombardo et al 2013) Thrombin in-teracts with staphylocoagulase a protein secreted by thehuman pathogen S aureus and activates prothrombin with-out proteolysis to form thrombin The resulting

FIG 6 The effect of amino acid replacements on the overall protein stability change G Data are shown for (a and b) FVIII A and B chains (c) FIX (d)FXI Three calculations were performed each being based on the sequences for which crystal structures are known (PDB codes 2R7E [FVIII] 2WPH [FIX]and 2F83 [FXI]) The FVIII A and B chains correspond to the N-terminal residues 19ndash760 and the C-terminal residues 1582ndash2351 that are observed in itscrystal structure The calculation represented by the dashed line indicates the distribution calculated for all 19 possible amino acid replacements Thatrepresented by the dotted line indicates the distribution calculated for all the evolutionary occurring replacements that were observed in our data set of47 genomes That represented by the continuous line indicates the distribution calculated for the disease-causing mutations from the FVIII FIX and FXImutation databases

3050

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

prothrombinmdashstaphylocoagulase complex binds to FG tocleave this into self-polymerizing fibrin This process is centralto the molecular pathology of S aureus endocarditis becausestaphylocoagulase bypasses physiological blood coagulationthrough the evasion of the recognition of prothrombin bycirculating thrombin inhibitors (Panizzi et al 2005)

Positive selection for FIII and FVII is attributed to a host-pathogen interaction with the herpes virus family These cellsurface molecules do not necessarily participate directly in animmune response but can still serve as a convenient gatewayfor pathogen entry (Vallender and Lahn 2004) Herpes virusespromote cellular surface infections by coevolving with theirhuman host indicating a mode by which pathogens exploitthe coagulation system (Sutherland et al 2012) A study of thehost-pathogen driven coevolution of herpes virus revealedthat this virus has existed from the time of mammals anddivided into the HSV-1 and HSV-2 types recently during

human evolution (Kolb et al 2013) These observations sug-gest that the pressure for FIII to evolve in early mammalsespecially through primates could be the reason for the pos-itive selection of FIII across the mammalian clade (fig 3) Wepropose that the FIIIndashFVII interaction is a significant drivingforce for positive selection in FVII

The FV and FVIII cofactors are homologous to each otherin their sequence and domain organization hence it is notsurprising to see similar selection pressures acting on them(fig 3c) A difference arises in the entire vertebrate genome(fig 3a) where FV shows positive selection whereas FVIII doesnot This difference between FV and FVIII across vertebratesmay result from the interaction of FV with Escherichia coliThe extracellular serine protease EspP from E coli cleaves FVto lead to its degradation and thus reduces the activity of thecoagulation cascade (Brunder et al 1997) The activated FVacofactor forms a FVaFXa complex with activated FX (fig 1)

FIG 7 The effect of amino acid replacements at the sequence level on the protein stability change G The residue stability changes are depicted ona seven point scale from highly destabilizing (red) to highly stabilizing (blue) for each of (a) FVIII chain A (b) FVIII chain B (c) FIX and (d) FXI Their PDBcodes are indicated in figure 6 In each panel the upper half shows the G values for all the possible 19 amino acid replacements at each residueposition that is part of the crystal structure and the lower half shows the G values for the disease-causing mutations taken from the mutationdatabases

3051

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

which is important for thrombin formation the most impor-tant step for fibrin clot formation Because both FVa and FXaare central to coagulation and are associated with bacterialproteolysis this offers an explanation for the signatures ofpositive selection in both proteins

FVIII FIX and FXIII are also associated with positive selec-tion however it is not clear what selective pressures are in-volved Although these proteins may also be targeted bypathogens no direct interactions have been reported so far

Absence of Positive Selection for FXI and FXIIFXII is unique in that it does not show positive selection eventhough it is involved in host-pathogen interactions with long-chain inorganic phosphatases (polyP) from microorganisms(Morrissey et al 2012) Small-chain polyP are produced in thehuman brain and by blood platelets Hence even during theabsence of pathogen-originated polyP the human host doesnot lack polyP and there is no pressure to evolve for this (Puyet al 2013) A F12 deficiency in patients does not cause bleed-ing disorders This means that the coagulation cascade can beinitiated even in the absence of F12 For similar reasons FXIshows no positive selection We hypothesize that this is be-cause first no interaction with the environment has beendemonstrated to date and second with respect to the polyPmechanism the coagulation cascade is capable of activatingin a FXI-independent manner

Relationship between Disease-Causing Mutations andSelection

The mechanisms of genetic drift mutational change andevolutionary change are closely related to each otherEvolutionary change is not only dependent on the geneticvariability from mutational change but also on genetic driftwhich describes the random fluctuations in the number ofgenetic variants in an organism Genetic drift and mutationsproduce random variations that selection can act upon Thenew changes incurred in one generation may be cancelled inanother generation It may therefore be important to deter-mine whether a new mutation is lost or becomes commonenough for selection to determine its fate (Masel 2011) In thisstudy we have examined the relationship between naturalselection acting on an amino acid and the disease-causingmutations reported at that amino acid position for thethree prominent coagulation factors FVIII FIX and FXI Wehave shown that the disease-causing mutations are morelikely to occur at a higher frequency at sites under negativeselection in all three proteins (fig 5) As sites under strongselective pressure are more likely to be critical to proteinfunction such a mutation can readily affect protein functionand trigger disease Lethal (or deleterious) mutations thatcause death are de facto not observed in our data set Bycontrast disease-causing mutations occur at a reduced fre-quency in sites under neutral or positive selection suggestingthat these sites are more tolerant to mutations than sitesunder negative selection Interestingly we have shown thatFVIII and FIX have undergone much positive selectionwhereas FXI has no such signatures (fig 4) Despite this dif-ference our results for three proteins show no change in the

relationship between evolutionary selection and disease-causing mutations (figs 5 and 6) In summary the reporteddisease-causing mutations show randomness and the muta-bility of a given amino acid position depends on the selectionpressure acting at that position and not on the entire protein

Effect of Disease-Causing Mutations on the Stability ofProtein Structure

Mutations that change the amino acid sequence can havestriking effects on protein stability Point mutations are re-lated to pathological and genetic conditions by reducing orabolishing coagulation protein function (table 1) Proteinfunction is related to its stability which is often measuredby the change in folding energy G of the protein structure(Stefl et al 2013) We have identified the effect of mutationson the stability of FVIII FIX and FXI in terms of G valuesThe stability effects of polymorphic amino acid changes inhuman sequences follow a Gaussian distribution (de Beeret al 2013) In that study disease-causing mutations alsoshowed the same distribution but slightly skewed towarddestabilization In this study the same trend for aminoacids substitutions in vertebrates were observed but this ismuch stronger for disease-causing mutations (fig 6) The dif-ference between the two studies may be explained by thedifferent methods used de Beer et al (2013) used the DiscreteOptimized Protein (DOPE) score of Modeller as a monitor ofglobal stability whereas we used FoldX (Materials andMethods) as an empirical method to infer directly the stabilityeffect G of a point mutation

Evolutionarily observed mutations are the most stable interms of their protein structure whereas disease-causing mu-tations destabilized the protein structure (figs 6 and 7) Eachsite may or may not tolerate a particular amino acid changedepending on the change in stability When each amino acidwas converted to each of the other 19 possible amino acids(fig 7 top panels) the G values showed that most aminoacids destabilize the structure whereas several may stabilize itThe disease-causing effect of a mutation may be affected notonly by the selection pressure acting on it but also by the typeof the amino acid replaced For the FVIII FIX and FXI bleedingdisorders destabilization was observed for most of the dis-ease-causing mutations

Utility of Evolutionary and Disease-CausingMutational Analyses for Clinical Diagnosis

Interactive locus-specific databases are useful tools for pre-senting patient mutational data for patient diagnosis andcare We plan to incorporate the above data sets on selectivepressures and stability effects into our mutational databasesfor FVIII FIX and FXI This information will provide insight forthe prediction of the effect of disease-causing mutations Thelevel of selective pressure was estimated under a maximum-likelihood framework Such methods are useful for geneticstudies but their results are sometimes difficult to interpretby a nonspecialist The presentation of this information needsto be intuitive to the viewer As shown in supplementaryfigure S3 Supplementary Material online the use of multiple

3052

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

sequence alignments and presentation of stability effectsG structural locations and selective pressures can helpin interpreting the impacts of the mutations This type ofinformation will therefore clarify the interpretation of newlydiscovered disease-causing mutations for clinicians evolu-tionary biologists and protein biochemists This combinedevolutionary-driven study of bleeding disorder-causing muta-tions is the most detailed of its kind as far as we are aware andmay be usefully extended to other biological processes andtheir proteins

Materials and Methods

Genomic Sequences

The flowchart of operations is summarized in supplementaryfigure S1 Supplementary Material online First the completemultiple sequence alignments were retrieved together withtheir gene trees for each of the 14 genes corresponding to the11 coagulation factors (table 1) from the Ensembl Comparadatabase (Vilella et al 2009) using their Ensembl IDs (table 1)and a Perl script Ensembl release 73 was used accessed onSeptember 30 2013 (Flicek et al 2013) These amino acidsequences came from Ensembl as they are generally ofgood quality and contained 74 vertebrate genomes in atotal of 80 genomes We then excluded any sequences thatcame from genomes of low coverage (2 or less) This re-sulted in 47 vertebrate genomes with alignment sizes be-tween 23 and 47 genomes These 47 sets of vertebratesequences provided sufficient genomic coverage to performphylogenetic analysis The 47 vertebrate organisms were clas-sified into five major clades (fig 2) namely Primates Glires(rodents rabbits and pika) Laurasiatheria Sauropsida (rep-tiles and birds) and Fishes (ray-finned fishes) together withminor clades including Atlantogenata and Australianmammals

Sequence Retrieval and Multiple Alignments

The multiple sequence alignments and gene trees fromEnsemblCompara included all the coagulation sequences indifferent species irrespective of their genome coverage Themultiple sequence alignments of the downloaded genes fromthe EnsemblCompara database were constructed using aglobal alignment algorithm provided within this databasefor significant number of sequences of arbitrary lengthsThe data set required for this study comprised 14 genesfrom 47 genomes A thorough quality control procedure ofthe multiple sequence alignments was performed in order togenerate minimum false positive results Thus sequences wereremoved if they showed 41 low-complexity regions orunknown nucleotides (indicated as ldquoNrdquo in a sequence)From these the phylogenetic trees were pruned usingNewick Utilities software (Junier and Zdobnov 2010) to iden-tify the tips corresponding to the retrieved species (fig 2) Thefinal alignments of these sequences were made with PRANKwhich is conservative in that it tends to align only amino acidsderived by substitution and introduces insertions and dele-tions (indels) instead of aligning sequence fragments thatappear too divergent (Loytynoja and Goldman 2008)

PRANK is an advanced probabilistic alignment algorithmthat incorporates phylogenetic information This has beenshown to reduce the number of false-positives when evalu-ating positive selection (Fletcher and Yang 2010 Loytynoja2014) All the PRANK alignments were made using defaultparameters for the empirical codon model and the prunedtrees as guides The resulting alignments are composed ofhighly conserved blocks surrounded by noise due to indelsFinally Gblocks (Castresana 2000) was used to remove thenoise in the unreliable regions of the multiple sequence align-ments which may cause false positives when testing for pos-itive selection to keep only the well-aligned parts The defaultGblocks parameters for the codontype model (parameter ndasht = c) were used with the exception of tolerating half gaps perposition (parameter -b5=h) This procedure resulted in cleanand accurate multiple sequence alignments for the final com-putations of the site and branch-site models using CodeMLtogether with their corresponding gene trees

Analyses of Evolutionary Pressure

For each of the 14 coagulation genes within each of the fivemajor clades the selective pressures were estimated using thedifferent codon substitution site models implemented inCodeML of the phylogenetic analysis with maximum-likeli-hood software (PAML release 47) (Yang 2007) By analyzingthe protein-coding genes at the nucleotide level the synon-ymous or silent changes (nucleotide substitutions that do notchange the encoded amino acid) are distinguished from thenonsynonymous or missense changes (substitutions thatchange the encoded amino acid) Because natural selectionoperates at the protein level and can take any direction ofpositive neutral or negative selection the synonymous andnonsynonymous substitutions are subjected to different se-lective pressures which occur at the rates dS and dN respec-tively Hence the ratio of nonsynonymoussynonymoussubstitution rates (= dNdS) measures the selective pressureat the protein level The value of reveals the direction andstrength of natural selection acting on the protein Values oflt 1 = 1 and 4 1 indicate negative selection neutralevolution and positive selection respectively If an existingfunction or phenotype is evolutionarily favorable then nega-tive selection (or purifying selection) favors the conservationof that function or phenotype (conservation of amino acids)The other is positive selection (or Darwinian selection) whichfavors the promotion of a new function or phenotypeNeutral evolution describes the tolerance to amino acid mu-tations which are neither deleterious nor beneficial and arefixed according to genetic drift

The following five models were used to account for thevariation of dNdS among codon sites M1a (two dNdS ratioswith negative selection [0lt 1] and neutral evolution[1 = 1]) M2a (three dNdS ratios with negative selection[0lt 1] neutral evolution [1 = 1] and positive selection[2 1]) M7 (beta) M8 (beta and 24 1) and M8a (betaand 2 = 1) The alignments contain several positions inwhich one of the sequences is missing and the ldquocleandata = 0rdquooption was selected to retain all sites The inference of positive

3053

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

selection was conducted by performing LRTs in which thefollowing two pairs of models were compared namely M1awith M2a and M8a with M8 The principle is to compare anull model that does not allow24 1 (M1a or M8a) with analternative model that does (M2a or M8) Because the alter-nate model has more parameters than the null model theLRT follows a 2 distribution The LRT is a powerful methodfor comparing two hypotheses where one is the special caseof the other The BEB method (Yang et al 2005) was used tocalculate the posterior probabilities for site classes and theBEB value was used to identify sites under positive selectionwhere the 2 distribution of the LRT was significant withPlt 005 As many different LRT comparisons (differentgenes different clades) were conducted a control for thefalse-discovery rate (expected proportion of false positives)was applied using the q value package in R (Storey andTibshirani 2003) We adjusted the parameters of the q valueto perform a bootstrap analysis (pi0meth = ldquobootstraprdquo) andto tolerate 10 of false positives in our significant results(ldquofdrlevel = 01rdquo) A bootstrap analysis is necessary as the Pvalue distribution is bimodal (Studer et al 2008)

Two sources for the branch-site model were used First theSelectome database at httpselectomeunilch (last accessedJanuary 2014) (Proux et al 2009 Moretti et al 2014) providesthe precomputed results of the branch-site model (supple-mentary table S2b Supplementary Material online) based onthe multiple sequence alignments from the EnsemblCompara database (Vilella et al 2009) The last release ofSelectome (release 6) contains an optimized procedure ofalignment refinement and quality control similar to theone used for our site-model analyses as described aboveThe branch-site model allows the dNdS value to vary notonly between sites but also between branches (Zhang et al2005) The advantage of this model is its ability to detecttraces of episodic positive selection (Studer and Robinson-Rechavi 2009a) Second a branch-site model was determinedbased on our data set (see above) We used the vertebratetrees and their multiple sequence alignments as inputs forbranch-site model calculations The multiple alignments weremanually rechecked to remove barren sequences and thecorresponding vertebrate trees were edited accordinglyusing Newick Utilities Each taxonomic unit also known asthe branch or node of the tree was tagged using an auto-mated python script The CodeML program was run on eachof these tagged trees from all the genes for this branch-sitemodel (supplementary table S2b Supplementary Materialonline) For some alignments we found misaligned short in-sertions up to a maximum of ten amino acids in the con-served blocks probably due to gene model prediction errorsIn our whole data set these errors affect 0ndash4 sequences peralignment As the F2 gene family contained most of thesesmall errors albeit at a very low level this was reanalyzed byremoving four sequences with potential problems from thealignment The recalculation of the likelihood ratio deviatedonly slightly from the initial results (P value of 393e5 for 38sequences vs P value of 217e4 for 34 sequences) This sug-gested that these errors would only have a minor impact ifany on the detection of positive selection especially as the

coagulation factor sequences were long and the model ofcodon substitutions was averaged over all columns (manyhundreds) and sequences (between 23 and 47)

Stability Effect of Mutations

When possible for each coagulation factor the molecularstructures of the domain(s) were retrieved directly fromthe CATH database (Sillitoe et al 2013) Prior to the muta-tional analyses each structure was repaired with the FoldXprogram (Schymkowitz et al 2005) to remove potential stericclashes FoldX utilizes an empirical force field model to esti-mate the stability of a protein (Gwildtype in kcalmol) FoldXwas used to estimate the stability effect on each residueposition when the residue was replaced by any of the 19other amino acids The stability effect is determined bythe difference in the wildtype and mutant energies(G = GmutantGwildtype) This resulted in a stabilitylandscape of all potential mutations including the Gvalues both for the disease mutations and for all the aminoacid substitutions observed across vertebrate evolution

Statistical Analyses

All statistical analyses were performed with R v302 (R CoreTeam 2013) and SigmaPlot (Systat Software San Jose CA)The pipeline was developed with Biopython scripts (Cocket al 2009) The visualization of the multiple sequence align-ments was performed using Jalview 28 (Waterhouse et al2009)

Data Availability

The final sequence alignments are provided on a webpagehttpwwwbiochemuclacukpavithraevolutioncoagulationindexphp (last accessed January 2014) Pathogenic mu-tation data for three human coagulation factors FVIII FIX andFXI were retrieved from our web databases at httpwwwfactorviii-dborg httpwwwfactorixorg and httpwwwfactorxiorg (last accessed January 2014) (Saunders et al2009 Rallapalli et al 2013)

Supplementary MaterialSupplementary tables S1ndashS2 and figures S1ndashS3 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

PMR and SJP thank Pfizer UK for a medical educationalgrant that generously funded this project and the SpecialTrustees of the Royal Free Hospital the KatharineDormandy Trust for Hemophilia and Related Disorders andthe European Association for Hemophilia and AssociatedDisorders for additional support The authors also thank DrKeith Gomez Dr Geoffrey Kemball-Cook and ProfessorEdward G Tuddenham for their generous collaborative sup-port of our studies including access to the FVIII mutations atthe Hemophilia A Database at httpwwwfactorviii-dborgRAS acknowledges funding from the Fondation du 450eme

3054

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

In order to interpret the effect of evolutionary and muta-tional changes in coagulation we have examined FVIII FIXand FXI For FVIII 5474 disease-causing mutations havebeen recently compiled (Rallapalli PM Kemball-Cook GTuddenham EG Gomez K Perkins SJ unpublished datahttpwwwfactorviii-dborg last accessed January 2014)3713 mutations were recently published for FIX (Rallapalliet al 2013 httpwwwfactorixorg last accessed January2014) and 487 mutations are known for FXI (Saunders et al2009 httpwwwfactorxiorg last accessed January 2014) Weidentified 47 full genomes in the Ensembl database thatshowed good sequence coverage This enabled the assessmentof positive or negative selection pressures (Yang 2006) in theevolution of coagulation protein sequences The identificationof positive selection in at least nine of the 11 coagulationproteins during different periods of evolution showed thatan effective coagulation pathway was under considerableadaptive constraints even in primates The mutational analy-ses for FVIII FIX and FXI showed that disease-causing muta-tions primarily affected highly conserved residues undernegative selection and damaged protein stabilities This jointevolutionary-mutational study provides novel clarifications ofthe observed data on pathogenic mutations and may facilitatenew gene therapy approaches for their treatments

Results

Selection of 47 Genomes

A data set of 14 coagulation factor genes (table 1) that codefor 11 proteins across different groups of vertebrate genomes(fig 2) was identified from the Ensembl database TheEnsembl database currently holds 78 genomes Of theseonly 47 vertebrate genomes were chosen based on their se-quence coverage and assembly quality (fig 2) Ensembl genesets are built from assemblies of DNA sequences and proteininformation hence it was important to select only good-quality assemblies for the gene trees and sequence data setsused in this study The 47 vertebrate genomes were classifiedinto five major clades based on the Ensembl classificationnamely Primates Glires Laurasiatheria Sauropsida andFishes The mammalian clade of 30 genomes consisted ofthree clades used in our study (Primates Glires andLaurasiatheria) and other clades (Atlantogenata andAustralian mammals) (fig 2) The codon substitution site-models M1a M2a M7 M8 and M8a from CodeML(Materials and Methods) were used to calculate the selectionpressures on the five major clades each with at least sevensequences to ensure that the CodeML calculations were rea-sonably powerful (Anisimova et al 2002) By following this

FIG 1 Schema of the blood coagulation pathway leading to fibrin The relationships between the 11 coagulation factors that are coded by 14 genes areshown in the modern revised coagulation pathway (blue enzymes red cofactors) Tissue factor (TF also known as FIII) initiates coagulation when itbinds and activates FVII The activated TF-FVIIa complex then activates FX which then generates activated thrombin (FIIa) FVIII FIX and FXI enable theamplification of the coagulation pathway to maximize FIIa production and possess the largest number of pathogenic human mutations Activatedthrombin cleaves FG (FI green) to form fibrin polymers (blood clots) that are cross-linked by FXIIIa

3042

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

criterion the Atlantogenata clade with only five sequenceswas eliminated from the selection pressure calculations Themultiple sequence alignments from Ensembl were realignedusing a strict quality control procedure (supplementary figS1 Supplementary Material online) which was performed inorder to minimize the number of false positive results below

Presence and Absence of Coagulation Factors

The 47 genomes were searched for the 14 genes coding forthe 11 coagulation factors In cases where the Ensembl genetree for a given coagulation factor did not show the presenceof a particular genome we searched for putative genes acrossthat genomic sequence to establish whether the organismhad lost them The purpose of the coagulation cascade is toconvert FG into the fibrin clot and unsurprisingly FG (theFGA FGB and FGG genes) was found in all 47 genomes Theother central genes thrombin (F2) tissue factor (F3) the

coagulation cascade cofactors (F5 and F8) F7 F9 and F10were present across all 47 genomes from fishes to humansF11 is present in all Sarcopterygii (tetrapods and coelacanth)but seems to have been lost or highly diverged in teleostfishes because it is only recorded in the spotted garLepisosteus oculatus which belongs to Holostei a sisterclade of teleost fishes F12 appeared in the ancestor of tetra-pods and is absent in fishes but present in amphibians(Xenopus tropicalis) reptiles (Anolis carolinensis) and mam-mals For F13 the alpha chain (F13A) was observed to appearfrom the time of vertebrates because this gene has beenfound in the sea lamprey fishes and tetrapods The betachain (F13B) appears to be present at the origin of vertebratesbecause F13B sequences have been recorded in cave fish(Astyanax mexicanus) and medaka (Oryzias latipes)Occasionally a gene was not reported this was most likelyto arise from incomplete or unidentified sequences

FIG 2 Phylogenetic tree of 47 vertebrate genomes The tree shows the 47 genomes studied here alongside their clade and taxonomic groups The 47vertebrate genomes showed good sequence quality and coverage in the Ensembl database and were grouped into five major clades (Primates GliresLaurasiatheria Sauropsida and Fishes) and three minor clades The Mammals comprised of the Primates Glires and Laurasiatheria clades together withAtlantogenata (African origin mammals and Xenartha) and four Australian species totaling 30 out of 47 genomes

3043

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Selective Pressures Using Codon SubstitutionSite-Models

Codon substitution site-models from the CodeML packagewere used to analyze the protein-coding sequences in the 14genes following the quality control of their alignments Themodels assume that selective pressure does not act equally onthe entire protein sequence but varies between amino acidsites in a gene family We calculated the log likelihood values(lnL) of five different selection models (M1a M2a M7 M8and M8a) in order to perform the likelihood ratio test (LRT)The five models differed from each other in the number ofparameters (np) considered and their variable values Thevalue is the ratio of nonsynonymous and synonymous sub-stitution rates (dNdS) (Materials and Methods) The compar-ison between a null model that does not consider 4 1(M1a M7 or M8a) and an alternate model that considers4 1 (M2a or M8) is a measure of positive selectionIf the null (neutral) model is rejected in favor of thealternate (selection) model positive selection is inferred(Yang 2006)

To evaluate the five models the LRT value was calculatedas follows

LRT frac14 2ethlnLalternate lnLnullTHORN

where lnL is the maximum-likelihood estimate of the proba-bilities of selection The LRT is asymptotically distributed as a2

k function where k (also known as the degrees of freedom) isthe difference between the np in the null and alternatemodels The np values were calculated along with the lnLvalues in CodeML

k frac14 npalternate npnull

The LRT values for which the 2k distribution shows a P

valuelt 005 indicated that some sites were significantlyunder positive selection For this study the comparison ofthe M8a null model with the M8 alternate model (referred toas M8a-M8) was considered optimal to infer positive selection(table 2) The LRT M1a-M2a is too conservative whereas M7ndashM8 was considered to be more problem prone and less ac-curate compared with M8a-M8 (Wong et al 2004 Studer andRobinson-Rechavi 2009b) Because the P values result frommultiple tests performed using different genes at differentclade levels it was necessary to control for the expected pro-portion of false positives False positives were accounted forby calculating the q value correction over the P value for thedifferent genes in each clade this approach being both pow-erful and specific (Studer et al 2008)

The overall picture is that a total of nine coagulation factorproteins (out of the 11 of the cascade) underwent positiveselection at one or more stages in evolution (fig 3) Thehighest number of genes under positive selection within anindividual clade is five (for vertebrates and mammals) ForVertebrates using the LRT M8a-M8 positive selection wasfound in as many as five of the 14 coagulation genes (table2) In figure 3a these included F2 F5 F7 F10 and F13A for all47 vertebrates In figure 3b these included FGA F3 F5 F8 andF9 for the 30 mammals In figure 3cndashe these included FGB F5

and F8 (Primates) FGA and F13B (Laurasiatheria) and F3(Sauropsida) FG (FG = FGA + FGB + FGG) and F5 were ob-served to undergo the most extensive positive selection com-pared with any other coagulation genes The selectivepressure on all the genes in different clades was identifiedfrom the LRT values (table 2) except for those of F11 inLaurasiatheria Sauropsida and Fishes F12 in Sauropsida andFishes and F13B in Fishes because these genes wereabsent (supplementary table S1 Supplementary Materialonline)

Estimation of Selective Pressures Using Branch-SiteModel of Codon Substitution

The above site models estimated selective pressures that varyamong amino acid positions across multiple species But se-lective pressures can also vary among species (Studer andRobinson-Rechavi 2009a) and codon substitution branch-site models have been developed to detect such episodicpositive selection (Zhang et al 2005) Here we have usedthe results of the branch-site model from two differentsources one from the Selectome database and the otherfrom our own calculations Both approaches employedCodeML on the two data sets of this study (Materials andMethods) in order to check for consistency with the positiveselection identified in the site-model calculations (fig 3)Similar to the site model the false positives in our branch-site model calculations were accounted for by calculating theq value correction over the P value for the different genes ineach clade For both branch-site calculations we followed thetaxonomic grouping in the Selectome database (supplemen-tary table S2b Supplementary Material online) The highestorder of classification is the Sarcopterygii (coelacanths lung-fish and tetrapods) followed by its largest subgroupTetrapoda (mammals sauropsids and frogs) Other majortaxonomic subtrees were those of amninots sauropsidsmammals theria and eutheria The Actinopterygii (fishes)and Sauropsida (birds) are common to both the calculationsThe Eutheria and mammalian subtree calculations from theSelectome database were considered to be equivalent to themammalian clade of our analyses (fig 2) Positive selectionacross the vertebrates was observed for all genes except FGGF3 F7 and F11 in our branch-site model calculations (supple-mentary table S2b Supplementary Material online) We ob-tained similar results with our site-model calculations(supplementary table S2a Supplementary Material online)except for the exclusion of F3 and F7 (fig 3a) Our branch-site calculations identified positive selection in mammalianFGB F2 F8 and F9 whereas the site model identified positiveselection in mammalian FGA F3 F5 F8 and F9 Overall FG(FGA FGB) F8 and F9 were consistently identified to be pos-itively selected in mammals whereas F5 shows positive selec-tion up to amniotes after which the mammals and thesauropsids clade split In conclusion the two branch-siteand the site models all suggested signatures of positive selec-tions across F8 and F9 whereas no such selection wasobserved across and between F11

3044

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Identification of Codon Sites under Positive Selection

For up to 11 genes out of 14 identified to be under positiveselection (table 3) further analyses were performed to identifywhich codon sites were under such selection This was doneusing the Bayes Empirical Bayes (BEB) method from the M8site-model (Yang et al 2005) The number of codons poten-tially under positive selection (BEB 450) in the positivelyselected genes ranged between 3 and 64 But the number ofcodons accurately predicted under positive selection (BEB495) is actually much lower (between 0 and 4) The FGAgene in Laurasiatheria showed the highest percentage of pos-itively selected codons (1329 as estimated by CodeML and64 codons detect with two at BEB495) followed by the F5gene in Primates (1321 with 54 codons all BEBlt95 ) Thesmallest percentage of positively selected codons was ob-served in the F10 gene of Vertebrates (001 with threecodons all with BEB lt95) No positive selection was ob-served in any of the codon sites for the FGG F11 and F12genes in the taxa groups of this study

Table 2 LRT Statistics for the Site Model Comparisons of Model 8with Model 8a for the 14 Coagulation Factors among the Five Clades

Gene Clade No ofSequences

No ofCodons

(M8a-M8)

LRT P value q value

FGA Vertebrates 44 236 139 024 035Mammals 29 288 775 001 004Primates 7 865 335 007 016

Glires 7 670 289 009 019Laurasiatheria 9 267 728 001 004

Sauropsida 6 464 033 057 062Fishes 7 592 016 100 072

FGB Vertebrates 40 429 145 023 035Mammals 26 456 278 100 072Primates 6 488 744 001 004

Glires 6 471 069 041 051Laurasiatheria 8 479 144 023 035

Sauropsida 5 473 002 100 072Fishes 7 476 099 100 072

FGG Vertebrates 44 386 174 100 072Mammals 29 427 319 007 017Primates 7 433 398 005 013

Glires 7 433 344 006 016Laurasiatheria 8 432 009 076 072

Sauropsida 5 427 003 086 072Fishes 8 395 542 100 072

F2 Vertebrates 38 475 1691 000 000Mammals 23 571 464 003 011Primates 7 621 425 004 012

Glires 5 611 007 100 072Laurasiatheria 7 620 158 021 034

Sauropsida 5 593 368 005 014Fishes 8 562 005 100 072

F3 Vertebrates 47 166 209 015 027Mammals 26 240 1337 000 000Primates 7 295 001 090 072

Glires 6 287 307 008 018Laurasiatheria 7 292 097 032 044

Sauropsida 5 254 865 000 002Fishes 14 192 082 036 047

F5 Vertebrates 37 1182 1495 000 000Mammals 22 1741 3192 000 000Primates 6 2116 536 002 008

Glires 6 1847 048 049 056Laurasiatheria 6 2043 483 003 010

Sauropsida 5 1550 145 023 035Fishes 8 1534 212 015 027

F7 Vertebrates 34 330 593 001 007Mammals 29 344 306 008 018Primates 7 443 016 069 072

Glires 7 417 03 100 072Laurasiatheria 7 443 057 045 054

Sauropsida 3 425 055 046 054Fishes NA

F8 Vertebrates 38 1028 485 003 010Mammals 24 1985 1119 000 001Primates 7 2351 683 001 004

Glires 5 2257 395 005 013Laurasiatheria 7 2344 025 062 067

Sauropsida 5 1480 003 086 072Fishes 7 1413 09 034 045

F9 Vertebrates 44 329 063 043 052Mammals 24 443 1857 000 000Primates 7 461 128 026 036

Glires 5 468 185 017 030

(continued)

Table 2 Continued

Gene Clade No ofSequences

No ofCodons

(M8a-M8)

LRT P value q value

Laurasiatheria 7 459 434 004 012Sauropsida 5 435 003 100 072

Fishes 13 368 036 100 072

F10 Vertebrates 41 353 1455 000 000Mammals 26 424 023 063 067Primates 7 485 001 091 072

Glires 6 464 206 015 027Laurasiatheria 7 463 23 013 025

Sauropsida 5 385 003 087 072Fishes 8 391 071 040 051

F11 Vertebrates 23 614 034 056 062Mammals 23 614 034 056 062Primates 7 625 005 082 072

Glires 5 619 007 079 072Laurasiatheria 6 625 002 089 072

Sauropsida NAFishes NA

F12 Vertebrates 27 512 398 100 072Mammals 25 533 018 100 072Primates 7 615 002 100 072

Glires 6 557 007 100 072Laurasiatheria 8 513 46 003 011

Sauropsida NAFishes NA

F13A Vertebrates 47 631 907 000 002Mammals 26 684 258 011 022Primates 7 732 162 020 034

Glires 6 730 147 100 072Laurasiatheria 9 730 027 100 072

Sauropsida 5 732 001 092 072Fishes 14 643 259 011 022

F13B Vertebrates 31 609 13 025 036Mammals 25 640 157 021 034Primates 7 660 122 027 037

Glires 5 634 021 100 072Laurasiatheria 7 656 708 001 004

Sauropsida 5 573 093 100 072Fishes NA

NOTEmdashPositive selection is denoted in bold in column 2

Plt 005 Plt 001

3045

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Disease-Causing Missense Mutations and Relationshipto Evolutionary Pressures

The relationship between selective pressure and the proba-bility of disease-causing mutations at a codon site is of greatinterest Disease-causing missense mutations have been re-ported in all 14 coagulation genes in particular forHemophilia A (F8) Hemophilia B (F9) factor XI deficiency(F11) and thrombophilia (F2) (table 1) Extensive compila-tions of 487ndash5474 disease-causing mutations are available forthree of these coagulation factors FVIII FIX and FXI(Saunders et al 2009 Rallapalli et al 2013 Rallapalli et al2014 [unpublished data]) There is a sufficient number ofpathogenic mutation and evolutionary selection data fromthe FVIII FIX and FXI proteins to enable us to compare instatistical detail the correlation between selective pressureand disease-causing mutations

For the comparison of the FVIII FIX and FXI proteinchanges selective pressures were identified using the BEB in-ference method from model M2a (fig 4) While powerful a

disadvantage of Model M8 is that it classifies sites into 11categories (ten in negativeneutral and one positive selec-tion) These 11 categories can vary between different genefamilies making the comparison difficult Thus model M2a isadvantageous in that this has only three distinct categoriesnamely negative selection (dNdSlt 1) neutral evolution (dNdS = 1) and positive selection (dNdS4 1) Under the M1a-M2a test in the primates clade two genes (F8 and F9) showedpositive selection whereas F11 did not show positive selec-tion which differs from that shown in figure 3c which is basedon the M8 model The posterior probabilities of neutral neg-ative and positive selection for the FVIII FIX and FXI proteinsshowed that positive selection occurred in specific codonsthe most being for F8 and the fewest for F11 For completionthe same analysis for the other 11 genes is shown in supple-mentary table S2 Supplementary Material online These showsimilar outcomes to those for FVIII FIX and FXI

For each amino acid position in FVIII FIX and FXI wecompared its selective pressure (positive neutral or negative)

FIG 3 Positively selected genes of the coagulation cascade in vertebrates mammals and three major clades The layout of this figure is a simplifiedrepresentation of figure 1 Genes showing significant positive selection in the LRT calculations are highlighted in red (a) Positive selection was observedin five genes when all 47 genome sequences (vertebrates) were analyzed as a single group (b) In the Mammals group (fig 2) where positive selection isseen for five genes only the (c) Primates and (d) Laurasiatheria clades are shown with positive selection for two or three genes only because no positiveselection was seen in the Glires clade whereas insufficient sequences were available in the Atlantogenata clade for conclusions to be drawn (e) Positiveselection was seen in the Sauropsida clade but no positive selection was observed in Fishes (fig 2)

3046

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

with the reported number of disease-causing mutations Theoutcome was presented using regression plots (fig 5)The regression plots and their correlation coefficients revealedthe relationship between the selection probability values atamino acid sites with the disease-causing mutations at thosesites The average probability of a site to be under positiveselection and its number of disease-causing mutationsshowed a negative relationship with correlation coefficientsof085067 and095 in FVIII FIX and FXI respectively(fig 5 top row) The correlation coefficients for neutral selec-tion followed that of positive selection with values of 087064 and 096 for FVIII FIX and FXI respectively (fig 5middle row) In distinction to these a positive relationshipwas seen between the average probability of negative selec-tion and the disease-causing mutations with correlation co-efficients of +086 +066 and +095 for FVIII FIX and FXIrespectively (fig 5 bottom row) To summarize the fewer thenumber of disease mutations at a given site the higher is theprobability for positive selection or neutral evolution (fig 5

top row and middle row) The greater the number of disease-causing mutations at a given site the higher the probabilityfor negative selection (fig 5 bottom row) The comparisonbetween the positively selected proteins FVIII and FIX withthe nonpositively selected protein FXI is noteworthyIrrespective of the selective pressure all three proteins ex-hibited similar regression lines in each of the three rowsThese comparisons show that there is a general inverse pro-portionality relationship between the probability of positiveselection (or neutral evolution) and the number of disease-causing mutations at each site This confirms previous obser-vations that disease-causing mutations tend to occur at crit-ical sites in the protein and that these highly conserved sitesare less tolerant to new mutations (negative selection)

Stability Effect of Mutations

In order to explain the consequences of disease-causing mu-tations on the protein structure FoldX software was used to

Table 3 Positively Selected Sites in the M8a-M8 Comparison

Gene Clade No ofCodons

Analyzed

No of Codonsunder positive

selection

PercentageEstimated byCodeML ()

Sites under Positive Selection (BEB4 50)

FGA Mammals 288 13 717 71832353951101108178212219246273

FGA Laurasiatheria 267 64 1329 467121929303336405293949697100102103108109116117118123136138141143155171180182195208210213214216218219220222223225227229231232233234236239242244245250253254

256259 260261264265

FGB Primates 488 17 822 394155808182130134135137173290367388413415448

F2 Vertebrates 475 5 099 5257151162381

F3 Mammals 240 11 663 55727691115177179185195230240

F3 Sauropsida 254 13 1060 340414474798081110123170208249

F5 Vertebrates 1182 10 087 1032755745785936027779269951010

F5 Mammals 1741 54 272 237243660134389392656663667669 676680718720731745749765768782809819822824875

88089489590791191692893094596797698910161040107610801089113611391324147114791515

152215481564

F5 Primates 2116 54 1321 74052129180211336341408409434 59266070370772075478487787988889290794195997810331039

10591131116812051209121812201222123012311232123612441254126212651279128514981504

151915261667168218491918

F7 Vertebrates 330 12 522 56526688162167178179209258262

F8 Mammals 1985 27 398 7245281339342521667718765784785792807817889912113011931197

12201229128813081333168619671984

F8 Primates 2351 33 063 42275579181485086088692796496697998899610191071118012821312

136114141421143614381459152715821613162216501689173023402349

F9 Mammals 443 15 440 45100114140150177181199202206209270309348

F10 Vertebrates 353 3 001 14470

F13A Vertebrates 631 6 126 398489497502550613

F13B Laurasiatheria 656 18 152 5567378107341349350374443459461497522525539540624

P4 95 P4 99

3047

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

estimate the stability effect (G) of residue changes on thethree-dimensional structures of human FVIII (chains A and B)FIX and FXI (fig 6) FoldX was designed to predict the stabilityeffect of a mutation This is an empirical method that relies onthe estimation of physical parameters from more than 1000mutations in protein structures and has been evaluated with21 proteins (Schymkowitz et al 2005 Tokuriki et al 2007)First each amino acid at each position was replaced with eachof the other 19 amino acids in order to provide a baseline forcomparison As expected the prediction of all virtual muta-tions shows a distribution skewed toward destabilization (me-dians G range between 051 and 123 kcalmol) Next theeffect of changing the human amino acids with those seen inevolution for the other 46 sequences of this study was calcu-lated in order to examine the effect of evolutionary change onprotein stabilities This produced a narrow distribution pre-dominantly centered on a difference of stability G of0 kcalmol and showing a slight skew toward deleterious mu-tations (medians G range between 014 and 030 kcalmol) This outcome is as expected because evolutionarychanges in the amino acids are selected in order to preservethe stability of the protein structures Finally the effect ofchanging the human amino acids to those observed inunique pathogenic missense mutations was calculated Thisanalysis produced a clearly skewed distribution (mediansG range between 101 and 295 kcalmol) of G

values that ranged as high as 10 kcalmol (and even higherbut these higher values were clipped for easier visualization)this agrees with the fact that these mutations affect the pro-tein structure The first distribution for the virtual mutationsis intermediate between the latter two distributions asexpected

The same outcome of stability effects was also observed atthe individual amino acid level (fig 7) When the effect of 19amino acid replacements were computed at each position forFVIII FIX and FXI and the majority of replacements showed adestabilizing effect (410 kcalmol in red) A small propor-tion of replacements stabilized the structure (lt10 kcalmol in blue) indicating the ability of the protein to compen-sate deleterious mutations by further mutation Although thisappears to occur more strongly for the FVIII chains A and B(fig 7a and b) the proportion of allowed replacements is infact similar for all four proteins when their different sizes areconsidered When the effect of replacements by disease-caus-ing missense mutations was calculated and the G valuesshowed that these almost invariably destabilized the threeprotein structures

DiscussionIn this study we have assembled sequence data for 14 coag-ulation genes across 47 genomes in order to analyze the evo-lutionary changes in amino acids that take place between five

FIG 4 Probability of evolutionary selection across the Primates clade for the three proteins FVIII FIX and FXI The probabilities of negative neutral andpositive selection were plotted against the amino acid position in the three proteins The probability values were obtained from CodeML analyses of theseven primate sequences (gray negative selection yellow neutral selection green positive selection)

3048

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

major clades and several minor ones These changes werecompared directly with disease-causing mutation data forthree coagulation proteins FVIII FIX and FXI Good-qualitypathogenic mutational data are publicly available for thesethree proteins and the bleeding disorders caused by thesethree sets of mutational data correspond to the three mostprevalent bleeding disorders The outcome of these detailedanalyses confirms that positive selection occurred during dif-ferent stages in the evolution of several coagulation proteinsIt was also observed when comparing the protein structurestabilities that the pathogenic mutations have a destabilizingeffect on the protein structure as opposed to the stable mu-tations that are tolerated during evolution (fig 6) The com-bination of positive selection analyses with those forpathogenic mutations provides information about predictingdisease causing mutations and clarifies fresh approaches fortherapeutic interventions in bleeding disorders

Selective Pressures in Vertebrate Evolution

Evolutionary changes combine random mutational eventswith natural selection The probability of a mutation to befixed in a population depends on its consequence on thefitness of an organism New adaptive mutations that are ben-eficial in terms of fitness are more likely to be retained bypositive selection than neutral mutation Neutral mutations

that are neither beneficial nor detrimental are randomly re-tained or removed by genetic drift depending on the popu-lation size (neutral selection) Detrimental mutations aremore likely to be removed from the population by negative(or purifying) selection There are no strict patterns and adeleterious mutation can sometimes be beneficial to oneaspect of the organism The textbook example of this issickle-cell anaemia (drepanocytosis) which is a bleeding dis-order caused by the pGlu6Val mutation in the hemoglobin gene Mutations responsible for sickle-cell anaemia are gen-erally removed by natural selection In malaria-infested pop-ulations particularly in tropical or subtropical regions thesemutations are selected for in one allele because they offerpartial protection against the pathogen (Lopez et al 2010)The malaria-infected sickle cells inhibit infection by the ma-laria plasmodium in heterozygous cases (beneficial) but causesickle-cell anaemia in homozygous cases (deleterious)

Genes corresponding to proteins in direct contact with theenvironment are more likely to be under adaptive selectionThus genes involved in host-pathogen interactions are com-monly associated with positive selection Large-scale analysesrevealing positive selection in vertebrates or arthropods havebeen reported for genes involved with immunity (Nielsenet al 2005 Studer et al 2008 Montoya-Burgos 2011 Rouxet al 2014) In the present detailed study positive selection inseveral genes of the coagulation cascade was identified at

FIG 5 Relationship between disease-causing mutations and evolutionary pressures The correlation between evolutionary selection and the number oftimes a disease-causing mutation occurs at each amino acid position is shown In the columns (a) FVIII (b) FIX and (c) FXI correspond to the threecoagulation factors for which pathogenic mutational information is available in sufficient quantity from our three mutational databases The three rowscorrespond to the posterior probability (based on the BEB model of CodeML) of positive selection neutral evolution and negative selection forprimates

3049

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

different stages in evolution (fig 3) An earlier study identifiedpositive selection across six mammalian genomes (humanchimpanzee macaque mouse rat and dog) using LRTsbased on codon substitution models and false discoveryrates to evaluate multiple LRT comparisons (Kosiol et al2008) Although they found strong positive selection(Plt 005 and false discovery ratelt 005) in nine complementgenes only nominal positive selection (Plt 005) was identi-fied in four coagulation genes (F2 F12 TFPI PROC) none ofwhich were identified in figure 3b A limitation of this earlieranalysis may be the low number of six genomes used Notethat complement is related to coagulation in that the latterhas the ability to activate complement of innate immunitythat mediates host-pathogen interactions (Amara et al 2008)

Positive Selection of the Coagulation Proteins throughHost-Pathogen Interactions

As a consequence of the immune response blood coagula-tion is often exploited by pathogens for reason of infectiveand septic processes For example FG and FV are targeted bybacteria thus offering a straightforward explanation of posi-tive selection (fig 3) We discuss this in terms of FG and FIIfollowed by FIII and FVII and then FV and FX

By taking advantage of the increased number of availablegenome sequences positive selection has been identified inseveral coagulation genes during various stages in evolutionIn particular all three subunits FGA FGB and FGG of FG haveshown positive selection at periods during vertebrate evolu-tion according to both the site model and the branch-sitemodels Similarly prothrombin (FII) has shown positive selec-tion but only across all the vertebrate genomes and notacross any smaller clades Although the immune system isunder strong adaptive constraints because it needs to evolveconstantly against pathogens the coagulation system alsocomprises the first line of immune defense against bacteriato form a clot to block bacterial invasion (Sun 2006) Both FGand prothrombin (FII) are targeted by certain bacteria Forexample FG is targeted during the course of infection bycysteine protease a virulence factor of Streptococcus pyogenes(Matsuka et al 1999) and FG binding protein Efb ofStaphylococcus aureus Similarly FG-like proteins exist inAnopheles gambiae and are involved in the function of theimmune system against malaria or bacterial parasites (Dongand Dimopoulos 2009 Lombardo et al 2013) Thrombin in-teracts with staphylocoagulase a protein secreted by thehuman pathogen S aureus and activates prothrombin with-out proteolysis to form thrombin The resulting

FIG 6 The effect of amino acid replacements on the overall protein stability change G Data are shown for (a and b) FVIII A and B chains (c) FIX (d)FXI Three calculations were performed each being based on the sequences for which crystal structures are known (PDB codes 2R7E [FVIII] 2WPH [FIX]and 2F83 [FXI]) The FVIII A and B chains correspond to the N-terminal residues 19ndash760 and the C-terminal residues 1582ndash2351 that are observed in itscrystal structure The calculation represented by the dashed line indicates the distribution calculated for all 19 possible amino acid replacements Thatrepresented by the dotted line indicates the distribution calculated for all the evolutionary occurring replacements that were observed in our data set of47 genomes That represented by the continuous line indicates the distribution calculated for the disease-causing mutations from the FVIII FIX and FXImutation databases

3050

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

prothrombinmdashstaphylocoagulase complex binds to FG tocleave this into self-polymerizing fibrin This process is centralto the molecular pathology of S aureus endocarditis becausestaphylocoagulase bypasses physiological blood coagulationthrough the evasion of the recognition of prothrombin bycirculating thrombin inhibitors (Panizzi et al 2005)

Positive selection for FIII and FVII is attributed to a host-pathogen interaction with the herpes virus family These cellsurface molecules do not necessarily participate directly in animmune response but can still serve as a convenient gatewayfor pathogen entry (Vallender and Lahn 2004) Herpes virusespromote cellular surface infections by coevolving with theirhuman host indicating a mode by which pathogens exploitthe coagulation system (Sutherland et al 2012) A study of thehost-pathogen driven coevolution of herpes virus revealedthat this virus has existed from the time of mammals anddivided into the HSV-1 and HSV-2 types recently during

human evolution (Kolb et al 2013) These observations sug-gest that the pressure for FIII to evolve in early mammalsespecially through primates could be the reason for the pos-itive selection of FIII across the mammalian clade (fig 3) Wepropose that the FIIIndashFVII interaction is a significant drivingforce for positive selection in FVII

The FV and FVIII cofactors are homologous to each otherin their sequence and domain organization hence it is notsurprising to see similar selection pressures acting on them(fig 3c) A difference arises in the entire vertebrate genome(fig 3a) where FV shows positive selection whereas FVIII doesnot This difference between FV and FVIII across vertebratesmay result from the interaction of FV with Escherichia coliThe extracellular serine protease EspP from E coli cleaves FVto lead to its degradation and thus reduces the activity of thecoagulation cascade (Brunder et al 1997) The activated FVacofactor forms a FVaFXa complex with activated FX (fig 1)

FIG 7 The effect of amino acid replacements at the sequence level on the protein stability change G The residue stability changes are depicted ona seven point scale from highly destabilizing (red) to highly stabilizing (blue) for each of (a) FVIII chain A (b) FVIII chain B (c) FIX and (d) FXI Their PDBcodes are indicated in figure 6 In each panel the upper half shows the G values for all the possible 19 amino acid replacements at each residueposition that is part of the crystal structure and the lower half shows the G values for the disease-causing mutations taken from the mutationdatabases

3051

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

which is important for thrombin formation the most impor-tant step for fibrin clot formation Because both FVa and FXaare central to coagulation and are associated with bacterialproteolysis this offers an explanation for the signatures ofpositive selection in both proteins

FVIII FIX and FXIII are also associated with positive selec-tion however it is not clear what selective pressures are in-volved Although these proteins may also be targeted bypathogens no direct interactions have been reported so far

Absence of Positive Selection for FXI and FXIIFXII is unique in that it does not show positive selection eventhough it is involved in host-pathogen interactions with long-chain inorganic phosphatases (polyP) from microorganisms(Morrissey et al 2012) Small-chain polyP are produced in thehuman brain and by blood platelets Hence even during theabsence of pathogen-originated polyP the human host doesnot lack polyP and there is no pressure to evolve for this (Puyet al 2013) A F12 deficiency in patients does not cause bleed-ing disorders This means that the coagulation cascade can beinitiated even in the absence of F12 For similar reasons FXIshows no positive selection We hypothesize that this is be-cause first no interaction with the environment has beendemonstrated to date and second with respect to the polyPmechanism the coagulation cascade is capable of activatingin a FXI-independent manner

Relationship between Disease-Causing Mutations andSelection

The mechanisms of genetic drift mutational change andevolutionary change are closely related to each otherEvolutionary change is not only dependent on the geneticvariability from mutational change but also on genetic driftwhich describes the random fluctuations in the number ofgenetic variants in an organism Genetic drift and mutationsproduce random variations that selection can act upon Thenew changes incurred in one generation may be cancelled inanother generation It may therefore be important to deter-mine whether a new mutation is lost or becomes commonenough for selection to determine its fate (Masel 2011) In thisstudy we have examined the relationship between naturalselection acting on an amino acid and the disease-causingmutations reported at that amino acid position for thethree prominent coagulation factors FVIII FIX and FXI Wehave shown that the disease-causing mutations are morelikely to occur at a higher frequency at sites under negativeselection in all three proteins (fig 5) As sites under strongselective pressure are more likely to be critical to proteinfunction such a mutation can readily affect protein functionand trigger disease Lethal (or deleterious) mutations thatcause death are de facto not observed in our data set Bycontrast disease-causing mutations occur at a reduced fre-quency in sites under neutral or positive selection suggestingthat these sites are more tolerant to mutations than sitesunder negative selection Interestingly we have shown thatFVIII and FIX have undergone much positive selectionwhereas FXI has no such signatures (fig 4) Despite this dif-ference our results for three proteins show no change in the

relationship between evolutionary selection and disease-causing mutations (figs 5 and 6) In summary the reporteddisease-causing mutations show randomness and the muta-bility of a given amino acid position depends on the selectionpressure acting at that position and not on the entire protein

Effect of Disease-Causing Mutations on the Stability ofProtein Structure

Mutations that change the amino acid sequence can havestriking effects on protein stability Point mutations are re-lated to pathological and genetic conditions by reducing orabolishing coagulation protein function (table 1) Proteinfunction is related to its stability which is often measuredby the change in folding energy G of the protein structure(Stefl et al 2013) We have identified the effect of mutationson the stability of FVIII FIX and FXI in terms of G valuesThe stability effects of polymorphic amino acid changes inhuman sequences follow a Gaussian distribution (de Beeret al 2013) In that study disease-causing mutations alsoshowed the same distribution but slightly skewed towarddestabilization In this study the same trend for aminoacids substitutions in vertebrates were observed but this ismuch stronger for disease-causing mutations (fig 6) The dif-ference between the two studies may be explained by thedifferent methods used de Beer et al (2013) used the DiscreteOptimized Protein (DOPE) score of Modeller as a monitor ofglobal stability whereas we used FoldX (Materials andMethods) as an empirical method to infer directly the stabilityeffect G of a point mutation

Evolutionarily observed mutations are the most stable interms of their protein structure whereas disease-causing mu-tations destabilized the protein structure (figs 6 and 7) Eachsite may or may not tolerate a particular amino acid changedepending on the change in stability When each amino acidwas converted to each of the other 19 possible amino acids(fig 7 top panels) the G values showed that most aminoacids destabilize the structure whereas several may stabilize itThe disease-causing effect of a mutation may be affected notonly by the selection pressure acting on it but also by the typeof the amino acid replaced For the FVIII FIX and FXI bleedingdisorders destabilization was observed for most of the dis-ease-causing mutations

Utility of Evolutionary and Disease-CausingMutational Analyses for Clinical Diagnosis

Interactive locus-specific databases are useful tools for pre-senting patient mutational data for patient diagnosis andcare We plan to incorporate the above data sets on selectivepressures and stability effects into our mutational databasesfor FVIII FIX and FXI This information will provide insight forthe prediction of the effect of disease-causing mutations Thelevel of selective pressure was estimated under a maximum-likelihood framework Such methods are useful for geneticstudies but their results are sometimes difficult to interpretby a nonspecialist The presentation of this information needsto be intuitive to the viewer As shown in supplementaryfigure S3 Supplementary Material online the use of multiple

3052

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

sequence alignments and presentation of stability effectsG structural locations and selective pressures can helpin interpreting the impacts of the mutations This type ofinformation will therefore clarify the interpretation of newlydiscovered disease-causing mutations for clinicians evolu-tionary biologists and protein biochemists This combinedevolutionary-driven study of bleeding disorder-causing muta-tions is the most detailed of its kind as far as we are aware andmay be usefully extended to other biological processes andtheir proteins

Materials and Methods

Genomic Sequences

The flowchart of operations is summarized in supplementaryfigure S1 Supplementary Material online First the completemultiple sequence alignments were retrieved together withtheir gene trees for each of the 14 genes corresponding to the11 coagulation factors (table 1) from the Ensembl Comparadatabase (Vilella et al 2009) using their Ensembl IDs (table 1)and a Perl script Ensembl release 73 was used accessed onSeptember 30 2013 (Flicek et al 2013) These amino acidsequences came from Ensembl as they are generally ofgood quality and contained 74 vertebrate genomes in atotal of 80 genomes We then excluded any sequences thatcame from genomes of low coverage (2 or less) This re-sulted in 47 vertebrate genomes with alignment sizes be-tween 23 and 47 genomes These 47 sets of vertebratesequences provided sufficient genomic coverage to performphylogenetic analysis The 47 vertebrate organisms were clas-sified into five major clades (fig 2) namely Primates Glires(rodents rabbits and pika) Laurasiatheria Sauropsida (rep-tiles and birds) and Fishes (ray-finned fishes) together withminor clades including Atlantogenata and Australianmammals

Sequence Retrieval and Multiple Alignments

The multiple sequence alignments and gene trees fromEnsemblCompara included all the coagulation sequences indifferent species irrespective of their genome coverage Themultiple sequence alignments of the downloaded genes fromthe EnsemblCompara database were constructed using aglobal alignment algorithm provided within this databasefor significant number of sequences of arbitrary lengthsThe data set required for this study comprised 14 genesfrom 47 genomes A thorough quality control procedure ofthe multiple sequence alignments was performed in order togenerate minimum false positive results Thus sequences wereremoved if they showed 41 low-complexity regions orunknown nucleotides (indicated as ldquoNrdquo in a sequence)From these the phylogenetic trees were pruned usingNewick Utilities software (Junier and Zdobnov 2010) to iden-tify the tips corresponding to the retrieved species (fig 2) Thefinal alignments of these sequences were made with PRANKwhich is conservative in that it tends to align only amino acidsderived by substitution and introduces insertions and dele-tions (indels) instead of aligning sequence fragments thatappear too divergent (Loytynoja and Goldman 2008)

PRANK is an advanced probabilistic alignment algorithmthat incorporates phylogenetic information This has beenshown to reduce the number of false-positives when evalu-ating positive selection (Fletcher and Yang 2010 Loytynoja2014) All the PRANK alignments were made using defaultparameters for the empirical codon model and the prunedtrees as guides The resulting alignments are composed ofhighly conserved blocks surrounded by noise due to indelsFinally Gblocks (Castresana 2000) was used to remove thenoise in the unreliable regions of the multiple sequence align-ments which may cause false positives when testing for pos-itive selection to keep only the well-aligned parts The defaultGblocks parameters for the codontype model (parameter ndasht = c) were used with the exception of tolerating half gaps perposition (parameter -b5=h) This procedure resulted in cleanand accurate multiple sequence alignments for the final com-putations of the site and branch-site models using CodeMLtogether with their corresponding gene trees

Analyses of Evolutionary Pressure

For each of the 14 coagulation genes within each of the fivemajor clades the selective pressures were estimated using thedifferent codon substitution site models implemented inCodeML of the phylogenetic analysis with maximum-likeli-hood software (PAML release 47) (Yang 2007) By analyzingthe protein-coding genes at the nucleotide level the synon-ymous or silent changes (nucleotide substitutions that do notchange the encoded amino acid) are distinguished from thenonsynonymous or missense changes (substitutions thatchange the encoded amino acid) Because natural selectionoperates at the protein level and can take any direction ofpositive neutral or negative selection the synonymous andnonsynonymous substitutions are subjected to different se-lective pressures which occur at the rates dS and dN respec-tively Hence the ratio of nonsynonymoussynonymoussubstitution rates (= dNdS) measures the selective pressureat the protein level The value of reveals the direction andstrength of natural selection acting on the protein Values oflt 1 = 1 and 4 1 indicate negative selection neutralevolution and positive selection respectively If an existingfunction or phenotype is evolutionarily favorable then nega-tive selection (or purifying selection) favors the conservationof that function or phenotype (conservation of amino acids)The other is positive selection (or Darwinian selection) whichfavors the promotion of a new function or phenotypeNeutral evolution describes the tolerance to amino acid mu-tations which are neither deleterious nor beneficial and arefixed according to genetic drift

The following five models were used to account for thevariation of dNdS among codon sites M1a (two dNdS ratioswith negative selection [0lt 1] and neutral evolution[1 = 1]) M2a (three dNdS ratios with negative selection[0lt 1] neutral evolution [1 = 1] and positive selection[2 1]) M7 (beta) M8 (beta and 24 1) and M8a (betaand 2 = 1) The alignments contain several positions inwhich one of the sequences is missing and the ldquocleandata = 0rdquooption was selected to retain all sites The inference of positive

3053

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

selection was conducted by performing LRTs in which thefollowing two pairs of models were compared namely M1awith M2a and M8a with M8 The principle is to compare anull model that does not allow24 1 (M1a or M8a) with analternative model that does (M2a or M8) Because the alter-nate model has more parameters than the null model theLRT follows a 2 distribution The LRT is a powerful methodfor comparing two hypotheses where one is the special caseof the other The BEB method (Yang et al 2005) was used tocalculate the posterior probabilities for site classes and theBEB value was used to identify sites under positive selectionwhere the 2 distribution of the LRT was significant withPlt 005 As many different LRT comparisons (differentgenes different clades) were conducted a control for thefalse-discovery rate (expected proportion of false positives)was applied using the q value package in R (Storey andTibshirani 2003) We adjusted the parameters of the q valueto perform a bootstrap analysis (pi0meth = ldquobootstraprdquo) andto tolerate 10 of false positives in our significant results(ldquofdrlevel = 01rdquo) A bootstrap analysis is necessary as the Pvalue distribution is bimodal (Studer et al 2008)

Two sources for the branch-site model were used First theSelectome database at httpselectomeunilch (last accessedJanuary 2014) (Proux et al 2009 Moretti et al 2014) providesthe precomputed results of the branch-site model (supple-mentary table S2b Supplementary Material online) based onthe multiple sequence alignments from the EnsemblCompara database (Vilella et al 2009) The last release ofSelectome (release 6) contains an optimized procedure ofalignment refinement and quality control similar to theone used for our site-model analyses as described aboveThe branch-site model allows the dNdS value to vary notonly between sites but also between branches (Zhang et al2005) The advantage of this model is its ability to detecttraces of episodic positive selection (Studer and Robinson-Rechavi 2009a) Second a branch-site model was determinedbased on our data set (see above) We used the vertebratetrees and their multiple sequence alignments as inputs forbranch-site model calculations The multiple alignments weremanually rechecked to remove barren sequences and thecorresponding vertebrate trees were edited accordinglyusing Newick Utilities Each taxonomic unit also known asthe branch or node of the tree was tagged using an auto-mated python script The CodeML program was run on eachof these tagged trees from all the genes for this branch-sitemodel (supplementary table S2b Supplementary Materialonline) For some alignments we found misaligned short in-sertions up to a maximum of ten amino acids in the con-served blocks probably due to gene model prediction errorsIn our whole data set these errors affect 0ndash4 sequences peralignment As the F2 gene family contained most of thesesmall errors albeit at a very low level this was reanalyzed byremoving four sequences with potential problems from thealignment The recalculation of the likelihood ratio deviatedonly slightly from the initial results (P value of 393e5 for 38sequences vs P value of 217e4 for 34 sequences) This sug-gested that these errors would only have a minor impact ifany on the detection of positive selection especially as the

coagulation factor sequences were long and the model ofcodon substitutions was averaged over all columns (manyhundreds) and sequences (between 23 and 47)

Stability Effect of Mutations

When possible for each coagulation factor the molecularstructures of the domain(s) were retrieved directly fromthe CATH database (Sillitoe et al 2013) Prior to the muta-tional analyses each structure was repaired with the FoldXprogram (Schymkowitz et al 2005) to remove potential stericclashes FoldX utilizes an empirical force field model to esti-mate the stability of a protein (Gwildtype in kcalmol) FoldXwas used to estimate the stability effect on each residueposition when the residue was replaced by any of the 19other amino acids The stability effect is determined bythe difference in the wildtype and mutant energies(G = GmutantGwildtype) This resulted in a stabilitylandscape of all potential mutations including the Gvalues both for the disease mutations and for all the aminoacid substitutions observed across vertebrate evolution

Statistical Analyses

All statistical analyses were performed with R v302 (R CoreTeam 2013) and SigmaPlot (Systat Software San Jose CA)The pipeline was developed with Biopython scripts (Cocket al 2009) The visualization of the multiple sequence align-ments was performed using Jalview 28 (Waterhouse et al2009)

Data Availability

The final sequence alignments are provided on a webpagehttpwwwbiochemuclacukpavithraevolutioncoagulationindexphp (last accessed January 2014) Pathogenic mu-tation data for three human coagulation factors FVIII FIX andFXI were retrieved from our web databases at httpwwwfactorviii-dborg httpwwwfactorixorg and httpwwwfactorxiorg (last accessed January 2014) (Saunders et al2009 Rallapalli et al 2013)

Supplementary MaterialSupplementary tables S1ndashS2 and figures S1ndashS3 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

PMR and SJP thank Pfizer UK for a medical educationalgrant that generously funded this project and the SpecialTrustees of the Royal Free Hospital the KatharineDormandy Trust for Hemophilia and Related Disorders andthe European Association for Hemophilia and AssociatedDisorders for additional support The authors also thank DrKeith Gomez Dr Geoffrey Kemball-Cook and ProfessorEdward G Tuddenham for their generous collaborative sup-port of our studies including access to the FVIII mutations atthe Hemophilia A Database at httpwwwfactorviii-dborgRAS acknowledges funding from the Fondation du 450eme

3054

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

criterion the Atlantogenata clade with only five sequenceswas eliminated from the selection pressure calculations Themultiple sequence alignments from Ensembl were realignedusing a strict quality control procedure (supplementary figS1 Supplementary Material online) which was performed inorder to minimize the number of false positive results below

Presence and Absence of Coagulation Factors

The 47 genomes were searched for the 14 genes coding forthe 11 coagulation factors In cases where the Ensembl genetree for a given coagulation factor did not show the presenceof a particular genome we searched for putative genes acrossthat genomic sequence to establish whether the organismhad lost them The purpose of the coagulation cascade is toconvert FG into the fibrin clot and unsurprisingly FG (theFGA FGB and FGG genes) was found in all 47 genomes Theother central genes thrombin (F2) tissue factor (F3) the

coagulation cascade cofactors (F5 and F8) F7 F9 and F10were present across all 47 genomes from fishes to humansF11 is present in all Sarcopterygii (tetrapods and coelacanth)but seems to have been lost or highly diverged in teleostfishes because it is only recorded in the spotted garLepisosteus oculatus which belongs to Holostei a sisterclade of teleost fishes F12 appeared in the ancestor of tetra-pods and is absent in fishes but present in amphibians(Xenopus tropicalis) reptiles (Anolis carolinensis) and mam-mals For F13 the alpha chain (F13A) was observed to appearfrom the time of vertebrates because this gene has beenfound in the sea lamprey fishes and tetrapods The betachain (F13B) appears to be present at the origin of vertebratesbecause F13B sequences have been recorded in cave fish(Astyanax mexicanus) and medaka (Oryzias latipes)Occasionally a gene was not reported this was most likelyto arise from incomplete or unidentified sequences

FIG 2 Phylogenetic tree of 47 vertebrate genomes The tree shows the 47 genomes studied here alongside their clade and taxonomic groups The 47vertebrate genomes showed good sequence quality and coverage in the Ensembl database and were grouped into five major clades (Primates GliresLaurasiatheria Sauropsida and Fishes) and three minor clades The Mammals comprised of the Primates Glires and Laurasiatheria clades together withAtlantogenata (African origin mammals and Xenartha) and four Australian species totaling 30 out of 47 genomes

3043

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Selective Pressures Using Codon SubstitutionSite-Models

Codon substitution site-models from the CodeML packagewere used to analyze the protein-coding sequences in the 14genes following the quality control of their alignments Themodels assume that selective pressure does not act equally onthe entire protein sequence but varies between amino acidsites in a gene family We calculated the log likelihood values(lnL) of five different selection models (M1a M2a M7 M8and M8a) in order to perform the likelihood ratio test (LRT)The five models differed from each other in the number ofparameters (np) considered and their variable values Thevalue is the ratio of nonsynonymous and synonymous sub-stitution rates (dNdS) (Materials and Methods) The compar-ison between a null model that does not consider 4 1(M1a M7 or M8a) and an alternate model that considers4 1 (M2a or M8) is a measure of positive selectionIf the null (neutral) model is rejected in favor of thealternate (selection) model positive selection is inferred(Yang 2006)

To evaluate the five models the LRT value was calculatedas follows

LRT frac14 2ethlnLalternate lnLnullTHORN

where lnL is the maximum-likelihood estimate of the proba-bilities of selection The LRT is asymptotically distributed as a2

k function where k (also known as the degrees of freedom) isthe difference between the np in the null and alternatemodels The np values were calculated along with the lnLvalues in CodeML

k frac14 npalternate npnull

The LRT values for which the 2k distribution shows a P

valuelt 005 indicated that some sites were significantlyunder positive selection For this study the comparison ofthe M8a null model with the M8 alternate model (referred toas M8a-M8) was considered optimal to infer positive selection(table 2) The LRT M1a-M2a is too conservative whereas M7ndashM8 was considered to be more problem prone and less ac-curate compared with M8a-M8 (Wong et al 2004 Studer andRobinson-Rechavi 2009b) Because the P values result frommultiple tests performed using different genes at differentclade levels it was necessary to control for the expected pro-portion of false positives False positives were accounted forby calculating the q value correction over the P value for thedifferent genes in each clade this approach being both pow-erful and specific (Studer et al 2008)

The overall picture is that a total of nine coagulation factorproteins (out of the 11 of the cascade) underwent positiveselection at one or more stages in evolution (fig 3) Thehighest number of genes under positive selection within anindividual clade is five (for vertebrates and mammals) ForVertebrates using the LRT M8a-M8 positive selection wasfound in as many as five of the 14 coagulation genes (table2) In figure 3a these included F2 F5 F7 F10 and F13A for all47 vertebrates In figure 3b these included FGA F3 F5 F8 andF9 for the 30 mammals In figure 3cndashe these included FGB F5

and F8 (Primates) FGA and F13B (Laurasiatheria) and F3(Sauropsida) FG (FG = FGA + FGB + FGG) and F5 were ob-served to undergo the most extensive positive selection com-pared with any other coagulation genes The selectivepressure on all the genes in different clades was identifiedfrom the LRT values (table 2) except for those of F11 inLaurasiatheria Sauropsida and Fishes F12 in Sauropsida andFishes and F13B in Fishes because these genes wereabsent (supplementary table S1 Supplementary Materialonline)

Estimation of Selective Pressures Using Branch-SiteModel of Codon Substitution

The above site models estimated selective pressures that varyamong amino acid positions across multiple species But se-lective pressures can also vary among species (Studer andRobinson-Rechavi 2009a) and codon substitution branch-site models have been developed to detect such episodicpositive selection (Zhang et al 2005) Here we have usedthe results of the branch-site model from two differentsources one from the Selectome database and the otherfrom our own calculations Both approaches employedCodeML on the two data sets of this study (Materials andMethods) in order to check for consistency with the positiveselection identified in the site-model calculations (fig 3)Similar to the site model the false positives in our branch-site model calculations were accounted for by calculating theq value correction over the P value for the different genes ineach clade For both branch-site calculations we followed thetaxonomic grouping in the Selectome database (supplemen-tary table S2b Supplementary Material online) The highestorder of classification is the Sarcopterygii (coelacanths lung-fish and tetrapods) followed by its largest subgroupTetrapoda (mammals sauropsids and frogs) Other majortaxonomic subtrees were those of amninots sauropsidsmammals theria and eutheria The Actinopterygii (fishes)and Sauropsida (birds) are common to both the calculationsThe Eutheria and mammalian subtree calculations from theSelectome database were considered to be equivalent to themammalian clade of our analyses (fig 2) Positive selectionacross the vertebrates was observed for all genes except FGGF3 F7 and F11 in our branch-site model calculations (supple-mentary table S2b Supplementary Material online) We ob-tained similar results with our site-model calculations(supplementary table S2a Supplementary Material online)except for the exclusion of F3 and F7 (fig 3a) Our branch-site calculations identified positive selection in mammalianFGB F2 F8 and F9 whereas the site model identified positiveselection in mammalian FGA F3 F5 F8 and F9 Overall FG(FGA FGB) F8 and F9 were consistently identified to be pos-itively selected in mammals whereas F5 shows positive selec-tion up to amniotes after which the mammals and thesauropsids clade split In conclusion the two branch-siteand the site models all suggested signatures of positive selec-tions across F8 and F9 whereas no such selection wasobserved across and between F11

3044

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Identification of Codon Sites under Positive Selection

For up to 11 genes out of 14 identified to be under positiveselection (table 3) further analyses were performed to identifywhich codon sites were under such selection This was doneusing the Bayes Empirical Bayes (BEB) method from the M8site-model (Yang et al 2005) The number of codons poten-tially under positive selection (BEB 450) in the positivelyselected genes ranged between 3 and 64 But the number ofcodons accurately predicted under positive selection (BEB495) is actually much lower (between 0 and 4) The FGAgene in Laurasiatheria showed the highest percentage of pos-itively selected codons (1329 as estimated by CodeML and64 codons detect with two at BEB495) followed by the F5gene in Primates (1321 with 54 codons all BEBlt95 ) Thesmallest percentage of positively selected codons was ob-served in the F10 gene of Vertebrates (001 with threecodons all with BEB lt95) No positive selection was ob-served in any of the codon sites for the FGG F11 and F12genes in the taxa groups of this study

Table 2 LRT Statistics for the Site Model Comparisons of Model 8with Model 8a for the 14 Coagulation Factors among the Five Clades

Gene Clade No ofSequences

No ofCodons

(M8a-M8)

LRT P value q value

FGA Vertebrates 44 236 139 024 035Mammals 29 288 775 001 004Primates 7 865 335 007 016

Glires 7 670 289 009 019Laurasiatheria 9 267 728 001 004

Sauropsida 6 464 033 057 062Fishes 7 592 016 100 072

FGB Vertebrates 40 429 145 023 035Mammals 26 456 278 100 072Primates 6 488 744 001 004

Glires 6 471 069 041 051Laurasiatheria 8 479 144 023 035

Sauropsida 5 473 002 100 072Fishes 7 476 099 100 072

FGG Vertebrates 44 386 174 100 072Mammals 29 427 319 007 017Primates 7 433 398 005 013

Glires 7 433 344 006 016Laurasiatheria 8 432 009 076 072

Sauropsida 5 427 003 086 072Fishes 8 395 542 100 072

F2 Vertebrates 38 475 1691 000 000Mammals 23 571 464 003 011Primates 7 621 425 004 012

Glires 5 611 007 100 072Laurasiatheria 7 620 158 021 034

Sauropsida 5 593 368 005 014Fishes 8 562 005 100 072

F3 Vertebrates 47 166 209 015 027Mammals 26 240 1337 000 000Primates 7 295 001 090 072

Glires 6 287 307 008 018Laurasiatheria 7 292 097 032 044

Sauropsida 5 254 865 000 002Fishes 14 192 082 036 047

F5 Vertebrates 37 1182 1495 000 000Mammals 22 1741 3192 000 000Primates 6 2116 536 002 008

Glires 6 1847 048 049 056Laurasiatheria 6 2043 483 003 010

Sauropsida 5 1550 145 023 035Fishes 8 1534 212 015 027

F7 Vertebrates 34 330 593 001 007Mammals 29 344 306 008 018Primates 7 443 016 069 072

Glires 7 417 03 100 072Laurasiatheria 7 443 057 045 054

Sauropsida 3 425 055 046 054Fishes NA

F8 Vertebrates 38 1028 485 003 010Mammals 24 1985 1119 000 001Primates 7 2351 683 001 004

Glires 5 2257 395 005 013Laurasiatheria 7 2344 025 062 067

Sauropsida 5 1480 003 086 072Fishes 7 1413 09 034 045

F9 Vertebrates 44 329 063 043 052Mammals 24 443 1857 000 000Primates 7 461 128 026 036

Glires 5 468 185 017 030

(continued)

Table 2 Continued

Gene Clade No ofSequences

No ofCodons

(M8a-M8)

LRT P value q value

Laurasiatheria 7 459 434 004 012Sauropsida 5 435 003 100 072

Fishes 13 368 036 100 072

F10 Vertebrates 41 353 1455 000 000Mammals 26 424 023 063 067Primates 7 485 001 091 072

Glires 6 464 206 015 027Laurasiatheria 7 463 23 013 025

Sauropsida 5 385 003 087 072Fishes 8 391 071 040 051

F11 Vertebrates 23 614 034 056 062Mammals 23 614 034 056 062Primates 7 625 005 082 072

Glires 5 619 007 079 072Laurasiatheria 6 625 002 089 072

Sauropsida NAFishes NA

F12 Vertebrates 27 512 398 100 072Mammals 25 533 018 100 072Primates 7 615 002 100 072

Glires 6 557 007 100 072Laurasiatheria 8 513 46 003 011

Sauropsida NAFishes NA

F13A Vertebrates 47 631 907 000 002Mammals 26 684 258 011 022Primates 7 732 162 020 034

Glires 6 730 147 100 072Laurasiatheria 9 730 027 100 072

Sauropsida 5 732 001 092 072Fishes 14 643 259 011 022

F13B Vertebrates 31 609 13 025 036Mammals 25 640 157 021 034Primates 7 660 122 027 037

Glires 5 634 021 100 072Laurasiatheria 7 656 708 001 004

Sauropsida 5 573 093 100 072Fishes NA

NOTEmdashPositive selection is denoted in bold in column 2

Plt 005 Plt 001

3045

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Disease-Causing Missense Mutations and Relationshipto Evolutionary Pressures

The relationship between selective pressure and the proba-bility of disease-causing mutations at a codon site is of greatinterest Disease-causing missense mutations have been re-ported in all 14 coagulation genes in particular forHemophilia A (F8) Hemophilia B (F9) factor XI deficiency(F11) and thrombophilia (F2) (table 1) Extensive compila-tions of 487ndash5474 disease-causing mutations are available forthree of these coagulation factors FVIII FIX and FXI(Saunders et al 2009 Rallapalli et al 2013 Rallapalli et al2014 [unpublished data]) There is a sufficient number ofpathogenic mutation and evolutionary selection data fromthe FVIII FIX and FXI proteins to enable us to compare instatistical detail the correlation between selective pressureand disease-causing mutations

For the comparison of the FVIII FIX and FXI proteinchanges selective pressures were identified using the BEB in-ference method from model M2a (fig 4) While powerful a

disadvantage of Model M8 is that it classifies sites into 11categories (ten in negativeneutral and one positive selec-tion) These 11 categories can vary between different genefamilies making the comparison difficult Thus model M2a isadvantageous in that this has only three distinct categoriesnamely negative selection (dNdSlt 1) neutral evolution (dNdS = 1) and positive selection (dNdS4 1) Under the M1a-M2a test in the primates clade two genes (F8 and F9) showedpositive selection whereas F11 did not show positive selec-tion which differs from that shown in figure 3c which is basedon the M8 model The posterior probabilities of neutral neg-ative and positive selection for the FVIII FIX and FXI proteinsshowed that positive selection occurred in specific codonsthe most being for F8 and the fewest for F11 For completionthe same analysis for the other 11 genes is shown in supple-mentary table S2 Supplementary Material online These showsimilar outcomes to those for FVIII FIX and FXI

For each amino acid position in FVIII FIX and FXI wecompared its selective pressure (positive neutral or negative)

FIG 3 Positively selected genes of the coagulation cascade in vertebrates mammals and three major clades The layout of this figure is a simplifiedrepresentation of figure 1 Genes showing significant positive selection in the LRT calculations are highlighted in red (a) Positive selection was observedin five genes when all 47 genome sequences (vertebrates) were analyzed as a single group (b) In the Mammals group (fig 2) where positive selection isseen for five genes only the (c) Primates and (d) Laurasiatheria clades are shown with positive selection for two or three genes only because no positiveselection was seen in the Glires clade whereas insufficient sequences were available in the Atlantogenata clade for conclusions to be drawn (e) Positiveselection was seen in the Sauropsida clade but no positive selection was observed in Fishes (fig 2)

3046

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

with the reported number of disease-causing mutations Theoutcome was presented using regression plots (fig 5)The regression plots and their correlation coefficients revealedthe relationship between the selection probability values atamino acid sites with the disease-causing mutations at thosesites The average probability of a site to be under positiveselection and its number of disease-causing mutationsshowed a negative relationship with correlation coefficientsof085067 and095 in FVIII FIX and FXI respectively(fig 5 top row) The correlation coefficients for neutral selec-tion followed that of positive selection with values of 087064 and 096 for FVIII FIX and FXI respectively (fig 5middle row) In distinction to these a positive relationshipwas seen between the average probability of negative selec-tion and the disease-causing mutations with correlation co-efficients of +086 +066 and +095 for FVIII FIX and FXIrespectively (fig 5 bottom row) To summarize the fewer thenumber of disease mutations at a given site the higher is theprobability for positive selection or neutral evolution (fig 5

top row and middle row) The greater the number of disease-causing mutations at a given site the higher the probabilityfor negative selection (fig 5 bottom row) The comparisonbetween the positively selected proteins FVIII and FIX withthe nonpositively selected protein FXI is noteworthyIrrespective of the selective pressure all three proteins ex-hibited similar regression lines in each of the three rowsThese comparisons show that there is a general inverse pro-portionality relationship between the probability of positiveselection (or neutral evolution) and the number of disease-causing mutations at each site This confirms previous obser-vations that disease-causing mutations tend to occur at crit-ical sites in the protein and that these highly conserved sitesare less tolerant to new mutations (negative selection)

Stability Effect of Mutations

In order to explain the consequences of disease-causing mu-tations on the protein structure FoldX software was used to

Table 3 Positively Selected Sites in the M8a-M8 Comparison

Gene Clade No ofCodons

Analyzed

No of Codonsunder positive

selection

PercentageEstimated byCodeML ()

Sites under Positive Selection (BEB4 50)

FGA Mammals 288 13 717 71832353951101108178212219246273

FGA Laurasiatheria 267 64 1329 467121929303336405293949697100102103108109116117118123136138141143155171180182195208210213214216218219220222223225227229231232233234236239242244245250253254

256259 260261264265

FGB Primates 488 17 822 394155808182130134135137173290367388413415448

F2 Vertebrates 475 5 099 5257151162381

F3 Mammals 240 11 663 55727691115177179185195230240

F3 Sauropsida 254 13 1060 340414474798081110123170208249

F5 Vertebrates 1182 10 087 1032755745785936027779269951010

F5 Mammals 1741 54 272 237243660134389392656663667669 676680718720731745749765768782809819822824875

88089489590791191692893094596797698910161040107610801089113611391324147114791515

152215481564

F5 Primates 2116 54 1321 74052129180211336341408409434 59266070370772075478487787988889290794195997810331039

10591131116812051209121812201222123012311232123612441254126212651279128514981504

151915261667168218491918

F7 Vertebrates 330 12 522 56526688162167178179209258262

F8 Mammals 1985 27 398 7245281339342521667718765784785792807817889912113011931197

12201229128813081333168619671984

F8 Primates 2351 33 063 42275579181485086088692796496697998899610191071118012821312

136114141421143614381459152715821613162216501689173023402349

F9 Mammals 443 15 440 45100114140150177181199202206209270309348

F10 Vertebrates 353 3 001 14470

F13A Vertebrates 631 6 126 398489497502550613

F13B Laurasiatheria 656 18 152 5567378107341349350374443459461497522525539540624

P4 95 P4 99

3047

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

estimate the stability effect (G) of residue changes on thethree-dimensional structures of human FVIII (chains A and B)FIX and FXI (fig 6) FoldX was designed to predict the stabilityeffect of a mutation This is an empirical method that relies onthe estimation of physical parameters from more than 1000mutations in protein structures and has been evaluated with21 proteins (Schymkowitz et al 2005 Tokuriki et al 2007)First each amino acid at each position was replaced with eachof the other 19 amino acids in order to provide a baseline forcomparison As expected the prediction of all virtual muta-tions shows a distribution skewed toward destabilization (me-dians G range between 051 and 123 kcalmol) Next theeffect of changing the human amino acids with those seen inevolution for the other 46 sequences of this study was calcu-lated in order to examine the effect of evolutionary change onprotein stabilities This produced a narrow distribution pre-dominantly centered on a difference of stability G of0 kcalmol and showing a slight skew toward deleterious mu-tations (medians G range between 014 and 030 kcalmol) This outcome is as expected because evolutionarychanges in the amino acids are selected in order to preservethe stability of the protein structures Finally the effect ofchanging the human amino acids to those observed inunique pathogenic missense mutations was calculated Thisanalysis produced a clearly skewed distribution (mediansG range between 101 and 295 kcalmol) of G

values that ranged as high as 10 kcalmol (and even higherbut these higher values were clipped for easier visualization)this agrees with the fact that these mutations affect the pro-tein structure The first distribution for the virtual mutationsis intermediate between the latter two distributions asexpected

The same outcome of stability effects was also observed atthe individual amino acid level (fig 7) When the effect of 19amino acid replacements were computed at each position forFVIII FIX and FXI and the majority of replacements showed adestabilizing effect (410 kcalmol in red) A small propor-tion of replacements stabilized the structure (lt10 kcalmol in blue) indicating the ability of the protein to compen-sate deleterious mutations by further mutation Although thisappears to occur more strongly for the FVIII chains A and B(fig 7a and b) the proportion of allowed replacements is infact similar for all four proteins when their different sizes areconsidered When the effect of replacements by disease-caus-ing missense mutations was calculated and the G valuesshowed that these almost invariably destabilized the threeprotein structures

DiscussionIn this study we have assembled sequence data for 14 coag-ulation genes across 47 genomes in order to analyze the evo-lutionary changes in amino acids that take place between five

FIG 4 Probability of evolutionary selection across the Primates clade for the three proteins FVIII FIX and FXI The probabilities of negative neutral andpositive selection were plotted against the amino acid position in the three proteins The probability values were obtained from CodeML analyses of theseven primate sequences (gray negative selection yellow neutral selection green positive selection)

3048

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

major clades and several minor ones These changes werecompared directly with disease-causing mutation data forthree coagulation proteins FVIII FIX and FXI Good-qualitypathogenic mutational data are publicly available for thesethree proteins and the bleeding disorders caused by thesethree sets of mutational data correspond to the three mostprevalent bleeding disorders The outcome of these detailedanalyses confirms that positive selection occurred during dif-ferent stages in the evolution of several coagulation proteinsIt was also observed when comparing the protein structurestabilities that the pathogenic mutations have a destabilizingeffect on the protein structure as opposed to the stable mu-tations that are tolerated during evolution (fig 6) The com-bination of positive selection analyses with those forpathogenic mutations provides information about predictingdisease causing mutations and clarifies fresh approaches fortherapeutic interventions in bleeding disorders

Selective Pressures in Vertebrate Evolution

Evolutionary changes combine random mutational eventswith natural selection The probability of a mutation to befixed in a population depends on its consequence on thefitness of an organism New adaptive mutations that are ben-eficial in terms of fitness are more likely to be retained bypositive selection than neutral mutation Neutral mutations

that are neither beneficial nor detrimental are randomly re-tained or removed by genetic drift depending on the popu-lation size (neutral selection) Detrimental mutations aremore likely to be removed from the population by negative(or purifying) selection There are no strict patterns and adeleterious mutation can sometimes be beneficial to oneaspect of the organism The textbook example of this issickle-cell anaemia (drepanocytosis) which is a bleeding dis-order caused by the pGlu6Val mutation in the hemoglobin gene Mutations responsible for sickle-cell anaemia are gen-erally removed by natural selection In malaria-infested pop-ulations particularly in tropical or subtropical regions thesemutations are selected for in one allele because they offerpartial protection against the pathogen (Lopez et al 2010)The malaria-infected sickle cells inhibit infection by the ma-laria plasmodium in heterozygous cases (beneficial) but causesickle-cell anaemia in homozygous cases (deleterious)

Genes corresponding to proteins in direct contact with theenvironment are more likely to be under adaptive selectionThus genes involved in host-pathogen interactions are com-monly associated with positive selection Large-scale analysesrevealing positive selection in vertebrates or arthropods havebeen reported for genes involved with immunity (Nielsenet al 2005 Studer et al 2008 Montoya-Burgos 2011 Rouxet al 2014) In the present detailed study positive selection inseveral genes of the coagulation cascade was identified at

FIG 5 Relationship between disease-causing mutations and evolutionary pressures The correlation between evolutionary selection and the number oftimes a disease-causing mutation occurs at each amino acid position is shown In the columns (a) FVIII (b) FIX and (c) FXI correspond to the threecoagulation factors for which pathogenic mutational information is available in sufficient quantity from our three mutational databases The three rowscorrespond to the posterior probability (based on the BEB model of CodeML) of positive selection neutral evolution and negative selection forprimates

3049

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

different stages in evolution (fig 3) An earlier study identifiedpositive selection across six mammalian genomes (humanchimpanzee macaque mouse rat and dog) using LRTsbased on codon substitution models and false discoveryrates to evaluate multiple LRT comparisons (Kosiol et al2008) Although they found strong positive selection(Plt 005 and false discovery ratelt 005) in nine complementgenes only nominal positive selection (Plt 005) was identi-fied in four coagulation genes (F2 F12 TFPI PROC) none ofwhich were identified in figure 3b A limitation of this earlieranalysis may be the low number of six genomes used Notethat complement is related to coagulation in that the latterhas the ability to activate complement of innate immunitythat mediates host-pathogen interactions (Amara et al 2008)

Positive Selection of the Coagulation Proteins throughHost-Pathogen Interactions

As a consequence of the immune response blood coagula-tion is often exploited by pathogens for reason of infectiveand septic processes For example FG and FV are targeted bybacteria thus offering a straightforward explanation of posi-tive selection (fig 3) We discuss this in terms of FG and FIIfollowed by FIII and FVII and then FV and FX

By taking advantage of the increased number of availablegenome sequences positive selection has been identified inseveral coagulation genes during various stages in evolutionIn particular all three subunits FGA FGB and FGG of FG haveshown positive selection at periods during vertebrate evolu-tion according to both the site model and the branch-sitemodels Similarly prothrombin (FII) has shown positive selec-tion but only across all the vertebrate genomes and notacross any smaller clades Although the immune system isunder strong adaptive constraints because it needs to evolveconstantly against pathogens the coagulation system alsocomprises the first line of immune defense against bacteriato form a clot to block bacterial invasion (Sun 2006) Both FGand prothrombin (FII) are targeted by certain bacteria Forexample FG is targeted during the course of infection bycysteine protease a virulence factor of Streptococcus pyogenes(Matsuka et al 1999) and FG binding protein Efb ofStaphylococcus aureus Similarly FG-like proteins exist inAnopheles gambiae and are involved in the function of theimmune system against malaria or bacterial parasites (Dongand Dimopoulos 2009 Lombardo et al 2013) Thrombin in-teracts with staphylocoagulase a protein secreted by thehuman pathogen S aureus and activates prothrombin with-out proteolysis to form thrombin The resulting

FIG 6 The effect of amino acid replacements on the overall protein stability change G Data are shown for (a and b) FVIII A and B chains (c) FIX (d)FXI Three calculations were performed each being based on the sequences for which crystal structures are known (PDB codes 2R7E [FVIII] 2WPH [FIX]and 2F83 [FXI]) The FVIII A and B chains correspond to the N-terminal residues 19ndash760 and the C-terminal residues 1582ndash2351 that are observed in itscrystal structure The calculation represented by the dashed line indicates the distribution calculated for all 19 possible amino acid replacements Thatrepresented by the dotted line indicates the distribution calculated for all the evolutionary occurring replacements that were observed in our data set of47 genomes That represented by the continuous line indicates the distribution calculated for the disease-causing mutations from the FVIII FIX and FXImutation databases

3050

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

prothrombinmdashstaphylocoagulase complex binds to FG tocleave this into self-polymerizing fibrin This process is centralto the molecular pathology of S aureus endocarditis becausestaphylocoagulase bypasses physiological blood coagulationthrough the evasion of the recognition of prothrombin bycirculating thrombin inhibitors (Panizzi et al 2005)

Positive selection for FIII and FVII is attributed to a host-pathogen interaction with the herpes virus family These cellsurface molecules do not necessarily participate directly in animmune response but can still serve as a convenient gatewayfor pathogen entry (Vallender and Lahn 2004) Herpes virusespromote cellular surface infections by coevolving with theirhuman host indicating a mode by which pathogens exploitthe coagulation system (Sutherland et al 2012) A study of thehost-pathogen driven coevolution of herpes virus revealedthat this virus has existed from the time of mammals anddivided into the HSV-1 and HSV-2 types recently during

human evolution (Kolb et al 2013) These observations sug-gest that the pressure for FIII to evolve in early mammalsespecially through primates could be the reason for the pos-itive selection of FIII across the mammalian clade (fig 3) Wepropose that the FIIIndashFVII interaction is a significant drivingforce for positive selection in FVII

The FV and FVIII cofactors are homologous to each otherin their sequence and domain organization hence it is notsurprising to see similar selection pressures acting on them(fig 3c) A difference arises in the entire vertebrate genome(fig 3a) where FV shows positive selection whereas FVIII doesnot This difference between FV and FVIII across vertebratesmay result from the interaction of FV with Escherichia coliThe extracellular serine protease EspP from E coli cleaves FVto lead to its degradation and thus reduces the activity of thecoagulation cascade (Brunder et al 1997) The activated FVacofactor forms a FVaFXa complex with activated FX (fig 1)

FIG 7 The effect of amino acid replacements at the sequence level on the protein stability change G The residue stability changes are depicted ona seven point scale from highly destabilizing (red) to highly stabilizing (blue) for each of (a) FVIII chain A (b) FVIII chain B (c) FIX and (d) FXI Their PDBcodes are indicated in figure 6 In each panel the upper half shows the G values for all the possible 19 amino acid replacements at each residueposition that is part of the crystal structure and the lower half shows the G values for the disease-causing mutations taken from the mutationdatabases

3051

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

which is important for thrombin formation the most impor-tant step for fibrin clot formation Because both FVa and FXaare central to coagulation and are associated with bacterialproteolysis this offers an explanation for the signatures ofpositive selection in both proteins

FVIII FIX and FXIII are also associated with positive selec-tion however it is not clear what selective pressures are in-volved Although these proteins may also be targeted bypathogens no direct interactions have been reported so far

Absence of Positive Selection for FXI and FXIIFXII is unique in that it does not show positive selection eventhough it is involved in host-pathogen interactions with long-chain inorganic phosphatases (polyP) from microorganisms(Morrissey et al 2012) Small-chain polyP are produced in thehuman brain and by blood platelets Hence even during theabsence of pathogen-originated polyP the human host doesnot lack polyP and there is no pressure to evolve for this (Puyet al 2013) A F12 deficiency in patients does not cause bleed-ing disorders This means that the coagulation cascade can beinitiated even in the absence of F12 For similar reasons FXIshows no positive selection We hypothesize that this is be-cause first no interaction with the environment has beendemonstrated to date and second with respect to the polyPmechanism the coagulation cascade is capable of activatingin a FXI-independent manner

Relationship between Disease-Causing Mutations andSelection

The mechanisms of genetic drift mutational change andevolutionary change are closely related to each otherEvolutionary change is not only dependent on the geneticvariability from mutational change but also on genetic driftwhich describes the random fluctuations in the number ofgenetic variants in an organism Genetic drift and mutationsproduce random variations that selection can act upon Thenew changes incurred in one generation may be cancelled inanother generation It may therefore be important to deter-mine whether a new mutation is lost or becomes commonenough for selection to determine its fate (Masel 2011) In thisstudy we have examined the relationship between naturalselection acting on an amino acid and the disease-causingmutations reported at that amino acid position for thethree prominent coagulation factors FVIII FIX and FXI Wehave shown that the disease-causing mutations are morelikely to occur at a higher frequency at sites under negativeselection in all three proteins (fig 5) As sites under strongselective pressure are more likely to be critical to proteinfunction such a mutation can readily affect protein functionand trigger disease Lethal (or deleterious) mutations thatcause death are de facto not observed in our data set Bycontrast disease-causing mutations occur at a reduced fre-quency in sites under neutral or positive selection suggestingthat these sites are more tolerant to mutations than sitesunder negative selection Interestingly we have shown thatFVIII and FIX have undergone much positive selectionwhereas FXI has no such signatures (fig 4) Despite this dif-ference our results for three proteins show no change in the

relationship between evolutionary selection and disease-causing mutations (figs 5 and 6) In summary the reporteddisease-causing mutations show randomness and the muta-bility of a given amino acid position depends on the selectionpressure acting at that position and not on the entire protein

Effect of Disease-Causing Mutations on the Stability ofProtein Structure

Mutations that change the amino acid sequence can havestriking effects on protein stability Point mutations are re-lated to pathological and genetic conditions by reducing orabolishing coagulation protein function (table 1) Proteinfunction is related to its stability which is often measuredby the change in folding energy G of the protein structure(Stefl et al 2013) We have identified the effect of mutationson the stability of FVIII FIX and FXI in terms of G valuesThe stability effects of polymorphic amino acid changes inhuman sequences follow a Gaussian distribution (de Beeret al 2013) In that study disease-causing mutations alsoshowed the same distribution but slightly skewed towarddestabilization In this study the same trend for aminoacids substitutions in vertebrates were observed but this ismuch stronger for disease-causing mutations (fig 6) The dif-ference between the two studies may be explained by thedifferent methods used de Beer et al (2013) used the DiscreteOptimized Protein (DOPE) score of Modeller as a monitor ofglobal stability whereas we used FoldX (Materials andMethods) as an empirical method to infer directly the stabilityeffect G of a point mutation

Evolutionarily observed mutations are the most stable interms of their protein structure whereas disease-causing mu-tations destabilized the protein structure (figs 6 and 7) Eachsite may or may not tolerate a particular amino acid changedepending on the change in stability When each amino acidwas converted to each of the other 19 possible amino acids(fig 7 top panels) the G values showed that most aminoacids destabilize the structure whereas several may stabilize itThe disease-causing effect of a mutation may be affected notonly by the selection pressure acting on it but also by the typeof the amino acid replaced For the FVIII FIX and FXI bleedingdisorders destabilization was observed for most of the dis-ease-causing mutations

Utility of Evolutionary and Disease-CausingMutational Analyses for Clinical Diagnosis

Interactive locus-specific databases are useful tools for pre-senting patient mutational data for patient diagnosis andcare We plan to incorporate the above data sets on selectivepressures and stability effects into our mutational databasesfor FVIII FIX and FXI This information will provide insight forthe prediction of the effect of disease-causing mutations Thelevel of selective pressure was estimated under a maximum-likelihood framework Such methods are useful for geneticstudies but their results are sometimes difficult to interpretby a nonspecialist The presentation of this information needsto be intuitive to the viewer As shown in supplementaryfigure S3 Supplementary Material online the use of multiple

3052

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

sequence alignments and presentation of stability effectsG structural locations and selective pressures can helpin interpreting the impacts of the mutations This type ofinformation will therefore clarify the interpretation of newlydiscovered disease-causing mutations for clinicians evolu-tionary biologists and protein biochemists This combinedevolutionary-driven study of bleeding disorder-causing muta-tions is the most detailed of its kind as far as we are aware andmay be usefully extended to other biological processes andtheir proteins

Materials and Methods

Genomic Sequences

The flowchart of operations is summarized in supplementaryfigure S1 Supplementary Material online First the completemultiple sequence alignments were retrieved together withtheir gene trees for each of the 14 genes corresponding to the11 coagulation factors (table 1) from the Ensembl Comparadatabase (Vilella et al 2009) using their Ensembl IDs (table 1)and a Perl script Ensembl release 73 was used accessed onSeptember 30 2013 (Flicek et al 2013) These amino acidsequences came from Ensembl as they are generally ofgood quality and contained 74 vertebrate genomes in atotal of 80 genomes We then excluded any sequences thatcame from genomes of low coverage (2 or less) This re-sulted in 47 vertebrate genomes with alignment sizes be-tween 23 and 47 genomes These 47 sets of vertebratesequences provided sufficient genomic coverage to performphylogenetic analysis The 47 vertebrate organisms were clas-sified into five major clades (fig 2) namely Primates Glires(rodents rabbits and pika) Laurasiatheria Sauropsida (rep-tiles and birds) and Fishes (ray-finned fishes) together withminor clades including Atlantogenata and Australianmammals

Sequence Retrieval and Multiple Alignments

The multiple sequence alignments and gene trees fromEnsemblCompara included all the coagulation sequences indifferent species irrespective of their genome coverage Themultiple sequence alignments of the downloaded genes fromthe EnsemblCompara database were constructed using aglobal alignment algorithm provided within this databasefor significant number of sequences of arbitrary lengthsThe data set required for this study comprised 14 genesfrom 47 genomes A thorough quality control procedure ofthe multiple sequence alignments was performed in order togenerate minimum false positive results Thus sequences wereremoved if they showed 41 low-complexity regions orunknown nucleotides (indicated as ldquoNrdquo in a sequence)From these the phylogenetic trees were pruned usingNewick Utilities software (Junier and Zdobnov 2010) to iden-tify the tips corresponding to the retrieved species (fig 2) Thefinal alignments of these sequences were made with PRANKwhich is conservative in that it tends to align only amino acidsderived by substitution and introduces insertions and dele-tions (indels) instead of aligning sequence fragments thatappear too divergent (Loytynoja and Goldman 2008)

PRANK is an advanced probabilistic alignment algorithmthat incorporates phylogenetic information This has beenshown to reduce the number of false-positives when evalu-ating positive selection (Fletcher and Yang 2010 Loytynoja2014) All the PRANK alignments were made using defaultparameters for the empirical codon model and the prunedtrees as guides The resulting alignments are composed ofhighly conserved blocks surrounded by noise due to indelsFinally Gblocks (Castresana 2000) was used to remove thenoise in the unreliable regions of the multiple sequence align-ments which may cause false positives when testing for pos-itive selection to keep only the well-aligned parts The defaultGblocks parameters for the codontype model (parameter ndasht = c) were used with the exception of tolerating half gaps perposition (parameter -b5=h) This procedure resulted in cleanand accurate multiple sequence alignments for the final com-putations of the site and branch-site models using CodeMLtogether with their corresponding gene trees

Analyses of Evolutionary Pressure

For each of the 14 coagulation genes within each of the fivemajor clades the selective pressures were estimated using thedifferent codon substitution site models implemented inCodeML of the phylogenetic analysis with maximum-likeli-hood software (PAML release 47) (Yang 2007) By analyzingthe protein-coding genes at the nucleotide level the synon-ymous or silent changes (nucleotide substitutions that do notchange the encoded amino acid) are distinguished from thenonsynonymous or missense changes (substitutions thatchange the encoded amino acid) Because natural selectionoperates at the protein level and can take any direction ofpositive neutral or negative selection the synonymous andnonsynonymous substitutions are subjected to different se-lective pressures which occur at the rates dS and dN respec-tively Hence the ratio of nonsynonymoussynonymoussubstitution rates (= dNdS) measures the selective pressureat the protein level The value of reveals the direction andstrength of natural selection acting on the protein Values oflt 1 = 1 and 4 1 indicate negative selection neutralevolution and positive selection respectively If an existingfunction or phenotype is evolutionarily favorable then nega-tive selection (or purifying selection) favors the conservationof that function or phenotype (conservation of amino acids)The other is positive selection (or Darwinian selection) whichfavors the promotion of a new function or phenotypeNeutral evolution describes the tolerance to amino acid mu-tations which are neither deleterious nor beneficial and arefixed according to genetic drift

The following five models were used to account for thevariation of dNdS among codon sites M1a (two dNdS ratioswith negative selection [0lt 1] and neutral evolution[1 = 1]) M2a (three dNdS ratios with negative selection[0lt 1] neutral evolution [1 = 1] and positive selection[2 1]) M7 (beta) M8 (beta and 24 1) and M8a (betaand 2 = 1) The alignments contain several positions inwhich one of the sequences is missing and the ldquocleandata = 0rdquooption was selected to retain all sites The inference of positive

3053

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

selection was conducted by performing LRTs in which thefollowing two pairs of models were compared namely M1awith M2a and M8a with M8 The principle is to compare anull model that does not allow24 1 (M1a or M8a) with analternative model that does (M2a or M8) Because the alter-nate model has more parameters than the null model theLRT follows a 2 distribution The LRT is a powerful methodfor comparing two hypotheses where one is the special caseof the other The BEB method (Yang et al 2005) was used tocalculate the posterior probabilities for site classes and theBEB value was used to identify sites under positive selectionwhere the 2 distribution of the LRT was significant withPlt 005 As many different LRT comparisons (differentgenes different clades) were conducted a control for thefalse-discovery rate (expected proportion of false positives)was applied using the q value package in R (Storey andTibshirani 2003) We adjusted the parameters of the q valueto perform a bootstrap analysis (pi0meth = ldquobootstraprdquo) andto tolerate 10 of false positives in our significant results(ldquofdrlevel = 01rdquo) A bootstrap analysis is necessary as the Pvalue distribution is bimodal (Studer et al 2008)

Two sources for the branch-site model were used First theSelectome database at httpselectomeunilch (last accessedJanuary 2014) (Proux et al 2009 Moretti et al 2014) providesthe precomputed results of the branch-site model (supple-mentary table S2b Supplementary Material online) based onthe multiple sequence alignments from the EnsemblCompara database (Vilella et al 2009) The last release ofSelectome (release 6) contains an optimized procedure ofalignment refinement and quality control similar to theone used for our site-model analyses as described aboveThe branch-site model allows the dNdS value to vary notonly between sites but also between branches (Zhang et al2005) The advantage of this model is its ability to detecttraces of episodic positive selection (Studer and Robinson-Rechavi 2009a) Second a branch-site model was determinedbased on our data set (see above) We used the vertebratetrees and their multiple sequence alignments as inputs forbranch-site model calculations The multiple alignments weremanually rechecked to remove barren sequences and thecorresponding vertebrate trees were edited accordinglyusing Newick Utilities Each taxonomic unit also known asthe branch or node of the tree was tagged using an auto-mated python script The CodeML program was run on eachof these tagged trees from all the genes for this branch-sitemodel (supplementary table S2b Supplementary Materialonline) For some alignments we found misaligned short in-sertions up to a maximum of ten amino acids in the con-served blocks probably due to gene model prediction errorsIn our whole data set these errors affect 0ndash4 sequences peralignment As the F2 gene family contained most of thesesmall errors albeit at a very low level this was reanalyzed byremoving four sequences with potential problems from thealignment The recalculation of the likelihood ratio deviatedonly slightly from the initial results (P value of 393e5 for 38sequences vs P value of 217e4 for 34 sequences) This sug-gested that these errors would only have a minor impact ifany on the detection of positive selection especially as the

coagulation factor sequences were long and the model ofcodon substitutions was averaged over all columns (manyhundreds) and sequences (between 23 and 47)

Stability Effect of Mutations

When possible for each coagulation factor the molecularstructures of the domain(s) were retrieved directly fromthe CATH database (Sillitoe et al 2013) Prior to the muta-tional analyses each structure was repaired with the FoldXprogram (Schymkowitz et al 2005) to remove potential stericclashes FoldX utilizes an empirical force field model to esti-mate the stability of a protein (Gwildtype in kcalmol) FoldXwas used to estimate the stability effect on each residueposition when the residue was replaced by any of the 19other amino acids The stability effect is determined bythe difference in the wildtype and mutant energies(G = GmutantGwildtype) This resulted in a stabilitylandscape of all potential mutations including the Gvalues both for the disease mutations and for all the aminoacid substitutions observed across vertebrate evolution

Statistical Analyses

All statistical analyses were performed with R v302 (R CoreTeam 2013) and SigmaPlot (Systat Software San Jose CA)The pipeline was developed with Biopython scripts (Cocket al 2009) The visualization of the multiple sequence align-ments was performed using Jalview 28 (Waterhouse et al2009)

Data Availability

The final sequence alignments are provided on a webpagehttpwwwbiochemuclacukpavithraevolutioncoagulationindexphp (last accessed January 2014) Pathogenic mu-tation data for three human coagulation factors FVIII FIX andFXI were retrieved from our web databases at httpwwwfactorviii-dborg httpwwwfactorixorg and httpwwwfactorxiorg (last accessed January 2014) (Saunders et al2009 Rallapalli et al 2013)

Supplementary MaterialSupplementary tables S1ndashS2 and figures S1ndashS3 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

PMR and SJP thank Pfizer UK for a medical educationalgrant that generously funded this project and the SpecialTrustees of the Royal Free Hospital the KatharineDormandy Trust for Hemophilia and Related Disorders andthe European Association for Hemophilia and AssociatedDisorders for additional support The authors also thank DrKeith Gomez Dr Geoffrey Kemball-Cook and ProfessorEdward G Tuddenham for their generous collaborative sup-port of our studies including access to the FVIII mutations atthe Hemophilia A Database at httpwwwfactorviii-dborgRAS acknowledges funding from the Fondation du 450eme

3054

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Selective Pressures Using Codon SubstitutionSite-Models

Codon substitution site-models from the CodeML packagewere used to analyze the protein-coding sequences in the 14genes following the quality control of their alignments Themodels assume that selective pressure does not act equally onthe entire protein sequence but varies between amino acidsites in a gene family We calculated the log likelihood values(lnL) of five different selection models (M1a M2a M7 M8and M8a) in order to perform the likelihood ratio test (LRT)The five models differed from each other in the number ofparameters (np) considered and their variable values Thevalue is the ratio of nonsynonymous and synonymous sub-stitution rates (dNdS) (Materials and Methods) The compar-ison between a null model that does not consider 4 1(M1a M7 or M8a) and an alternate model that considers4 1 (M2a or M8) is a measure of positive selectionIf the null (neutral) model is rejected in favor of thealternate (selection) model positive selection is inferred(Yang 2006)

To evaluate the five models the LRT value was calculatedas follows

LRT frac14 2ethlnLalternate lnLnullTHORN

where lnL is the maximum-likelihood estimate of the proba-bilities of selection The LRT is asymptotically distributed as a2

k function where k (also known as the degrees of freedom) isthe difference between the np in the null and alternatemodels The np values were calculated along with the lnLvalues in CodeML

k frac14 npalternate npnull

The LRT values for which the 2k distribution shows a P

valuelt 005 indicated that some sites were significantlyunder positive selection For this study the comparison ofthe M8a null model with the M8 alternate model (referred toas M8a-M8) was considered optimal to infer positive selection(table 2) The LRT M1a-M2a is too conservative whereas M7ndashM8 was considered to be more problem prone and less ac-curate compared with M8a-M8 (Wong et al 2004 Studer andRobinson-Rechavi 2009b) Because the P values result frommultiple tests performed using different genes at differentclade levels it was necessary to control for the expected pro-portion of false positives False positives were accounted forby calculating the q value correction over the P value for thedifferent genes in each clade this approach being both pow-erful and specific (Studer et al 2008)

The overall picture is that a total of nine coagulation factorproteins (out of the 11 of the cascade) underwent positiveselection at one or more stages in evolution (fig 3) Thehighest number of genes under positive selection within anindividual clade is five (for vertebrates and mammals) ForVertebrates using the LRT M8a-M8 positive selection wasfound in as many as five of the 14 coagulation genes (table2) In figure 3a these included F2 F5 F7 F10 and F13A for all47 vertebrates In figure 3b these included FGA F3 F5 F8 andF9 for the 30 mammals In figure 3cndashe these included FGB F5

and F8 (Primates) FGA and F13B (Laurasiatheria) and F3(Sauropsida) FG (FG = FGA + FGB + FGG) and F5 were ob-served to undergo the most extensive positive selection com-pared with any other coagulation genes The selectivepressure on all the genes in different clades was identifiedfrom the LRT values (table 2) except for those of F11 inLaurasiatheria Sauropsida and Fishes F12 in Sauropsida andFishes and F13B in Fishes because these genes wereabsent (supplementary table S1 Supplementary Materialonline)

Estimation of Selective Pressures Using Branch-SiteModel of Codon Substitution

The above site models estimated selective pressures that varyamong amino acid positions across multiple species But se-lective pressures can also vary among species (Studer andRobinson-Rechavi 2009a) and codon substitution branch-site models have been developed to detect such episodicpositive selection (Zhang et al 2005) Here we have usedthe results of the branch-site model from two differentsources one from the Selectome database and the otherfrom our own calculations Both approaches employedCodeML on the two data sets of this study (Materials andMethods) in order to check for consistency with the positiveselection identified in the site-model calculations (fig 3)Similar to the site model the false positives in our branch-site model calculations were accounted for by calculating theq value correction over the P value for the different genes ineach clade For both branch-site calculations we followed thetaxonomic grouping in the Selectome database (supplemen-tary table S2b Supplementary Material online) The highestorder of classification is the Sarcopterygii (coelacanths lung-fish and tetrapods) followed by its largest subgroupTetrapoda (mammals sauropsids and frogs) Other majortaxonomic subtrees were those of amninots sauropsidsmammals theria and eutheria The Actinopterygii (fishes)and Sauropsida (birds) are common to both the calculationsThe Eutheria and mammalian subtree calculations from theSelectome database were considered to be equivalent to themammalian clade of our analyses (fig 2) Positive selectionacross the vertebrates was observed for all genes except FGGF3 F7 and F11 in our branch-site model calculations (supple-mentary table S2b Supplementary Material online) We ob-tained similar results with our site-model calculations(supplementary table S2a Supplementary Material online)except for the exclusion of F3 and F7 (fig 3a) Our branch-site calculations identified positive selection in mammalianFGB F2 F8 and F9 whereas the site model identified positiveselection in mammalian FGA F3 F5 F8 and F9 Overall FG(FGA FGB) F8 and F9 were consistently identified to be pos-itively selected in mammals whereas F5 shows positive selec-tion up to amniotes after which the mammals and thesauropsids clade split In conclusion the two branch-siteand the site models all suggested signatures of positive selec-tions across F8 and F9 whereas no such selection wasobserved across and between F11

3044

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Identification of Codon Sites under Positive Selection

For up to 11 genes out of 14 identified to be under positiveselection (table 3) further analyses were performed to identifywhich codon sites were under such selection This was doneusing the Bayes Empirical Bayes (BEB) method from the M8site-model (Yang et al 2005) The number of codons poten-tially under positive selection (BEB 450) in the positivelyselected genes ranged between 3 and 64 But the number ofcodons accurately predicted under positive selection (BEB495) is actually much lower (between 0 and 4) The FGAgene in Laurasiatheria showed the highest percentage of pos-itively selected codons (1329 as estimated by CodeML and64 codons detect with two at BEB495) followed by the F5gene in Primates (1321 with 54 codons all BEBlt95 ) Thesmallest percentage of positively selected codons was ob-served in the F10 gene of Vertebrates (001 with threecodons all with BEB lt95) No positive selection was ob-served in any of the codon sites for the FGG F11 and F12genes in the taxa groups of this study

Table 2 LRT Statistics for the Site Model Comparisons of Model 8with Model 8a for the 14 Coagulation Factors among the Five Clades

Gene Clade No ofSequences

No ofCodons

(M8a-M8)

LRT P value q value

FGA Vertebrates 44 236 139 024 035Mammals 29 288 775 001 004Primates 7 865 335 007 016

Glires 7 670 289 009 019Laurasiatheria 9 267 728 001 004

Sauropsida 6 464 033 057 062Fishes 7 592 016 100 072

FGB Vertebrates 40 429 145 023 035Mammals 26 456 278 100 072Primates 6 488 744 001 004

Glires 6 471 069 041 051Laurasiatheria 8 479 144 023 035

Sauropsida 5 473 002 100 072Fishes 7 476 099 100 072

FGG Vertebrates 44 386 174 100 072Mammals 29 427 319 007 017Primates 7 433 398 005 013

Glires 7 433 344 006 016Laurasiatheria 8 432 009 076 072

Sauropsida 5 427 003 086 072Fishes 8 395 542 100 072

F2 Vertebrates 38 475 1691 000 000Mammals 23 571 464 003 011Primates 7 621 425 004 012

Glires 5 611 007 100 072Laurasiatheria 7 620 158 021 034

Sauropsida 5 593 368 005 014Fishes 8 562 005 100 072

F3 Vertebrates 47 166 209 015 027Mammals 26 240 1337 000 000Primates 7 295 001 090 072

Glires 6 287 307 008 018Laurasiatheria 7 292 097 032 044

Sauropsida 5 254 865 000 002Fishes 14 192 082 036 047

F5 Vertebrates 37 1182 1495 000 000Mammals 22 1741 3192 000 000Primates 6 2116 536 002 008

Glires 6 1847 048 049 056Laurasiatheria 6 2043 483 003 010

Sauropsida 5 1550 145 023 035Fishes 8 1534 212 015 027

F7 Vertebrates 34 330 593 001 007Mammals 29 344 306 008 018Primates 7 443 016 069 072

Glires 7 417 03 100 072Laurasiatheria 7 443 057 045 054

Sauropsida 3 425 055 046 054Fishes NA

F8 Vertebrates 38 1028 485 003 010Mammals 24 1985 1119 000 001Primates 7 2351 683 001 004

Glires 5 2257 395 005 013Laurasiatheria 7 2344 025 062 067

Sauropsida 5 1480 003 086 072Fishes 7 1413 09 034 045

F9 Vertebrates 44 329 063 043 052Mammals 24 443 1857 000 000Primates 7 461 128 026 036

Glires 5 468 185 017 030

(continued)

Table 2 Continued

Gene Clade No ofSequences

No ofCodons

(M8a-M8)

LRT P value q value

Laurasiatheria 7 459 434 004 012Sauropsida 5 435 003 100 072

Fishes 13 368 036 100 072

F10 Vertebrates 41 353 1455 000 000Mammals 26 424 023 063 067Primates 7 485 001 091 072

Glires 6 464 206 015 027Laurasiatheria 7 463 23 013 025

Sauropsida 5 385 003 087 072Fishes 8 391 071 040 051

F11 Vertebrates 23 614 034 056 062Mammals 23 614 034 056 062Primates 7 625 005 082 072

Glires 5 619 007 079 072Laurasiatheria 6 625 002 089 072

Sauropsida NAFishes NA

F12 Vertebrates 27 512 398 100 072Mammals 25 533 018 100 072Primates 7 615 002 100 072

Glires 6 557 007 100 072Laurasiatheria 8 513 46 003 011

Sauropsida NAFishes NA

F13A Vertebrates 47 631 907 000 002Mammals 26 684 258 011 022Primates 7 732 162 020 034

Glires 6 730 147 100 072Laurasiatheria 9 730 027 100 072

Sauropsida 5 732 001 092 072Fishes 14 643 259 011 022

F13B Vertebrates 31 609 13 025 036Mammals 25 640 157 021 034Primates 7 660 122 027 037

Glires 5 634 021 100 072Laurasiatheria 7 656 708 001 004

Sauropsida 5 573 093 100 072Fishes NA

NOTEmdashPositive selection is denoted in bold in column 2

Plt 005 Plt 001

3045

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Disease-Causing Missense Mutations and Relationshipto Evolutionary Pressures

The relationship between selective pressure and the proba-bility of disease-causing mutations at a codon site is of greatinterest Disease-causing missense mutations have been re-ported in all 14 coagulation genes in particular forHemophilia A (F8) Hemophilia B (F9) factor XI deficiency(F11) and thrombophilia (F2) (table 1) Extensive compila-tions of 487ndash5474 disease-causing mutations are available forthree of these coagulation factors FVIII FIX and FXI(Saunders et al 2009 Rallapalli et al 2013 Rallapalli et al2014 [unpublished data]) There is a sufficient number ofpathogenic mutation and evolutionary selection data fromthe FVIII FIX and FXI proteins to enable us to compare instatistical detail the correlation between selective pressureand disease-causing mutations

For the comparison of the FVIII FIX and FXI proteinchanges selective pressures were identified using the BEB in-ference method from model M2a (fig 4) While powerful a

disadvantage of Model M8 is that it classifies sites into 11categories (ten in negativeneutral and one positive selec-tion) These 11 categories can vary between different genefamilies making the comparison difficult Thus model M2a isadvantageous in that this has only three distinct categoriesnamely negative selection (dNdSlt 1) neutral evolution (dNdS = 1) and positive selection (dNdS4 1) Under the M1a-M2a test in the primates clade two genes (F8 and F9) showedpositive selection whereas F11 did not show positive selec-tion which differs from that shown in figure 3c which is basedon the M8 model The posterior probabilities of neutral neg-ative and positive selection for the FVIII FIX and FXI proteinsshowed that positive selection occurred in specific codonsthe most being for F8 and the fewest for F11 For completionthe same analysis for the other 11 genes is shown in supple-mentary table S2 Supplementary Material online These showsimilar outcomes to those for FVIII FIX and FXI

For each amino acid position in FVIII FIX and FXI wecompared its selective pressure (positive neutral or negative)

FIG 3 Positively selected genes of the coagulation cascade in vertebrates mammals and three major clades The layout of this figure is a simplifiedrepresentation of figure 1 Genes showing significant positive selection in the LRT calculations are highlighted in red (a) Positive selection was observedin five genes when all 47 genome sequences (vertebrates) were analyzed as a single group (b) In the Mammals group (fig 2) where positive selection isseen for five genes only the (c) Primates and (d) Laurasiatheria clades are shown with positive selection for two or three genes only because no positiveselection was seen in the Glires clade whereas insufficient sequences were available in the Atlantogenata clade for conclusions to be drawn (e) Positiveselection was seen in the Sauropsida clade but no positive selection was observed in Fishes (fig 2)

3046

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

with the reported number of disease-causing mutations Theoutcome was presented using regression plots (fig 5)The regression plots and their correlation coefficients revealedthe relationship between the selection probability values atamino acid sites with the disease-causing mutations at thosesites The average probability of a site to be under positiveselection and its number of disease-causing mutationsshowed a negative relationship with correlation coefficientsof085067 and095 in FVIII FIX and FXI respectively(fig 5 top row) The correlation coefficients for neutral selec-tion followed that of positive selection with values of 087064 and 096 for FVIII FIX and FXI respectively (fig 5middle row) In distinction to these a positive relationshipwas seen between the average probability of negative selec-tion and the disease-causing mutations with correlation co-efficients of +086 +066 and +095 for FVIII FIX and FXIrespectively (fig 5 bottom row) To summarize the fewer thenumber of disease mutations at a given site the higher is theprobability for positive selection or neutral evolution (fig 5

top row and middle row) The greater the number of disease-causing mutations at a given site the higher the probabilityfor negative selection (fig 5 bottom row) The comparisonbetween the positively selected proteins FVIII and FIX withthe nonpositively selected protein FXI is noteworthyIrrespective of the selective pressure all three proteins ex-hibited similar regression lines in each of the three rowsThese comparisons show that there is a general inverse pro-portionality relationship between the probability of positiveselection (or neutral evolution) and the number of disease-causing mutations at each site This confirms previous obser-vations that disease-causing mutations tend to occur at crit-ical sites in the protein and that these highly conserved sitesare less tolerant to new mutations (negative selection)

Stability Effect of Mutations

In order to explain the consequences of disease-causing mu-tations on the protein structure FoldX software was used to

Table 3 Positively Selected Sites in the M8a-M8 Comparison

Gene Clade No ofCodons

Analyzed

No of Codonsunder positive

selection

PercentageEstimated byCodeML ()

Sites under Positive Selection (BEB4 50)

FGA Mammals 288 13 717 71832353951101108178212219246273

FGA Laurasiatheria 267 64 1329 467121929303336405293949697100102103108109116117118123136138141143155171180182195208210213214216218219220222223225227229231232233234236239242244245250253254

256259 260261264265

FGB Primates 488 17 822 394155808182130134135137173290367388413415448

F2 Vertebrates 475 5 099 5257151162381

F3 Mammals 240 11 663 55727691115177179185195230240

F3 Sauropsida 254 13 1060 340414474798081110123170208249

F5 Vertebrates 1182 10 087 1032755745785936027779269951010

F5 Mammals 1741 54 272 237243660134389392656663667669 676680718720731745749765768782809819822824875

88089489590791191692893094596797698910161040107610801089113611391324147114791515

152215481564

F5 Primates 2116 54 1321 74052129180211336341408409434 59266070370772075478487787988889290794195997810331039

10591131116812051209121812201222123012311232123612441254126212651279128514981504

151915261667168218491918

F7 Vertebrates 330 12 522 56526688162167178179209258262

F8 Mammals 1985 27 398 7245281339342521667718765784785792807817889912113011931197

12201229128813081333168619671984

F8 Primates 2351 33 063 42275579181485086088692796496697998899610191071118012821312

136114141421143614381459152715821613162216501689173023402349

F9 Mammals 443 15 440 45100114140150177181199202206209270309348

F10 Vertebrates 353 3 001 14470

F13A Vertebrates 631 6 126 398489497502550613

F13B Laurasiatheria 656 18 152 5567378107341349350374443459461497522525539540624

P4 95 P4 99

3047

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

estimate the stability effect (G) of residue changes on thethree-dimensional structures of human FVIII (chains A and B)FIX and FXI (fig 6) FoldX was designed to predict the stabilityeffect of a mutation This is an empirical method that relies onthe estimation of physical parameters from more than 1000mutations in protein structures and has been evaluated with21 proteins (Schymkowitz et al 2005 Tokuriki et al 2007)First each amino acid at each position was replaced with eachof the other 19 amino acids in order to provide a baseline forcomparison As expected the prediction of all virtual muta-tions shows a distribution skewed toward destabilization (me-dians G range between 051 and 123 kcalmol) Next theeffect of changing the human amino acids with those seen inevolution for the other 46 sequences of this study was calcu-lated in order to examine the effect of evolutionary change onprotein stabilities This produced a narrow distribution pre-dominantly centered on a difference of stability G of0 kcalmol and showing a slight skew toward deleterious mu-tations (medians G range between 014 and 030 kcalmol) This outcome is as expected because evolutionarychanges in the amino acids are selected in order to preservethe stability of the protein structures Finally the effect ofchanging the human amino acids to those observed inunique pathogenic missense mutations was calculated Thisanalysis produced a clearly skewed distribution (mediansG range between 101 and 295 kcalmol) of G

values that ranged as high as 10 kcalmol (and even higherbut these higher values were clipped for easier visualization)this agrees with the fact that these mutations affect the pro-tein structure The first distribution for the virtual mutationsis intermediate between the latter two distributions asexpected

The same outcome of stability effects was also observed atthe individual amino acid level (fig 7) When the effect of 19amino acid replacements were computed at each position forFVIII FIX and FXI and the majority of replacements showed adestabilizing effect (410 kcalmol in red) A small propor-tion of replacements stabilized the structure (lt10 kcalmol in blue) indicating the ability of the protein to compen-sate deleterious mutations by further mutation Although thisappears to occur more strongly for the FVIII chains A and B(fig 7a and b) the proportion of allowed replacements is infact similar for all four proteins when their different sizes areconsidered When the effect of replacements by disease-caus-ing missense mutations was calculated and the G valuesshowed that these almost invariably destabilized the threeprotein structures

DiscussionIn this study we have assembled sequence data for 14 coag-ulation genes across 47 genomes in order to analyze the evo-lutionary changes in amino acids that take place between five

FIG 4 Probability of evolutionary selection across the Primates clade for the three proteins FVIII FIX and FXI The probabilities of negative neutral andpositive selection were plotted against the amino acid position in the three proteins The probability values were obtained from CodeML analyses of theseven primate sequences (gray negative selection yellow neutral selection green positive selection)

3048

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

major clades and several minor ones These changes werecompared directly with disease-causing mutation data forthree coagulation proteins FVIII FIX and FXI Good-qualitypathogenic mutational data are publicly available for thesethree proteins and the bleeding disorders caused by thesethree sets of mutational data correspond to the three mostprevalent bleeding disorders The outcome of these detailedanalyses confirms that positive selection occurred during dif-ferent stages in the evolution of several coagulation proteinsIt was also observed when comparing the protein structurestabilities that the pathogenic mutations have a destabilizingeffect on the protein structure as opposed to the stable mu-tations that are tolerated during evolution (fig 6) The com-bination of positive selection analyses with those forpathogenic mutations provides information about predictingdisease causing mutations and clarifies fresh approaches fortherapeutic interventions in bleeding disorders

Selective Pressures in Vertebrate Evolution

Evolutionary changes combine random mutational eventswith natural selection The probability of a mutation to befixed in a population depends on its consequence on thefitness of an organism New adaptive mutations that are ben-eficial in terms of fitness are more likely to be retained bypositive selection than neutral mutation Neutral mutations

that are neither beneficial nor detrimental are randomly re-tained or removed by genetic drift depending on the popu-lation size (neutral selection) Detrimental mutations aremore likely to be removed from the population by negative(or purifying) selection There are no strict patterns and adeleterious mutation can sometimes be beneficial to oneaspect of the organism The textbook example of this issickle-cell anaemia (drepanocytosis) which is a bleeding dis-order caused by the pGlu6Val mutation in the hemoglobin gene Mutations responsible for sickle-cell anaemia are gen-erally removed by natural selection In malaria-infested pop-ulations particularly in tropical or subtropical regions thesemutations are selected for in one allele because they offerpartial protection against the pathogen (Lopez et al 2010)The malaria-infected sickle cells inhibit infection by the ma-laria plasmodium in heterozygous cases (beneficial) but causesickle-cell anaemia in homozygous cases (deleterious)

Genes corresponding to proteins in direct contact with theenvironment are more likely to be under adaptive selectionThus genes involved in host-pathogen interactions are com-monly associated with positive selection Large-scale analysesrevealing positive selection in vertebrates or arthropods havebeen reported for genes involved with immunity (Nielsenet al 2005 Studer et al 2008 Montoya-Burgos 2011 Rouxet al 2014) In the present detailed study positive selection inseveral genes of the coagulation cascade was identified at

FIG 5 Relationship between disease-causing mutations and evolutionary pressures The correlation between evolutionary selection and the number oftimes a disease-causing mutation occurs at each amino acid position is shown In the columns (a) FVIII (b) FIX and (c) FXI correspond to the threecoagulation factors for which pathogenic mutational information is available in sufficient quantity from our three mutational databases The three rowscorrespond to the posterior probability (based on the BEB model of CodeML) of positive selection neutral evolution and negative selection forprimates

3049

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

different stages in evolution (fig 3) An earlier study identifiedpositive selection across six mammalian genomes (humanchimpanzee macaque mouse rat and dog) using LRTsbased on codon substitution models and false discoveryrates to evaluate multiple LRT comparisons (Kosiol et al2008) Although they found strong positive selection(Plt 005 and false discovery ratelt 005) in nine complementgenes only nominal positive selection (Plt 005) was identi-fied in four coagulation genes (F2 F12 TFPI PROC) none ofwhich were identified in figure 3b A limitation of this earlieranalysis may be the low number of six genomes used Notethat complement is related to coagulation in that the latterhas the ability to activate complement of innate immunitythat mediates host-pathogen interactions (Amara et al 2008)

Positive Selection of the Coagulation Proteins throughHost-Pathogen Interactions

As a consequence of the immune response blood coagula-tion is often exploited by pathogens for reason of infectiveand septic processes For example FG and FV are targeted bybacteria thus offering a straightforward explanation of posi-tive selection (fig 3) We discuss this in terms of FG and FIIfollowed by FIII and FVII and then FV and FX

By taking advantage of the increased number of availablegenome sequences positive selection has been identified inseveral coagulation genes during various stages in evolutionIn particular all three subunits FGA FGB and FGG of FG haveshown positive selection at periods during vertebrate evolu-tion according to both the site model and the branch-sitemodels Similarly prothrombin (FII) has shown positive selec-tion but only across all the vertebrate genomes and notacross any smaller clades Although the immune system isunder strong adaptive constraints because it needs to evolveconstantly against pathogens the coagulation system alsocomprises the first line of immune defense against bacteriato form a clot to block bacterial invasion (Sun 2006) Both FGand prothrombin (FII) are targeted by certain bacteria Forexample FG is targeted during the course of infection bycysteine protease a virulence factor of Streptococcus pyogenes(Matsuka et al 1999) and FG binding protein Efb ofStaphylococcus aureus Similarly FG-like proteins exist inAnopheles gambiae and are involved in the function of theimmune system against malaria or bacterial parasites (Dongand Dimopoulos 2009 Lombardo et al 2013) Thrombin in-teracts with staphylocoagulase a protein secreted by thehuman pathogen S aureus and activates prothrombin with-out proteolysis to form thrombin The resulting

FIG 6 The effect of amino acid replacements on the overall protein stability change G Data are shown for (a and b) FVIII A and B chains (c) FIX (d)FXI Three calculations were performed each being based on the sequences for which crystal structures are known (PDB codes 2R7E [FVIII] 2WPH [FIX]and 2F83 [FXI]) The FVIII A and B chains correspond to the N-terminal residues 19ndash760 and the C-terminal residues 1582ndash2351 that are observed in itscrystal structure The calculation represented by the dashed line indicates the distribution calculated for all 19 possible amino acid replacements Thatrepresented by the dotted line indicates the distribution calculated for all the evolutionary occurring replacements that were observed in our data set of47 genomes That represented by the continuous line indicates the distribution calculated for the disease-causing mutations from the FVIII FIX and FXImutation databases

3050

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

prothrombinmdashstaphylocoagulase complex binds to FG tocleave this into self-polymerizing fibrin This process is centralto the molecular pathology of S aureus endocarditis becausestaphylocoagulase bypasses physiological blood coagulationthrough the evasion of the recognition of prothrombin bycirculating thrombin inhibitors (Panizzi et al 2005)

Positive selection for FIII and FVII is attributed to a host-pathogen interaction with the herpes virus family These cellsurface molecules do not necessarily participate directly in animmune response but can still serve as a convenient gatewayfor pathogen entry (Vallender and Lahn 2004) Herpes virusespromote cellular surface infections by coevolving with theirhuman host indicating a mode by which pathogens exploitthe coagulation system (Sutherland et al 2012) A study of thehost-pathogen driven coevolution of herpes virus revealedthat this virus has existed from the time of mammals anddivided into the HSV-1 and HSV-2 types recently during

human evolution (Kolb et al 2013) These observations sug-gest that the pressure for FIII to evolve in early mammalsespecially through primates could be the reason for the pos-itive selection of FIII across the mammalian clade (fig 3) Wepropose that the FIIIndashFVII interaction is a significant drivingforce for positive selection in FVII

The FV and FVIII cofactors are homologous to each otherin their sequence and domain organization hence it is notsurprising to see similar selection pressures acting on them(fig 3c) A difference arises in the entire vertebrate genome(fig 3a) where FV shows positive selection whereas FVIII doesnot This difference between FV and FVIII across vertebratesmay result from the interaction of FV with Escherichia coliThe extracellular serine protease EspP from E coli cleaves FVto lead to its degradation and thus reduces the activity of thecoagulation cascade (Brunder et al 1997) The activated FVacofactor forms a FVaFXa complex with activated FX (fig 1)

FIG 7 The effect of amino acid replacements at the sequence level on the protein stability change G The residue stability changes are depicted ona seven point scale from highly destabilizing (red) to highly stabilizing (blue) for each of (a) FVIII chain A (b) FVIII chain B (c) FIX and (d) FXI Their PDBcodes are indicated in figure 6 In each panel the upper half shows the G values for all the possible 19 amino acid replacements at each residueposition that is part of the crystal structure and the lower half shows the G values for the disease-causing mutations taken from the mutationdatabases

3051

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

which is important for thrombin formation the most impor-tant step for fibrin clot formation Because both FVa and FXaare central to coagulation and are associated with bacterialproteolysis this offers an explanation for the signatures ofpositive selection in both proteins

FVIII FIX and FXIII are also associated with positive selec-tion however it is not clear what selective pressures are in-volved Although these proteins may also be targeted bypathogens no direct interactions have been reported so far

Absence of Positive Selection for FXI and FXIIFXII is unique in that it does not show positive selection eventhough it is involved in host-pathogen interactions with long-chain inorganic phosphatases (polyP) from microorganisms(Morrissey et al 2012) Small-chain polyP are produced in thehuman brain and by blood platelets Hence even during theabsence of pathogen-originated polyP the human host doesnot lack polyP and there is no pressure to evolve for this (Puyet al 2013) A F12 deficiency in patients does not cause bleed-ing disorders This means that the coagulation cascade can beinitiated even in the absence of F12 For similar reasons FXIshows no positive selection We hypothesize that this is be-cause first no interaction with the environment has beendemonstrated to date and second with respect to the polyPmechanism the coagulation cascade is capable of activatingin a FXI-independent manner

Relationship between Disease-Causing Mutations andSelection

The mechanisms of genetic drift mutational change andevolutionary change are closely related to each otherEvolutionary change is not only dependent on the geneticvariability from mutational change but also on genetic driftwhich describes the random fluctuations in the number ofgenetic variants in an organism Genetic drift and mutationsproduce random variations that selection can act upon Thenew changes incurred in one generation may be cancelled inanother generation It may therefore be important to deter-mine whether a new mutation is lost or becomes commonenough for selection to determine its fate (Masel 2011) In thisstudy we have examined the relationship between naturalselection acting on an amino acid and the disease-causingmutations reported at that amino acid position for thethree prominent coagulation factors FVIII FIX and FXI Wehave shown that the disease-causing mutations are morelikely to occur at a higher frequency at sites under negativeselection in all three proteins (fig 5) As sites under strongselective pressure are more likely to be critical to proteinfunction such a mutation can readily affect protein functionand trigger disease Lethal (or deleterious) mutations thatcause death are de facto not observed in our data set Bycontrast disease-causing mutations occur at a reduced fre-quency in sites under neutral or positive selection suggestingthat these sites are more tolerant to mutations than sitesunder negative selection Interestingly we have shown thatFVIII and FIX have undergone much positive selectionwhereas FXI has no such signatures (fig 4) Despite this dif-ference our results for three proteins show no change in the

relationship between evolutionary selection and disease-causing mutations (figs 5 and 6) In summary the reporteddisease-causing mutations show randomness and the muta-bility of a given amino acid position depends on the selectionpressure acting at that position and not on the entire protein

Effect of Disease-Causing Mutations on the Stability ofProtein Structure

Mutations that change the amino acid sequence can havestriking effects on protein stability Point mutations are re-lated to pathological and genetic conditions by reducing orabolishing coagulation protein function (table 1) Proteinfunction is related to its stability which is often measuredby the change in folding energy G of the protein structure(Stefl et al 2013) We have identified the effect of mutationson the stability of FVIII FIX and FXI in terms of G valuesThe stability effects of polymorphic amino acid changes inhuman sequences follow a Gaussian distribution (de Beeret al 2013) In that study disease-causing mutations alsoshowed the same distribution but slightly skewed towarddestabilization In this study the same trend for aminoacids substitutions in vertebrates were observed but this ismuch stronger for disease-causing mutations (fig 6) The dif-ference between the two studies may be explained by thedifferent methods used de Beer et al (2013) used the DiscreteOptimized Protein (DOPE) score of Modeller as a monitor ofglobal stability whereas we used FoldX (Materials andMethods) as an empirical method to infer directly the stabilityeffect G of a point mutation

Evolutionarily observed mutations are the most stable interms of their protein structure whereas disease-causing mu-tations destabilized the protein structure (figs 6 and 7) Eachsite may or may not tolerate a particular amino acid changedepending on the change in stability When each amino acidwas converted to each of the other 19 possible amino acids(fig 7 top panels) the G values showed that most aminoacids destabilize the structure whereas several may stabilize itThe disease-causing effect of a mutation may be affected notonly by the selection pressure acting on it but also by the typeof the amino acid replaced For the FVIII FIX and FXI bleedingdisorders destabilization was observed for most of the dis-ease-causing mutations

Utility of Evolutionary and Disease-CausingMutational Analyses for Clinical Diagnosis

Interactive locus-specific databases are useful tools for pre-senting patient mutational data for patient diagnosis andcare We plan to incorporate the above data sets on selectivepressures and stability effects into our mutational databasesfor FVIII FIX and FXI This information will provide insight forthe prediction of the effect of disease-causing mutations Thelevel of selective pressure was estimated under a maximum-likelihood framework Such methods are useful for geneticstudies but their results are sometimes difficult to interpretby a nonspecialist The presentation of this information needsto be intuitive to the viewer As shown in supplementaryfigure S3 Supplementary Material online the use of multiple

3052

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

sequence alignments and presentation of stability effectsG structural locations and selective pressures can helpin interpreting the impacts of the mutations This type ofinformation will therefore clarify the interpretation of newlydiscovered disease-causing mutations for clinicians evolu-tionary biologists and protein biochemists This combinedevolutionary-driven study of bleeding disorder-causing muta-tions is the most detailed of its kind as far as we are aware andmay be usefully extended to other biological processes andtheir proteins

Materials and Methods

Genomic Sequences

The flowchart of operations is summarized in supplementaryfigure S1 Supplementary Material online First the completemultiple sequence alignments were retrieved together withtheir gene trees for each of the 14 genes corresponding to the11 coagulation factors (table 1) from the Ensembl Comparadatabase (Vilella et al 2009) using their Ensembl IDs (table 1)and a Perl script Ensembl release 73 was used accessed onSeptember 30 2013 (Flicek et al 2013) These amino acidsequences came from Ensembl as they are generally ofgood quality and contained 74 vertebrate genomes in atotal of 80 genomes We then excluded any sequences thatcame from genomes of low coverage (2 or less) This re-sulted in 47 vertebrate genomes with alignment sizes be-tween 23 and 47 genomes These 47 sets of vertebratesequences provided sufficient genomic coverage to performphylogenetic analysis The 47 vertebrate organisms were clas-sified into five major clades (fig 2) namely Primates Glires(rodents rabbits and pika) Laurasiatheria Sauropsida (rep-tiles and birds) and Fishes (ray-finned fishes) together withminor clades including Atlantogenata and Australianmammals

Sequence Retrieval and Multiple Alignments

The multiple sequence alignments and gene trees fromEnsemblCompara included all the coagulation sequences indifferent species irrespective of their genome coverage Themultiple sequence alignments of the downloaded genes fromthe EnsemblCompara database were constructed using aglobal alignment algorithm provided within this databasefor significant number of sequences of arbitrary lengthsThe data set required for this study comprised 14 genesfrom 47 genomes A thorough quality control procedure ofthe multiple sequence alignments was performed in order togenerate minimum false positive results Thus sequences wereremoved if they showed 41 low-complexity regions orunknown nucleotides (indicated as ldquoNrdquo in a sequence)From these the phylogenetic trees were pruned usingNewick Utilities software (Junier and Zdobnov 2010) to iden-tify the tips corresponding to the retrieved species (fig 2) Thefinal alignments of these sequences were made with PRANKwhich is conservative in that it tends to align only amino acidsderived by substitution and introduces insertions and dele-tions (indels) instead of aligning sequence fragments thatappear too divergent (Loytynoja and Goldman 2008)

PRANK is an advanced probabilistic alignment algorithmthat incorporates phylogenetic information This has beenshown to reduce the number of false-positives when evalu-ating positive selection (Fletcher and Yang 2010 Loytynoja2014) All the PRANK alignments were made using defaultparameters for the empirical codon model and the prunedtrees as guides The resulting alignments are composed ofhighly conserved blocks surrounded by noise due to indelsFinally Gblocks (Castresana 2000) was used to remove thenoise in the unreliable regions of the multiple sequence align-ments which may cause false positives when testing for pos-itive selection to keep only the well-aligned parts The defaultGblocks parameters for the codontype model (parameter ndasht = c) were used with the exception of tolerating half gaps perposition (parameter -b5=h) This procedure resulted in cleanand accurate multiple sequence alignments for the final com-putations of the site and branch-site models using CodeMLtogether with their corresponding gene trees

Analyses of Evolutionary Pressure

For each of the 14 coagulation genes within each of the fivemajor clades the selective pressures were estimated using thedifferent codon substitution site models implemented inCodeML of the phylogenetic analysis with maximum-likeli-hood software (PAML release 47) (Yang 2007) By analyzingthe protein-coding genes at the nucleotide level the synon-ymous or silent changes (nucleotide substitutions that do notchange the encoded amino acid) are distinguished from thenonsynonymous or missense changes (substitutions thatchange the encoded amino acid) Because natural selectionoperates at the protein level and can take any direction ofpositive neutral or negative selection the synonymous andnonsynonymous substitutions are subjected to different se-lective pressures which occur at the rates dS and dN respec-tively Hence the ratio of nonsynonymoussynonymoussubstitution rates (= dNdS) measures the selective pressureat the protein level The value of reveals the direction andstrength of natural selection acting on the protein Values oflt 1 = 1 and 4 1 indicate negative selection neutralevolution and positive selection respectively If an existingfunction or phenotype is evolutionarily favorable then nega-tive selection (or purifying selection) favors the conservationof that function or phenotype (conservation of amino acids)The other is positive selection (or Darwinian selection) whichfavors the promotion of a new function or phenotypeNeutral evolution describes the tolerance to amino acid mu-tations which are neither deleterious nor beneficial and arefixed according to genetic drift

The following five models were used to account for thevariation of dNdS among codon sites M1a (two dNdS ratioswith negative selection [0lt 1] and neutral evolution[1 = 1]) M2a (three dNdS ratios with negative selection[0lt 1] neutral evolution [1 = 1] and positive selection[2 1]) M7 (beta) M8 (beta and 24 1) and M8a (betaand 2 = 1) The alignments contain several positions inwhich one of the sequences is missing and the ldquocleandata = 0rdquooption was selected to retain all sites The inference of positive

3053

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

selection was conducted by performing LRTs in which thefollowing two pairs of models were compared namely M1awith M2a and M8a with M8 The principle is to compare anull model that does not allow24 1 (M1a or M8a) with analternative model that does (M2a or M8) Because the alter-nate model has more parameters than the null model theLRT follows a 2 distribution The LRT is a powerful methodfor comparing two hypotheses where one is the special caseof the other The BEB method (Yang et al 2005) was used tocalculate the posterior probabilities for site classes and theBEB value was used to identify sites under positive selectionwhere the 2 distribution of the LRT was significant withPlt 005 As many different LRT comparisons (differentgenes different clades) were conducted a control for thefalse-discovery rate (expected proportion of false positives)was applied using the q value package in R (Storey andTibshirani 2003) We adjusted the parameters of the q valueto perform a bootstrap analysis (pi0meth = ldquobootstraprdquo) andto tolerate 10 of false positives in our significant results(ldquofdrlevel = 01rdquo) A bootstrap analysis is necessary as the Pvalue distribution is bimodal (Studer et al 2008)

Two sources for the branch-site model were used First theSelectome database at httpselectomeunilch (last accessedJanuary 2014) (Proux et al 2009 Moretti et al 2014) providesthe precomputed results of the branch-site model (supple-mentary table S2b Supplementary Material online) based onthe multiple sequence alignments from the EnsemblCompara database (Vilella et al 2009) The last release ofSelectome (release 6) contains an optimized procedure ofalignment refinement and quality control similar to theone used for our site-model analyses as described aboveThe branch-site model allows the dNdS value to vary notonly between sites but also between branches (Zhang et al2005) The advantage of this model is its ability to detecttraces of episodic positive selection (Studer and Robinson-Rechavi 2009a) Second a branch-site model was determinedbased on our data set (see above) We used the vertebratetrees and their multiple sequence alignments as inputs forbranch-site model calculations The multiple alignments weremanually rechecked to remove barren sequences and thecorresponding vertebrate trees were edited accordinglyusing Newick Utilities Each taxonomic unit also known asthe branch or node of the tree was tagged using an auto-mated python script The CodeML program was run on eachof these tagged trees from all the genes for this branch-sitemodel (supplementary table S2b Supplementary Materialonline) For some alignments we found misaligned short in-sertions up to a maximum of ten amino acids in the con-served blocks probably due to gene model prediction errorsIn our whole data set these errors affect 0ndash4 sequences peralignment As the F2 gene family contained most of thesesmall errors albeit at a very low level this was reanalyzed byremoving four sequences with potential problems from thealignment The recalculation of the likelihood ratio deviatedonly slightly from the initial results (P value of 393e5 for 38sequences vs P value of 217e4 for 34 sequences) This sug-gested that these errors would only have a minor impact ifany on the detection of positive selection especially as the

coagulation factor sequences were long and the model ofcodon substitutions was averaged over all columns (manyhundreds) and sequences (between 23 and 47)

Stability Effect of Mutations

When possible for each coagulation factor the molecularstructures of the domain(s) were retrieved directly fromthe CATH database (Sillitoe et al 2013) Prior to the muta-tional analyses each structure was repaired with the FoldXprogram (Schymkowitz et al 2005) to remove potential stericclashes FoldX utilizes an empirical force field model to esti-mate the stability of a protein (Gwildtype in kcalmol) FoldXwas used to estimate the stability effect on each residueposition when the residue was replaced by any of the 19other amino acids The stability effect is determined bythe difference in the wildtype and mutant energies(G = GmutantGwildtype) This resulted in a stabilitylandscape of all potential mutations including the Gvalues both for the disease mutations and for all the aminoacid substitutions observed across vertebrate evolution

Statistical Analyses

All statistical analyses were performed with R v302 (R CoreTeam 2013) and SigmaPlot (Systat Software San Jose CA)The pipeline was developed with Biopython scripts (Cocket al 2009) The visualization of the multiple sequence align-ments was performed using Jalview 28 (Waterhouse et al2009)

Data Availability

The final sequence alignments are provided on a webpagehttpwwwbiochemuclacukpavithraevolutioncoagulationindexphp (last accessed January 2014) Pathogenic mu-tation data for three human coagulation factors FVIII FIX andFXI were retrieved from our web databases at httpwwwfactorviii-dborg httpwwwfactorixorg and httpwwwfactorxiorg (last accessed January 2014) (Saunders et al2009 Rallapalli et al 2013)

Supplementary MaterialSupplementary tables S1ndashS2 and figures S1ndashS3 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

PMR and SJP thank Pfizer UK for a medical educationalgrant that generously funded this project and the SpecialTrustees of the Royal Free Hospital the KatharineDormandy Trust for Hemophilia and Related Disorders andthe European Association for Hemophilia and AssociatedDisorders for additional support The authors also thank DrKeith Gomez Dr Geoffrey Kemball-Cook and ProfessorEdward G Tuddenham for their generous collaborative sup-port of our studies including access to the FVIII mutations atthe Hemophilia A Database at httpwwwfactorviii-dborgRAS acknowledges funding from the Fondation du 450eme

3054

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Identification of Codon Sites under Positive Selection

For up to 11 genes out of 14 identified to be under positiveselection (table 3) further analyses were performed to identifywhich codon sites were under such selection This was doneusing the Bayes Empirical Bayes (BEB) method from the M8site-model (Yang et al 2005) The number of codons poten-tially under positive selection (BEB 450) in the positivelyselected genes ranged between 3 and 64 But the number ofcodons accurately predicted under positive selection (BEB495) is actually much lower (between 0 and 4) The FGAgene in Laurasiatheria showed the highest percentage of pos-itively selected codons (1329 as estimated by CodeML and64 codons detect with two at BEB495) followed by the F5gene in Primates (1321 with 54 codons all BEBlt95 ) Thesmallest percentage of positively selected codons was ob-served in the F10 gene of Vertebrates (001 with threecodons all with BEB lt95) No positive selection was ob-served in any of the codon sites for the FGG F11 and F12genes in the taxa groups of this study

Table 2 LRT Statistics for the Site Model Comparisons of Model 8with Model 8a for the 14 Coagulation Factors among the Five Clades

Gene Clade No ofSequences

No ofCodons

(M8a-M8)

LRT P value q value

FGA Vertebrates 44 236 139 024 035Mammals 29 288 775 001 004Primates 7 865 335 007 016

Glires 7 670 289 009 019Laurasiatheria 9 267 728 001 004

Sauropsida 6 464 033 057 062Fishes 7 592 016 100 072

FGB Vertebrates 40 429 145 023 035Mammals 26 456 278 100 072Primates 6 488 744 001 004

Glires 6 471 069 041 051Laurasiatheria 8 479 144 023 035

Sauropsida 5 473 002 100 072Fishes 7 476 099 100 072

FGG Vertebrates 44 386 174 100 072Mammals 29 427 319 007 017Primates 7 433 398 005 013

Glires 7 433 344 006 016Laurasiatheria 8 432 009 076 072

Sauropsida 5 427 003 086 072Fishes 8 395 542 100 072

F2 Vertebrates 38 475 1691 000 000Mammals 23 571 464 003 011Primates 7 621 425 004 012

Glires 5 611 007 100 072Laurasiatheria 7 620 158 021 034

Sauropsida 5 593 368 005 014Fishes 8 562 005 100 072

F3 Vertebrates 47 166 209 015 027Mammals 26 240 1337 000 000Primates 7 295 001 090 072

Glires 6 287 307 008 018Laurasiatheria 7 292 097 032 044

Sauropsida 5 254 865 000 002Fishes 14 192 082 036 047

F5 Vertebrates 37 1182 1495 000 000Mammals 22 1741 3192 000 000Primates 6 2116 536 002 008

Glires 6 1847 048 049 056Laurasiatheria 6 2043 483 003 010

Sauropsida 5 1550 145 023 035Fishes 8 1534 212 015 027

F7 Vertebrates 34 330 593 001 007Mammals 29 344 306 008 018Primates 7 443 016 069 072

Glires 7 417 03 100 072Laurasiatheria 7 443 057 045 054

Sauropsida 3 425 055 046 054Fishes NA

F8 Vertebrates 38 1028 485 003 010Mammals 24 1985 1119 000 001Primates 7 2351 683 001 004

Glires 5 2257 395 005 013Laurasiatheria 7 2344 025 062 067

Sauropsida 5 1480 003 086 072Fishes 7 1413 09 034 045

F9 Vertebrates 44 329 063 043 052Mammals 24 443 1857 000 000Primates 7 461 128 026 036

Glires 5 468 185 017 030

(continued)

Table 2 Continued

Gene Clade No ofSequences

No ofCodons

(M8a-M8)

LRT P value q value

Laurasiatheria 7 459 434 004 012Sauropsida 5 435 003 100 072

Fishes 13 368 036 100 072

F10 Vertebrates 41 353 1455 000 000Mammals 26 424 023 063 067Primates 7 485 001 091 072

Glires 6 464 206 015 027Laurasiatheria 7 463 23 013 025

Sauropsida 5 385 003 087 072Fishes 8 391 071 040 051

F11 Vertebrates 23 614 034 056 062Mammals 23 614 034 056 062Primates 7 625 005 082 072

Glires 5 619 007 079 072Laurasiatheria 6 625 002 089 072

Sauropsida NAFishes NA

F12 Vertebrates 27 512 398 100 072Mammals 25 533 018 100 072Primates 7 615 002 100 072

Glires 6 557 007 100 072Laurasiatheria 8 513 46 003 011

Sauropsida NAFishes NA

F13A Vertebrates 47 631 907 000 002Mammals 26 684 258 011 022Primates 7 732 162 020 034

Glires 6 730 147 100 072Laurasiatheria 9 730 027 100 072

Sauropsida 5 732 001 092 072Fishes 14 643 259 011 022

F13B Vertebrates 31 609 13 025 036Mammals 25 640 157 021 034Primates 7 660 122 027 037

Glires 5 634 021 100 072Laurasiatheria 7 656 708 001 004

Sauropsida 5 573 093 100 072Fishes NA

NOTEmdashPositive selection is denoted in bold in column 2

Plt 005 Plt 001

3045

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Disease-Causing Missense Mutations and Relationshipto Evolutionary Pressures

The relationship between selective pressure and the proba-bility of disease-causing mutations at a codon site is of greatinterest Disease-causing missense mutations have been re-ported in all 14 coagulation genes in particular forHemophilia A (F8) Hemophilia B (F9) factor XI deficiency(F11) and thrombophilia (F2) (table 1) Extensive compila-tions of 487ndash5474 disease-causing mutations are available forthree of these coagulation factors FVIII FIX and FXI(Saunders et al 2009 Rallapalli et al 2013 Rallapalli et al2014 [unpublished data]) There is a sufficient number ofpathogenic mutation and evolutionary selection data fromthe FVIII FIX and FXI proteins to enable us to compare instatistical detail the correlation between selective pressureand disease-causing mutations

For the comparison of the FVIII FIX and FXI proteinchanges selective pressures were identified using the BEB in-ference method from model M2a (fig 4) While powerful a

disadvantage of Model M8 is that it classifies sites into 11categories (ten in negativeneutral and one positive selec-tion) These 11 categories can vary between different genefamilies making the comparison difficult Thus model M2a isadvantageous in that this has only three distinct categoriesnamely negative selection (dNdSlt 1) neutral evolution (dNdS = 1) and positive selection (dNdS4 1) Under the M1a-M2a test in the primates clade two genes (F8 and F9) showedpositive selection whereas F11 did not show positive selec-tion which differs from that shown in figure 3c which is basedon the M8 model The posterior probabilities of neutral neg-ative and positive selection for the FVIII FIX and FXI proteinsshowed that positive selection occurred in specific codonsthe most being for F8 and the fewest for F11 For completionthe same analysis for the other 11 genes is shown in supple-mentary table S2 Supplementary Material online These showsimilar outcomes to those for FVIII FIX and FXI

For each amino acid position in FVIII FIX and FXI wecompared its selective pressure (positive neutral or negative)

FIG 3 Positively selected genes of the coagulation cascade in vertebrates mammals and three major clades The layout of this figure is a simplifiedrepresentation of figure 1 Genes showing significant positive selection in the LRT calculations are highlighted in red (a) Positive selection was observedin five genes when all 47 genome sequences (vertebrates) were analyzed as a single group (b) In the Mammals group (fig 2) where positive selection isseen for five genes only the (c) Primates and (d) Laurasiatheria clades are shown with positive selection for two or three genes only because no positiveselection was seen in the Glires clade whereas insufficient sequences were available in the Atlantogenata clade for conclusions to be drawn (e) Positiveselection was seen in the Sauropsida clade but no positive selection was observed in Fishes (fig 2)

3046

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

with the reported number of disease-causing mutations Theoutcome was presented using regression plots (fig 5)The regression plots and their correlation coefficients revealedthe relationship between the selection probability values atamino acid sites with the disease-causing mutations at thosesites The average probability of a site to be under positiveselection and its number of disease-causing mutationsshowed a negative relationship with correlation coefficientsof085067 and095 in FVIII FIX and FXI respectively(fig 5 top row) The correlation coefficients for neutral selec-tion followed that of positive selection with values of 087064 and 096 for FVIII FIX and FXI respectively (fig 5middle row) In distinction to these a positive relationshipwas seen between the average probability of negative selec-tion and the disease-causing mutations with correlation co-efficients of +086 +066 and +095 for FVIII FIX and FXIrespectively (fig 5 bottom row) To summarize the fewer thenumber of disease mutations at a given site the higher is theprobability for positive selection or neutral evolution (fig 5

top row and middle row) The greater the number of disease-causing mutations at a given site the higher the probabilityfor negative selection (fig 5 bottom row) The comparisonbetween the positively selected proteins FVIII and FIX withthe nonpositively selected protein FXI is noteworthyIrrespective of the selective pressure all three proteins ex-hibited similar regression lines in each of the three rowsThese comparisons show that there is a general inverse pro-portionality relationship between the probability of positiveselection (or neutral evolution) and the number of disease-causing mutations at each site This confirms previous obser-vations that disease-causing mutations tend to occur at crit-ical sites in the protein and that these highly conserved sitesare less tolerant to new mutations (negative selection)

Stability Effect of Mutations

In order to explain the consequences of disease-causing mu-tations on the protein structure FoldX software was used to

Table 3 Positively Selected Sites in the M8a-M8 Comparison

Gene Clade No ofCodons

Analyzed

No of Codonsunder positive

selection

PercentageEstimated byCodeML ()

Sites under Positive Selection (BEB4 50)

FGA Mammals 288 13 717 71832353951101108178212219246273

FGA Laurasiatheria 267 64 1329 467121929303336405293949697100102103108109116117118123136138141143155171180182195208210213214216218219220222223225227229231232233234236239242244245250253254

256259 260261264265

FGB Primates 488 17 822 394155808182130134135137173290367388413415448

F2 Vertebrates 475 5 099 5257151162381

F3 Mammals 240 11 663 55727691115177179185195230240

F3 Sauropsida 254 13 1060 340414474798081110123170208249

F5 Vertebrates 1182 10 087 1032755745785936027779269951010

F5 Mammals 1741 54 272 237243660134389392656663667669 676680718720731745749765768782809819822824875

88089489590791191692893094596797698910161040107610801089113611391324147114791515

152215481564

F5 Primates 2116 54 1321 74052129180211336341408409434 59266070370772075478487787988889290794195997810331039

10591131116812051209121812201222123012311232123612441254126212651279128514981504

151915261667168218491918

F7 Vertebrates 330 12 522 56526688162167178179209258262

F8 Mammals 1985 27 398 7245281339342521667718765784785792807817889912113011931197

12201229128813081333168619671984

F8 Primates 2351 33 063 42275579181485086088692796496697998899610191071118012821312

136114141421143614381459152715821613162216501689173023402349

F9 Mammals 443 15 440 45100114140150177181199202206209270309348

F10 Vertebrates 353 3 001 14470

F13A Vertebrates 631 6 126 398489497502550613

F13B Laurasiatheria 656 18 152 5567378107341349350374443459461497522525539540624

P4 95 P4 99

3047

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

estimate the stability effect (G) of residue changes on thethree-dimensional structures of human FVIII (chains A and B)FIX and FXI (fig 6) FoldX was designed to predict the stabilityeffect of a mutation This is an empirical method that relies onthe estimation of physical parameters from more than 1000mutations in protein structures and has been evaluated with21 proteins (Schymkowitz et al 2005 Tokuriki et al 2007)First each amino acid at each position was replaced with eachof the other 19 amino acids in order to provide a baseline forcomparison As expected the prediction of all virtual muta-tions shows a distribution skewed toward destabilization (me-dians G range between 051 and 123 kcalmol) Next theeffect of changing the human amino acids with those seen inevolution for the other 46 sequences of this study was calcu-lated in order to examine the effect of evolutionary change onprotein stabilities This produced a narrow distribution pre-dominantly centered on a difference of stability G of0 kcalmol and showing a slight skew toward deleterious mu-tations (medians G range between 014 and 030 kcalmol) This outcome is as expected because evolutionarychanges in the amino acids are selected in order to preservethe stability of the protein structures Finally the effect ofchanging the human amino acids to those observed inunique pathogenic missense mutations was calculated Thisanalysis produced a clearly skewed distribution (mediansG range between 101 and 295 kcalmol) of G

values that ranged as high as 10 kcalmol (and even higherbut these higher values were clipped for easier visualization)this agrees with the fact that these mutations affect the pro-tein structure The first distribution for the virtual mutationsis intermediate between the latter two distributions asexpected

The same outcome of stability effects was also observed atthe individual amino acid level (fig 7) When the effect of 19amino acid replacements were computed at each position forFVIII FIX and FXI and the majority of replacements showed adestabilizing effect (410 kcalmol in red) A small propor-tion of replacements stabilized the structure (lt10 kcalmol in blue) indicating the ability of the protein to compen-sate deleterious mutations by further mutation Although thisappears to occur more strongly for the FVIII chains A and B(fig 7a and b) the proportion of allowed replacements is infact similar for all four proteins when their different sizes areconsidered When the effect of replacements by disease-caus-ing missense mutations was calculated and the G valuesshowed that these almost invariably destabilized the threeprotein structures

DiscussionIn this study we have assembled sequence data for 14 coag-ulation genes across 47 genomes in order to analyze the evo-lutionary changes in amino acids that take place between five

FIG 4 Probability of evolutionary selection across the Primates clade for the three proteins FVIII FIX and FXI The probabilities of negative neutral andpositive selection were plotted against the amino acid position in the three proteins The probability values were obtained from CodeML analyses of theseven primate sequences (gray negative selection yellow neutral selection green positive selection)

3048

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

major clades and several minor ones These changes werecompared directly with disease-causing mutation data forthree coagulation proteins FVIII FIX and FXI Good-qualitypathogenic mutational data are publicly available for thesethree proteins and the bleeding disorders caused by thesethree sets of mutational data correspond to the three mostprevalent bleeding disorders The outcome of these detailedanalyses confirms that positive selection occurred during dif-ferent stages in the evolution of several coagulation proteinsIt was also observed when comparing the protein structurestabilities that the pathogenic mutations have a destabilizingeffect on the protein structure as opposed to the stable mu-tations that are tolerated during evolution (fig 6) The com-bination of positive selection analyses with those forpathogenic mutations provides information about predictingdisease causing mutations and clarifies fresh approaches fortherapeutic interventions in bleeding disorders

Selective Pressures in Vertebrate Evolution

Evolutionary changes combine random mutational eventswith natural selection The probability of a mutation to befixed in a population depends on its consequence on thefitness of an organism New adaptive mutations that are ben-eficial in terms of fitness are more likely to be retained bypositive selection than neutral mutation Neutral mutations

that are neither beneficial nor detrimental are randomly re-tained or removed by genetic drift depending on the popu-lation size (neutral selection) Detrimental mutations aremore likely to be removed from the population by negative(or purifying) selection There are no strict patterns and adeleterious mutation can sometimes be beneficial to oneaspect of the organism The textbook example of this issickle-cell anaemia (drepanocytosis) which is a bleeding dis-order caused by the pGlu6Val mutation in the hemoglobin gene Mutations responsible for sickle-cell anaemia are gen-erally removed by natural selection In malaria-infested pop-ulations particularly in tropical or subtropical regions thesemutations are selected for in one allele because they offerpartial protection against the pathogen (Lopez et al 2010)The malaria-infected sickle cells inhibit infection by the ma-laria plasmodium in heterozygous cases (beneficial) but causesickle-cell anaemia in homozygous cases (deleterious)

Genes corresponding to proteins in direct contact with theenvironment are more likely to be under adaptive selectionThus genes involved in host-pathogen interactions are com-monly associated with positive selection Large-scale analysesrevealing positive selection in vertebrates or arthropods havebeen reported for genes involved with immunity (Nielsenet al 2005 Studer et al 2008 Montoya-Burgos 2011 Rouxet al 2014) In the present detailed study positive selection inseveral genes of the coagulation cascade was identified at

FIG 5 Relationship between disease-causing mutations and evolutionary pressures The correlation between evolutionary selection and the number oftimes a disease-causing mutation occurs at each amino acid position is shown In the columns (a) FVIII (b) FIX and (c) FXI correspond to the threecoagulation factors for which pathogenic mutational information is available in sufficient quantity from our three mutational databases The three rowscorrespond to the posterior probability (based on the BEB model of CodeML) of positive selection neutral evolution and negative selection forprimates

3049

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

different stages in evolution (fig 3) An earlier study identifiedpositive selection across six mammalian genomes (humanchimpanzee macaque mouse rat and dog) using LRTsbased on codon substitution models and false discoveryrates to evaluate multiple LRT comparisons (Kosiol et al2008) Although they found strong positive selection(Plt 005 and false discovery ratelt 005) in nine complementgenes only nominal positive selection (Plt 005) was identi-fied in four coagulation genes (F2 F12 TFPI PROC) none ofwhich were identified in figure 3b A limitation of this earlieranalysis may be the low number of six genomes used Notethat complement is related to coagulation in that the latterhas the ability to activate complement of innate immunitythat mediates host-pathogen interactions (Amara et al 2008)

Positive Selection of the Coagulation Proteins throughHost-Pathogen Interactions

As a consequence of the immune response blood coagula-tion is often exploited by pathogens for reason of infectiveand septic processes For example FG and FV are targeted bybacteria thus offering a straightforward explanation of posi-tive selection (fig 3) We discuss this in terms of FG and FIIfollowed by FIII and FVII and then FV and FX

By taking advantage of the increased number of availablegenome sequences positive selection has been identified inseveral coagulation genes during various stages in evolutionIn particular all three subunits FGA FGB and FGG of FG haveshown positive selection at periods during vertebrate evolu-tion according to both the site model and the branch-sitemodels Similarly prothrombin (FII) has shown positive selec-tion but only across all the vertebrate genomes and notacross any smaller clades Although the immune system isunder strong adaptive constraints because it needs to evolveconstantly against pathogens the coagulation system alsocomprises the first line of immune defense against bacteriato form a clot to block bacterial invasion (Sun 2006) Both FGand prothrombin (FII) are targeted by certain bacteria Forexample FG is targeted during the course of infection bycysteine protease a virulence factor of Streptococcus pyogenes(Matsuka et al 1999) and FG binding protein Efb ofStaphylococcus aureus Similarly FG-like proteins exist inAnopheles gambiae and are involved in the function of theimmune system against malaria or bacterial parasites (Dongand Dimopoulos 2009 Lombardo et al 2013) Thrombin in-teracts with staphylocoagulase a protein secreted by thehuman pathogen S aureus and activates prothrombin with-out proteolysis to form thrombin The resulting

FIG 6 The effect of amino acid replacements on the overall protein stability change G Data are shown for (a and b) FVIII A and B chains (c) FIX (d)FXI Three calculations were performed each being based on the sequences for which crystal structures are known (PDB codes 2R7E [FVIII] 2WPH [FIX]and 2F83 [FXI]) The FVIII A and B chains correspond to the N-terminal residues 19ndash760 and the C-terminal residues 1582ndash2351 that are observed in itscrystal structure The calculation represented by the dashed line indicates the distribution calculated for all 19 possible amino acid replacements Thatrepresented by the dotted line indicates the distribution calculated for all the evolutionary occurring replacements that were observed in our data set of47 genomes That represented by the continuous line indicates the distribution calculated for the disease-causing mutations from the FVIII FIX and FXImutation databases

3050

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

prothrombinmdashstaphylocoagulase complex binds to FG tocleave this into self-polymerizing fibrin This process is centralto the molecular pathology of S aureus endocarditis becausestaphylocoagulase bypasses physiological blood coagulationthrough the evasion of the recognition of prothrombin bycirculating thrombin inhibitors (Panizzi et al 2005)

Positive selection for FIII and FVII is attributed to a host-pathogen interaction with the herpes virus family These cellsurface molecules do not necessarily participate directly in animmune response but can still serve as a convenient gatewayfor pathogen entry (Vallender and Lahn 2004) Herpes virusespromote cellular surface infections by coevolving with theirhuman host indicating a mode by which pathogens exploitthe coagulation system (Sutherland et al 2012) A study of thehost-pathogen driven coevolution of herpes virus revealedthat this virus has existed from the time of mammals anddivided into the HSV-1 and HSV-2 types recently during

human evolution (Kolb et al 2013) These observations sug-gest that the pressure for FIII to evolve in early mammalsespecially through primates could be the reason for the pos-itive selection of FIII across the mammalian clade (fig 3) Wepropose that the FIIIndashFVII interaction is a significant drivingforce for positive selection in FVII

The FV and FVIII cofactors are homologous to each otherin their sequence and domain organization hence it is notsurprising to see similar selection pressures acting on them(fig 3c) A difference arises in the entire vertebrate genome(fig 3a) where FV shows positive selection whereas FVIII doesnot This difference between FV and FVIII across vertebratesmay result from the interaction of FV with Escherichia coliThe extracellular serine protease EspP from E coli cleaves FVto lead to its degradation and thus reduces the activity of thecoagulation cascade (Brunder et al 1997) The activated FVacofactor forms a FVaFXa complex with activated FX (fig 1)

FIG 7 The effect of amino acid replacements at the sequence level on the protein stability change G The residue stability changes are depicted ona seven point scale from highly destabilizing (red) to highly stabilizing (blue) for each of (a) FVIII chain A (b) FVIII chain B (c) FIX and (d) FXI Their PDBcodes are indicated in figure 6 In each panel the upper half shows the G values for all the possible 19 amino acid replacements at each residueposition that is part of the crystal structure and the lower half shows the G values for the disease-causing mutations taken from the mutationdatabases

3051

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

which is important for thrombin formation the most impor-tant step for fibrin clot formation Because both FVa and FXaare central to coagulation and are associated with bacterialproteolysis this offers an explanation for the signatures ofpositive selection in both proteins

FVIII FIX and FXIII are also associated with positive selec-tion however it is not clear what selective pressures are in-volved Although these proteins may also be targeted bypathogens no direct interactions have been reported so far

Absence of Positive Selection for FXI and FXIIFXII is unique in that it does not show positive selection eventhough it is involved in host-pathogen interactions with long-chain inorganic phosphatases (polyP) from microorganisms(Morrissey et al 2012) Small-chain polyP are produced in thehuman brain and by blood platelets Hence even during theabsence of pathogen-originated polyP the human host doesnot lack polyP and there is no pressure to evolve for this (Puyet al 2013) A F12 deficiency in patients does not cause bleed-ing disorders This means that the coagulation cascade can beinitiated even in the absence of F12 For similar reasons FXIshows no positive selection We hypothesize that this is be-cause first no interaction with the environment has beendemonstrated to date and second with respect to the polyPmechanism the coagulation cascade is capable of activatingin a FXI-independent manner

Relationship between Disease-Causing Mutations andSelection

The mechanisms of genetic drift mutational change andevolutionary change are closely related to each otherEvolutionary change is not only dependent on the geneticvariability from mutational change but also on genetic driftwhich describes the random fluctuations in the number ofgenetic variants in an organism Genetic drift and mutationsproduce random variations that selection can act upon Thenew changes incurred in one generation may be cancelled inanother generation It may therefore be important to deter-mine whether a new mutation is lost or becomes commonenough for selection to determine its fate (Masel 2011) In thisstudy we have examined the relationship between naturalselection acting on an amino acid and the disease-causingmutations reported at that amino acid position for thethree prominent coagulation factors FVIII FIX and FXI Wehave shown that the disease-causing mutations are morelikely to occur at a higher frequency at sites under negativeselection in all three proteins (fig 5) As sites under strongselective pressure are more likely to be critical to proteinfunction such a mutation can readily affect protein functionand trigger disease Lethal (or deleterious) mutations thatcause death are de facto not observed in our data set Bycontrast disease-causing mutations occur at a reduced fre-quency in sites under neutral or positive selection suggestingthat these sites are more tolerant to mutations than sitesunder negative selection Interestingly we have shown thatFVIII and FIX have undergone much positive selectionwhereas FXI has no such signatures (fig 4) Despite this dif-ference our results for three proteins show no change in the

relationship between evolutionary selection and disease-causing mutations (figs 5 and 6) In summary the reporteddisease-causing mutations show randomness and the muta-bility of a given amino acid position depends on the selectionpressure acting at that position and not on the entire protein

Effect of Disease-Causing Mutations on the Stability ofProtein Structure

Mutations that change the amino acid sequence can havestriking effects on protein stability Point mutations are re-lated to pathological and genetic conditions by reducing orabolishing coagulation protein function (table 1) Proteinfunction is related to its stability which is often measuredby the change in folding energy G of the protein structure(Stefl et al 2013) We have identified the effect of mutationson the stability of FVIII FIX and FXI in terms of G valuesThe stability effects of polymorphic amino acid changes inhuman sequences follow a Gaussian distribution (de Beeret al 2013) In that study disease-causing mutations alsoshowed the same distribution but slightly skewed towarddestabilization In this study the same trend for aminoacids substitutions in vertebrates were observed but this ismuch stronger for disease-causing mutations (fig 6) The dif-ference between the two studies may be explained by thedifferent methods used de Beer et al (2013) used the DiscreteOptimized Protein (DOPE) score of Modeller as a monitor ofglobal stability whereas we used FoldX (Materials andMethods) as an empirical method to infer directly the stabilityeffect G of a point mutation

Evolutionarily observed mutations are the most stable interms of their protein structure whereas disease-causing mu-tations destabilized the protein structure (figs 6 and 7) Eachsite may or may not tolerate a particular amino acid changedepending on the change in stability When each amino acidwas converted to each of the other 19 possible amino acids(fig 7 top panels) the G values showed that most aminoacids destabilize the structure whereas several may stabilize itThe disease-causing effect of a mutation may be affected notonly by the selection pressure acting on it but also by the typeof the amino acid replaced For the FVIII FIX and FXI bleedingdisorders destabilization was observed for most of the dis-ease-causing mutations

Utility of Evolutionary and Disease-CausingMutational Analyses for Clinical Diagnosis

Interactive locus-specific databases are useful tools for pre-senting patient mutational data for patient diagnosis andcare We plan to incorporate the above data sets on selectivepressures and stability effects into our mutational databasesfor FVIII FIX and FXI This information will provide insight forthe prediction of the effect of disease-causing mutations Thelevel of selective pressure was estimated under a maximum-likelihood framework Such methods are useful for geneticstudies but their results are sometimes difficult to interpretby a nonspecialist The presentation of this information needsto be intuitive to the viewer As shown in supplementaryfigure S3 Supplementary Material online the use of multiple

3052

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

sequence alignments and presentation of stability effectsG structural locations and selective pressures can helpin interpreting the impacts of the mutations This type ofinformation will therefore clarify the interpretation of newlydiscovered disease-causing mutations for clinicians evolu-tionary biologists and protein biochemists This combinedevolutionary-driven study of bleeding disorder-causing muta-tions is the most detailed of its kind as far as we are aware andmay be usefully extended to other biological processes andtheir proteins

Materials and Methods

Genomic Sequences

The flowchart of operations is summarized in supplementaryfigure S1 Supplementary Material online First the completemultiple sequence alignments were retrieved together withtheir gene trees for each of the 14 genes corresponding to the11 coagulation factors (table 1) from the Ensembl Comparadatabase (Vilella et al 2009) using their Ensembl IDs (table 1)and a Perl script Ensembl release 73 was used accessed onSeptember 30 2013 (Flicek et al 2013) These amino acidsequences came from Ensembl as they are generally ofgood quality and contained 74 vertebrate genomes in atotal of 80 genomes We then excluded any sequences thatcame from genomes of low coverage (2 or less) This re-sulted in 47 vertebrate genomes with alignment sizes be-tween 23 and 47 genomes These 47 sets of vertebratesequences provided sufficient genomic coverage to performphylogenetic analysis The 47 vertebrate organisms were clas-sified into five major clades (fig 2) namely Primates Glires(rodents rabbits and pika) Laurasiatheria Sauropsida (rep-tiles and birds) and Fishes (ray-finned fishes) together withminor clades including Atlantogenata and Australianmammals

Sequence Retrieval and Multiple Alignments

The multiple sequence alignments and gene trees fromEnsemblCompara included all the coagulation sequences indifferent species irrespective of their genome coverage Themultiple sequence alignments of the downloaded genes fromthe EnsemblCompara database were constructed using aglobal alignment algorithm provided within this databasefor significant number of sequences of arbitrary lengthsThe data set required for this study comprised 14 genesfrom 47 genomes A thorough quality control procedure ofthe multiple sequence alignments was performed in order togenerate minimum false positive results Thus sequences wereremoved if they showed 41 low-complexity regions orunknown nucleotides (indicated as ldquoNrdquo in a sequence)From these the phylogenetic trees were pruned usingNewick Utilities software (Junier and Zdobnov 2010) to iden-tify the tips corresponding to the retrieved species (fig 2) Thefinal alignments of these sequences were made with PRANKwhich is conservative in that it tends to align only amino acidsderived by substitution and introduces insertions and dele-tions (indels) instead of aligning sequence fragments thatappear too divergent (Loytynoja and Goldman 2008)

PRANK is an advanced probabilistic alignment algorithmthat incorporates phylogenetic information This has beenshown to reduce the number of false-positives when evalu-ating positive selection (Fletcher and Yang 2010 Loytynoja2014) All the PRANK alignments were made using defaultparameters for the empirical codon model and the prunedtrees as guides The resulting alignments are composed ofhighly conserved blocks surrounded by noise due to indelsFinally Gblocks (Castresana 2000) was used to remove thenoise in the unreliable regions of the multiple sequence align-ments which may cause false positives when testing for pos-itive selection to keep only the well-aligned parts The defaultGblocks parameters for the codontype model (parameter ndasht = c) were used with the exception of tolerating half gaps perposition (parameter -b5=h) This procedure resulted in cleanand accurate multiple sequence alignments for the final com-putations of the site and branch-site models using CodeMLtogether with their corresponding gene trees

Analyses of Evolutionary Pressure

For each of the 14 coagulation genes within each of the fivemajor clades the selective pressures were estimated using thedifferent codon substitution site models implemented inCodeML of the phylogenetic analysis with maximum-likeli-hood software (PAML release 47) (Yang 2007) By analyzingthe protein-coding genes at the nucleotide level the synon-ymous or silent changes (nucleotide substitutions that do notchange the encoded amino acid) are distinguished from thenonsynonymous or missense changes (substitutions thatchange the encoded amino acid) Because natural selectionoperates at the protein level and can take any direction ofpositive neutral or negative selection the synonymous andnonsynonymous substitutions are subjected to different se-lective pressures which occur at the rates dS and dN respec-tively Hence the ratio of nonsynonymoussynonymoussubstitution rates (= dNdS) measures the selective pressureat the protein level The value of reveals the direction andstrength of natural selection acting on the protein Values oflt 1 = 1 and 4 1 indicate negative selection neutralevolution and positive selection respectively If an existingfunction or phenotype is evolutionarily favorable then nega-tive selection (or purifying selection) favors the conservationof that function or phenotype (conservation of amino acids)The other is positive selection (or Darwinian selection) whichfavors the promotion of a new function or phenotypeNeutral evolution describes the tolerance to amino acid mu-tations which are neither deleterious nor beneficial and arefixed according to genetic drift

The following five models were used to account for thevariation of dNdS among codon sites M1a (two dNdS ratioswith negative selection [0lt 1] and neutral evolution[1 = 1]) M2a (three dNdS ratios with negative selection[0lt 1] neutral evolution [1 = 1] and positive selection[2 1]) M7 (beta) M8 (beta and 24 1) and M8a (betaand 2 = 1) The alignments contain several positions inwhich one of the sequences is missing and the ldquocleandata = 0rdquooption was selected to retain all sites The inference of positive

3053

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

selection was conducted by performing LRTs in which thefollowing two pairs of models were compared namely M1awith M2a and M8a with M8 The principle is to compare anull model that does not allow24 1 (M1a or M8a) with analternative model that does (M2a or M8) Because the alter-nate model has more parameters than the null model theLRT follows a 2 distribution The LRT is a powerful methodfor comparing two hypotheses where one is the special caseof the other The BEB method (Yang et al 2005) was used tocalculate the posterior probabilities for site classes and theBEB value was used to identify sites under positive selectionwhere the 2 distribution of the LRT was significant withPlt 005 As many different LRT comparisons (differentgenes different clades) were conducted a control for thefalse-discovery rate (expected proportion of false positives)was applied using the q value package in R (Storey andTibshirani 2003) We adjusted the parameters of the q valueto perform a bootstrap analysis (pi0meth = ldquobootstraprdquo) andto tolerate 10 of false positives in our significant results(ldquofdrlevel = 01rdquo) A bootstrap analysis is necessary as the Pvalue distribution is bimodal (Studer et al 2008)

Two sources for the branch-site model were used First theSelectome database at httpselectomeunilch (last accessedJanuary 2014) (Proux et al 2009 Moretti et al 2014) providesthe precomputed results of the branch-site model (supple-mentary table S2b Supplementary Material online) based onthe multiple sequence alignments from the EnsemblCompara database (Vilella et al 2009) The last release ofSelectome (release 6) contains an optimized procedure ofalignment refinement and quality control similar to theone used for our site-model analyses as described aboveThe branch-site model allows the dNdS value to vary notonly between sites but also between branches (Zhang et al2005) The advantage of this model is its ability to detecttraces of episodic positive selection (Studer and Robinson-Rechavi 2009a) Second a branch-site model was determinedbased on our data set (see above) We used the vertebratetrees and their multiple sequence alignments as inputs forbranch-site model calculations The multiple alignments weremanually rechecked to remove barren sequences and thecorresponding vertebrate trees were edited accordinglyusing Newick Utilities Each taxonomic unit also known asthe branch or node of the tree was tagged using an auto-mated python script The CodeML program was run on eachof these tagged trees from all the genes for this branch-sitemodel (supplementary table S2b Supplementary Materialonline) For some alignments we found misaligned short in-sertions up to a maximum of ten amino acids in the con-served blocks probably due to gene model prediction errorsIn our whole data set these errors affect 0ndash4 sequences peralignment As the F2 gene family contained most of thesesmall errors albeit at a very low level this was reanalyzed byremoving four sequences with potential problems from thealignment The recalculation of the likelihood ratio deviatedonly slightly from the initial results (P value of 393e5 for 38sequences vs P value of 217e4 for 34 sequences) This sug-gested that these errors would only have a minor impact ifany on the detection of positive selection especially as the

coagulation factor sequences were long and the model ofcodon substitutions was averaged over all columns (manyhundreds) and sequences (between 23 and 47)

Stability Effect of Mutations

When possible for each coagulation factor the molecularstructures of the domain(s) were retrieved directly fromthe CATH database (Sillitoe et al 2013) Prior to the muta-tional analyses each structure was repaired with the FoldXprogram (Schymkowitz et al 2005) to remove potential stericclashes FoldX utilizes an empirical force field model to esti-mate the stability of a protein (Gwildtype in kcalmol) FoldXwas used to estimate the stability effect on each residueposition when the residue was replaced by any of the 19other amino acids The stability effect is determined bythe difference in the wildtype and mutant energies(G = GmutantGwildtype) This resulted in a stabilitylandscape of all potential mutations including the Gvalues both for the disease mutations and for all the aminoacid substitutions observed across vertebrate evolution

Statistical Analyses

All statistical analyses were performed with R v302 (R CoreTeam 2013) and SigmaPlot (Systat Software San Jose CA)The pipeline was developed with Biopython scripts (Cocket al 2009) The visualization of the multiple sequence align-ments was performed using Jalview 28 (Waterhouse et al2009)

Data Availability

The final sequence alignments are provided on a webpagehttpwwwbiochemuclacukpavithraevolutioncoagulationindexphp (last accessed January 2014) Pathogenic mu-tation data for three human coagulation factors FVIII FIX andFXI were retrieved from our web databases at httpwwwfactorviii-dborg httpwwwfactorixorg and httpwwwfactorxiorg (last accessed January 2014) (Saunders et al2009 Rallapalli et al 2013)

Supplementary MaterialSupplementary tables S1ndashS2 and figures S1ndashS3 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

PMR and SJP thank Pfizer UK for a medical educationalgrant that generously funded this project and the SpecialTrustees of the Royal Free Hospital the KatharineDormandy Trust for Hemophilia and Related Disorders andthe European Association for Hemophilia and AssociatedDisorders for additional support The authors also thank DrKeith Gomez Dr Geoffrey Kemball-Cook and ProfessorEdward G Tuddenham for their generous collaborative sup-port of our studies including access to the FVIII mutations atthe Hemophilia A Database at httpwwwfactorviii-dborgRAS acknowledges funding from the Fondation du 450eme

3054

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Disease-Causing Missense Mutations and Relationshipto Evolutionary Pressures

The relationship between selective pressure and the proba-bility of disease-causing mutations at a codon site is of greatinterest Disease-causing missense mutations have been re-ported in all 14 coagulation genes in particular forHemophilia A (F8) Hemophilia B (F9) factor XI deficiency(F11) and thrombophilia (F2) (table 1) Extensive compila-tions of 487ndash5474 disease-causing mutations are available forthree of these coagulation factors FVIII FIX and FXI(Saunders et al 2009 Rallapalli et al 2013 Rallapalli et al2014 [unpublished data]) There is a sufficient number ofpathogenic mutation and evolutionary selection data fromthe FVIII FIX and FXI proteins to enable us to compare instatistical detail the correlation between selective pressureand disease-causing mutations

For the comparison of the FVIII FIX and FXI proteinchanges selective pressures were identified using the BEB in-ference method from model M2a (fig 4) While powerful a

disadvantage of Model M8 is that it classifies sites into 11categories (ten in negativeneutral and one positive selec-tion) These 11 categories can vary between different genefamilies making the comparison difficult Thus model M2a isadvantageous in that this has only three distinct categoriesnamely negative selection (dNdSlt 1) neutral evolution (dNdS = 1) and positive selection (dNdS4 1) Under the M1a-M2a test in the primates clade two genes (F8 and F9) showedpositive selection whereas F11 did not show positive selec-tion which differs from that shown in figure 3c which is basedon the M8 model The posterior probabilities of neutral neg-ative and positive selection for the FVIII FIX and FXI proteinsshowed that positive selection occurred in specific codonsthe most being for F8 and the fewest for F11 For completionthe same analysis for the other 11 genes is shown in supple-mentary table S2 Supplementary Material online These showsimilar outcomes to those for FVIII FIX and FXI

For each amino acid position in FVIII FIX and FXI wecompared its selective pressure (positive neutral or negative)

FIG 3 Positively selected genes of the coagulation cascade in vertebrates mammals and three major clades The layout of this figure is a simplifiedrepresentation of figure 1 Genes showing significant positive selection in the LRT calculations are highlighted in red (a) Positive selection was observedin five genes when all 47 genome sequences (vertebrates) were analyzed as a single group (b) In the Mammals group (fig 2) where positive selection isseen for five genes only the (c) Primates and (d) Laurasiatheria clades are shown with positive selection for two or three genes only because no positiveselection was seen in the Glires clade whereas insufficient sequences were available in the Atlantogenata clade for conclusions to be drawn (e) Positiveselection was seen in the Sauropsida clade but no positive selection was observed in Fishes (fig 2)

3046

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

with the reported number of disease-causing mutations Theoutcome was presented using regression plots (fig 5)The regression plots and their correlation coefficients revealedthe relationship between the selection probability values atamino acid sites with the disease-causing mutations at thosesites The average probability of a site to be under positiveselection and its number of disease-causing mutationsshowed a negative relationship with correlation coefficientsof085067 and095 in FVIII FIX and FXI respectively(fig 5 top row) The correlation coefficients for neutral selec-tion followed that of positive selection with values of 087064 and 096 for FVIII FIX and FXI respectively (fig 5middle row) In distinction to these a positive relationshipwas seen between the average probability of negative selec-tion and the disease-causing mutations with correlation co-efficients of +086 +066 and +095 for FVIII FIX and FXIrespectively (fig 5 bottom row) To summarize the fewer thenumber of disease mutations at a given site the higher is theprobability for positive selection or neutral evolution (fig 5

top row and middle row) The greater the number of disease-causing mutations at a given site the higher the probabilityfor negative selection (fig 5 bottom row) The comparisonbetween the positively selected proteins FVIII and FIX withthe nonpositively selected protein FXI is noteworthyIrrespective of the selective pressure all three proteins ex-hibited similar regression lines in each of the three rowsThese comparisons show that there is a general inverse pro-portionality relationship between the probability of positiveselection (or neutral evolution) and the number of disease-causing mutations at each site This confirms previous obser-vations that disease-causing mutations tend to occur at crit-ical sites in the protein and that these highly conserved sitesare less tolerant to new mutations (negative selection)

Stability Effect of Mutations

In order to explain the consequences of disease-causing mu-tations on the protein structure FoldX software was used to

Table 3 Positively Selected Sites in the M8a-M8 Comparison

Gene Clade No ofCodons

Analyzed

No of Codonsunder positive

selection

PercentageEstimated byCodeML ()

Sites under Positive Selection (BEB4 50)

FGA Mammals 288 13 717 71832353951101108178212219246273

FGA Laurasiatheria 267 64 1329 467121929303336405293949697100102103108109116117118123136138141143155171180182195208210213214216218219220222223225227229231232233234236239242244245250253254

256259 260261264265

FGB Primates 488 17 822 394155808182130134135137173290367388413415448

F2 Vertebrates 475 5 099 5257151162381

F3 Mammals 240 11 663 55727691115177179185195230240

F3 Sauropsida 254 13 1060 340414474798081110123170208249

F5 Vertebrates 1182 10 087 1032755745785936027779269951010

F5 Mammals 1741 54 272 237243660134389392656663667669 676680718720731745749765768782809819822824875

88089489590791191692893094596797698910161040107610801089113611391324147114791515

152215481564

F5 Primates 2116 54 1321 74052129180211336341408409434 59266070370772075478487787988889290794195997810331039

10591131116812051209121812201222123012311232123612441254126212651279128514981504

151915261667168218491918

F7 Vertebrates 330 12 522 56526688162167178179209258262

F8 Mammals 1985 27 398 7245281339342521667718765784785792807817889912113011931197

12201229128813081333168619671984

F8 Primates 2351 33 063 42275579181485086088692796496697998899610191071118012821312

136114141421143614381459152715821613162216501689173023402349

F9 Mammals 443 15 440 45100114140150177181199202206209270309348

F10 Vertebrates 353 3 001 14470

F13A Vertebrates 631 6 126 398489497502550613

F13B Laurasiatheria 656 18 152 5567378107341349350374443459461497522525539540624

P4 95 P4 99

3047

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

estimate the stability effect (G) of residue changes on thethree-dimensional structures of human FVIII (chains A and B)FIX and FXI (fig 6) FoldX was designed to predict the stabilityeffect of a mutation This is an empirical method that relies onthe estimation of physical parameters from more than 1000mutations in protein structures and has been evaluated with21 proteins (Schymkowitz et al 2005 Tokuriki et al 2007)First each amino acid at each position was replaced with eachof the other 19 amino acids in order to provide a baseline forcomparison As expected the prediction of all virtual muta-tions shows a distribution skewed toward destabilization (me-dians G range between 051 and 123 kcalmol) Next theeffect of changing the human amino acids with those seen inevolution for the other 46 sequences of this study was calcu-lated in order to examine the effect of evolutionary change onprotein stabilities This produced a narrow distribution pre-dominantly centered on a difference of stability G of0 kcalmol and showing a slight skew toward deleterious mu-tations (medians G range between 014 and 030 kcalmol) This outcome is as expected because evolutionarychanges in the amino acids are selected in order to preservethe stability of the protein structures Finally the effect ofchanging the human amino acids to those observed inunique pathogenic missense mutations was calculated Thisanalysis produced a clearly skewed distribution (mediansG range between 101 and 295 kcalmol) of G

values that ranged as high as 10 kcalmol (and even higherbut these higher values were clipped for easier visualization)this agrees with the fact that these mutations affect the pro-tein structure The first distribution for the virtual mutationsis intermediate between the latter two distributions asexpected

The same outcome of stability effects was also observed atthe individual amino acid level (fig 7) When the effect of 19amino acid replacements were computed at each position forFVIII FIX and FXI and the majority of replacements showed adestabilizing effect (410 kcalmol in red) A small propor-tion of replacements stabilized the structure (lt10 kcalmol in blue) indicating the ability of the protein to compen-sate deleterious mutations by further mutation Although thisappears to occur more strongly for the FVIII chains A and B(fig 7a and b) the proportion of allowed replacements is infact similar for all four proteins when their different sizes areconsidered When the effect of replacements by disease-caus-ing missense mutations was calculated and the G valuesshowed that these almost invariably destabilized the threeprotein structures

DiscussionIn this study we have assembled sequence data for 14 coag-ulation genes across 47 genomes in order to analyze the evo-lutionary changes in amino acids that take place between five

FIG 4 Probability of evolutionary selection across the Primates clade for the three proteins FVIII FIX and FXI The probabilities of negative neutral andpositive selection were plotted against the amino acid position in the three proteins The probability values were obtained from CodeML analyses of theseven primate sequences (gray negative selection yellow neutral selection green positive selection)

3048

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

major clades and several minor ones These changes werecompared directly with disease-causing mutation data forthree coagulation proteins FVIII FIX and FXI Good-qualitypathogenic mutational data are publicly available for thesethree proteins and the bleeding disorders caused by thesethree sets of mutational data correspond to the three mostprevalent bleeding disorders The outcome of these detailedanalyses confirms that positive selection occurred during dif-ferent stages in the evolution of several coagulation proteinsIt was also observed when comparing the protein structurestabilities that the pathogenic mutations have a destabilizingeffect on the protein structure as opposed to the stable mu-tations that are tolerated during evolution (fig 6) The com-bination of positive selection analyses with those forpathogenic mutations provides information about predictingdisease causing mutations and clarifies fresh approaches fortherapeutic interventions in bleeding disorders

Selective Pressures in Vertebrate Evolution

Evolutionary changes combine random mutational eventswith natural selection The probability of a mutation to befixed in a population depends on its consequence on thefitness of an organism New adaptive mutations that are ben-eficial in terms of fitness are more likely to be retained bypositive selection than neutral mutation Neutral mutations

that are neither beneficial nor detrimental are randomly re-tained or removed by genetic drift depending on the popu-lation size (neutral selection) Detrimental mutations aremore likely to be removed from the population by negative(or purifying) selection There are no strict patterns and adeleterious mutation can sometimes be beneficial to oneaspect of the organism The textbook example of this issickle-cell anaemia (drepanocytosis) which is a bleeding dis-order caused by the pGlu6Val mutation in the hemoglobin gene Mutations responsible for sickle-cell anaemia are gen-erally removed by natural selection In malaria-infested pop-ulations particularly in tropical or subtropical regions thesemutations are selected for in one allele because they offerpartial protection against the pathogen (Lopez et al 2010)The malaria-infected sickle cells inhibit infection by the ma-laria plasmodium in heterozygous cases (beneficial) but causesickle-cell anaemia in homozygous cases (deleterious)

Genes corresponding to proteins in direct contact with theenvironment are more likely to be under adaptive selectionThus genes involved in host-pathogen interactions are com-monly associated with positive selection Large-scale analysesrevealing positive selection in vertebrates or arthropods havebeen reported for genes involved with immunity (Nielsenet al 2005 Studer et al 2008 Montoya-Burgos 2011 Rouxet al 2014) In the present detailed study positive selection inseveral genes of the coagulation cascade was identified at

FIG 5 Relationship between disease-causing mutations and evolutionary pressures The correlation between evolutionary selection and the number oftimes a disease-causing mutation occurs at each amino acid position is shown In the columns (a) FVIII (b) FIX and (c) FXI correspond to the threecoagulation factors for which pathogenic mutational information is available in sufficient quantity from our three mutational databases The three rowscorrespond to the posterior probability (based on the BEB model of CodeML) of positive selection neutral evolution and negative selection forprimates

3049

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

different stages in evolution (fig 3) An earlier study identifiedpositive selection across six mammalian genomes (humanchimpanzee macaque mouse rat and dog) using LRTsbased on codon substitution models and false discoveryrates to evaluate multiple LRT comparisons (Kosiol et al2008) Although they found strong positive selection(Plt 005 and false discovery ratelt 005) in nine complementgenes only nominal positive selection (Plt 005) was identi-fied in four coagulation genes (F2 F12 TFPI PROC) none ofwhich were identified in figure 3b A limitation of this earlieranalysis may be the low number of six genomes used Notethat complement is related to coagulation in that the latterhas the ability to activate complement of innate immunitythat mediates host-pathogen interactions (Amara et al 2008)

Positive Selection of the Coagulation Proteins throughHost-Pathogen Interactions

As a consequence of the immune response blood coagula-tion is often exploited by pathogens for reason of infectiveand septic processes For example FG and FV are targeted bybacteria thus offering a straightforward explanation of posi-tive selection (fig 3) We discuss this in terms of FG and FIIfollowed by FIII and FVII and then FV and FX

By taking advantage of the increased number of availablegenome sequences positive selection has been identified inseveral coagulation genes during various stages in evolutionIn particular all three subunits FGA FGB and FGG of FG haveshown positive selection at periods during vertebrate evolu-tion according to both the site model and the branch-sitemodels Similarly prothrombin (FII) has shown positive selec-tion but only across all the vertebrate genomes and notacross any smaller clades Although the immune system isunder strong adaptive constraints because it needs to evolveconstantly against pathogens the coagulation system alsocomprises the first line of immune defense against bacteriato form a clot to block bacterial invasion (Sun 2006) Both FGand prothrombin (FII) are targeted by certain bacteria Forexample FG is targeted during the course of infection bycysteine protease a virulence factor of Streptococcus pyogenes(Matsuka et al 1999) and FG binding protein Efb ofStaphylococcus aureus Similarly FG-like proteins exist inAnopheles gambiae and are involved in the function of theimmune system against malaria or bacterial parasites (Dongand Dimopoulos 2009 Lombardo et al 2013) Thrombin in-teracts with staphylocoagulase a protein secreted by thehuman pathogen S aureus and activates prothrombin with-out proteolysis to form thrombin The resulting

FIG 6 The effect of amino acid replacements on the overall protein stability change G Data are shown for (a and b) FVIII A and B chains (c) FIX (d)FXI Three calculations were performed each being based on the sequences for which crystal structures are known (PDB codes 2R7E [FVIII] 2WPH [FIX]and 2F83 [FXI]) The FVIII A and B chains correspond to the N-terminal residues 19ndash760 and the C-terminal residues 1582ndash2351 that are observed in itscrystal structure The calculation represented by the dashed line indicates the distribution calculated for all 19 possible amino acid replacements Thatrepresented by the dotted line indicates the distribution calculated for all the evolutionary occurring replacements that were observed in our data set of47 genomes That represented by the continuous line indicates the distribution calculated for the disease-causing mutations from the FVIII FIX and FXImutation databases

3050

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

prothrombinmdashstaphylocoagulase complex binds to FG tocleave this into self-polymerizing fibrin This process is centralto the molecular pathology of S aureus endocarditis becausestaphylocoagulase bypasses physiological blood coagulationthrough the evasion of the recognition of prothrombin bycirculating thrombin inhibitors (Panizzi et al 2005)

Positive selection for FIII and FVII is attributed to a host-pathogen interaction with the herpes virus family These cellsurface molecules do not necessarily participate directly in animmune response but can still serve as a convenient gatewayfor pathogen entry (Vallender and Lahn 2004) Herpes virusespromote cellular surface infections by coevolving with theirhuman host indicating a mode by which pathogens exploitthe coagulation system (Sutherland et al 2012) A study of thehost-pathogen driven coevolution of herpes virus revealedthat this virus has existed from the time of mammals anddivided into the HSV-1 and HSV-2 types recently during

human evolution (Kolb et al 2013) These observations sug-gest that the pressure for FIII to evolve in early mammalsespecially through primates could be the reason for the pos-itive selection of FIII across the mammalian clade (fig 3) Wepropose that the FIIIndashFVII interaction is a significant drivingforce for positive selection in FVII

The FV and FVIII cofactors are homologous to each otherin their sequence and domain organization hence it is notsurprising to see similar selection pressures acting on them(fig 3c) A difference arises in the entire vertebrate genome(fig 3a) where FV shows positive selection whereas FVIII doesnot This difference between FV and FVIII across vertebratesmay result from the interaction of FV with Escherichia coliThe extracellular serine protease EspP from E coli cleaves FVto lead to its degradation and thus reduces the activity of thecoagulation cascade (Brunder et al 1997) The activated FVacofactor forms a FVaFXa complex with activated FX (fig 1)

FIG 7 The effect of amino acid replacements at the sequence level on the protein stability change G The residue stability changes are depicted ona seven point scale from highly destabilizing (red) to highly stabilizing (blue) for each of (a) FVIII chain A (b) FVIII chain B (c) FIX and (d) FXI Their PDBcodes are indicated in figure 6 In each panel the upper half shows the G values for all the possible 19 amino acid replacements at each residueposition that is part of the crystal structure and the lower half shows the G values for the disease-causing mutations taken from the mutationdatabases

3051

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

which is important for thrombin formation the most impor-tant step for fibrin clot formation Because both FVa and FXaare central to coagulation and are associated with bacterialproteolysis this offers an explanation for the signatures ofpositive selection in both proteins

FVIII FIX and FXIII are also associated with positive selec-tion however it is not clear what selective pressures are in-volved Although these proteins may also be targeted bypathogens no direct interactions have been reported so far

Absence of Positive Selection for FXI and FXIIFXII is unique in that it does not show positive selection eventhough it is involved in host-pathogen interactions with long-chain inorganic phosphatases (polyP) from microorganisms(Morrissey et al 2012) Small-chain polyP are produced in thehuman brain and by blood platelets Hence even during theabsence of pathogen-originated polyP the human host doesnot lack polyP and there is no pressure to evolve for this (Puyet al 2013) A F12 deficiency in patients does not cause bleed-ing disorders This means that the coagulation cascade can beinitiated even in the absence of F12 For similar reasons FXIshows no positive selection We hypothesize that this is be-cause first no interaction with the environment has beendemonstrated to date and second with respect to the polyPmechanism the coagulation cascade is capable of activatingin a FXI-independent manner

Relationship between Disease-Causing Mutations andSelection

The mechanisms of genetic drift mutational change andevolutionary change are closely related to each otherEvolutionary change is not only dependent on the geneticvariability from mutational change but also on genetic driftwhich describes the random fluctuations in the number ofgenetic variants in an organism Genetic drift and mutationsproduce random variations that selection can act upon Thenew changes incurred in one generation may be cancelled inanother generation It may therefore be important to deter-mine whether a new mutation is lost or becomes commonenough for selection to determine its fate (Masel 2011) In thisstudy we have examined the relationship between naturalselection acting on an amino acid and the disease-causingmutations reported at that amino acid position for thethree prominent coagulation factors FVIII FIX and FXI Wehave shown that the disease-causing mutations are morelikely to occur at a higher frequency at sites under negativeselection in all three proteins (fig 5) As sites under strongselective pressure are more likely to be critical to proteinfunction such a mutation can readily affect protein functionand trigger disease Lethal (or deleterious) mutations thatcause death are de facto not observed in our data set Bycontrast disease-causing mutations occur at a reduced fre-quency in sites under neutral or positive selection suggestingthat these sites are more tolerant to mutations than sitesunder negative selection Interestingly we have shown thatFVIII and FIX have undergone much positive selectionwhereas FXI has no such signatures (fig 4) Despite this dif-ference our results for three proteins show no change in the

relationship between evolutionary selection and disease-causing mutations (figs 5 and 6) In summary the reporteddisease-causing mutations show randomness and the muta-bility of a given amino acid position depends on the selectionpressure acting at that position and not on the entire protein

Effect of Disease-Causing Mutations on the Stability ofProtein Structure

Mutations that change the amino acid sequence can havestriking effects on protein stability Point mutations are re-lated to pathological and genetic conditions by reducing orabolishing coagulation protein function (table 1) Proteinfunction is related to its stability which is often measuredby the change in folding energy G of the protein structure(Stefl et al 2013) We have identified the effect of mutationson the stability of FVIII FIX and FXI in terms of G valuesThe stability effects of polymorphic amino acid changes inhuman sequences follow a Gaussian distribution (de Beeret al 2013) In that study disease-causing mutations alsoshowed the same distribution but slightly skewed towarddestabilization In this study the same trend for aminoacids substitutions in vertebrates were observed but this ismuch stronger for disease-causing mutations (fig 6) The dif-ference between the two studies may be explained by thedifferent methods used de Beer et al (2013) used the DiscreteOptimized Protein (DOPE) score of Modeller as a monitor ofglobal stability whereas we used FoldX (Materials andMethods) as an empirical method to infer directly the stabilityeffect G of a point mutation

Evolutionarily observed mutations are the most stable interms of their protein structure whereas disease-causing mu-tations destabilized the protein structure (figs 6 and 7) Eachsite may or may not tolerate a particular amino acid changedepending on the change in stability When each amino acidwas converted to each of the other 19 possible amino acids(fig 7 top panels) the G values showed that most aminoacids destabilize the structure whereas several may stabilize itThe disease-causing effect of a mutation may be affected notonly by the selection pressure acting on it but also by the typeof the amino acid replaced For the FVIII FIX and FXI bleedingdisorders destabilization was observed for most of the dis-ease-causing mutations

Utility of Evolutionary and Disease-CausingMutational Analyses for Clinical Diagnosis

Interactive locus-specific databases are useful tools for pre-senting patient mutational data for patient diagnosis andcare We plan to incorporate the above data sets on selectivepressures and stability effects into our mutational databasesfor FVIII FIX and FXI This information will provide insight forthe prediction of the effect of disease-causing mutations Thelevel of selective pressure was estimated under a maximum-likelihood framework Such methods are useful for geneticstudies but their results are sometimes difficult to interpretby a nonspecialist The presentation of this information needsto be intuitive to the viewer As shown in supplementaryfigure S3 Supplementary Material online the use of multiple

3052

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

sequence alignments and presentation of stability effectsG structural locations and selective pressures can helpin interpreting the impacts of the mutations This type ofinformation will therefore clarify the interpretation of newlydiscovered disease-causing mutations for clinicians evolu-tionary biologists and protein biochemists This combinedevolutionary-driven study of bleeding disorder-causing muta-tions is the most detailed of its kind as far as we are aware andmay be usefully extended to other biological processes andtheir proteins

Materials and Methods

Genomic Sequences

The flowchart of operations is summarized in supplementaryfigure S1 Supplementary Material online First the completemultiple sequence alignments were retrieved together withtheir gene trees for each of the 14 genes corresponding to the11 coagulation factors (table 1) from the Ensembl Comparadatabase (Vilella et al 2009) using their Ensembl IDs (table 1)and a Perl script Ensembl release 73 was used accessed onSeptember 30 2013 (Flicek et al 2013) These amino acidsequences came from Ensembl as they are generally ofgood quality and contained 74 vertebrate genomes in atotal of 80 genomes We then excluded any sequences thatcame from genomes of low coverage (2 or less) This re-sulted in 47 vertebrate genomes with alignment sizes be-tween 23 and 47 genomes These 47 sets of vertebratesequences provided sufficient genomic coverage to performphylogenetic analysis The 47 vertebrate organisms were clas-sified into five major clades (fig 2) namely Primates Glires(rodents rabbits and pika) Laurasiatheria Sauropsida (rep-tiles and birds) and Fishes (ray-finned fishes) together withminor clades including Atlantogenata and Australianmammals

Sequence Retrieval and Multiple Alignments

The multiple sequence alignments and gene trees fromEnsemblCompara included all the coagulation sequences indifferent species irrespective of their genome coverage Themultiple sequence alignments of the downloaded genes fromthe EnsemblCompara database were constructed using aglobal alignment algorithm provided within this databasefor significant number of sequences of arbitrary lengthsThe data set required for this study comprised 14 genesfrom 47 genomes A thorough quality control procedure ofthe multiple sequence alignments was performed in order togenerate minimum false positive results Thus sequences wereremoved if they showed 41 low-complexity regions orunknown nucleotides (indicated as ldquoNrdquo in a sequence)From these the phylogenetic trees were pruned usingNewick Utilities software (Junier and Zdobnov 2010) to iden-tify the tips corresponding to the retrieved species (fig 2) Thefinal alignments of these sequences were made with PRANKwhich is conservative in that it tends to align only amino acidsderived by substitution and introduces insertions and dele-tions (indels) instead of aligning sequence fragments thatappear too divergent (Loytynoja and Goldman 2008)

PRANK is an advanced probabilistic alignment algorithmthat incorporates phylogenetic information This has beenshown to reduce the number of false-positives when evalu-ating positive selection (Fletcher and Yang 2010 Loytynoja2014) All the PRANK alignments were made using defaultparameters for the empirical codon model and the prunedtrees as guides The resulting alignments are composed ofhighly conserved blocks surrounded by noise due to indelsFinally Gblocks (Castresana 2000) was used to remove thenoise in the unreliable regions of the multiple sequence align-ments which may cause false positives when testing for pos-itive selection to keep only the well-aligned parts The defaultGblocks parameters for the codontype model (parameter ndasht = c) were used with the exception of tolerating half gaps perposition (parameter -b5=h) This procedure resulted in cleanand accurate multiple sequence alignments for the final com-putations of the site and branch-site models using CodeMLtogether with their corresponding gene trees

Analyses of Evolutionary Pressure

For each of the 14 coagulation genes within each of the fivemajor clades the selective pressures were estimated using thedifferent codon substitution site models implemented inCodeML of the phylogenetic analysis with maximum-likeli-hood software (PAML release 47) (Yang 2007) By analyzingthe protein-coding genes at the nucleotide level the synon-ymous or silent changes (nucleotide substitutions that do notchange the encoded amino acid) are distinguished from thenonsynonymous or missense changes (substitutions thatchange the encoded amino acid) Because natural selectionoperates at the protein level and can take any direction ofpositive neutral or negative selection the synonymous andnonsynonymous substitutions are subjected to different se-lective pressures which occur at the rates dS and dN respec-tively Hence the ratio of nonsynonymoussynonymoussubstitution rates (= dNdS) measures the selective pressureat the protein level The value of reveals the direction andstrength of natural selection acting on the protein Values oflt 1 = 1 and 4 1 indicate negative selection neutralevolution and positive selection respectively If an existingfunction or phenotype is evolutionarily favorable then nega-tive selection (or purifying selection) favors the conservationof that function or phenotype (conservation of amino acids)The other is positive selection (or Darwinian selection) whichfavors the promotion of a new function or phenotypeNeutral evolution describes the tolerance to amino acid mu-tations which are neither deleterious nor beneficial and arefixed according to genetic drift

The following five models were used to account for thevariation of dNdS among codon sites M1a (two dNdS ratioswith negative selection [0lt 1] and neutral evolution[1 = 1]) M2a (three dNdS ratios with negative selection[0lt 1] neutral evolution [1 = 1] and positive selection[2 1]) M7 (beta) M8 (beta and 24 1) and M8a (betaand 2 = 1) The alignments contain several positions inwhich one of the sequences is missing and the ldquocleandata = 0rdquooption was selected to retain all sites The inference of positive

3053

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

selection was conducted by performing LRTs in which thefollowing two pairs of models were compared namely M1awith M2a and M8a with M8 The principle is to compare anull model that does not allow24 1 (M1a or M8a) with analternative model that does (M2a or M8) Because the alter-nate model has more parameters than the null model theLRT follows a 2 distribution The LRT is a powerful methodfor comparing two hypotheses where one is the special caseof the other The BEB method (Yang et al 2005) was used tocalculate the posterior probabilities for site classes and theBEB value was used to identify sites under positive selectionwhere the 2 distribution of the LRT was significant withPlt 005 As many different LRT comparisons (differentgenes different clades) were conducted a control for thefalse-discovery rate (expected proportion of false positives)was applied using the q value package in R (Storey andTibshirani 2003) We adjusted the parameters of the q valueto perform a bootstrap analysis (pi0meth = ldquobootstraprdquo) andto tolerate 10 of false positives in our significant results(ldquofdrlevel = 01rdquo) A bootstrap analysis is necessary as the Pvalue distribution is bimodal (Studer et al 2008)

Two sources for the branch-site model were used First theSelectome database at httpselectomeunilch (last accessedJanuary 2014) (Proux et al 2009 Moretti et al 2014) providesthe precomputed results of the branch-site model (supple-mentary table S2b Supplementary Material online) based onthe multiple sequence alignments from the EnsemblCompara database (Vilella et al 2009) The last release ofSelectome (release 6) contains an optimized procedure ofalignment refinement and quality control similar to theone used for our site-model analyses as described aboveThe branch-site model allows the dNdS value to vary notonly between sites but also between branches (Zhang et al2005) The advantage of this model is its ability to detecttraces of episodic positive selection (Studer and Robinson-Rechavi 2009a) Second a branch-site model was determinedbased on our data set (see above) We used the vertebratetrees and their multiple sequence alignments as inputs forbranch-site model calculations The multiple alignments weremanually rechecked to remove barren sequences and thecorresponding vertebrate trees were edited accordinglyusing Newick Utilities Each taxonomic unit also known asthe branch or node of the tree was tagged using an auto-mated python script The CodeML program was run on eachof these tagged trees from all the genes for this branch-sitemodel (supplementary table S2b Supplementary Materialonline) For some alignments we found misaligned short in-sertions up to a maximum of ten amino acids in the con-served blocks probably due to gene model prediction errorsIn our whole data set these errors affect 0ndash4 sequences peralignment As the F2 gene family contained most of thesesmall errors albeit at a very low level this was reanalyzed byremoving four sequences with potential problems from thealignment The recalculation of the likelihood ratio deviatedonly slightly from the initial results (P value of 393e5 for 38sequences vs P value of 217e4 for 34 sequences) This sug-gested that these errors would only have a minor impact ifany on the detection of positive selection especially as the

coagulation factor sequences were long and the model ofcodon substitutions was averaged over all columns (manyhundreds) and sequences (between 23 and 47)

Stability Effect of Mutations

When possible for each coagulation factor the molecularstructures of the domain(s) were retrieved directly fromthe CATH database (Sillitoe et al 2013) Prior to the muta-tional analyses each structure was repaired with the FoldXprogram (Schymkowitz et al 2005) to remove potential stericclashes FoldX utilizes an empirical force field model to esti-mate the stability of a protein (Gwildtype in kcalmol) FoldXwas used to estimate the stability effect on each residueposition when the residue was replaced by any of the 19other amino acids The stability effect is determined bythe difference in the wildtype and mutant energies(G = GmutantGwildtype) This resulted in a stabilitylandscape of all potential mutations including the Gvalues both for the disease mutations and for all the aminoacid substitutions observed across vertebrate evolution

Statistical Analyses

All statistical analyses were performed with R v302 (R CoreTeam 2013) and SigmaPlot (Systat Software San Jose CA)The pipeline was developed with Biopython scripts (Cocket al 2009) The visualization of the multiple sequence align-ments was performed using Jalview 28 (Waterhouse et al2009)

Data Availability

The final sequence alignments are provided on a webpagehttpwwwbiochemuclacukpavithraevolutioncoagulationindexphp (last accessed January 2014) Pathogenic mu-tation data for three human coagulation factors FVIII FIX andFXI were retrieved from our web databases at httpwwwfactorviii-dborg httpwwwfactorixorg and httpwwwfactorxiorg (last accessed January 2014) (Saunders et al2009 Rallapalli et al 2013)

Supplementary MaterialSupplementary tables S1ndashS2 and figures S1ndashS3 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

PMR and SJP thank Pfizer UK for a medical educationalgrant that generously funded this project and the SpecialTrustees of the Royal Free Hospital the KatharineDormandy Trust for Hemophilia and Related Disorders andthe European Association for Hemophilia and AssociatedDisorders for additional support The authors also thank DrKeith Gomez Dr Geoffrey Kemball-Cook and ProfessorEdward G Tuddenham for their generous collaborative sup-port of our studies including access to the FVIII mutations atthe Hemophilia A Database at httpwwwfactorviii-dborgRAS acknowledges funding from the Fondation du 450eme

3054

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

with the reported number of disease-causing mutations Theoutcome was presented using regression plots (fig 5)The regression plots and their correlation coefficients revealedthe relationship between the selection probability values atamino acid sites with the disease-causing mutations at thosesites The average probability of a site to be under positiveselection and its number of disease-causing mutationsshowed a negative relationship with correlation coefficientsof085067 and095 in FVIII FIX and FXI respectively(fig 5 top row) The correlation coefficients for neutral selec-tion followed that of positive selection with values of 087064 and 096 for FVIII FIX and FXI respectively (fig 5middle row) In distinction to these a positive relationshipwas seen between the average probability of negative selec-tion and the disease-causing mutations with correlation co-efficients of +086 +066 and +095 for FVIII FIX and FXIrespectively (fig 5 bottom row) To summarize the fewer thenumber of disease mutations at a given site the higher is theprobability for positive selection or neutral evolution (fig 5

top row and middle row) The greater the number of disease-causing mutations at a given site the higher the probabilityfor negative selection (fig 5 bottom row) The comparisonbetween the positively selected proteins FVIII and FIX withthe nonpositively selected protein FXI is noteworthyIrrespective of the selective pressure all three proteins ex-hibited similar regression lines in each of the three rowsThese comparisons show that there is a general inverse pro-portionality relationship between the probability of positiveselection (or neutral evolution) and the number of disease-causing mutations at each site This confirms previous obser-vations that disease-causing mutations tend to occur at crit-ical sites in the protein and that these highly conserved sitesare less tolerant to new mutations (negative selection)

Stability Effect of Mutations

In order to explain the consequences of disease-causing mu-tations on the protein structure FoldX software was used to

Table 3 Positively Selected Sites in the M8a-M8 Comparison

Gene Clade No ofCodons

Analyzed

No of Codonsunder positive

selection

PercentageEstimated byCodeML ()

Sites under Positive Selection (BEB4 50)

FGA Mammals 288 13 717 71832353951101108178212219246273

FGA Laurasiatheria 267 64 1329 467121929303336405293949697100102103108109116117118123136138141143155171180182195208210213214216218219220222223225227229231232233234236239242244245250253254

256259 260261264265

FGB Primates 488 17 822 394155808182130134135137173290367388413415448

F2 Vertebrates 475 5 099 5257151162381

F3 Mammals 240 11 663 55727691115177179185195230240

F3 Sauropsida 254 13 1060 340414474798081110123170208249

F5 Vertebrates 1182 10 087 1032755745785936027779269951010

F5 Mammals 1741 54 272 237243660134389392656663667669 676680718720731745749765768782809819822824875

88089489590791191692893094596797698910161040107610801089113611391324147114791515

152215481564

F5 Primates 2116 54 1321 74052129180211336341408409434 59266070370772075478487787988889290794195997810331039

10591131116812051209121812201222123012311232123612441254126212651279128514981504

151915261667168218491918

F7 Vertebrates 330 12 522 56526688162167178179209258262

F8 Mammals 1985 27 398 7245281339342521667718765784785792807817889912113011931197

12201229128813081333168619671984

F8 Primates 2351 33 063 42275579181485086088692796496697998899610191071118012821312

136114141421143614381459152715821613162216501689173023402349

F9 Mammals 443 15 440 45100114140150177181199202206209270309348

F10 Vertebrates 353 3 001 14470

F13A Vertebrates 631 6 126 398489497502550613

F13B Laurasiatheria 656 18 152 5567378107341349350374443459461497522525539540624

P4 95 P4 99

3047

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

estimate the stability effect (G) of residue changes on thethree-dimensional structures of human FVIII (chains A and B)FIX and FXI (fig 6) FoldX was designed to predict the stabilityeffect of a mutation This is an empirical method that relies onthe estimation of physical parameters from more than 1000mutations in protein structures and has been evaluated with21 proteins (Schymkowitz et al 2005 Tokuriki et al 2007)First each amino acid at each position was replaced with eachof the other 19 amino acids in order to provide a baseline forcomparison As expected the prediction of all virtual muta-tions shows a distribution skewed toward destabilization (me-dians G range between 051 and 123 kcalmol) Next theeffect of changing the human amino acids with those seen inevolution for the other 46 sequences of this study was calcu-lated in order to examine the effect of evolutionary change onprotein stabilities This produced a narrow distribution pre-dominantly centered on a difference of stability G of0 kcalmol and showing a slight skew toward deleterious mu-tations (medians G range between 014 and 030 kcalmol) This outcome is as expected because evolutionarychanges in the amino acids are selected in order to preservethe stability of the protein structures Finally the effect ofchanging the human amino acids to those observed inunique pathogenic missense mutations was calculated Thisanalysis produced a clearly skewed distribution (mediansG range between 101 and 295 kcalmol) of G

values that ranged as high as 10 kcalmol (and even higherbut these higher values were clipped for easier visualization)this agrees with the fact that these mutations affect the pro-tein structure The first distribution for the virtual mutationsis intermediate between the latter two distributions asexpected

The same outcome of stability effects was also observed atthe individual amino acid level (fig 7) When the effect of 19amino acid replacements were computed at each position forFVIII FIX and FXI and the majority of replacements showed adestabilizing effect (410 kcalmol in red) A small propor-tion of replacements stabilized the structure (lt10 kcalmol in blue) indicating the ability of the protein to compen-sate deleterious mutations by further mutation Although thisappears to occur more strongly for the FVIII chains A and B(fig 7a and b) the proportion of allowed replacements is infact similar for all four proteins when their different sizes areconsidered When the effect of replacements by disease-caus-ing missense mutations was calculated and the G valuesshowed that these almost invariably destabilized the threeprotein structures

DiscussionIn this study we have assembled sequence data for 14 coag-ulation genes across 47 genomes in order to analyze the evo-lutionary changes in amino acids that take place between five

FIG 4 Probability of evolutionary selection across the Primates clade for the three proteins FVIII FIX and FXI The probabilities of negative neutral andpositive selection were plotted against the amino acid position in the three proteins The probability values were obtained from CodeML analyses of theseven primate sequences (gray negative selection yellow neutral selection green positive selection)

3048

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

major clades and several minor ones These changes werecompared directly with disease-causing mutation data forthree coagulation proteins FVIII FIX and FXI Good-qualitypathogenic mutational data are publicly available for thesethree proteins and the bleeding disorders caused by thesethree sets of mutational data correspond to the three mostprevalent bleeding disorders The outcome of these detailedanalyses confirms that positive selection occurred during dif-ferent stages in the evolution of several coagulation proteinsIt was also observed when comparing the protein structurestabilities that the pathogenic mutations have a destabilizingeffect on the protein structure as opposed to the stable mu-tations that are tolerated during evolution (fig 6) The com-bination of positive selection analyses with those forpathogenic mutations provides information about predictingdisease causing mutations and clarifies fresh approaches fortherapeutic interventions in bleeding disorders

Selective Pressures in Vertebrate Evolution

Evolutionary changes combine random mutational eventswith natural selection The probability of a mutation to befixed in a population depends on its consequence on thefitness of an organism New adaptive mutations that are ben-eficial in terms of fitness are more likely to be retained bypositive selection than neutral mutation Neutral mutations

that are neither beneficial nor detrimental are randomly re-tained or removed by genetic drift depending on the popu-lation size (neutral selection) Detrimental mutations aremore likely to be removed from the population by negative(or purifying) selection There are no strict patterns and adeleterious mutation can sometimes be beneficial to oneaspect of the organism The textbook example of this issickle-cell anaemia (drepanocytosis) which is a bleeding dis-order caused by the pGlu6Val mutation in the hemoglobin gene Mutations responsible for sickle-cell anaemia are gen-erally removed by natural selection In malaria-infested pop-ulations particularly in tropical or subtropical regions thesemutations are selected for in one allele because they offerpartial protection against the pathogen (Lopez et al 2010)The malaria-infected sickle cells inhibit infection by the ma-laria plasmodium in heterozygous cases (beneficial) but causesickle-cell anaemia in homozygous cases (deleterious)

Genes corresponding to proteins in direct contact with theenvironment are more likely to be under adaptive selectionThus genes involved in host-pathogen interactions are com-monly associated with positive selection Large-scale analysesrevealing positive selection in vertebrates or arthropods havebeen reported for genes involved with immunity (Nielsenet al 2005 Studer et al 2008 Montoya-Burgos 2011 Rouxet al 2014) In the present detailed study positive selection inseveral genes of the coagulation cascade was identified at

FIG 5 Relationship between disease-causing mutations and evolutionary pressures The correlation between evolutionary selection and the number oftimes a disease-causing mutation occurs at each amino acid position is shown In the columns (a) FVIII (b) FIX and (c) FXI correspond to the threecoagulation factors for which pathogenic mutational information is available in sufficient quantity from our three mutational databases The three rowscorrespond to the posterior probability (based on the BEB model of CodeML) of positive selection neutral evolution and negative selection forprimates

3049

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

different stages in evolution (fig 3) An earlier study identifiedpositive selection across six mammalian genomes (humanchimpanzee macaque mouse rat and dog) using LRTsbased on codon substitution models and false discoveryrates to evaluate multiple LRT comparisons (Kosiol et al2008) Although they found strong positive selection(Plt 005 and false discovery ratelt 005) in nine complementgenes only nominal positive selection (Plt 005) was identi-fied in four coagulation genes (F2 F12 TFPI PROC) none ofwhich were identified in figure 3b A limitation of this earlieranalysis may be the low number of six genomes used Notethat complement is related to coagulation in that the latterhas the ability to activate complement of innate immunitythat mediates host-pathogen interactions (Amara et al 2008)

Positive Selection of the Coagulation Proteins throughHost-Pathogen Interactions

As a consequence of the immune response blood coagula-tion is often exploited by pathogens for reason of infectiveand septic processes For example FG and FV are targeted bybacteria thus offering a straightforward explanation of posi-tive selection (fig 3) We discuss this in terms of FG and FIIfollowed by FIII and FVII and then FV and FX

By taking advantage of the increased number of availablegenome sequences positive selection has been identified inseveral coagulation genes during various stages in evolutionIn particular all three subunits FGA FGB and FGG of FG haveshown positive selection at periods during vertebrate evolu-tion according to both the site model and the branch-sitemodels Similarly prothrombin (FII) has shown positive selec-tion but only across all the vertebrate genomes and notacross any smaller clades Although the immune system isunder strong adaptive constraints because it needs to evolveconstantly against pathogens the coagulation system alsocomprises the first line of immune defense against bacteriato form a clot to block bacterial invasion (Sun 2006) Both FGand prothrombin (FII) are targeted by certain bacteria Forexample FG is targeted during the course of infection bycysteine protease a virulence factor of Streptococcus pyogenes(Matsuka et al 1999) and FG binding protein Efb ofStaphylococcus aureus Similarly FG-like proteins exist inAnopheles gambiae and are involved in the function of theimmune system against malaria or bacterial parasites (Dongand Dimopoulos 2009 Lombardo et al 2013) Thrombin in-teracts with staphylocoagulase a protein secreted by thehuman pathogen S aureus and activates prothrombin with-out proteolysis to form thrombin The resulting

FIG 6 The effect of amino acid replacements on the overall protein stability change G Data are shown for (a and b) FVIII A and B chains (c) FIX (d)FXI Three calculations were performed each being based on the sequences for which crystal structures are known (PDB codes 2R7E [FVIII] 2WPH [FIX]and 2F83 [FXI]) The FVIII A and B chains correspond to the N-terminal residues 19ndash760 and the C-terminal residues 1582ndash2351 that are observed in itscrystal structure The calculation represented by the dashed line indicates the distribution calculated for all 19 possible amino acid replacements Thatrepresented by the dotted line indicates the distribution calculated for all the evolutionary occurring replacements that were observed in our data set of47 genomes That represented by the continuous line indicates the distribution calculated for the disease-causing mutations from the FVIII FIX and FXImutation databases

3050

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

prothrombinmdashstaphylocoagulase complex binds to FG tocleave this into self-polymerizing fibrin This process is centralto the molecular pathology of S aureus endocarditis becausestaphylocoagulase bypasses physiological blood coagulationthrough the evasion of the recognition of prothrombin bycirculating thrombin inhibitors (Panizzi et al 2005)

Positive selection for FIII and FVII is attributed to a host-pathogen interaction with the herpes virus family These cellsurface molecules do not necessarily participate directly in animmune response but can still serve as a convenient gatewayfor pathogen entry (Vallender and Lahn 2004) Herpes virusespromote cellular surface infections by coevolving with theirhuman host indicating a mode by which pathogens exploitthe coagulation system (Sutherland et al 2012) A study of thehost-pathogen driven coevolution of herpes virus revealedthat this virus has existed from the time of mammals anddivided into the HSV-1 and HSV-2 types recently during

human evolution (Kolb et al 2013) These observations sug-gest that the pressure for FIII to evolve in early mammalsespecially through primates could be the reason for the pos-itive selection of FIII across the mammalian clade (fig 3) Wepropose that the FIIIndashFVII interaction is a significant drivingforce for positive selection in FVII

The FV and FVIII cofactors are homologous to each otherin their sequence and domain organization hence it is notsurprising to see similar selection pressures acting on them(fig 3c) A difference arises in the entire vertebrate genome(fig 3a) where FV shows positive selection whereas FVIII doesnot This difference between FV and FVIII across vertebratesmay result from the interaction of FV with Escherichia coliThe extracellular serine protease EspP from E coli cleaves FVto lead to its degradation and thus reduces the activity of thecoagulation cascade (Brunder et al 1997) The activated FVacofactor forms a FVaFXa complex with activated FX (fig 1)

FIG 7 The effect of amino acid replacements at the sequence level on the protein stability change G The residue stability changes are depicted ona seven point scale from highly destabilizing (red) to highly stabilizing (blue) for each of (a) FVIII chain A (b) FVIII chain B (c) FIX and (d) FXI Their PDBcodes are indicated in figure 6 In each panel the upper half shows the G values for all the possible 19 amino acid replacements at each residueposition that is part of the crystal structure and the lower half shows the G values for the disease-causing mutations taken from the mutationdatabases

3051

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

which is important for thrombin formation the most impor-tant step for fibrin clot formation Because both FVa and FXaare central to coagulation and are associated with bacterialproteolysis this offers an explanation for the signatures ofpositive selection in both proteins

FVIII FIX and FXIII are also associated with positive selec-tion however it is not clear what selective pressures are in-volved Although these proteins may also be targeted bypathogens no direct interactions have been reported so far

Absence of Positive Selection for FXI and FXIIFXII is unique in that it does not show positive selection eventhough it is involved in host-pathogen interactions with long-chain inorganic phosphatases (polyP) from microorganisms(Morrissey et al 2012) Small-chain polyP are produced in thehuman brain and by blood platelets Hence even during theabsence of pathogen-originated polyP the human host doesnot lack polyP and there is no pressure to evolve for this (Puyet al 2013) A F12 deficiency in patients does not cause bleed-ing disorders This means that the coagulation cascade can beinitiated even in the absence of F12 For similar reasons FXIshows no positive selection We hypothesize that this is be-cause first no interaction with the environment has beendemonstrated to date and second with respect to the polyPmechanism the coagulation cascade is capable of activatingin a FXI-independent manner

Relationship between Disease-Causing Mutations andSelection

The mechanisms of genetic drift mutational change andevolutionary change are closely related to each otherEvolutionary change is not only dependent on the geneticvariability from mutational change but also on genetic driftwhich describes the random fluctuations in the number ofgenetic variants in an organism Genetic drift and mutationsproduce random variations that selection can act upon Thenew changes incurred in one generation may be cancelled inanother generation It may therefore be important to deter-mine whether a new mutation is lost or becomes commonenough for selection to determine its fate (Masel 2011) In thisstudy we have examined the relationship between naturalselection acting on an amino acid and the disease-causingmutations reported at that amino acid position for thethree prominent coagulation factors FVIII FIX and FXI Wehave shown that the disease-causing mutations are morelikely to occur at a higher frequency at sites under negativeselection in all three proteins (fig 5) As sites under strongselective pressure are more likely to be critical to proteinfunction such a mutation can readily affect protein functionand trigger disease Lethal (or deleterious) mutations thatcause death are de facto not observed in our data set Bycontrast disease-causing mutations occur at a reduced fre-quency in sites under neutral or positive selection suggestingthat these sites are more tolerant to mutations than sitesunder negative selection Interestingly we have shown thatFVIII and FIX have undergone much positive selectionwhereas FXI has no such signatures (fig 4) Despite this dif-ference our results for three proteins show no change in the

relationship between evolutionary selection and disease-causing mutations (figs 5 and 6) In summary the reporteddisease-causing mutations show randomness and the muta-bility of a given amino acid position depends on the selectionpressure acting at that position and not on the entire protein

Effect of Disease-Causing Mutations on the Stability ofProtein Structure

Mutations that change the amino acid sequence can havestriking effects on protein stability Point mutations are re-lated to pathological and genetic conditions by reducing orabolishing coagulation protein function (table 1) Proteinfunction is related to its stability which is often measuredby the change in folding energy G of the protein structure(Stefl et al 2013) We have identified the effect of mutationson the stability of FVIII FIX and FXI in terms of G valuesThe stability effects of polymorphic amino acid changes inhuman sequences follow a Gaussian distribution (de Beeret al 2013) In that study disease-causing mutations alsoshowed the same distribution but slightly skewed towarddestabilization In this study the same trend for aminoacids substitutions in vertebrates were observed but this ismuch stronger for disease-causing mutations (fig 6) The dif-ference between the two studies may be explained by thedifferent methods used de Beer et al (2013) used the DiscreteOptimized Protein (DOPE) score of Modeller as a monitor ofglobal stability whereas we used FoldX (Materials andMethods) as an empirical method to infer directly the stabilityeffect G of a point mutation

Evolutionarily observed mutations are the most stable interms of their protein structure whereas disease-causing mu-tations destabilized the protein structure (figs 6 and 7) Eachsite may or may not tolerate a particular amino acid changedepending on the change in stability When each amino acidwas converted to each of the other 19 possible amino acids(fig 7 top panels) the G values showed that most aminoacids destabilize the structure whereas several may stabilize itThe disease-causing effect of a mutation may be affected notonly by the selection pressure acting on it but also by the typeof the amino acid replaced For the FVIII FIX and FXI bleedingdisorders destabilization was observed for most of the dis-ease-causing mutations

Utility of Evolutionary and Disease-CausingMutational Analyses for Clinical Diagnosis

Interactive locus-specific databases are useful tools for pre-senting patient mutational data for patient diagnosis andcare We plan to incorporate the above data sets on selectivepressures and stability effects into our mutational databasesfor FVIII FIX and FXI This information will provide insight forthe prediction of the effect of disease-causing mutations Thelevel of selective pressure was estimated under a maximum-likelihood framework Such methods are useful for geneticstudies but their results are sometimes difficult to interpretby a nonspecialist The presentation of this information needsto be intuitive to the viewer As shown in supplementaryfigure S3 Supplementary Material online the use of multiple

3052

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

sequence alignments and presentation of stability effectsG structural locations and selective pressures can helpin interpreting the impacts of the mutations This type ofinformation will therefore clarify the interpretation of newlydiscovered disease-causing mutations for clinicians evolu-tionary biologists and protein biochemists This combinedevolutionary-driven study of bleeding disorder-causing muta-tions is the most detailed of its kind as far as we are aware andmay be usefully extended to other biological processes andtheir proteins

Materials and Methods

Genomic Sequences

The flowchart of operations is summarized in supplementaryfigure S1 Supplementary Material online First the completemultiple sequence alignments were retrieved together withtheir gene trees for each of the 14 genes corresponding to the11 coagulation factors (table 1) from the Ensembl Comparadatabase (Vilella et al 2009) using their Ensembl IDs (table 1)and a Perl script Ensembl release 73 was used accessed onSeptember 30 2013 (Flicek et al 2013) These amino acidsequences came from Ensembl as they are generally ofgood quality and contained 74 vertebrate genomes in atotal of 80 genomes We then excluded any sequences thatcame from genomes of low coverage (2 or less) This re-sulted in 47 vertebrate genomes with alignment sizes be-tween 23 and 47 genomes These 47 sets of vertebratesequences provided sufficient genomic coverage to performphylogenetic analysis The 47 vertebrate organisms were clas-sified into five major clades (fig 2) namely Primates Glires(rodents rabbits and pika) Laurasiatheria Sauropsida (rep-tiles and birds) and Fishes (ray-finned fishes) together withminor clades including Atlantogenata and Australianmammals

Sequence Retrieval and Multiple Alignments

The multiple sequence alignments and gene trees fromEnsemblCompara included all the coagulation sequences indifferent species irrespective of their genome coverage Themultiple sequence alignments of the downloaded genes fromthe EnsemblCompara database were constructed using aglobal alignment algorithm provided within this databasefor significant number of sequences of arbitrary lengthsThe data set required for this study comprised 14 genesfrom 47 genomes A thorough quality control procedure ofthe multiple sequence alignments was performed in order togenerate minimum false positive results Thus sequences wereremoved if they showed 41 low-complexity regions orunknown nucleotides (indicated as ldquoNrdquo in a sequence)From these the phylogenetic trees were pruned usingNewick Utilities software (Junier and Zdobnov 2010) to iden-tify the tips corresponding to the retrieved species (fig 2) Thefinal alignments of these sequences were made with PRANKwhich is conservative in that it tends to align only amino acidsderived by substitution and introduces insertions and dele-tions (indels) instead of aligning sequence fragments thatappear too divergent (Loytynoja and Goldman 2008)

PRANK is an advanced probabilistic alignment algorithmthat incorporates phylogenetic information This has beenshown to reduce the number of false-positives when evalu-ating positive selection (Fletcher and Yang 2010 Loytynoja2014) All the PRANK alignments were made using defaultparameters for the empirical codon model and the prunedtrees as guides The resulting alignments are composed ofhighly conserved blocks surrounded by noise due to indelsFinally Gblocks (Castresana 2000) was used to remove thenoise in the unreliable regions of the multiple sequence align-ments which may cause false positives when testing for pos-itive selection to keep only the well-aligned parts The defaultGblocks parameters for the codontype model (parameter ndasht = c) were used with the exception of tolerating half gaps perposition (parameter -b5=h) This procedure resulted in cleanand accurate multiple sequence alignments for the final com-putations of the site and branch-site models using CodeMLtogether with their corresponding gene trees

Analyses of Evolutionary Pressure

For each of the 14 coagulation genes within each of the fivemajor clades the selective pressures were estimated using thedifferent codon substitution site models implemented inCodeML of the phylogenetic analysis with maximum-likeli-hood software (PAML release 47) (Yang 2007) By analyzingthe protein-coding genes at the nucleotide level the synon-ymous or silent changes (nucleotide substitutions that do notchange the encoded amino acid) are distinguished from thenonsynonymous or missense changes (substitutions thatchange the encoded amino acid) Because natural selectionoperates at the protein level and can take any direction ofpositive neutral or negative selection the synonymous andnonsynonymous substitutions are subjected to different se-lective pressures which occur at the rates dS and dN respec-tively Hence the ratio of nonsynonymoussynonymoussubstitution rates (= dNdS) measures the selective pressureat the protein level The value of reveals the direction andstrength of natural selection acting on the protein Values oflt 1 = 1 and 4 1 indicate negative selection neutralevolution and positive selection respectively If an existingfunction or phenotype is evolutionarily favorable then nega-tive selection (or purifying selection) favors the conservationof that function or phenotype (conservation of amino acids)The other is positive selection (or Darwinian selection) whichfavors the promotion of a new function or phenotypeNeutral evolution describes the tolerance to amino acid mu-tations which are neither deleterious nor beneficial and arefixed according to genetic drift

The following five models were used to account for thevariation of dNdS among codon sites M1a (two dNdS ratioswith negative selection [0lt 1] and neutral evolution[1 = 1]) M2a (three dNdS ratios with negative selection[0lt 1] neutral evolution [1 = 1] and positive selection[2 1]) M7 (beta) M8 (beta and 24 1) and M8a (betaand 2 = 1) The alignments contain several positions inwhich one of the sequences is missing and the ldquocleandata = 0rdquooption was selected to retain all sites The inference of positive

3053

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

selection was conducted by performing LRTs in which thefollowing two pairs of models were compared namely M1awith M2a and M8a with M8 The principle is to compare anull model that does not allow24 1 (M1a or M8a) with analternative model that does (M2a or M8) Because the alter-nate model has more parameters than the null model theLRT follows a 2 distribution The LRT is a powerful methodfor comparing two hypotheses where one is the special caseof the other The BEB method (Yang et al 2005) was used tocalculate the posterior probabilities for site classes and theBEB value was used to identify sites under positive selectionwhere the 2 distribution of the LRT was significant withPlt 005 As many different LRT comparisons (differentgenes different clades) were conducted a control for thefalse-discovery rate (expected proportion of false positives)was applied using the q value package in R (Storey andTibshirani 2003) We adjusted the parameters of the q valueto perform a bootstrap analysis (pi0meth = ldquobootstraprdquo) andto tolerate 10 of false positives in our significant results(ldquofdrlevel = 01rdquo) A bootstrap analysis is necessary as the Pvalue distribution is bimodal (Studer et al 2008)

Two sources for the branch-site model were used First theSelectome database at httpselectomeunilch (last accessedJanuary 2014) (Proux et al 2009 Moretti et al 2014) providesthe precomputed results of the branch-site model (supple-mentary table S2b Supplementary Material online) based onthe multiple sequence alignments from the EnsemblCompara database (Vilella et al 2009) The last release ofSelectome (release 6) contains an optimized procedure ofalignment refinement and quality control similar to theone used for our site-model analyses as described aboveThe branch-site model allows the dNdS value to vary notonly between sites but also between branches (Zhang et al2005) The advantage of this model is its ability to detecttraces of episodic positive selection (Studer and Robinson-Rechavi 2009a) Second a branch-site model was determinedbased on our data set (see above) We used the vertebratetrees and their multiple sequence alignments as inputs forbranch-site model calculations The multiple alignments weremanually rechecked to remove barren sequences and thecorresponding vertebrate trees were edited accordinglyusing Newick Utilities Each taxonomic unit also known asthe branch or node of the tree was tagged using an auto-mated python script The CodeML program was run on eachof these tagged trees from all the genes for this branch-sitemodel (supplementary table S2b Supplementary Materialonline) For some alignments we found misaligned short in-sertions up to a maximum of ten amino acids in the con-served blocks probably due to gene model prediction errorsIn our whole data set these errors affect 0ndash4 sequences peralignment As the F2 gene family contained most of thesesmall errors albeit at a very low level this was reanalyzed byremoving four sequences with potential problems from thealignment The recalculation of the likelihood ratio deviatedonly slightly from the initial results (P value of 393e5 for 38sequences vs P value of 217e4 for 34 sequences) This sug-gested that these errors would only have a minor impact ifany on the detection of positive selection especially as the

coagulation factor sequences were long and the model ofcodon substitutions was averaged over all columns (manyhundreds) and sequences (between 23 and 47)

Stability Effect of Mutations

When possible for each coagulation factor the molecularstructures of the domain(s) were retrieved directly fromthe CATH database (Sillitoe et al 2013) Prior to the muta-tional analyses each structure was repaired with the FoldXprogram (Schymkowitz et al 2005) to remove potential stericclashes FoldX utilizes an empirical force field model to esti-mate the stability of a protein (Gwildtype in kcalmol) FoldXwas used to estimate the stability effect on each residueposition when the residue was replaced by any of the 19other amino acids The stability effect is determined bythe difference in the wildtype and mutant energies(G = GmutantGwildtype) This resulted in a stabilitylandscape of all potential mutations including the Gvalues both for the disease mutations and for all the aminoacid substitutions observed across vertebrate evolution

Statistical Analyses

All statistical analyses were performed with R v302 (R CoreTeam 2013) and SigmaPlot (Systat Software San Jose CA)The pipeline was developed with Biopython scripts (Cocket al 2009) The visualization of the multiple sequence align-ments was performed using Jalview 28 (Waterhouse et al2009)

Data Availability

The final sequence alignments are provided on a webpagehttpwwwbiochemuclacukpavithraevolutioncoagulationindexphp (last accessed January 2014) Pathogenic mu-tation data for three human coagulation factors FVIII FIX andFXI were retrieved from our web databases at httpwwwfactorviii-dborg httpwwwfactorixorg and httpwwwfactorxiorg (last accessed January 2014) (Saunders et al2009 Rallapalli et al 2013)

Supplementary MaterialSupplementary tables S1ndashS2 and figures S1ndashS3 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

PMR and SJP thank Pfizer UK for a medical educationalgrant that generously funded this project and the SpecialTrustees of the Royal Free Hospital the KatharineDormandy Trust for Hemophilia and Related Disorders andthe European Association for Hemophilia and AssociatedDisorders for additional support The authors also thank DrKeith Gomez Dr Geoffrey Kemball-Cook and ProfessorEdward G Tuddenham for their generous collaborative sup-port of our studies including access to the FVIII mutations atthe Hemophilia A Database at httpwwwfactorviii-dborgRAS acknowledges funding from the Fondation du 450eme

3054

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

estimate the stability effect (G) of residue changes on thethree-dimensional structures of human FVIII (chains A and B)FIX and FXI (fig 6) FoldX was designed to predict the stabilityeffect of a mutation This is an empirical method that relies onthe estimation of physical parameters from more than 1000mutations in protein structures and has been evaluated with21 proteins (Schymkowitz et al 2005 Tokuriki et al 2007)First each amino acid at each position was replaced with eachof the other 19 amino acids in order to provide a baseline forcomparison As expected the prediction of all virtual muta-tions shows a distribution skewed toward destabilization (me-dians G range between 051 and 123 kcalmol) Next theeffect of changing the human amino acids with those seen inevolution for the other 46 sequences of this study was calcu-lated in order to examine the effect of evolutionary change onprotein stabilities This produced a narrow distribution pre-dominantly centered on a difference of stability G of0 kcalmol and showing a slight skew toward deleterious mu-tations (medians G range between 014 and 030 kcalmol) This outcome is as expected because evolutionarychanges in the amino acids are selected in order to preservethe stability of the protein structures Finally the effect ofchanging the human amino acids to those observed inunique pathogenic missense mutations was calculated Thisanalysis produced a clearly skewed distribution (mediansG range between 101 and 295 kcalmol) of G

values that ranged as high as 10 kcalmol (and even higherbut these higher values were clipped for easier visualization)this agrees with the fact that these mutations affect the pro-tein structure The first distribution for the virtual mutationsis intermediate between the latter two distributions asexpected

The same outcome of stability effects was also observed atthe individual amino acid level (fig 7) When the effect of 19amino acid replacements were computed at each position forFVIII FIX and FXI and the majority of replacements showed adestabilizing effect (410 kcalmol in red) A small propor-tion of replacements stabilized the structure (lt10 kcalmol in blue) indicating the ability of the protein to compen-sate deleterious mutations by further mutation Although thisappears to occur more strongly for the FVIII chains A and B(fig 7a and b) the proportion of allowed replacements is infact similar for all four proteins when their different sizes areconsidered When the effect of replacements by disease-caus-ing missense mutations was calculated and the G valuesshowed that these almost invariably destabilized the threeprotein structures

DiscussionIn this study we have assembled sequence data for 14 coag-ulation genes across 47 genomes in order to analyze the evo-lutionary changes in amino acids that take place between five

FIG 4 Probability of evolutionary selection across the Primates clade for the three proteins FVIII FIX and FXI The probabilities of negative neutral andpositive selection were plotted against the amino acid position in the three proteins The probability values were obtained from CodeML analyses of theseven primate sequences (gray negative selection yellow neutral selection green positive selection)

3048

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

major clades and several minor ones These changes werecompared directly with disease-causing mutation data forthree coagulation proteins FVIII FIX and FXI Good-qualitypathogenic mutational data are publicly available for thesethree proteins and the bleeding disorders caused by thesethree sets of mutational data correspond to the three mostprevalent bleeding disorders The outcome of these detailedanalyses confirms that positive selection occurred during dif-ferent stages in the evolution of several coagulation proteinsIt was also observed when comparing the protein structurestabilities that the pathogenic mutations have a destabilizingeffect on the protein structure as opposed to the stable mu-tations that are tolerated during evolution (fig 6) The com-bination of positive selection analyses with those forpathogenic mutations provides information about predictingdisease causing mutations and clarifies fresh approaches fortherapeutic interventions in bleeding disorders

Selective Pressures in Vertebrate Evolution

Evolutionary changes combine random mutational eventswith natural selection The probability of a mutation to befixed in a population depends on its consequence on thefitness of an organism New adaptive mutations that are ben-eficial in terms of fitness are more likely to be retained bypositive selection than neutral mutation Neutral mutations

that are neither beneficial nor detrimental are randomly re-tained or removed by genetic drift depending on the popu-lation size (neutral selection) Detrimental mutations aremore likely to be removed from the population by negative(or purifying) selection There are no strict patterns and adeleterious mutation can sometimes be beneficial to oneaspect of the organism The textbook example of this issickle-cell anaemia (drepanocytosis) which is a bleeding dis-order caused by the pGlu6Val mutation in the hemoglobin gene Mutations responsible for sickle-cell anaemia are gen-erally removed by natural selection In malaria-infested pop-ulations particularly in tropical or subtropical regions thesemutations are selected for in one allele because they offerpartial protection against the pathogen (Lopez et al 2010)The malaria-infected sickle cells inhibit infection by the ma-laria plasmodium in heterozygous cases (beneficial) but causesickle-cell anaemia in homozygous cases (deleterious)

Genes corresponding to proteins in direct contact with theenvironment are more likely to be under adaptive selectionThus genes involved in host-pathogen interactions are com-monly associated with positive selection Large-scale analysesrevealing positive selection in vertebrates or arthropods havebeen reported for genes involved with immunity (Nielsenet al 2005 Studer et al 2008 Montoya-Burgos 2011 Rouxet al 2014) In the present detailed study positive selection inseveral genes of the coagulation cascade was identified at

FIG 5 Relationship between disease-causing mutations and evolutionary pressures The correlation between evolutionary selection and the number oftimes a disease-causing mutation occurs at each amino acid position is shown In the columns (a) FVIII (b) FIX and (c) FXI correspond to the threecoagulation factors for which pathogenic mutational information is available in sufficient quantity from our three mutational databases The three rowscorrespond to the posterior probability (based on the BEB model of CodeML) of positive selection neutral evolution and negative selection forprimates

3049

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

different stages in evolution (fig 3) An earlier study identifiedpositive selection across six mammalian genomes (humanchimpanzee macaque mouse rat and dog) using LRTsbased on codon substitution models and false discoveryrates to evaluate multiple LRT comparisons (Kosiol et al2008) Although they found strong positive selection(Plt 005 and false discovery ratelt 005) in nine complementgenes only nominal positive selection (Plt 005) was identi-fied in four coagulation genes (F2 F12 TFPI PROC) none ofwhich were identified in figure 3b A limitation of this earlieranalysis may be the low number of six genomes used Notethat complement is related to coagulation in that the latterhas the ability to activate complement of innate immunitythat mediates host-pathogen interactions (Amara et al 2008)

Positive Selection of the Coagulation Proteins throughHost-Pathogen Interactions

As a consequence of the immune response blood coagula-tion is often exploited by pathogens for reason of infectiveand septic processes For example FG and FV are targeted bybacteria thus offering a straightforward explanation of posi-tive selection (fig 3) We discuss this in terms of FG and FIIfollowed by FIII and FVII and then FV and FX

By taking advantage of the increased number of availablegenome sequences positive selection has been identified inseveral coagulation genes during various stages in evolutionIn particular all three subunits FGA FGB and FGG of FG haveshown positive selection at periods during vertebrate evolu-tion according to both the site model and the branch-sitemodels Similarly prothrombin (FII) has shown positive selec-tion but only across all the vertebrate genomes and notacross any smaller clades Although the immune system isunder strong adaptive constraints because it needs to evolveconstantly against pathogens the coagulation system alsocomprises the first line of immune defense against bacteriato form a clot to block bacterial invasion (Sun 2006) Both FGand prothrombin (FII) are targeted by certain bacteria Forexample FG is targeted during the course of infection bycysteine protease a virulence factor of Streptococcus pyogenes(Matsuka et al 1999) and FG binding protein Efb ofStaphylococcus aureus Similarly FG-like proteins exist inAnopheles gambiae and are involved in the function of theimmune system against malaria or bacterial parasites (Dongand Dimopoulos 2009 Lombardo et al 2013) Thrombin in-teracts with staphylocoagulase a protein secreted by thehuman pathogen S aureus and activates prothrombin with-out proteolysis to form thrombin The resulting

FIG 6 The effect of amino acid replacements on the overall protein stability change G Data are shown for (a and b) FVIII A and B chains (c) FIX (d)FXI Three calculations were performed each being based on the sequences for which crystal structures are known (PDB codes 2R7E [FVIII] 2WPH [FIX]and 2F83 [FXI]) The FVIII A and B chains correspond to the N-terminal residues 19ndash760 and the C-terminal residues 1582ndash2351 that are observed in itscrystal structure The calculation represented by the dashed line indicates the distribution calculated for all 19 possible amino acid replacements Thatrepresented by the dotted line indicates the distribution calculated for all the evolutionary occurring replacements that were observed in our data set of47 genomes That represented by the continuous line indicates the distribution calculated for the disease-causing mutations from the FVIII FIX and FXImutation databases

3050

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

prothrombinmdashstaphylocoagulase complex binds to FG tocleave this into self-polymerizing fibrin This process is centralto the molecular pathology of S aureus endocarditis becausestaphylocoagulase bypasses physiological blood coagulationthrough the evasion of the recognition of prothrombin bycirculating thrombin inhibitors (Panizzi et al 2005)

Positive selection for FIII and FVII is attributed to a host-pathogen interaction with the herpes virus family These cellsurface molecules do not necessarily participate directly in animmune response but can still serve as a convenient gatewayfor pathogen entry (Vallender and Lahn 2004) Herpes virusespromote cellular surface infections by coevolving with theirhuman host indicating a mode by which pathogens exploitthe coagulation system (Sutherland et al 2012) A study of thehost-pathogen driven coevolution of herpes virus revealedthat this virus has existed from the time of mammals anddivided into the HSV-1 and HSV-2 types recently during

human evolution (Kolb et al 2013) These observations sug-gest that the pressure for FIII to evolve in early mammalsespecially through primates could be the reason for the pos-itive selection of FIII across the mammalian clade (fig 3) Wepropose that the FIIIndashFVII interaction is a significant drivingforce for positive selection in FVII

The FV and FVIII cofactors are homologous to each otherin their sequence and domain organization hence it is notsurprising to see similar selection pressures acting on them(fig 3c) A difference arises in the entire vertebrate genome(fig 3a) where FV shows positive selection whereas FVIII doesnot This difference between FV and FVIII across vertebratesmay result from the interaction of FV with Escherichia coliThe extracellular serine protease EspP from E coli cleaves FVto lead to its degradation and thus reduces the activity of thecoagulation cascade (Brunder et al 1997) The activated FVacofactor forms a FVaFXa complex with activated FX (fig 1)

FIG 7 The effect of amino acid replacements at the sequence level on the protein stability change G The residue stability changes are depicted ona seven point scale from highly destabilizing (red) to highly stabilizing (blue) for each of (a) FVIII chain A (b) FVIII chain B (c) FIX and (d) FXI Their PDBcodes are indicated in figure 6 In each panel the upper half shows the G values for all the possible 19 amino acid replacements at each residueposition that is part of the crystal structure and the lower half shows the G values for the disease-causing mutations taken from the mutationdatabases

3051

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

which is important for thrombin formation the most impor-tant step for fibrin clot formation Because both FVa and FXaare central to coagulation and are associated with bacterialproteolysis this offers an explanation for the signatures ofpositive selection in both proteins

FVIII FIX and FXIII are also associated with positive selec-tion however it is not clear what selective pressures are in-volved Although these proteins may also be targeted bypathogens no direct interactions have been reported so far

Absence of Positive Selection for FXI and FXIIFXII is unique in that it does not show positive selection eventhough it is involved in host-pathogen interactions with long-chain inorganic phosphatases (polyP) from microorganisms(Morrissey et al 2012) Small-chain polyP are produced in thehuman brain and by blood platelets Hence even during theabsence of pathogen-originated polyP the human host doesnot lack polyP and there is no pressure to evolve for this (Puyet al 2013) A F12 deficiency in patients does not cause bleed-ing disorders This means that the coagulation cascade can beinitiated even in the absence of F12 For similar reasons FXIshows no positive selection We hypothesize that this is be-cause first no interaction with the environment has beendemonstrated to date and second with respect to the polyPmechanism the coagulation cascade is capable of activatingin a FXI-independent manner

Relationship between Disease-Causing Mutations andSelection

The mechanisms of genetic drift mutational change andevolutionary change are closely related to each otherEvolutionary change is not only dependent on the geneticvariability from mutational change but also on genetic driftwhich describes the random fluctuations in the number ofgenetic variants in an organism Genetic drift and mutationsproduce random variations that selection can act upon Thenew changes incurred in one generation may be cancelled inanother generation It may therefore be important to deter-mine whether a new mutation is lost or becomes commonenough for selection to determine its fate (Masel 2011) In thisstudy we have examined the relationship between naturalselection acting on an amino acid and the disease-causingmutations reported at that amino acid position for thethree prominent coagulation factors FVIII FIX and FXI Wehave shown that the disease-causing mutations are morelikely to occur at a higher frequency at sites under negativeselection in all three proteins (fig 5) As sites under strongselective pressure are more likely to be critical to proteinfunction such a mutation can readily affect protein functionand trigger disease Lethal (or deleterious) mutations thatcause death are de facto not observed in our data set Bycontrast disease-causing mutations occur at a reduced fre-quency in sites under neutral or positive selection suggestingthat these sites are more tolerant to mutations than sitesunder negative selection Interestingly we have shown thatFVIII and FIX have undergone much positive selectionwhereas FXI has no such signatures (fig 4) Despite this dif-ference our results for three proteins show no change in the

relationship between evolutionary selection and disease-causing mutations (figs 5 and 6) In summary the reporteddisease-causing mutations show randomness and the muta-bility of a given amino acid position depends on the selectionpressure acting at that position and not on the entire protein

Effect of Disease-Causing Mutations on the Stability ofProtein Structure

Mutations that change the amino acid sequence can havestriking effects on protein stability Point mutations are re-lated to pathological and genetic conditions by reducing orabolishing coagulation protein function (table 1) Proteinfunction is related to its stability which is often measuredby the change in folding energy G of the protein structure(Stefl et al 2013) We have identified the effect of mutationson the stability of FVIII FIX and FXI in terms of G valuesThe stability effects of polymorphic amino acid changes inhuman sequences follow a Gaussian distribution (de Beeret al 2013) In that study disease-causing mutations alsoshowed the same distribution but slightly skewed towarddestabilization In this study the same trend for aminoacids substitutions in vertebrates were observed but this ismuch stronger for disease-causing mutations (fig 6) The dif-ference between the two studies may be explained by thedifferent methods used de Beer et al (2013) used the DiscreteOptimized Protein (DOPE) score of Modeller as a monitor ofglobal stability whereas we used FoldX (Materials andMethods) as an empirical method to infer directly the stabilityeffect G of a point mutation

Evolutionarily observed mutations are the most stable interms of their protein structure whereas disease-causing mu-tations destabilized the protein structure (figs 6 and 7) Eachsite may or may not tolerate a particular amino acid changedepending on the change in stability When each amino acidwas converted to each of the other 19 possible amino acids(fig 7 top panels) the G values showed that most aminoacids destabilize the structure whereas several may stabilize itThe disease-causing effect of a mutation may be affected notonly by the selection pressure acting on it but also by the typeof the amino acid replaced For the FVIII FIX and FXI bleedingdisorders destabilization was observed for most of the dis-ease-causing mutations

Utility of Evolutionary and Disease-CausingMutational Analyses for Clinical Diagnosis

Interactive locus-specific databases are useful tools for pre-senting patient mutational data for patient diagnosis andcare We plan to incorporate the above data sets on selectivepressures and stability effects into our mutational databasesfor FVIII FIX and FXI This information will provide insight forthe prediction of the effect of disease-causing mutations Thelevel of selective pressure was estimated under a maximum-likelihood framework Such methods are useful for geneticstudies but their results are sometimes difficult to interpretby a nonspecialist The presentation of this information needsto be intuitive to the viewer As shown in supplementaryfigure S3 Supplementary Material online the use of multiple

3052

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

sequence alignments and presentation of stability effectsG structural locations and selective pressures can helpin interpreting the impacts of the mutations This type ofinformation will therefore clarify the interpretation of newlydiscovered disease-causing mutations for clinicians evolu-tionary biologists and protein biochemists This combinedevolutionary-driven study of bleeding disorder-causing muta-tions is the most detailed of its kind as far as we are aware andmay be usefully extended to other biological processes andtheir proteins

Materials and Methods

Genomic Sequences

The flowchart of operations is summarized in supplementaryfigure S1 Supplementary Material online First the completemultiple sequence alignments were retrieved together withtheir gene trees for each of the 14 genes corresponding to the11 coagulation factors (table 1) from the Ensembl Comparadatabase (Vilella et al 2009) using their Ensembl IDs (table 1)and a Perl script Ensembl release 73 was used accessed onSeptember 30 2013 (Flicek et al 2013) These amino acidsequences came from Ensembl as they are generally ofgood quality and contained 74 vertebrate genomes in atotal of 80 genomes We then excluded any sequences thatcame from genomes of low coverage (2 or less) This re-sulted in 47 vertebrate genomes with alignment sizes be-tween 23 and 47 genomes These 47 sets of vertebratesequences provided sufficient genomic coverage to performphylogenetic analysis The 47 vertebrate organisms were clas-sified into five major clades (fig 2) namely Primates Glires(rodents rabbits and pika) Laurasiatheria Sauropsida (rep-tiles and birds) and Fishes (ray-finned fishes) together withminor clades including Atlantogenata and Australianmammals

Sequence Retrieval and Multiple Alignments

The multiple sequence alignments and gene trees fromEnsemblCompara included all the coagulation sequences indifferent species irrespective of their genome coverage Themultiple sequence alignments of the downloaded genes fromthe EnsemblCompara database were constructed using aglobal alignment algorithm provided within this databasefor significant number of sequences of arbitrary lengthsThe data set required for this study comprised 14 genesfrom 47 genomes A thorough quality control procedure ofthe multiple sequence alignments was performed in order togenerate minimum false positive results Thus sequences wereremoved if they showed 41 low-complexity regions orunknown nucleotides (indicated as ldquoNrdquo in a sequence)From these the phylogenetic trees were pruned usingNewick Utilities software (Junier and Zdobnov 2010) to iden-tify the tips corresponding to the retrieved species (fig 2) Thefinal alignments of these sequences were made with PRANKwhich is conservative in that it tends to align only amino acidsderived by substitution and introduces insertions and dele-tions (indels) instead of aligning sequence fragments thatappear too divergent (Loytynoja and Goldman 2008)

PRANK is an advanced probabilistic alignment algorithmthat incorporates phylogenetic information This has beenshown to reduce the number of false-positives when evalu-ating positive selection (Fletcher and Yang 2010 Loytynoja2014) All the PRANK alignments were made using defaultparameters for the empirical codon model and the prunedtrees as guides The resulting alignments are composed ofhighly conserved blocks surrounded by noise due to indelsFinally Gblocks (Castresana 2000) was used to remove thenoise in the unreliable regions of the multiple sequence align-ments which may cause false positives when testing for pos-itive selection to keep only the well-aligned parts The defaultGblocks parameters for the codontype model (parameter ndasht = c) were used with the exception of tolerating half gaps perposition (parameter -b5=h) This procedure resulted in cleanand accurate multiple sequence alignments for the final com-putations of the site and branch-site models using CodeMLtogether with their corresponding gene trees

Analyses of Evolutionary Pressure

For each of the 14 coagulation genes within each of the fivemajor clades the selective pressures were estimated using thedifferent codon substitution site models implemented inCodeML of the phylogenetic analysis with maximum-likeli-hood software (PAML release 47) (Yang 2007) By analyzingthe protein-coding genes at the nucleotide level the synon-ymous or silent changes (nucleotide substitutions that do notchange the encoded amino acid) are distinguished from thenonsynonymous or missense changes (substitutions thatchange the encoded amino acid) Because natural selectionoperates at the protein level and can take any direction ofpositive neutral or negative selection the synonymous andnonsynonymous substitutions are subjected to different se-lective pressures which occur at the rates dS and dN respec-tively Hence the ratio of nonsynonymoussynonymoussubstitution rates (= dNdS) measures the selective pressureat the protein level The value of reveals the direction andstrength of natural selection acting on the protein Values oflt 1 = 1 and 4 1 indicate negative selection neutralevolution and positive selection respectively If an existingfunction or phenotype is evolutionarily favorable then nega-tive selection (or purifying selection) favors the conservationof that function or phenotype (conservation of amino acids)The other is positive selection (or Darwinian selection) whichfavors the promotion of a new function or phenotypeNeutral evolution describes the tolerance to amino acid mu-tations which are neither deleterious nor beneficial and arefixed according to genetic drift

The following five models were used to account for thevariation of dNdS among codon sites M1a (two dNdS ratioswith negative selection [0lt 1] and neutral evolution[1 = 1]) M2a (three dNdS ratios with negative selection[0lt 1] neutral evolution [1 = 1] and positive selection[2 1]) M7 (beta) M8 (beta and 24 1) and M8a (betaand 2 = 1) The alignments contain several positions inwhich one of the sequences is missing and the ldquocleandata = 0rdquooption was selected to retain all sites The inference of positive

3053

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

selection was conducted by performing LRTs in which thefollowing two pairs of models were compared namely M1awith M2a and M8a with M8 The principle is to compare anull model that does not allow24 1 (M1a or M8a) with analternative model that does (M2a or M8) Because the alter-nate model has more parameters than the null model theLRT follows a 2 distribution The LRT is a powerful methodfor comparing two hypotheses where one is the special caseof the other The BEB method (Yang et al 2005) was used tocalculate the posterior probabilities for site classes and theBEB value was used to identify sites under positive selectionwhere the 2 distribution of the LRT was significant withPlt 005 As many different LRT comparisons (differentgenes different clades) were conducted a control for thefalse-discovery rate (expected proportion of false positives)was applied using the q value package in R (Storey andTibshirani 2003) We adjusted the parameters of the q valueto perform a bootstrap analysis (pi0meth = ldquobootstraprdquo) andto tolerate 10 of false positives in our significant results(ldquofdrlevel = 01rdquo) A bootstrap analysis is necessary as the Pvalue distribution is bimodal (Studer et al 2008)

Two sources for the branch-site model were used First theSelectome database at httpselectomeunilch (last accessedJanuary 2014) (Proux et al 2009 Moretti et al 2014) providesthe precomputed results of the branch-site model (supple-mentary table S2b Supplementary Material online) based onthe multiple sequence alignments from the EnsemblCompara database (Vilella et al 2009) The last release ofSelectome (release 6) contains an optimized procedure ofalignment refinement and quality control similar to theone used for our site-model analyses as described aboveThe branch-site model allows the dNdS value to vary notonly between sites but also between branches (Zhang et al2005) The advantage of this model is its ability to detecttraces of episodic positive selection (Studer and Robinson-Rechavi 2009a) Second a branch-site model was determinedbased on our data set (see above) We used the vertebratetrees and their multiple sequence alignments as inputs forbranch-site model calculations The multiple alignments weremanually rechecked to remove barren sequences and thecorresponding vertebrate trees were edited accordinglyusing Newick Utilities Each taxonomic unit also known asthe branch or node of the tree was tagged using an auto-mated python script The CodeML program was run on eachof these tagged trees from all the genes for this branch-sitemodel (supplementary table S2b Supplementary Materialonline) For some alignments we found misaligned short in-sertions up to a maximum of ten amino acids in the con-served blocks probably due to gene model prediction errorsIn our whole data set these errors affect 0ndash4 sequences peralignment As the F2 gene family contained most of thesesmall errors albeit at a very low level this was reanalyzed byremoving four sequences with potential problems from thealignment The recalculation of the likelihood ratio deviatedonly slightly from the initial results (P value of 393e5 for 38sequences vs P value of 217e4 for 34 sequences) This sug-gested that these errors would only have a minor impact ifany on the detection of positive selection especially as the

coagulation factor sequences were long and the model ofcodon substitutions was averaged over all columns (manyhundreds) and sequences (between 23 and 47)

Stability Effect of Mutations

When possible for each coagulation factor the molecularstructures of the domain(s) were retrieved directly fromthe CATH database (Sillitoe et al 2013) Prior to the muta-tional analyses each structure was repaired with the FoldXprogram (Schymkowitz et al 2005) to remove potential stericclashes FoldX utilizes an empirical force field model to esti-mate the stability of a protein (Gwildtype in kcalmol) FoldXwas used to estimate the stability effect on each residueposition when the residue was replaced by any of the 19other amino acids The stability effect is determined bythe difference in the wildtype and mutant energies(G = GmutantGwildtype) This resulted in a stabilitylandscape of all potential mutations including the Gvalues both for the disease mutations and for all the aminoacid substitutions observed across vertebrate evolution

Statistical Analyses

All statistical analyses were performed with R v302 (R CoreTeam 2013) and SigmaPlot (Systat Software San Jose CA)The pipeline was developed with Biopython scripts (Cocket al 2009) The visualization of the multiple sequence align-ments was performed using Jalview 28 (Waterhouse et al2009)

Data Availability

The final sequence alignments are provided on a webpagehttpwwwbiochemuclacukpavithraevolutioncoagulationindexphp (last accessed January 2014) Pathogenic mu-tation data for three human coagulation factors FVIII FIX andFXI were retrieved from our web databases at httpwwwfactorviii-dborg httpwwwfactorixorg and httpwwwfactorxiorg (last accessed January 2014) (Saunders et al2009 Rallapalli et al 2013)

Supplementary MaterialSupplementary tables S1ndashS2 and figures S1ndashS3 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

PMR and SJP thank Pfizer UK for a medical educationalgrant that generously funded this project and the SpecialTrustees of the Royal Free Hospital the KatharineDormandy Trust for Hemophilia and Related Disorders andthe European Association for Hemophilia and AssociatedDisorders for additional support The authors also thank DrKeith Gomez Dr Geoffrey Kemball-Cook and ProfessorEdward G Tuddenham for their generous collaborative sup-port of our studies including access to the FVIII mutations atthe Hemophilia A Database at httpwwwfactorviii-dborgRAS acknowledges funding from the Fondation du 450eme

3054

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

major clades and several minor ones These changes werecompared directly with disease-causing mutation data forthree coagulation proteins FVIII FIX and FXI Good-qualitypathogenic mutational data are publicly available for thesethree proteins and the bleeding disorders caused by thesethree sets of mutational data correspond to the three mostprevalent bleeding disorders The outcome of these detailedanalyses confirms that positive selection occurred during dif-ferent stages in the evolution of several coagulation proteinsIt was also observed when comparing the protein structurestabilities that the pathogenic mutations have a destabilizingeffect on the protein structure as opposed to the stable mu-tations that are tolerated during evolution (fig 6) The com-bination of positive selection analyses with those forpathogenic mutations provides information about predictingdisease causing mutations and clarifies fresh approaches fortherapeutic interventions in bleeding disorders

Selective Pressures in Vertebrate Evolution

Evolutionary changes combine random mutational eventswith natural selection The probability of a mutation to befixed in a population depends on its consequence on thefitness of an organism New adaptive mutations that are ben-eficial in terms of fitness are more likely to be retained bypositive selection than neutral mutation Neutral mutations

that are neither beneficial nor detrimental are randomly re-tained or removed by genetic drift depending on the popu-lation size (neutral selection) Detrimental mutations aremore likely to be removed from the population by negative(or purifying) selection There are no strict patterns and adeleterious mutation can sometimes be beneficial to oneaspect of the organism The textbook example of this issickle-cell anaemia (drepanocytosis) which is a bleeding dis-order caused by the pGlu6Val mutation in the hemoglobin gene Mutations responsible for sickle-cell anaemia are gen-erally removed by natural selection In malaria-infested pop-ulations particularly in tropical or subtropical regions thesemutations are selected for in one allele because they offerpartial protection against the pathogen (Lopez et al 2010)The malaria-infected sickle cells inhibit infection by the ma-laria plasmodium in heterozygous cases (beneficial) but causesickle-cell anaemia in homozygous cases (deleterious)

Genes corresponding to proteins in direct contact with theenvironment are more likely to be under adaptive selectionThus genes involved in host-pathogen interactions are com-monly associated with positive selection Large-scale analysesrevealing positive selection in vertebrates or arthropods havebeen reported for genes involved with immunity (Nielsenet al 2005 Studer et al 2008 Montoya-Burgos 2011 Rouxet al 2014) In the present detailed study positive selection inseveral genes of the coagulation cascade was identified at

FIG 5 Relationship between disease-causing mutations and evolutionary pressures The correlation between evolutionary selection and the number oftimes a disease-causing mutation occurs at each amino acid position is shown In the columns (a) FVIII (b) FIX and (c) FXI correspond to the threecoagulation factors for which pathogenic mutational information is available in sufficient quantity from our three mutational databases The three rowscorrespond to the posterior probability (based on the BEB model of CodeML) of positive selection neutral evolution and negative selection forprimates

3049

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

different stages in evolution (fig 3) An earlier study identifiedpositive selection across six mammalian genomes (humanchimpanzee macaque mouse rat and dog) using LRTsbased on codon substitution models and false discoveryrates to evaluate multiple LRT comparisons (Kosiol et al2008) Although they found strong positive selection(Plt 005 and false discovery ratelt 005) in nine complementgenes only nominal positive selection (Plt 005) was identi-fied in four coagulation genes (F2 F12 TFPI PROC) none ofwhich were identified in figure 3b A limitation of this earlieranalysis may be the low number of six genomes used Notethat complement is related to coagulation in that the latterhas the ability to activate complement of innate immunitythat mediates host-pathogen interactions (Amara et al 2008)

Positive Selection of the Coagulation Proteins throughHost-Pathogen Interactions

As a consequence of the immune response blood coagula-tion is often exploited by pathogens for reason of infectiveand septic processes For example FG and FV are targeted bybacteria thus offering a straightforward explanation of posi-tive selection (fig 3) We discuss this in terms of FG and FIIfollowed by FIII and FVII and then FV and FX

By taking advantage of the increased number of availablegenome sequences positive selection has been identified inseveral coagulation genes during various stages in evolutionIn particular all three subunits FGA FGB and FGG of FG haveshown positive selection at periods during vertebrate evolu-tion according to both the site model and the branch-sitemodels Similarly prothrombin (FII) has shown positive selec-tion but only across all the vertebrate genomes and notacross any smaller clades Although the immune system isunder strong adaptive constraints because it needs to evolveconstantly against pathogens the coagulation system alsocomprises the first line of immune defense against bacteriato form a clot to block bacterial invasion (Sun 2006) Both FGand prothrombin (FII) are targeted by certain bacteria Forexample FG is targeted during the course of infection bycysteine protease a virulence factor of Streptococcus pyogenes(Matsuka et al 1999) and FG binding protein Efb ofStaphylococcus aureus Similarly FG-like proteins exist inAnopheles gambiae and are involved in the function of theimmune system against malaria or bacterial parasites (Dongand Dimopoulos 2009 Lombardo et al 2013) Thrombin in-teracts with staphylocoagulase a protein secreted by thehuman pathogen S aureus and activates prothrombin with-out proteolysis to form thrombin The resulting

FIG 6 The effect of amino acid replacements on the overall protein stability change G Data are shown for (a and b) FVIII A and B chains (c) FIX (d)FXI Three calculations were performed each being based on the sequences for which crystal structures are known (PDB codes 2R7E [FVIII] 2WPH [FIX]and 2F83 [FXI]) The FVIII A and B chains correspond to the N-terminal residues 19ndash760 and the C-terminal residues 1582ndash2351 that are observed in itscrystal structure The calculation represented by the dashed line indicates the distribution calculated for all 19 possible amino acid replacements Thatrepresented by the dotted line indicates the distribution calculated for all the evolutionary occurring replacements that were observed in our data set of47 genomes That represented by the continuous line indicates the distribution calculated for the disease-causing mutations from the FVIII FIX and FXImutation databases

3050

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

prothrombinmdashstaphylocoagulase complex binds to FG tocleave this into self-polymerizing fibrin This process is centralto the molecular pathology of S aureus endocarditis becausestaphylocoagulase bypasses physiological blood coagulationthrough the evasion of the recognition of prothrombin bycirculating thrombin inhibitors (Panizzi et al 2005)

Positive selection for FIII and FVII is attributed to a host-pathogen interaction with the herpes virus family These cellsurface molecules do not necessarily participate directly in animmune response but can still serve as a convenient gatewayfor pathogen entry (Vallender and Lahn 2004) Herpes virusespromote cellular surface infections by coevolving with theirhuman host indicating a mode by which pathogens exploitthe coagulation system (Sutherland et al 2012) A study of thehost-pathogen driven coevolution of herpes virus revealedthat this virus has existed from the time of mammals anddivided into the HSV-1 and HSV-2 types recently during

human evolution (Kolb et al 2013) These observations sug-gest that the pressure for FIII to evolve in early mammalsespecially through primates could be the reason for the pos-itive selection of FIII across the mammalian clade (fig 3) Wepropose that the FIIIndashFVII interaction is a significant drivingforce for positive selection in FVII

The FV and FVIII cofactors are homologous to each otherin their sequence and domain organization hence it is notsurprising to see similar selection pressures acting on them(fig 3c) A difference arises in the entire vertebrate genome(fig 3a) where FV shows positive selection whereas FVIII doesnot This difference between FV and FVIII across vertebratesmay result from the interaction of FV with Escherichia coliThe extracellular serine protease EspP from E coli cleaves FVto lead to its degradation and thus reduces the activity of thecoagulation cascade (Brunder et al 1997) The activated FVacofactor forms a FVaFXa complex with activated FX (fig 1)

FIG 7 The effect of amino acid replacements at the sequence level on the protein stability change G The residue stability changes are depicted ona seven point scale from highly destabilizing (red) to highly stabilizing (blue) for each of (a) FVIII chain A (b) FVIII chain B (c) FIX and (d) FXI Their PDBcodes are indicated in figure 6 In each panel the upper half shows the G values for all the possible 19 amino acid replacements at each residueposition that is part of the crystal structure and the lower half shows the G values for the disease-causing mutations taken from the mutationdatabases

3051

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

which is important for thrombin formation the most impor-tant step for fibrin clot formation Because both FVa and FXaare central to coagulation and are associated with bacterialproteolysis this offers an explanation for the signatures ofpositive selection in both proteins

FVIII FIX and FXIII are also associated with positive selec-tion however it is not clear what selective pressures are in-volved Although these proteins may also be targeted bypathogens no direct interactions have been reported so far

Absence of Positive Selection for FXI and FXIIFXII is unique in that it does not show positive selection eventhough it is involved in host-pathogen interactions with long-chain inorganic phosphatases (polyP) from microorganisms(Morrissey et al 2012) Small-chain polyP are produced in thehuman brain and by blood platelets Hence even during theabsence of pathogen-originated polyP the human host doesnot lack polyP and there is no pressure to evolve for this (Puyet al 2013) A F12 deficiency in patients does not cause bleed-ing disorders This means that the coagulation cascade can beinitiated even in the absence of F12 For similar reasons FXIshows no positive selection We hypothesize that this is be-cause first no interaction with the environment has beendemonstrated to date and second with respect to the polyPmechanism the coagulation cascade is capable of activatingin a FXI-independent manner

Relationship between Disease-Causing Mutations andSelection

The mechanisms of genetic drift mutational change andevolutionary change are closely related to each otherEvolutionary change is not only dependent on the geneticvariability from mutational change but also on genetic driftwhich describes the random fluctuations in the number ofgenetic variants in an organism Genetic drift and mutationsproduce random variations that selection can act upon Thenew changes incurred in one generation may be cancelled inanother generation It may therefore be important to deter-mine whether a new mutation is lost or becomes commonenough for selection to determine its fate (Masel 2011) In thisstudy we have examined the relationship between naturalselection acting on an amino acid and the disease-causingmutations reported at that amino acid position for thethree prominent coagulation factors FVIII FIX and FXI Wehave shown that the disease-causing mutations are morelikely to occur at a higher frequency at sites under negativeselection in all three proteins (fig 5) As sites under strongselective pressure are more likely to be critical to proteinfunction such a mutation can readily affect protein functionand trigger disease Lethal (or deleterious) mutations thatcause death are de facto not observed in our data set Bycontrast disease-causing mutations occur at a reduced fre-quency in sites under neutral or positive selection suggestingthat these sites are more tolerant to mutations than sitesunder negative selection Interestingly we have shown thatFVIII and FIX have undergone much positive selectionwhereas FXI has no such signatures (fig 4) Despite this dif-ference our results for three proteins show no change in the

relationship between evolutionary selection and disease-causing mutations (figs 5 and 6) In summary the reporteddisease-causing mutations show randomness and the muta-bility of a given amino acid position depends on the selectionpressure acting at that position and not on the entire protein

Effect of Disease-Causing Mutations on the Stability ofProtein Structure

Mutations that change the amino acid sequence can havestriking effects on protein stability Point mutations are re-lated to pathological and genetic conditions by reducing orabolishing coagulation protein function (table 1) Proteinfunction is related to its stability which is often measuredby the change in folding energy G of the protein structure(Stefl et al 2013) We have identified the effect of mutationson the stability of FVIII FIX and FXI in terms of G valuesThe stability effects of polymorphic amino acid changes inhuman sequences follow a Gaussian distribution (de Beeret al 2013) In that study disease-causing mutations alsoshowed the same distribution but slightly skewed towarddestabilization In this study the same trend for aminoacids substitutions in vertebrates were observed but this ismuch stronger for disease-causing mutations (fig 6) The dif-ference between the two studies may be explained by thedifferent methods used de Beer et al (2013) used the DiscreteOptimized Protein (DOPE) score of Modeller as a monitor ofglobal stability whereas we used FoldX (Materials andMethods) as an empirical method to infer directly the stabilityeffect G of a point mutation

Evolutionarily observed mutations are the most stable interms of their protein structure whereas disease-causing mu-tations destabilized the protein structure (figs 6 and 7) Eachsite may or may not tolerate a particular amino acid changedepending on the change in stability When each amino acidwas converted to each of the other 19 possible amino acids(fig 7 top panels) the G values showed that most aminoacids destabilize the structure whereas several may stabilize itThe disease-causing effect of a mutation may be affected notonly by the selection pressure acting on it but also by the typeof the amino acid replaced For the FVIII FIX and FXI bleedingdisorders destabilization was observed for most of the dis-ease-causing mutations

Utility of Evolutionary and Disease-CausingMutational Analyses for Clinical Diagnosis

Interactive locus-specific databases are useful tools for pre-senting patient mutational data for patient diagnosis andcare We plan to incorporate the above data sets on selectivepressures and stability effects into our mutational databasesfor FVIII FIX and FXI This information will provide insight forthe prediction of the effect of disease-causing mutations Thelevel of selective pressure was estimated under a maximum-likelihood framework Such methods are useful for geneticstudies but their results are sometimes difficult to interpretby a nonspecialist The presentation of this information needsto be intuitive to the viewer As shown in supplementaryfigure S3 Supplementary Material online the use of multiple

3052

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

sequence alignments and presentation of stability effectsG structural locations and selective pressures can helpin interpreting the impacts of the mutations This type ofinformation will therefore clarify the interpretation of newlydiscovered disease-causing mutations for clinicians evolu-tionary biologists and protein biochemists This combinedevolutionary-driven study of bleeding disorder-causing muta-tions is the most detailed of its kind as far as we are aware andmay be usefully extended to other biological processes andtheir proteins

Materials and Methods

Genomic Sequences

The flowchart of operations is summarized in supplementaryfigure S1 Supplementary Material online First the completemultiple sequence alignments were retrieved together withtheir gene trees for each of the 14 genes corresponding to the11 coagulation factors (table 1) from the Ensembl Comparadatabase (Vilella et al 2009) using their Ensembl IDs (table 1)and a Perl script Ensembl release 73 was used accessed onSeptember 30 2013 (Flicek et al 2013) These amino acidsequences came from Ensembl as they are generally ofgood quality and contained 74 vertebrate genomes in atotal of 80 genomes We then excluded any sequences thatcame from genomes of low coverage (2 or less) This re-sulted in 47 vertebrate genomes with alignment sizes be-tween 23 and 47 genomes These 47 sets of vertebratesequences provided sufficient genomic coverage to performphylogenetic analysis The 47 vertebrate organisms were clas-sified into five major clades (fig 2) namely Primates Glires(rodents rabbits and pika) Laurasiatheria Sauropsida (rep-tiles and birds) and Fishes (ray-finned fishes) together withminor clades including Atlantogenata and Australianmammals

Sequence Retrieval and Multiple Alignments

The multiple sequence alignments and gene trees fromEnsemblCompara included all the coagulation sequences indifferent species irrespective of their genome coverage Themultiple sequence alignments of the downloaded genes fromthe EnsemblCompara database were constructed using aglobal alignment algorithm provided within this databasefor significant number of sequences of arbitrary lengthsThe data set required for this study comprised 14 genesfrom 47 genomes A thorough quality control procedure ofthe multiple sequence alignments was performed in order togenerate minimum false positive results Thus sequences wereremoved if they showed 41 low-complexity regions orunknown nucleotides (indicated as ldquoNrdquo in a sequence)From these the phylogenetic trees were pruned usingNewick Utilities software (Junier and Zdobnov 2010) to iden-tify the tips corresponding to the retrieved species (fig 2) Thefinal alignments of these sequences were made with PRANKwhich is conservative in that it tends to align only amino acidsderived by substitution and introduces insertions and dele-tions (indels) instead of aligning sequence fragments thatappear too divergent (Loytynoja and Goldman 2008)

PRANK is an advanced probabilistic alignment algorithmthat incorporates phylogenetic information This has beenshown to reduce the number of false-positives when evalu-ating positive selection (Fletcher and Yang 2010 Loytynoja2014) All the PRANK alignments were made using defaultparameters for the empirical codon model and the prunedtrees as guides The resulting alignments are composed ofhighly conserved blocks surrounded by noise due to indelsFinally Gblocks (Castresana 2000) was used to remove thenoise in the unreliable regions of the multiple sequence align-ments which may cause false positives when testing for pos-itive selection to keep only the well-aligned parts The defaultGblocks parameters for the codontype model (parameter ndasht = c) were used with the exception of tolerating half gaps perposition (parameter -b5=h) This procedure resulted in cleanand accurate multiple sequence alignments for the final com-putations of the site and branch-site models using CodeMLtogether with their corresponding gene trees

Analyses of Evolutionary Pressure

For each of the 14 coagulation genes within each of the fivemajor clades the selective pressures were estimated using thedifferent codon substitution site models implemented inCodeML of the phylogenetic analysis with maximum-likeli-hood software (PAML release 47) (Yang 2007) By analyzingthe protein-coding genes at the nucleotide level the synon-ymous or silent changes (nucleotide substitutions that do notchange the encoded amino acid) are distinguished from thenonsynonymous or missense changes (substitutions thatchange the encoded amino acid) Because natural selectionoperates at the protein level and can take any direction ofpositive neutral or negative selection the synonymous andnonsynonymous substitutions are subjected to different se-lective pressures which occur at the rates dS and dN respec-tively Hence the ratio of nonsynonymoussynonymoussubstitution rates (= dNdS) measures the selective pressureat the protein level The value of reveals the direction andstrength of natural selection acting on the protein Values oflt 1 = 1 and 4 1 indicate negative selection neutralevolution and positive selection respectively If an existingfunction or phenotype is evolutionarily favorable then nega-tive selection (or purifying selection) favors the conservationof that function or phenotype (conservation of amino acids)The other is positive selection (or Darwinian selection) whichfavors the promotion of a new function or phenotypeNeutral evolution describes the tolerance to amino acid mu-tations which are neither deleterious nor beneficial and arefixed according to genetic drift

The following five models were used to account for thevariation of dNdS among codon sites M1a (two dNdS ratioswith negative selection [0lt 1] and neutral evolution[1 = 1]) M2a (three dNdS ratios with negative selection[0lt 1] neutral evolution [1 = 1] and positive selection[2 1]) M7 (beta) M8 (beta and 24 1) and M8a (betaand 2 = 1) The alignments contain several positions inwhich one of the sequences is missing and the ldquocleandata = 0rdquooption was selected to retain all sites The inference of positive

3053

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

selection was conducted by performing LRTs in which thefollowing two pairs of models were compared namely M1awith M2a and M8a with M8 The principle is to compare anull model that does not allow24 1 (M1a or M8a) with analternative model that does (M2a or M8) Because the alter-nate model has more parameters than the null model theLRT follows a 2 distribution The LRT is a powerful methodfor comparing two hypotheses where one is the special caseof the other The BEB method (Yang et al 2005) was used tocalculate the posterior probabilities for site classes and theBEB value was used to identify sites under positive selectionwhere the 2 distribution of the LRT was significant withPlt 005 As many different LRT comparisons (differentgenes different clades) were conducted a control for thefalse-discovery rate (expected proportion of false positives)was applied using the q value package in R (Storey andTibshirani 2003) We adjusted the parameters of the q valueto perform a bootstrap analysis (pi0meth = ldquobootstraprdquo) andto tolerate 10 of false positives in our significant results(ldquofdrlevel = 01rdquo) A bootstrap analysis is necessary as the Pvalue distribution is bimodal (Studer et al 2008)

Two sources for the branch-site model were used First theSelectome database at httpselectomeunilch (last accessedJanuary 2014) (Proux et al 2009 Moretti et al 2014) providesthe precomputed results of the branch-site model (supple-mentary table S2b Supplementary Material online) based onthe multiple sequence alignments from the EnsemblCompara database (Vilella et al 2009) The last release ofSelectome (release 6) contains an optimized procedure ofalignment refinement and quality control similar to theone used for our site-model analyses as described aboveThe branch-site model allows the dNdS value to vary notonly between sites but also between branches (Zhang et al2005) The advantage of this model is its ability to detecttraces of episodic positive selection (Studer and Robinson-Rechavi 2009a) Second a branch-site model was determinedbased on our data set (see above) We used the vertebratetrees and their multiple sequence alignments as inputs forbranch-site model calculations The multiple alignments weremanually rechecked to remove barren sequences and thecorresponding vertebrate trees were edited accordinglyusing Newick Utilities Each taxonomic unit also known asthe branch or node of the tree was tagged using an auto-mated python script The CodeML program was run on eachof these tagged trees from all the genes for this branch-sitemodel (supplementary table S2b Supplementary Materialonline) For some alignments we found misaligned short in-sertions up to a maximum of ten amino acids in the con-served blocks probably due to gene model prediction errorsIn our whole data set these errors affect 0ndash4 sequences peralignment As the F2 gene family contained most of thesesmall errors albeit at a very low level this was reanalyzed byremoving four sequences with potential problems from thealignment The recalculation of the likelihood ratio deviatedonly slightly from the initial results (P value of 393e5 for 38sequences vs P value of 217e4 for 34 sequences) This sug-gested that these errors would only have a minor impact ifany on the detection of positive selection especially as the

coagulation factor sequences were long and the model ofcodon substitutions was averaged over all columns (manyhundreds) and sequences (between 23 and 47)

Stability Effect of Mutations

When possible for each coagulation factor the molecularstructures of the domain(s) were retrieved directly fromthe CATH database (Sillitoe et al 2013) Prior to the muta-tional analyses each structure was repaired with the FoldXprogram (Schymkowitz et al 2005) to remove potential stericclashes FoldX utilizes an empirical force field model to esti-mate the stability of a protein (Gwildtype in kcalmol) FoldXwas used to estimate the stability effect on each residueposition when the residue was replaced by any of the 19other amino acids The stability effect is determined bythe difference in the wildtype and mutant energies(G = GmutantGwildtype) This resulted in a stabilitylandscape of all potential mutations including the Gvalues both for the disease mutations and for all the aminoacid substitutions observed across vertebrate evolution

Statistical Analyses

All statistical analyses were performed with R v302 (R CoreTeam 2013) and SigmaPlot (Systat Software San Jose CA)The pipeline was developed with Biopython scripts (Cocket al 2009) The visualization of the multiple sequence align-ments was performed using Jalview 28 (Waterhouse et al2009)

Data Availability

The final sequence alignments are provided on a webpagehttpwwwbiochemuclacukpavithraevolutioncoagulationindexphp (last accessed January 2014) Pathogenic mu-tation data for three human coagulation factors FVIII FIX andFXI were retrieved from our web databases at httpwwwfactorviii-dborg httpwwwfactorixorg and httpwwwfactorxiorg (last accessed January 2014) (Saunders et al2009 Rallapalli et al 2013)

Supplementary MaterialSupplementary tables S1ndashS2 and figures S1ndashS3 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

PMR and SJP thank Pfizer UK for a medical educationalgrant that generously funded this project and the SpecialTrustees of the Royal Free Hospital the KatharineDormandy Trust for Hemophilia and Related Disorders andthe European Association for Hemophilia and AssociatedDisorders for additional support The authors also thank DrKeith Gomez Dr Geoffrey Kemball-Cook and ProfessorEdward G Tuddenham for their generous collaborative sup-port of our studies including access to the FVIII mutations atthe Hemophilia A Database at httpwwwfactorviii-dborgRAS acknowledges funding from the Fondation du 450eme

3054

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

different stages in evolution (fig 3) An earlier study identifiedpositive selection across six mammalian genomes (humanchimpanzee macaque mouse rat and dog) using LRTsbased on codon substitution models and false discoveryrates to evaluate multiple LRT comparisons (Kosiol et al2008) Although they found strong positive selection(Plt 005 and false discovery ratelt 005) in nine complementgenes only nominal positive selection (Plt 005) was identi-fied in four coagulation genes (F2 F12 TFPI PROC) none ofwhich were identified in figure 3b A limitation of this earlieranalysis may be the low number of six genomes used Notethat complement is related to coagulation in that the latterhas the ability to activate complement of innate immunitythat mediates host-pathogen interactions (Amara et al 2008)

Positive Selection of the Coagulation Proteins throughHost-Pathogen Interactions

As a consequence of the immune response blood coagula-tion is often exploited by pathogens for reason of infectiveand septic processes For example FG and FV are targeted bybacteria thus offering a straightforward explanation of posi-tive selection (fig 3) We discuss this in terms of FG and FIIfollowed by FIII and FVII and then FV and FX

By taking advantage of the increased number of availablegenome sequences positive selection has been identified inseveral coagulation genes during various stages in evolutionIn particular all three subunits FGA FGB and FGG of FG haveshown positive selection at periods during vertebrate evolu-tion according to both the site model and the branch-sitemodels Similarly prothrombin (FII) has shown positive selec-tion but only across all the vertebrate genomes and notacross any smaller clades Although the immune system isunder strong adaptive constraints because it needs to evolveconstantly against pathogens the coagulation system alsocomprises the first line of immune defense against bacteriato form a clot to block bacterial invasion (Sun 2006) Both FGand prothrombin (FII) are targeted by certain bacteria Forexample FG is targeted during the course of infection bycysteine protease a virulence factor of Streptococcus pyogenes(Matsuka et al 1999) and FG binding protein Efb ofStaphylococcus aureus Similarly FG-like proteins exist inAnopheles gambiae and are involved in the function of theimmune system against malaria or bacterial parasites (Dongand Dimopoulos 2009 Lombardo et al 2013) Thrombin in-teracts with staphylocoagulase a protein secreted by thehuman pathogen S aureus and activates prothrombin with-out proteolysis to form thrombin The resulting

FIG 6 The effect of amino acid replacements on the overall protein stability change G Data are shown for (a and b) FVIII A and B chains (c) FIX (d)FXI Three calculations were performed each being based on the sequences for which crystal structures are known (PDB codes 2R7E [FVIII] 2WPH [FIX]and 2F83 [FXI]) The FVIII A and B chains correspond to the N-terminal residues 19ndash760 and the C-terminal residues 1582ndash2351 that are observed in itscrystal structure The calculation represented by the dashed line indicates the distribution calculated for all 19 possible amino acid replacements Thatrepresented by the dotted line indicates the distribution calculated for all the evolutionary occurring replacements that were observed in our data set of47 genomes That represented by the continuous line indicates the distribution calculated for the disease-causing mutations from the FVIII FIX and FXImutation databases

3050

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

prothrombinmdashstaphylocoagulase complex binds to FG tocleave this into self-polymerizing fibrin This process is centralto the molecular pathology of S aureus endocarditis becausestaphylocoagulase bypasses physiological blood coagulationthrough the evasion of the recognition of prothrombin bycirculating thrombin inhibitors (Panizzi et al 2005)

Positive selection for FIII and FVII is attributed to a host-pathogen interaction with the herpes virus family These cellsurface molecules do not necessarily participate directly in animmune response but can still serve as a convenient gatewayfor pathogen entry (Vallender and Lahn 2004) Herpes virusespromote cellular surface infections by coevolving with theirhuman host indicating a mode by which pathogens exploitthe coagulation system (Sutherland et al 2012) A study of thehost-pathogen driven coevolution of herpes virus revealedthat this virus has existed from the time of mammals anddivided into the HSV-1 and HSV-2 types recently during

human evolution (Kolb et al 2013) These observations sug-gest that the pressure for FIII to evolve in early mammalsespecially through primates could be the reason for the pos-itive selection of FIII across the mammalian clade (fig 3) Wepropose that the FIIIndashFVII interaction is a significant drivingforce for positive selection in FVII

The FV and FVIII cofactors are homologous to each otherin their sequence and domain organization hence it is notsurprising to see similar selection pressures acting on them(fig 3c) A difference arises in the entire vertebrate genome(fig 3a) where FV shows positive selection whereas FVIII doesnot This difference between FV and FVIII across vertebratesmay result from the interaction of FV with Escherichia coliThe extracellular serine protease EspP from E coli cleaves FVto lead to its degradation and thus reduces the activity of thecoagulation cascade (Brunder et al 1997) The activated FVacofactor forms a FVaFXa complex with activated FX (fig 1)

FIG 7 The effect of amino acid replacements at the sequence level on the protein stability change G The residue stability changes are depicted ona seven point scale from highly destabilizing (red) to highly stabilizing (blue) for each of (a) FVIII chain A (b) FVIII chain B (c) FIX and (d) FXI Their PDBcodes are indicated in figure 6 In each panel the upper half shows the G values for all the possible 19 amino acid replacements at each residueposition that is part of the crystal structure and the lower half shows the G values for the disease-causing mutations taken from the mutationdatabases

3051

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

which is important for thrombin formation the most impor-tant step for fibrin clot formation Because both FVa and FXaare central to coagulation and are associated with bacterialproteolysis this offers an explanation for the signatures ofpositive selection in both proteins

FVIII FIX and FXIII are also associated with positive selec-tion however it is not clear what selective pressures are in-volved Although these proteins may also be targeted bypathogens no direct interactions have been reported so far

Absence of Positive Selection for FXI and FXIIFXII is unique in that it does not show positive selection eventhough it is involved in host-pathogen interactions with long-chain inorganic phosphatases (polyP) from microorganisms(Morrissey et al 2012) Small-chain polyP are produced in thehuman brain and by blood platelets Hence even during theabsence of pathogen-originated polyP the human host doesnot lack polyP and there is no pressure to evolve for this (Puyet al 2013) A F12 deficiency in patients does not cause bleed-ing disorders This means that the coagulation cascade can beinitiated even in the absence of F12 For similar reasons FXIshows no positive selection We hypothesize that this is be-cause first no interaction with the environment has beendemonstrated to date and second with respect to the polyPmechanism the coagulation cascade is capable of activatingin a FXI-independent manner

Relationship between Disease-Causing Mutations andSelection

The mechanisms of genetic drift mutational change andevolutionary change are closely related to each otherEvolutionary change is not only dependent on the geneticvariability from mutational change but also on genetic driftwhich describes the random fluctuations in the number ofgenetic variants in an organism Genetic drift and mutationsproduce random variations that selection can act upon Thenew changes incurred in one generation may be cancelled inanother generation It may therefore be important to deter-mine whether a new mutation is lost or becomes commonenough for selection to determine its fate (Masel 2011) In thisstudy we have examined the relationship between naturalselection acting on an amino acid and the disease-causingmutations reported at that amino acid position for thethree prominent coagulation factors FVIII FIX and FXI Wehave shown that the disease-causing mutations are morelikely to occur at a higher frequency at sites under negativeselection in all three proteins (fig 5) As sites under strongselective pressure are more likely to be critical to proteinfunction such a mutation can readily affect protein functionand trigger disease Lethal (or deleterious) mutations thatcause death are de facto not observed in our data set Bycontrast disease-causing mutations occur at a reduced fre-quency in sites under neutral or positive selection suggestingthat these sites are more tolerant to mutations than sitesunder negative selection Interestingly we have shown thatFVIII and FIX have undergone much positive selectionwhereas FXI has no such signatures (fig 4) Despite this dif-ference our results for three proteins show no change in the

relationship between evolutionary selection and disease-causing mutations (figs 5 and 6) In summary the reporteddisease-causing mutations show randomness and the muta-bility of a given amino acid position depends on the selectionpressure acting at that position and not on the entire protein

Effect of Disease-Causing Mutations on the Stability ofProtein Structure

Mutations that change the amino acid sequence can havestriking effects on protein stability Point mutations are re-lated to pathological and genetic conditions by reducing orabolishing coagulation protein function (table 1) Proteinfunction is related to its stability which is often measuredby the change in folding energy G of the protein structure(Stefl et al 2013) We have identified the effect of mutationson the stability of FVIII FIX and FXI in terms of G valuesThe stability effects of polymorphic amino acid changes inhuman sequences follow a Gaussian distribution (de Beeret al 2013) In that study disease-causing mutations alsoshowed the same distribution but slightly skewed towarddestabilization In this study the same trend for aminoacids substitutions in vertebrates were observed but this ismuch stronger for disease-causing mutations (fig 6) The dif-ference between the two studies may be explained by thedifferent methods used de Beer et al (2013) used the DiscreteOptimized Protein (DOPE) score of Modeller as a monitor ofglobal stability whereas we used FoldX (Materials andMethods) as an empirical method to infer directly the stabilityeffect G of a point mutation

Evolutionarily observed mutations are the most stable interms of their protein structure whereas disease-causing mu-tations destabilized the protein structure (figs 6 and 7) Eachsite may or may not tolerate a particular amino acid changedepending on the change in stability When each amino acidwas converted to each of the other 19 possible amino acids(fig 7 top panels) the G values showed that most aminoacids destabilize the structure whereas several may stabilize itThe disease-causing effect of a mutation may be affected notonly by the selection pressure acting on it but also by the typeof the amino acid replaced For the FVIII FIX and FXI bleedingdisorders destabilization was observed for most of the dis-ease-causing mutations

Utility of Evolutionary and Disease-CausingMutational Analyses for Clinical Diagnosis

Interactive locus-specific databases are useful tools for pre-senting patient mutational data for patient diagnosis andcare We plan to incorporate the above data sets on selectivepressures and stability effects into our mutational databasesfor FVIII FIX and FXI This information will provide insight forthe prediction of the effect of disease-causing mutations Thelevel of selective pressure was estimated under a maximum-likelihood framework Such methods are useful for geneticstudies but their results are sometimes difficult to interpretby a nonspecialist The presentation of this information needsto be intuitive to the viewer As shown in supplementaryfigure S3 Supplementary Material online the use of multiple

3052

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

sequence alignments and presentation of stability effectsG structural locations and selective pressures can helpin interpreting the impacts of the mutations This type ofinformation will therefore clarify the interpretation of newlydiscovered disease-causing mutations for clinicians evolu-tionary biologists and protein biochemists This combinedevolutionary-driven study of bleeding disorder-causing muta-tions is the most detailed of its kind as far as we are aware andmay be usefully extended to other biological processes andtheir proteins

Materials and Methods

Genomic Sequences

The flowchart of operations is summarized in supplementaryfigure S1 Supplementary Material online First the completemultiple sequence alignments were retrieved together withtheir gene trees for each of the 14 genes corresponding to the11 coagulation factors (table 1) from the Ensembl Comparadatabase (Vilella et al 2009) using their Ensembl IDs (table 1)and a Perl script Ensembl release 73 was used accessed onSeptember 30 2013 (Flicek et al 2013) These amino acidsequences came from Ensembl as they are generally ofgood quality and contained 74 vertebrate genomes in atotal of 80 genomes We then excluded any sequences thatcame from genomes of low coverage (2 or less) This re-sulted in 47 vertebrate genomes with alignment sizes be-tween 23 and 47 genomes These 47 sets of vertebratesequences provided sufficient genomic coverage to performphylogenetic analysis The 47 vertebrate organisms were clas-sified into five major clades (fig 2) namely Primates Glires(rodents rabbits and pika) Laurasiatheria Sauropsida (rep-tiles and birds) and Fishes (ray-finned fishes) together withminor clades including Atlantogenata and Australianmammals

Sequence Retrieval and Multiple Alignments

The multiple sequence alignments and gene trees fromEnsemblCompara included all the coagulation sequences indifferent species irrespective of their genome coverage Themultiple sequence alignments of the downloaded genes fromthe EnsemblCompara database were constructed using aglobal alignment algorithm provided within this databasefor significant number of sequences of arbitrary lengthsThe data set required for this study comprised 14 genesfrom 47 genomes A thorough quality control procedure ofthe multiple sequence alignments was performed in order togenerate minimum false positive results Thus sequences wereremoved if they showed 41 low-complexity regions orunknown nucleotides (indicated as ldquoNrdquo in a sequence)From these the phylogenetic trees were pruned usingNewick Utilities software (Junier and Zdobnov 2010) to iden-tify the tips corresponding to the retrieved species (fig 2) Thefinal alignments of these sequences were made with PRANKwhich is conservative in that it tends to align only amino acidsderived by substitution and introduces insertions and dele-tions (indels) instead of aligning sequence fragments thatappear too divergent (Loytynoja and Goldman 2008)

PRANK is an advanced probabilistic alignment algorithmthat incorporates phylogenetic information This has beenshown to reduce the number of false-positives when evalu-ating positive selection (Fletcher and Yang 2010 Loytynoja2014) All the PRANK alignments were made using defaultparameters for the empirical codon model and the prunedtrees as guides The resulting alignments are composed ofhighly conserved blocks surrounded by noise due to indelsFinally Gblocks (Castresana 2000) was used to remove thenoise in the unreliable regions of the multiple sequence align-ments which may cause false positives when testing for pos-itive selection to keep only the well-aligned parts The defaultGblocks parameters for the codontype model (parameter ndasht = c) were used with the exception of tolerating half gaps perposition (parameter -b5=h) This procedure resulted in cleanand accurate multiple sequence alignments for the final com-putations of the site and branch-site models using CodeMLtogether with their corresponding gene trees

Analyses of Evolutionary Pressure

For each of the 14 coagulation genes within each of the fivemajor clades the selective pressures were estimated using thedifferent codon substitution site models implemented inCodeML of the phylogenetic analysis with maximum-likeli-hood software (PAML release 47) (Yang 2007) By analyzingthe protein-coding genes at the nucleotide level the synon-ymous or silent changes (nucleotide substitutions that do notchange the encoded amino acid) are distinguished from thenonsynonymous or missense changes (substitutions thatchange the encoded amino acid) Because natural selectionoperates at the protein level and can take any direction ofpositive neutral or negative selection the synonymous andnonsynonymous substitutions are subjected to different se-lective pressures which occur at the rates dS and dN respec-tively Hence the ratio of nonsynonymoussynonymoussubstitution rates (= dNdS) measures the selective pressureat the protein level The value of reveals the direction andstrength of natural selection acting on the protein Values oflt 1 = 1 and 4 1 indicate negative selection neutralevolution and positive selection respectively If an existingfunction or phenotype is evolutionarily favorable then nega-tive selection (or purifying selection) favors the conservationof that function or phenotype (conservation of amino acids)The other is positive selection (or Darwinian selection) whichfavors the promotion of a new function or phenotypeNeutral evolution describes the tolerance to amino acid mu-tations which are neither deleterious nor beneficial and arefixed according to genetic drift

The following five models were used to account for thevariation of dNdS among codon sites M1a (two dNdS ratioswith negative selection [0lt 1] and neutral evolution[1 = 1]) M2a (three dNdS ratios with negative selection[0lt 1] neutral evolution [1 = 1] and positive selection[2 1]) M7 (beta) M8 (beta and 24 1) and M8a (betaand 2 = 1) The alignments contain several positions inwhich one of the sequences is missing and the ldquocleandata = 0rdquooption was selected to retain all sites The inference of positive

3053

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

selection was conducted by performing LRTs in which thefollowing two pairs of models were compared namely M1awith M2a and M8a with M8 The principle is to compare anull model that does not allow24 1 (M1a or M8a) with analternative model that does (M2a or M8) Because the alter-nate model has more parameters than the null model theLRT follows a 2 distribution The LRT is a powerful methodfor comparing two hypotheses where one is the special caseof the other The BEB method (Yang et al 2005) was used tocalculate the posterior probabilities for site classes and theBEB value was used to identify sites under positive selectionwhere the 2 distribution of the LRT was significant withPlt 005 As many different LRT comparisons (differentgenes different clades) were conducted a control for thefalse-discovery rate (expected proportion of false positives)was applied using the q value package in R (Storey andTibshirani 2003) We adjusted the parameters of the q valueto perform a bootstrap analysis (pi0meth = ldquobootstraprdquo) andto tolerate 10 of false positives in our significant results(ldquofdrlevel = 01rdquo) A bootstrap analysis is necessary as the Pvalue distribution is bimodal (Studer et al 2008)

Two sources for the branch-site model were used First theSelectome database at httpselectomeunilch (last accessedJanuary 2014) (Proux et al 2009 Moretti et al 2014) providesthe precomputed results of the branch-site model (supple-mentary table S2b Supplementary Material online) based onthe multiple sequence alignments from the EnsemblCompara database (Vilella et al 2009) The last release ofSelectome (release 6) contains an optimized procedure ofalignment refinement and quality control similar to theone used for our site-model analyses as described aboveThe branch-site model allows the dNdS value to vary notonly between sites but also between branches (Zhang et al2005) The advantage of this model is its ability to detecttraces of episodic positive selection (Studer and Robinson-Rechavi 2009a) Second a branch-site model was determinedbased on our data set (see above) We used the vertebratetrees and their multiple sequence alignments as inputs forbranch-site model calculations The multiple alignments weremanually rechecked to remove barren sequences and thecorresponding vertebrate trees were edited accordinglyusing Newick Utilities Each taxonomic unit also known asthe branch or node of the tree was tagged using an auto-mated python script The CodeML program was run on eachof these tagged trees from all the genes for this branch-sitemodel (supplementary table S2b Supplementary Materialonline) For some alignments we found misaligned short in-sertions up to a maximum of ten amino acids in the con-served blocks probably due to gene model prediction errorsIn our whole data set these errors affect 0ndash4 sequences peralignment As the F2 gene family contained most of thesesmall errors albeit at a very low level this was reanalyzed byremoving four sequences with potential problems from thealignment The recalculation of the likelihood ratio deviatedonly slightly from the initial results (P value of 393e5 for 38sequences vs P value of 217e4 for 34 sequences) This sug-gested that these errors would only have a minor impact ifany on the detection of positive selection especially as the

coagulation factor sequences were long and the model ofcodon substitutions was averaged over all columns (manyhundreds) and sequences (between 23 and 47)

Stability Effect of Mutations

When possible for each coagulation factor the molecularstructures of the domain(s) were retrieved directly fromthe CATH database (Sillitoe et al 2013) Prior to the muta-tional analyses each structure was repaired with the FoldXprogram (Schymkowitz et al 2005) to remove potential stericclashes FoldX utilizes an empirical force field model to esti-mate the stability of a protein (Gwildtype in kcalmol) FoldXwas used to estimate the stability effect on each residueposition when the residue was replaced by any of the 19other amino acids The stability effect is determined bythe difference in the wildtype and mutant energies(G = GmutantGwildtype) This resulted in a stabilitylandscape of all potential mutations including the Gvalues both for the disease mutations and for all the aminoacid substitutions observed across vertebrate evolution

Statistical Analyses

All statistical analyses were performed with R v302 (R CoreTeam 2013) and SigmaPlot (Systat Software San Jose CA)The pipeline was developed with Biopython scripts (Cocket al 2009) The visualization of the multiple sequence align-ments was performed using Jalview 28 (Waterhouse et al2009)

Data Availability

The final sequence alignments are provided on a webpagehttpwwwbiochemuclacukpavithraevolutioncoagulationindexphp (last accessed January 2014) Pathogenic mu-tation data for three human coagulation factors FVIII FIX andFXI were retrieved from our web databases at httpwwwfactorviii-dborg httpwwwfactorixorg and httpwwwfactorxiorg (last accessed January 2014) (Saunders et al2009 Rallapalli et al 2013)

Supplementary MaterialSupplementary tables S1ndashS2 and figures S1ndashS3 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

PMR and SJP thank Pfizer UK for a medical educationalgrant that generously funded this project and the SpecialTrustees of the Royal Free Hospital the KatharineDormandy Trust for Hemophilia and Related Disorders andthe European Association for Hemophilia and AssociatedDisorders for additional support The authors also thank DrKeith Gomez Dr Geoffrey Kemball-Cook and ProfessorEdward G Tuddenham for their generous collaborative sup-port of our studies including access to the FVIII mutations atthe Hemophilia A Database at httpwwwfactorviii-dborgRAS acknowledges funding from the Fondation du 450eme

3054

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

prothrombinmdashstaphylocoagulase complex binds to FG tocleave this into self-polymerizing fibrin This process is centralto the molecular pathology of S aureus endocarditis becausestaphylocoagulase bypasses physiological blood coagulationthrough the evasion of the recognition of prothrombin bycirculating thrombin inhibitors (Panizzi et al 2005)

Positive selection for FIII and FVII is attributed to a host-pathogen interaction with the herpes virus family These cellsurface molecules do not necessarily participate directly in animmune response but can still serve as a convenient gatewayfor pathogen entry (Vallender and Lahn 2004) Herpes virusespromote cellular surface infections by coevolving with theirhuman host indicating a mode by which pathogens exploitthe coagulation system (Sutherland et al 2012) A study of thehost-pathogen driven coevolution of herpes virus revealedthat this virus has existed from the time of mammals anddivided into the HSV-1 and HSV-2 types recently during

human evolution (Kolb et al 2013) These observations sug-gest that the pressure for FIII to evolve in early mammalsespecially through primates could be the reason for the pos-itive selection of FIII across the mammalian clade (fig 3) Wepropose that the FIIIndashFVII interaction is a significant drivingforce for positive selection in FVII

The FV and FVIII cofactors are homologous to each otherin their sequence and domain organization hence it is notsurprising to see similar selection pressures acting on them(fig 3c) A difference arises in the entire vertebrate genome(fig 3a) where FV shows positive selection whereas FVIII doesnot This difference between FV and FVIII across vertebratesmay result from the interaction of FV with Escherichia coliThe extracellular serine protease EspP from E coli cleaves FVto lead to its degradation and thus reduces the activity of thecoagulation cascade (Brunder et al 1997) The activated FVacofactor forms a FVaFXa complex with activated FX (fig 1)

FIG 7 The effect of amino acid replacements at the sequence level on the protein stability change G The residue stability changes are depicted ona seven point scale from highly destabilizing (red) to highly stabilizing (blue) for each of (a) FVIII chain A (b) FVIII chain B (c) FIX and (d) FXI Their PDBcodes are indicated in figure 6 In each panel the upper half shows the G values for all the possible 19 amino acid replacements at each residueposition that is part of the crystal structure and the lower half shows the G values for the disease-causing mutations taken from the mutationdatabases

3051

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

which is important for thrombin formation the most impor-tant step for fibrin clot formation Because both FVa and FXaare central to coagulation and are associated with bacterialproteolysis this offers an explanation for the signatures ofpositive selection in both proteins

FVIII FIX and FXIII are also associated with positive selec-tion however it is not clear what selective pressures are in-volved Although these proteins may also be targeted bypathogens no direct interactions have been reported so far

Absence of Positive Selection for FXI and FXIIFXII is unique in that it does not show positive selection eventhough it is involved in host-pathogen interactions with long-chain inorganic phosphatases (polyP) from microorganisms(Morrissey et al 2012) Small-chain polyP are produced in thehuman brain and by blood platelets Hence even during theabsence of pathogen-originated polyP the human host doesnot lack polyP and there is no pressure to evolve for this (Puyet al 2013) A F12 deficiency in patients does not cause bleed-ing disorders This means that the coagulation cascade can beinitiated even in the absence of F12 For similar reasons FXIshows no positive selection We hypothesize that this is be-cause first no interaction with the environment has beendemonstrated to date and second with respect to the polyPmechanism the coagulation cascade is capable of activatingin a FXI-independent manner

Relationship between Disease-Causing Mutations andSelection

The mechanisms of genetic drift mutational change andevolutionary change are closely related to each otherEvolutionary change is not only dependent on the geneticvariability from mutational change but also on genetic driftwhich describes the random fluctuations in the number ofgenetic variants in an organism Genetic drift and mutationsproduce random variations that selection can act upon Thenew changes incurred in one generation may be cancelled inanother generation It may therefore be important to deter-mine whether a new mutation is lost or becomes commonenough for selection to determine its fate (Masel 2011) In thisstudy we have examined the relationship between naturalselection acting on an amino acid and the disease-causingmutations reported at that amino acid position for thethree prominent coagulation factors FVIII FIX and FXI Wehave shown that the disease-causing mutations are morelikely to occur at a higher frequency at sites under negativeselection in all three proteins (fig 5) As sites under strongselective pressure are more likely to be critical to proteinfunction such a mutation can readily affect protein functionand trigger disease Lethal (or deleterious) mutations thatcause death are de facto not observed in our data set Bycontrast disease-causing mutations occur at a reduced fre-quency in sites under neutral or positive selection suggestingthat these sites are more tolerant to mutations than sitesunder negative selection Interestingly we have shown thatFVIII and FIX have undergone much positive selectionwhereas FXI has no such signatures (fig 4) Despite this dif-ference our results for three proteins show no change in the

relationship between evolutionary selection and disease-causing mutations (figs 5 and 6) In summary the reporteddisease-causing mutations show randomness and the muta-bility of a given amino acid position depends on the selectionpressure acting at that position and not on the entire protein

Effect of Disease-Causing Mutations on the Stability ofProtein Structure

Mutations that change the amino acid sequence can havestriking effects on protein stability Point mutations are re-lated to pathological and genetic conditions by reducing orabolishing coagulation protein function (table 1) Proteinfunction is related to its stability which is often measuredby the change in folding energy G of the protein structure(Stefl et al 2013) We have identified the effect of mutationson the stability of FVIII FIX and FXI in terms of G valuesThe stability effects of polymorphic amino acid changes inhuman sequences follow a Gaussian distribution (de Beeret al 2013) In that study disease-causing mutations alsoshowed the same distribution but slightly skewed towarddestabilization In this study the same trend for aminoacids substitutions in vertebrates were observed but this ismuch stronger for disease-causing mutations (fig 6) The dif-ference between the two studies may be explained by thedifferent methods used de Beer et al (2013) used the DiscreteOptimized Protein (DOPE) score of Modeller as a monitor ofglobal stability whereas we used FoldX (Materials andMethods) as an empirical method to infer directly the stabilityeffect G of a point mutation

Evolutionarily observed mutations are the most stable interms of their protein structure whereas disease-causing mu-tations destabilized the protein structure (figs 6 and 7) Eachsite may or may not tolerate a particular amino acid changedepending on the change in stability When each amino acidwas converted to each of the other 19 possible amino acids(fig 7 top panels) the G values showed that most aminoacids destabilize the structure whereas several may stabilize itThe disease-causing effect of a mutation may be affected notonly by the selection pressure acting on it but also by the typeof the amino acid replaced For the FVIII FIX and FXI bleedingdisorders destabilization was observed for most of the dis-ease-causing mutations

Utility of Evolutionary and Disease-CausingMutational Analyses for Clinical Diagnosis

Interactive locus-specific databases are useful tools for pre-senting patient mutational data for patient diagnosis andcare We plan to incorporate the above data sets on selectivepressures and stability effects into our mutational databasesfor FVIII FIX and FXI This information will provide insight forthe prediction of the effect of disease-causing mutations Thelevel of selective pressure was estimated under a maximum-likelihood framework Such methods are useful for geneticstudies but their results are sometimes difficult to interpretby a nonspecialist The presentation of this information needsto be intuitive to the viewer As shown in supplementaryfigure S3 Supplementary Material online the use of multiple

3052

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

sequence alignments and presentation of stability effectsG structural locations and selective pressures can helpin interpreting the impacts of the mutations This type ofinformation will therefore clarify the interpretation of newlydiscovered disease-causing mutations for clinicians evolu-tionary biologists and protein biochemists This combinedevolutionary-driven study of bleeding disorder-causing muta-tions is the most detailed of its kind as far as we are aware andmay be usefully extended to other biological processes andtheir proteins

Materials and Methods

Genomic Sequences

The flowchart of operations is summarized in supplementaryfigure S1 Supplementary Material online First the completemultiple sequence alignments were retrieved together withtheir gene trees for each of the 14 genes corresponding to the11 coagulation factors (table 1) from the Ensembl Comparadatabase (Vilella et al 2009) using their Ensembl IDs (table 1)and a Perl script Ensembl release 73 was used accessed onSeptember 30 2013 (Flicek et al 2013) These amino acidsequences came from Ensembl as they are generally ofgood quality and contained 74 vertebrate genomes in atotal of 80 genomes We then excluded any sequences thatcame from genomes of low coverage (2 or less) This re-sulted in 47 vertebrate genomes with alignment sizes be-tween 23 and 47 genomes These 47 sets of vertebratesequences provided sufficient genomic coverage to performphylogenetic analysis The 47 vertebrate organisms were clas-sified into five major clades (fig 2) namely Primates Glires(rodents rabbits and pika) Laurasiatheria Sauropsida (rep-tiles and birds) and Fishes (ray-finned fishes) together withminor clades including Atlantogenata and Australianmammals

Sequence Retrieval and Multiple Alignments

The multiple sequence alignments and gene trees fromEnsemblCompara included all the coagulation sequences indifferent species irrespective of their genome coverage Themultiple sequence alignments of the downloaded genes fromthe EnsemblCompara database were constructed using aglobal alignment algorithm provided within this databasefor significant number of sequences of arbitrary lengthsThe data set required for this study comprised 14 genesfrom 47 genomes A thorough quality control procedure ofthe multiple sequence alignments was performed in order togenerate minimum false positive results Thus sequences wereremoved if they showed 41 low-complexity regions orunknown nucleotides (indicated as ldquoNrdquo in a sequence)From these the phylogenetic trees were pruned usingNewick Utilities software (Junier and Zdobnov 2010) to iden-tify the tips corresponding to the retrieved species (fig 2) Thefinal alignments of these sequences were made with PRANKwhich is conservative in that it tends to align only amino acidsderived by substitution and introduces insertions and dele-tions (indels) instead of aligning sequence fragments thatappear too divergent (Loytynoja and Goldman 2008)

PRANK is an advanced probabilistic alignment algorithmthat incorporates phylogenetic information This has beenshown to reduce the number of false-positives when evalu-ating positive selection (Fletcher and Yang 2010 Loytynoja2014) All the PRANK alignments were made using defaultparameters for the empirical codon model and the prunedtrees as guides The resulting alignments are composed ofhighly conserved blocks surrounded by noise due to indelsFinally Gblocks (Castresana 2000) was used to remove thenoise in the unreliable regions of the multiple sequence align-ments which may cause false positives when testing for pos-itive selection to keep only the well-aligned parts The defaultGblocks parameters for the codontype model (parameter ndasht = c) were used with the exception of tolerating half gaps perposition (parameter -b5=h) This procedure resulted in cleanand accurate multiple sequence alignments for the final com-putations of the site and branch-site models using CodeMLtogether with their corresponding gene trees

Analyses of Evolutionary Pressure

For each of the 14 coagulation genes within each of the fivemajor clades the selective pressures were estimated using thedifferent codon substitution site models implemented inCodeML of the phylogenetic analysis with maximum-likeli-hood software (PAML release 47) (Yang 2007) By analyzingthe protein-coding genes at the nucleotide level the synon-ymous or silent changes (nucleotide substitutions that do notchange the encoded amino acid) are distinguished from thenonsynonymous or missense changes (substitutions thatchange the encoded amino acid) Because natural selectionoperates at the protein level and can take any direction ofpositive neutral or negative selection the synonymous andnonsynonymous substitutions are subjected to different se-lective pressures which occur at the rates dS and dN respec-tively Hence the ratio of nonsynonymoussynonymoussubstitution rates (= dNdS) measures the selective pressureat the protein level The value of reveals the direction andstrength of natural selection acting on the protein Values oflt 1 = 1 and 4 1 indicate negative selection neutralevolution and positive selection respectively If an existingfunction or phenotype is evolutionarily favorable then nega-tive selection (or purifying selection) favors the conservationof that function or phenotype (conservation of amino acids)The other is positive selection (or Darwinian selection) whichfavors the promotion of a new function or phenotypeNeutral evolution describes the tolerance to amino acid mu-tations which are neither deleterious nor beneficial and arefixed according to genetic drift

The following five models were used to account for thevariation of dNdS among codon sites M1a (two dNdS ratioswith negative selection [0lt 1] and neutral evolution[1 = 1]) M2a (three dNdS ratios with negative selection[0lt 1] neutral evolution [1 = 1] and positive selection[2 1]) M7 (beta) M8 (beta and 24 1) and M8a (betaand 2 = 1) The alignments contain several positions inwhich one of the sequences is missing and the ldquocleandata = 0rdquooption was selected to retain all sites The inference of positive

3053

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

selection was conducted by performing LRTs in which thefollowing two pairs of models were compared namely M1awith M2a and M8a with M8 The principle is to compare anull model that does not allow24 1 (M1a or M8a) with analternative model that does (M2a or M8) Because the alter-nate model has more parameters than the null model theLRT follows a 2 distribution The LRT is a powerful methodfor comparing two hypotheses where one is the special caseof the other The BEB method (Yang et al 2005) was used tocalculate the posterior probabilities for site classes and theBEB value was used to identify sites under positive selectionwhere the 2 distribution of the LRT was significant withPlt 005 As many different LRT comparisons (differentgenes different clades) were conducted a control for thefalse-discovery rate (expected proportion of false positives)was applied using the q value package in R (Storey andTibshirani 2003) We adjusted the parameters of the q valueto perform a bootstrap analysis (pi0meth = ldquobootstraprdquo) andto tolerate 10 of false positives in our significant results(ldquofdrlevel = 01rdquo) A bootstrap analysis is necessary as the Pvalue distribution is bimodal (Studer et al 2008)

Two sources for the branch-site model were used First theSelectome database at httpselectomeunilch (last accessedJanuary 2014) (Proux et al 2009 Moretti et al 2014) providesthe precomputed results of the branch-site model (supple-mentary table S2b Supplementary Material online) based onthe multiple sequence alignments from the EnsemblCompara database (Vilella et al 2009) The last release ofSelectome (release 6) contains an optimized procedure ofalignment refinement and quality control similar to theone used for our site-model analyses as described aboveThe branch-site model allows the dNdS value to vary notonly between sites but also between branches (Zhang et al2005) The advantage of this model is its ability to detecttraces of episodic positive selection (Studer and Robinson-Rechavi 2009a) Second a branch-site model was determinedbased on our data set (see above) We used the vertebratetrees and their multiple sequence alignments as inputs forbranch-site model calculations The multiple alignments weremanually rechecked to remove barren sequences and thecorresponding vertebrate trees were edited accordinglyusing Newick Utilities Each taxonomic unit also known asthe branch or node of the tree was tagged using an auto-mated python script The CodeML program was run on eachof these tagged trees from all the genes for this branch-sitemodel (supplementary table S2b Supplementary Materialonline) For some alignments we found misaligned short in-sertions up to a maximum of ten amino acids in the con-served blocks probably due to gene model prediction errorsIn our whole data set these errors affect 0ndash4 sequences peralignment As the F2 gene family contained most of thesesmall errors albeit at a very low level this was reanalyzed byremoving four sequences with potential problems from thealignment The recalculation of the likelihood ratio deviatedonly slightly from the initial results (P value of 393e5 for 38sequences vs P value of 217e4 for 34 sequences) This sug-gested that these errors would only have a minor impact ifany on the detection of positive selection especially as the

coagulation factor sequences were long and the model ofcodon substitutions was averaged over all columns (manyhundreds) and sequences (between 23 and 47)

Stability Effect of Mutations

When possible for each coagulation factor the molecularstructures of the domain(s) were retrieved directly fromthe CATH database (Sillitoe et al 2013) Prior to the muta-tional analyses each structure was repaired with the FoldXprogram (Schymkowitz et al 2005) to remove potential stericclashes FoldX utilizes an empirical force field model to esti-mate the stability of a protein (Gwildtype in kcalmol) FoldXwas used to estimate the stability effect on each residueposition when the residue was replaced by any of the 19other amino acids The stability effect is determined bythe difference in the wildtype and mutant energies(G = GmutantGwildtype) This resulted in a stabilitylandscape of all potential mutations including the Gvalues both for the disease mutations and for all the aminoacid substitutions observed across vertebrate evolution

Statistical Analyses

All statistical analyses were performed with R v302 (R CoreTeam 2013) and SigmaPlot (Systat Software San Jose CA)The pipeline was developed with Biopython scripts (Cocket al 2009) The visualization of the multiple sequence align-ments was performed using Jalview 28 (Waterhouse et al2009)

Data Availability

The final sequence alignments are provided on a webpagehttpwwwbiochemuclacukpavithraevolutioncoagulationindexphp (last accessed January 2014) Pathogenic mu-tation data for three human coagulation factors FVIII FIX andFXI were retrieved from our web databases at httpwwwfactorviii-dborg httpwwwfactorixorg and httpwwwfactorxiorg (last accessed January 2014) (Saunders et al2009 Rallapalli et al 2013)

Supplementary MaterialSupplementary tables S1ndashS2 and figures S1ndashS3 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

PMR and SJP thank Pfizer UK for a medical educationalgrant that generously funded this project and the SpecialTrustees of the Royal Free Hospital the KatharineDormandy Trust for Hemophilia and Related Disorders andthe European Association for Hemophilia and AssociatedDisorders for additional support The authors also thank DrKeith Gomez Dr Geoffrey Kemball-Cook and ProfessorEdward G Tuddenham for their generous collaborative sup-port of our studies including access to the FVIII mutations atthe Hemophilia A Database at httpwwwfactorviii-dborgRAS acknowledges funding from the Fondation du 450eme

3054

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

which is important for thrombin formation the most impor-tant step for fibrin clot formation Because both FVa and FXaare central to coagulation and are associated with bacterialproteolysis this offers an explanation for the signatures ofpositive selection in both proteins

FVIII FIX and FXIII are also associated with positive selec-tion however it is not clear what selective pressures are in-volved Although these proteins may also be targeted bypathogens no direct interactions have been reported so far

Absence of Positive Selection for FXI and FXIIFXII is unique in that it does not show positive selection eventhough it is involved in host-pathogen interactions with long-chain inorganic phosphatases (polyP) from microorganisms(Morrissey et al 2012) Small-chain polyP are produced in thehuman brain and by blood platelets Hence even during theabsence of pathogen-originated polyP the human host doesnot lack polyP and there is no pressure to evolve for this (Puyet al 2013) A F12 deficiency in patients does not cause bleed-ing disorders This means that the coagulation cascade can beinitiated even in the absence of F12 For similar reasons FXIshows no positive selection We hypothesize that this is be-cause first no interaction with the environment has beendemonstrated to date and second with respect to the polyPmechanism the coagulation cascade is capable of activatingin a FXI-independent manner

Relationship between Disease-Causing Mutations andSelection

The mechanisms of genetic drift mutational change andevolutionary change are closely related to each otherEvolutionary change is not only dependent on the geneticvariability from mutational change but also on genetic driftwhich describes the random fluctuations in the number ofgenetic variants in an organism Genetic drift and mutationsproduce random variations that selection can act upon Thenew changes incurred in one generation may be cancelled inanother generation It may therefore be important to deter-mine whether a new mutation is lost or becomes commonenough for selection to determine its fate (Masel 2011) In thisstudy we have examined the relationship between naturalselection acting on an amino acid and the disease-causingmutations reported at that amino acid position for thethree prominent coagulation factors FVIII FIX and FXI Wehave shown that the disease-causing mutations are morelikely to occur at a higher frequency at sites under negativeselection in all three proteins (fig 5) As sites under strongselective pressure are more likely to be critical to proteinfunction such a mutation can readily affect protein functionand trigger disease Lethal (or deleterious) mutations thatcause death are de facto not observed in our data set Bycontrast disease-causing mutations occur at a reduced fre-quency in sites under neutral or positive selection suggestingthat these sites are more tolerant to mutations than sitesunder negative selection Interestingly we have shown thatFVIII and FIX have undergone much positive selectionwhereas FXI has no such signatures (fig 4) Despite this dif-ference our results for three proteins show no change in the

relationship between evolutionary selection and disease-causing mutations (figs 5 and 6) In summary the reporteddisease-causing mutations show randomness and the muta-bility of a given amino acid position depends on the selectionpressure acting at that position and not on the entire protein

Effect of Disease-Causing Mutations on the Stability ofProtein Structure

Mutations that change the amino acid sequence can havestriking effects on protein stability Point mutations are re-lated to pathological and genetic conditions by reducing orabolishing coagulation protein function (table 1) Proteinfunction is related to its stability which is often measuredby the change in folding energy G of the protein structure(Stefl et al 2013) We have identified the effect of mutationson the stability of FVIII FIX and FXI in terms of G valuesThe stability effects of polymorphic amino acid changes inhuman sequences follow a Gaussian distribution (de Beeret al 2013) In that study disease-causing mutations alsoshowed the same distribution but slightly skewed towarddestabilization In this study the same trend for aminoacids substitutions in vertebrates were observed but this ismuch stronger for disease-causing mutations (fig 6) The dif-ference between the two studies may be explained by thedifferent methods used de Beer et al (2013) used the DiscreteOptimized Protein (DOPE) score of Modeller as a monitor ofglobal stability whereas we used FoldX (Materials andMethods) as an empirical method to infer directly the stabilityeffect G of a point mutation

Evolutionarily observed mutations are the most stable interms of their protein structure whereas disease-causing mu-tations destabilized the protein structure (figs 6 and 7) Eachsite may or may not tolerate a particular amino acid changedepending on the change in stability When each amino acidwas converted to each of the other 19 possible amino acids(fig 7 top panels) the G values showed that most aminoacids destabilize the structure whereas several may stabilize itThe disease-causing effect of a mutation may be affected notonly by the selection pressure acting on it but also by the typeof the amino acid replaced For the FVIII FIX and FXI bleedingdisorders destabilization was observed for most of the dis-ease-causing mutations

Utility of Evolutionary and Disease-CausingMutational Analyses for Clinical Diagnosis

Interactive locus-specific databases are useful tools for pre-senting patient mutational data for patient diagnosis andcare We plan to incorporate the above data sets on selectivepressures and stability effects into our mutational databasesfor FVIII FIX and FXI This information will provide insight forthe prediction of the effect of disease-causing mutations Thelevel of selective pressure was estimated under a maximum-likelihood framework Such methods are useful for geneticstudies but their results are sometimes difficult to interpretby a nonspecialist The presentation of this information needsto be intuitive to the viewer As shown in supplementaryfigure S3 Supplementary Material online the use of multiple

3052

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

sequence alignments and presentation of stability effectsG structural locations and selective pressures can helpin interpreting the impacts of the mutations This type ofinformation will therefore clarify the interpretation of newlydiscovered disease-causing mutations for clinicians evolu-tionary biologists and protein biochemists This combinedevolutionary-driven study of bleeding disorder-causing muta-tions is the most detailed of its kind as far as we are aware andmay be usefully extended to other biological processes andtheir proteins

Materials and Methods

Genomic Sequences

The flowchart of operations is summarized in supplementaryfigure S1 Supplementary Material online First the completemultiple sequence alignments were retrieved together withtheir gene trees for each of the 14 genes corresponding to the11 coagulation factors (table 1) from the Ensembl Comparadatabase (Vilella et al 2009) using their Ensembl IDs (table 1)and a Perl script Ensembl release 73 was used accessed onSeptember 30 2013 (Flicek et al 2013) These amino acidsequences came from Ensembl as they are generally ofgood quality and contained 74 vertebrate genomes in atotal of 80 genomes We then excluded any sequences thatcame from genomes of low coverage (2 or less) This re-sulted in 47 vertebrate genomes with alignment sizes be-tween 23 and 47 genomes These 47 sets of vertebratesequences provided sufficient genomic coverage to performphylogenetic analysis The 47 vertebrate organisms were clas-sified into five major clades (fig 2) namely Primates Glires(rodents rabbits and pika) Laurasiatheria Sauropsida (rep-tiles and birds) and Fishes (ray-finned fishes) together withminor clades including Atlantogenata and Australianmammals

Sequence Retrieval and Multiple Alignments

The multiple sequence alignments and gene trees fromEnsemblCompara included all the coagulation sequences indifferent species irrespective of their genome coverage Themultiple sequence alignments of the downloaded genes fromthe EnsemblCompara database were constructed using aglobal alignment algorithm provided within this databasefor significant number of sequences of arbitrary lengthsThe data set required for this study comprised 14 genesfrom 47 genomes A thorough quality control procedure ofthe multiple sequence alignments was performed in order togenerate minimum false positive results Thus sequences wereremoved if they showed 41 low-complexity regions orunknown nucleotides (indicated as ldquoNrdquo in a sequence)From these the phylogenetic trees were pruned usingNewick Utilities software (Junier and Zdobnov 2010) to iden-tify the tips corresponding to the retrieved species (fig 2) Thefinal alignments of these sequences were made with PRANKwhich is conservative in that it tends to align only amino acidsderived by substitution and introduces insertions and dele-tions (indels) instead of aligning sequence fragments thatappear too divergent (Loytynoja and Goldman 2008)

PRANK is an advanced probabilistic alignment algorithmthat incorporates phylogenetic information This has beenshown to reduce the number of false-positives when evalu-ating positive selection (Fletcher and Yang 2010 Loytynoja2014) All the PRANK alignments were made using defaultparameters for the empirical codon model and the prunedtrees as guides The resulting alignments are composed ofhighly conserved blocks surrounded by noise due to indelsFinally Gblocks (Castresana 2000) was used to remove thenoise in the unreliable regions of the multiple sequence align-ments which may cause false positives when testing for pos-itive selection to keep only the well-aligned parts The defaultGblocks parameters for the codontype model (parameter ndasht = c) were used with the exception of tolerating half gaps perposition (parameter -b5=h) This procedure resulted in cleanand accurate multiple sequence alignments for the final com-putations of the site and branch-site models using CodeMLtogether with their corresponding gene trees

Analyses of Evolutionary Pressure

For each of the 14 coagulation genes within each of the fivemajor clades the selective pressures were estimated using thedifferent codon substitution site models implemented inCodeML of the phylogenetic analysis with maximum-likeli-hood software (PAML release 47) (Yang 2007) By analyzingthe protein-coding genes at the nucleotide level the synon-ymous or silent changes (nucleotide substitutions that do notchange the encoded amino acid) are distinguished from thenonsynonymous or missense changes (substitutions thatchange the encoded amino acid) Because natural selectionoperates at the protein level and can take any direction ofpositive neutral or negative selection the synonymous andnonsynonymous substitutions are subjected to different se-lective pressures which occur at the rates dS and dN respec-tively Hence the ratio of nonsynonymoussynonymoussubstitution rates (= dNdS) measures the selective pressureat the protein level The value of reveals the direction andstrength of natural selection acting on the protein Values oflt 1 = 1 and 4 1 indicate negative selection neutralevolution and positive selection respectively If an existingfunction or phenotype is evolutionarily favorable then nega-tive selection (or purifying selection) favors the conservationof that function or phenotype (conservation of amino acids)The other is positive selection (or Darwinian selection) whichfavors the promotion of a new function or phenotypeNeutral evolution describes the tolerance to amino acid mu-tations which are neither deleterious nor beneficial and arefixed according to genetic drift

The following five models were used to account for thevariation of dNdS among codon sites M1a (two dNdS ratioswith negative selection [0lt 1] and neutral evolution[1 = 1]) M2a (three dNdS ratios with negative selection[0lt 1] neutral evolution [1 = 1] and positive selection[2 1]) M7 (beta) M8 (beta and 24 1) and M8a (betaand 2 = 1) The alignments contain several positions inwhich one of the sequences is missing and the ldquocleandata = 0rdquooption was selected to retain all sites The inference of positive

3053

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

selection was conducted by performing LRTs in which thefollowing two pairs of models were compared namely M1awith M2a and M8a with M8 The principle is to compare anull model that does not allow24 1 (M1a or M8a) with analternative model that does (M2a or M8) Because the alter-nate model has more parameters than the null model theLRT follows a 2 distribution The LRT is a powerful methodfor comparing two hypotheses where one is the special caseof the other The BEB method (Yang et al 2005) was used tocalculate the posterior probabilities for site classes and theBEB value was used to identify sites under positive selectionwhere the 2 distribution of the LRT was significant withPlt 005 As many different LRT comparisons (differentgenes different clades) were conducted a control for thefalse-discovery rate (expected proportion of false positives)was applied using the q value package in R (Storey andTibshirani 2003) We adjusted the parameters of the q valueto perform a bootstrap analysis (pi0meth = ldquobootstraprdquo) andto tolerate 10 of false positives in our significant results(ldquofdrlevel = 01rdquo) A bootstrap analysis is necessary as the Pvalue distribution is bimodal (Studer et al 2008)

Two sources for the branch-site model were used First theSelectome database at httpselectomeunilch (last accessedJanuary 2014) (Proux et al 2009 Moretti et al 2014) providesthe precomputed results of the branch-site model (supple-mentary table S2b Supplementary Material online) based onthe multiple sequence alignments from the EnsemblCompara database (Vilella et al 2009) The last release ofSelectome (release 6) contains an optimized procedure ofalignment refinement and quality control similar to theone used for our site-model analyses as described aboveThe branch-site model allows the dNdS value to vary notonly between sites but also between branches (Zhang et al2005) The advantage of this model is its ability to detecttraces of episodic positive selection (Studer and Robinson-Rechavi 2009a) Second a branch-site model was determinedbased on our data set (see above) We used the vertebratetrees and their multiple sequence alignments as inputs forbranch-site model calculations The multiple alignments weremanually rechecked to remove barren sequences and thecorresponding vertebrate trees were edited accordinglyusing Newick Utilities Each taxonomic unit also known asthe branch or node of the tree was tagged using an auto-mated python script The CodeML program was run on eachof these tagged trees from all the genes for this branch-sitemodel (supplementary table S2b Supplementary Materialonline) For some alignments we found misaligned short in-sertions up to a maximum of ten amino acids in the con-served blocks probably due to gene model prediction errorsIn our whole data set these errors affect 0ndash4 sequences peralignment As the F2 gene family contained most of thesesmall errors albeit at a very low level this was reanalyzed byremoving four sequences with potential problems from thealignment The recalculation of the likelihood ratio deviatedonly slightly from the initial results (P value of 393e5 for 38sequences vs P value of 217e4 for 34 sequences) This sug-gested that these errors would only have a minor impact ifany on the detection of positive selection especially as the

coagulation factor sequences were long and the model ofcodon substitutions was averaged over all columns (manyhundreds) and sequences (between 23 and 47)

Stability Effect of Mutations

When possible for each coagulation factor the molecularstructures of the domain(s) were retrieved directly fromthe CATH database (Sillitoe et al 2013) Prior to the muta-tional analyses each structure was repaired with the FoldXprogram (Schymkowitz et al 2005) to remove potential stericclashes FoldX utilizes an empirical force field model to esti-mate the stability of a protein (Gwildtype in kcalmol) FoldXwas used to estimate the stability effect on each residueposition when the residue was replaced by any of the 19other amino acids The stability effect is determined bythe difference in the wildtype and mutant energies(G = GmutantGwildtype) This resulted in a stabilitylandscape of all potential mutations including the Gvalues both for the disease mutations and for all the aminoacid substitutions observed across vertebrate evolution

Statistical Analyses

All statistical analyses were performed with R v302 (R CoreTeam 2013) and SigmaPlot (Systat Software San Jose CA)The pipeline was developed with Biopython scripts (Cocket al 2009) The visualization of the multiple sequence align-ments was performed using Jalview 28 (Waterhouse et al2009)

Data Availability

The final sequence alignments are provided on a webpagehttpwwwbiochemuclacukpavithraevolutioncoagulationindexphp (last accessed January 2014) Pathogenic mu-tation data for three human coagulation factors FVIII FIX andFXI were retrieved from our web databases at httpwwwfactorviii-dborg httpwwwfactorixorg and httpwwwfactorxiorg (last accessed January 2014) (Saunders et al2009 Rallapalli et al 2013)

Supplementary MaterialSupplementary tables S1ndashS2 and figures S1ndashS3 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

PMR and SJP thank Pfizer UK for a medical educationalgrant that generously funded this project and the SpecialTrustees of the Royal Free Hospital the KatharineDormandy Trust for Hemophilia and Related Disorders andthe European Association for Hemophilia and AssociatedDisorders for additional support The authors also thank DrKeith Gomez Dr Geoffrey Kemball-Cook and ProfessorEdward G Tuddenham for their generous collaborative sup-port of our studies including access to the FVIII mutations atthe Hemophilia A Database at httpwwwfactorviii-dborgRAS acknowledges funding from the Fondation du 450eme

3054

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

sequence alignments and presentation of stability effectsG structural locations and selective pressures can helpin interpreting the impacts of the mutations This type ofinformation will therefore clarify the interpretation of newlydiscovered disease-causing mutations for clinicians evolu-tionary biologists and protein biochemists This combinedevolutionary-driven study of bleeding disorder-causing muta-tions is the most detailed of its kind as far as we are aware andmay be usefully extended to other biological processes andtheir proteins

Materials and Methods

Genomic Sequences

The flowchart of operations is summarized in supplementaryfigure S1 Supplementary Material online First the completemultiple sequence alignments were retrieved together withtheir gene trees for each of the 14 genes corresponding to the11 coagulation factors (table 1) from the Ensembl Comparadatabase (Vilella et al 2009) using their Ensembl IDs (table 1)and a Perl script Ensembl release 73 was used accessed onSeptember 30 2013 (Flicek et al 2013) These amino acidsequences came from Ensembl as they are generally ofgood quality and contained 74 vertebrate genomes in atotal of 80 genomes We then excluded any sequences thatcame from genomes of low coverage (2 or less) This re-sulted in 47 vertebrate genomes with alignment sizes be-tween 23 and 47 genomes These 47 sets of vertebratesequences provided sufficient genomic coverage to performphylogenetic analysis The 47 vertebrate organisms were clas-sified into five major clades (fig 2) namely Primates Glires(rodents rabbits and pika) Laurasiatheria Sauropsida (rep-tiles and birds) and Fishes (ray-finned fishes) together withminor clades including Atlantogenata and Australianmammals

Sequence Retrieval and Multiple Alignments

The multiple sequence alignments and gene trees fromEnsemblCompara included all the coagulation sequences indifferent species irrespective of their genome coverage Themultiple sequence alignments of the downloaded genes fromthe EnsemblCompara database were constructed using aglobal alignment algorithm provided within this databasefor significant number of sequences of arbitrary lengthsThe data set required for this study comprised 14 genesfrom 47 genomes A thorough quality control procedure ofthe multiple sequence alignments was performed in order togenerate minimum false positive results Thus sequences wereremoved if they showed 41 low-complexity regions orunknown nucleotides (indicated as ldquoNrdquo in a sequence)From these the phylogenetic trees were pruned usingNewick Utilities software (Junier and Zdobnov 2010) to iden-tify the tips corresponding to the retrieved species (fig 2) Thefinal alignments of these sequences were made with PRANKwhich is conservative in that it tends to align only amino acidsderived by substitution and introduces insertions and dele-tions (indels) instead of aligning sequence fragments thatappear too divergent (Loytynoja and Goldman 2008)

PRANK is an advanced probabilistic alignment algorithmthat incorporates phylogenetic information This has beenshown to reduce the number of false-positives when evalu-ating positive selection (Fletcher and Yang 2010 Loytynoja2014) All the PRANK alignments were made using defaultparameters for the empirical codon model and the prunedtrees as guides The resulting alignments are composed ofhighly conserved blocks surrounded by noise due to indelsFinally Gblocks (Castresana 2000) was used to remove thenoise in the unreliable regions of the multiple sequence align-ments which may cause false positives when testing for pos-itive selection to keep only the well-aligned parts The defaultGblocks parameters for the codontype model (parameter ndasht = c) were used with the exception of tolerating half gaps perposition (parameter -b5=h) This procedure resulted in cleanand accurate multiple sequence alignments for the final com-putations of the site and branch-site models using CodeMLtogether with their corresponding gene trees

Analyses of Evolutionary Pressure

For each of the 14 coagulation genes within each of the fivemajor clades the selective pressures were estimated using thedifferent codon substitution site models implemented inCodeML of the phylogenetic analysis with maximum-likeli-hood software (PAML release 47) (Yang 2007) By analyzingthe protein-coding genes at the nucleotide level the synon-ymous or silent changes (nucleotide substitutions that do notchange the encoded amino acid) are distinguished from thenonsynonymous or missense changes (substitutions thatchange the encoded amino acid) Because natural selectionoperates at the protein level and can take any direction ofpositive neutral or negative selection the synonymous andnonsynonymous substitutions are subjected to different se-lective pressures which occur at the rates dS and dN respec-tively Hence the ratio of nonsynonymoussynonymoussubstitution rates (= dNdS) measures the selective pressureat the protein level The value of reveals the direction andstrength of natural selection acting on the protein Values oflt 1 = 1 and 4 1 indicate negative selection neutralevolution and positive selection respectively If an existingfunction or phenotype is evolutionarily favorable then nega-tive selection (or purifying selection) favors the conservationof that function or phenotype (conservation of amino acids)The other is positive selection (or Darwinian selection) whichfavors the promotion of a new function or phenotypeNeutral evolution describes the tolerance to amino acid mu-tations which are neither deleterious nor beneficial and arefixed according to genetic drift

The following five models were used to account for thevariation of dNdS among codon sites M1a (two dNdS ratioswith negative selection [0lt 1] and neutral evolution[1 = 1]) M2a (three dNdS ratios with negative selection[0lt 1] neutral evolution [1 = 1] and positive selection[2 1]) M7 (beta) M8 (beta and 24 1) and M8a (betaand 2 = 1) The alignments contain several positions inwhich one of the sequences is missing and the ldquocleandata = 0rdquooption was selected to retain all sites The inference of positive

3053

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

selection was conducted by performing LRTs in which thefollowing two pairs of models were compared namely M1awith M2a and M8a with M8 The principle is to compare anull model that does not allow24 1 (M1a or M8a) with analternative model that does (M2a or M8) Because the alter-nate model has more parameters than the null model theLRT follows a 2 distribution The LRT is a powerful methodfor comparing two hypotheses where one is the special caseof the other The BEB method (Yang et al 2005) was used tocalculate the posterior probabilities for site classes and theBEB value was used to identify sites under positive selectionwhere the 2 distribution of the LRT was significant withPlt 005 As many different LRT comparisons (differentgenes different clades) were conducted a control for thefalse-discovery rate (expected proportion of false positives)was applied using the q value package in R (Storey andTibshirani 2003) We adjusted the parameters of the q valueto perform a bootstrap analysis (pi0meth = ldquobootstraprdquo) andto tolerate 10 of false positives in our significant results(ldquofdrlevel = 01rdquo) A bootstrap analysis is necessary as the Pvalue distribution is bimodal (Studer et al 2008)

Two sources for the branch-site model were used First theSelectome database at httpselectomeunilch (last accessedJanuary 2014) (Proux et al 2009 Moretti et al 2014) providesthe precomputed results of the branch-site model (supple-mentary table S2b Supplementary Material online) based onthe multiple sequence alignments from the EnsemblCompara database (Vilella et al 2009) The last release ofSelectome (release 6) contains an optimized procedure ofalignment refinement and quality control similar to theone used for our site-model analyses as described aboveThe branch-site model allows the dNdS value to vary notonly between sites but also between branches (Zhang et al2005) The advantage of this model is its ability to detecttraces of episodic positive selection (Studer and Robinson-Rechavi 2009a) Second a branch-site model was determinedbased on our data set (see above) We used the vertebratetrees and their multiple sequence alignments as inputs forbranch-site model calculations The multiple alignments weremanually rechecked to remove barren sequences and thecorresponding vertebrate trees were edited accordinglyusing Newick Utilities Each taxonomic unit also known asthe branch or node of the tree was tagged using an auto-mated python script The CodeML program was run on eachof these tagged trees from all the genes for this branch-sitemodel (supplementary table S2b Supplementary Materialonline) For some alignments we found misaligned short in-sertions up to a maximum of ten amino acids in the con-served blocks probably due to gene model prediction errorsIn our whole data set these errors affect 0ndash4 sequences peralignment As the F2 gene family contained most of thesesmall errors albeit at a very low level this was reanalyzed byremoving four sequences with potential problems from thealignment The recalculation of the likelihood ratio deviatedonly slightly from the initial results (P value of 393e5 for 38sequences vs P value of 217e4 for 34 sequences) This sug-gested that these errors would only have a minor impact ifany on the detection of positive selection especially as the

coagulation factor sequences were long and the model ofcodon substitutions was averaged over all columns (manyhundreds) and sequences (between 23 and 47)

Stability Effect of Mutations

When possible for each coagulation factor the molecularstructures of the domain(s) were retrieved directly fromthe CATH database (Sillitoe et al 2013) Prior to the muta-tional analyses each structure was repaired with the FoldXprogram (Schymkowitz et al 2005) to remove potential stericclashes FoldX utilizes an empirical force field model to esti-mate the stability of a protein (Gwildtype in kcalmol) FoldXwas used to estimate the stability effect on each residueposition when the residue was replaced by any of the 19other amino acids The stability effect is determined bythe difference in the wildtype and mutant energies(G = GmutantGwildtype) This resulted in a stabilitylandscape of all potential mutations including the Gvalues both for the disease mutations and for all the aminoacid substitutions observed across vertebrate evolution

Statistical Analyses

All statistical analyses were performed with R v302 (R CoreTeam 2013) and SigmaPlot (Systat Software San Jose CA)The pipeline was developed with Biopython scripts (Cocket al 2009) The visualization of the multiple sequence align-ments was performed using Jalview 28 (Waterhouse et al2009)

Data Availability

The final sequence alignments are provided on a webpagehttpwwwbiochemuclacukpavithraevolutioncoagulationindexphp (last accessed January 2014) Pathogenic mu-tation data for three human coagulation factors FVIII FIX andFXI were retrieved from our web databases at httpwwwfactorviii-dborg httpwwwfactorixorg and httpwwwfactorxiorg (last accessed January 2014) (Saunders et al2009 Rallapalli et al 2013)

Supplementary MaterialSupplementary tables S1ndashS2 and figures S1ndashS3 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

PMR and SJP thank Pfizer UK for a medical educationalgrant that generously funded this project and the SpecialTrustees of the Royal Free Hospital the KatharineDormandy Trust for Hemophilia and Related Disorders andthe European Association for Hemophilia and AssociatedDisorders for additional support The authors also thank DrKeith Gomez Dr Geoffrey Kemball-Cook and ProfessorEdward G Tuddenham for their generous collaborative sup-port of our studies including access to the FVIII mutations atthe Hemophilia A Database at httpwwwfactorviii-dborgRAS acknowledges funding from the Fondation du 450eme

3054

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

selection was conducted by performing LRTs in which thefollowing two pairs of models were compared namely M1awith M2a and M8a with M8 The principle is to compare anull model that does not allow24 1 (M1a or M8a) with analternative model that does (M2a or M8) Because the alter-nate model has more parameters than the null model theLRT follows a 2 distribution The LRT is a powerful methodfor comparing two hypotheses where one is the special caseof the other The BEB method (Yang et al 2005) was used tocalculate the posterior probabilities for site classes and theBEB value was used to identify sites under positive selectionwhere the 2 distribution of the LRT was significant withPlt 005 As many different LRT comparisons (differentgenes different clades) were conducted a control for thefalse-discovery rate (expected proportion of false positives)was applied using the q value package in R (Storey andTibshirani 2003) We adjusted the parameters of the q valueto perform a bootstrap analysis (pi0meth = ldquobootstraprdquo) andto tolerate 10 of false positives in our significant results(ldquofdrlevel = 01rdquo) A bootstrap analysis is necessary as the Pvalue distribution is bimodal (Studer et al 2008)

Two sources for the branch-site model were used First theSelectome database at httpselectomeunilch (last accessedJanuary 2014) (Proux et al 2009 Moretti et al 2014) providesthe precomputed results of the branch-site model (supple-mentary table S2b Supplementary Material online) based onthe multiple sequence alignments from the EnsemblCompara database (Vilella et al 2009) The last release ofSelectome (release 6) contains an optimized procedure ofalignment refinement and quality control similar to theone used for our site-model analyses as described aboveThe branch-site model allows the dNdS value to vary notonly between sites but also between branches (Zhang et al2005) The advantage of this model is its ability to detecttraces of episodic positive selection (Studer and Robinson-Rechavi 2009a) Second a branch-site model was determinedbased on our data set (see above) We used the vertebratetrees and their multiple sequence alignments as inputs forbranch-site model calculations The multiple alignments weremanually rechecked to remove barren sequences and thecorresponding vertebrate trees were edited accordinglyusing Newick Utilities Each taxonomic unit also known asthe branch or node of the tree was tagged using an auto-mated python script The CodeML program was run on eachof these tagged trees from all the genes for this branch-sitemodel (supplementary table S2b Supplementary Materialonline) For some alignments we found misaligned short in-sertions up to a maximum of ten amino acids in the con-served blocks probably due to gene model prediction errorsIn our whole data set these errors affect 0ndash4 sequences peralignment As the F2 gene family contained most of thesesmall errors albeit at a very low level this was reanalyzed byremoving four sequences with potential problems from thealignment The recalculation of the likelihood ratio deviatedonly slightly from the initial results (P value of 393e5 for 38sequences vs P value of 217e4 for 34 sequences) This sug-gested that these errors would only have a minor impact ifany on the detection of positive selection especially as the

coagulation factor sequences were long and the model ofcodon substitutions was averaged over all columns (manyhundreds) and sequences (between 23 and 47)

Stability Effect of Mutations

When possible for each coagulation factor the molecularstructures of the domain(s) were retrieved directly fromthe CATH database (Sillitoe et al 2013) Prior to the muta-tional analyses each structure was repaired with the FoldXprogram (Schymkowitz et al 2005) to remove potential stericclashes FoldX utilizes an empirical force field model to esti-mate the stability of a protein (Gwildtype in kcalmol) FoldXwas used to estimate the stability effect on each residueposition when the residue was replaced by any of the 19other amino acids The stability effect is determined bythe difference in the wildtype and mutant energies(G = GmutantGwildtype) This resulted in a stabilitylandscape of all potential mutations including the Gvalues both for the disease mutations and for all the aminoacid substitutions observed across vertebrate evolution

Statistical Analyses

All statistical analyses were performed with R v302 (R CoreTeam 2013) and SigmaPlot (Systat Software San Jose CA)The pipeline was developed with Biopython scripts (Cocket al 2009) The visualization of the multiple sequence align-ments was performed using Jalview 28 (Waterhouse et al2009)

Data Availability

The final sequence alignments are provided on a webpagehttpwwwbiochemuclacukpavithraevolutioncoagulationindexphp (last accessed January 2014) Pathogenic mu-tation data for three human coagulation factors FVIII FIX andFXI were retrieved from our web databases at httpwwwfactorviii-dborg httpwwwfactorixorg and httpwwwfactorxiorg (last accessed January 2014) (Saunders et al2009 Rallapalli et al 2013)

Supplementary MaterialSupplementary tables S1ndashS2 and figures S1ndashS3 are available atMolecular Biology and Evolution online (httpwwwmbeoxfordjournalsorg)

Acknowledgments

PMR and SJP thank Pfizer UK for a medical educationalgrant that generously funded this project and the SpecialTrustees of the Royal Free Hospital the KatharineDormandy Trust for Hemophilia and Related Disorders andthe European Association for Hemophilia and AssociatedDisorders for additional support The authors also thank DrKeith Gomez Dr Geoffrey Kemball-Cook and ProfessorEdward G Tuddenham for their generous collaborative sup-port of our studies including access to the FVIII mutations atthe Hemophilia A Database at httpwwwfactorviii-dborgRAS acknowledges funding from the Fondation du 450eme

3054

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

anniversaire de lrsquoUniversite de Lausanne and Swiss NationalScience Foundation grants 132476 and 136477

ReferencesAmara U Rittirsch D Flierl M Bruckner U Klos A Gebhard F Lambris

JD Huber-Lang M 2008 Interaction between the coagulation andcomplement system Adv Exp Med Biol 63271ndash79

Anisimova M Bielawski JP Yang Z 2002 Accuracy and power of Bayesprediction of amino acid sites under positive selection Mol Biol Evol19950ndash958

Broze GJ Jr 1995 Tissue factor pathway inhibitor and the revised theoryof coagulation Annu Rev Med 46103ndash112

Brunder W Schmidt H Karch H 1997 EspP a novel extracellular serineprotease of enterohaemorrhagic Escherichia coli O157H7 cleaveshuman coagulation factor V Mol Microbiol 24767ndash778

Butenas S vanrsquot Veer C Cawthern K Brummel KE Mann KG 2000Models of blood coagulation Blood Coagul Fibrinolysis 11(Suppl 1)S9ndashS13

Castresana J 2000 Selection of conserved blocks from multiple align-ments for their use in phylogenetic analysis Mol Biol Evol 17540ndash552

Cock PJ Antao T Chang JT Chapman BA Cox CJ Dalke A Friedberg IHamelryck T Kauff F Wilczynski B et al 2009 Biopython freelyavailable Python tools for computational molecular biology andbioinformatics Bioinformatics 251422ndash1423

de Beer TA Laskowski RA Parks SL Sipos B Goldman N Thornton JM2013 Amino Acid changes in disease-associated variants differ rad-ically from variants observed in the 1000 genomes project datasetPLoS Comput Biol 91ndash15

Dong Y Dimopoulos G 2009 Anopheles fibrinogen-related proteinsprovide expanded pattern recognition capacity against bacteriaand malaria parasites J Biol Chem 2849835ndash9844

Doolittle RF 2009 Step-by-step evolution of vertebrate blood coagula-tion Cold Spring Harb Symp Quant Biol 7435ndash40

Doolittle RF Jiang Y Nand J 2008 Genomic evidence for a simplerclotting scheme in jawless vertebrates J Mol Evol 66185ndash196

Fletcher W Yang Z 2010 The effect of insertions deletions and align-ment errors on the branch-site test of positive selection Mol BiolEvol 272257ndash2267

Flicek P Ahmed I Amode MR Barrell D Beal K Brent S Carvalho-SilvaD Clapham P Coates G Fairley S et al 2013 Ensembl 2013 NucleicAcids Res 41D48ndashD55

Junier T Zdobnov EM 2010 The Newick utilitieshigh-throughput phy-logenetic tree processing in the UNIX shell Bioinformatics 261669ndash1670

Kawthalkar S 2013 Overview of Physiology of blood In essentials ofhaematology New Delhi (India) Jaypee Brothers Medical Publishers(P) Ltd p 1ndash51

Kolb AW Ane C Brandt CR 2013 Using HSV-1 genome phylogeneticsto track past human migrations PLoS One 81ndash9

Kosiol C Vinar T da Fonseca RR Hubisz MJ Bustamante CD Nielsen RSiepel A 2008 Patterns of positive selection in six Mammalian ge-nomes PLoS Genet 4e1000144

Lombardo F Ghani Y Kafatos FC Christophides GK 2013Comprehensive genetic dissection of the hemocyte immune re-sponse in the malaria mosquito Anopheles gambiae PLoS Pathog9e1003145

Lopez C Saravia C Gomez A Hoebeke J Patarroyo MA 2010Mechanisms of genetically-based resistance to malaria Gene 4671ndash12

Loytynoja A 2014 Phylogeny-aware alignment with PRANK MethodsMol Biol 1079155ndash170

Loytynoja A Goldman N 2008 Phylogeny-aware gap placement pre-vents errors in sequence alignment and evolutionary analysisScience 3201632ndash1635

Masel J 2011 Genetic drift Curr Biol 21R837ndashR838Matsuka YV Pillai S Gubba S Musser JM Olmsted SB 1999 Fibrinogen

cleavage by the Streptococcus pyogenes extracellular cysteine

protease and generation of antibodies that inhibit enzyme proteo-lytic activity Infect Immun 674326ndash4333

Meyer A Van de Peer Y 2005 From 2R to 3R evidence for a fish-specificgenome duplication (FSGD) Bioessays 27937ndash945

Montoya-Burgos JI 2011 Patterns of positive selection and neutral evo-lution in the protein-coding genes of Tetraodon and Takifugu PLoSOne 6e24800

Moretti S Laurenczy B Gharib WH Castella B Kuzniar A Schabauer HStuder RA Valle M Salamin N Stockinger H et al 2014 Selectomeupdate quality control and computational improvements to a data-base of positive selection Nucleic Acids Res 42D917ndashD921

Morrissey JH Choi SH Smith SA 2012 Polyphosphate an ancient mol-ecule that links platelets coagulation and inflammation Blood 1195972ndash5979

Nathwani AC Tuddenham EG Rangarajan S Rosales C McIntosh JLinch DC Chowdary P Riddell A Pie AJ Harrington C et al 2011Adenovirus-associated virus vector-mediated gene transfer in hemo-philia B N Engl J Med 3652357ndash2365

Nielsen R Bustamante C Clark AG Glanowski S Sackton TB Hubisz MJFledel-Alon A Tanenbaum DM Civello D White TJ et al 2005 Ascan for positively selected genes in the genomes of humans andchimpanzees PLoS Biol 3e170

Panizzi P Friedrich R Fuentes-Prior P Richter K Bock PE Bode W 2005Fibrinogen substrate recognition by staphylocoagulase(pro)throm-bin complexes J Biol Chem 131179ndash1187

Proux E Studer RA Moretti S Robinson-Rechavi M 2009 Selectome adatabase of positive selection Nucleic Acids Res 37D404ndashD407

Puy C Tucker EI Wong ZC Gailani D Smith SA Choi SH Morrissey JHGruber A McCarty OJ 2013 Factor XII promotes blood coagulationindependent of factor XI in the presence of long-chain polypho-sphates J Thromb Haemost 111341ndash1352

R Core Team 2013 R a language and environment for statistical com-puting Vienna Austria R Foundation for Statistical Computing

Rallapalli PM Kemball-Cook G Tuddenham EG Gomez K Perkins SJ2013 An interactive mutation database for human coagulationfactor IX provides novel insights into the phenotypes and geneticsof hemophilia B J Thromb Haemost 111329ndash1340

Roux J Privman E Moretti S Daub JT Robinson-Rechavi M Keller L2014 Patterns of positive selection in seven ant genomes Mol BiolEvol 311661ndash1685

Saunders RE Shiltagh N Gomez K Mellars G Cooper C Perry DJTuddenham EG Perkins SJ 2009 Structural analysis of eight noveland 112 previously reported missense mutations in the interactiveFXI mutation database reveals new insight on FXI deficiencyThromb Haemost 102287ndash301

Schymkowitz J Borg J Stricher F Nys R Rousseau F Serrano L 2005 TheFoldX web server an online force field Nucleic Acids Res 33W382ndashW388

Sillitoe I Cuff AL Dessailly BH Dawson NL Furnham N Lee D Lees JGLewis TE Studer RA Rentzsch R et al 2013 New functionalfamilies (FunFams) in CATH to improve the mapping of con-served functional sites to 3D structures Nucleic Acids Res 41D490ndashD498

Spronk HM Govers-Riemslag JW ten CH 2003 The blood coagulationsystem as a molecular machine Bioessays 251220ndash1228

Stefl S Nishi H Petukh M Panchenko AR Alexov E 2013 Molecularmechanisms of disease-causing missense mutations J Mol Biol 4253919ndash3936

Storey JD Tibshirani R 2003 Statistical significance for genomewidestudies Proc Natl Acad Sci U S A 1009440ndash9445

Studer RA Penel S Duret L Robinson-Rechavi M 2008 Pervasive pos-itive selection on duplicated and nonduplicated vertebrate proteincoding genes Genome Res 181393ndash1402

Studer RA Robinson-Rechavi M 2009a Evidence for an episodic modelof protein sequence evolution Biochem Soc Trans 37783ndash786

Studer RA Robinson-Rechavi M 2009b Large-scale analyses of positiveselection using codon models In Pontarotti P editor Evolutionarybiology Berlin (Germany) Springer-Verlag Berlin Heidelbergp 217ndash235

3055

Evolutionary Analysis of the Coagulation Factors doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from

Sun H 2006 The interaction between pathogens and the host coagu-lation system Physiology 21281ndash288

Sutherland MR Ruf W Pryzdial EL 2012 Tissue factor and glycoproteinC on herpes simplex virus type 1 are protease-activated receptor 2cofactors that enhance infection Blood 1193638ndash3645

Thomas CE Ehrhardt A Kay MA 2003 Progress and problemswith the use of viral vectors for gene therapy Nat Rev Genet 4346ndash358

Tokuriki N1 Stricher F Schymkowitz J Serrano L Tawfik DS 2007 Thestability effects of protein mutations appear to be universally dis-tributed J Mol Biol 3691318ndash1332

Vallender EJ Lahn BT 2004 Positive selection on the human genomeHum Mol Genet 13R245ndashR254

Vilella AJ Severin J Ureta-Vidal A Heng L Durbin R Birney E 2009EnsemblCompara GeneTrees complete duplication-aware phyloge-netic trees in vertebrates Genome Res 19327ndash335

Waterhouse AM Procter JB Martin DM Clamp M Barton GJ 2009Jalview Version 2mdasha multiple sequence alignment editor and anal-ysis workbench Bioinformatics 251189ndash1191

Wong WS Yang Z Goldman N Nielsen R 2004 Accuracy and power ofstatistical methods for detecting adaptive evolution in proteincoding sequences and for identifying positively selected sitesGenetics 1681041ndash1051

Yang Z 2006 Computational molecular evolution Oxford UniversityPress Oxford

Yang Z 2007 PAML 4 phylogenetic analysis by maximum likelihoodMol Biol Evol 241586ndash1591

Yang Z Wong WS Nielsen R 2005 Bayes empirical Bayes inference ofamino acid sites under positive selection Mol Biol Evol 221107ndash1118

Zhang J Nielsen R Yang Z 2005 Evaluation of an improved branch-sitelikelihood method for detecting positive selection at the molecularlevel Mol Biol Evol 222472ndash2479

3056

Rallapalli et al doi101093molbevmsu248 MBE at U

CL

Library Services on N

ovember 21 2014

httpmbeoxfordjournalsorg

Dow

nloaded from