Archaea: The First Domain of Diversified Life

27
Review Article Archaea: The First Domain of Diversified Life Gustavo Caetano-Anollés, 1 Arshan Nasir, 1 Kaiyue Zhou, 1 Derek Caetano-Anollés, 1 Jay E. Mittenthal, 1 Feng-Jie Sun, 2 and Kyung Mo Kim 3 1 Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, Institute for Genomic Biology and Illinois Informatics Institute, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA 2 School of Science and Technology, Georgia Gwinnett College, Lawrenceville, GA 30043, USA 3 Microbial Resource Center, Korea Research Institute of Bioscience and Biotechnology, Daejeon 305-806, Republic of Korea Correspondence should be addressed to Gustavo Caetano-Anoll´ es; [email protected] Received 30 September 2013; Revised 15 February 2014; Accepted 25 March 2014; Published 2 June 2014 Academic Editor: Celine Brochier-Armanet Copyright © 2014 Gustavo Caetano-Anoll´ es et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. e study of the origin of diversified life has been plagued by technical and conceptual difficulties, controversy, and apriorism. It is now popularly accepted that the universal tree of life is rooted in the akaryotes and that Archaea and Eukarya are sister groups to each other. However, evolutionary studies have overwhelmingly focused on nucleic acid and protein sequences, which partially fulfill only two of the three main steps of phylogenetic analysis, formulation of realistic evolutionary models, and optimization of tree reconstruction. In the absence of character polarization, that is, the ability to identify ancestral and derived character states, any statement about the rooting of the tree of life should be considered suspect. Here we show that macromolecular structure and a new phylogenetic framework of analysis that focuses on the parts of biological systems instead of the whole provide both deep and reliable phylogenetic signal and enable us to put forth hypotheses of origin. We review over a decade of phylogenomic studies, which mine information in a genomic census of millions of encoded proteins and RNAs. We show how the use of process models of molecular accumulation that comply with Weston’s generality criterion supports a consistent phylogenomic scenario in which the origin of diversified life can be traced back to the early history of Archaea. 1. Introduction Imagine a child playing in a woodland stream, poking a stick into an eddy in the flowing current, thereby disrupting it. But the eddy quickly reforms. e child disperses it again. Again it reforms, and the fascinating game goes on. ere you have it! Organisms are resilient patterns in a turbulent flow—patterns in an energy flow”— Carl Woese [1]. Understanding the origin of diversified life is a challeng- ing proposition. It involves the use of ideographic thinking that is historical and retrodictive, as opposed to nomo- thetic explorations that are universal and predictive [2]. Experimental science for the most part is nomothetic; the search for truth comes from universal statements that can be conceptualized as being of general predictive utility. Nomo- thetic explorations are in general both philosophically and operationally less complex to pursue than any ideographic exploration. In contrast, retrodictions speak about singular or plural events in history that must be formalized by “transformations” that comply with a number of evolutionary axioms [3] and interface with a framework of maximization of explanatory power [4]. e fundamental statement that organismal diversity is the product of evolution is supported by an ensemble of three nested primary axioms of the highest level of universality [3]: (i) evolution occurs, including its principle that history of change entails spatiotemporal continuity (sensu Leibnitz), (ii) only one historical account of all living or extinct entities of life and their component parts exists as a consequence of descent with modification, and (iii) features of those entities (characters) are preserved through generations via genealogical descent. History must comply with the “principle of continuity,” which crucially supports evolutionary thinking. e axiomatic rationale of natura non facit saltum” highlighted by Leibnitz, Linnaeus, and Newton must be considered a generality of how natural things change and a “fruitful principle of discovery.” We note Hindawi Publishing Corporation Archaea Volume 2014, Article ID 590214, 26 pages http://dx.doi.org/10.1155/2014/590214

Transcript of Archaea: The First Domain of Diversified Life

Review ArticleArchaea The First Domain of Diversified Life

Gustavo Caetano-Anolleacutes1 Arshan Nasir1 Kaiyue Zhou1 Derek Caetano-Anolleacutes1

Jay E Mittenthal1 Feng-Jie Sun2 and Kyung Mo Kim3

1 Evolutionary Bioinformatics Laboratory Department of Crop Sciences Institute for Genomic Biology and Illinois Informatics InstituteUniversity of Illinois at Urbana-Champaign Urbana IL 61801 USA

2 School of Science and Technology Georgia Gwinnett College Lawrenceville GA 30043 USA3Microbial Resource Center Korea Research Institute of Bioscience and Biotechnology Daejeon 305-806 Republic of Korea

Correspondence should be addressed to Gustavo Caetano-Anolles gcaillinoisedu

Received 30 September 2013 Revised 15 February 2014 Accepted 25 March 2014 Published 2 June 2014

Academic Editor Celine Brochier-Armanet

Copyright copy 2014 Gustavo Caetano-Anolles et al This is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

The study of the origin of diversified life has been plagued by technical and conceptual difficulties controversy and apriorism Itis now popularly accepted that the universal tree of life is rooted in the akaryotes and that Archaea and Eukarya are sister groupsto each other However evolutionary studies have overwhelmingly focused on nucleic acid and protein sequences which partiallyfulfill only two of the three main steps of phylogenetic analysis formulation of realistic evolutionary models and optimization oftree reconstruction In the absence of character polarization that is the ability to identify ancestral and derived character statesany statement about the rooting of the tree of life should be considered suspect Here we show that macromolecular structure anda new phylogenetic framework of analysis that focuses on the parts of biological systems instead of the whole provide both deepand reliable phylogenetic signal and enable us to put forth hypotheses of origin We review over a decade of phylogenomic studieswhich mine information in a genomic census of millions of encoded proteins and RNAs We show how the use of process modelsof molecular accumulation that comply with Westonrsquos generality criterion supports a consistent phylogenomic scenario in whichthe origin of diversified life can be traced back to the early history of Archaea

1 Introduction

ldquoImagine a child playing in a woodland stream poking a stickinto an eddy in the flowing current thereby disrupting it Butthe eddy quickly reforms The child disperses it again Again itreforms and the fascinating game goes on There you have itOrganisms are resilient patterns in a turbulent flowmdashpatternsin an energy flowrdquomdash Carl Woese [1]

Understanding the origin of diversified life is a challeng-ing proposition It involves the use of ideographic thinkingthat is historical and retrodictive as opposed to nomo-thetic explorations that are universal and predictive [2]Experimental science for the most part is nomothetic thesearch for truth comes from universal statements that can beconceptualized as being of general predictive utility Nomo-thetic explorations are in general both philosophically andoperationally less complex to pursue than any ideographicexploration In contrast retrodictions speak about singular

or plural events in history that must be formalized byldquotransformationsrdquo that comply with a number of evolutionaryaxioms [3] and interface with a framework of maximizationof explanatory power [4] The fundamental statement thatorganismal diversity is the product of evolution is supportedby an ensemble of three nested primary axioms of the highestlevel of universality [3] (i) evolution occurs includingits principle that history of change entails spatiotemporalcontinuity (sensu Leibnitz) (ii) only one historical accountof all living or extinct entities of life and their componentparts exists as a consequence of descent with modificationand (iii) features of those entities (characters) are preservedthrough generations via genealogical descent History mustcomply with the ldquoprinciple of continuityrdquo which cruciallysupports evolutionary thinking The axiomatic rationale ofldquonatura non facit saltumrdquo highlighted by Leibnitz Linnaeusand Newton must be considered a generality of how naturalthings change and a ldquofruitful principle of discoveryrdquo We note

Hindawi Publishing CorporationArchaeaVolume 2014 Article ID 590214 26 pageshttpdxdoiorg1011552014590214

2 Archaea

that this axiomatic generality which we have discussed inthe context of origin of life research [5] encompasses rarepunctuations (eg quantum leap changes such as genomeduplications and rearrangements or the rare evolutionaryappearance of new fold structures) embedded in a fabric ofgradual change (eg changes induced by point mutations)Both gradual changes and punctuations are interlinked andare always expressed within spatiotemporal continuity (egstructural punctuations in the mappings of sequences intostructures of RNA [6]) This interpretative framework canexplain novelty and complexity with principles of scientificinquiry that maximize the explanatory power of assertionsabout retrodictions

Phylogenetic theories are embodied in evolutionaryldquotreesrdquo and ldquomodelsrdquo Trees (phylogenies) are multidimen-sional statements of relationship of the entities that arestudied (phylogenetic taxa) Models are evolutionary trans-formations of the biological attributes examined in data (phy-logenetic characters) which define the relationships of taxa intrees The tripartite interaction between characters modelsand trees must occur in ways that enhance retrodictivepower through test and corroboration [4] In other wordsit must follow the Popperian pillars of scientific inquiryor suitable philosophical analogs We note that retrodictivestatements allow drawing inferences about the past by usinginformation that is extant (ie that we can access today) andis necessarilymodernThe challenge of travelling back in timerests on not only making inferences about archaic biologywith information drawn from modern biological systemsbut also interpreting the retrodictive statements withoutconceptual restrictions imposed by modernity This has beenan important obstacle to historical understanding startingwith grading hypotheses inspired by Aristotlersquos great chain ofbeing the scala naturae

It was Willi Hennig in the fifties who first formal-ized retrodiction in quantitative terms [7] Since then hisldquophylogenetic systematicsrdquo has benefitted from numerousconceptual and bioinformatics developments which are nowresponsible for modern phylogenetic analysis of systemsof any kind from molecules and organisms to languageand culture from engineering applications to astrophysicsAstrocladistics for example focuses on the evolution anddiversification of galaxies caused by transforming events suchas accretion interaction andmergers (eg [8]) While majorviews have emerged in the ldquodiscovery operationsrdquo (sensu[2]) of the phylogenetic systematics paradigm includingmaximum parsimony and the frequentist and uncertaintyviews of maximum likelihood and Bayesian thinking themajor technical and philosophical challenges persist [9]More importantly as we will explain below technical andphilosophical aspects of the ideographic framework in somecases have been turned into landscapes of authoritarianismand apriorism [3] This insidious trend is pervasive in theldquorooting of the tree of liferdquo field of inquiry [10] that underliesthe origins of biochemistry and biodiversity we here discuss

In this opinion paper we address the challenges of findingan origin to biodiversity and propose a new framework fordeep phylogenetic analysis that focuses on the parts of biolog-ical systems instead of thewholeWe review the application of

this framework to data drawn from structural and functionalgenomics and argue that the origin of cellular life involvedgradual accretion of molecular interactions and the rise ofhierarchical and modular structure We discuss our findingswhich provide strong support to the very early rise ofprimordial archaeal lineages and the emergence of Archaeaas the first domain of diversified cellular life (superking-dom) The term ldquodomainrdquo of life stresses the cohesivenessof the organism supergroup very much like domains inproteins and nucleic acids stress the molecular cohesivenessof their atomic makeup Instead the term ldquosuperkingdomrdquo(superregnum) makes explicit the fact that there is a nestedhierarchy of groups of organisms many of which sharecommon ancestors (ie they are monophyletic) We proposethat the rise of emerging lineages was embedded in aprimordial ldquoevolutionary graderdquo (sensu Huxley [11]) a groupof diversifying organisms (primordial archaeons) in activetransition that were initially unified by the same and archaiclevel of physiological complexity Our discussion will attemptto reconcile some divergent views of the origin of diversifiedlife and will provide a generic scenario for ldquoturning pointsrdquo oforigin that may be recurrent in biology

2 A Tripartite World of Organismal Diversity

Carl Woese and his colleagues of the Urbana School wereresponsible for the groundbreaking discovery that the worldof organisms was tripartite that is it encompassed nottwo but three major ldquodomainsrdquo of cellular life (ArchaeaBacteria and Eukarya) Two of the three ldquoaboriginalrdquo linesof decent were initially conceptualized as ldquourkingdomsrdquo ofdeep origin that were microbial and qualitatively differ-ent from eukaryotic organisms [12] They corresponded toArchaea and Bacteria The discovery of Archaea challengedthe established akaryoteeukaryote divide (we use the termldquoakaryoterdquo to describe a cell without a bona fide nucleusThis term complements theword ldquoeukaryoterdquo (ldquoeurdquo good andldquokaryonrdquo kernel) which is ahistorical The new term takesaway the time component of the widely used ldquoprokaryoterdquo(ldquoprordquo before) definition which may be incorrect for manyorganisms of the microbial domains) that supported ldquoladderrdquoscenarios of gradual evolution from simplistic microbes toldquohigherrdquo organisms which were tenaciously defended bymolecular biologists and microbiologists of the time Woeseand Fox [12] made it clear ldquoEvolution seems to progress in aldquoquantizedrdquo fashion One level or domain of organization givesrise ultimately to a higher (more complex) oneWhat ldquoprokary-oterdquo and ldquoeukaryoterdquo actually represent are two such domainsThus although it is useful to define phylogenetic patternswithin each domain it is not meaningful to construct phylo-genetic classifications between domains Prokaryotic kingdomsare not comparable to eukaryotic onesrdquo The discovery wasrevolutionary especially because scala naturae deeply seatedthe roots of the akaryoteeukaryote divide and microbeswere considered primitive forms that did not warrant equalstanding when compared to the complex organization ofEukarya (see [13] for a historical account)The significance ofthe tripartite world was quickly realized and vividly resisted

Archaea 3

by the establishment Its resistance is still embodied todayin new proposals of origins such as the archaeon-bacteriumfusion hypothesis used to explain the rise of Eukarya (seebelow) It is noteworthy that the root of the universal treeof cellular organisms the ldquotree of liferdquo (ToL) was initiallynot the driving issue This changed when the sequencesof proteins that had diverged by gene duplication prior toa putative universal common ancestor were analyzed withphylogenetic methods and the comparisons used to root theToL [14 15] Paralogous gene couples included elongationfactors (eg EF-Tu and EFG) ATPases (120572 and 120573 subunits)signal recognition particle proteins and carbamoyl phos-phate synthetases all believed to be very ancient (reviewedin [16]) In many cases bacterial sequences were the firstto branch (appeared at the base) in the reconstructed treesforcing archaeal and eukaryal sequences to be sister groupsto each other This ldquocanonicalrdquo rooting scheme of the ToL(Figure 1(a)) was accepted as fact and was quickly endorsedby the supporters of the Urbana School [17] In fact theacceptance of the ldquocanonicalrdquo rooting in Bacteria became sodeep that it has now prompted the search for the origins ofEukarya in the molecular and physiological constitutions ofthe putative archaeal sister group [18] For example Embleyand coworkers generated sequence-based phylogenies usingconserved proteins and advanced algorithms to show thatEukarya emerged from within Archaea [19ndash21] (refer to[22] for critical analysis) Importantly these analyses sufferfrom technical and logical problems that are inherent insequence-based tree reconstructions For example proteinssuch as elongation factors tRNA synthetases and otheruniversal proteins used in their analyses are prone to highsubstitution rates [23] Mathematically it leads to loss ofinformation regarding the root of the ToL as shown bySober and Steel [24] (refer to [25] for more discussion) Onthe other hand paralogous rootings sometimes contradictedeach other and were soon and rightly considered weak andunreliable [23 26 27]The validity of paralogy-based rootingmethodology has proven to be severely compromised bya number of problems and artifacts of sequence analysis(eg long branch attraction mutational saturation taxonsampling horizontal gene transfer (HGT) hidden paralogyand historical segmental gene heterogeneity) Consequentlythere is no proper outgroup that can be used to root a ToLthat is built from molecular sequences and currently thereare no proper models of sequence evolution that can providea reliable ldquoevolutionary arrowrdquo Because of this fact archaealand eukaryal rootings should be considered equally probableto the canonical bacterial rooting (Figure 1(c)) This is animportant realization that needs to be explicitly highlightedespecially because it affects evolutionary interpretations andthe likelihood of scenarios of origins of diversified life

3 Mining Ancient Phylogenetic Signalin Universal Molecules

Woesersquos crucial insight was the explicit selection of theribosome for evolutionary studies The universality of theribosomal ensemble and its central role in protein synthesis

ensured it carried an ancient and overriding memory of thecellular systems that were studied This was made evident inthe first ToL reconstructions In contrast many of the pro-teins encoded by paralogous gene couples (eg translationfactors) likely carried convoluted histories of protein domaincooption or important phylogenetic biases induced forexample by mutational saturation in their protein sequencesThe ribosome is indeed the central feature of cellular lifethe signature of ldquoribocellsrdquo However its embedded phylo-genetic signatures are also convoluted The constitution ofthe ribosome is heterogeneous The ribosome represents anensemble of 3-4 ribosomal RNA (rRNA) and sim70 protein(r-protein) molecules depending on the species consideredand embodies multiple interactions with the cellular milieuneeded for function (eg assembly and disassembly inter-actions with the membrane of the endoplasmic reticulum)A group of 34 r-proteins is present across cellular life [28]Ribosomal history has been shown to involve complex pat-terns of protein-RNA coevolution within the evolutionarilyconserved core [29] These patterns are expressed distinctlyin its major constituents While both of its major subunitsevolved in parallel a primordial core that embodied bothprocessive and catalytic functions was established quite earlyin evolution This primordial core was later accessorizedwith structural elements (eg accretion of numerous rRNAhelical segments and stabilizing A-minor interactions) andr-proteins (eg the L7L12 protein complex that stimulatesthe GTPase activity of EFG) that enhanced its functionalpropertiesThis included expansion elements in the structureof rRNA that were specific not only to subunits but also toindividual domains of life Figure 2 for example shows a phy-logenomic model of ribosomal molecular accretion derivedfrom the survey of protein domain structures in genomesand substructures in rRNA molecules The accretion processof component parts of the universal core appears to havebeen a painstakingly slow process that unfolded during aperiod of sim2 billion years and overlapped with the firstepisodes of organismal diversification [30] Despite of thiscomplexity the focus of biodiversity studies was for decadesthe small subunit rRNA molecule [31] This focus has notchanged much in recent years Consequently the historyof organisms and populations is currently recounted by theinformation seated in the small subunit rRNA molecule Inother words the historical narrative generally comes fromonly sim1 of ribosomal molecular constitution This impor-tant and unacknowledged bias was already made explicit inearly phylogenetic studies For example de Rijk et al [32]demonstrated that phylogenies reconstructed from the smalland large subunit rRNA molecules were different and thatthe reconstructions from the large subunit were more robustand better suited to establish wide-range relationships Thestructure of the small and large subunits was also shown tocarry distinct phylogenetic signatures [33]However only fewevolutionary studies have combined small and large subunitrRNA for history reconstruction Remarkably in all of thesecases phylogenetic signal was significantly improved (eg[34])

4 Archaea

Eukarya

Archaea

Bacteria

Urancestor

Ancestor

E

A

B

U

a

(a)

Crown

Stema

U

E BA

E BA

a

U

TimeOrigin oflineages

(b)

Canonical rooting Eukaryal rooting Archaeal rooting

Time

E AB

a

U

B EA

a

U

E BA

a

U

(c)

Figure 1 Rooting the tree of life (ToL) an exercise for ldquotree-thinkersrdquo (a) A node-based tree representation of the ToL focuses on taxa(sampled or inferred vertices illustrated with open circles) and models edges as ancestry relationships The unrooted tree describes taxa asconglomerates of extant species for each domain of life Archaea (A) Bacteria (B) and Eukarya (E) (abbreviations are used throughout thepaper) Adding a universal ancestor vertex (rooting the ToL) implies adding a reconstructed entity (urancestor) that roots the tree but doesnot model ancestry relationshipsThe vertex is either a ldquomost recent universal ancestorrdquo if one defines it from an ingroup perspective or a ldquolastuniversal common ancestorrdquo (LUCA) if the definition relates to an outgroup perspective Rooting the tree enables defining ldquototal lineagesrdquowhich are lists of ancestors spanning from the ancestral taxon to the domain taxa (b) A stem-based view focuses on edges (branches) whichare sampled and inferred ancestral taxa are viewed as lineages under the paradigm of descent with modification Vertices correspond tospeciation events Terminal edges represent conglomerates of lineages leading to domains of life (the terminal nodes of node-based trees)and the ancestral stem represents the lineage of the urancestor (U) (double arrowhead line) Total lineages are simply a chain of edges thatgoes back in time and ends in the ancestral stem Both node-based and stem-based tree representations are mathematically isomorphic butthey are not equal [154] They change the concept of monophyletic and paraphyletic relationships A node-based clade starts with a lineageat the instant of the splitting event incorporating the ancestor into the makeup of the clade In contrast a stem-based clade originates witha planted branch on the tree where the branch represents a lineage between two lineage splitting events Planting an ancestral stem definesan origin of lineages and the first speciation event in the record of life This delimits a crown clade of two domain lineages in the stem-basedtree (labeled with black lines) that includes the ancestor of the sister groups a and a stem domain group at its base (c) The three possiblerootings of the ToL depicted with stem-based tree representations Terminal edges are labeled with thin lines and these conglomerate lineagescan include stem and crown groups

4 Building a Tower of Babel froma Comparative Genomic Patchworkof Sequence Homologies

Molecular evolutionists were cognizant of the limitationof looking at the history of only few component partswhich by definition could be divergent When genomicsequences became widely available pioneers jumped ontothe bandwagon of evolutionary genomics and the possibilityof gaining systemic knowledge from entire repertoires ofgenes and molecules (eg [35ndash39]) The genomic revolutionfor example quickly materialized in gene content trees thatreconstructed the evolution of genomes directly from theirevolutionary units the genes (eg [37 39 40]) or the domainconstituents of the translated proteins [35 41]The sequencesof multiple genes were also combined or concatenatedin attempts to extract deep phylogenetic signal [42ndash44]

The results that were obtained consistently supported thetripartite world backing-up the claims of the Urbana SchoolHowever clues about ancestors and lineages leading to extanttaxa were still missing

With few exceptions that focused on the structure ofRNA and protein molecules (see below) analyses based ongenomic sequences gene content gene order and othergenomic characteristics were unable to produce rooted treeswithout the help of outgroups additional ad hoc hypothesesthat are external to the group of organisms being studiedand generally carry strong assumptions An ldquoarrow of timerdquowas not included in the models of genomic evolution thatwere used As with sequence ToLs that were generated wereunrooted (Figure 1(a)) and generally rooted a posteriori eitherclaiming that the canonical root was correct or makingassumptions about character change that may be strictlyincorrect In a recent example distance-based approaches

Archaea 5

Primordialribosome

MaturePTC

Processiveribosome

Matureribosome ribosome

Present day

Ribosomal core

Diversified life

Ribosomal history

The present

Origin ofproteins

Urancestor 1 2 3

38 30 20 10 0

Molecular age (nd)

History of life (time billions of years)

Figure 2The evolutionary history of ribosome was traced onto the three-dimensional structure of its rRNA and r-protein components Agesof components were colored with hues from red (ancient) to blue (recent) Phylogenomic history shows that primordial metabolic enzymespreceded RNA-protein interactions and the ribosome The timeline of life derived from a universal phylogenetic tree of protein domainstructure at fold superfamily level of complexity is shown with time flowing from left to right and expressed in billions of years accordingto a molecular clock of fold structures The ribosomal timeline highlights major historical events of the molecular ensemble History ofthe conserved ribosomal core [29] overlaps with that of diversified life [48] suggesting that episodes of cooption and lateral transfer havepervaded early ribosomal evolution

were used to build universal network trees from gene familiesdefined by reciprocal best BLAST hits [45] These networksshowed a midpoint rooting of the ToL between Bacteriaand Archaea However this rooting involves a complex opti-mization of path lengths in the split networks and criticallyassumes that lineages evolved at roughly similar rates Thisdiminishes the confidence of themidpoint rooting especiallywhen considering the uncertainties of distances inferred fromBLAST analyses and the fact that domains in genes holddifferent histories and rates of change Another promisingbut ill-conceptualized case is the rooting of distance-basedtrees inferred by studying the frequency of l-mer sets ofamino acids in proteins at the proteome level [46] Thecompositional data generated a ToL that was rooted inEukarya when randomized proteome sequences were usedas outgroupsThe assumption of randomness associated withthe root of a ToL is however unsupported or probably wrongespecially because protein sequence space and itsmappings tostructure are far from random [47] More importantly a largefraction of modern proteins had already evolved protein foldstructures when the first diversified lineages arose from theuniversal common ancestor of cellular life [48]These evolvedrepertoires cannot be claimed to be random

The inability of genomics to provide clear answers torooting questions promoted (to some extent) parsimonythinking and the exploration of evolutionary differences ofsignificance that could act as ldquoanchorsrdquo and could impartldquopolarityrdquo to tree statements [49] As we will describe belowwe applied parsimony thinking to the evolution of proteinand RNA structures [41 50] and for over a decade we havebeen generating rooted ToLs that portray the evolution ofproteomes with increasing explanatory power This workhowever has been for themost part unacknowledged Insteadmolecular evolutionists focused almost exclusively onmolec-ular sequences in their search for solutions to the ToL

problem For example analysis of genomic insertions anddeletions (indels) that are rare in paralogous gene sets rootedthe ToL between the group of Actinobacteria and Gram-negative bacteria and the group of Firmicutes and Archaea[51] Unfortunately little is known about the dynamics ofindel generation its biological role in gene duplication andits influence on the structure of paralogous protein pairs (eg[52]) If the dynamics are close to that of sequences indelsshould be considered subject to all the same limitations ofsequence analyses including long branch attraction artifactscharacter independence and inapplicable characters Givena tree topology parsimony was also used to optimize theevolutionary transformation of genomic features For exam-ple when considering the occurrence of protein domainsthat are abundant variants of domain structures accumulategradually in genomes operationally the domain structure perse cannot be lost once it had been gained This enabled theuse of a variant of the unrooted Dollo algorithmic method(eg [53]) the Dollo parsimony model [54] is based onthe assumption that when a biological feature that is verycomplex is lost in evolution it cannot be regained throughvertical descent The method however could not be appliedto akaryotic genomes as these are subject to extensive lateralgene transfer and until recently [55] was not extended to ToLreconstructions

Parsimony thinking was also invoked in ldquotransitionanalysisrdquo a method that attempts to establish polarity ofcharacter change by for example examining homologies ofproteins in proteolytic complexes such as the proteasomemembrane and cell envelope biochemical makeup and bodystructures of flagella sometimes aided by BLAST queries[56 57] The elaboration however is restricted to the fewmolecular and cellular structures that are analyzed out ofthousands that populate the akaryotic and eukaryotic cellsThe approach is local has not yet made use of an objective

6 Archaea

analytic phylogenetic framework and does not weigh gainslosses and transfers with algorithmic implementationsThusmany statements can fall pray of incorrect optimizations orprocesses of convergence of structure and function includingcooptions that are common in metabolic enzymes [58]Transitions also fail to consider the molecular makeup of thecomplexes examined (eg evolution of the photosyntheticreaction centers) which often hold domains with hetero-geneous histories in the different organismal groups Forexample the 119865

1119865119900ATP synthetase complex that powers

cellular processes has a long history of accretion of domainsthat span almost 38 billion years of history which is evenolder than that of the ribosome [59] Finally establishing thevalidity of evolutionary transitions in polarization schemescan be highly problematic each transition that is studiedrequires well-grounded assumptions [60] some of whichhave not been yet properly elaborated

The inability to solve the rooting problem and the insis-tence on extracting deep phylogenetic signal from molecularsequences that are prone to mutational saturation raisedskepticism about the possibility of ever finding the rootof the ToL Bapteste and Brochier [60] made explicit theconceptual difficulties claiming scientists in the field hadadopted Agripparsquos logic of doubt Misunderstandings on howto conduct ideographic inquiry in evolutionary genomicshad effectively blocked the reerection of a Tower of Babelthat would explain the diversification of genomes (Figure 3)Instead there was ldquoConfusion of Doubtsrdquo Unfortunately theimpasse was aggravated by aprioristic tendencies inheritedfrom systematic biology [3 10] disagreements about theevolutionary role of vertical and lateral inheritance andthe problem of homology [61] and currently disagreementsabout the actual role of Darwinian evolution in speciationand the ToL problem [13 62]

During the decade of evolutionary genomic discovery theeffects of HGT on phylogeny [63] took front cover HGT wasinvoked as an overriding process However little attentionwas paid to alternative explanations such as differential lossof gene variants ancient or derived and there was littleconcern for other sources of reticulation HGT is certainlyan important evolutionary process that complicates the ldquotreerdquoconcept of phylogenetic analysis and must be carefullystudied (eg [64ndash66]) In some bacterial taxonomic groupssuch as the proteobacteria HGT was found to be pervasiveand challenged the definition of species [67] Cases like theseprompted the radical suggestion that the ToL should beabandoned and that a ldquoweb of liferdquo should be used to describethe diversification of microbes and multicellular organisms[68] However the problem has two aspects that must beseparately considered

(i) MechanicsThe widespread and impactful nature of HGTmust be addressed Does its existence truly compromise thevalidity of phylogenetic tree statements While HGT seemsimportant for some bacterial lineages [69] its evolutionaryimpacts in Archaea and Eukarya are not as extensive (eg[61 70]) and its global role can be contested [71] In turnthe role of viruses and RNA agents in genetic exchangecontinues to be understudied (eg [72]) and could be

Figure 3 The ldquoConfusion of Tonguesrdquo an engraving by GustaveDore (1832ndash1883) was modified by D Caetano-Anolles to portraythe event of diversification that halted the construction of the Towerof Babel In linguistics the biblical story inspired tree thinkingWe take the metaphor as the fall of the urancestor of cellular lifeand the replacement of ldquoscala naturaerdquo by branching processes ofcomplexity

a crucial source of reticulation Furthermore the relationshipbetween differential gene loss and HGT has not been ade-quately formalized especially for genes that are very ancientEstablishing patterns of loss requires establishing polarity ofchange and rooted trees which as we have discussed earlierremains unattained for ToLs reconstructed from molecularsequences

(ii) Interpretation HGT generally materializes as a mismatchbetween histories of genes in genomes and histories of organ-isms Since genomes are by functional definition collectionsof genes reticulations affect history of the genes and not ofthe organisms which given evolutionary assumptions resultfrom hierarchical relationships (eg [73]) Thus and at firstglance reticulation processes (HGT gene recombinationgene duplications and gene loss) do not obliterate verticalphylogenetic signal The problem however remains complexA ToL can be considered an ensemble of lineages nestedwithin each other [74] Protein domain lineages will benested in gene lineages gene lineages will be nested inlineages of gene families gene families will be nested inorganismal lineages organisms will be nested in lineagesof higher organismal groupings (populations communitiesecosystems and biomes) and so on All lineage levelsare defined by biological complexity and the hierarchicalorganization of life and follow a fractal pattern distributioneach level contributing vertical and lateral phylogenetic signal

Archaea 7

to the whole However sublineages in one hierarchical levelmay hold different histories compared to the histories ofhigher and lower levels Historical mismatches introduce forexample lineage sorting and reticulation problems that rep-resent a complication to phylogenetic analysis For examplemismatches between gene and organismal phylogenies existin the presence of differential sorting of genomic compo-nents due to for example species extinctions or genomicrearrangements These mismatches violate the fundamentalcladistic assumption that history follows a branching processbut can be explained by homoplasy (convergent evolution)the horizontal ldquotracerdquo of the homology-homoplasy yin-yangof phylogenetic signal The problem of homoplasy cannotbe solved by conceptually preferring one particular levelof the hierarchy [74] A focus on the organismal level forexample will not solve the problems of hybridization thatare brought by strict or relaxed sexual reproduction or viral-mediated genetic exchanges or the problems of ancientevents of fusions or endosymbiosis However homoplasy andthe optimization of characters and trees of cladistic analysisprovide a rigorous framework to discover the magnitude andsource of nonvertical processes in evolution Thus cladisticsor the ToL are not invalidated by network-like signals Infact recent discrete mathematical formulations that test thefundamental axioms of tree construction have also proventhat the bifurcating history of trees is preserved despiteevolutionary reticulations [75] While there is no ldquoconfusionof doubtsrdquo at this level the babel of confusion has not stopped

Genomics revealed evolutionary patchiness which in-cited hypotheses of chimeras and fusions Phylogenies ofgenes were found highly discordant Organisms that werebeing sequenced shared very few gene sequences and moretroublingly gene trees that were generated possessed differ-ent topologies especially within akaryotic organisms Withtime the number of nearly universal genes decreased andthe patchiness and discordancy increased The comparativegenomic patchwork of sequence homologies showed thatthere were groups of genes that were only shared by certaindomains of life Thus the makeup of genomes appearedchimeric A number of hypotheses of chimeric origins ofeukaryotes were proposed based on the fact that eukaryotesshared genes expressing sister group phylogenetic relation-ships with both Bacteria and Archaea [64 76] One thatis notable is the rise of eukaryotes from a ldquoring of liferdquofusion of an archaeon and a bacterium (eg [77]) which ina single blow defeated both the tree-like and the tripartitenature of life Under this school homology searches withBLAST against a database ofsim38million akaryotic sequencesallowed to assign archaeal bacterial or ambiguous ancestriesto genes in the human genome and explain homologypatterns as relics of the akaryotic ancestors of humans [78]Remarkably archaeal genes tended to be involved in informa-tional processes encoded shorter and more central proteinsand were less likely to be involved in heritable human dis-eases The chimeric origin of eukaryotes by fusions has beenrightfully contested there is no proper evidence supportingits existence [79] Unfortunately chimerism opened a floodof speculations about reticulation The history of life wasseen by many through the lens of a ldquoforestrdquo of gene histories

[80] Reticulations andultimatelyHGTwere forced to explainchimeric patterns Fusion hypotheses diverted the centralissue of the rooting of the ToL and prompted the separateanalysis of eukaryotic origins and akaryotic evolution

Dagan and Martin [64] denounced that the ToL wasa ldquotree of one percentrdquo because only a small fraction ofsequences could be considered universal and could bemined for deep phylogenetic signal The rest would accountfor lateral processes that confounded vertical descent Theclaim came fundamentally from networks constructed usingBLAST heuristic searches for short and strong matchesin genomic sequences Phylogenetic ldquoforestsrdquo of akaryoticgenes later on boosted the claim [81] These networks werebuilt from 6901 trees of genes using maximum likelihoodmethods Only 102 of these trees were derived from nearlyuniversal genes and contributed very little vertical phyloge-netic signal The initial conclusion was striking ldquothe originaltree of life concept is obsolete it would not even be a treeof one percentrdquo [81] However the fact that vertical signalwas present in the forests merited reevaluation ldquoreplacementof the ToL with a network graph would change our entireperception of the process of evolution and invalidate allevolutionary reconstructionrdquo [82] We note that in thesestudies an ultrametric tree of akaryotes was recovered fromvertical signal in a supertree of nearly universal genesThe rooted tree which was used to simulate clock-likebehavior revealed the early divergence of Nanoarchaeumequitans and then Archaea [81] While inconsistency of theforest supertree increased at high phylogenetic depths itsassociated supernetwork showed there were no reticulationsbetween Archaea and Bacteria Thus the deep rooting signalof archaeal diversification may be bona fide and worthy offurther study

While those that value tree thinking have contested onmany grounds the idea that the ToL is ldquoobsoleterdquo (eg[13]) our take in the debate is simple Homologies betweengene sequences established through the ldquoemperorrsquos BLASTrdquo(sensu [71]) are poor substitutes to phylogenetic tree recon-structions from gene sequences By the same token phy-logenies reconstructed from sequences are poor substitutesto phylogenies that consider the molecular structure andfunction of the encoded products Gene sequences are notonly prone to mutational saturation but they generally comein pieces These pieces represent evolutionary and structuralmodules that host the functions of the encoded moleculesProtein domains for example are three-dimensional (3D)arrangements of elements of secondary structure that foldautonomously [83] are compact [84] and are evolutionarilyconserved [85 86]The landscape of evolutionary explorationchanges when the role of protein domains in function andevolution is considered [87] Changes at sequence levelincluding substitutions insertions and deletions can havelittle impact on structure and vice versa sequence changes incrucial sites can have devastating consequences on functionand fitness However there is no detailed analysis of historicalmismatches between gene sequences and domain structuresat the ToL level Remarkably while HGT seems rampantat sequence level its impact at the domain structural levelis limited [88] This makes trees derived from domains

8 Archaea

effectively ldquotrees of 99 percentrdquo and their use very powerfulThe fact that domains diversify mostly by vertical descent(eg [89ndash91]) suggests that gene reticulations simply reflectthe pervasive and impactful combinatorial effect of domainrearrangements in proteins [92] and perhaps little else Thisimportant claim must be carefully evaluated It offers thepossibility of dissecting how levels of organization impactprocesses of inheritance in biology

5 Out of the Impasse Parts and Wholes inthe Evolving Structure of Systems

Therooting of theToL is clearlymuddled by the high dynamicnature of change in protein and nucleic acid sequences andby the patchiness and reticulation complexities that exist atgene level However it is also possible that the problemsof the ToL are ultimately technical We have made thecase that the use of molecular sequences is problematic onmany grounds including mutational saturation definitionof homology of sites in sequence alignments inapplicablecharacters taxon sampling and tree imbalance and differenthistorical signatures in domains of multidomain proteins[93] In particular violation of character independence by themere existence of atomic structure represents a very seriousproblem that plagues phylogenetic analysis of sequences Wehere present a solution to the impasse We show that theToL can be rooted with different approaches that focus onstructure and function and that its root is congruently placedin Archaea

Epistemologically phylogenetic characters must complywith symmetry breaking and the irreversibility of time [94]In other words characters must establish transformationalhomology relationships and serve as independent evidentialstatements

(i) Characters Must Be Homologous and Heritable acrossTree Terminal Units (Taxa) Character homology is a centraland controversial concept that embodies the existence ofhistorical continuity of information [74] Characters areldquobasicrdquo evidential statements that make up columns in datamatrices used for tree reconstructionThey are conjectures ofperceived similarities that are accepted as fact for the dura-tion of the study are strengthened by Hennigian reciprocalillumination and can be put to the test through congruencewith other characters as these are fit to the trees To be usefulcharacters must be heritable and informative across the taxarows of the data matrix This can be evaluated for examplewith the cladistics information content (CIC) measure [95]Finding informative characters can be particularly challeng-ing when the features that are studied change at fast pace andwhen taxa sample awide and consequently deep phylogeneticspectrum When building a ToL the highly dynamic natureof change in the sequence makeup of protein or nucleic acidmolecules challenges the ability to retrieve reliable phyloge-netic signatures across taxa even if molecules are universallydistributed and harbor evolutionarily conserved regions withdeep phylogenetic signal (eg rRNA) The reason is thatgiven enough time functionally or structurally constrained

regions of the sequence will be fixed (will be structurallycanalized [6]) and will offer little if any phylogeneticallymeaningful signal to uncover for example the universalrRNA core In turn mutational saturation of unconstrainedregions will quickly erase history

The mutational saturation problem was made mathe-matically explicit by Sober and Steel [24] using ldquomutualinformationrdquo and the concept of time as an informationdestroying process Mutual information 119868(119883 119884) between tworandom variables119883 and 119884 is defined by

119868 (119883 119884) = sum

119909119910

119875 (119883 = 119909 119884 = 119910) log(119875 (119883 = 119909 119884 = 119910)

119875 (119883 = 119909) 119875 (119884 = 119910)

)

(1)

When 119868(119883 119884) approaches 0 119883 and 119884 become independentand no method can predict 119883 from knowledge of 119884 Impor-tantly mutual information approaches 0 as the time between119883 and 119884 increases in a Markov chain Regardless of theuse of a parsimony maximum likelihood or Bayesian-basedframework of analysis 119868(119883 119884)will be particularly small whensequence sites are saturated by too many substitutions dueto high substitution rates or large time scales Ancestralstates at interior nodes of the trees cannot be establishedwith confidence from extant information even in the mostoptimal situation of knowing the underlying phylogeny andthe model of character evolution Under simple models theproblem is not mitigated by the fact that the number ofterminal leaves of trees and the sources of initial phylogeneticinformation increases with time Since a phase transitionoccurs when substitution probabilities exceed a critical value[96] oneway out of the impasse is to find features in sufficientnumber that change at much slower pace than sequence sitesand test if mutual information is significant and overcomesFanorsquos inequality These features exist in molecular biologyand have been used for phylogenetic reconstruction Theyare for example the 3D fold structures of protein molecules[87] or the stem modules of RNA structures [97] featuresthat change at very slow rate when compared to associatedsequences For example protein 3D structural cores evolvelinearly with amino acid substitutions per site and change at3ndash10 times slower rates than sequences [98] This high con-servation highlights the evolutionary dynamics of molecularstructure Remarkably rates of change of proteins performinga same function are maintained by functional constraintsbut accelerate when proteins perform different functions orcontain indels In turn fold structural diversity explodesinto modular structures at low sequence identities probablytriggered by functional diversification Within the contextof structural conservation the fact that fold structures arestructural and evolutionary modules that accumulate inproteomes by gene duplication and rearrangements andspread in biological networks by recruitment (eg [99]) alsoprovides a solution to the problems of vanishing phylogeneticsignal Since fold accumulation increases with time in theMarkov chain mutual information must increase reversing

Archaea 9

the ldquodata processing inequalityrdquo that destroys informationand enabling deep evolutionary information

(ii) Characters Must Show at Least Two Distinct CharacterStates One of these two states (transformational homologs)must be ancestral (the ldquoplesiomorphicrdquo state) and the othermust be derived (the ldquoapomorphicrdquo state) [74] Only sharedand derived features (synapomorphies) provide vertical phy-logenetic evidence Consequently determining the relativeancestry of alternative character states defines the polarity ofcharacter transformations and roots the underlying treeThisis a fundamental property of phylogenetic inference Polar-ization in tree reconstruction enables the ldquoarrow of timerdquo(sensu Eddingtonrsquos entropy-induced asymmetry) solves therooting problem and fulfills other epistemological require-ments

Cladistically speaking character polarity refers to thedistinction between the ancestral and derived states and theidentification of synapomorphies However an evolutionaryview of polarity also refers to the direction of characterstate transformations in the phylogenetic model Historicallythree accepted alternatives have been available for rootingtrees [100ndash102] the outgroup comparison the ontogeneticmethod and the paleontological or stratigraphic methodWhile the three methods do not include assumptions ofevolutionary process they have been the subject of muchdiscussion and their interpretation of much controversyThe first two are however justified by the assumption thatdiversity results in a nested taxonomic hierarchy which mayor may not be induced by evolution We will not discussthe stratigraphic method as it relies on auxiliary assump-tions regarding the completeness of the fossil record Themidpoint rooting criterion mentioned earlier will not bediscussed either The procedure is contested by the existenceof heterogeneities in rates of change across the trees andproblems with the accurate characterization of phylogeneticdistances

In outgroup comparison polarity is inferred by thedistribution of character states in the ingroup (group of taxaof interest) and the sister group (taxa outside the group ofinterest) In a simple case if the character state is only foundin the ingroup it must be considered derived Outgroupcomparison is by far the method of choice because phyloge-neticists tend to have confidence in the supporting assump-tions higher-level relationships are outside the ingroupequivalent ontogenetic stages are compared and characterstate distributions are appropriately surveyed Unfortunatelythe method is ldquoindirectrdquo in that it depends on the assumptionof the existence of a higher-level relationship between theoutgroup and the ingroup Consequently the method cannotroot the ToL because at that level there is no higher-levelrelationship that is presently available Moreover the methodin itself does not polarize characters It simply connects theingroup to the rest of the ToL [100]

The ontogenetic criterion confers polarity through thedistribution of the states of homologous characters in onto-genies of the ingroup generally by focusing on the generalityof character states with more widely distributed states beingconsidered ancestral Nelsonrsquos rule states ldquogiven an ontogenetic

character transformation from a character observed to be moregeneral to a character observed to be less general the moregeneral character is primitive and the less general advancedrdquo[103] This ldquobiogenetic lawrdquo appears powerful in that itdepends only on the assumption that ontogenies of ingrouptaxa are properly surveyed It is also a ldquodirectrdquo method thatrelies exclusively on the ingroup Consequently it has thepotential to root the ToL Unfortunately Nelsonrsquos ldquogeneralityrdquohas been interpreted in numerous ways especially as it relatesto the ontogenetic sequence leading tomuch confusion [102]It also involves comparison of developmentally nested anddistinct life history stages making it difficult to extend themethod (originally conceptualized for vertebrate phylogeny)to the microbial world However Weston [100 104] madeit clear that the ontogenetic criterion embodies a widerldquogenerality criterionrdquo in which the taxic distribution of acharacter state is a subset of the distribution of anotherIn other words character states that characterize an entiregroup must be considered ancestral relative to an alternativestate that characterizes a subset of the group Besides thecentrality of nested patterns the generality criterion embedsthe core assumption that every homology is a synapomorphyin naturersquos nested taxonomic hierarchy and that homologiesin the hierarchy result from additive phylogenetic change[100] Westonrsquos more general rule therefore states that ldquogivena distribution of two homologous characters in which one119909 is possessed by all of the species that possess its homologcharacter 119910 and by at least one other species that does notthen 119910 may be postulated to be apomorphous relative to 119909rdquo[104] The only assumption of the method is that relevantcharacter states in the ingroup are properly surveyed Thisnew rule crucially substitutes the concept of ontogenetictransformation by the more general concept of homologyand additive phylogenetic change which can be applied tocases in which homologous entities accumulate ldquoiterativelyrdquoin evolution (eg generation of paralogous genes by dupli-cation) Since horizontally acquired characters (xenologs)are not considered synapomorphies they contribute towardsphylogenetic noise and are excluded after calculation ofhomoplasy and retention indices (ie measures of goodnessof fit of characters to the phylogeny)

Wehave applied the ldquogenerality criterionrdquo to the rooting ofthe ToL through polarization strategies that embody axiomsof evolutionary process Figure 4 shows three examplesA rooted phylogeny describing the evolution of 5S rRNAmolecules sampled from a wide range of organisms wasreconstructed from molecular sequence and structure [105]The ToL that was recovered was rooted paraphyletically inArchaea (Figure 4(a)) The model of character state transfor-mation was based on the axiom that evolved RNA moleculesare optimized to increase molecular persistence and producehighly stable folded conformations Molecular persistencematerializes in RNA structure in vitro with base pairsassociating and disassociating at rates as high as 05 sminus1 [106]The frustrated kinetics and energetics of this folding processenable some structural conformations to quickly reach stablestatesThis process is evolutionarily optimized through struc-tural canalization [6] in which evolution attains molecularfunctions by both increasing the average life and stability of

10 Archaea

Archaea Bacteria Eukarya

Root

Taxa = 666 5S rRNA molecules of diverse organismsCharacters = sequence and structure of tRNA

CI = 0072 RI = 0772 and g1=minus013Tree length = 11381 steps

(a)

Ancestral

AB B

EB

AE

EABEA

Derived

A E B

Timeline

Eukarya (E)

Archaea (A)

Bacteria (B)

ABE

(b)

1000

Root

Eukarya

Archaea

Bacteria

Taxa = 150 free-living proteomesCharacters = genomic abundance of 1510 FSFsTree length = 70132 stepsCI = 0155 RI = 0739 and g1=minus028

(c)

Figure 4 Trees of life generated from the structure of RNA and protein molecules congruently show a rooting in Archaea (a) A rootedphylogenetic tree of 5S rRNA reconstructed from both the sequence and the structure of the molecules (from [105]) (b) Global most-parsimonious scenario of organismal diversification based on tRNA (from [107]) A total of 571 tRNA molecules with sequence basemodification and structural information were used to build a ToL which failed to show monophyletic groupings Ancestries of lineageswere then inferred by constraining sets of tRNAs into monophyletic groups representing competing (shown in boxes) or noncompetingphylogenetic hypotheses and measuring tree suboptimality and lineage coalescence (illustrated with color hues in circles) (c) A ToLreconstructed from the genomic abundance counts of 1510 FSFs as phylogenetic characters in the proteomes of 150 free-living organismssampled equally and randomly from the three domains of life (data taken from [122 123]) Taxa were labeled with circles colored accordingto superkingdom CI = consistency index RI = retention index and 119892

1= gamma distribution parameter

selected conformations and decreasing their relative numberThus conformational diversity measured for example by theShannon entropy of the base-pairing probability matrix orfeatures of thermodynamic stability act as ldquoevo-devordquo proxyof a generality criterion for RNA molecules in which thecriterion of similarity (eg ontogenetic transformation) isof positional and compositional correspondence Using adifferent approach we recently broadened the use of phyloge-netic constraint analysis [107 108] borrowing from a formal

cyberneticmethod that decomposes a reconstructable systeminto its components [109] and used it to root a ToL derivedfrom tRNA sequence and structure (Figure 4(b)) The ToLwas again rooted in Archaea The number of additionalsteps required to force (constrain) particular taxa into amonophyletic group was used to define a lineage coalescencedistance (119878) with which to test alternative hypotheses ofmonophyly These hypotheses were then ordered accordingto 119878 value in an evolutionary timeline Since 119878 records

Archaea 11

the relative distribution of character states in taxic sets italso embodies the generality criterion of rooting Finallywe generated rooted ToLs that describe the evolution ofproteomes directly from a census of protein fold structures inproteomes (Figure 4(c)) This ToL shows a paraphyletic root-ing inArchaeaThemethod extracts phylogenetic signal fromproteomic abundance of protein fold structures and considersthat the most abundant and widely distributed folds are ofancient origin when defining transformation series [41] Thispolarization scheme which results in the gradual growth ofthe proteome repertoire again represents an embodimentof the generality criterion in which statements of homology(fold structures) result from additive phylogenetic change(increases in abundance) It is noteworthy that these ToLreconstructions take into account the genomic abundance ofeach and every fold structure in each proteome and acrossthe entire matrix thereby generating a frustrated system Ingeneral very high abundance of only few folds will not attracttaxa (species) to the derived branches in theToLThe ancestryof taxa is determined by both the abundance and interplayamong fold structure characters For example metabolicfolds such as those involved inATP hydrolysis are widespreadin cells and considered to be very ancient These are alsothe most ancient folds in our phylogenies In comparisonsome popular eukaryote-specific folds (eg immunoglobulinsuperfamilies) are highly abundant but appear in a derivedmanner in our phylogenies Thus we reason that there is nocircularity involved in the character polarization scheme

The compositional schemes extend the concept of rootingwith paralogous sequences to the entire proteome comple-ments from gene family level [100] to structural hierarchiesThe three examples make use of different rooting rationalesbut provide a congruent scenario of origins of diversifiedlife Technically roots are inferred by the Lundberg method[110] that does not require any outgroup taxa specificationThis method roots the trees a posteriori by attaching thehypothetical ancestor to that branch of the unrooted networkthat would yield minimum increase in the tree length (thuspreserving the principle of parsimony)

(iii) Characters Must Serve as Independent EvolutionaryHypotheses Valid phylogenetic optimization requires thatcharacters be independent pieces of evidence Charactersshould not dependonother charactersWhen the assumptionof independence is violated characters are overweighed inthe analysis and the resulting phylogeny fails to representtrue history [111] Possible dependencies could be of manykinds from structural to functional from developmen-tal to ecological These dependencies distort and obscurephylogenetic signal and must be either avoided or codedinto the phylogenetic model through parameters or weightcorrections

As we will now elaborate the problem of character inde-pendence is about parts and wholes in the hierarchical fabricof life and in the nested hierarchies of the ToL Biologicalsystems are by definition made of parts regardless of the wayparts are defined In evolution diversification and integrationof parts unify parts into cohesive entities modules whichthen diversify [112]This process and the rise of modules may

explain evolutionary waves of complexity and organizationand the emergence of structure (defined broadly) in biologythat is hierarchical The hierarchical makeup is made evidentin the structure of protein molecules where lower levelparts of the polymer (the amino acid residues) interact witheach other and establish cohesive higher-level modular partswhich also establish interaction networks and are crucialfor molecular function and for interaction of proteins withthe cellular environment Consequently the structure ofproteins can be described at increasing hierarchical levelsof structural abstraction sequences motifs loops domainsfamilies superfamilies topologies folds architectures andclasses Two accepted gold standards of protein classificationstructural classification of proteins (SCOP) [113] and classarchitecture topology homology (CATH) [114] use parts ofthis incomplete scheme to describe the atomic complexityof the molecules We note that these classifications do notconsider unrealized structural states such as protein foldsthat are possible but that have never been identified inthe natural world of protein structures We also note thatmodules sometimes engage in combinatorial games Forexample protein domains are rearranged in evolution byfusions and fissions producing the enormous diversity ofalternative domain rearrangements that exist inmultidomainproteins [92]

Since biological systems evolve and carry common ances-try parts of these systems by definition evolve and bythemselves carry common ancestry In other words thehistories of parts are embodied in the ensemble of lineagesof the ToL The focus of phylogenetic analysis however hasbeen overwhelmingly the organism as the biological systemas testified by the effort devoted to taxonomical classificationsystematic biology and building of the ToL The genomicrevolution has provided a wealth of parts in the aminoacid sequence sites of proteins and nucleic acid moleculeswhich have been used as phylogenetic characters for analysisof molecules representing organisms This has maintainedthe focus of reconstructing trees of systems (the wholes)For example amino acid sites of a protein are generally fitinto a data matrix (alignment) which is then used to buildtrees of genes and organisms (Figure 5(a)) using modernalgorithmic implementations [115] However and as we havementioned proteins have complicated structures that resultfrom interactions between amino acids at the 3D atomic levelThese intramolecular interactions or at least some of them[116] induce protein folding and delimit molecular func-tion and protein stability They are responsible for proteinsecondary supersecondary domain and tertiary structureand by definition their mere existence induces violation ofcharacter independence Penny and Collins [117] proposedthe simple thought experiment in which the bioinformaticianexchanges rows of sequence sites in the alignment matrixand asks what was lost in the process Randomization ofcharacters (columns) in the data matrix does not changethe phylogenetic tree However randomization destroys thestructure of the molecule and very likely its function Thisconfirms that reconstruction of trees from information insequence sites violates character independence and in theprocess ignores structure and biologyThe effects of violation

12 Archaea

Interactions between parts(eg atomic contacts between sites)

Characters parts(amino acid sequence sites)

16

7

5Impact onphylogeneticreconstruction high

Phylogenetic model

Evolutionof systems

Canonicalparadigm

System 3 (protein from Mba)

System 2 (protein from Mja)

System 1 (protein from Pfu)

A L Y S

A L Y S

G L S S

1 5 6 7

(a)

Interactions between systems(eg proteomes in trophic chains)

Characters systems(proteomes)

Impact onphylogeneticreconstruction low

Phylogenetic model

Evolutionof parts

Newparadigm

Part 3 (FSF domain c261)

Part 2 (FSF domain c12)

Part 1 (FSF domain c371)

Higher trophic

levels

Mesoplankton

Microplankton

Nanoplankton

Picoplankton

Pfu Mja Mba 22 29 28

54 44 43

84 88 80

(b)

Figure 5 A new phylogenetic strategy simplifies the problems of character independence (a)The canonical paradigm explores the evolutionof systems such as the evolution of organisms in the reconstruction of ToLs For example the terminals of phylogenetic trees can be genessampled from different organisms (eg Pyrococcus furiosus (Pfu)Methanococcus jannaschii (Mja) andMethanosarcina barkeri (Mba)) andphylogenetic characters can be amino acid sequence sites of the corresponding gene products Character states can describe the identityof the amino acid at each site Since characters are molecular parts that interact with other parts when molecules fold into compact 3Dstructures their interaction violates the principle of character independence Consequently the effects of covariation must be considered inthe phylogeneticmodel used to build the trees (b)The new paradigm explores the evolution of parts such as the evolution of protein domainsin proteomes For example the terminals of phylogenetic trees can be domains defined at fold superfamily level of structural complexity andthe characters used to build the trees can be proteomes Character states can be the number of domains holding the FSF structure Sinceproteomes interact with other proteomes when organisms establish close interactions interactions that could affect the abundance of domainsin proteomes should be considered negligible (unless there is an obligate parasitic lifestyle involved) and there is no need to budget trophicinteractions in the phylogenetic model

of character independence may be minimal for trees ofsequences that are closely related However trees describingdeep historical relationships require that sequences be diver-gent and this maximizes the chances of even wider diver-gences in molecular structure that are not being accountedfor by the models of sequence evolution

The genomic revolution has also provided a wealth ofmodels of 3D atomic structure These structures are usedas gold standards to assign with high confidence structuralmodules to sequences As mentioned earlier proteomesembody collections of protein domains with well-definedstructures and functions Protein domain counts in pro-teomes have been used to generate trees of protein domains(Figure 5(b)) using standard cladistic approaches and well-established methods (reviewed in [87]) These trees describethe evolution of protein structure at global level Theyare effectively trees of parts While domains interact witheach other in multidomain proteins or establish protein-protein interactions with the domains of other proteinsthese interactions of parts do not violate character indepen-dence This is because phylogenetic characters are actuallyproteomes systems defined by structural states that exist atmuch higher levels than the protein domain not far awayfrom the organism level Remarkably no information is lostwhen character columns in the matrix are randomized in

the thought experiment The order of proteome charactersin the matrix does not follow any rationale Characters arenot ordered by lifestyles or trophic levels of the organismsInteractions between free-living organisms will seldom biastheir domain makeup and if so those characters can beexcluded from analysis Even the establishment of symbioticor obligate parasitic interactions such as the nodule-formingsymbioses between rhizobia and legumes may have littleimpact on character independence as long as the jointinclusion of the host and the symbiont is avoided

6 Evidence Supporting the Archaeal Rootingof the Universal Tree

Figure 4 shows examples of rooted ToLs generated fromthe sequence and structure of RNA and protein moleculesThe different phylogenomic approaches arrive at a commonrooted topology that places the stem group of Archaea at thebase of the ToL (the archaeal rooting of Figure 1(c)) Howeverthe evolutionary interrelationship of parts andwholes promptthe use of trees of parts a focus on higher-level structureand a decrease in confidence in the power of trees of systemsThese concepts have been applied to the study of the historyof nucleic acid and protein structures for over a decade and

Archaea 13

23

38

786

360324

164

38

2

33

453

126 56

1

1

FSF domains

Archaea

Archaea

EukaryaEukarya

Bact

eria

Bact

eria

f gt 06

(a)

1

11

526

852272

162

100

0

3

232

35 15

5

16

GO terms

Archaea

Archaea

EukaryaBa

cter

iaEukarya

Bact

eria

f gt 06

(b)

69

0

7

8393

228

2

1

0

5

91

1

0

RNA families

Archaea

Eukarya

Bact

eria

Archaea

Eukarya

Bact

eria

f gt 06

(c)

Figure 6 Venn diagrams displaying the distributions of 1733 FSF domains (a) 1924 terminal GO terms (b) and 1148 RNA families (c)in the genomes of the three domains of life FSF domain data was taken from Nasir et al [122 123] and included 981 completely sequencedproteomes from 70 Archaea 652 Bacteria and 259 Eukarya Terminal GO terms corresponding to the ldquomolecular functionrdquo hierarchy definedby the GO database [118] were identified in 249 free-living organisms including 45 Archaea 183 Bacteria and 21 Eukarya (data taken from[123]) The Venn diagram of RNA families and the distribution of Rfam clans and families in organisms were taken from Hoeppner et al[129] and their Dataset S1 Shown below are distribution patterns for FSFs GO terms and RNA families that are present in more than 60 ofthe organisms examined (119891 gt 06) All distributions highlight maximum sharing in the ancient ABE and BE taxonomic groups and minimalsharing in archaeal taxonomic groups

provide additional evidence in support of the archaeal rootingscenario

61 Comparative Genomic Argumentation The distributionof gene-encoded products in the genomes of sequencedorganisms and parsimony thinking can reveal global evolu-tionary patterns without formal phylogenetic reconstructionWe will show how simple numerical analyses of proteindomains molecular activities defined by the gene ontology(GO) consortium [118] and RNA families that are domain-specific or are shared between domains of life can uncover thetripartite division of cellular life exclude chimeric scenariosof origin and provide initial insight on the rooting of the ToL(Figure 6)

The number of unique protein folds observed in natureis very small SCOP ver 175 defines only sim2000 foldsuperfamilies (FSFs) groups of homologous domains unifiedon the base of common structural and evolutionary rela-tionships for a total of 110800 known domains in proteins[113 119] FSFs represent highly conserved evolutionary unitsthat are suitable for studying organismal diversification [87]Yafremava et al [120] plotted the total number of distinctFSFs (FSF diversity) versus the average reuse of FSFs in theproteome of an organism (FSF abundance) This exerciseuncovered a scaling behavior typical of a Benford distributionwith a linear regime of proteomic growth for microbial

organisms and a superlinear regime for eukaryotic organisms(Figure 8 in [120]) These same scaling patterns are observedwhen studying the relationship between open reading framesand genome size [121] Remarkably archaeal and eukaryalproteomes exhibited both minimum and maximum levelsof FSF abundance and diversity respectively Bacterial pro-teomes however showed intermediate levels We note thatthe general scaling behavior is consistent with a scenarioin which evolutionary diversification proceeds from simplerproteomes to the more complex ones in gradual mannersupporting the principle of spatiotemporal continuity andrevealing the nested phylogenetic hierarchies of organismsUnder this scenario (and results of [120]) the streamlinedarchaeal proteomes represent the earliest form of cellular lifeRemarkably archaeal species harboring thermophilic andhyperthermophilic lifestyles encoded the most streamlinedFSF repertoires (Figure 7) Clearly modern thermophilicarchaeons are most closely related to the ancient cells thatinhabited planet Earth billions of years ago (also read below)

FSF domain distributions in the genomes of the threedomains of life provide further insights into their evolutionNasir et al [122 123] generated Venn diagrams to illustrateFSF sharing patterns in the genomes of Archaea Bacteriaand Eukarya (Figure 6(a)) These diagrams display the totalnumber of FSFs that are unique to a domain of life (taxonomicgroupsA B and E) shared by only two (AB BE andAE) and

14 Archaea

FSF use (log)

S marinus I hospitalis

H butylicus

T pendens

Cand korarchaeum

A pernix

M kandleri

T neutrophils

P islandicum

290

285

280

275

270

265

260

255

250

245

240

22 23 24 25

ThermophilicNonthermophilic

FSF

reus

e (lo

g)

N maritimus

Figure 7 A plot of FSF use (diversity) against FSF reuse (abun-dance) reveals a linear pattern of proteomic growth in 48 archaealproteomes Thermophilic archaeal species occupy positions that areclose to the origin of the plot They also populate the most basalbranch positions in ToLs (see discussion in themain text) Both axesare in logarithmic scale

those that are universal (ABE) About half of the FSFs (786out of 1733) were present in all three domains of life 453 ofwhich were present in at least 60 of the organisms that wereexamined (119891 gt 06 where 119891 is the distribution index rangingfrom 0 to 1) The fact that about 70 of widely shared FSFs(672) belong to the ABE taxonomic group strongly supports acommon evolutionary origin for cells Evolutionary timelineshave confirmed that a large number of these universal FSFswere present in the ancestor of the three domains of life theurancestor (Figure 1) and were for the most part retained inextant proteomes [48]

Interestingly the number of BE FSFs was sim10-fold greaterthan AE and AB FSFs (324 versus 38 and 38) (Figure 6(a))The finding that Bacteria and Eukarya encode a significantlylarge number of shared FSF domains is remarkable and hintstowards an unprecedented strong evolutionary associationbetween these two domains of life Moreover a significantnumber of BE FSFs is widespread in the bacterial andeukaryotic proteomes 56 of the 324 FSFs are shared bymore than 60 of the bacterial and eukaryal organismsthat were analyzed (Figure 6(a)) This is strong evidenceof an ancient vertical evolutionary trace from their mutualancestor (anticipated in [123]) This trace uniquely supportsthe archaeal rooting of the ToL In turn the canonicaland eukaryotic rooting alternatives are highly unlikely asnone alone can explain the remarkable diversity of the BEtaxonomic group (as well as its ancient origin [123]) Theremarkable vertical trace of the ABE taxonomic group andthe negligible vertical trace of the AB group (only one FSF

shared by more than 60 of organisms) also falsify fusionhypotheses responsible for a putative chimeric makeup ofEukarya In light of these findings the most parsimoniousexplanation of comparative genomic data is that Archaea isthe first domain of diversified cellular life

The patterns of FSF sharing are further strengthenedby the genomic distributions of terminal-level molecularfunction GO terms (Figure 6(b)) The three domains of lifeshared about a quarter (526) of the 1924 GO terms thatwere surveyed A total of 232 of these ABE terms wereshared by 60 of organisms analyzed Again the fact thatabout 76 of widely shared GO terms (306) belong to theABE taxonomic group strongly supports a common verticaltrace while the finding that Bacteria and Eukarya share asignificantly large number of GO terms (272) supports againthe archaeal rooting of the ToL An alternative explanationfor the very large size of BE could be large-scale metabolism-related gene transfer from the ancestors of mitochondriaand plastids to the ancestors of modern eukaryotes [124]However we note that the BE group is not restricted to onlymetabolic functions It also includes FSFs and GOs involvedin intracellular and extracellular processes and regulation andinformational functions [125] Thus the very large size ofBE is a significant outcome most likely shaped by verticalevolutionary scenarios and cannot solely be explained byparasiticsymbiotic relationships that exist between Bacteriaand Eukarya [123] Moreover the simplicity of archaeal FSFrepertoires is not due to the paucity of available archaealgenomic dataWe confirmed that themean FSF coverage (ienumber of proteinsannotated to FSFsGOs out of total) forArchaea Bacteria and Eukarya was largely comparable (egTable S1 in [122]) Finally as we will describe below thesecomparative patterns and tentative conclusions are confirmedby phylogenomic analysis [91 122 126] Congruent phyloge-nies were obtained with (Figure 4(c)) and without equal andrandom sampling of taxa Thus the relatively low numberof archaeal genomes is also not expected to compromise ourinferences

We note that Eukarya shares many informational geneswith Archaea and many operational genes with BacteriaWhile informational genes have been thought more refrac-tory to HGT than noninformational genes we have con-firmed at GO level that this is not the case A ToL builtfrom GO terms showed that in fact noninformational termswere less homoplasious while statistical enrichment analysesrevealed that HGT had little if any functional preferencefor GO terms across GO hierarchical levels We recentlycompared ToLs reconstructed from non-HGTGO terms andToLs reconstructed from informational GO terms that wereextracted fromnon-HGTGO terms (Kim et al ms in reviewalso read below) In both cases Archaea appeared as a basalparaphyletic group of the ToLs and the common origin ofBacteria and Eukarya was maintained Thus the 272 GOterms shared by Bacteria and Eukarya harbor a strong verticaltrace

Another observation often used as support for Archaea-Eukarya kinship is the discovery of few eukaryote-specificproteins (eg actin tubulin H3 H4 ESCRT ribosomalproteins and others) in some archaeal species [127] and

Archaea 15

their complete absence fromBacteria (except for documentedtubulin HGT from eukaryotes to bacterial genus Prosthe-cobacter [128]) This suggests either that eukaryotes arosefrom an archaeal lineage [127] or that the ancestor of Eukaryaand Archaea was complex and modern Archaea are highlyreduced [25] Indeed few eukaryote-specific proteins havenow been found in some archaeal species (in some casesjust one species) We argue that this poor spread cannotbe taken as evidence for the Archaea-Eukarya sister rela-tionship This needs to be confirmed by robust phylogeneticanalysis which is unfortunately not possible when usingprotein sequences Recently we described a new strategy forinferring vertical and horizontal traces [123] This methodcalculates the spread of FSFs andGOs that are shared betweentwo-superkingdom groups (ie AB AE and BE) Balanceddistributions often indicate vertical inheritance while biaseddistributions suggest horizontal flux For example penicillinbinding molecular activity (GO 0008658) was present in100 of the sampled bacterial proteomes but was onlypresent in 11 of the archaeal species [123] Thus presenceof GO 0008658 in Archaea was attributed to HGT gain fromBacteria Using this simple method we established that bothBacteria and Eukarya were united by much stronger verticaltrace than either was to Archaea In fact strong reductivetendencies in the archaeal genomes were recorded [123]Thus in our opinion presence of eukaryote-specific proteinsin only very few archaeal species could in fact be an HGTevent that is not detectable by sequence phylogenies

In turn many arguments favor now the Bacteria-Eukaryasisterhood in addition to the structure-based phylogeniesand balanced distribution ofmolecular features For exampleBacteria and Eukarya have similar lipid membranes that canbe used as argument for their evolutionary kinshipMoreoverArchaea fundamentally differ from Eukarya in terms oftheir virosphere Viruses infecting Archaea and Eukarya aredrastically different as recently discussed by Forterre [25]

The genomic distribution of RNA families was taken fromHoeppner et al [129] and also shows a vertical evolutionarytrace in five crucial Rfam clans that are universal includingtRNA 5S rRNA subunit rRNA and RNase P RNA Theseuniversal RNA groups are likely minimally affected by HGTHowever 99 of Rfam clans and families are specific todomains of life and only 11 of the 1148 groups were sharedat 119891 gt 06 levels This clearly shows that the functionalcomplexity of RNA materialized very late during organismaldiversification and that it is not a good genomic feature forexploring the rooting of the ToL Only five RNA familiesare shared between two domains of life and only one ofthese does so at 119891 gt 06 the G12 pseudoknot of the 23SrRNA which is present in bacterial and eukaryotic organellarrRNA While the large subunit rRNA scaffold supports theG12 pseudoknot with its vertical trace all five interdomainRNA families can be explainedmost parsimoniously byHGTThus only a handful of ancient and universal RNA speciescan be used to root the ToL

We end by noting that the Venn diagrams consistentlyshow that Archaea harbors the least number of unique (A)and shared (AB and AE) FSFs GO terms or RNA familiesThis trend supports an early divergence of this domain of life

from the urancestor and the possibility that such divergencebe shaped by evolutionary reductive events We reason thatsuch losseswould bemore parsimonious early on in evolutionthan in the later periods This is because genes often increasetheir abundance in evolutionary time (by gene duplicationsHGT and other processes)Thus it is reasonable to think thatloss of an ancient gene would be more feasible very early inevolution relative to losing it very late

62 Phylogenomic Evidence from the Sequence and Structure ofRNA Phylogenetic analyses of the few RNA families that areuniversal and display an important vertical evolutionary trace(Figure 6(c)) provide compelling evidence in favor of an earlyevolutionary appearance of Archaea [105 107 108 130ndash135]Here we briefly summarize evidence from tRNA 5S rRNAand RNase P RNA Unpublished analyses of rRNA sequenceand structure using advanced phylogenetic methods alsoshow that Archaea was the first domain of life

tRNA molecules are generally short (sim73ndash95 nucleotidesin length) and highly conserved Consequently theirsequence generally contains limited amount of phylogeneticinformation These limitations have been overcome by ana-lyzing entire tRNomes [136] which extend the length ofan organismal set of tRNAs to over 2000 bases Xue etal [130 131] analyzed the genetic distances between tRNAsequences as averages between alloacceptor tRNAs fromdiverse groups of tRNomes using multiple molecular basesubstitution models The distances were mapped onto anunrooted phylogeny of tRNA molecules (Figure 8(a))The ldquoarrow of timerdquo assumption in these studies is thatancient tRNA paralogs closely resemble each other when lin-eages originate close to the time of the gene duplication Re-markably the results revealed a paraphyletic rooting of theToL in Archaea The root was specifically located close tothe hyperthermophilic methanogen Methanopyrus kandleri(Figure 8(a)) The hypothesis of this specific rooting scenariohas been supported by several other studies [134 137]including a study of genetic distances between paralogouspairs of aminoacyl-tRNA synthetase (aaRS) proteins [131] Aremarkable match between distance scores of tRNA and pairsof aaRS paralogs (Figure 8(b)) not only confirms the earlyappearance of Archaea but also suggests a coevolutionarytrace associated with molecular interactions that areresponsible for the genetic code In fact a recent exhaustivephylogenomic analysis of tRNA and aaRS coevolutionexplicitly reveals the origins and evolution of the geneticcode and the underlying molecular basis of genetics [59]Di Giulio [132 133] has also proposed an archaeal rootingof the tree of life specifically in the lineage leading to thephylum of Nanoarchaeota This rooting is based on uniqueand ancestral genomic traits of Nanoarchaeum equitans splitgenes separately codifying for the 51015840 and 31015840 halves of tRNAand the absence of operons which are considered molecularfossils [132] However this claim needs additional support ascontrasting evidence now recognizes N equitans as a highlyderived archaeal species [138]

In addition to sequences structural features of tRNAmolecules also support the archaeal rooting of the ToL

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

2 Archaea

that this axiomatic generality which we have discussed inthe context of origin of life research [5] encompasses rarepunctuations (eg quantum leap changes such as genomeduplications and rearrangements or the rare evolutionaryappearance of new fold structures) embedded in a fabric ofgradual change (eg changes induced by point mutations)Both gradual changes and punctuations are interlinked andare always expressed within spatiotemporal continuity (egstructural punctuations in the mappings of sequences intostructures of RNA [6]) This interpretative framework canexplain novelty and complexity with principles of scientificinquiry that maximize the explanatory power of assertionsabout retrodictions

Phylogenetic theories are embodied in evolutionaryldquotreesrdquo and ldquomodelsrdquo Trees (phylogenies) are multidimen-sional statements of relationship of the entities that arestudied (phylogenetic taxa) Models are evolutionary trans-formations of the biological attributes examined in data (phy-logenetic characters) which define the relationships of taxa intrees The tripartite interaction between characters modelsand trees must occur in ways that enhance retrodictivepower through test and corroboration [4] In other wordsit must follow the Popperian pillars of scientific inquiryor suitable philosophical analogs We note that retrodictivestatements allow drawing inferences about the past by usinginformation that is extant (ie that we can access today) andis necessarilymodernThe challenge of travelling back in timerests on not only making inferences about archaic biologywith information drawn from modern biological systemsbut also interpreting the retrodictive statements withoutconceptual restrictions imposed by modernity This has beenan important obstacle to historical understanding startingwith grading hypotheses inspired by Aristotlersquos great chain ofbeing the scala naturae

It was Willi Hennig in the fifties who first formal-ized retrodiction in quantitative terms [7] Since then hisldquophylogenetic systematicsrdquo has benefitted from numerousconceptual and bioinformatics developments which are nowresponsible for modern phylogenetic analysis of systemsof any kind from molecules and organisms to languageand culture from engineering applications to astrophysicsAstrocladistics for example focuses on the evolution anddiversification of galaxies caused by transforming events suchas accretion interaction andmergers (eg [8]) While majorviews have emerged in the ldquodiscovery operationsrdquo (sensu[2]) of the phylogenetic systematics paradigm includingmaximum parsimony and the frequentist and uncertaintyviews of maximum likelihood and Bayesian thinking themajor technical and philosophical challenges persist [9]More importantly as we will explain below technical andphilosophical aspects of the ideographic framework in somecases have been turned into landscapes of authoritarianismand apriorism [3] This insidious trend is pervasive in theldquorooting of the tree of liferdquo field of inquiry [10] that underliesthe origins of biochemistry and biodiversity we here discuss

In this opinion paper we address the challenges of findingan origin to biodiversity and propose a new framework fordeep phylogenetic analysis that focuses on the parts of biolog-ical systems instead of thewholeWe review the application of

this framework to data drawn from structural and functionalgenomics and argue that the origin of cellular life involvedgradual accretion of molecular interactions and the rise ofhierarchical and modular structure We discuss our findingswhich provide strong support to the very early rise ofprimordial archaeal lineages and the emergence of Archaeaas the first domain of diversified cellular life (superking-dom) The term ldquodomainrdquo of life stresses the cohesivenessof the organism supergroup very much like domains inproteins and nucleic acids stress the molecular cohesivenessof their atomic makeup Instead the term ldquosuperkingdomrdquo(superregnum) makes explicit the fact that there is a nestedhierarchy of groups of organisms many of which sharecommon ancestors (ie they are monophyletic) We proposethat the rise of emerging lineages was embedded in aprimordial ldquoevolutionary graderdquo (sensu Huxley [11]) a groupof diversifying organisms (primordial archaeons) in activetransition that were initially unified by the same and archaiclevel of physiological complexity Our discussion will attemptto reconcile some divergent views of the origin of diversifiedlife and will provide a generic scenario for ldquoturning pointsrdquo oforigin that may be recurrent in biology

2 A Tripartite World of Organismal Diversity

Carl Woese and his colleagues of the Urbana School wereresponsible for the groundbreaking discovery that the worldof organisms was tripartite that is it encompassed nottwo but three major ldquodomainsrdquo of cellular life (ArchaeaBacteria and Eukarya) Two of the three ldquoaboriginalrdquo linesof decent were initially conceptualized as ldquourkingdomsrdquo ofdeep origin that were microbial and qualitatively differ-ent from eukaryotic organisms [12] They corresponded toArchaea and Bacteria The discovery of Archaea challengedthe established akaryoteeukaryote divide (we use the termldquoakaryoterdquo to describe a cell without a bona fide nucleusThis term complements theword ldquoeukaryoterdquo (ldquoeurdquo good andldquokaryonrdquo kernel) which is ahistorical The new term takesaway the time component of the widely used ldquoprokaryoterdquo(ldquoprordquo before) definition which may be incorrect for manyorganisms of the microbial domains) that supported ldquoladderrdquoscenarios of gradual evolution from simplistic microbes toldquohigherrdquo organisms which were tenaciously defended bymolecular biologists and microbiologists of the time Woeseand Fox [12] made it clear ldquoEvolution seems to progress in aldquoquantizedrdquo fashion One level or domain of organization givesrise ultimately to a higher (more complex) oneWhat ldquoprokary-oterdquo and ldquoeukaryoterdquo actually represent are two such domainsThus although it is useful to define phylogenetic patternswithin each domain it is not meaningful to construct phylo-genetic classifications between domains Prokaryotic kingdomsare not comparable to eukaryotic onesrdquo The discovery wasrevolutionary especially because scala naturae deeply seatedthe roots of the akaryoteeukaryote divide and microbeswere considered primitive forms that did not warrant equalstanding when compared to the complex organization ofEukarya (see [13] for a historical account)The significance ofthe tripartite world was quickly realized and vividly resisted

Archaea 3

by the establishment Its resistance is still embodied todayin new proposals of origins such as the archaeon-bacteriumfusion hypothesis used to explain the rise of Eukarya (seebelow) It is noteworthy that the root of the universal treeof cellular organisms the ldquotree of liferdquo (ToL) was initiallynot the driving issue This changed when the sequencesof proteins that had diverged by gene duplication prior toa putative universal common ancestor were analyzed withphylogenetic methods and the comparisons used to root theToL [14 15] Paralogous gene couples included elongationfactors (eg EF-Tu and EFG) ATPases (120572 and 120573 subunits)signal recognition particle proteins and carbamoyl phos-phate synthetases all believed to be very ancient (reviewedin [16]) In many cases bacterial sequences were the firstto branch (appeared at the base) in the reconstructed treesforcing archaeal and eukaryal sequences to be sister groupsto each other This ldquocanonicalrdquo rooting scheme of the ToL(Figure 1(a)) was accepted as fact and was quickly endorsedby the supporters of the Urbana School [17] In fact theacceptance of the ldquocanonicalrdquo rooting in Bacteria became sodeep that it has now prompted the search for the origins ofEukarya in the molecular and physiological constitutions ofthe putative archaeal sister group [18] For example Embleyand coworkers generated sequence-based phylogenies usingconserved proteins and advanced algorithms to show thatEukarya emerged from within Archaea [19ndash21] (refer to[22] for critical analysis) Importantly these analyses sufferfrom technical and logical problems that are inherent insequence-based tree reconstructions For example proteinssuch as elongation factors tRNA synthetases and otheruniversal proteins used in their analyses are prone to highsubstitution rates [23] Mathematically it leads to loss ofinformation regarding the root of the ToL as shown bySober and Steel [24] (refer to [25] for more discussion) Onthe other hand paralogous rootings sometimes contradictedeach other and were soon and rightly considered weak andunreliable [23 26 27]The validity of paralogy-based rootingmethodology has proven to be severely compromised bya number of problems and artifacts of sequence analysis(eg long branch attraction mutational saturation taxonsampling horizontal gene transfer (HGT) hidden paralogyand historical segmental gene heterogeneity) Consequentlythere is no proper outgroup that can be used to root a ToLthat is built from molecular sequences and currently thereare no proper models of sequence evolution that can providea reliable ldquoevolutionary arrowrdquo Because of this fact archaealand eukaryal rootings should be considered equally probableto the canonical bacterial rooting (Figure 1(c)) This is animportant realization that needs to be explicitly highlightedespecially because it affects evolutionary interpretations andthe likelihood of scenarios of origins of diversified life

3 Mining Ancient Phylogenetic Signalin Universal Molecules

Woesersquos crucial insight was the explicit selection of theribosome for evolutionary studies The universality of theribosomal ensemble and its central role in protein synthesis

ensured it carried an ancient and overriding memory of thecellular systems that were studied This was made evident inthe first ToL reconstructions In contrast many of the pro-teins encoded by paralogous gene couples (eg translationfactors) likely carried convoluted histories of protein domaincooption or important phylogenetic biases induced forexample by mutational saturation in their protein sequencesThe ribosome is indeed the central feature of cellular lifethe signature of ldquoribocellsrdquo However its embedded phylo-genetic signatures are also convoluted The constitution ofthe ribosome is heterogeneous The ribosome represents anensemble of 3-4 ribosomal RNA (rRNA) and sim70 protein(r-protein) molecules depending on the species consideredand embodies multiple interactions with the cellular milieuneeded for function (eg assembly and disassembly inter-actions with the membrane of the endoplasmic reticulum)A group of 34 r-proteins is present across cellular life [28]Ribosomal history has been shown to involve complex pat-terns of protein-RNA coevolution within the evolutionarilyconserved core [29] These patterns are expressed distinctlyin its major constituents While both of its major subunitsevolved in parallel a primordial core that embodied bothprocessive and catalytic functions was established quite earlyin evolution This primordial core was later accessorizedwith structural elements (eg accretion of numerous rRNAhelical segments and stabilizing A-minor interactions) andr-proteins (eg the L7L12 protein complex that stimulatesthe GTPase activity of EFG) that enhanced its functionalpropertiesThis included expansion elements in the structureof rRNA that were specific not only to subunits but also toindividual domains of life Figure 2 for example shows a phy-logenomic model of ribosomal molecular accretion derivedfrom the survey of protein domain structures in genomesand substructures in rRNA molecules The accretion processof component parts of the universal core appears to havebeen a painstakingly slow process that unfolded during aperiod of sim2 billion years and overlapped with the firstepisodes of organismal diversification [30] Despite of thiscomplexity the focus of biodiversity studies was for decadesthe small subunit rRNA molecule [31] This focus has notchanged much in recent years Consequently the historyof organisms and populations is currently recounted by theinformation seated in the small subunit rRNA molecule Inother words the historical narrative generally comes fromonly sim1 of ribosomal molecular constitution This impor-tant and unacknowledged bias was already made explicit inearly phylogenetic studies For example de Rijk et al [32]demonstrated that phylogenies reconstructed from the smalland large subunit rRNA molecules were different and thatthe reconstructions from the large subunit were more robustand better suited to establish wide-range relationships Thestructure of the small and large subunits was also shown tocarry distinct phylogenetic signatures [33]However only fewevolutionary studies have combined small and large subunitrRNA for history reconstruction Remarkably in all of thesecases phylogenetic signal was significantly improved (eg[34])

4 Archaea

Eukarya

Archaea

Bacteria

Urancestor

Ancestor

E

A

B

U

a

(a)

Crown

Stema

U

E BA

E BA

a

U

TimeOrigin oflineages

(b)

Canonical rooting Eukaryal rooting Archaeal rooting

Time

E AB

a

U

B EA

a

U

E BA

a

U

(c)

Figure 1 Rooting the tree of life (ToL) an exercise for ldquotree-thinkersrdquo (a) A node-based tree representation of the ToL focuses on taxa(sampled or inferred vertices illustrated with open circles) and models edges as ancestry relationships The unrooted tree describes taxa asconglomerates of extant species for each domain of life Archaea (A) Bacteria (B) and Eukarya (E) (abbreviations are used throughout thepaper) Adding a universal ancestor vertex (rooting the ToL) implies adding a reconstructed entity (urancestor) that roots the tree but doesnot model ancestry relationshipsThe vertex is either a ldquomost recent universal ancestorrdquo if one defines it from an ingroup perspective or a ldquolastuniversal common ancestorrdquo (LUCA) if the definition relates to an outgroup perspective Rooting the tree enables defining ldquototal lineagesrdquowhich are lists of ancestors spanning from the ancestral taxon to the domain taxa (b) A stem-based view focuses on edges (branches) whichare sampled and inferred ancestral taxa are viewed as lineages under the paradigm of descent with modification Vertices correspond tospeciation events Terminal edges represent conglomerates of lineages leading to domains of life (the terminal nodes of node-based trees)and the ancestral stem represents the lineage of the urancestor (U) (double arrowhead line) Total lineages are simply a chain of edges thatgoes back in time and ends in the ancestral stem Both node-based and stem-based tree representations are mathematically isomorphic butthey are not equal [154] They change the concept of monophyletic and paraphyletic relationships A node-based clade starts with a lineageat the instant of the splitting event incorporating the ancestor into the makeup of the clade In contrast a stem-based clade originates witha planted branch on the tree where the branch represents a lineage between two lineage splitting events Planting an ancestral stem definesan origin of lineages and the first speciation event in the record of life This delimits a crown clade of two domain lineages in the stem-basedtree (labeled with black lines) that includes the ancestor of the sister groups a and a stem domain group at its base (c) The three possiblerootings of the ToL depicted with stem-based tree representations Terminal edges are labeled with thin lines and these conglomerate lineagescan include stem and crown groups

4 Building a Tower of Babel froma Comparative Genomic Patchworkof Sequence Homologies

Molecular evolutionists were cognizant of the limitationof looking at the history of only few component partswhich by definition could be divergent When genomicsequences became widely available pioneers jumped ontothe bandwagon of evolutionary genomics and the possibilityof gaining systemic knowledge from entire repertoires ofgenes and molecules (eg [35ndash39]) The genomic revolutionfor example quickly materialized in gene content trees thatreconstructed the evolution of genomes directly from theirevolutionary units the genes (eg [37 39 40]) or the domainconstituents of the translated proteins [35 41]The sequencesof multiple genes were also combined or concatenatedin attempts to extract deep phylogenetic signal [42ndash44]

The results that were obtained consistently supported thetripartite world backing-up the claims of the Urbana SchoolHowever clues about ancestors and lineages leading to extanttaxa were still missing

With few exceptions that focused on the structure ofRNA and protein molecules (see below) analyses based ongenomic sequences gene content gene order and othergenomic characteristics were unable to produce rooted treeswithout the help of outgroups additional ad hoc hypothesesthat are external to the group of organisms being studiedand generally carry strong assumptions An ldquoarrow of timerdquowas not included in the models of genomic evolution thatwere used As with sequence ToLs that were generated wereunrooted (Figure 1(a)) and generally rooted a posteriori eitherclaiming that the canonical root was correct or makingassumptions about character change that may be strictlyincorrect In a recent example distance-based approaches

Archaea 5

Primordialribosome

MaturePTC

Processiveribosome

Matureribosome ribosome

Present day

Ribosomal core

Diversified life

Ribosomal history

The present

Origin ofproteins

Urancestor 1 2 3

38 30 20 10 0

Molecular age (nd)

History of life (time billions of years)

Figure 2The evolutionary history of ribosome was traced onto the three-dimensional structure of its rRNA and r-protein components Agesof components were colored with hues from red (ancient) to blue (recent) Phylogenomic history shows that primordial metabolic enzymespreceded RNA-protein interactions and the ribosome The timeline of life derived from a universal phylogenetic tree of protein domainstructure at fold superfamily level of complexity is shown with time flowing from left to right and expressed in billions of years accordingto a molecular clock of fold structures The ribosomal timeline highlights major historical events of the molecular ensemble History ofthe conserved ribosomal core [29] overlaps with that of diversified life [48] suggesting that episodes of cooption and lateral transfer havepervaded early ribosomal evolution

were used to build universal network trees from gene familiesdefined by reciprocal best BLAST hits [45] These networksshowed a midpoint rooting of the ToL between Bacteriaand Archaea However this rooting involves a complex opti-mization of path lengths in the split networks and criticallyassumes that lineages evolved at roughly similar rates Thisdiminishes the confidence of themidpoint rooting especiallywhen considering the uncertainties of distances inferred fromBLAST analyses and the fact that domains in genes holddifferent histories and rates of change Another promisingbut ill-conceptualized case is the rooting of distance-basedtrees inferred by studying the frequency of l-mer sets ofamino acids in proteins at the proteome level [46] Thecompositional data generated a ToL that was rooted inEukarya when randomized proteome sequences were usedas outgroupsThe assumption of randomness associated withthe root of a ToL is however unsupported or probably wrongespecially because protein sequence space and itsmappings tostructure are far from random [47] More importantly a largefraction of modern proteins had already evolved protein foldstructures when the first diversified lineages arose from theuniversal common ancestor of cellular life [48]These evolvedrepertoires cannot be claimed to be random

The inability of genomics to provide clear answers torooting questions promoted (to some extent) parsimonythinking and the exploration of evolutionary differences ofsignificance that could act as ldquoanchorsrdquo and could impartldquopolarityrdquo to tree statements [49] As we will describe belowwe applied parsimony thinking to the evolution of proteinand RNA structures [41 50] and for over a decade we havebeen generating rooted ToLs that portray the evolution ofproteomes with increasing explanatory power This workhowever has been for themost part unacknowledged Insteadmolecular evolutionists focused almost exclusively onmolec-ular sequences in their search for solutions to the ToL

problem For example analysis of genomic insertions anddeletions (indels) that are rare in paralogous gene sets rootedthe ToL between the group of Actinobacteria and Gram-negative bacteria and the group of Firmicutes and Archaea[51] Unfortunately little is known about the dynamics ofindel generation its biological role in gene duplication andits influence on the structure of paralogous protein pairs (eg[52]) If the dynamics are close to that of sequences indelsshould be considered subject to all the same limitations ofsequence analyses including long branch attraction artifactscharacter independence and inapplicable characters Givena tree topology parsimony was also used to optimize theevolutionary transformation of genomic features For exam-ple when considering the occurrence of protein domainsthat are abundant variants of domain structures accumulategradually in genomes operationally the domain structure perse cannot be lost once it had been gained This enabled theuse of a variant of the unrooted Dollo algorithmic method(eg [53]) the Dollo parsimony model [54] is based onthe assumption that when a biological feature that is verycomplex is lost in evolution it cannot be regained throughvertical descent The method however could not be appliedto akaryotic genomes as these are subject to extensive lateralgene transfer and until recently [55] was not extended to ToLreconstructions

Parsimony thinking was also invoked in ldquotransitionanalysisrdquo a method that attempts to establish polarity ofcharacter change by for example examining homologies ofproteins in proteolytic complexes such as the proteasomemembrane and cell envelope biochemical makeup and bodystructures of flagella sometimes aided by BLAST queries[56 57] The elaboration however is restricted to the fewmolecular and cellular structures that are analyzed out ofthousands that populate the akaryotic and eukaryotic cellsThe approach is local has not yet made use of an objective

6 Archaea

analytic phylogenetic framework and does not weigh gainslosses and transfers with algorithmic implementationsThusmany statements can fall pray of incorrect optimizations orprocesses of convergence of structure and function includingcooptions that are common in metabolic enzymes [58]Transitions also fail to consider the molecular makeup of thecomplexes examined (eg evolution of the photosyntheticreaction centers) which often hold domains with hetero-geneous histories in the different organismal groups Forexample the 119865

1119865119900ATP synthetase complex that powers

cellular processes has a long history of accretion of domainsthat span almost 38 billion years of history which is evenolder than that of the ribosome [59] Finally establishing thevalidity of evolutionary transitions in polarization schemescan be highly problematic each transition that is studiedrequires well-grounded assumptions [60] some of whichhave not been yet properly elaborated

The inability to solve the rooting problem and the insis-tence on extracting deep phylogenetic signal from molecularsequences that are prone to mutational saturation raisedskepticism about the possibility of ever finding the rootof the ToL Bapteste and Brochier [60] made explicit theconceptual difficulties claiming scientists in the field hadadopted Agripparsquos logic of doubt Misunderstandings on howto conduct ideographic inquiry in evolutionary genomicshad effectively blocked the reerection of a Tower of Babelthat would explain the diversification of genomes (Figure 3)Instead there was ldquoConfusion of Doubtsrdquo Unfortunately theimpasse was aggravated by aprioristic tendencies inheritedfrom systematic biology [3 10] disagreements about theevolutionary role of vertical and lateral inheritance andthe problem of homology [61] and currently disagreementsabout the actual role of Darwinian evolution in speciationand the ToL problem [13 62]

During the decade of evolutionary genomic discovery theeffects of HGT on phylogeny [63] took front cover HGT wasinvoked as an overriding process However little attentionwas paid to alternative explanations such as differential lossof gene variants ancient or derived and there was littleconcern for other sources of reticulation HGT is certainlyan important evolutionary process that complicates the ldquotreerdquoconcept of phylogenetic analysis and must be carefullystudied (eg [64ndash66]) In some bacterial taxonomic groupssuch as the proteobacteria HGT was found to be pervasiveand challenged the definition of species [67] Cases like theseprompted the radical suggestion that the ToL should beabandoned and that a ldquoweb of liferdquo should be used to describethe diversification of microbes and multicellular organisms[68] However the problem has two aspects that must beseparately considered

(i) MechanicsThe widespread and impactful nature of HGTmust be addressed Does its existence truly compromise thevalidity of phylogenetic tree statements While HGT seemsimportant for some bacterial lineages [69] its evolutionaryimpacts in Archaea and Eukarya are not as extensive (eg[61 70]) and its global role can be contested [71] In turnthe role of viruses and RNA agents in genetic exchangecontinues to be understudied (eg [72]) and could be

Figure 3 The ldquoConfusion of Tonguesrdquo an engraving by GustaveDore (1832ndash1883) was modified by D Caetano-Anolles to portraythe event of diversification that halted the construction of the Towerof Babel In linguistics the biblical story inspired tree thinkingWe take the metaphor as the fall of the urancestor of cellular lifeand the replacement of ldquoscala naturaerdquo by branching processes ofcomplexity

a crucial source of reticulation Furthermore the relationshipbetween differential gene loss and HGT has not been ade-quately formalized especially for genes that are very ancientEstablishing patterns of loss requires establishing polarity ofchange and rooted trees which as we have discussed earlierremains unattained for ToLs reconstructed from molecularsequences

(ii) Interpretation HGT generally materializes as a mismatchbetween histories of genes in genomes and histories of organ-isms Since genomes are by functional definition collectionsof genes reticulations affect history of the genes and not ofthe organisms which given evolutionary assumptions resultfrom hierarchical relationships (eg [73]) Thus and at firstglance reticulation processes (HGT gene recombinationgene duplications and gene loss) do not obliterate verticalphylogenetic signal The problem however remains complexA ToL can be considered an ensemble of lineages nestedwithin each other [74] Protein domain lineages will benested in gene lineages gene lineages will be nested inlineages of gene families gene families will be nested inorganismal lineages organisms will be nested in lineagesof higher organismal groupings (populations communitiesecosystems and biomes) and so on All lineage levelsare defined by biological complexity and the hierarchicalorganization of life and follow a fractal pattern distributioneach level contributing vertical and lateral phylogenetic signal

Archaea 7

to the whole However sublineages in one hierarchical levelmay hold different histories compared to the histories ofhigher and lower levels Historical mismatches introduce forexample lineage sorting and reticulation problems that rep-resent a complication to phylogenetic analysis For examplemismatches between gene and organismal phylogenies existin the presence of differential sorting of genomic compo-nents due to for example species extinctions or genomicrearrangements These mismatches violate the fundamentalcladistic assumption that history follows a branching processbut can be explained by homoplasy (convergent evolution)the horizontal ldquotracerdquo of the homology-homoplasy yin-yangof phylogenetic signal The problem of homoplasy cannotbe solved by conceptually preferring one particular levelof the hierarchy [74] A focus on the organismal level forexample will not solve the problems of hybridization thatare brought by strict or relaxed sexual reproduction or viral-mediated genetic exchanges or the problems of ancientevents of fusions or endosymbiosis However homoplasy andthe optimization of characters and trees of cladistic analysisprovide a rigorous framework to discover the magnitude andsource of nonvertical processes in evolution Thus cladisticsor the ToL are not invalidated by network-like signals Infact recent discrete mathematical formulations that test thefundamental axioms of tree construction have also proventhat the bifurcating history of trees is preserved despiteevolutionary reticulations [75] While there is no ldquoconfusionof doubtsrdquo at this level the babel of confusion has not stopped

Genomics revealed evolutionary patchiness which in-cited hypotheses of chimeras and fusions Phylogenies ofgenes were found highly discordant Organisms that werebeing sequenced shared very few gene sequences and moretroublingly gene trees that were generated possessed differ-ent topologies especially within akaryotic organisms Withtime the number of nearly universal genes decreased andthe patchiness and discordancy increased The comparativegenomic patchwork of sequence homologies showed thatthere were groups of genes that were only shared by certaindomains of life Thus the makeup of genomes appearedchimeric A number of hypotheses of chimeric origins ofeukaryotes were proposed based on the fact that eukaryotesshared genes expressing sister group phylogenetic relation-ships with both Bacteria and Archaea [64 76] One thatis notable is the rise of eukaryotes from a ldquoring of liferdquofusion of an archaeon and a bacterium (eg [77]) which ina single blow defeated both the tree-like and the tripartitenature of life Under this school homology searches withBLAST against a database ofsim38million akaryotic sequencesallowed to assign archaeal bacterial or ambiguous ancestriesto genes in the human genome and explain homologypatterns as relics of the akaryotic ancestors of humans [78]Remarkably archaeal genes tended to be involved in informa-tional processes encoded shorter and more central proteinsand were less likely to be involved in heritable human dis-eases The chimeric origin of eukaryotes by fusions has beenrightfully contested there is no proper evidence supportingits existence [79] Unfortunately chimerism opened a floodof speculations about reticulation The history of life wasseen by many through the lens of a ldquoforestrdquo of gene histories

[80] Reticulations andultimatelyHGTwere forced to explainchimeric patterns Fusion hypotheses diverted the centralissue of the rooting of the ToL and prompted the separateanalysis of eukaryotic origins and akaryotic evolution

Dagan and Martin [64] denounced that the ToL wasa ldquotree of one percentrdquo because only a small fraction ofsequences could be considered universal and could bemined for deep phylogenetic signal The rest would accountfor lateral processes that confounded vertical descent Theclaim came fundamentally from networks constructed usingBLAST heuristic searches for short and strong matchesin genomic sequences Phylogenetic ldquoforestsrdquo of akaryoticgenes later on boosted the claim [81] These networks werebuilt from 6901 trees of genes using maximum likelihoodmethods Only 102 of these trees were derived from nearlyuniversal genes and contributed very little vertical phyloge-netic signal The initial conclusion was striking ldquothe originaltree of life concept is obsolete it would not even be a treeof one percentrdquo [81] However the fact that vertical signalwas present in the forests merited reevaluation ldquoreplacementof the ToL with a network graph would change our entireperception of the process of evolution and invalidate allevolutionary reconstructionrdquo [82] We note that in thesestudies an ultrametric tree of akaryotes was recovered fromvertical signal in a supertree of nearly universal genesThe rooted tree which was used to simulate clock-likebehavior revealed the early divergence of Nanoarchaeumequitans and then Archaea [81] While inconsistency of theforest supertree increased at high phylogenetic depths itsassociated supernetwork showed there were no reticulationsbetween Archaea and Bacteria Thus the deep rooting signalof archaeal diversification may be bona fide and worthy offurther study

While those that value tree thinking have contested onmany grounds the idea that the ToL is ldquoobsoleterdquo (eg[13]) our take in the debate is simple Homologies betweengene sequences established through the ldquoemperorrsquos BLASTrdquo(sensu [71]) are poor substitutes to phylogenetic tree recon-structions from gene sequences By the same token phy-logenies reconstructed from sequences are poor substitutesto phylogenies that consider the molecular structure andfunction of the encoded products Gene sequences are notonly prone to mutational saturation but they generally comein pieces These pieces represent evolutionary and structuralmodules that host the functions of the encoded moleculesProtein domains for example are three-dimensional (3D)arrangements of elements of secondary structure that foldautonomously [83] are compact [84] and are evolutionarilyconserved [85 86]The landscape of evolutionary explorationchanges when the role of protein domains in function andevolution is considered [87] Changes at sequence levelincluding substitutions insertions and deletions can havelittle impact on structure and vice versa sequence changes incrucial sites can have devastating consequences on functionand fitness However there is no detailed analysis of historicalmismatches between gene sequences and domain structuresat the ToL level Remarkably while HGT seems rampantat sequence level its impact at the domain structural levelis limited [88] This makes trees derived from domains

8 Archaea

effectively ldquotrees of 99 percentrdquo and their use very powerfulThe fact that domains diversify mostly by vertical descent(eg [89ndash91]) suggests that gene reticulations simply reflectthe pervasive and impactful combinatorial effect of domainrearrangements in proteins [92] and perhaps little else Thisimportant claim must be carefully evaluated It offers thepossibility of dissecting how levels of organization impactprocesses of inheritance in biology

5 Out of the Impasse Parts and Wholes inthe Evolving Structure of Systems

Therooting of theToL is clearlymuddled by the high dynamicnature of change in protein and nucleic acid sequences andby the patchiness and reticulation complexities that exist atgene level However it is also possible that the problemsof the ToL are ultimately technical We have made thecase that the use of molecular sequences is problematic onmany grounds including mutational saturation definitionof homology of sites in sequence alignments inapplicablecharacters taxon sampling and tree imbalance and differenthistorical signatures in domains of multidomain proteins[93] In particular violation of character independence by themere existence of atomic structure represents a very seriousproblem that plagues phylogenetic analysis of sequences Wehere present a solution to the impasse We show that theToL can be rooted with different approaches that focus onstructure and function and that its root is congruently placedin Archaea

Epistemologically phylogenetic characters must complywith symmetry breaking and the irreversibility of time [94]In other words characters must establish transformationalhomology relationships and serve as independent evidentialstatements

(i) Characters Must Be Homologous and Heritable acrossTree Terminal Units (Taxa) Character homology is a centraland controversial concept that embodies the existence ofhistorical continuity of information [74] Characters areldquobasicrdquo evidential statements that make up columns in datamatrices used for tree reconstructionThey are conjectures ofperceived similarities that are accepted as fact for the dura-tion of the study are strengthened by Hennigian reciprocalillumination and can be put to the test through congruencewith other characters as these are fit to the trees To be usefulcharacters must be heritable and informative across the taxarows of the data matrix This can be evaluated for examplewith the cladistics information content (CIC) measure [95]Finding informative characters can be particularly challeng-ing when the features that are studied change at fast pace andwhen taxa sample awide and consequently deep phylogeneticspectrum When building a ToL the highly dynamic natureof change in the sequence makeup of protein or nucleic acidmolecules challenges the ability to retrieve reliable phyloge-netic signatures across taxa even if molecules are universallydistributed and harbor evolutionarily conserved regions withdeep phylogenetic signal (eg rRNA) The reason is thatgiven enough time functionally or structurally constrained

regions of the sequence will be fixed (will be structurallycanalized [6]) and will offer little if any phylogeneticallymeaningful signal to uncover for example the universalrRNA core In turn mutational saturation of unconstrainedregions will quickly erase history

The mutational saturation problem was made mathe-matically explicit by Sober and Steel [24] using ldquomutualinformationrdquo and the concept of time as an informationdestroying process Mutual information 119868(119883 119884) between tworandom variables119883 and 119884 is defined by

119868 (119883 119884) = sum

119909119910

119875 (119883 = 119909 119884 = 119910) log(119875 (119883 = 119909 119884 = 119910)

119875 (119883 = 119909) 119875 (119884 = 119910)

)

(1)

When 119868(119883 119884) approaches 0 119883 and 119884 become independentand no method can predict 119883 from knowledge of 119884 Impor-tantly mutual information approaches 0 as the time between119883 and 119884 increases in a Markov chain Regardless of theuse of a parsimony maximum likelihood or Bayesian-basedframework of analysis 119868(119883 119884)will be particularly small whensequence sites are saturated by too many substitutions dueto high substitution rates or large time scales Ancestralstates at interior nodes of the trees cannot be establishedwith confidence from extant information even in the mostoptimal situation of knowing the underlying phylogeny andthe model of character evolution Under simple models theproblem is not mitigated by the fact that the number ofterminal leaves of trees and the sources of initial phylogeneticinformation increases with time Since a phase transitionoccurs when substitution probabilities exceed a critical value[96] oneway out of the impasse is to find features in sufficientnumber that change at much slower pace than sequence sitesand test if mutual information is significant and overcomesFanorsquos inequality These features exist in molecular biologyand have been used for phylogenetic reconstruction Theyare for example the 3D fold structures of protein molecules[87] or the stem modules of RNA structures [97] featuresthat change at very slow rate when compared to associatedsequences For example protein 3D structural cores evolvelinearly with amino acid substitutions per site and change at3ndash10 times slower rates than sequences [98] This high con-servation highlights the evolutionary dynamics of molecularstructure Remarkably rates of change of proteins performinga same function are maintained by functional constraintsbut accelerate when proteins perform different functions orcontain indels In turn fold structural diversity explodesinto modular structures at low sequence identities probablytriggered by functional diversification Within the contextof structural conservation the fact that fold structures arestructural and evolutionary modules that accumulate inproteomes by gene duplication and rearrangements andspread in biological networks by recruitment (eg [99]) alsoprovides a solution to the problems of vanishing phylogeneticsignal Since fold accumulation increases with time in theMarkov chain mutual information must increase reversing

Archaea 9

the ldquodata processing inequalityrdquo that destroys informationand enabling deep evolutionary information

(ii) Characters Must Show at Least Two Distinct CharacterStates One of these two states (transformational homologs)must be ancestral (the ldquoplesiomorphicrdquo state) and the othermust be derived (the ldquoapomorphicrdquo state) [74] Only sharedand derived features (synapomorphies) provide vertical phy-logenetic evidence Consequently determining the relativeancestry of alternative character states defines the polarity ofcharacter transformations and roots the underlying treeThisis a fundamental property of phylogenetic inference Polar-ization in tree reconstruction enables the ldquoarrow of timerdquo(sensu Eddingtonrsquos entropy-induced asymmetry) solves therooting problem and fulfills other epistemological require-ments

Cladistically speaking character polarity refers to thedistinction between the ancestral and derived states and theidentification of synapomorphies However an evolutionaryview of polarity also refers to the direction of characterstate transformations in the phylogenetic model Historicallythree accepted alternatives have been available for rootingtrees [100ndash102] the outgroup comparison the ontogeneticmethod and the paleontological or stratigraphic methodWhile the three methods do not include assumptions ofevolutionary process they have been the subject of muchdiscussion and their interpretation of much controversyThe first two are however justified by the assumption thatdiversity results in a nested taxonomic hierarchy which mayor may not be induced by evolution We will not discussthe stratigraphic method as it relies on auxiliary assump-tions regarding the completeness of the fossil record Themidpoint rooting criterion mentioned earlier will not bediscussed either The procedure is contested by the existenceof heterogeneities in rates of change across the trees andproblems with the accurate characterization of phylogeneticdistances

In outgroup comparison polarity is inferred by thedistribution of character states in the ingroup (group of taxaof interest) and the sister group (taxa outside the group ofinterest) In a simple case if the character state is only foundin the ingroup it must be considered derived Outgroupcomparison is by far the method of choice because phyloge-neticists tend to have confidence in the supporting assump-tions higher-level relationships are outside the ingroupequivalent ontogenetic stages are compared and characterstate distributions are appropriately surveyed Unfortunatelythe method is ldquoindirectrdquo in that it depends on the assumptionof the existence of a higher-level relationship between theoutgroup and the ingroup Consequently the method cannotroot the ToL because at that level there is no higher-levelrelationship that is presently available Moreover the methodin itself does not polarize characters It simply connects theingroup to the rest of the ToL [100]

The ontogenetic criterion confers polarity through thedistribution of the states of homologous characters in onto-genies of the ingroup generally by focusing on the generalityof character states with more widely distributed states beingconsidered ancestral Nelsonrsquos rule states ldquogiven an ontogenetic

character transformation from a character observed to be moregeneral to a character observed to be less general the moregeneral character is primitive and the less general advancedrdquo[103] This ldquobiogenetic lawrdquo appears powerful in that itdepends only on the assumption that ontogenies of ingrouptaxa are properly surveyed It is also a ldquodirectrdquo method thatrelies exclusively on the ingroup Consequently it has thepotential to root the ToL Unfortunately Nelsonrsquos ldquogeneralityrdquohas been interpreted in numerous ways especially as it relatesto the ontogenetic sequence leading tomuch confusion [102]It also involves comparison of developmentally nested anddistinct life history stages making it difficult to extend themethod (originally conceptualized for vertebrate phylogeny)to the microbial world However Weston [100 104] madeit clear that the ontogenetic criterion embodies a widerldquogenerality criterionrdquo in which the taxic distribution of acharacter state is a subset of the distribution of anotherIn other words character states that characterize an entiregroup must be considered ancestral relative to an alternativestate that characterizes a subset of the group Besides thecentrality of nested patterns the generality criterion embedsthe core assumption that every homology is a synapomorphyin naturersquos nested taxonomic hierarchy and that homologiesin the hierarchy result from additive phylogenetic change[100] Westonrsquos more general rule therefore states that ldquogivena distribution of two homologous characters in which one119909 is possessed by all of the species that possess its homologcharacter 119910 and by at least one other species that does notthen 119910 may be postulated to be apomorphous relative to 119909rdquo[104] The only assumption of the method is that relevantcharacter states in the ingroup are properly surveyed Thisnew rule crucially substitutes the concept of ontogenetictransformation by the more general concept of homologyand additive phylogenetic change which can be applied tocases in which homologous entities accumulate ldquoiterativelyrdquoin evolution (eg generation of paralogous genes by dupli-cation) Since horizontally acquired characters (xenologs)are not considered synapomorphies they contribute towardsphylogenetic noise and are excluded after calculation ofhomoplasy and retention indices (ie measures of goodnessof fit of characters to the phylogeny)

Wehave applied the ldquogenerality criterionrdquo to the rooting ofthe ToL through polarization strategies that embody axiomsof evolutionary process Figure 4 shows three examplesA rooted phylogeny describing the evolution of 5S rRNAmolecules sampled from a wide range of organisms wasreconstructed from molecular sequence and structure [105]The ToL that was recovered was rooted paraphyletically inArchaea (Figure 4(a)) The model of character state transfor-mation was based on the axiom that evolved RNA moleculesare optimized to increase molecular persistence and producehighly stable folded conformations Molecular persistencematerializes in RNA structure in vitro with base pairsassociating and disassociating at rates as high as 05 sminus1 [106]The frustrated kinetics and energetics of this folding processenable some structural conformations to quickly reach stablestatesThis process is evolutionarily optimized through struc-tural canalization [6] in which evolution attains molecularfunctions by both increasing the average life and stability of

10 Archaea

Archaea Bacteria Eukarya

Root

Taxa = 666 5S rRNA molecules of diverse organismsCharacters = sequence and structure of tRNA

CI = 0072 RI = 0772 and g1=minus013Tree length = 11381 steps

(a)

Ancestral

AB B

EB

AE

EABEA

Derived

A E B

Timeline

Eukarya (E)

Archaea (A)

Bacteria (B)

ABE

(b)

1000

Root

Eukarya

Archaea

Bacteria

Taxa = 150 free-living proteomesCharacters = genomic abundance of 1510 FSFsTree length = 70132 stepsCI = 0155 RI = 0739 and g1=minus028

(c)

Figure 4 Trees of life generated from the structure of RNA and protein molecules congruently show a rooting in Archaea (a) A rootedphylogenetic tree of 5S rRNA reconstructed from both the sequence and the structure of the molecules (from [105]) (b) Global most-parsimonious scenario of organismal diversification based on tRNA (from [107]) A total of 571 tRNA molecules with sequence basemodification and structural information were used to build a ToL which failed to show monophyletic groupings Ancestries of lineageswere then inferred by constraining sets of tRNAs into monophyletic groups representing competing (shown in boxes) or noncompetingphylogenetic hypotheses and measuring tree suboptimality and lineage coalescence (illustrated with color hues in circles) (c) A ToLreconstructed from the genomic abundance counts of 1510 FSFs as phylogenetic characters in the proteomes of 150 free-living organismssampled equally and randomly from the three domains of life (data taken from [122 123]) Taxa were labeled with circles colored accordingto superkingdom CI = consistency index RI = retention index and 119892

1= gamma distribution parameter

selected conformations and decreasing their relative numberThus conformational diversity measured for example by theShannon entropy of the base-pairing probability matrix orfeatures of thermodynamic stability act as ldquoevo-devordquo proxyof a generality criterion for RNA molecules in which thecriterion of similarity (eg ontogenetic transformation) isof positional and compositional correspondence Using adifferent approach we recently broadened the use of phyloge-netic constraint analysis [107 108] borrowing from a formal

cyberneticmethod that decomposes a reconstructable systeminto its components [109] and used it to root a ToL derivedfrom tRNA sequence and structure (Figure 4(b)) The ToLwas again rooted in Archaea The number of additionalsteps required to force (constrain) particular taxa into amonophyletic group was used to define a lineage coalescencedistance (119878) with which to test alternative hypotheses ofmonophyly These hypotheses were then ordered accordingto 119878 value in an evolutionary timeline Since 119878 records

Archaea 11

the relative distribution of character states in taxic sets italso embodies the generality criterion of rooting Finallywe generated rooted ToLs that describe the evolution ofproteomes directly from a census of protein fold structures inproteomes (Figure 4(c)) This ToL shows a paraphyletic root-ing inArchaeaThemethod extracts phylogenetic signal fromproteomic abundance of protein fold structures and considersthat the most abundant and widely distributed folds are ofancient origin when defining transformation series [41] Thispolarization scheme which results in the gradual growth ofthe proteome repertoire again represents an embodimentof the generality criterion in which statements of homology(fold structures) result from additive phylogenetic change(increases in abundance) It is noteworthy that these ToLreconstructions take into account the genomic abundance ofeach and every fold structure in each proteome and acrossthe entire matrix thereby generating a frustrated system Ingeneral very high abundance of only few folds will not attracttaxa (species) to the derived branches in theToLThe ancestryof taxa is determined by both the abundance and interplayamong fold structure characters For example metabolicfolds such as those involved inATP hydrolysis are widespreadin cells and considered to be very ancient These are alsothe most ancient folds in our phylogenies In comparisonsome popular eukaryote-specific folds (eg immunoglobulinsuperfamilies) are highly abundant but appear in a derivedmanner in our phylogenies Thus we reason that there is nocircularity involved in the character polarization scheme

The compositional schemes extend the concept of rootingwith paralogous sequences to the entire proteome comple-ments from gene family level [100] to structural hierarchiesThe three examples make use of different rooting rationalesbut provide a congruent scenario of origins of diversifiedlife Technically roots are inferred by the Lundberg method[110] that does not require any outgroup taxa specificationThis method roots the trees a posteriori by attaching thehypothetical ancestor to that branch of the unrooted networkthat would yield minimum increase in the tree length (thuspreserving the principle of parsimony)

(iii) Characters Must Serve as Independent EvolutionaryHypotheses Valid phylogenetic optimization requires thatcharacters be independent pieces of evidence Charactersshould not dependonother charactersWhen the assumptionof independence is violated characters are overweighed inthe analysis and the resulting phylogeny fails to representtrue history [111] Possible dependencies could be of manykinds from structural to functional from developmen-tal to ecological These dependencies distort and obscurephylogenetic signal and must be either avoided or codedinto the phylogenetic model through parameters or weightcorrections

As we will now elaborate the problem of character inde-pendence is about parts and wholes in the hierarchical fabricof life and in the nested hierarchies of the ToL Biologicalsystems are by definition made of parts regardless of the wayparts are defined In evolution diversification and integrationof parts unify parts into cohesive entities modules whichthen diversify [112]This process and the rise of modules may

explain evolutionary waves of complexity and organizationand the emergence of structure (defined broadly) in biologythat is hierarchical The hierarchical makeup is made evidentin the structure of protein molecules where lower levelparts of the polymer (the amino acid residues) interact witheach other and establish cohesive higher-level modular partswhich also establish interaction networks and are crucialfor molecular function and for interaction of proteins withthe cellular environment Consequently the structure ofproteins can be described at increasing hierarchical levelsof structural abstraction sequences motifs loops domainsfamilies superfamilies topologies folds architectures andclasses Two accepted gold standards of protein classificationstructural classification of proteins (SCOP) [113] and classarchitecture topology homology (CATH) [114] use parts ofthis incomplete scheme to describe the atomic complexityof the molecules We note that these classifications do notconsider unrealized structural states such as protein foldsthat are possible but that have never been identified inthe natural world of protein structures We also note thatmodules sometimes engage in combinatorial games Forexample protein domains are rearranged in evolution byfusions and fissions producing the enormous diversity ofalternative domain rearrangements that exist inmultidomainproteins [92]

Since biological systems evolve and carry common ances-try parts of these systems by definition evolve and bythemselves carry common ancestry In other words thehistories of parts are embodied in the ensemble of lineagesof the ToL The focus of phylogenetic analysis however hasbeen overwhelmingly the organism as the biological systemas testified by the effort devoted to taxonomical classificationsystematic biology and building of the ToL The genomicrevolution has provided a wealth of parts in the aminoacid sequence sites of proteins and nucleic acid moleculeswhich have been used as phylogenetic characters for analysisof molecules representing organisms This has maintainedthe focus of reconstructing trees of systems (the wholes)For example amino acid sites of a protein are generally fitinto a data matrix (alignment) which is then used to buildtrees of genes and organisms (Figure 5(a)) using modernalgorithmic implementations [115] However and as we havementioned proteins have complicated structures that resultfrom interactions between amino acids at the 3D atomic levelThese intramolecular interactions or at least some of them[116] induce protein folding and delimit molecular func-tion and protein stability They are responsible for proteinsecondary supersecondary domain and tertiary structureand by definition their mere existence induces violation ofcharacter independence Penny and Collins [117] proposedthe simple thought experiment in which the bioinformaticianexchanges rows of sequence sites in the alignment matrixand asks what was lost in the process Randomization ofcharacters (columns) in the data matrix does not changethe phylogenetic tree However randomization destroys thestructure of the molecule and very likely its function Thisconfirms that reconstruction of trees from information insequence sites violates character independence and in theprocess ignores structure and biologyThe effects of violation

12 Archaea

Interactions between parts(eg atomic contacts between sites)

Characters parts(amino acid sequence sites)

16

7

5Impact onphylogeneticreconstruction high

Phylogenetic model

Evolutionof systems

Canonicalparadigm

System 3 (protein from Mba)

System 2 (protein from Mja)

System 1 (protein from Pfu)

A L Y S

A L Y S

G L S S

1 5 6 7

(a)

Interactions between systems(eg proteomes in trophic chains)

Characters systems(proteomes)

Impact onphylogeneticreconstruction low

Phylogenetic model

Evolutionof parts

Newparadigm

Part 3 (FSF domain c261)

Part 2 (FSF domain c12)

Part 1 (FSF domain c371)

Higher trophic

levels

Mesoplankton

Microplankton

Nanoplankton

Picoplankton

Pfu Mja Mba 22 29 28

54 44 43

84 88 80

(b)

Figure 5 A new phylogenetic strategy simplifies the problems of character independence (a)The canonical paradigm explores the evolutionof systems such as the evolution of organisms in the reconstruction of ToLs For example the terminals of phylogenetic trees can be genessampled from different organisms (eg Pyrococcus furiosus (Pfu)Methanococcus jannaschii (Mja) andMethanosarcina barkeri (Mba)) andphylogenetic characters can be amino acid sequence sites of the corresponding gene products Character states can describe the identityof the amino acid at each site Since characters are molecular parts that interact with other parts when molecules fold into compact 3Dstructures their interaction violates the principle of character independence Consequently the effects of covariation must be considered inthe phylogeneticmodel used to build the trees (b)The new paradigm explores the evolution of parts such as the evolution of protein domainsin proteomes For example the terminals of phylogenetic trees can be domains defined at fold superfamily level of structural complexity andthe characters used to build the trees can be proteomes Character states can be the number of domains holding the FSF structure Sinceproteomes interact with other proteomes when organisms establish close interactions interactions that could affect the abundance of domainsin proteomes should be considered negligible (unless there is an obligate parasitic lifestyle involved) and there is no need to budget trophicinteractions in the phylogenetic model

of character independence may be minimal for trees ofsequences that are closely related However trees describingdeep historical relationships require that sequences be diver-gent and this maximizes the chances of even wider diver-gences in molecular structure that are not being accountedfor by the models of sequence evolution

The genomic revolution has also provided a wealth ofmodels of 3D atomic structure These structures are usedas gold standards to assign with high confidence structuralmodules to sequences As mentioned earlier proteomesembody collections of protein domains with well-definedstructures and functions Protein domain counts in pro-teomes have been used to generate trees of protein domains(Figure 5(b)) using standard cladistic approaches and well-established methods (reviewed in [87]) These trees describethe evolution of protein structure at global level Theyare effectively trees of parts While domains interact witheach other in multidomain proteins or establish protein-protein interactions with the domains of other proteinsthese interactions of parts do not violate character indepen-dence This is because phylogenetic characters are actuallyproteomes systems defined by structural states that exist atmuch higher levels than the protein domain not far awayfrom the organism level Remarkably no information is lostwhen character columns in the matrix are randomized in

the thought experiment The order of proteome charactersin the matrix does not follow any rationale Characters arenot ordered by lifestyles or trophic levels of the organismsInteractions between free-living organisms will seldom biastheir domain makeup and if so those characters can beexcluded from analysis Even the establishment of symbioticor obligate parasitic interactions such as the nodule-formingsymbioses between rhizobia and legumes may have littleimpact on character independence as long as the jointinclusion of the host and the symbiont is avoided

6 Evidence Supporting the Archaeal Rootingof the Universal Tree

Figure 4 shows examples of rooted ToLs generated fromthe sequence and structure of RNA and protein moleculesThe different phylogenomic approaches arrive at a commonrooted topology that places the stem group of Archaea at thebase of the ToL (the archaeal rooting of Figure 1(c)) Howeverthe evolutionary interrelationship of parts andwholes promptthe use of trees of parts a focus on higher-level structureand a decrease in confidence in the power of trees of systemsThese concepts have been applied to the study of the historyof nucleic acid and protein structures for over a decade and

Archaea 13

23

38

786

360324

164

38

2

33

453

126 56

1

1

FSF domains

Archaea

Archaea

EukaryaEukarya

Bact

eria

Bact

eria

f gt 06

(a)

1

11

526

852272

162

100

0

3

232

35 15

5

16

GO terms

Archaea

Archaea

EukaryaBa

cter

iaEukarya

Bact

eria

f gt 06

(b)

69

0

7

8393

228

2

1

0

5

91

1

0

RNA families

Archaea

Eukarya

Bact

eria

Archaea

Eukarya

Bact

eria

f gt 06

(c)

Figure 6 Venn diagrams displaying the distributions of 1733 FSF domains (a) 1924 terminal GO terms (b) and 1148 RNA families (c)in the genomes of the three domains of life FSF domain data was taken from Nasir et al [122 123] and included 981 completely sequencedproteomes from 70 Archaea 652 Bacteria and 259 Eukarya Terminal GO terms corresponding to the ldquomolecular functionrdquo hierarchy definedby the GO database [118] were identified in 249 free-living organisms including 45 Archaea 183 Bacteria and 21 Eukarya (data taken from[123]) The Venn diagram of RNA families and the distribution of Rfam clans and families in organisms were taken from Hoeppner et al[129] and their Dataset S1 Shown below are distribution patterns for FSFs GO terms and RNA families that are present in more than 60 ofthe organisms examined (119891 gt 06) All distributions highlight maximum sharing in the ancient ABE and BE taxonomic groups and minimalsharing in archaeal taxonomic groups

provide additional evidence in support of the archaeal rootingscenario

61 Comparative Genomic Argumentation The distributionof gene-encoded products in the genomes of sequencedorganisms and parsimony thinking can reveal global evolu-tionary patterns without formal phylogenetic reconstructionWe will show how simple numerical analyses of proteindomains molecular activities defined by the gene ontology(GO) consortium [118] and RNA families that are domain-specific or are shared between domains of life can uncover thetripartite division of cellular life exclude chimeric scenariosof origin and provide initial insight on the rooting of the ToL(Figure 6)

The number of unique protein folds observed in natureis very small SCOP ver 175 defines only sim2000 foldsuperfamilies (FSFs) groups of homologous domains unifiedon the base of common structural and evolutionary rela-tionships for a total of 110800 known domains in proteins[113 119] FSFs represent highly conserved evolutionary unitsthat are suitable for studying organismal diversification [87]Yafremava et al [120] plotted the total number of distinctFSFs (FSF diversity) versus the average reuse of FSFs in theproteome of an organism (FSF abundance) This exerciseuncovered a scaling behavior typical of a Benford distributionwith a linear regime of proteomic growth for microbial

organisms and a superlinear regime for eukaryotic organisms(Figure 8 in [120]) These same scaling patterns are observedwhen studying the relationship between open reading framesand genome size [121] Remarkably archaeal and eukaryalproteomes exhibited both minimum and maximum levelsof FSF abundance and diversity respectively Bacterial pro-teomes however showed intermediate levels We note thatthe general scaling behavior is consistent with a scenarioin which evolutionary diversification proceeds from simplerproteomes to the more complex ones in gradual mannersupporting the principle of spatiotemporal continuity andrevealing the nested phylogenetic hierarchies of organismsUnder this scenario (and results of [120]) the streamlinedarchaeal proteomes represent the earliest form of cellular lifeRemarkably archaeal species harboring thermophilic andhyperthermophilic lifestyles encoded the most streamlinedFSF repertoires (Figure 7) Clearly modern thermophilicarchaeons are most closely related to the ancient cells thatinhabited planet Earth billions of years ago (also read below)

FSF domain distributions in the genomes of the threedomains of life provide further insights into their evolutionNasir et al [122 123] generated Venn diagrams to illustrateFSF sharing patterns in the genomes of Archaea Bacteriaand Eukarya (Figure 6(a)) These diagrams display the totalnumber of FSFs that are unique to a domain of life (taxonomicgroupsA B and E) shared by only two (AB BE andAE) and

14 Archaea

FSF use (log)

S marinus I hospitalis

H butylicus

T pendens

Cand korarchaeum

A pernix

M kandleri

T neutrophils

P islandicum

290

285

280

275

270

265

260

255

250

245

240

22 23 24 25

ThermophilicNonthermophilic

FSF

reus

e (lo

g)

N maritimus

Figure 7 A plot of FSF use (diversity) against FSF reuse (abun-dance) reveals a linear pattern of proteomic growth in 48 archaealproteomes Thermophilic archaeal species occupy positions that areclose to the origin of the plot They also populate the most basalbranch positions in ToLs (see discussion in themain text) Both axesare in logarithmic scale

those that are universal (ABE) About half of the FSFs (786out of 1733) were present in all three domains of life 453 ofwhich were present in at least 60 of the organisms that wereexamined (119891 gt 06 where 119891 is the distribution index rangingfrom 0 to 1) The fact that about 70 of widely shared FSFs(672) belong to the ABE taxonomic group strongly supports acommon evolutionary origin for cells Evolutionary timelineshave confirmed that a large number of these universal FSFswere present in the ancestor of the three domains of life theurancestor (Figure 1) and were for the most part retained inextant proteomes [48]

Interestingly the number of BE FSFs was sim10-fold greaterthan AE and AB FSFs (324 versus 38 and 38) (Figure 6(a))The finding that Bacteria and Eukarya encode a significantlylarge number of shared FSF domains is remarkable and hintstowards an unprecedented strong evolutionary associationbetween these two domains of life Moreover a significantnumber of BE FSFs is widespread in the bacterial andeukaryotic proteomes 56 of the 324 FSFs are shared bymore than 60 of the bacterial and eukaryal organismsthat were analyzed (Figure 6(a)) This is strong evidenceof an ancient vertical evolutionary trace from their mutualancestor (anticipated in [123]) This trace uniquely supportsthe archaeal rooting of the ToL In turn the canonicaland eukaryotic rooting alternatives are highly unlikely asnone alone can explain the remarkable diversity of the BEtaxonomic group (as well as its ancient origin [123]) Theremarkable vertical trace of the ABE taxonomic group andthe negligible vertical trace of the AB group (only one FSF

shared by more than 60 of organisms) also falsify fusionhypotheses responsible for a putative chimeric makeup ofEukarya In light of these findings the most parsimoniousexplanation of comparative genomic data is that Archaea isthe first domain of diversified cellular life

The patterns of FSF sharing are further strengthenedby the genomic distributions of terminal-level molecularfunction GO terms (Figure 6(b)) The three domains of lifeshared about a quarter (526) of the 1924 GO terms thatwere surveyed A total of 232 of these ABE terms wereshared by 60 of organisms analyzed Again the fact thatabout 76 of widely shared GO terms (306) belong to theABE taxonomic group strongly supports a common verticaltrace while the finding that Bacteria and Eukarya share asignificantly large number of GO terms (272) supports againthe archaeal rooting of the ToL An alternative explanationfor the very large size of BE could be large-scale metabolism-related gene transfer from the ancestors of mitochondriaand plastids to the ancestors of modern eukaryotes [124]However we note that the BE group is not restricted to onlymetabolic functions It also includes FSFs and GOs involvedin intracellular and extracellular processes and regulation andinformational functions [125] Thus the very large size ofBE is a significant outcome most likely shaped by verticalevolutionary scenarios and cannot solely be explained byparasiticsymbiotic relationships that exist between Bacteriaand Eukarya [123] Moreover the simplicity of archaeal FSFrepertoires is not due to the paucity of available archaealgenomic dataWe confirmed that themean FSF coverage (ienumber of proteinsannotated to FSFsGOs out of total) forArchaea Bacteria and Eukarya was largely comparable (egTable S1 in [122]) Finally as we will describe below thesecomparative patterns and tentative conclusions are confirmedby phylogenomic analysis [91 122 126] Congruent phyloge-nies were obtained with (Figure 4(c)) and without equal andrandom sampling of taxa Thus the relatively low numberof archaeal genomes is also not expected to compromise ourinferences

We note that Eukarya shares many informational geneswith Archaea and many operational genes with BacteriaWhile informational genes have been thought more refrac-tory to HGT than noninformational genes we have con-firmed at GO level that this is not the case A ToL builtfrom GO terms showed that in fact noninformational termswere less homoplasious while statistical enrichment analysesrevealed that HGT had little if any functional preferencefor GO terms across GO hierarchical levels We recentlycompared ToLs reconstructed from non-HGTGO terms andToLs reconstructed from informational GO terms that wereextracted fromnon-HGTGO terms (Kim et al ms in reviewalso read below) In both cases Archaea appeared as a basalparaphyletic group of the ToLs and the common origin ofBacteria and Eukarya was maintained Thus the 272 GOterms shared by Bacteria and Eukarya harbor a strong verticaltrace

Another observation often used as support for Archaea-Eukarya kinship is the discovery of few eukaryote-specificproteins (eg actin tubulin H3 H4 ESCRT ribosomalproteins and others) in some archaeal species [127] and

Archaea 15

their complete absence fromBacteria (except for documentedtubulin HGT from eukaryotes to bacterial genus Prosthe-cobacter [128]) This suggests either that eukaryotes arosefrom an archaeal lineage [127] or that the ancestor of Eukaryaand Archaea was complex and modern Archaea are highlyreduced [25] Indeed few eukaryote-specific proteins havenow been found in some archaeal species (in some casesjust one species) We argue that this poor spread cannotbe taken as evidence for the Archaea-Eukarya sister rela-tionship This needs to be confirmed by robust phylogeneticanalysis which is unfortunately not possible when usingprotein sequences Recently we described a new strategy forinferring vertical and horizontal traces [123] This methodcalculates the spread of FSFs andGOs that are shared betweentwo-superkingdom groups (ie AB AE and BE) Balanceddistributions often indicate vertical inheritance while biaseddistributions suggest horizontal flux For example penicillinbinding molecular activity (GO 0008658) was present in100 of the sampled bacterial proteomes but was onlypresent in 11 of the archaeal species [123] Thus presenceof GO 0008658 in Archaea was attributed to HGT gain fromBacteria Using this simple method we established that bothBacteria and Eukarya were united by much stronger verticaltrace than either was to Archaea In fact strong reductivetendencies in the archaeal genomes were recorded [123]Thus in our opinion presence of eukaryote-specific proteinsin only very few archaeal species could in fact be an HGTevent that is not detectable by sequence phylogenies

In turn many arguments favor now the Bacteria-Eukaryasisterhood in addition to the structure-based phylogeniesand balanced distribution ofmolecular features For exampleBacteria and Eukarya have similar lipid membranes that canbe used as argument for their evolutionary kinshipMoreoverArchaea fundamentally differ from Eukarya in terms oftheir virosphere Viruses infecting Archaea and Eukarya aredrastically different as recently discussed by Forterre [25]

The genomic distribution of RNA families was taken fromHoeppner et al [129] and also shows a vertical evolutionarytrace in five crucial Rfam clans that are universal includingtRNA 5S rRNA subunit rRNA and RNase P RNA Theseuniversal RNA groups are likely minimally affected by HGTHowever 99 of Rfam clans and families are specific todomains of life and only 11 of the 1148 groups were sharedat 119891 gt 06 levels This clearly shows that the functionalcomplexity of RNA materialized very late during organismaldiversification and that it is not a good genomic feature forexploring the rooting of the ToL Only five RNA familiesare shared between two domains of life and only one ofthese does so at 119891 gt 06 the G12 pseudoknot of the 23SrRNA which is present in bacterial and eukaryotic organellarrRNA While the large subunit rRNA scaffold supports theG12 pseudoknot with its vertical trace all five interdomainRNA families can be explainedmost parsimoniously byHGTThus only a handful of ancient and universal RNA speciescan be used to root the ToL

We end by noting that the Venn diagrams consistentlyshow that Archaea harbors the least number of unique (A)and shared (AB and AE) FSFs GO terms or RNA familiesThis trend supports an early divergence of this domain of life

from the urancestor and the possibility that such divergencebe shaped by evolutionary reductive events We reason thatsuch losseswould bemore parsimonious early on in evolutionthan in the later periods This is because genes often increasetheir abundance in evolutionary time (by gene duplicationsHGT and other processes)Thus it is reasonable to think thatloss of an ancient gene would be more feasible very early inevolution relative to losing it very late

62 Phylogenomic Evidence from the Sequence and Structure ofRNA Phylogenetic analyses of the few RNA families that areuniversal and display an important vertical evolutionary trace(Figure 6(c)) provide compelling evidence in favor of an earlyevolutionary appearance of Archaea [105 107 108 130ndash135]Here we briefly summarize evidence from tRNA 5S rRNAand RNase P RNA Unpublished analyses of rRNA sequenceand structure using advanced phylogenetic methods alsoshow that Archaea was the first domain of life

tRNA molecules are generally short (sim73ndash95 nucleotidesin length) and highly conserved Consequently theirsequence generally contains limited amount of phylogeneticinformation These limitations have been overcome by ana-lyzing entire tRNomes [136] which extend the length ofan organismal set of tRNAs to over 2000 bases Xue etal [130 131] analyzed the genetic distances between tRNAsequences as averages between alloacceptor tRNAs fromdiverse groups of tRNomes using multiple molecular basesubstitution models The distances were mapped onto anunrooted phylogeny of tRNA molecules (Figure 8(a))The ldquoarrow of timerdquo assumption in these studies is thatancient tRNA paralogs closely resemble each other when lin-eages originate close to the time of the gene duplication Re-markably the results revealed a paraphyletic rooting of theToL in Archaea The root was specifically located close tothe hyperthermophilic methanogen Methanopyrus kandleri(Figure 8(a)) The hypothesis of this specific rooting scenariohas been supported by several other studies [134 137]including a study of genetic distances between paralogouspairs of aminoacyl-tRNA synthetase (aaRS) proteins [131] Aremarkable match between distance scores of tRNA and pairsof aaRS paralogs (Figure 8(b)) not only confirms the earlyappearance of Archaea but also suggests a coevolutionarytrace associated with molecular interactions that areresponsible for the genetic code In fact a recent exhaustivephylogenomic analysis of tRNA and aaRS coevolutionexplicitly reveals the origins and evolution of the geneticcode and the underlying molecular basis of genetics [59]Di Giulio [132 133] has also proposed an archaeal rootingof the tree of life specifically in the lineage leading to thephylum of Nanoarchaeota This rooting is based on uniqueand ancestral genomic traits of Nanoarchaeum equitans splitgenes separately codifying for the 51015840 and 31015840 halves of tRNAand the absence of operons which are considered molecularfossils [132] However this claim needs additional support ascontrasting evidence now recognizes N equitans as a highlyderived archaeal species [138]

In addition to sequences structural features of tRNAmolecules also support the archaeal rooting of the ToL

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Archaea 3

by the establishment Its resistance is still embodied todayin new proposals of origins such as the archaeon-bacteriumfusion hypothesis used to explain the rise of Eukarya (seebelow) It is noteworthy that the root of the universal treeof cellular organisms the ldquotree of liferdquo (ToL) was initiallynot the driving issue This changed when the sequencesof proteins that had diverged by gene duplication prior toa putative universal common ancestor were analyzed withphylogenetic methods and the comparisons used to root theToL [14 15] Paralogous gene couples included elongationfactors (eg EF-Tu and EFG) ATPases (120572 and 120573 subunits)signal recognition particle proteins and carbamoyl phos-phate synthetases all believed to be very ancient (reviewedin [16]) In many cases bacterial sequences were the firstto branch (appeared at the base) in the reconstructed treesforcing archaeal and eukaryal sequences to be sister groupsto each other This ldquocanonicalrdquo rooting scheme of the ToL(Figure 1(a)) was accepted as fact and was quickly endorsedby the supporters of the Urbana School [17] In fact theacceptance of the ldquocanonicalrdquo rooting in Bacteria became sodeep that it has now prompted the search for the origins ofEukarya in the molecular and physiological constitutions ofthe putative archaeal sister group [18] For example Embleyand coworkers generated sequence-based phylogenies usingconserved proteins and advanced algorithms to show thatEukarya emerged from within Archaea [19ndash21] (refer to[22] for critical analysis) Importantly these analyses sufferfrom technical and logical problems that are inherent insequence-based tree reconstructions For example proteinssuch as elongation factors tRNA synthetases and otheruniversal proteins used in their analyses are prone to highsubstitution rates [23] Mathematically it leads to loss ofinformation regarding the root of the ToL as shown bySober and Steel [24] (refer to [25] for more discussion) Onthe other hand paralogous rootings sometimes contradictedeach other and were soon and rightly considered weak andunreliable [23 26 27]The validity of paralogy-based rootingmethodology has proven to be severely compromised bya number of problems and artifacts of sequence analysis(eg long branch attraction mutational saturation taxonsampling horizontal gene transfer (HGT) hidden paralogyand historical segmental gene heterogeneity) Consequentlythere is no proper outgroup that can be used to root a ToLthat is built from molecular sequences and currently thereare no proper models of sequence evolution that can providea reliable ldquoevolutionary arrowrdquo Because of this fact archaealand eukaryal rootings should be considered equally probableto the canonical bacterial rooting (Figure 1(c)) This is animportant realization that needs to be explicitly highlightedespecially because it affects evolutionary interpretations andthe likelihood of scenarios of origins of diversified life

3 Mining Ancient Phylogenetic Signalin Universal Molecules

Woesersquos crucial insight was the explicit selection of theribosome for evolutionary studies The universality of theribosomal ensemble and its central role in protein synthesis

ensured it carried an ancient and overriding memory of thecellular systems that were studied This was made evident inthe first ToL reconstructions In contrast many of the pro-teins encoded by paralogous gene couples (eg translationfactors) likely carried convoluted histories of protein domaincooption or important phylogenetic biases induced forexample by mutational saturation in their protein sequencesThe ribosome is indeed the central feature of cellular lifethe signature of ldquoribocellsrdquo However its embedded phylo-genetic signatures are also convoluted The constitution ofthe ribosome is heterogeneous The ribosome represents anensemble of 3-4 ribosomal RNA (rRNA) and sim70 protein(r-protein) molecules depending on the species consideredand embodies multiple interactions with the cellular milieuneeded for function (eg assembly and disassembly inter-actions with the membrane of the endoplasmic reticulum)A group of 34 r-proteins is present across cellular life [28]Ribosomal history has been shown to involve complex pat-terns of protein-RNA coevolution within the evolutionarilyconserved core [29] These patterns are expressed distinctlyin its major constituents While both of its major subunitsevolved in parallel a primordial core that embodied bothprocessive and catalytic functions was established quite earlyin evolution This primordial core was later accessorizedwith structural elements (eg accretion of numerous rRNAhelical segments and stabilizing A-minor interactions) andr-proteins (eg the L7L12 protein complex that stimulatesthe GTPase activity of EFG) that enhanced its functionalpropertiesThis included expansion elements in the structureof rRNA that were specific not only to subunits but also toindividual domains of life Figure 2 for example shows a phy-logenomic model of ribosomal molecular accretion derivedfrom the survey of protein domain structures in genomesand substructures in rRNA molecules The accretion processof component parts of the universal core appears to havebeen a painstakingly slow process that unfolded during aperiod of sim2 billion years and overlapped with the firstepisodes of organismal diversification [30] Despite of thiscomplexity the focus of biodiversity studies was for decadesthe small subunit rRNA molecule [31] This focus has notchanged much in recent years Consequently the historyof organisms and populations is currently recounted by theinformation seated in the small subunit rRNA molecule Inother words the historical narrative generally comes fromonly sim1 of ribosomal molecular constitution This impor-tant and unacknowledged bias was already made explicit inearly phylogenetic studies For example de Rijk et al [32]demonstrated that phylogenies reconstructed from the smalland large subunit rRNA molecules were different and thatthe reconstructions from the large subunit were more robustand better suited to establish wide-range relationships Thestructure of the small and large subunits was also shown tocarry distinct phylogenetic signatures [33]However only fewevolutionary studies have combined small and large subunitrRNA for history reconstruction Remarkably in all of thesecases phylogenetic signal was significantly improved (eg[34])

4 Archaea

Eukarya

Archaea

Bacteria

Urancestor

Ancestor

E

A

B

U

a

(a)

Crown

Stema

U

E BA

E BA

a

U

TimeOrigin oflineages

(b)

Canonical rooting Eukaryal rooting Archaeal rooting

Time

E AB

a

U

B EA

a

U

E BA

a

U

(c)

Figure 1 Rooting the tree of life (ToL) an exercise for ldquotree-thinkersrdquo (a) A node-based tree representation of the ToL focuses on taxa(sampled or inferred vertices illustrated with open circles) and models edges as ancestry relationships The unrooted tree describes taxa asconglomerates of extant species for each domain of life Archaea (A) Bacteria (B) and Eukarya (E) (abbreviations are used throughout thepaper) Adding a universal ancestor vertex (rooting the ToL) implies adding a reconstructed entity (urancestor) that roots the tree but doesnot model ancestry relationshipsThe vertex is either a ldquomost recent universal ancestorrdquo if one defines it from an ingroup perspective or a ldquolastuniversal common ancestorrdquo (LUCA) if the definition relates to an outgroup perspective Rooting the tree enables defining ldquototal lineagesrdquowhich are lists of ancestors spanning from the ancestral taxon to the domain taxa (b) A stem-based view focuses on edges (branches) whichare sampled and inferred ancestral taxa are viewed as lineages under the paradigm of descent with modification Vertices correspond tospeciation events Terminal edges represent conglomerates of lineages leading to domains of life (the terminal nodes of node-based trees)and the ancestral stem represents the lineage of the urancestor (U) (double arrowhead line) Total lineages are simply a chain of edges thatgoes back in time and ends in the ancestral stem Both node-based and stem-based tree representations are mathematically isomorphic butthey are not equal [154] They change the concept of monophyletic and paraphyletic relationships A node-based clade starts with a lineageat the instant of the splitting event incorporating the ancestor into the makeup of the clade In contrast a stem-based clade originates witha planted branch on the tree where the branch represents a lineage between two lineage splitting events Planting an ancestral stem definesan origin of lineages and the first speciation event in the record of life This delimits a crown clade of two domain lineages in the stem-basedtree (labeled with black lines) that includes the ancestor of the sister groups a and a stem domain group at its base (c) The three possiblerootings of the ToL depicted with stem-based tree representations Terminal edges are labeled with thin lines and these conglomerate lineagescan include stem and crown groups

4 Building a Tower of Babel froma Comparative Genomic Patchworkof Sequence Homologies

Molecular evolutionists were cognizant of the limitationof looking at the history of only few component partswhich by definition could be divergent When genomicsequences became widely available pioneers jumped ontothe bandwagon of evolutionary genomics and the possibilityof gaining systemic knowledge from entire repertoires ofgenes and molecules (eg [35ndash39]) The genomic revolutionfor example quickly materialized in gene content trees thatreconstructed the evolution of genomes directly from theirevolutionary units the genes (eg [37 39 40]) or the domainconstituents of the translated proteins [35 41]The sequencesof multiple genes were also combined or concatenatedin attempts to extract deep phylogenetic signal [42ndash44]

The results that were obtained consistently supported thetripartite world backing-up the claims of the Urbana SchoolHowever clues about ancestors and lineages leading to extanttaxa were still missing

With few exceptions that focused on the structure ofRNA and protein molecules (see below) analyses based ongenomic sequences gene content gene order and othergenomic characteristics were unable to produce rooted treeswithout the help of outgroups additional ad hoc hypothesesthat are external to the group of organisms being studiedand generally carry strong assumptions An ldquoarrow of timerdquowas not included in the models of genomic evolution thatwere used As with sequence ToLs that were generated wereunrooted (Figure 1(a)) and generally rooted a posteriori eitherclaiming that the canonical root was correct or makingassumptions about character change that may be strictlyincorrect In a recent example distance-based approaches

Archaea 5

Primordialribosome

MaturePTC

Processiveribosome

Matureribosome ribosome

Present day

Ribosomal core

Diversified life

Ribosomal history

The present

Origin ofproteins

Urancestor 1 2 3

38 30 20 10 0

Molecular age (nd)

History of life (time billions of years)

Figure 2The evolutionary history of ribosome was traced onto the three-dimensional structure of its rRNA and r-protein components Agesof components were colored with hues from red (ancient) to blue (recent) Phylogenomic history shows that primordial metabolic enzymespreceded RNA-protein interactions and the ribosome The timeline of life derived from a universal phylogenetic tree of protein domainstructure at fold superfamily level of complexity is shown with time flowing from left to right and expressed in billions of years accordingto a molecular clock of fold structures The ribosomal timeline highlights major historical events of the molecular ensemble History ofthe conserved ribosomal core [29] overlaps with that of diversified life [48] suggesting that episodes of cooption and lateral transfer havepervaded early ribosomal evolution

were used to build universal network trees from gene familiesdefined by reciprocal best BLAST hits [45] These networksshowed a midpoint rooting of the ToL between Bacteriaand Archaea However this rooting involves a complex opti-mization of path lengths in the split networks and criticallyassumes that lineages evolved at roughly similar rates Thisdiminishes the confidence of themidpoint rooting especiallywhen considering the uncertainties of distances inferred fromBLAST analyses and the fact that domains in genes holddifferent histories and rates of change Another promisingbut ill-conceptualized case is the rooting of distance-basedtrees inferred by studying the frequency of l-mer sets ofamino acids in proteins at the proteome level [46] Thecompositional data generated a ToL that was rooted inEukarya when randomized proteome sequences were usedas outgroupsThe assumption of randomness associated withthe root of a ToL is however unsupported or probably wrongespecially because protein sequence space and itsmappings tostructure are far from random [47] More importantly a largefraction of modern proteins had already evolved protein foldstructures when the first diversified lineages arose from theuniversal common ancestor of cellular life [48]These evolvedrepertoires cannot be claimed to be random

The inability of genomics to provide clear answers torooting questions promoted (to some extent) parsimonythinking and the exploration of evolutionary differences ofsignificance that could act as ldquoanchorsrdquo and could impartldquopolarityrdquo to tree statements [49] As we will describe belowwe applied parsimony thinking to the evolution of proteinand RNA structures [41 50] and for over a decade we havebeen generating rooted ToLs that portray the evolution ofproteomes with increasing explanatory power This workhowever has been for themost part unacknowledged Insteadmolecular evolutionists focused almost exclusively onmolec-ular sequences in their search for solutions to the ToL

problem For example analysis of genomic insertions anddeletions (indels) that are rare in paralogous gene sets rootedthe ToL between the group of Actinobacteria and Gram-negative bacteria and the group of Firmicutes and Archaea[51] Unfortunately little is known about the dynamics ofindel generation its biological role in gene duplication andits influence on the structure of paralogous protein pairs (eg[52]) If the dynamics are close to that of sequences indelsshould be considered subject to all the same limitations ofsequence analyses including long branch attraction artifactscharacter independence and inapplicable characters Givena tree topology parsimony was also used to optimize theevolutionary transformation of genomic features For exam-ple when considering the occurrence of protein domainsthat are abundant variants of domain structures accumulategradually in genomes operationally the domain structure perse cannot be lost once it had been gained This enabled theuse of a variant of the unrooted Dollo algorithmic method(eg [53]) the Dollo parsimony model [54] is based onthe assumption that when a biological feature that is verycomplex is lost in evolution it cannot be regained throughvertical descent The method however could not be appliedto akaryotic genomes as these are subject to extensive lateralgene transfer and until recently [55] was not extended to ToLreconstructions

Parsimony thinking was also invoked in ldquotransitionanalysisrdquo a method that attempts to establish polarity ofcharacter change by for example examining homologies ofproteins in proteolytic complexes such as the proteasomemembrane and cell envelope biochemical makeup and bodystructures of flagella sometimes aided by BLAST queries[56 57] The elaboration however is restricted to the fewmolecular and cellular structures that are analyzed out ofthousands that populate the akaryotic and eukaryotic cellsThe approach is local has not yet made use of an objective

6 Archaea

analytic phylogenetic framework and does not weigh gainslosses and transfers with algorithmic implementationsThusmany statements can fall pray of incorrect optimizations orprocesses of convergence of structure and function includingcooptions that are common in metabolic enzymes [58]Transitions also fail to consider the molecular makeup of thecomplexes examined (eg evolution of the photosyntheticreaction centers) which often hold domains with hetero-geneous histories in the different organismal groups Forexample the 119865

1119865119900ATP synthetase complex that powers

cellular processes has a long history of accretion of domainsthat span almost 38 billion years of history which is evenolder than that of the ribosome [59] Finally establishing thevalidity of evolutionary transitions in polarization schemescan be highly problematic each transition that is studiedrequires well-grounded assumptions [60] some of whichhave not been yet properly elaborated

The inability to solve the rooting problem and the insis-tence on extracting deep phylogenetic signal from molecularsequences that are prone to mutational saturation raisedskepticism about the possibility of ever finding the rootof the ToL Bapteste and Brochier [60] made explicit theconceptual difficulties claiming scientists in the field hadadopted Agripparsquos logic of doubt Misunderstandings on howto conduct ideographic inquiry in evolutionary genomicshad effectively blocked the reerection of a Tower of Babelthat would explain the diversification of genomes (Figure 3)Instead there was ldquoConfusion of Doubtsrdquo Unfortunately theimpasse was aggravated by aprioristic tendencies inheritedfrom systematic biology [3 10] disagreements about theevolutionary role of vertical and lateral inheritance andthe problem of homology [61] and currently disagreementsabout the actual role of Darwinian evolution in speciationand the ToL problem [13 62]

During the decade of evolutionary genomic discovery theeffects of HGT on phylogeny [63] took front cover HGT wasinvoked as an overriding process However little attentionwas paid to alternative explanations such as differential lossof gene variants ancient or derived and there was littleconcern for other sources of reticulation HGT is certainlyan important evolutionary process that complicates the ldquotreerdquoconcept of phylogenetic analysis and must be carefullystudied (eg [64ndash66]) In some bacterial taxonomic groupssuch as the proteobacteria HGT was found to be pervasiveand challenged the definition of species [67] Cases like theseprompted the radical suggestion that the ToL should beabandoned and that a ldquoweb of liferdquo should be used to describethe diversification of microbes and multicellular organisms[68] However the problem has two aspects that must beseparately considered

(i) MechanicsThe widespread and impactful nature of HGTmust be addressed Does its existence truly compromise thevalidity of phylogenetic tree statements While HGT seemsimportant for some bacterial lineages [69] its evolutionaryimpacts in Archaea and Eukarya are not as extensive (eg[61 70]) and its global role can be contested [71] In turnthe role of viruses and RNA agents in genetic exchangecontinues to be understudied (eg [72]) and could be

Figure 3 The ldquoConfusion of Tonguesrdquo an engraving by GustaveDore (1832ndash1883) was modified by D Caetano-Anolles to portraythe event of diversification that halted the construction of the Towerof Babel In linguistics the biblical story inspired tree thinkingWe take the metaphor as the fall of the urancestor of cellular lifeand the replacement of ldquoscala naturaerdquo by branching processes ofcomplexity

a crucial source of reticulation Furthermore the relationshipbetween differential gene loss and HGT has not been ade-quately formalized especially for genes that are very ancientEstablishing patterns of loss requires establishing polarity ofchange and rooted trees which as we have discussed earlierremains unattained for ToLs reconstructed from molecularsequences

(ii) Interpretation HGT generally materializes as a mismatchbetween histories of genes in genomes and histories of organ-isms Since genomes are by functional definition collectionsof genes reticulations affect history of the genes and not ofthe organisms which given evolutionary assumptions resultfrom hierarchical relationships (eg [73]) Thus and at firstglance reticulation processes (HGT gene recombinationgene duplications and gene loss) do not obliterate verticalphylogenetic signal The problem however remains complexA ToL can be considered an ensemble of lineages nestedwithin each other [74] Protein domain lineages will benested in gene lineages gene lineages will be nested inlineages of gene families gene families will be nested inorganismal lineages organisms will be nested in lineagesof higher organismal groupings (populations communitiesecosystems and biomes) and so on All lineage levelsare defined by biological complexity and the hierarchicalorganization of life and follow a fractal pattern distributioneach level contributing vertical and lateral phylogenetic signal

Archaea 7

to the whole However sublineages in one hierarchical levelmay hold different histories compared to the histories ofhigher and lower levels Historical mismatches introduce forexample lineage sorting and reticulation problems that rep-resent a complication to phylogenetic analysis For examplemismatches between gene and organismal phylogenies existin the presence of differential sorting of genomic compo-nents due to for example species extinctions or genomicrearrangements These mismatches violate the fundamentalcladistic assumption that history follows a branching processbut can be explained by homoplasy (convergent evolution)the horizontal ldquotracerdquo of the homology-homoplasy yin-yangof phylogenetic signal The problem of homoplasy cannotbe solved by conceptually preferring one particular levelof the hierarchy [74] A focus on the organismal level forexample will not solve the problems of hybridization thatare brought by strict or relaxed sexual reproduction or viral-mediated genetic exchanges or the problems of ancientevents of fusions or endosymbiosis However homoplasy andthe optimization of characters and trees of cladistic analysisprovide a rigorous framework to discover the magnitude andsource of nonvertical processes in evolution Thus cladisticsor the ToL are not invalidated by network-like signals Infact recent discrete mathematical formulations that test thefundamental axioms of tree construction have also proventhat the bifurcating history of trees is preserved despiteevolutionary reticulations [75] While there is no ldquoconfusionof doubtsrdquo at this level the babel of confusion has not stopped

Genomics revealed evolutionary patchiness which in-cited hypotheses of chimeras and fusions Phylogenies ofgenes were found highly discordant Organisms that werebeing sequenced shared very few gene sequences and moretroublingly gene trees that were generated possessed differ-ent topologies especially within akaryotic organisms Withtime the number of nearly universal genes decreased andthe patchiness and discordancy increased The comparativegenomic patchwork of sequence homologies showed thatthere were groups of genes that were only shared by certaindomains of life Thus the makeup of genomes appearedchimeric A number of hypotheses of chimeric origins ofeukaryotes were proposed based on the fact that eukaryotesshared genes expressing sister group phylogenetic relation-ships with both Bacteria and Archaea [64 76] One thatis notable is the rise of eukaryotes from a ldquoring of liferdquofusion of an archaeon and a bacterium (eg [77]) which ina single blow defeated both the tree-like and the tripartitenature of life Under this school homology searches withBLAST against a database ofsim38million akaryotic sequencesallowed to assign archaeal bacterial or ambiguous ancestriesto genes in the human genome and explain homologypatterns as relics of the akaryotic ancestors of humans [78]Remarkably archaeal genes tended to be involved in informa-tional processes encoded shorter and more central proteinsand were less likely to be involved in heritable human dis-eases The chimeric origin of eukaryotes by fusions has beenrightfully contested there is no proper evidence supportingits existence [79] Unfortunately chimerism opened a floodof speculations about reticulation The history of life wasseen by many through the lens of a ldquoforestrdquo of gene histories

[80] Reticulations andultimatelyHGTwere forced to explainchimeric patterns Fusion hypotheses diverted the centralissue of the rooting of the ToL and prompted the separateanalysis of eukaryotic origins and akaryotic evolution

Dagan and Martin [64] denounced that the ToL wasa ldquotree of one percentrdquo because only a small fraction ofsequences could be considered universal and could bemined for deep phylogenetic signal The rest would accountfor lateral processes that confounded vertical descent Theclaim came fundamentally from networks constructed usingBLAST heuristic searches for short and strong matchesin genomic sequences Phylogenetic ldquoforestsrdquo of akaryoticgenes later on boosted the claim [81] These networks werebuilt from 6901 trees of genes using maximum likelihoodmethods Only 102 of these trees were derived from nearlyuniversal genes and contributed very little vertical phyloge-netic signal The initial conclusion was striking ldquothe originaltree of life concept is obsolete it would not even be a treeof one percentrdquo [81] However the fact that vertical signalwas present in the forests merited reevaluation ldquoreplacementof the ToL with a network graph would change our entireperception of the process of evolution and invalidate allevolutionary reconstructionrdquo [82] We note that in thesestudies an ultrametric tree of akaryotes was recovered fromvertical signal in a supertree of nearly universal genesThe rooted tree which was used to simulate clock-likebehavior revealed the early divergence of Nanoarchaeumequitans and then Archaea [81] While inconsistency of theforest supertree increased at high phylogenetic depths itsassociated supernetwork showed there were no reticulationsbetween Archaea and Bacteria Thus the deep rooting signalof archaeal diversification may be bona fide and worthy offurther study

While those that value tree thinking have contested onmany grounds the idea that the ToL is ldquoobsoleterdquo (eg[13]) our take in the debate is simple Homologies betweengene sequences established through the ldquoemperorrsquos BLASTrdquo(sensu [71]) are poor substitutes to phylogenetic tree recon-structions from gene sequences By the same token phy-logenies reconstructed from sequences are poor substitutesto phylogenies that consider the molecular structure andfunction of the encoded products Gene sequences are notonly prone to mutational saturation but they generally comein pieces These pieces represent evolutionary and structuralmodules that host the functions of the encoded moleculesProtein domains for example are three-dimensional (3D)arrangements of elements of secondary structure that foldautonomously [83] are compact [84] and are evolutionarilyconserved [85 86]The landscape of evolutionary explorationchanges when the role of protein domains in function andevolution is considered [87] Changes at sequence levelincluding substitutions insertions and deletions can havelittle impact on structure and vice versa sequence changes incrucial sites can have devastating consequences on functionand fitness However there is no detailed analysis of historicalmismatches between gene sequences and domain structuresat the ToL level Remarkably while HGT seems rampantat sequence level its impact at the domain structural levelis limited [88] This makes trees derived from domains

8 Archaea

effectively ldquotrees of 99 percentrdquo and their use very powerfulThe fact that domains diversify mostly by vertical descent(eg [89ndash91]) suggests that gene reticulations simply reflectthe pervasive and impactful combinatorial effect of domainrearrangements in proteins [92] and perhaps little else Thisimportant claim must be carefully evaluated It offers thepossibility of dissecting how levels of organization impactprocesses of inheritance in biology

5 Out of the Impasse Parts and Wholes inthe Evolving Structure of Systems

Therooting of theToL is clearlymuddled by the high dynamicnature of change in protein and nucleic acid sequences andby the patchiness and reticulation complexities that exist atgene level However it is also possible that the problemsof the ToL are ultimately technical We have made thecase that the use of molecular sequences is problematic onmany grounds including mutational saturation definitionof homology of sites in sequence alignments inapplicablecharacters taxon sampling and tree imbalance and differenthistorical signatures in domains of multidomain proteins[93] In particular violation of character independence by themere existence of atomic structure represents a very seriousproblem that plagues phylogenetic analysis of sequences Wehere present a solution to the impasse We show that theToL can be rooted with different approaches that focus onstructure and function and that its root is congruently placedin Archaea

Epistemologically phylogenetic characters must complywith symmetry breaking and the irreversibility of time [94]In other words characters must establish transformationalhomology relationships and serve as independent evidentialstatements

(i) Characters Must Be Homologous and Heritable acrossTree Terminal Units (Taxa) Character homology is a centraland controversial concept that embodies the existence ofhistorical continuity of information [74] Characters areldquobasicrdquo evidential statements that make up columns in datamatrices used for tree reconstructionThey are conjectures ofperceived similarities that are accepted as fact for the dura-tion of the study are strengthened by Hennigian reciprocalillumination and can be put to the test through congruencewith other characters as these are fit to the trees To be usefulcharacters must be heritable and informative across the taxarows of the data matrix This can be evaluated for examplewith the cladistics information content (CIC) measure [95]Finding informative characters can be particularly challeng-ing when the features that are studied change at fast pace andwhen taxa sample awide and consequently deep phylogeneticspectrum When building a ToL the highly dynamic natureof change in the sequence makeup of protein or nucleic acidmolecules challenges the ability to retrieve reliable phyloge-netic signatures across taxa even if molecules are universallydistributed and harbor evolutionarily conserved regions withdeep phylogenetic signal (eg rRNA) The reason is thatgiven enough time functionally or structurally constrained

regions of the sequence will be fixed (will be structurallycanalized [6]) and will offer little if any phylogeneticallymeaningful signal to uncover for example the universalrRNA core In turn mutational saturation of unconstrainedregions will quickly erase history

The mutational saturation problem was made mathe-matically explicit by Sober and Steel [24] using ldquomutualinformationrdquo and the concept of time as an informationdestroying process Mutual information 119868(119883 119884) between tworandom variables119883 and 119884 is defined by

119868 (119883 119884) = sum

119909119910

119875 (119883 = 119909 119884 = 119910) log(119875 (119883 = 119909 119884 = 119910)

119875 (119883 = 119909) 119875 (119884 = 119910)

)

(1)

When 119868(119883 119884) approaches 0 119883 and 119884 become independentand no method can predict 119883 from knowledge of 119884 Impor-tantly mutual information approaches 0 as the time between119883 and 119884 increases in a Markov chain Regardless of theuse of a parsimony maximum likelihood or Bayesian-basedframework of analysis 119868(119883 119884)will be particularly small whensequence sites are saturated by too many substitutions dueto high substitution rates or large time scales Ancestralstates at interior nodes of the trees cannot be establishedwith confidence from extant information even in the mostoptimal situation of knowing the underlying phylogeny andthe model of character evolution Under simple models theproblem is not mitigated by the fact that the number ofterminal leaves of trees and the sources of initial phylogeneticinformation increases with time Since a phase transitionoccurs when substitution probabilities exceed a critical value[96] oneway out of the impasse is to find features in sufficientnumber that change at much slower pace than sequence sitesand test if mutual information is significant and overcomesFanorsquos inequality These features exist in molecular biologyand have been used for phylogenetic reconstruction Theyare for example the 3D fold structures of protein molecules[87] or the stem modules of RNA structures [97] featuresthat change at very slow rate when compared to associatedsequences For example protein 3D structural cores evolvelinearly with amino acid substitutions per site and change at3ndash10 times slower rates than sequences [98] This high con-servation highlights the evolutionary dynamics of molecularstructure Remarkably rates of change of proteins performinga same function are maintained by functional constraintsbut accelerate when proteins perform different functions orcontain indels In turn fold structural diversity explodesinto modular structures at low sequence identities probablytriggered by functional diversification Within the contextof structural conservation the fact that fold structures arestructural and evolutionary modules that accumulate inproteomes by gene duplication and rearrangements andspread in biological networks by recruitment (eg [99]) alsoprovides a solution to the problems of vanishing phylogeneticsignal Since fold accumulation increases with time in theMarkov chain mutual information must increase reversing

Archaea 9

the ldquodata processing inequalityrdquo that destroys informationand enabling deep evolutionary information

(ii) Characters Must Show at Least Two Distinct CharacterStates One of these two states (transformational homologs)must be ancestral (the ldquoplesiomorphicrdquo state) and the othermust be derived (the ldquoapomorphicrdquo state) [74] Only sharedand derived features (synapomorphies) provide vertical phy-logenetic evidence Consequently determining the relativeancestry of alternative character states defines the polarity ofcharacter transformations and roots the underlying treeThisis a fundamental property of phylogenetic inference Polar-ization in tree reconstruction enables the ldquoarrow of timerdquo(sensu Eddingtonrsquos entropy-induced asymmetry) solves therooting problem and fulfills other epistemological require-ments

Cladistically speaking character polarity refers to thedistinction between the ancestral and derived states and theidentification of synapomorphies However an evolutionaryview of polarity also refers to the direction of characterstate transformations in the phylogenetic model Historicallythree accepted alternatives have been available for rootingtrees [100ndash102] the outgroup comparison the ontogeneticmethod and the paleontological or stratigraphic methodWhile the three methods do not include assumptions ofevolutionary process they have been the subject of muchdiscussion and their interpretation of much controversyThe first two are however justified by the assumption thatdiversity results in a nested taxonomic hierarchy which mayor may not be induced by evolution We will not discussthe stratigraphic method as it relies on auxiliary assump-tions regarding the completeness of the fossil record Themidpoint rooting criterion mentioned earlier will not bediscussed either The procedure is contested by the existenceof heterogeneities in rates of change across the trees andproblems with the accurate characterization of phylogeneticdistances

In outgroup comparison polarity is inferred by thedistribution of character states in the ingroup (group of taxaof interest) and the sister group (taxa outside the group ofinterest) In a simple case if the character state is only foundin the ingroup it must be considered derived Outgroupcomparison is by far the method of choice because phyloge-neticists tend to have confidence in the supporting assump-tions higher-level relationships are outside the ingroupequivalent ontogenetic stages are compared and characterstate distributions are appropriately surveyed Unfortunatelythe method is ldquoindirectrdquo in that it depends on the assumptionof the existence of a higher-level relationship between theoutgroup and the ingroup Consequently the method cannotroot the ToL because at that level there is no higher-levelrelationship that is presently available Moreover the methodin itself does not polarize characters It simply connects theingroup to the rest of the ToL [100]

The ontogenetic criterion confers polarity through thedistribution of the states of homologous characters in onto-genies of the ingroup generally by focusing on the generalityof character states with more widely distributed states beingconsidered ancestral Nelsonrsquos rule states ldquogiven an ontogenetic

character transformation from a character observed to be moregeneral to a character observed to be less general the moregeneral character is primitive and the less general advancedrdquo[103] This ldquobiogenetic lawrdquo appears powerful in that itdepends only on the assumption that ontogenies of ingrouptaxa are properly surveyed It is also a ldquodirectrdquo method thatrelies exclusively on the ingroup Consequently it has thepotential to root the ToL Unfortunately Nelsonrsquos ldquogeneralityrdquohas been interpreted in numerous ways especially as it relatesto the ontogenetic sequence leading tomuch confusion [102]It also involves comparison of developmentally nested anddistinct life history stages making it difficult to extend themethod (originally conceptualized for vertebrate phylogeny)to the microbial world However Weston [100 104] madeit clear that the ontogenetic criterion embodies a widerldquogenerality criterionrdquo in which the taxic distribution of acharacter state is a subset of the distribution of anotherIn other words character states that characterize an entiregroup must be considered ancestral relative to an alternativestate that characterizes a subset of the group Besides thecentrality of nested patterns the generality criterion embedsthe core assumption that every homology is a synapomorphyin naturersquos nested taxonomic hierarchy and that homologiesin the hierarchy result from additive phylogenetic change[100] Westonrsquos more general rule therefore states that ldquogivena distribution of two homologous characters in which one119909 is possessed by all of the species that possess its homologcharacter 119910 and by at least one other species that does notthen 119910 may be postulated to be apomorphous relative to 119909rdquo[104] The only assumption of the method is that relevantcharacter states in the ingroup are properly surveyed Thisnew rule crucially substitutes the concept of ontogenetictransformation by the more general concept of homologyand additive phylogenetic change which can be applied tocases in which homologous entities accumulate ldquoiterativelyrdquoin evolution (eg generation of paralogous genes by dupli-cation) Since horizontally acquired characters (xenologs)are not considered synapomorphies they contribute towardsphylogenetic noise and are excluded after calculation ofhomoplasy and retention indices (ie measures of goodnessof fit of characters to the phylogeny)

Wehave applied the ldquogenerality criterionrdquo to the rooting ofthe ToL through polarization strategies that embody axiomsof evolutionary process Figure 4 shows three examplesA rooted phylogeny describing the evolution of 5S rRNAmolecules sampled from a wide range of organisms wasreconstructed from molecular sequence and structure [105]The ToL that was recovered was rooted paraphyletically inArchaea (Figure 4(a)) The model of character state transfor-mation was based on the axiom that evolved RNA moleculesare optimized to increase molecular persistence and producehighly stable folded conformations Molecular persistencematerializes in RNA structure in vitro with base pairsassociating and disassociating at rates as high as 05 sminus1 [106]The frustrated kinetics and energetics of this folding processenable some structural conformations to quickly reach stablestatesThis process is evolutionarily optimized through struc-tural canalization [6] in which evolution attains molecularfunctions by both increasing the average life and stability of

10 Archaea

Archaea Bacteria Eukarya

Root

Taxa = 666 5S rRNA molecules of diverse organismsCharacters = sequence and structure of tRNA

CI = 0072 RI = 0772 and g1=minus013Tree length = 11381 steps

(a)

Ancestral

AB B

EB

AE

EABEA

Derived

A E B

Timeline

Eukarya (E)

Archaea (A)

Bacteria (B)

ABE

(b)

1000

Root

Eukarya

Archaea

Bacteria

Taxa = 150 free-living proteomesCharacters = genomic abundance of 1510 FSFsTree length = 70132 stepsCI = 0155 RI = 0739 and g1=minus028

(c)

Figure 4 Trees of life generated from the structure of RNA and protein molecules congruently show a rooting in Archaea (a) A rootedphylogenetic tree of 5S rRNA reconstructed from both the sequence and the structure of the molecules (from [105]) (b) Global most-parsimonious scenario of organismal diversification based on tRNA (from [107]) A total of 571 tRNA molecules with sequence basemodification and structural information were used to build a ToL which failed to show monophyletic groupings Ancestries of lineageswere then inferred by constraining sets of tRNAs into monophyletic groups representing competing (shown in boxes) or noncompetingphylogenetic hypotheses and measuring tree suboptimality and lineage coalescence (illustrated with color hues in circles) (c) A ToLreconstructed from the genomic abundance counts of 1510 FSFs as phylogenetic characters in the proteomes of 150 free-living organismssampled equally and randomly from the three domains of life (data taken from [122 123]) Taxa were labeled with circles colored accordingto superkingdom CI = consistency index RI = retention index and 119892

1= gamma distribution parameter

selected conformations and decreasing their relative numberThus conformational diversity measured for example by theShannon entropy of the base-pairing probability matrix orfeatures of thermodynamic stability act as ldquoevo-devordquo proxyof a generality criterion for RNA molecules in which thecriterion of similarity (eg ontogenetic transformation) isof positional and compositional correspondence Using adifferent approach we recently broadened the use of phyloge-netic constraint analysis [107 108] borrowing from a formal

cyberneticmethod that decomposes a reconstructable systeminto its components [109] and used it to root a ToL derivedfrom tRNA sequence and structure (Figure 4(b)) The ToLwas again rooted in Archaea The number of additionalsteps required to force (constrain) particular taxa into amonophyletic group was used to define a lineage coalescencedistance (119878) with which to test alternative hypotheses ofmonophyly These hypotheses were then ordered accordingto 119878 value in an evolutionary timeline Since 119878 records

Archaea 11

the relative distribution of character states in taxic sets italso embodies the generality criterion of rooting Finallywe generated rooted ToLs that describe the evolution ofproteomes directly from a census of protein fold structures inproteomes (Figure 4(c)) This ToL shows a paraphyletic root-ing inArchaeaThemethod extracts phylogenetic signal fromproteomic abundance of protein fold structures and considersthat the most abundant and widely distributed folds are ofancient origin when defining transformation series [41] Thispolarization scheme which results in the gradual growth ofthe proteome repertoire again represents an embodimentof the generality criterion in which statements of homology(fold structures) result from additive phylogenetic change(increases in abundance) It is noteworthy that these ToLreconstructions take into account the genomic abundance ofeach and every fold structure in each proteome and acrossthe entire matrix thereby generating a frustrated system Ingeneral very high abundance of only few folds will not attracttaxa (species) to the derived branches in theToLThe ancestryof taxa is determined by both the abundance and interplayamong fold structure characters For example metabolicfolds such as those involved inATP hydrolysis are widespreadin cells and considered to be very ancient These are alsothe most ancient folds in our phylogenies In comparisonsome popular eukaryote-specific folds (eg immunoglobulinsuperfamilies) are highly abundant but appear in a derivedmanner in our phylogenies Thus we reason that there is nocircularity involved in the character polarization scheme

The compositional schemes extend the concept of rootingwith paralogous sequences to the entire proteome comple-ments from gene family level [100] to structural hierarchiesThe three examples make use of different rooting rationalesbut provide a congruent scenario of origins of diversifiedlife Technically roots are inferred by the Lundberg method[110] that does not require any outgroup taxa specificationThis method roots the trees a posteriori by attaching thehypothetical ancestor to that branch of the unrooted networkthat would yield minimum increase in the tree length (thuspreserving the principle of parsimony)

(iii) Characters Must Serve as Independent EvolutionaryHypotheses Valid phylogenetic optimization requires thatcharacters be independent pieces of evidence Charactersshould not dependonother charactersWhen the assumptionof independence is violated characters are overweighed inthe analysis and the resulting phylogeny fails to representtrue history [111] Possible dependencies could be of manykinds from structural to functional from developmen-tal to ecological These dependencies distort and obscurephylogenetic signal and must be either avoided or codedinto the phylogenetic model through parameters or weightcorrections

As we will now elaborate the problem of character inde-pendence is about parts and wholes in the hierarchical fabricof life and in the nested hierarchies of the ToL Biologicalsystems are by definition made of parts regardless of the wayparts are defined In evolution diversification and integrationof parts unify parts into cohesive entities modules whichthen diversify [112]This process and the rise of modules may

explain evolutionary waves of complexity and organizationand the emergence of structure (defined broadly) in biologythat is hierarchical The hierarchical makeup is made evidentin the structure of protein molecules where lower levelparts of the polymer (the amino acid residues) interact witheach other and establish cohesive higher-level modular partswhich also establish interaction networks and are crucialfor molecular function and for interaction of proteins withthe cellular environment Consequently the structure ofproteins can be described at increasing hierarchical levelsof structural abstraction sequences motifs loops domainsfamilies superfamilies topologies folds architectures andclasses Two accepted gold standards of protein classificationstructural classification of proteins (SCOP) [113] and classarchitecture topology homology (CATH) [114] use parts ofthis incomplete scheme to describe the atomic complexityof the molecules We note that these classifications do notconsider unrealized structural states such as protein foldsthat are possible but that have never been identified inthe natural world of protein structures We also note thatmodules sometimes engage in combinatorial games Forexample protein domains are rearranged in evolution byfusions and fissions producing the enormous diversity ofalternative domain rearrangements that exist inmultidomainproteins [92]

Since biological systems evolve and carry common ances-try parts of these systems by definition evolve and bythemselves carry common ancestry In other words thehistories of parts are embodied in the ensemble of lineagesof the ToL The focus of phylogenetic analysis however hasbeen overwhelmingly the organism as the biological systemas testified by the effort devoted to taxonomical classificationsystematic biology and building of the ToL The genomicrevolution has provided a wealth of parts in the aminoacid sequence sites of proteins and nucleic acid moleculeswhich have been used as phylogenetic characters for analysisof molecules representing organisms This has maintainedthe focus of reconstructing trees of systems (the wholes)For example amino acid sites of a protein are generally fitinto a data matrix (alignment) which is then used to buildtrees of genes and organisms (Figure 5(a)) using modernalgorithmic implementations [115] However and as we havementioned proteins have complicated structures that resultfrom interactions between amino acids at the 3D atomic levelThese intramolecular interactions or at least some of them[116] induce protein folding and delimit molecular func-tion and protein stability They are responsible for proteinsecondary supersecondary domain and tertiary structureand by definition their mere existence induces violation ofcharacter independence Penny and Collins [117] proposedthe simple thought experiment in which the bioinformaticianexchanges rows of sequence sites in the alignment matrixand asks what was lost in the process Randomization ofcharacters (columns) in the data matrix does not changethe phylogenetic tree However randomization destroys thestructure of the molecule and very likely its function Thisconfirms that reconstruction of trees from information insequence sites violates character independence and in theprocess ignores structure and biologyThe effects of violation

12 Archaea

Interactions between parts(eg atomic contacts between sites)

Characters parts(amino acid sequence sites)

16

7

5Impact onphylogeneticreconstruction high

Phylogenetic model

Evolutionof systems

Canonicalparadigm

System 3 (protein from Mba)

System 2 (protein from Mja)

System 1 (protein from Pfu)

A L Y S

A L Y S

G L S S

1 5 6 7

(a)

Interactions between systems(eg proteomes in trophic chains)

Characters systems(proteomes)

Impact onphylogeneticreconstruction low

Phylogenetic model

Evolutionof parts

Newparadigm

Part 3 (FSF domain c261)

Part 2 (FSF domain c12)

Part 1 (FSF domain c371)

Higher trophic

levels

Mesoplankton

Microplankton

Nanoplankton

Picoplankton

Pfu Mja Mba 22 29 28

54 44 43

84 88 80

(b)

Figure 5 A new phylogenetic strategy simplifies the problems of character independence (a)The canonical paradigm explores the evolutionof systems such as the evolution of organisms in the reconstruction of ToLs For example the terminals of phylogenetic trees can be genessampled from different organisms (eg Pyrococcus furiosus (Pfu)Methanococcus jannaschii (Mja) andMethanosarcina barkeri (Mba)) andphylogenetic characters can be amino acid sequence sites of the corresponding gene products Character states can describe the identityof the amino acid at each site Since characters are molecular parts that interact with other parts when molecules fold into compact 3Dstructures their interaction violates the principle of character independence Consequently the effects of covariation must be considered inthe phylogeneticmodel used to build the trees (b)The new paradigm explores the evolution of parts such as the evolution of protein domainsin proteomes For example the terminals of phylogenetic trees can be domains defined at fold superfamily level of structural complexity andthe characters used to build the trees can be proteomes Character states can be the number of domains holding the FSF structure Sinceproteomes interact with other proteomes when organisms establish close interactions interactions that could affect the abundance of domainsin proteomes should be considered negligible (unless there is an obligate parasitic lifestyle involved) and there is no need to budget trophicinteractions in the phylogenetic model

of character independence may be minimal for trees ofsequences that are closely related However trees describingdeep historical relationships require that sequences be diver-gent and this maximizes the chances of even wider diver-gences in molecular structure that are not being accountedfor by the models of sequence evolution

The genomic revolution has also provided a wealth ofmodels of 3D atomic structure These structures are usedas gold standards to assign with high confidence structuralmodules to sequences As mentioned earlier proteomesembody collections of protein domains with well-definedstructures and functions Protein domain counts in pro-teomes have been used to generate trees of protein domains(Figure 5(b)) using standard cladistic approaches and well-established methods (reviewed in [87]) These trees describethe evolution of protein structure at global level Theyare effectively trees of parts While domains interact witheach other in multidomain proteins or establish protein-protein interactions with the domains of other proteinsthese interactions of parts do not violate character indepen-dence This is because phylogenetic characters are actuallyproteomes systems defined by structural states that exist atmuch higher levels than the protein domain not far awayfrom the organism level Remarkably no information is lostwhen character columns in the matrix are randomized in

the thought experiment The order of proteome charactersin the matrix does not follow any rationale Characters arenot ordered by lifestyles or trophic levels of the organismsInteractions between free-living organisms will seldom biastheir domain makeup and if so those characters can beexcluded from analysis Even the establishment of symbioticor obligate parasitic interactions such as the nodule-formingsymbioses between rhizobia and legumes may have littleimpact on character independence as long as the jointinclusion of the host and the symbiont is avoided

6 Evidence Supporting the Archaeal Rootingof the Universal Tree

Figure 4 shows examples of rooted ToLs generated fromthe sequence and structure of RNA and protein moleculesThe different phylogenomic approaches arrive at a commonrooted topology that places the stem group of Archaea at thebase of the ToL (the archaeal rooting of Figure 1(c)) Howeverthe evolutionary interrelationship of parts andwholes promptthe use of trees of parts a focus on higher-level structureand a decrease in confidence in the power of trees of systemsThese concepts have been applied to the study of the historyof nucleic acid and protein structures for over a decade and

Archaea 13

23

38

786

360324

164

38

2

33

453

126 56

1

1

FSF domains

Archaea

Archaea

EukaryaEukarya

Bact

eria

Bact

eria

f gt 06

(a)

1

11

526

852272

162

100

0

3

232

35 15

5

16

GO terms

Archaea

Archaea

EukaryaBa

cter

iaEukarya

Bact

eria

f gt 06

(b)

69

0

7

8393

228

2

1

0

5

91

1

0

RNA families

Archaea

Eukarya

Bact

eria

Archaea

Eukarya

Bact

eria

f gt 06

(c)

Figure 6 Venn diagrams displaying the distributions of 1733 FSF domains (a) 1924 terminal GO terms (b) and 1148 RNA families (c)in the genomes of the three domains of life FSF domain data was taken from Nasir et al [122 123] and included 981 completely sequencedproteomes from 70 Archaea 652 Bacteria and 259 Eukarya Terminal GO terms corresponding to the ldquomolecular functionrdquo hierarchy definedby the GO database [118] were identified in 249 free-living organisms including 45 Archaea 183 Bacteria and 21 Eukarya (data taken from[123]) The Venn diagram of RNA families and the distribution of Rfam clans and families in organisms were taken from Hoeppner et al[129] and their Dataset S1 Shown below are distribution patterns for FSFs GO terms and RNA families that are present in more than 60 ofthe organisms examined (119891 gt 06) All distributions highlight maximum sharing in the ancient ABE and BE taxonomic groups and minimalsharing in archaeal taxonomic groups

provide additional evidence in support of the archaeal rootingscenario

61 Comparative Genomic Argumentation The distributionof gene-encoded products in the genomes of sequencedorganisms and parsimony thinking can reveal global evolu-tionary patterns without formal phylogenetic reconstructionWe will show how simple numerical analyses of proteindomains molecular activities defined by the gene ontology(GO) consortium [118] and RNA families that are domain-specific or are shared between domains of life can uncover thetripartite division of cellular life exclude chimeric scenariosof origin and provide initial insight on the rooting of the ToL(Figure 6)

The number of unique protein folds observed in natureis very small SCOP ver 175 defines only sim2000 foldsuperfamilies (FSFs) groups of homologous domains unifiedon the base of common structural and evolutionary rela-tionships for a total of 110800 known domains in proteins[113 119] FSFs represent highly conserved evolutionary unitsthat are suitable for studying organismal diversification [87]Yafremava et al [120] plotted the total number of distinctFSFs (FSF diversity) versus the average reuse of FSFs in theproteome of an organism (FSF abundance) This exerciseuncovered a scaling behavior typical of a Benford distributionwith a linear regime of proteomic growth for microbial

organisms and a superlinear regime for eukaryotic organisms(Figure 8 in [120]) These same scaling patterns are observedwhen studying the relationship between open reading framesand genome size [121] Remarkably archaeal and eukaryalproteomes exhibited both minimum and maximum levelsof FSF abundance and diversity respectively Bacterial pro-teomes however showed intermediate levels We note thatthe general scaling behavior is consistent with a scenarioin which evolutionary diversification proceeds from simplerproteomes to the more complex ones in gradual mannersupporting the principle of spatiotemporal continuity andrevealing the nested phylogenetic hierarchies of organismsUnder this scenario (and results of [120]) the streamlinedarchaeal proteomes represent the earliest form of cellular lifeRemarkably archaeal species harboring thermophilic andhyperthermophilic lifestyles encoded the most streamlinedFSF repertoires (Figure 7) Clearly modern thermophilicarchaeons are most closely related to the ancient cells thatinhabited planet Earth billions of years ago (also read below)

FSF domain distributions in the genomes of the threedomains of life provide further insights into their evolutionNasir et al [122 123] generated Venn diagrams to illustrateFSF sharing patterns in the genomes of Archaea Bacteriaand Eukarya (Figure 6(a)) These diagrams display the totalnumber of FSFs that are unique to a domain of life (taxonomicgroupsA B and E) shared by only two (AB BE andAE) and

14 Archaea

FSF use (log)

S marinus I hospitalis

H butylicus

T pendens

Cand korarchaeum

A pernix

M kandleri

T neutrophils

P islandicum

290

285

280

275

270

265

260

255

250

245

240

22 23 24 25

ThermophilicNonthermophilic

FSF

reus

e (lo

g)

N maritimus

Figure 7 A plot of FSF use (diversity) against FSF reuse (abun-dance) reveals a linear pattern of proteomic growth in 48 archaealproteomes Thermophilic archaeal species occupy positions that areclose to the origin of the plot They also populate the most basalbranch positions in ToLs (see discussion in themain text) Both axesare in logarithmic scale

those that are universal (ABE) About half of the FSFs (786out of 1733) were present in all three domains of life 453 ofwhich were present in at least 60 of the organisms that wereexamined (119891 gt 06 where 119891 is the distribution index rangingfrom 0 to 1) The fact that about 70 of widely shared FSFs(672) belong to the ABE taxonomic group strongly supports acommon evolutionary origin for cells Evolutionary timelineshave confirmed that a large number of these universal FSFswere present in the ancestor of the three domains of life theurancestor (Figure 1) and were for the most part retained inextant proteomes [48]

Interestingly the number of BE FSFs was sim10-fold greaterthan AE and AB FSFs (324 versus 38 and 38) (Figure 6(a))The finding that Bacteria and Eukarya encode a significantlylarge number of shared FSF domains is remarkable and hintstowards an unprecedented strong evolutionary associationbetween these two domains of life Moreover a significantnumber of BE FSFs is widespread in the bacterial andeukaryotic proteomes 56 of the 324 FSFs are shared bymore than 60 of the bacterial and eukaryal organismsthat were analyzed (Figure 6(a)) This is strong evidenceof an ancient vertical evolutionary trace from their mutualancestor (anticipated in [123]) This trace uniquely supportsthe archaeal rooting of the ToL In turn the canonicaland eukaryotic rooting alternatives are highly unlikely asnone alone can explain the remarkable diversity of the BEtaxonomic group (as well as its ancient origin [123]) Theremarkable vertical trace of the ABE taxonomic group andthe negligible vertical trace of the AB group (only one FSF

shared by more than 60 of organisms) also falsify fusionhypotheses responsible for a putative chimeric makeup ofEukarya In light of these findings the most parsimoniousexplanation of comparative genomic data is that Archaea isthe first domain of diversified cellular life

The patterns of FSF sharing are further strengthenedby the genomic distributions of terminal-level molecularfunction GO terms (Figure 6(b)) The three domains of lifeshared about a quarter (526) of the 1924 GO terms thatwere surveyed A total of 232 of these ABE terms wereshared by 60 of organisms analyzed Again the fact thatabout 76 of widely shared GO terms (306) belong to theABE taxonomic group strongly supports a common verticaltrace while the finding that Bacteria and Eukarya share asignificantly large number of GO terms (272) supports againthe archaeal rooting of the ToL An alternative explanationfor the very large size of BE could be large-scale metabolism-related gene transfer from the ancestors of mitochondriaand plastids to the ancestors of modern eukaryotes [124]However we note that the BE group is not restricted to onlymetabolic functions It also includes FSFs and GOs involvedin intracellular and extracellular processes and regulation andinformational functions [125] Thus the very large size ofBE is a significant outcome most likely shaped by verticalevolutionary scenarios and cannot solely be explained byparasiticsymbiotic relationships that exist between Bacteriaand Eukarya [123] Moreover the simplicity of archaeal FSFrepertoires is not due to the paucity of available archaealgenomic dataWe confirmed that themean FSF coverage (ienumber of proteinsannotated to FSFsGOs out of total) forArchaea Bacteria and Eukarya was largely comparable (egTable S1 in [122]) Finally as we will describe below thesecomparative patterns and tentative conclusions are confirmedby phylogenomic analysis [91 122 126] Congruent phyloge-nies were obtained with (Figure 4(c)) and without equal andrandom sampling of taxa Thus the relatively low numberof archaeal genomes is also not expected to compromise ourinferences

We note that Eukarya shares many informational geneswith Archaea and many operational genes with BacteriaWhile informational genes have been thought more refrac-tory to HGT than noninformational genes we have con-firmed at GO level that this is not the case A ToL builtfrom GO terms showed that in fact noninformational termswere less homoplasious while statistical enrichment analysesrevealed that HGT had little if any functional preferencefor GO terms across GO hierarchical levels We recentlycompared ToLs reconstructed from non-HGTGO terms andToLs reconstructed from informational GO terms that wereextracted fromnon-HGTGO terms (Kim et al ms in reviewalso read below) In both cases Archaea appeared as a basalparaphyletic group of the ToLs and the common origin ofBacteria and Eukarya was maintained Thus the 272 GOterms shared by Bacteria and Eukarya harbor a strong verticaltrace

Another observation often used as support for Archaea-Eukarya kinship is the discovery of few eukaryote-specificproteins (eg actin tubulin H3 H4 ESCRT ribosomalproteins and others) in some archaeal species [127] and

Archaea 15

their complete absence fromBacteria (except for documentedtubulin HGT from eukaryotes to bacterial genus Prosthe-cobacter [128]) This suggests either that eukaryotes arosefrom an archaeal lineage [127] or that the ancestor of Eukaryaand Archaea was complex and modern Archaea are highlyreduced [25] Indeed few eukaryote-specific proteins havenow been found in some archaeal species (in some casesjust one species) We argue that this poor spread cannotbe taken as evidence for the Archaea-Eukarya sister rela-tionship This needs to be confirmed by robust phylogeneticanalysis which is unfortunately not possible when usingprotein sequences Recently we described a new strategy forinferring vertical and horizontal traces [123] This methodcalculates the spread of FSFs andGOs that are shared betweentwo-superkingdom groups (ie AB AE and BE) Balanceddistributions often indicate vertical inheritance while biaseddistributions suggest horizontal flux For example penicillinbinding molecular activity (GO 0008658) was present in100 of the sampled bacterial proteomes but was onlypresent in 11 of the archaeal species [123] Thus presenceof GO 0008658 in Archaea was attributed to HGT gain fromBacteria Using this simple method we established that bothBacteria and Eukarya were united by much stronger verticaltrace than either was to Archaea In fact strong reductivetendencies in the archaeal genomes were recorded [123]Thus in our opinion presence of eukaryote-specific proteinsin only very few archaeal species could in fact be an HGTevent that is not detectable by sequence phylogenies

In turn many arguments favor now the Bacteria-Eukaryasisterhood in addition to the structure-based phylogeniesand balanced distribution ofmolecular features For exampleBacteria and Eukarya have similar lipid membranes that canbe used as argument for their evolutionary kinshipMoreoverArchaea fundamentally differ from Eukarya in terms oftheir virosphere Viruses infecting Archaea and Eukarya aredrastically different as recently discussed by Forterre [25]

The genomic distribution of RNA families was taken fromHoeppner et al [129] and also shows a vertical evolutionarytrace in five crucial Rfam clans that are universal includingtRNA 5S rRNA subunit rRNA and RNase P RNA Theseuniversal RNA groups are likely minimally affected by HGTHowever 99 of Rfam clans and families are specific todomains of life and only 11 of the 1148 groups were sharedat 119891 gt 06 levels This clearly shows that the functionalcomplexity of RNA materialized very late during organismaldiversification and that it is not a good genomic feature forexploring the rooting of the ToL Only five RNA familiesare shared between two domains of life and only one ofthese does so at 119891 gt 06 the G12 pseudoknot of the 23SrRNA which is present in bacterial and eukaryotic organellarrRNA While the large subunit rRNA scaffold supports theG12 pseudoknot with its vertical trace all five interdomainRNA families can be explainedmost parsimoniously byHGTThus only a handful of ancient and universal RNA speciescan be used to root the ToL

We end by noting that the Venn diagrams consistentlyshow that Archaea harbors the least number of unique (A)and shared (AB and AE) FSFs GO terms or RNA familiesThis trend supports an early divergence of this domain of life

from the urancestor and the possibility that such divergencebe shaped by evolutionary reductive events We reason thatsuch losseswould bemore parsimonious early on in evolutionthan in the later periods This is because genes often increasetheir abundance in evolutionary time (by gene duplicationsHGT and other processes)Thus it is reasonable to think thatloss of an ancient gene would be more feasible very early inevolution relative to losing it very late

62 Phylogenomic Evidence from the Sequence and Structure ofRNA Phylogenetic analyses of the few RNA families that areuniversal and display an important vertical evolutionary trace(Figure 6(c)) provide compelling evidence in favor of an earlyevolutionary appearance of Archaea [105 107 108 130ndash135]Here we briefly summarize evidence from tRNA 5S rRNAand RNase P RNA Unpublished analyses of rRNA sequenceand structure using advanced phylogenetic methods alsoshow that Archaea was the first domain of life

tRNA molecules are generally short (sim73ndash95 nucleotidesin length) and highly conserved Consequently theirsequence generally contains limited amount of phylogeneticinformation These limitations have been overcome by ana-lyzing entire tRNomes [136] which extend the length ofan organismal set of tRNAs to over 2000 bases Xue etal [130 131] analyzed the genetic distances between tRNAsequences as averages between alloacceptor tRNAs fromdiverse groups of tRNomes using multiple molecular basesubstitution models The distances were mapped onto anunrooted phylogeny of tRNA molecules (Figure 8(a))The ldquoarrow of timerdquo assumption in these studies is thatancient tRNA paralogs closely resemble each other when lin-eages originate close to the time of the gene duplication Re-markably the results revealed a paraphyletic rooting of theToL in Archaea The root was specifically located close tothe hyperthermophilic methanogen Methanopyrus kandleri(Figure 8(a)) The hypothesis of this specific rooting scenariohas been supported by several other studies [134 137]including a study of genetic distances between paralogouspairs of aminoacyl-tRNA synthetase (aaRS) proteins [131] Aremarkable match between distance scores of tRNA and pairsof aaRS paralogs (Figure 8(b)) not only confirms the earlyappearance of Archaea but also suggests a coevolutionarytrace associated with molecular interactions that areresponsible for the genetic code In fact a recent exhaustivephylogenomic analysis of tRNA and aaRS coevolutionexplicitly reveals the origins and evolution of the geneticcode and the underlying molecular basis of genetics [59]Di Giulio [132 133] has also proposed an archaeal rootingof the tree of life specifically in the lineage leading to thephylum of Nanoarchaeota This rooting is based on uniqueand ancestral genomic traits of Nanoarchaeum equitans splitgenes separately codifying for the 51015840 and 31015840 halves of tRNAand the absence of operons which are considered molecularfossils [132] However this claim needs additional support ascontrasting evidence now recognizes N equitans as a highlyderived archaeal species [138]

In addition to sequences structural features of tRNAmolecules also support the archaeal rooting of the ToL

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

4 Archaea

Eukarya

Archaea

Bacteria

Urancestor

Ancestor

E

A

B

U

a

(a)

Crown

Stema

U

E BA

E BA

a

U

TimeOrigin oflineages

(b)

Canonical rooting Eukaryal rooting Archaeal rooting

Time

E AB

a

U

B EA

a

U

E BA

a

U

(c)

Figure 1 Rooting the tree of life (ToL) an exercise for ldquotree-thinkersrdquo (a) A node-based tree representation of the ToL focuses on taxa(sampled or inferred vertices illustrated with open circles) and models edges as ancestry relationships The unrooted tree describes taxa asconglomerates of extant species for each domain of life Archaea (A) Bacteria (B) and Eukarya (E) (abbreviations are used throughout thepaper) Adding a universal ancestor vertex (rooting the ToL) implies adding a reconstructed entity (urancestor) that roots the tree but doesnot model ancestry relationshipsThe vertex is either a ldquomost recent universal ancestorrdquo if one defines it from an ingroup perspective or a ldquolastuniversal common ancestorrdquo (LUCA) if the definition relates to an outgroup perspective Rooting the tree enables defining ldquototal lineagesrdquowhich are lists of ancestors spanning from the ancestral taxon to the domain taxa (b) A stem-based view focuses on edges (branches) whichare sampled and inferred ancestral taxa are viewed as lineages under the paradigm of descent with modification Vertices correspond tospeciation events Terminal edges represent conglomerates of lineages leading to domains of life (the terminal nodes of node-based trees)and the ancestral stem represents the lineage of the urancestor (U) (double arrowhead line) Total lineages are simply a chain of edges thatgoes back in time and ends in the ancestral stem Both node-based and stem-based tree representations are mathematically isomorphic butthey are not equal [154] They change the concept of monophyletic and paraphyletic relationships A node-based clade starts with a lineageat the instant of the splitting event incorporating the ancestor into the makeup of the clade In contrast a stem-based clade originates witha planted branch on the tree where the branch represents a lineage between two lineage splitting events Planting an ancestral stem definesan origin of lineages and the first speciation event in the record of life This delimits a crown clade of two domain lineages in the stem-basedtree (labeled with black lines) that includes the ancestor of the sister groups a and a stem domain group at its base (c) The three possiblerootings of the ToL depicted with stem-based tree representations Terminal edges are labeled with thin lines and these conglomerate lineagescan include stem and crown groups

4 Building a Tower of Babel froma Comparative Genomic Patchworkof Sequence Homologies

Molecular evolutionists were cognizant of the limitationof looking at the history of only few component partswhich by definition could be divergent When genomicsequences became widely available pioneers jumped ontothe bandwagon of evolutionary genomics and the possibilityof gaining systemic knowledge from entire repertoires ofgenes and molecules (eg [35ndash39]) The genomic revolutionfor example quickly materialized in gene content trees thatreconstructed the evolution of genomes directly from theirevolutionary units the genes (eg [37 39 40]) or the domainconstituents of the translated proteins [35 41]The sequencesof multiple genes were also combined or concatenatedin attempts to extract deep phylogenetic signal [42ndash44]

The results that were obtained consistently supported thetripartite world backing-up the claims of the Urbana SchoolHowever clues about ancestors and lineages leading to extanttaxa were still missing

With few exceptions that focused on the structure ofRNA and protein molecules (see below) analyses based ongenomic sequences gene content gene order and othergenomic characteristics were unable to produce rooted treeswithout the help of outgroups additional ad hoc hypothesesthat are external to the group of organisms being studiedand generally carry strong assumptions An ldquoarrow of timerdquowas not included in the models of genomic evolution thatwere used As with sequence ToLs that were generated wereunrooted (Figure 1(a)) and generally rooted a posteriori eitherclaiming that the canonical root was correct or makingassumptions about character change that may be strictlyincorrect In a recent example distance-based approaches

Archaea 5

Primordialribosome

MaturePTC

Processiveribosome

Matureribosome ribosome

Present day

Ribosomal core

Diversified life

Ribosomal history

The present

Origin ofproteins

Urancestor 1 2 3

38 30 20 10 0

Molecular age (nd)

History of life (time billions of years)

Figure 2The evolutionary history of ribosome was traced onto the three-dimensional structure of its rRNA and r-protein components Agesof components were colored with hues from red (ancient) to blue (recent) Phylogenomic history shows that primordial metabolic enzymespreceded RNA-protein interactions and the ribosome The timeline of life derived from a universal phylogenetic tree of protein domainstructure at fold superfamily level of complexity is shown with time flowing from left to right and expressed in billions of years accordingto a molecular clock of fold structures The ribosomal timeline highlights major historical events of the molecular ensemble History ofthe conserved ribosomal core [29] overlaps with that of diversified life [48] suggesting that episodes of cooption and lateral transfer havepervaded early ribosomal evolution

were used to build universal network trees from gene familiesdefined by reciprocal best BLAST hits [45] These networksshowed a midpoint rooting of the ToL between Bacteriaand Archaea However this rooting involves a complex opti-mization of path lengths in the split networks and criticallyassumes that lineages evolved at roughly similar rates Thisdiminishes the confidence of themidpoint rooting especiallywhen considering the uncertainties of distances inferred fromBLAST analyses and the fact that domains in genes holddifferent histories and rates of change Another promisingbut ill-conceptualized case is the rooting of distance-basedtrees inferred by studying the frequency of l-mer sets ofamino acids in proteins at the proteome level [46] Thecompositional data generated a ToL that was rooted inEukarya when randomized proteome sequences were usedas outgroupsThe assumption of randomness associated withthe root of a ToL is however unsupported or probably wrongespecially because protein sequence space and itsmappings tostructure are far from random [47] More importantly a largefraction of modern proteins had already evolved protein foldstructures when the first diversified lineages arose from theuniversal common ancestor of cellular life [48]These evolvedrepertoires cannot be claimed to be random

The inability of genomics to provide clear answers torooting questions promoted (to some extent) parsimonythinking and the exploration of evolutionary differences ofsignificance that could act as ldquoanchorsrdquo and could impartldquopolarityrdquo to tree statements [49] As we will describe belowwe applied parsimony thinking to the evolution of proteinand RNA structures [41 50] and for over a decade we havebeen generating rooted ToLs that portray the evolution ofproteomes with increasing explanatory power This workhowever has been for themost part unacknowledged Insteadmolecular evolutionists focused almost exclusively onmolec-ular sequences in their search for solutions to the ToL

problem For example analysis of genomic insertions anddeletions (indels) that are rare in paralogous gene sets rootedthe ToL between the group of Actinobacteria and Gram-negative bacteria and the group of Firmicutes and Archaea[51] Unfortunately little is known about the dynamics ofindel generation its biological role in gene duplication andits influence on the structure of paralogous protein pairs (eg[52]) If the dynamics are close to that of sequences indelsshould be considered subject to all the same limitations ofsequence analyses including long branch attraction artifactscharacter independence and inapplicable characters Givena tree topology parsimony was also used to optimize theevolutionary transformation of genomic features For exam-ple when considering the occurrence of protein domainsthat are abundant variants of domain structures accumulategradually in genomes operationally the domain structure perse cannot be lost once it had been gained This enabled theuse of a variant of the unrooted Dollo algorithmic method(eg [53]) the Dollo parsimony model [54] is based onthe assumption that when a biological feature that is verycomplex is lost in evolution it cannot be regained throughvertical descent The method however could not be appliedto akaryotic genomes as these are subject to extensive lateralgene transfer and until recently [55] was not extended to ToLreconstructions

Parsimony thinking was also invoked in ldquotransitionanalysisrdquo a method that attempts to establish polarity ofcharacter change by for example examining homologies ofproteins in proteolytic complexes such as the proteasomemembrane and cell envelope biochemical makeup and bodystructures of flagella sometimes aided by BLAST queries[56 57] The elaboration however is restricted to the fewmolecular and cellular structures that are analyzed out ofthousands that populate the akaryotic and eukaryotic cellsThe approach is local has not yet made use of an objective

6 Archaea

analytic phylogenetic framework and does not weigh gainslosses and transfers with algorithmic implementationsThusmany statements can fall pray of incorrect optimizations orprocesses of convergence of structure and function includingcooptions that are common in metabolic enzymes [58]Transitions also fail to consider the molecular makeup of thecomplexes examined (eg evolution of the photosyntheticreaction centers) which often hold domains with hetero-geneous histories in the different organismal groups Forexample the 119865

1119865119900ATP synthetase complex that powers

cellular processes has a long history of accretion of domainsthat span almost 38 billion years of history which is evenolder than that of the ribosome [59] Finally establishing thevalidity of evolutionary transitions in polarization schemescan be highly problematic each transition that is studiedrequires well-grounded assumptions [60] some of whichhave not been yet properly elaborated

The inability to solve the rooting problem and the insis-tence on extracting deep phylogenetic signal from molecularsequences that are prone to mutational saturation raisedskepticism about the possibility of ever finding the rootof the ToL Bapteste and Brochier [60] made explicit theconceptual difficulties claiming scientists in the field hadadopted Agripparsquos logic of doubt Misunderstandings on howto conduct ideographic inquiry in evolutionary genomicshad effectively blocked the reerection of a Tower of Babelthat would explain the diversification of genomes (Figure 3)Instead there was ldquoConfusion of Doubtsrdquo Unfortunately theimpasse was aggravated by aprioristic tendencies inheritedfrom systematic biology [3 10] disagreements about theevolutionary role of vertical and lateral inheritance andthe problem of homology [61] and currently disagreementsabout the actual role of Darwinian evolution in speciationand the ToL problem [13 62]

During the decade of evolutionary genomic discovery theeffects of HGT on phylogeny [63] took front cover HGT wasinvoked as an overriding process However little attentionwas paid to alternative explanations such as differential lossof gene variants ancient or derived and there was littleconcern for other sources of reticulation HGT is certainlyan important evolutionary process that complicates the ldquotreerdquoconcept of phylogenetic analysis and must be carefullystudied (eg [64ndash66]) In some bacterial taxonomic groupssuch as the proteobacteria HGT was found to be pervasiveand challenged the definition of species [67] Cases like theseprompted the radical suggestion that the ToL should beabandoned and that a ldquoweb of liferdquo should be used to describethe diversification of microbes and multicellular organisms[68] However the problem has two aspects that must beseparately considered

(i) MechanicsThe widespread and impactful nature of HGTmust be addressed Does its existence truly compromise thevalidity of phylogenetic tree statements While HGT seemsimportant for some bacterial lineages [69] its evolutionaryimpacts in Archaea and Eukarya are not as extensive (eg[61 70]) and its global role can be contested [71] In turnthe role of viruses and RNA agents in genetic exchangecontinues to be understudied (eg [72]) and could be

Figure 3 The ldquoConfusion of Tonguesrdquo an engraving by GustaveDore (1832ndash1883) was modified by D Caetano-Anolles to portraythe event of diversification that halted the construction of the Towerof Babel In linguistics the biblical story inspired tree thinkingWe take the metaphor as the fall of the urancestor of cellular lifeand the replacement of ldquoscala naturaerdquo by branching processes ofcomplexity

a crucial source of reticulation Furthermore the relationshipbetween differential gene loss and HGT has not been ade-quately formalized especially for genes that are very ancientEstablishing patterns of loss requires establishing polarity ofchange and rooted trees which as we have discussed earlierremains unattained for ToLs reconstructed from molecularsequences

(ii) Interpretation HGT generally materializes as a mismatchbetween histories of genes in genomes and histories of organ-isms Since genomes are by functional definition collectionsof genes reticulations affect history of the genes and not ofthe organisms which given evolutionary assumptions resultfrom hierarchical relationships (eg [73]) Thus and at firstglance reticulation processes (HGT gene recombinationgene duplications and gene loss) do not obliterate verticalphylogenetic signal The problem however remains complexA ToL can be considered an ensemble of lineages nestedwithin each other [74] Protein domain lineages will benested in gene lineages gene lineages will be nested inlineages of gene families gene families will be nested inorganismal lineages organisms will be nested in lineagesof higher organismal groupings (populations communitiesecosystems and biomes) and so on All lineage levelsare defined by biological complexity and the hierarchicalorganization of life and follow a fractal pattern distributioneach level contributing vertical and lateral phylogenetic signal

Archaea 7

to the whole However sublineages in one hierarchical levelmay hold different histories compared to the histories ofhigher and lower levels Historical mismatches introduce forexample lineage sorting and reticulation problems that rep-resent a complication to phylogenetic analysis For examplemismatches between gene and organismal phylogenies existin the presence of differential sorting of genomic compo-nents due to for example species extinctions or genomicrearrangements These mismatches violate the fundamentalcladistic assumption that history follows a branching processbut can be explained by homoplasy (convergent evolution)the horizontal ldquotracerdquo of the homology-homoplasy yin-yangof phylogenetic signal The problem of homoplasy cannotbe solved by conceptually preferring one particular levelof the hierarchy [74] A focus on the organismal level forexample will not solve the problems of hybridization thatare brought by strict or relaxed sexual reproduction or viral-mediated genetic exchanges or the problems of ancientevents of fusions or endosymbiosis However homoplasy andthe optimization of characters and trees of cladistic analysisprovide a rigorous framework to discover the magnitude andsource of nonvertical processes in evolution Thus cladisticsor the ToL are not invalidated by network-like signals Infact recent discrete mathematical formulations that test thefundamental axioms of tree construction have also proventhat the bifurcating history of trees is preserved despiteevolutionary reticulations [75] While there is no ldquoconfusionof doubtsrdquo at this level the babel of confusion has not stopped

Genomics revealed evolutionary patchiness which in-cited hypotheses of chimeras and fusions Phylogenies ofgenes were found highly discordant Organisms that werebeing sequenced shared very few gene sequences and moretroublingly gene trees that were generated possessed differ-ent topologies especially within akaryotic organisms Withtime the number of nearly universal genes decreased andthe patchiness and discordancy increased The comparativegenomic patchwork of sequence homologies showed thatthere were groups of genes that were only shared by certaindomains of life Thus the makeup of genomes appearedchimeric A number of hypotheses of chimeric origins ofeukaryotes were proposed based on the fact that eukaryotesshared genes expressing sister group phylogenetic relation-ships with both Bacteria and Archaea [64 76] One thatis notable is the rise of eukaryotes from a ldquoring of liferdquofusion of an archaeon and a bacterium (eg [77]) which ina single blow defeated both the tree-like and the tripartitenature of life Under this school homology searches withBLAST against a database ofsim38million akaryotic sequencesallowed to assign archaeal bacterial or ambiguous ancestriesto genes in the human genome and explain homologypatterns as relics of the akaryotic ancestors of humans [78]Remarkably archaeal genes tended to be involved in informa-tional processes encoded shorter and more central proteinsand were less likely to be involved in heritable human dis-eases The chimeric origin of eukaryotes by fusions has beenrightfully contested there is no proper evidence supportingits existence [79] Unfortunately chimerism opened a floodof speculations about reticulation The history of life wasseen by many through the lens of a ldquoforestrdquo of gene histories

[80] Reticulations andultimatelyHGTwere forced to explainchimeric patterns Fusion hypotheses diverted the centralissue of the rooting of the ToL and prompted the separateanalysis of eukaryotic origins and akaryotic evolution

Dagan and Martin [64] denounced that the ToL wasa ldquotree of one percentrdquo because only a small fraction ofsequences could be considered universal and could bemined for deep phylogenetic signal The rest would accountfor lateral processes that confounded vertical descent Theclaim came fundamentally from networks constructed usingBLAST heuristic searches for short and strong matchesin genomic sequences Phylogenetic ldquoforestsrdquo of akaryoticgenes later on boosted the claim [81] These networks werebuilt from 6901 trees of genes using maximum likelihoodmethods Only 102 of these trees were derived from nearlyuniversal genes and contributed very little vertical phyloge-netic signal The initial conclusion was striking ldquothe originaltree of life concept is obsolete it would not even be a treeof one percentrdquo [81] However the fact that vertical signalwas present in the forests merited reevaluation ldquoreplacementof the ToL with a network graph would change our entireperception of the process of evolution and invalidate allevolutionary reconstructionrdquo [82] We note that in thesestudies an ultrametric tree of akaryotes was recovered fromvertical signal in a supertree of nearly universal genesThe rooted tree which was used to simulate clock-likebehavior revealed the early divergence of Nanoarchaeumequitans and then Archaea [81] While inconsistency of theforest supertree increased at high phylogenetic depths itsassociated supernetwork showed there were no reticulationsbetween Archaea and Bacteria Thus the deep rooting signalof archaeal diversification may be bona fide and worthy offurther study

While those that value tree thinking have contested onmany grounds the idea that the ToL is ldquoobsoleterdquo (eg[13]) our take in the debate is simple Homologies betweengene sequences established through the ldquoemperorrsquos BLASTrdquo(sensu [71]) are poor substitutes to phylogenetic tree recon-structions from gene sequences By the same token phy-logenies reconstructed from sequences are poor substitutesto phylogenies that consider the molecular structure andfunction of the encoded products Gene sequences are notonly prone to mutational saturation but they generally comein pieces These pieces represent evolutionary and structuralmodules that host the functions of the encoded moleculesProtein domains for example are three-dimensional (3D)arrangements of elements of secondary structure that foldautonomously [83] are compact [84] and are evolutionarilyconserved [85 86]The landscape of evolutionary explorationchanges when the role of protein domains in function andevolution is considered [87] Changes at sequence levelincluding substitutions insertions and deletions can havelittle impact on structure and vice versa sequence changes incrucial sites can have devastating consequences on functionand fitness However there is no detailed analysis of historicalmismatches between gene sequences and domain structuresat the ToL level Remarkably while HGT seems rampantat sequence level its impact at the domain structural levelis limited [88] This makes trees derived from domains

8 Archaea

effectively ldquotrees of 99 percentrdquo and their use very powerfulThe fact that domains diversify mostly by vertical descent(eg [89ndash91]) suggests that gene reticulations simply reflectthe pervasive and impactful combinatorial effect of domainrearrangements in proteins [92] and perhaps little else Thisimportant claim must be carefully evaluated It offers thepossibility of dissecting how levels of organization impactprocesses of inheritance in biology

5 Out of the Impasse Parts and Wholes inthe Evolving Structure of Systems

Therooting of theToL is clearlymuddled by the high dynamicnature of change in protein and nucleic acid sequences andby the patchiness and reticulation complexities that exist atgene level However it is also possible that the problemsof the ToL are ultimately technical We have made thecase that the use of molecular sequences is problematic onmany grounds including mutational saturation definitionof homology of sites in sequence alignments inapplicablecharacters taxon sampling and tree imbalance and differenthistorical signatures in domains of multidomain proteins[93] In particular violation of character independence by themere existence of atomic structure represents a very seriousproblem that plagues phylogenetic analysis of sequences Wehere present a solution to the impasse We show that theToL can be rooted with different approaches that focus onstructure and function and that its root is congruently placedin Archaea

Epistemologically phylogenetic characters must complywith symmetry breaking and the irreversibility of time [94]In other words characters must establish transformationalhomology relationships and serve as independent evidentialstatements

(i) Characters Must Be Homologous and Heritable acrossTree Terminal Units (Taxa) Character homology is a centraland controversial concept that embodies the existence ofhistorical continuity of information [74] Characters areldquobasicrdquo evidential statements that make up columns in datamatrices used for tree reconstructionThey are conjectures ofperceived similarities that are accepted as fact for the dura-tion of the study are strengthened by Hennigian reciprocalillumination and can be put to the test through congruencewith other characters as these are fit to the trees To be usefulcharacters must be heritable and informative across the taxarows of the data matrix This can be evaluated for examplewith the cladistics information content (CIC) measure [95]Finding informative characters can be particularly challeng-ing when the features that are studied change at fast pace andwhen taxa sample awide and consequently deep phylogeneticspectrum When building a ToL the highly dynamic natureof change in the sequence makeup of protein or nucleic acidmolecules challenges the ability to retrieve reliable phyloge-netic signatures across taxa even if molecules are universallydistributed and harbor evolutionarily conserved regions withdeep phylogenetic signal (eg rRNA) The reason is thatgiven enough time functionally or structurally constrained

regions of the sequence will be fixed (will be structurallycanalized [6]) and will offer little if any phylogeneticallymeaningful signal to uncover for example the universalrRNA core In turn mutational saturation of unconstrainedregions will quickly erase history

The mutational saturation problem was made mathe-matically explicit by Sober and Steel [24] using ldquomutualinformationrdquo and the concept of time as an informationdestroying process Mutual information 119868(119883 119884) between tworandom variables119883 and 119884 is defined by

119868 (119883 119884) = sum

119909119910

119875 (119883 = 119909 119884 = 119910) log(119875 (119883 = 119909 119884 = 119910)

119875 (119883 = 119909) 119875 (119884 = 119910)

)

(1)

When 119868(119883 119884) approaches 0 119883 and 119884 become independentand no method can predict 119883 from knowledge of 119884 Impor-tantly mutual information approaches 0 as the time between119883 and 119884 increases in a Markov chain Regardless of theuse of a parsimony maximum likelihood or Bayesian-basedframework of analysis 119868(119883 119884)will be particularly small whensequence sites are saturated by too many substitutions dueto high substitution rates or large time scales Ancestralstates at interior nodes of the trees cannot be establishedwith confidence from extant information even in the mostoptimal situation of knowing the underlying phylogeny andthe model of character evolution Under simple models theproblem is not mitigated by the fact that the number ofterminal leaves of trees and the sources of initial phylogeneticinformation increases with time Since a phase transitionoccurs when substitution probabilities exceed a critical value[96] oneway out of the impasse is to find features in sufficientnumber that change at much slower pace than sequence sitesand test if mutual information is significant and overcomesFanorsquos inequality These features exist in molecular biologyand have been used for phylogenetic reconstruction Theyare for example the 3D fold structures of protein molecules[87] or the stem modules of RNA structures [97] featuresthat change at very slow rate when compared to associatedsequences For example protein 3D structural cores evolvelinearly with amino acid substitutions per site and change at3ndash10 times slower rates than sequences [98] This high con-servation highlights the evolutionary dynamics of molecularstructure Remarkably rates of change of proteins performinga same function are maintained by functional constraintsbut accelerate when proteins perform different functions orcontain indels In turn fold structural diversity explodesinto modular structures at low sequence identities probablytriggered by functional diversification Within the contextof structural conservation the fact that fold structures arestructural and evolutionary modules that accumulate inproteomes by gene duplication and rearrangements andspread in biological networks by recruitment (eg [99]) alsoprovides a solution to the problems of vanishing phylogeneticsignal Since fold accumulation increases with time in theMarkov chain mutual information must increase reversing

Archaea 9

the ldquodata processing inequalityrdquo that destroys informationand enabling deep evolutionary information

(ii) Characters Must Show at Least Two Distinct CharacterStates One of these two states (transformational homologs)must be ancestral (the ldquoplesiomorphicrdquo state) and the othermust be derived (the ldquoapomorphicrdquo state) [74] Only sharedand derived features (synapomorphies) provide vertical phy-logenetic evidence Consequently determining the relativeancestry of alternative character states defines the polarity ofcharacter transformations and roots the underlying treeThisis a fundamental property of phylogenetic inference Polar-ization in tree reconstruction enables the ldquoarrow of timerdquo(sensu Eddingtonrsquos entropy-induced asymmetry) solves therooting problem and fulfills other epistemological require-ments

Cladistically speaking character polarity refers to thedistinction between the ancestral and derived states and theidentification of synapomorphies However an evolutionaryview of polarity also refers to the direction of characterstate transformations in the phylogenetic model Historicallythree accepted alternatives have been available for rootingtrees [100ndash102] the outgroup comparison the ontogeneticmethod and the paleontological or stratigraphic methodWhile the three methods do not include assumptions ofevolutionary process they have been the subject of muchdiscussion and their interpretation of much controversyThe first two are however justified by the assumption thatdiversity results in a nested taxonomic hierarchy which mayor may not be induced by evolution We will not discussthe stratigraphic method as it relies on auxiliary assump-tions regarding the completeness of the fossil record Themidpoint rooting criterion mentioned earlier will not bediscussed either The procedure is contested by the existenceof heterogeneities in rates of change across the trees andproblems with the accurate characterization of phylogeneticdistances

In outgroup comparison polarity is inferred by thedistribution of character states in the ingroup (group of taxaof interest) and the sister group (taxa outside the group ofinterest) In a simple case if the character state is only foundin the ingroup it must be considered derived Outgroupcomparison is by far the method of choice because phyloge-neticists tend to have confidence in the supporting assump-tions higher-level relationships are outside the ingroupequivalent ontogenetic stages are compared and characterstate distributions are appropriately surveyed Unfortunatelythe method is ldquoindirectrdquo in that it depends on the assumptionof the existence of a higher-level relationship between theoutgroup and the ingroup Consequently the method cannotroot the ToL because at that level there is no higher-levelrelationship that is presently available Moreover the methodin itself does not polarize characters It simply connects theingroup to the rest of the ToL [100]

The ontogenetic criterion confers polarity through thedistribution of the states of homologous characters in onto-genies of the ingroup generally by focusing on the generalityof character states with more widely distributed states beingconsidered ancestral Nelsonrsquos rule states ldquogiven an ontogenetic

character transformation from a character observed to be moregeneral to a character observed to be less general the moregeneral character is primitive and the less general advancedrdquo[103] This ldquobiogenetic lawrdquo appears powerful in that itdepends only on the assumption that ontogenies of ingrouptaxa are properly surveyed It is also a ldquodirectrdquo method thatrelies exclusively on the ingroup Consequently it has thepotential to root the ToL Unfortunately Nelsonrsquos ldquogeneralityrdquohas been interpreted in numerous ways especially as it relatesto the ontogenetic sequence leading tomuch confusion [102]It also involves comparison of developmentally nested anddistinct life history stages making it difficult to extend themethod (originally conceptualized for vertebrate phylogeny)to the microbial world However Weston [100 104] madeit clear that the ontogenetic criterion embodies a widerldquogenerality criterionrdquo in which the taxic distribution of acharacter state is a subset of the distribution of anotherIn other words character states that characterize an entiregroup must be considered ancestral relative to an alternativestate that characterizes a subset of the group Besides thecentrality of nested patterns the generality criterion embedsthe core assumption that every homology is a synapomorphyin naturersquos nested taxonomic hierarchy and that homologiesin the hierarchy result from additive phylogenetic change[100] Westonrsquos more general rule therefore states that ldquogivena distribution of two homologous characters in which one119909 is possessed by all of the species that possess its homologcharacter 119910 and by at least one other species that does notthen 119910 may be postulated to be apomorphous relative to 119909rdquo[104] The only assumption of the method is that relevantcharacter states in the ingroup are properly surveyed Thisnew rule crucially substitutes the concept of ontogenetictransformation by the more general concept of homologyand additive phylogenetic change which can be applied tocases in which homologous entities accumulate ldquoiterativelyrdquoin evolution (eg generation of paralogous genes by dupli-cation) Since horizontally acquired characters (xenologs)are not considered synapomorphies they contribute towardsphylogenetic noise and are excluded after calculation ofhomoplasy and retention indices (ie measures of goodnessof fit of characters to the phylogeny)

Wehave applied the ldquogenerality criterionrdquo to the rooting ofthe ToL through polarization strategies that embody axiomsof evolutionary process Figure 4 shows three examplesA rooted phylogeny describing the evolution of 5S rRNAmolecules sampled from a wide range of organisms wasreconstructed from molecular sequence and structure [105]The ToL that was recovered was rooted paraphyletically inArchaea (Figure 4(a)) The model of character state transfor-mation was based on the axiom that evolved RNA moleculesare optimized to increase molecular persistence and producehighly stable folded conformations Molecular persistencematerializes in RNA structure in vitro with base pairsassociating and disassociating at rates as high as 05 sminus1 [106]The frustrated kinetics and energetics of this folding processenable some structural conformations to quickly reach stablestatesThis process is evolutionarily optimized through struc-tural canalization [6] in which evolution attains molecularfunctions by both increasing the average life and stability of

10 Archaea

Archaea Bacteria Eukarya

Root

Taxa = 666 5S rRNA molecules of diverse organismsCharacters = sequence and structure of tRNA

CI = 0072 RI = 0772 and g1=minus013Tree length = 11381 steps

(a)

Ancestral

AB B

EB

AE

EABEA

Derived

A E B

Timeline

Eukarya (E)

Archaea (A)

Bacteria (B)

ABE

(b)

1000

Root

Eukarya

Archaea

Bacteria

Taxa = 150 free-living proteomesCharacters = genomic abundance of 1510 FSFsTree length = 70132 stepsCI = 0155 RI = 0739 and g1=minus028

(c)

Figure 4 Trees of life generated from the structure of RNA and protein molecules congruently show a rooting in Archaea (a) A rootedphylogenetic tree of 5S rRNA reconstructed from both the sequence and the structure of the molecules (from [105]) (b) Global most-parsimonious scenario of organismal diversification based on tRNA (from [107]) A total of 571 tRNA molecules with sequence basemodification and structural information were used to build a ToL which failed to show monophyletic groupings Ancestries of lineageswere then inferred by constraining sets of tRNAs into monophyletic groups representing competing (shown in boxes) or noncompetingphylogenetic hypotheses and measuring tree suboptimality and lineage coalescence (illustrated with color hues in circles) (c) A ToLreconstructed from the genomic abundance counts of 1510 FSFs as phylogenetic characters in the proteomes of 150 free-living organismssampled equally and randomly from the three domains of life (data taken from [122 123]) Taxa were labeled with circles colored accordingto superkingdom CI = consistency index RI = retention index and 119892

1= gamma distribution parameter

selected conformations and decreasing their relative numberThus conformational diversity measured for example by theShannon entropy of the base-pairing probability matrix orfeatures of thermodynamic stability act as ldquoevo-devordquo proxyof a generality criterion for RNA molecules in which thecriterion of similarity (eg ontogenetic transformation) isof positional and compositional correspondence Using adifferent approach we recently broadened the use of phyloge-netic constraint analysis [107 108] borrowing from a formal

cyberneticmethod that decomposes a reconstructable systeminto its components [109] and used it to root a ToL derivedfrom tRNA sequence and structure (Figure 4(b)) The ToLwas again rooted in Archaea The number of additionalsteps required to force (constrain) particular taxa into amonophyletic group was used to define a lineage coalescencedistance (119878) with which to test alternative hypotheses ofmonophyly These hypotheses were then ordered accordingto 119878 value in an evolutionary timeline Since 119878 records

Archaea 11

the relative distribution of character states in taxic sets italso embodies the generality criterion of rooting Finallywe generated rooted ToLs that describe the evolution ofproteomes directly from a census of protein fold structures inproteomes (Figure 4(c)) This ToL shows a paraphyletic root-ing inArchaeaThemethod extracts phylogenetic signal fromproteomic abundance of protein fold structures and considersthat the most abundant and widely distributed folds are ofancient origin when defining transformation series [41] Thispolarization scheme which results in the gradual growth ofthe proteome repertoire again represents an embodimentof the generality criterion in which statements of homology(fold structures) result from additive phylogenetic change(increases in abundance) It is noteworthy that these ToLreconstructions take into account the genomic abundance ofeach and every fold structure in each proteome and acrossthe entire matrix thereby generating a frustrated system Ingeneral very high abundance of only few folds will not attracttaxa (species) to the derived branches in theToLThe ancestryof taxa is determined by both the abundance and interplayamong fold structure characters For example metabolicfolds such as those involved inATP hydrolysis are widespreadin cells and considered to be very ancient These are alsothe most ancient folds in our phylogenies In comparisonsome popular eukaryote-specific folds (eg immunoglobulinsuperfamilies) are highly abundant but appear in a derivedmanner in our phylogenies Thus we reason that there is nocircularity involved in the character polarization scheme

The compositional schemes extend the concept of rootingwith paralogous sequences to the entire proteome comple-ments from gene family level [100] to structural hierarchiesThe three examples make use of different rooting rationalesbut provide a congruent scenario of origins of diversifiedlife Technically roots are inferred by the Lundberg method[110] that does not require any outgroup taxa specificationThis method roots the trees a posteriori by attaching thehypothetical ancestor to that branch of the unrooted networkthat would yield minimum increase in the tree length (thuspreserving the principle of parsimony)

(iii) Characters Must Serve as Independent EvolutionaryHypotheses Valid phylogenetic optimization requires thatcharacters be independent pieces of evidence Charactersshould not dependonother charactersWhen the assumptionof independence is violated characters are overweighed inthe analysis and the resulting phylogeny fails to representtrue history [111] Possible dependencies could be of manykinds from structural to functional from developmen-tal to ecological These dependencies distort and obscurephylogenetic signal and must be either avoided or codedinto the phylogenetic model through parameters or weightcorrections

As we will now elaborate the problem of character inde-pendence is about parts and wholes in the hierarchical fabricof life and in the nested hierarchies of the ToL Biologicalsystems are by definition made of parts regardless of the wayparts are defined In evolution diversification and integrationof parts unify parts into cohesive entities modules whichthen diversify [112]This process and the rise of modules may

explain evolutionary waves of complexity and organizationand the emergence of structure (defined broadly) in biologythat is hierarchical The hierarchical makeup is made evidentin the structure of protein molecules where lower levelparts of the polymer (the amino acid residues) interact witheach other and establish cohesive higher-level modular partswhich also establish interaction networks and are crucialfor molecular function and for interaction of proteins withthe cellular environment Consequently the structure ofproteins can be described at increasing hierarchical levelsof structural abstraction sequences motifs loops domainsfamilies superfamilies topologies folds architectures andclasses Two accepted gold standards of protein classificationstructural classification of proteins (SCOP) [113] and classarchitecture topology homology (CATH) [114] use parts ofthis incomplete scheme to describe the atomic complexityof the molecules We note that these classifications do notconsider unrealized structural states such as protein foldsthat are possible but that have never been identified inthe natural world of protein structures We also note thatmodules sometimes engage in combinatorial games Forexample protein domains are rearranged in evolution byfusions and fissions producing the enormous diversity ofalternative domain rearrangements that exist inmultidomainproteins [92]

Since biological systems evolve and carry common ances-try parts of these systems by definition evolve and bythemselves carry common ancestry In other words thehistories of parts are embodied in the ensemble of lineagesof the ToL The focus of phylogenetic analysis however hasbeen overwhelmingly the organism as the biological systemas testified by the effort devoted to taxonomical classificationsystematic biology and building of the ToL The genomicrevolution has provided a wealth of parts in the aminoacid sequence sites of proteins and nucleic acid moleculeswhich have been used as phylogenetic characters for analysisof molecules representing organisms This has maintainedthe focus of reconstructing trees of systems (the wholes)For example amino acid sites of a protein are generally fitinto a data matrix (alignment) which is then used to buildtrees of genes and organisms (Figure 5(a)) using modernalgorithmic implementations [115] However and as we havementioned proteins have complicated structures that resultfrom interactions between amino acids at the 3D atomic levelThese intramolecular interactions or at least some of them[116] induce protein folding and delimit molecular func-tion and protein stability They are responsible for proteinsecondary supersecondary domain and tertiary structureand by definition their mere existence induces violation ofcharacter independence Penny and Collins [117] proposedthe simple thought experiment in which the bioinformaticianexchanges rows of sequence sites in the alignment matrixand asks what was lost in the process Randomization ofcharacters (columns) in the data matrix does not changethe phylogenetic tree However randomization destroys thestructure of the molecule and very likely its function Thisconfirms that reconstruction of trees from information insequence sites violates character independence and in theprocess ignores structure and biologyThe effects of violation

12 Archaea

Interactions between parts(eg atomic contacts between sites)

Characters parts(amino acid sequence sites)

16

7

5Impact onphylogeneticreconstruction high

Phylogenetic model

Evolutionof systems

Canonicalparadigm

System 3 (protein from Mba)

System 2 (protein from Mja)

System 1 (protein from Pfu)

A L Y S

A L Y S

G L S S

1 5 6 7

(a)

Interactions between systems(eg proteomes in trophic chains)

Characters systems(proteomes)

Impact onphylogeneticreconstruction low

Phylogenetic model

Evolutionof parts

Newparadigm

Part 3 (FSF domain c261)

Part 2 (FSF domain c12)

Part 1 (FSF domain c371)

Higher trophic

levels

Mesoplankton

Microplankton

Nanoplankton

Picoplankton

Pfu Mja Mba 22 29 28

54 44 43

84 88 80

(b)

Figure 5 A new phylogenetic strategy simplifies the problems of character independence (a)The canonical paradigm explores the evolutionof systems such as the evolution of organisms in the reconstruction of ToLs For example the terminals of phylogenetic trees can be genessampled from different organisms (eg Pyrococcus furiosus (Pfu)Methanococcus jannaschii (Mja) andMethanosarcina barkeri (Mba)) andphylogenetic characters can be amino acid sequence sites of the corresponding gene products Character states can describe the identityof the amino acid at each site Since characters are molecular parts that interact with other parts when molecules fold into compact 3Dstructures their interaction violates the principle of character independence Consequently the effects of covariation must be considered inthe phylogeneticmodel used to build the trees (b)The new paradigm explores the evolution of parts such as the evolution of protein domainsin proteomes For example the terminals of phylogenetic trees can be domains defined at fold superfamily level of structural complexity andthe characters used to build the trees can be proteomes Character states can be the number of domains holding the FSF structure Sinceproteomes interact with other proteomes when organisms establish close interactions interactions that could affect the abundance of domainsin proteomes should be considered negligible (unless there is an obligate parasitic lifestyle involved) and there is no need to budget trophicinteractions in the phylogenetic model

of character independence may be minimal for trees ofsequences that are closely related However trees describingdeep historical relationships require that sequences be diver-gent and this maximizes the chances of even wider diver-gences in molecular structure that are not being accountedfor by the models of sequence evolution

The genomic revolution has also provided a wealth ofmodels of 3D atomic structure These structures are usedas gold standards to assign with high confidence structuralmodules to sequences As mentioned earlier proteomesembody collections of protein domains with well-definedstructures and functions Protein domain counts in pro-teomes have been used to generate trees of protein domains(Figure 5(b)) using standard cladistic approaches and well-established methods (reviewed in [87]) These trees describethe evolution of protein structure at global level Theyare effectively trees of parts While domains interact witheach other in multidomain proteins or establish protein-protein interactions with the domains of other proteinsthese interactions of parts do not violate character indepen-dence This is because phylogenetic characters are actuallyproteomes systems defined by structural states that exist atmuch higher levels than the protein domain not far awayfrom the organism level Remarkably no information is lostwhen character columns in the matrix are randomized in

the thought experiment The order of proteome charactersin the matrix does not follow any rationale Characters arenot ordered by lifestyles or trophic levels of the organismsInteractions between free-living organisms will seldom biastheir domain makeup and if so those characters can beexcluded from analysis Even the establishment of symbioticor obligate parasitic interactions such as the nodule-formingsymbioses between rhizobia and legumes may have littleimpact on character independence as long as the jointinclusion of the host and the symbiont is avoided

6 Evidence Supporting the Archaeal Rootingof the Universal Tree

Figure 4 shows examples of rooted ToLs generated fromthe sequence and structure of RNA and protein moleculesThe different phylogenomic approaches arrive at a commonrooted topology that places the stem group of Archaea at thebase of the ToL (the archaeal rooting of Figure 1(c)) Howeverthe evolutionary interrelationship of parts andwholes promptthe use of trees of parts a focus on higher-level structureand a decrease in confidence in the power of trees of systemsThese concepts have been applied to the study of the historyof nucleic acid and protein structures for over a decade and

Archaea 13

23

38

786

360324

164

38

2

33

453

126 56

1

1

FSF domains

Archaea

Archaea

EukaryaEukarya

Bact

eria

Bact

eria

f gt 06

(a)

1

11

526

852272

162

100

0

3

232

35 15

5

16

GO terms

Archaea

Archaea

EukaryaBa

cter

iaEukarya

Bact

eria

f gt 06

(b)

69

0

7

8393

228

2

1

0

5

91

1

0

RNA families

Archaea

Eukarya

Bact

eria

Archaea

Eukarya

Bact

eria

f gt 06

(c)

Figure 6 Venn diagrams displaying the distributions of 1733 FSF domains (a) 1924 terminal GO terms (b) and 1148 RNA families (c)in the genomes of the three domains of life FSF domain data was taken from Nasir et al [122 123] and included 981 completely sequencedproteomes from 70 Archaea 652 Bacteria and 259 Eukarya Terminal GO terms corresponding to the ldquomolecular functionrdquo hierarchy definedby the GO database [118] were identified in 249 free-living organisms including 45 Archaea 183 Bacteria and 21 Eukarya (data taken from[123]) The Venn diagram of RNA families and the distribution of Rfam clans and families in organisms were taken from Hoeppner et al[129] and their Dataset S1 Shown below are distribution patterns for FSFs GO terms and RNA families that are present in more than 60 ofthe organisms examined (119891 gt 06) All distributions highlight maximum sharing in the ancient ABE and BE taxonomic groups and minimalsharing in archaeal taxonomic groups

provide additional evidence in support of the archaeal rootingscenario

61 Comparative Genomic Argumentation The distributionof gene-encoded products in the genomes of sequencedorganisms and parsimony thinking can reveal global evolu-tionary patterns without formal phylogenetic reconstructionWe will show how simple numerical analyses of proteindomains molecular activities defined by the gene ontology(GO) consortium [118] and RNA families that are domain-specific or are shared between domains of life can uncover thetripartite division of cellular life exclude chimeric scenariosof origin and provide initial insight on the rooting of the ToL(Figure 6)

The number of unique protein folds observed in natureis very small SCOP ver 175 defines only sim2000 foldsuperfamilies (FSFs) groups of homologous domains unifiedon the base of common structural and evolutionary rela-tionships for a total of 110800 known domains in proteins[113 119] FSFs represent highly conserved evolutionary unitsthat are suitable for studying organismal diversification [87]Yafremava et al [120] plotted the total number of distinctFSFs (FSF diversity) versus the average reuse of FSFs in theproteome of an organism (FSF abundance) This exerciseuncovered a scaling behavior typical of a Benford distributionwith a linear regime of proteomic growth for microbial

organisms and a superlinear regime for eukaryotic organisms(Figure 8 in [120]) These same scaling patterns are observedwhen studying the relationship between open reading framesand genome size [121] Remarkably archaeal and eukaryalproteomes exhibited both minimum and maximum levelsof FSF abundance and diversity respectively Bacterial pro-teomes however showed intermediate levels We note thatthe general scaling behavior is consistent with a scenarioin which evolutionary diversification proceeds from simplerproteomes to the more complex ones in gradual mannersupporting the principle of spatiotemporal continuity andrevealing the nested phylogenetic hierarchies of organismsUnder this scenario (and results of [120]) the streamlinedarchaeal proteomes represent the earliest form of cellular lifeRemarkably archaeal species harboring thermophilic andhyperthermophilic lifestyles encoded the most streamlinedFSF repertoires (Figure 7) Clearly modern thermophilicarchaeons are most closely related to the ancient cells thatinhabited planet Earth billions of years ago (also read below)

FSF domain distributions in the genomes of the threedomains of life provide further insights into their evolutionNasir et al [122 123] generated Venn diagrams to illustrateFSF sharing patterns in the genomes of Archaea Bacteriaand Eukarya (Figure 6(a)) These diagrams display the totalnumber of FSFs that are unique to a domain of life (taxonomicgroupsA B and E) shared by only two (AB BE andAE) and

14 Archaea

FSF use (log)

S marinus I hospitalis

H butylicus

T pendens

Cand korarchaeum

A pernix

M kandleri

T neutrophils

P islandicum

290

285

280

275

270

265

260

255

250

245

240

22 23 24 25

ThermophilicNonthermophilic

FSF

reus

e (lo

g)

N maritimus

Figure 7 A plot of FSF use (diversity) against FSF reuse (abun-dance) reveals a linear pattern of proteomic growth in 48 archaealproteomes Thermophilic archaeal species occupy positions that areclose to the origin of the plot They also populate the most basalbranch positions in ToLs (see discussion in themain text) Both axesare in logarithmic scale

those that are universal (ABE) About half of the FSFs (786out of 1733) were present in all three domains of life 453 ofwhich were present in at least 60 of the organisms that wereexamined (119891 gt 06 where 119891 is the distribution index rangingfrom 0 to 1) The fact that about 70 of widely shared FSFs(672) belong to the ABE taxonomic group strongly supports acommon evolutionary origin for cells Evolutionary timelineshave confirmed that a large number of these universal FSFswere present in the ancestor of the three domains of life theurancestor (Figure 1) and were for the most part retained inextant proteomes [48]

Interestingly the number of BE FSFs was sim10-fold greaterthan AE and AB FSFs (324 versus 38 and 38) (Figure 6(a))The finding that Bacteria and Eukarya encode a significantlylarge number of shared FSF domains is remarkable and hintstowards an unprecedented strong evolutionary associationbetween these two domains of life Moreover a significantnumber of BE FSFs is widespread in the bacterial andeukaryotic proteomes 56 of the 324 FSFs are shared bymore than 60 of the bacterial and eukaryal organismsthat were analyzed (Figure 6(a)) This is strong evidenceof an ancient vertical evolutionary trace from their mutualancestor (anticipated in [123]) This trace uniquely supportsthe archaeal rooting of the ToL In turn the canonicaland eukaryotic rooting alternatives are highly unlikely asnone alone can explain the remarkable diversity of the BEtaxonomic group (as well as its ancient origin [123]) Theremarkable vertical trace of the ABE taxonomic group andthe negligible vertical trace of the AB group (only one FSF

shared by more than 60 of organisms) also falsify fusionhypotheses responsible for a putative chimeric makeup ofEukarya In light of these findings the most parsimoniousexplanation of comparative genomic data is that Archaea isthe first domain of diversified cellular life

The patterns of FSF sharing are further strengthenedby the genomic distributions of terminal-level molecularfunction GO terms (Figure 6(b)) The three domains of lifeshared about a quarter (526) of the 1924 GO terms thatwere surveyed A total of 232 of these ABE terms wereshared by 60 of organisms analyzed Again the fact thatabout 76 of widely shared GO terms (306) belong to theABE taxonomic group strongly supports a common verticaltrace while the finding that Bacteria and Eukarya share asignificantly large number of GO terms (272) supports againthe archaeal rooting of the ToL An alternative explanationfor the very large size of BE could be large-scale metabolism-related gene transfer from the ancestors of mitochondriaand plastids to the ancestors of modern eukaryotes [124]However we note that the BE group is not restricted to onlymetabolic functions It also includes FSFs and GOs involvedin intracellular and extracellular processes and regulation andinformational functions [125] Thus the very large size ofBE is a significant outcome most likely shaped by verticalevolutionary scenarios and cannot solely be explained byparasiticsymbiotic relationships that exist between Bacteriaand Eukarya [123] Moreover the simplicity of archaeal FSFrepertoires is not due to the paucity of available archaealgenomic dataWe confirmed that themean FSF coverage (ienumber of proteinsannotated to FSFsGOs out of total) forArchaea Bacteria and Eukarya was largely comparable (egTable S1 in [122]) Finally as we will describe below thesecomparative patterns and tentative conclusions are confirmedby phylogenomic analysis [91 122 126] Congruent phyloge-nies were obtained with (Figure 4(c)) and without equal andrandom sampling of taxa Thus the relatively low numberof archaeal genomes is also not expected to compromise ourinferences

We note that Eukarya shares many informational geneswith Archaea and many operational genes with BacteriaWhile informational genes have been thought more refrac-tory to HGT than noninformational genes we have con-firmed at GO level that this is not the case A ToL builtfrom GO terms showed that in fact noninformational termswere less homoplasious while statistical enrichment analysesrevealed that HGT had little if any functional preferencefor GO terms across GO hierarchical levels We recentlycompared ToLs reconstructed from non-HGTGO terms andToLs reconstructed from informational GO terms that wereextracted fromnon-HGTGO terms (Kim et al ms in reviewalso read below) In both cases Archaea appeared as a basalparaphyletic group of the ToLs and the common origin ofBacteria and Eukarya was maintained Thus the 272 GOterms shared by Bacteria and Eukarya harbor a strong verticaltrace

Another observation often used as support for Archaea-Eukarya kinship is the discovery of few eukaryote-specificproteins (eg actin tubulin H3 H4 ESCRT ribosomalproteins and others) in some archaeal species [127] and

Archaea 15

their complete absence fromBacteria (except for documentedtubulin HGT from eukaryotes to bacterial genus Prosthe-cobacter [128]) This suggests either that eukaryotes arosefrom an archaeal lineage [127] or that the ancestor of Eukaryaand Archaea was complex and modern Archaea are highlyreduced [25] Indeed few eukaryote-specific proteins havenow been found in some archaeal species (in some casesjust one species) We argue that this poor spread cannotbe taken as evidence for the Archaea-Eukarya sister rela-tionship This needs to be confirmed by robust phylogeneticanalysis which is unfortunately not possible when usingprotein sequences Recently we described a new strategy forinferring vertical and horizontal traces [123] This methodcalculates the spread of FSFs andGOs that are shared betweentwo-superkingdom groups (ie AB AE and BE) Balanceddistributions often indicate vertical inheritance while biaseddistributions suggest horizontal flux For example penicillinbinding molecular activity (GO 0008658) was present in100 of the sampled bacterial proteomes but was onlypresent in 11 of the archaeal species [123] Thus presenceof GO 0008658 in Archaea was attributed to HGT gain fromBacteria Using this simple method we established that bothBacteria and Eukarya were united by much stronger verticaltrace than either was to Archaea In fact strong reductivetendencies in the archaeal genomes were recorded [123]Thus in our opinion presence of eukaryote-specific proteinsin only very few archaeal species could in fact be an HGTevent that is not detectable by sequence phylogenies

In turn many arguments favor now the Bacteria-Eukaryasisterhood in addition to the structure-based phylogeniesand balanced distribution ofmolecular features For exampleBacteria and Eukarya have similar lipid membranes that canbe used as argument for their evolutionary kinshipMoreoverArchaea fundamentally differ from Eukarya in terms oftheir virosphere Viruses infecting Archaea and Eukarya aredrastically different as recently discussed by Forterre [25]

The genomic distribution of RNA families was taken fromHoeppner et al [129] and also shows a vertical evolutionarytrace in five crucial Rfam clans that are universal includingtRNA 5S rRNA subunit rRNA and RNase P RNA Theseuniversal RNA groups are likely minimally affected by HGTHowever 99 of Rfam clans and families are specific todomains of life and only 11 of the 1148 groups were sharedat 119891 gt 06 levels This clearly shows that the functionalcomplexity of RNA materialized very late during organismaldiversification and that it is not a good genomic feature forexploring the rooting of the ToL Only five RNA familiesare shared between two domains of life and only one ofthese does so at 119891 gt 06 the G12 pseudoknot of the 23SrRNA which is present in bacterial and eukaryotic organellarrRNA While the large subunit rRNA scaffold supports theG12 pseudoknot with its vertical trace all five interdomainRNA families can be explainedmost parsimoniously byHGTThus only a handful of ancient and universal RNA speciescan be used to root the ToL

We end by noting that the Venn diagrams consistentlyshow that Archaea harbors the least number of unique (A)and shared (AB and AE) FSFs GO terms or RNA familiesThis trend supports an early divergence of this domain of life

from the urancestor and the possibility that such divergencebe shaped by evolutionary reductive events We reason thatsuch losseswould bemore parsimonious early on in evolutionthan in the later periods This is because genes often increasetheir abundance in evolutionary time (by gene duplicationsHGT and other processes)Thus it is reasonable to think thatloss of an ancient gene would be more feasible very early inevolution relative to losing it very late

62 Phylogenomic Evidence from the Sequence and Structure ofRNA Phylogenetic analyses of the few RNA families that areuniversal and display an important vertical evolutionary trace(Figure 6(c)) provide compelling evidence in favor of an earlyevolutionary appearance of Archaea [105 107 108 130ndash135]Here we briefly summarize evidence from tRNA 5S rRNAand RNase P RNA Unpublished analyses of rRNA sequenceand structure using advanced phylogenetic methods alsoshow that Archaea was the first domain of life

tRNA molecules are generally short (sim73ndash95 nucleotidesin length) and highly conserved Consequently theirsequence generally contains limited amount of phylogeneticinformation These limitations have been overcome by ana-lyzing entire tRNomes [136] which extend the length ofan organismal set of tRNAs to over 2000 bases Xue etal [130 131] analyzed the genetic distances between tRNAsequences as averages between alloacceptor tRNAs fromdiverse groups of tRNomes using multiple molecular basesubstitution models The distances were mapped onto anunrooted phylogeny of tRNA molecules (Figure 8(a))The ldquoarrow of timerdquo assumption in these studies is thatancient tRNA paralogs closely resemble each other when lin-eages originate close to the time of the gene duplication Re-markably the results revealed a paraphyletic rooting of theToL in Archaea The root was specifically located close tothe hyperthermophilic methanogen Methanopyrus kandleri(Figure 8(a)) The hypothesis of this specific rooting scenariohas been supported by several other studies [134 137]including a study of genetic distances between paralogouspairs of aminoacyl-tRNA synthetase (aaRS) proteins [131] Aremarkable match between distance scores of tRNA and pairsof aaRS paralogs (Figure 8(b)) not only confirms the earlyappearance of Archaea but also suggests a coevolutionarytrace associated with molecular interactions that areresponsible for the genetic code In fact a recent exhaustivephylogenomic analysis of tRNA and aaRS coevolutionexplicitly reveals the origins and evolution of the geneticcode and the underlying molecular basis of genetics [59]Di Giulio [132 133] has also proposed an archaeal rootingof the tree of life specifically in the lineage leading to thephylum of Nanoarchaeota This rooting is based on uniqueand ancestral genomic traits of Nanoarchaeum equitans splitgenes separately codifying for the 51015840 and 31015840 halves of tRNAand the absence of operons which are considered molecularfossils [132] However this claim needs additional support ascontrasting evidence now recognizes N equitans as a highlyderived archaeal species [138]

In addition to sequences structural features of tRNAmolecules also support the archaeal rooting of the ToL

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Archaea 5

Primordialribosome

MaturePTC

Processiveribosome

Matureribosome ribosome

Present day

Ribosomal core

Diversified life

Ribosomal history

The present

Origin ofproteins

Urancestor 1 2 3

38 30 20 10 0

Molecular age (nd)

History of life (time billions of years)

Figure 2The evolutionary history of ribosome was traced onto the three-dimensional structure of its rRNA and r-protein components Agesof components were colored with hues from red (ancient) to blue (recent) Phylogenomic history shows that primordial metabolic enzymespreceded RNA-protein interactions and the ribosome The timeline of life derived from a universal phylogenetic tree of protein domainstructure at fold superfamily level of complexity is shown with time flowing from left to right and expressed in billions of years accordingto a molecular clock of fold structures The ribosomal timeline highlights major historical events of the molecular ensemble History ofthe conserved ribosomal core [29] overlaps with that of diversified life [48] suggesting that episodes of cooption and lateral transfer havepervaded early ribosomal evolution

were used to build universal network trees from gene familiesdefined by reciprocal best BLAST hits [45] These networksshowed a midpoint rooting of the ToL between Bacteriaand Archaea However this rooting involves a complex opti-mization of path lengths in the split networks and criticallyassumes that lineages evolved at roughly similar rates Thisdiminishes the confidence of themidpoint rooting especiallywhen considering the uncertainties of distances inferred fromBLAST analyses and the fact that domains in genes holddifferent histories and rates of change Another promisingbut ill-conceptualized case is the rooting of distance-basedtrees inferred by studying the frequency of l-mer sets ofamino acids in proteins at the proteome level [46] Thecompositional data generated a ToL that was rooted inEukarya when randomized proteome sequences were usedas outgroupsThe assumption of randomness associated withthe root of a ToL is however unsupported or probably wrongespecially because protein sequence space and itsmappings tostructure are far from random [47] More importantly a largefraction of modern proteins had already evolved protein foldstructures when the first diversified lineages arose from theuniversal common ancestor of cellular life [48]These evolvedrepertoires cannot be claimed to be random

The inability of genomics to provide clear answers torooting questions promoted (to some extent) parsimonythinking and the exploration of evolutionary differences ofsignificance that could act as ldquoanchorsrdquo and could impartldquopolarityrdquo to tree statements [49] As we will describe belowwe applied parsimony thinking to the evolution of proteinand RNA structures [41 50] and for over a decade we havebeen generating rooted ToLs that portray the evolution ofproteomes with increasing explanatory power This workhowever has been for themost part unacknowledged Insteadmolecular evolutionists focused almost exclusively onmolec-ular sequences in their search for solutions to the ToL

problem For example analysis of genomic insertions anddeletions (indels) that are rare in paralogous gene sets rootedthe ToL between the group of Actinobacteria and Gram-negative bacteria and the group of Firmicutes and Archaea[51] Unfortunately little is known about the dynamics ofindel generation its biological role in gene duplication andits influence on the structure of paralogous protein pairs (eg[52]) If the dynamics are close to that of sequences indelsshould be considered subject to all the same limitations ofsequence analyses including long branch attraction artifactscharacter independence and inapplicable characters Givena tree topology parsimony was also used to optimize theevolutionary transformation of genomic features For exam-ple when considering the occurrence of protein domainsthat are abundant variants of domain structures accumulategradually in genomes operationally the domain structure perse cannot be lost once it had been gained This enabled theuse of a variant of the unrooted Dollo algorithmic method(eg [53]) the Dollo parsimony model [54] is based onthe assumption that when a biological feature that is verycomplex is lost in evolution it cannot be regained throughvertical descent The method however could not be appliedto akaryotic genomes as these are subject to extensive lateralgene transfer and until recently [55] was not extended to ToLreconstructions

Parsimony thinking was also invoked in ldquotransitionanalysisrdquo a method that attempts to establish polarity ofcharacter change by for example examining homologies ofproteins in proteolytic complexes such as the proteasomemembrane and cell envelope biochemical makeup and bodystructures of flagella sometimes aided by BLAST queries[56 57] The elaboration however is restricted to the fewmolecular and cellular structures that are analyzed out ofthousands that populate the akaryotic and eukaryotic cellsThe approach is local has not yet made use of an objective

6 Archaea

analytic phylogenetic framework and does not weigh gainslosses and transfers with algorithmic implementationsThusmany statements can fall pray of incorrect optimizations orprocesses of convergence of structure and function includingcooptions that are common in metabolic enzymes [58]Transitions also fail to consider the molecular makeup of thecomplexes examined (eg evolution of the photosyntheticreaction centers) which often hold domains with hetero-geneous histories in the different organismal groups Forexample the 119865

1119865119900ATP synthetase complex that powers

cellular processes has a long history of accretion of domainsthat span almost 38 billion years of history which is evenolder than that of the ribosome [59] Finally establishing thevalidity of evolutionary transitions in polarization schemescan be highly problematic each transition that is studiedrequires well-grounded assumptions [60] some of whichhave not been yet properly elaborated

The inability to solve the rooting problem and the insis-tence on extracting deep phylogenetic signal from molecularsequences that are prone to mutational saturation raisedskepticism about the possibility of ever finding the rootof the ToL Bapteste and Brochier [60] made explicit theconceptual difficulties claiming scientists in the field hadadopted Agripparsquos logic of doubt Misunderstandings on howto conduct ideographic inquiry in evolutionary genomicshad effectively blocked the reerection of a Tower of Babelthat would explain the diversification of genomes (Figure 3)Instead there was ldquoConfusion of Doubtsrdquo Unfortunately theimpasse was aggravated by aprioristic tendencies inheritedfrom systematic biology [3 10] disagreements about theevolutionary role of vertical and lateral inheritance andthe problem of homology [61] and currently disagreementsabout the actual role of Darwinian evolution in speciationand the ToL problem [13 62]

During the decade of evolutionary genomic discovery theeffects of HGT on phylogeny [63] took front cover HGT wasinvoked as an overriding process However little attentionwas paid to alternative explanations such as differential lossof gene variants ancient or derived and there was littleconcern for other sources of reticulation HGT is certainlyan important evolutionary process that complicates the ldquotreerdquoconcept of phylogenetic analysis and must be carefullystudied (eg [64ndash66]) In some bacterial taxonomic groupssuch as the proteobacteria HGT was found to be pervasiveand challenged the definition of species [67] Cases like theseprompted the radical suggestion that the ToL should beabandoned and that a ldquoweb of liferdquo should be used to describethe diversification of microbes and multicellular organisms[68] However the problem has two aspects that must beseparately considered

(i) MechanicsThe widespread and impactful nature of HGTmust be addressed Does its existence truly compromise thevalidity of phylogenetic tree statements While HGT seemsimportant for some bacterial lineages [69] its evolutionaryimpacts in Archaea and Eukarya are not as extensive (eg[61 70]) and its global role can be contested [71] In turnthe role of viruses and RNA agents in genetic exchangecontinues to be understudied (eg [72]) and could be

Figure 3 The ldquoConfusion of Tonguesrdquo an engraving by GustaveDore (1832ndash1883) was modified by D Caetano-Anolles to portraythe event of diversification that halted the construction of the Towerof Babel In linguistics the biblical story inspired tree thinkingWe take the metaphor as the fall of the urancestor of cellular lifeand the replacement of ldquoscala naturaerdquo by branching processes ofcomplexity

a crucial source of reticulation Furthermore the relationshipbetween differential gene loss and HGT has not been ade-quately formalized especially for genes that are very ancientEstablishing patterns of loss requires establishing polarity ofchange and rooted trees which as we have discussed earlierremains unattained for ToLs reconstructed from molecularsequences

(ii) Interpretation HGT generally materializes as a mismatchbetween histories of genes in genomes and histories of organ-isms Since genomes are by functional definition collectionsof genes reticulations affect history of the genes and not ofthe organisms which given evolutionary assumptions resultfrom hierarchical relationships (eg [73]) Thus and at firstglance reticulation processes (HGT gene recombinationgene duplications and gene loss) do not obliterate verticalphylogenetic signal The problem however remains complexA ToL can be considered an ensemble of lineages nestedwithin each other [74] Protein domain lineages will benested in gene lineages gene lineages will be nested inlineages of gene families gene families will be nested inorganismal lineages organisms will be nested in lineagesof higher organismal groupings (populations communitiesecosystems and biomes) and so on All lineage levelsare defined by biological complexity and the hierarchicalorganization of life and follow a fractal pattern distributioneach level contributing vertical and lateral phylogenetic signal

Archaea 7

to the whole However sublineages in one hierarchical levelmay hold different histories compared to the histories ofhigher and lower levels Historical mismatches introduce forexample lineage sorting and reticulation problems that rep-resent a complication to phylogenetic analysis For examplemismatches between gene and organismal phylogenies existin the presence of differential sorting of genomic compo-nents due to for example species extinctions or genomicrearrangements These mismatches violate the fundamentalcladistic assumption that history follows a branching processbut can be explained by homoplasy (convergent evolution)the horizontal ldquotracerdquo of the homology-homoplasy yin-yangof phylogenetic signal The problem of homoplasy cannotbe solved by conceptually preferring one particular levelof the hierarchy [74] A focus on the organismal level forexample will not solve the problems of hybridization thatare brought by strict or relaxed sexual reproduction or viral-mediated genetic exchanges or the problems of ancientevents of fusions or endosymbiosis However homoplasy andthe optimization of characters and trees of cladistic analysisprovide a rigorous framework to discover the magnitude andsource of nonvertical processes in evolution Thus cladisticsor the ToL are not invalidated by network-like signals Infact recent discrete mathematical formulations that test thefundamental axioms of tree construction have also proventhat the bifurcating history of trees is preserved despiteevolutionary reticulations [75] While there is no ldquoconfusionof doubtsrdquo at this level the babel of confusion has not stopped

Genomics revealed evolutionary patchiness which in-cited hypotheses of chimeras and fusions Phylogenies ofgenes were found highly discordant Organisms that werebeing sequenced shared very few gene sequences and moretroublingly gene trees that were generated possessed differ-ent topologies especially within akaryotic organisms Withtime the number of nearly universal genes decreased andthe patchiness and discordancy increased The comparativegenomic patchwork of sequence homologies showed thatthere were groups of genes that were only shared by certaindomains of life Thus the makeup of genomes appearedchimeric A number of hypotheses of chimeric origins ofeukaryotes were proposed based on the fact that eukaryotesshared genes expressing sister group phylogenetic relation-ships with both Bacteria and Archaea [64 76] One thatis notable is the rise of eukaryotes from a ldquoring of liferdquofusion of an archaeon and a bacterium (eg [77]) which ina single blow defeated both the tree-like and the tripartitenature of life Under this school homology searches withBLAST against a database ofsim38million akaryotic sequencesallowed to assign archaeal bacterial or ambiguous ancestriesto genes in the human genome and explain homologypatterns as relics of the akaryotic ancestors of humans [78]Remarkably archaeal genes tended to be involved in informa-tional processes encoded shorter and more central proteinsand were less likely to be involved in heritable human dis-eases The chimeric origin of eukaryotes by fusions has beenrightfully contested there is no proper evidence supportingits existence [79] Unfortunately chimerism opened a floodof speculations about reticulation The history of life wasseen by many through the lens of a ldquoforestrdquo of gene histories

[80] Reticulations andultimatelyHGTwere forced to explainchimeric patterns Fusion hypotheses diverted the centralissue of the rooting of the ToL and prompted the separateanalysis of eukaryotic origins and akaryotic evolution

Dagan and Martin [64] denounced that the ToL wasa ldquotree of one percentrdquo because only a small fraction ofsequences could be considered universal and could bemined for deep phylogenetic signal The rest would accountfor lateral processes that confounded vertical descent Theclaim came fundamentally from networks constructed usingBLAST heuristic searches for short and strong matchesin genomic sequences Phylogenetic ldquoforestsrdquo of akaryoticgenes later on boosted the claim [81] These networks werebuilt from 6901 trees of genes using maximum likelihoodmethods Only 102 of these trees were derived from nearlyuniversal genes and contributed very little vertical phyloge-netic signal The initial conclusion was striking ldquothe originaltree of life concept is obsolete it would not even be a treeof one percentrdquo [81] However the fact that vertical signalwas present in the forests merited reevaluation ldquoreplacementof the ToL with a network graph would change our entireperception of the process of evolution and invalidate allevolutionary reconstructionrdquo [82] We note that in thesestudies an ultrametric tree of akaryotes was recovered fromvertical signal in a supertree of nearly universal genesThe rooted tree which was used to simulate clock-likebehavior revealed the early divergence of Nanoarchaeumequitans and then Archaea [81] While inconsistency of theforest supertree increased at high phylogenetic depths itsassociated supernetwork showed there were no reticulationsbetween Archaea and Bacteria Thus the deep rooting signalof archaeal diversification may be bona fide and worthy offurther study

While those that value tree thinking have contested onmany grounds the idea that the ToL is ldquoobsoleterdquo (eg[13]) our take in the debate is simple Homologies betweengene sequences established through the ldquoemperorrsquos BLASTrdquo(sensu [71]) are poor substitutes to phylogenetic tree recon-structions from gene sequences By the same token phy-logenies reconstructed from sequences are poor substitutesto phylogenies that consider the molecular structure andfunction of the encoded products Gene sequences are notonly prone to mutational saturation but they generally comein pieces These pieces represent evolutionary and structuralmodules that host the functions of the encoded moleculesProtein domains for example are three-dimensional (3D)arrangements of elements of secondary structure that foldautonomously [83] are compact [84] and are evolutionarilyconserved [85 86]The landscape of evolutionary explorationchanges when the role of protein domains in function andevolution is considered [87] Changes at sequence levelincluding substitutions insertions and deletions can havelittle impact on structure and vice versa sequence changes incrucial sites can have devastating consequences on functionand fitness However there is no detailed analysis of historicalmismatches between gene sequences and domain structuresat the ToL level Remarkably while HGT seems rampantat sequence level its impact at the domain structural levelis limited [88] This makes trees derived from domains

8 Archaea

effectively ldquotrees of 99 percentrdquo and their use very powerfulThe fact that domains diversify mostly by vertical descent(eg [89ndash91]) suggests that gene reticulations simply reflectthe pervasive and impactful combinatorial effect of domainrearrangements in proteins [92] and perhaps little else Thisimportant claim must be carefully evaluated It offers thepossibility of dissecting how levels of organization impactprocesses of inheritance in biology

5 Out of the Impasse Parts and Wholes inthe Evolving Structure of Systems

Therooting of theToL is clearlymuddled by the high dynamicnature of change in protein and nucleic acid sequences andby the patchiness and reticulation complexities that exist atgene level However it is also possible that the problemsof the ToL are ultimately technical We have made thecase that the use of molecular sequences is problematic onmany grounds including mutational saturation definitionof homology of sites in sequence alignments inapplicablecharacters taxon sampling and tree imbalance and differenthistorical signatures in domains of multidomain proteins[93] In particular violation of character independence by themere existence of atomic structure represents a very seriousproblem that plagues phylogenetic analysis of sequences Wehere present a solution to the impasse We show that theToL can be rooted with different approaches that focus onstructure and function and that its root is congruently placedin Archaea

Epistemologically phylogenetic characters must complywith symmetry breaking and the irreversibility of time [94]In other words characters must establish transformationalhomology relationships and serve as independent evidentialstatements

(i) Characters Must Be Homologous and Heritable acrossTree Terminal Units (Taxa) Character homology is a centraland controversial concept that embodies the existence ofhistorical continuity of information [74] Characters areldquobasicrdquo evidential statements that make up columns in datamatrices used for tree reconstructionThey are conjectures ofperceived similarities that are accepted as fact for the dura-tion of the study are strengthened by Hennigian reciprocalillumination and can be put to the test through congruencewith other characters as these are fit to the trees To be usefulcharacters must be heritable and informative across the taxarows of the data matrix This can be evaluated for examplewith the cladistics information content (CIC) measure [95]Finding informative characters can be particularly challeng-ing when the features that are studied change at fast pace andwhen taxa sample awide and consequently deep phylogeneticspectrum When building a ToL the highly dynamic natureof change in the sequence makeup of protein or nucleic acidmolecules challenges the ability to retrieve reliable phyloge-netic signatures across taxa even if molecules are universallydistributed and harbor evolutionarily conserved regions withdeep phylogenetic signal (eg rRNA) The reason is thatgiven enough time functionally or structurally constrained

regions of the sequence will be fixed (will be structurallycanalized [6]) and will offer little if any phylogeneticallymeaningful signal to uncover for example the universalrRNA core In turn mutational saturation of unconstrainedregions will quickly erase history

The mutational saturation problem was made mathe-matically explicit by Sober and Steel [24] using ldquomutualinformationrdquo and the concept of time as an informationdestroying process Mutual information 119868(119883 119884) between tworandom variables119883 and 119884 is defined by

119868 (119883 119884) = sum

119909119910

119875 (119883 = 119909 119884 = 119910) log(119875 (119883 = 119909 119884 = 119910)

119875 (119883 = 119909) 119875 (119884 = 119910)

)

(1)

When 119868(119883 119884) approaches 0 119883 and 119884 become independentand no method can predict 119883 from knowledge of 119884 Impor-tantly mutual information approaches 0 as the time between119883 and 119884 increases in a Markov chain Regardless of theuse of a parsimony maximum likelihood or Bayesian-basedframework of analysis 119868(119883 119884)will be particularly small whensequence sites are saturated by too many substitutions dueto high substitution rates or large time scales Ancestralstates at interior nodes of the trees cannot be establishedwith confidence from extant information even in the mostoptimal situation of knowing the underlying phylogeny andthe model of character evolution Under simple models theproblem is not mitigated by the fact that the number ofterminal leaves of trees and the sources of initial phylogeneticinformation increases with time Since a phase transitionoccurs when substitution probabilities exceed a critical value[96] oneway out of the impasse is to find features in sufficientnumber that change at much slower pace than sequence sitesand test if mutual information is significant and overcomesFanorsquos inequality These features exist in molecular biologyand have been used for phylogenetic reconstruction Theyare for example the 3D fold structures of protein molecules[87] or the stem modules of RNA structures [97] featuresthat change at very slow rate when compared to associatedsequences For example protein 3D structural cores evolvelinearly with amino acid substitutions per site and change at3ndash10 times slower rates than sequences [98] This high con-servation highlights the evolutionary dynamics of molecularstructure Remarkably rates of change of proteins performinga same function are maintained by functional constraintsbut accelerate when proteins perform different functions orcontain indels In turn fold structural diversity explodesinto modular structures at low sequence identities probablytriggered by functional diversification Within the contextof structural conservation the fact that fold structures arestructural and evolutionary modules that accumulate inproteomes by gene duplication and rearrangements andspread in biological networks by recruitment (eg [99]) alsoprovides a solution to the problems of vanishing phylogeneticsignal Since fold accumulation increases with time in theMarkov chain mutual information must increase reversing

Archaea 9

the ldquodata processing inequalityrdquo that destroys informationand enabling deep evolutionary information

(ii) Characters Must Show at Least Two Distinct CharacterStates One of these two states (transformational homologs)must be ancestral (the ldquoplesiomorphicrdquo state) and the othermust be derived (the ldquoapomorphicrdquo state) [74] Only sharedand derived features (synapomorphies) provide vertical phy-logenetic evidence Consequently determining the relativeancestry of alternative character states defines the polarity ofcharacter transformations and roots the underlying treeThisis a fundamental property of phylogenetic inference Polar-ization in tree reconstruction enables the ldquoarrow of timerdquo(sensu Eddingtonrsquos entropy-induced asymmetry) solves therooting problem and fulfills other epistemological require-ments

Cladistically speaking character polarity refers to thedistinction between the ancestral and derived states and theidentification of synapomorphies However an evolutionaryview of polarity also refers to the direction of characterstate transformations in the phylogenetic model Historicallythree accepted alternatives have been available for rootingtrees [100ndash102] the outgroup comparison the ontogeneticmethod and the paleontological or stratigraphic methodWhile the three methods do not include assumptions ofevolutionary process they have been the subject of muchdiscussion and their interpretation of much controversyThe first two are however justified by the assumption thatdiversity results in a nested taxonomic hierarchy which mayor may not be induced by evolution We will not discussthe stratigraphic method as it relies on auxiliary assump-tions regarding the completeness of the fossil record Themidpoint rooting criterion mentioned earlier will not bediscussed either The procedure is contested by the existenceof heterogeneities in rates of change across the trees andproblems with the accurate characterization of phylogeneticdistances

In outgroup comparison polarity is inferred by thedistribution of character states in the ingroup (group of taxaof interest) and the sister group (taxa outside the group ofinterest) In a simple case if the character state is only foundin the ingroup it must be considered derived Outgroupcomparison is by far the method of choice because phyloge-neticists tend to have confidence in the supporting assump-tions higher-level relationships are outside the ingroupequivalent ontogenetic stages are compared and characterstate distributions are appropriately surveyed Unfortunatelythe method is ldquoindirectrdquo in that it depends on the assumptionof the existence of a higher-level relationship between theoutgroup and the ingroup Consequently the method cannotroot the ToL because at that level there is no higher-levelrelationship that is presently available Moreover the methodin itself does not polarize characters It simply connects theingroup to the rest of the ToL [100]

The ontogenetic criterion confers polarity through thedistribution of the states of homologous characters in onto-genies of the ingroup generally by focusing on the generalityof character states with more widely distributed states beingconsidered ancestral Nelsonrsquos rule states ldquogiven an ontogenetic

character transformation from a character observed to be moregeneral to a character observed to be less general the moregeneral character is primitive and the less general advancedrdquo[103] This ldquobiogenetic lawrdquo appears powerful in that itdepends only on the assumption that ontogenies of ingrouptaxa are properly surveyed It is also a ldquodirectrdquo method thatrelies exclusively on the ingroup Consequently it has thepotential to root the ToL Unfortunately Nelsonrsquos ldquogeneralityrdquohas been interpreted in numerous ways especially as it relatesto the ontogenetic sequence leading tomuch confusion [102]It also involves comparison of developmentally nested anddistinct life history stages making it difficult to extend themethod (originally conceptualized for vertebrate phylogeny)to the microbial world However Weston [100 104] madeit clear that the ontogenetic criterion embodies a widerldquogenerality criterionrdquo in which the taxic distribution of acharacter state is a subset of the distribution of anotherIn other words character states that characterize an entiregroup must be considered ancestral relative to an alternativestate that characterizes a subset of the group Besides thecentrality of nested patterns the generality criterion embedsthe core assumption that every homology is a synapomorphyin naturersquos nested taxonomic hierarchy and that homologiesin the hierarchy result from additive phylogenetic change[100] Westonrsquos more general rule therefore states that ldquogivena distribution of two homologous characters in which one119909 is possessed by all of the species that possess its homologcharacter 119910 and by at least one other species that does notthen 119910 may be postulated to be apomorphous relative to 119909rdquo[104] The only assumption of the method is that relevantcharacter states in the ingroup are properly surveyed Thisnew rule crucially substitutes the concept of ontogenetictransformation by the more general concept of homologyand additive phylogenetic change which can be applied tocases in which homologous entities accumulate ldquoiterativelyrdquoin evolution (eg generation of paralogous genes by dupli-cation) Since horizontally acquired characters (xenologs)are not considered synapomorphies they contribute towardsphylogenetic noise and are excluded after calculation ofhomoplasy and retention indices (ie measures of goodnessof fit of characters to the phylogeny)

Wehave applied the ldquogenerality criterionrdquo to the rooting ofthe ToL through polarization strategies that embody axiomsof evolutionary process Figure 4 shows three examplesA rooted phylogeny describing the evolution of 5S rRNAmolecules sampled from a wide range of organisms wasreconstructed from molecular sequence and structure [105]The ToL that was recovered was rooted paraphyletically inArchaea (Figure 4(a)) The model of character state transfor-mation was based on the axiom that evolved RNA moleculesare optimized to increase molecular persistence and producehighly stable folded conformations Molecular persistencematerializes in RNA structure in vitro with base pairsassociating and disassociating at rates as high as 05 sminus1 [106]The frustrated kinetics and energetics of this folding processenable some structural conformations to quickly reach stablestatesThis process is evolutionarily optimized through struc-tural canalization [6] in which evolution attains molecularfunctions by both increasing the average life and stability of

10 Archaea

Archaea Bacteria Eukarya

Root

Taxa = 666 5S rRNA molecules of diverse organismsCharacters = sequence and structure of tRNA

CI = 0072 RI = 0772 and g1=minus013Tree length = 11381 steps

(a)

Ancestral

AB B

EB

AE

EABEA

Derived

A E B

Timeline

Eukarya (E)

Archaea (A)

Bacteria (B)

ABE

(b)

1000

Root

Eukarya

Archaea

Bacteria

Taxa = 150 free-living proteomesCharacters = genomic abundance of 1510 FSFsTree length = 70132 stepsCI = 0155 RI = 0739 and g1=minus028

(c)

Figure 4 Trees of life generated from the structure of RNA and protein molecules congruently show a rooting in Archaea (a) A rootedphylogenetic tree of 5S rRNA reconstructed from both the sequence and the structure of the molecules (from [105]) (b) Global most-parsimonious scenario of organismal diversification based on tRNA (from [107]) A total of 571 tRNA molecules with sequence basemodification and structural information were used to build a ToL which failed to show monophyletic groupings Ancestries of lineageswere then inferred by constraining sets of tRNAs into monophyletic groups representing competing (shown in boxes) or noncompetingphylogenetic hypotheses and measuring tree suboptimality and lineage coalescence (illustrated with color hues in circles) (c) A ToLreconstructed from the genomic abundance counts of 1510 FSFs as phylogenetic characters in the proteomes of 150 free-living organismssampled equally and randomly from the three domains of life (data taken from [122 123]) Taxa were labeled with circles colored accordingto superkingdom CI = consistency index RI = retention index and 119892

1= gamma distribution parameter

selected conformations and decreasing their relative numberThus conformational diversity measured for example by theShannon entropy of the base-pairing probability matrix orfeatures of thermodynamic stability act as ldquoevo-devordquo proxyof a generality criterion for RNA molecules in which thecriterion of similarity (eg ontogenetic transformation) isof positional and compositional correspondence Using adifferent approach we recently broadened the use of phyloge-netic constraint analysis [107 108] borrowing from a formal

cyberneticmethod that decomposes a reconstructable systeminto its components [109] and used it to root a ToL derivedfrom tRNA sequence and structure (Figure 4(b)) The ToLwas again rooted in Archaea The number of additionalsteps required to force (constrain) particular taxa into amonophyletic group was used to define a lineage coalescencedistance (119878) with which to test alternative hypotheses ofmonophyly These hypotheses were then ordered accordingto 119878 value in an evolutionary timeline Since 119878 records

Archaea 11

the relative distribution of character states in taxic sets italso embodies the generality criterion of rooting Finallywe generated rooted ToLs that describe the evolution ofproteomes directly from a census of protein fold structures inproteomes (Figure 4(c)) This ToL shows a paraphyletic root-ing inArchaeaThemethod extracts phylogenetic signal fromproteomic abundance of protein fold structures and considersthat the most abundant and widely distributed folds are ofancient origin when defining transformation series [41] Thispolarization scheme which results in the gradual growth ofthe proteome repertoire again represents an embodimentof the generality criterion in which statements of homology(fold structures) result from additive phylogenetic change(increases in abundance) It is noteworthy that these ToLreconstructions take into account the genomic abundance ofeach and every fold structure in each proteome and acrossthe entire matrix thereby generating a frustrated system Ingeneral very high abundance of only few folds will not attracttaxa (species) to the derived branches in theToLThe ancestryof taxa is determined by both the abundance and interplayamong fold structure characters For example metabolicfolds such as those involved inATP hydrolysis are widespreadin cells and considered to be very ancient These are alsothe most ancient folds in our phylogenies In comparisonsome popular eukaryote-specific folds (eg immunoglobulinsuperfamilies) are highly abundant but appear in a derivedmanner in our phylogenies Thus we reason that there is nocircularity involved in the character polarization scheme

The compositional schemes extend the concept of rootingwith paralogous sequences to the entire proteome comple-ments from gene family level [100] to structural hierarchiesThe three examples make use of different rooting rationalesbut provide a congruent scenario of origins of diversifiedlife Technically roots are inferred by the Lundberg method[110] that does not require any outgroup taxa specificationThis method roots the trees a posteriori by attaching thehypothetical ancestor to that branch of the unrooted networkthat would yield minimum increase in the tree length (thuspreserving the principle of parsimony)

(iii) Characters Must Serve as Independent EvolutionaryHypotheses Valid phylogenetic optimization requires thatcharacters be independent pieces of evidence Charactersshould not dependonother charactersWhen the assumptionof independence is violated characters are overweighed inthe analysis and the resulting phylogeny fails to representtrue history [111] Possible dependencies could be of manykinds from structural to functional from developmen-tal to ecological These dependencies distort and obscurephylogenetic signal and must be either avoided or codedinto the phylogenetic model through parameters or weightcorrections

As we will now elaborate the problem of character inde-pendence is about parts and wholes in the hierarchical fabricof life and in the nested hierarchies of the ToL Biologicalsystems are by definition made of parts regardless of the wayparts are defined In evolution diversification and integrationof parts unify parts into cohesive entities modules whichthen diversify [112]This process and the rise of modules may

explain evolutionary waves of complexity and organizationand the emergence of structure (defined broadly) in biologythat is hierarchical The hierarchical makeup is made evidentin the structure of protein molecules where lower levelparts of the polymer (the amino acid residues) interact witheach other and establish cohesive higher-level modular partswhich also establish interaction networks and are crucialfor molecular function and for interaction of proteins withthe cellular environment Consequently the structure ofproteins can be described at increasing hierarchical levelsof structural abstraction sequences motifs loops domainsfamilies superfamilies topologies folds architectures andclasses Two accepted gold standards of protein classificationstructural classification of proteins (SCOP) [113] and classarchitecture topology homology (CATH) [114] use parts ofthis incomplete scheme to describe the atomic complexityof the molecules We note that these classifications do notconsider unrealized structural states such as protein foldsthat are possible but that have never been identified inthe natural world of protein structures We also note thatmodules sometimes engage in combinatorial games Forexample protein domains are rearranged in evolution byfusions and fissions producing the enormous diversity ofalternative domain rearrangements that exist inmultidomainproteins [92]

Since biological systems evolve and carry common ances-try parts of these systems by definition evolve and bythemselves carry common ancestry In other words thehistories of parts are embodied in the ensemble of lineagesof the ToL The focus of phylogenetic analysis however hasbeen overwhelmingly the organism as the biological systemas testified by the effort devoted to taxonomical classificationsystematic biology and building of the ToL The genomicrevolution has provided a wealth of parts in the aminoacid sequence sites of proteins and nucleic acid moleculeswhich have been used as phylogenetic characters for analysisof molecules representing organisms This has maintainedthe focus of reconstructing trees of systems (the wholes)For example amino acid sites of a protein are generally fitinto a data matrix (alignment) which is then used to buildtrees of genes and organisms (Figure 5(a)) using modernalgorithmic implementations [115] However and as we havementioned proteins have complicated structures that resultfrom interactions between amino acids at the 3D atomic levelThese intramolecular interactions or at least some of them[116] induce protein folding and delimit molecular func-tion and protein stability They are responsible for proteinsecondary supersecondary domain and tertiary structureand by definition their mere existence induces violation ofcharacter independence Penny and Collins [117] proposedthe simple thought experiment in which the bioinformaticianexchanges rows of sequence sites in the alignment matrixand asks what was lost in the process Randomization ofcharacters (columns) in the data matrix does not changethe phylogenetic tree However randomization destroys thestructure of the molecule and very likely its function Thisconfirms that reconstruction of trees from information insequence sites violates character independence and in theprocess ignores structure and biologyThe effects of violation

12 Archaea

Interactions between parts(eg atomic contacts between sites)

Characters parts(amino acid sequence sites)

16

7

5Impact onphylogeneticreconstruction high

Phylogenetic model

Evolutionof systems

Canonicalparadigm

System 3 (protein from Mba)

System 2 (protein from Mja)

System 1 (protein from Pfu)

A L Y S

A L Y S

G L S S

1 5 6 7

(a)

Interactions between systems(eg proteomes in trophic chains)

Characters systems(proteomes)

Impact onphylogeneticreconstruction low

Phylogenetic model

Evolutionof parts

Newparadigm

Part 3 (FSF domain c261)

Part 2 (FSF domain c12)

Part 1 (FSF domain c371)

Higher trophic

levels

Mesoplankton

Microplankton

Nanoplankton

Picoplankton

Pfu Mja Mba 22 29 28

54 44 43

84 88 80

(b)

Figure 5 A new phylogenetic strategy simplifies the problems of character independence (a)The canonical paradigm explores the evolutionof systems such as the evolution of organisms in the reconstruction of ToLs For example the terminals of phylogenetic trees can be genessampled from different organisms (eg Pyrococcus furiosus (Pfu)Methanococcus jannaschii (Mja) andMethanosarcina barkeri (Mba)) andphylogenetic characters can be amino acid sequence sites of the corresponding gene products Character states can describe the identityof the amino acid at each site Since characters are molecular parts that interact with other parts when molecules fold into compact 3Dstructures their interaction violates the principle of character independence Consequently the effects of covariation must be considered inthe phylogeneticmodel used to build the trees (b)The new paradigm explores the evolution of parts such as the evolution of protein domainsin proteomes For example the terminals of phylogenetic trees can be domains defined at fold superfamily level of structural complexity andthe characters used to build the trees can be proteomes Character states can be the number of domains holding the FSF structure Sinceproteomes interact with other proteomes when organisms establish close interactions interactions that could affect the abundance of domainsin proteomes should be considered negligible (unless there is an obligate parasitic lifestyle involved) and there is no need to budget trophicinteractions in the phylogenetic model

of character independence may be minimal for trees ofsequences that are closely related However trees describingdeep historical relationships require that sequences be diver-gent and this maximizes the chances of even wider diver-gences in molecular structure that are not being accountedfor by the models of sequence evolution

The genomic revolution has also provided a wealth ofmodels of 3D atomic structure These structures are usedas gold standards to assign with high confidence structuralmodules to sequences As mentioned earlier proteomesembody collections of protein domains with well-definedstructures and functions Protein domain counts in pro-teomes have been used to generate trees of protein domains(Figure 5(b)) using standard cladistic approaches and well-established methods (reviewed in [87]) These trees describethe evolution of protein structure at global level Theyare effectively trees of parts While domains interact witheach other in multidomain proteins or establish protein-protein interactions with the domains of other proteinsthese interactions of parts do not violate character indepen-dence This is because phylogenetic characters are actuallyproteomes systems defined by structural states that exist atmuch higher levels than the protein domain not far awayfrom the organism level Remarkably no information is lostwhen character columns in the matrix are randomized in

the thought experiment The order of proteome charactersin the matrix does not follow any rationale Characters arenot ordered by lifestyles or trophic levels of the organismsInteractions between free-living organisms will seldom biastheir domain makeup and if so those characters can beexcluded from analysis Even the establishment of symbioticor obligate parasitic interactions such as the nodule-formingsymbioses between rhizobia and legumes may have littleimpact on character independence as long as the jointinclusion of the host and the symbiont is avoided

6 Evidence Supporting the Archaeal Rootingof the Universal Tree

Figure 4 shows examples of rooted ToLs generated fromthe sequence and structure of RNA and protein moleculesThe different phylogenomic approaches arrive at a commonrooted topology that places the stem group of Archaea at thebase of the ToL (the archaeal rooting of Figure 1(c)) Howeverthe evolutionary interrelationship of parts andwholes promptthe use of trees of parts a focus on higher-level structureand a decrease in confidence in the power of trees of systemsThese concepts have been applied to the study of the historyof nucleic acid and protein structures for over a decade and

Archaea 13

23

38

786

360324

164

38

2

33

453

126 56

1

1

FSF domains

Archaea

Archaea

EukaryaEukarya

Bact

eria

Bact

eria

f gt 06

(a)

1

11

526

852272

162

100

0

3

232

35 15

5

16

GO terms

Archaea

Archaea

EukaryaBa

cter

iaEukarya

Bact

eria

f gt 06

(b)

69

0

7

8393

228

2

1

0

5

91

1

0

RNA families

Archaea

Eukarya

Bact

eria

Archaea

Eukarya

Bact

eria

f gt 06

(c)

Figure 6 Venn diagrams displaying the distributions of 1733 FSF domains (a) 1924 terminal GO terms (b) and 1148 RNA families (c)in the genomes of the three domains of life FSF domain data was taken from Nasir et al [122 123] and included 981 completely sequencedproteomes from 70 Archaea 652 Bacteria and 259 Eukarya Terminal GO terms corresponding to the ldquomolecular functionrdquo hierarchy definedby the GO database [118] were identified in 249 free-living organisms including 45 Archaea 183 Bacteria and 21 Eukarya (data taken from[123]) The Venn diagram of RNA families and the distribution of Rfam clans and families in organisms were taken from Hoeppner et al[129] and their Dataset S1 Shown below are distribution patterns for FSFs GO terms and RNA families that are present in more than 60 ofthe organisms examined (119891 gt 06) All distributions highlight maximum sharing in the ancient ABE and BE taxonomic groups and minimalsharing in archaeal taxonomic groups

provide additional evidence in support of the archaeal rootingscenario

61 Comparative Genomic Argumentation The distributionof gene-encoded products in the genomes of sequencedorganisms and parsimony thinking can reveal global evolu-tionary patterns without formal phylogenetic reconstructionWe will show how simple numerical analyses of proteindomains molecular activities defined by the gene ontology(GO) consortium [118] and RNA families that are domain-specific or are shared between domains of life can uncover thetripartite division of cellular life exclude chimeric scenariosof origin and provide initial insight on the rooting of the ToL(Figure 6)

The number of unique protein folds observed in natureis very small SCOP ver 175 defines only sim2000 foldsuperfamilies (FSFs) groups of homologous domains unifiedon the base of common structural and evolutionary rela-tionships for a total of 110800 known domains in proteins[113 119] FSFs represent highly conserved evolutionary unitsthat are suitable for studying organismal diversification [87]Yafremava et al [120] plotted the total number of distinctFSFs (FSF diversity) versus the average reuse of FSFs in theproteome of an organism (FSF abundance) This exerciseuncovered a scaling behavior typical of a Benford distributionwith a linear regime of proteomic growth for microbial

organisms and a superlinear regime for eukaryotic organisms(Figure 8 in [120]) These same scaling patterns are observedwhen studying the relationship between open reading framesand genome size [121] Remarkably archaeal and eukaryalproteomes exhibited both minimum and maximum levelsof FSF abundance and diversity respectively Bacterial pro-teomes however showed intermediate levels We note thatthe general scaling behavior is consistent with a scenarioin which evolutionary diversification proceeds from simplerproteomes to the more complex ones in gradual mannersupporting the principle of spatiotemporal continuity andrevealing the nested phylogenetic hierarchies of organismsUnder this scenario (and results of [120]) the streamlinedarchaeal proteomes represent the earliest form of cellular lifeRemarkably archaeal species harboring thermophilic andhyperthermophilic lifestyles encoded the most streamlinedFSF repertoires (Figure 7) Clearly modern thermophilicarchaeons are most closely related to the ancient cells thatinhabited planet Earth billions of years ago (also read below)

FSF domain distributions in the genomes of the threedomains of life provide further insights into their evolutionNasir et al [122 123] generated Venn diagrams to illustrateFSF sharing patterns in the genomes of Archaea Bacteriaand Eukarya (Figure 6(a)) These diagrams display the totalnumber of FSFs that are unique to a domain of life (taxonomicgroupsA B and E) shared by only two (AB BE andAE) and

14 Archaea

FSF use (log)

S marinus I hospitalis

H butylicus

T pendens

Cand korarchaeum

A pernix

M kandleri

T neutrophils

P islandicum

290

285

280

275

270

265

260

255

250

245

240

22 23 24 25

ThermophilicNonthermophilic

FSF

reus

e (lo

g)

N maritimus

Figure 7 A plot of FSF use (diversity) against FSF reuse (abun-dance) reveals a linear pattern of proteomic growth in 48 archaealproteomes Thermophilic archaeal species occupy positions that areclose to the origin of the plot They also populate the most basalbranch positions in ToLs (see discussion in themain text) Both axesare in logarithmic scale

those that are universal (ABE) About half of the FSFs (786out of 1733) were present in all three domains of life 453 ofwhich were present in at least 60 of the organisms that wereexamined (119891 gt 06 where 119891 is the distribution index rangingfrom 0 to 1) The fact that about 70 of widely shared FSFs(672) belong to the ABE taxonomic group strongly supports acommon evolutionary origin for cells Evolutionary timelineshave confirmed that a large number of these universal FSFswere present in the ancestor of the three domains of life theurancestor (Figure 1) and were for the most part retained inextant proteomes [48]

Interestingly the number of BE FSFs was sim10-fold greaterthan AE and AB FSFs (324 versus 38 and 38) (Figure 6(a))The finding that Bacteria and Eukarya encode a significantlylarge number of shared FSF domains is remarkable and hintstowards an unprecedented strong evolutionary associationbetween these two domains of life Moreover a significantnumber of BE FSFs is widespread in the bacterial andeukaryotic proteomes 56 of the 324 FSFs are shared bymore than 60 of the bacterial and eukaryal organismsthat were analyzed (Figure 6(a)) This is strong evidenceof an ancient vertical evolutionary trace from their mutualancestor (anticipated in [123]) This trace uniquely supportsthe archaeal rooting of the ToL In turn the canonicaland eukaryotic rooting alternatives are highly unlikely asnone alone can explain the remarkable diversity of the BEtaxonomic group (as well as its ancient origin [123]) Theremarkable vertical trace of the ABE taxonomic group andthe negligible vertical trace of the AB group (only one FSF

shared by more than 60 of organisms) also falsify fusionhypotheses responsible for a putative chimeric makeup ofEukarya In light of these findings the most parsimoniousexplanation of comparative genomic data is that Archaea isthe first domain of diversified cellular life

The patterns of FSF sharing are further strengthenedby the genomic distributions of terminal-level molecularfunction GO terms (Figure 6(b)) The three domains of lifeshared about a quarter (526) of the 1924 GO terms thatwere surveyed A total of 232 of these ABE terms wereshared by 60 of organisms analyzed Again the fact thatabout 76 of widely shared GO terms (306) belong to theABE taxonomic group strongly supports a common verticaltrace while the finding that Bacteria and Eukarya share asignificantly large number of GO terms (272) supports againthe archaeal rooting of the ToL An alternative explanationfor the very large size of BE could be large-scale metabolism-related gene transfer from the ancestors of mitochondriaand plastids to the ancestors of modern eukaryotes [124]However we note that the BE group is not restricted to onlymetabolic functions It also includes FSFs and GOs involvedin intracellular and extracellular processes and regulation andinformational functions [125] Thus the very large size ofBE is a significant outcome most likely shaped by verticalevolutionary scenarios and cannot solely be explained byparasiticsymbiotic relationships that exist between Bacteriaand Eukarya [123] Moreover the simplicity of archaeal FSFrepertoires is not due to the paucity of available archaealgenomic dataWe confirmed that themean FSF coverage (ienumber of proteinsannotated to FSFsGOs out of total) forArchaea Bacteria and Eukarya was largely comparable (egTable S1 in [122]) Finally as we will describe below thesecomparative patterns and tentative conclusions are confirmedby phylogenomic analysis [91 122 126] Congruent phyloge-nies were obtained with (Figure 4(c)) and without equal andrandom sampling of taxa Thus the relatively low numberof archaeal genomes is also not expected to compromise ourinferences

We note that Eukarya shares many informational geneswith Archaea and many operational genes with BacteriaWhile informational genes have been thought more refrac-tory to HGT than noninformational genes we have con-firmed at GO level that this is not the case A ToL builtfrom GO terms showed that in fact noninformational termswere less homoplasious while statistical enrichment analysesrevealed that HGT had little if any functional preferencefor GO terms across GO hierarchical levels We recentlycompared ToLs reconstructed from non-HGTGO terms andToLs reconstructed from informational GO terms that wereextracted fromnon-HGTGO terms (Kim et al ms in reviewalso read below) In both cases Archaea appeared as a basalparaphyletic group of the ToLs and the common origin ofBacteria and Eukarya was maintained Thus the 272 GOterms shared by Bacteria and Eukarya harbor a strong verticaltrace

Another observation often used as support for Archaea-Eukarya kinship is the discovery of few eukaryote-specificproteins (eg actin tubulin H3 H4 ESCRT ribosomalproteins and others) in some archaeal species [127] and

Archaea 15

their complete absence fromBacteria (except for documentedtubulin HGT from eukaryotes to bacterial genus Prosthe-cobacter [128]) This suggests either that eukaryotes arosefrom an archaeal lineage [127] or that the ancestor of Eukaryaand Archaea was complex and modern Archaea are highlyreduced [25] Indeed few eukaryote-specific proteins havenow been found in some archaeal species (in some casesjust one species) We argue that this poor spread cannotbe taken as evidence for the Archaea-Eukarya sister rela-tionship This needs to be confirmed by robust phylogeneticanalysis which is unfortunately not possible when usingprotein sequences Recently we described a new strategy forinferring vertical and horizontal traces [123] This methodcalculates the spread of FSFs andGOs that are shared betweentwo-superkingdom groups (ie AB AE and BE) Balanceddistributions often indicate vertical inheritance while biaseddistributions suggest horizontal flux For example penicillinbinding molecular activity (GO 0008658) was present in100 of the sampled bacterial proteomes but was onlypresent in 11 of the archaeal species [123] Thus presenceof GO 0008658 in Archaea was attributed to HGT gain fromBacteria Using this simple method we established that bothBacteria and Eukarya were united by much stronger verticaltrace than either was to Archaea In fact strong reductivetendencies in the archaeal genomes were recorded [123]Thus in our opinion presence of eukaryote-specific proteinsin only very few archaeal species could in fact be an HGTevent that is not detectable by sequence phylogenies

In turn many arguments favor now the Bacteria-Eukaryasisterhood in addition to the structure-based phylogeniesand balanced distribution ofmolecular features For exampleBacteria and Eukarya have similar lipid membranes that canbe used as argument for their evolutionary kinshipMoreoverArchaea fundamentally differ from Eukarya in terms oftheir virosphere Viruses infecting Archaea and Eukarya aredrastically different as recently discussed by Forterre [25]

The genomic distribution of RNA families was taken fromHoeppner et al [129] and also shows a vertical evolutionarytrace in five crucial Rfam clans that are universal includingtRNA 5S rRNA subunit rRNA and RNase P RNA Theseuniversal RNA groups are likely minimally affected by HGTHowever 99 of Rfam clans and families are specific todomains of life and only 11 of the 1148 groups were sharedat 119891 gt 06 levels This clearly shows that the functionalcomplexity of RNA materialized very late during organismaldiversification and that it is not a good genomic feature forexploring the rooting of the ToL Only five RNA familiesare shared between two domains of life and only one ofthese does so at 119891 gt 06 the G12 pseudoknot of the 23SrRNA which is present in bacterial and eukaryotic organellarrRNA While the large subunit rRNA scaffold supports theG12 pseudoknot with its vertical trace all five interdomainRNA families can be explainedmost parsimoniously byHGTThus only a handful of ancient and universal RNA speciescan be used to root the ToL

We end by noting that the Venn diagrams consistentlyshow that Archaea harbors the least number of unique (A)and shared (AB and AE) FSFs GO terms or RNA familiesThis trend supports an early divergence of this domain of life

from the urancestor and the possibility that such divergencebe shaped by evolutionary reductive events We reason thatsuch losseswould bemore parsimonious early on in evolutionthan in the later periods This is because genes often increasetheir abundance in evolutionary time (by gene duplicationsHGT and other processes)Thus it is reasonable to think thatloss of an ancient gene would be more feasible very early inevolution relative to losing it very late

62 Phylogenomic Evidence from the Sequence and Structure ofRNA Phylogenetic analyses of the few RNA families that areuniversal and display an important vertical evolutionary trace(Figure 6(c)) provide compelling evidence in favor of an earlyevolutionary appearance of Archaea [105 107 108 130ndash135]Here we briefly summarize evidence from tRNA 5S rRNAand RNase P RNA Unpublished analyses of rRNA sequenceand structure using advanced phylogenetic methods alsoshow that Archaea was the first domain of life

tRNA molecules are generally short (sim73ndash95 nucleotidesin length) and highly conserved Consequently theirsequence generally contains limited amount of phylogeneticinformation These limitations have been overcome by ana-lyzing entire tRNomes [136] which extend the length ofan organismal set of tRNAs to over 2000 bases Xue etal [130 131] analyzed the genetic distances between tRNAsequences as averages between alloacceptor tRNAs fromdiverse groups of tRNomes using multiple molecular basesubstitution models The distances were mapped onto anunrooted phylogeny of tRNA molecules (Figure 8(a))The ldquoarrow of timerdquo assumption in these studies is thatancient tRNA paralogs closely resemble each other when lin-eages originate close to the time of the gene duplication Re-markably the results revealed a paraphyletic rooting of theToL in Archaea The root was specifically located close tothe hyperthermophilic methanogen Methanopyrus kandleri(Figure 8(a)) The hypothesis of this specific rooting scenariohas been supported by several other studies [134 137]including a study of genetic distances between paralogouspairs of aminoacyl-tRNA synthetase (aaRS) proteins [131] Aremarkable match between distance scores of tRNA and pairsof aaRS paralogs (Figure 8(b)) not only confirms the earlyappearance of Archaea but also suggests a coevolutionarytrace associated with molecular interactions that areresponsible for the genetic code In fact a recent exhaustivephylogenomic analysis of tRNA and aaRS coevolutionexplicitly reveals the origins and evolution of the geneticcode and the underlying molecular basis of genetics [59]Di Giulio [132 133] has also proposed an archaeal rootingof the tree of life specifically in the lineage leading to thephylum of Nanoarchaeota This rooting is based on uniqueand ancestral genomic traits of Nanoarchaeum equitans splitgenes separately codifying for the 51015840 and 31015840 halves of tRNAand the absence of operons which are considered molecularfossils [132] However this claim needs additional support ascontrasting evidence now recognizes N equitans as a highlyderived archaeal species [138]

In addition to sequences structural features of tRNAmolecules also support the archaeal rooting of the ToL

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

6 Archaea

analytic phylogenetic framework and does not weigh gainslosses and transfers with algorithmic implementationsThusmany statements can fall pray of incorrect optimizations orprocesses of convergence of structure and function includingcooptions that are common in metabolic enzymes [58]Transitions also fail to consider the molecular makeup of thecomplexes examined (eg evolution of the photosyntheticreaction centers) which often hold domains with hetero-geneous histories in the different organismal groups Forexample the 119865

1119865119900ATP synthetase complex that powers

cellular processes has a long history of accretion of domainsthat span almost 38 billion years of history which is evenolder than that of the ribosome [59] Finally establishing thevalidity of evolutionary transitions in polarization schemescan be highly problematic each transition that is studiedrequires well-grounded assumptions [60] some of whichhave not been yet properly elaborated

The inability to solve the rooting problem and the insis-tence on extracting deep phylogenetic signal from molecularsequences that are prone to mutational saturation raisedskepticism about the possibility of ever finding the rootof the ToL Bapteste and Brochier [60] made explicit theconceptual difficulties claiming scientists in the field hadadopted Agripparsquos logic of doubt Misunderstandings on howto conduct ideographic inquiry in evolutionary genomicshad effectively blocked the reerection of a Tower of Babelthat would explain the diversification of genomes (Figure 3)Instead there was ldquoConfusion of Doubtsrdquo Unfortunately theimpasse was aggravated by aprioristic tendencies inheritedfrom systematic biology [3 10] disagreements about theevolutionary role of vertical and lateral inheritance andthe problem of homology [61] and currently disagreementsabout the actual role of Darwinian evolution in speciationand the ToL problem [13 62]

During the decade of evolutionary genomic discovery theeffects of HGT on phylogeny [63] took front cover HGT wasinvoked as an overriding process However little attentionwas paid to alternative explanations such as differential lossof gene variants ancient or derived and there was littleconcern for other sources of reticulation HGT is certainlyan important evolutionary process that complicates the ldquotreerdquoconcept of phylogenetic analysis and must be carefullystudied (eg [64ndash66]) In some bacterial taxonomic groupssuch as the proteobacteria HGT was found to be pervasiveand challenged the definition of species [67] Cases like theseprompted the radical suggestion that the ToL should beabandoned and that a ldquoweb of liferdquo should be used to describethe diversification of microbes and multicellular organisms[68] However the problem has two aspects that must beseparately considered

(i) MechanicsThe widespread and impactful nature of HGTmust be addressed Does its existence truly compromise thevalidity of phylogenetic tree statements While HGT seemsimportant for some bacterial lineages [69] its evolutionaryimpacts in Archaea and Eukarya are not as extensive (eg[61 70]) and its global role can be contested [71] In turnthe role of viruses and RNA agents in genetic exchangecontinues to be understudied (eg [72]) and could be

Figure 3 The ldquoConfusion of Tonguesrdquo an engraving by GustaveDore (1832ndash1883) was modified by D Caetano-Anolles to portraythe event of diversification that halted the construction of the Towerof Babel In linguistics the biblical story inspired tree thinkingWe take the metaphor as the fall of the urancestor of cellular lifeand the replacement of ldquoscala naturaerdquo by branching processes ofcomplexity

a crucial source of reticulation Furthermore the relationshipbetween differential gene loss and HGT has not been ade-quately formalized especially for genes that are very ancientEstablishing patterns of loss requires establishing polarity ofchange and rooted trees which as we have discussed earlierremains unattained for ToLs reconstructed from molecularsequences

(ii) Interpretation HGT generally materializes as a mismatchbetween histories of genes in genomes and histories of organ-isms Since genomes are by functional definition collectionsof genes reticulations affect history of the genes and not ofthe organisms which given evolutionary assumptions resultfrom hierarchical relationships (eg [73]) Thus and at firstglance reticulation processes (HGT gene recombinationgene duplications and gene loss) do not obliterate verticalphylogenetic signal The problem however remains complexA ToL can be considered an ensemble of lineages nestedwithin each other [74] Protein domain lineages will benested in gene lineages gene lineages will be nested inlineages of gene families gene families will be nested inorganismal lineages organisms will be nested in lineagesof higher organismal groupings (populations communitiesecosystems and biomes) and so on All lineage levelsare defined by biological complexity and the hierarchicalorganization of life and follow a fractal pattern distributioneach level contributing vertical and lateral phylogenetic signal

Archaea 7

to the whole However sublineages in one hierarchical levelmay hold different histories compared to the histories ofhigher and lower levels Historical mismatches introduce forexample lineage sorting and reticulation problems that rep-resent a complication to phylogenetic analysis For examplemismatches between gene and organismal phylogenies existin the presence of differential sorting of genomic compo-nents due to for example species extinctions or genomicrearrangements These mismatches violate the fundamentalcladistic assumption that history follows a branching processbut can be explained by homoplasy (convergent evolution)the horizontal ldquotracerdquo of the homology-homoplasy yin-yangof phylogenetic signal The problem of homoplasy cannotbe solved by conceptually preferring one particular levelof the hierarchy [74] A focus on the organismal level forexample will not solve the problems of hybridization thatare brought by strict or relaxed sexual reproduction or viral-mediated genetic exchanges or the problems of ancientevents of fusions or endosymbiosis However homoplasy andthe optimization of characters and trees of cladistic analysisprovide a rigorous framework to discover the magnitude andsource of nonvertical processes in evolution Thus cladisticsor the ToL are not invalidated by network-like signals Infact recent discrete mathematical formulations that test thefundamental axioms of tree construction have also proventhat the bifurcating history of trees is preserved despiteevolutionary reticulations [75] While there is no ldquoconfusionof doubtsrdquo at this level the babel of confusion has not stopped

Genomics revealed evolutionary patchiness which in-cited hypotheses of chimeras and fusions Phylogenies ofgenes were found highly discordant Organisms that werebeing sequenced shared very few gene sequences and moretroublingly gene trees that were generated possessed differ-ent topologies especially within akaryotic organisms Withtime the number of nearly universal genes decreased andthe patchiness and discordancy increased The comparativegenomic patchwork of sequence homologies showed thatthere were groups of genes that were only shared by certaindomains of life Thus the makeup of genomes appearedchimeric A number of hypotheses of chimeric origins ofeukaryotes were proposed based on the fact that eukaryotesshared genes expressing sister group phylogenetic relation-ships with both Bacteria and Archaea [64 76] One thatis notable is the rise of eukaryotes from a ldquoring of liferdquofusion of an archaeon and a bacterium (eg [77]) which ina single blow defeated both the tree-like and the tripartitenature of life Under this school homology searches withBLAST against a database ofsim38million akaryotic sequencesallowed to assign archaeal bacterial or ambiguous ancestriesto genes in the human genome and explain homologypatterns as relics of the akaryotic ancestors of humans [78]Remarkably archaeal genes tended to be involved in informa-tional processes encoded shorter and more central proteinsand were less likely to be involved in heritable human dis-eases The chimeric origin of eukaryotes by fusions has beenrightfully contested there is no proper evidence supportingits existence [79] Unfortunately chimerism opened a floodof speculations about reticulation The history of life wasseen by many through the lens of a ldquoforestrdquo of gene histories

[80] Reticulations andultimatelyHGTwere forced to explainchimeric patterns Fusion hypotheses diverted the centralissue of the rooting of the ToL and prompted the separateanalysis of eukaryotic origins and akaryotic evolution

Dagan and Martin [64] denounced that the ToL wasa ldquotree of one percentrdquo because only a small fraction ofsequences could be considered universal and could bemined for deep phylogenetic signal The rest would accountfor lateral processes that confounded vertical descent Theclaim came fundamentally from networks constructed usingBLAST heuristic searches for short and strong matchesin genomic sequences Phylogenetic ldquoforestsrdquo of akaryoticgenes later on boosted the claim [81] These networks werebuilt from 6901 trees of genes using maximum likelihoodmethods Only 102 of these trees were derived from nearlyuniversal genes and contributed very little vertical phyloge-netic signal The initial conclusion was striking ldquothe originaltree of life concept is obsolete it would not even be a treeof one percentrdquo [81] However the fact that vertical signalwas present in the forests merited reevaluation ldquoreplacementof the ToL with a network graph would change our entireperception of the process of evolution and invalidate allevolutionary reconstructionrdquo [82] We note that in thesestudies an ultrametric tree of akaryotes was recovered fromvertical signal in a supertree of nearly universal genesThe rooted tree which was used to simulate clock-likebehavior revealed the early divergence of Nanoarchaeumequitans and then Archaea [81] While inconsistency of theforest supertree increased at high phylogenetic depths itsassociated supernetwork showed there were no reticulationsbetween Archaea and Bacteria Thus the deep rooting signalof archaeal diversification may be bona fide and worthy offurther study

While those that value tree thinking have contested onmany grounds the idea that the ToL is ldquoobsoleterdquo (eg[13]) our take in the debate is simple Homologies betweengene sequences established through the ldquoemperorrsquos BLASTrdquo(sensu [71]) are poor substitutes to phylogenetic tree recon-structions from gene sequences By the same token phy-logenies reconstructed from sequences are poor substitutesto phylogenies that consider the molecular structure andfunction of the encoded products Gene sequences are notonly prone to mutational saturation but they generally comein pieces These pieces represent evolutionary and structuralmodules that host the functions of the encoded moleculesProtein domains for example are three-dimensional (3D)arrangements of elements of secondary structure that foldautonomously [83] are compact [84] and are evolutionarilyconserved [85 86]The landscape of evolutionary explorationchanges when the role of protein domains in function andevolution is considered [87] Changes at sequence levelincluding substitutions insertions and deletions can havelittle impact on structure and vice versa sequence changes incrucial sites can have devastating consequences on functionand fitness However there is no detailed analysis of historicalmismatches between gene sequences and domain structuresat the ToL level Remarkably while HGT seems rampantat sequence level its impact at the domain structural levelis limited [88] This makes trees derived from domains

8 Archaea

effectively ldquotrees of 99 percentrdquo and their use very powerfulThe fact that domains diversify mostly by vertical descent(eg [89ndash91]) suggests that gene reticulations simply reflectthe pervasive and impactful combinatorial effect of domainrearrangements in proteins [92] and perhaps little else Thisimportant claim must be carefully evaluated It offers thepossibility of dissecting how levels of organization impactprocesses of inheritance in biology

5 Out of the Impasse Parts and Wholes inthe Evolving Structure of Systems

Therooting of theToL is clearlymuddled by the high dynamicnature of change in protein and nucleic acid sequences andby the patchiness and reticulation complexities that exist atgene level However it is also possible that the problemsof the ToL are ultimately technical We have made thecase that the use of molecular sequences is problematic onmany grounds including mutational saturation definitionof homology of sites in sequence alignments inapplicablecharacters taxon sampling and tree imbalance and differenthistorical signatures in domains of multidomain proteins[93] In particular violation of character independence by themere existence of atomic structure represents a very seriousproblem that plagues phylogenetic analysis of sequences Wehere present a solution to the impasse We show that theToL can be rooted with different approaches that focus onstructure and function and that its root is congruently placedin Archaea

Epistemologically phylogenetic characters must complywith symmetry breaking and the irreversibility of time [94]In other words characters must establish transformationalhomology relationships and serve as independent evidentialstatements

(i) Characters Must Be Homologous and Heritable acrossTree Terminal Units (Taxa) Character homology is a centraland controversial concept that embodies the existence ofhistorical continuity of information [74] Characters areldquobasicrdquo evidential statements that make up columns in datamatrices used for tree reconstructionThey are conjectures ofperceived similarities that are accepted as fact for the dura-tion of the study are strengthened by Hennigian reciprocalillumination and can be put to the test through congruencewith other characters as these are fit to the trees To be usefulcharacters must be heritable and informative across the taxarows of the data matrix This can be evaluated for examplewith the cladistics information content (CIC) measure [95]Finding informative characters can be particularly challeng-ing when the features that are studied change at fast pace andwhen taxa sample awide and consequently deep phylogeneticspectrum When building a ToL the highly dynamic natureof change in the sequence makeup of protein or nucleic acidmolecules challenges the ability to retrieve reliable phyloge-netic signatures across taxa even if molecules are universallydistributed and harbor evolutionarily conserved regions withdeep phylogenetic signal (eg rRNA) The reason is thatgiven enough time functionally or structurally constrained

regions of the sequence will be fixed (will be structurallycanalized [6]) and will offer little if any phylogeneticallymeaningful signal to uncover for example the universalrRNA core In turn mutational saturation of unconstrainedregions will quickly erase history

The mutational saturation problem was made mathe-matically explicit by Sober and Steel [24] using ldquomutualinformationrdquo and the concept of time as an informationdestroying process Mutual information 119868(119883 119884) between tworandom variables119883 and 119884 is defined by

119868 (119883 119884) = sum

119909119910

119875 (119883 = 119909 119884 = 119910) log(119875 (119883 = 119909 119884 = 119910)

119875 (119883 = 119909) 119875 (119884 = 119910)

)

(1)

When 119868(119883 119884) approaches 0 119883 and 119884 become independentand no method can predict 119883 from knowledge of 119884 Impor-tantly mutual information approaches 0 as the time between119883 and 119884 increases in a Markov chain Regardless of theuse of a parsimony maximum likelihood or Bayesian-basedframework of analysis 119868(119883 119884)will be particularly small whensequence sites are saturated by too many substitutions dueto high substitution rates or large time scales Ancestralstates at interior nodes of the trees cannot be establishedwith confidence from extant information even in the mostoptimal situation of knowing the underlying phylogeny andthe model of character evolution Under simple models theproblem is not mitigated by the fact that the number ofterminal leaves of trees and the sources of initial phylogeneticinformation increases with time Since a phase transitionoccurs when substitution probabilities exceed a critical value[96] oneway out of the impasse is to find features in sufficientnumber that change at much slower pace than sequence sitesand test if mutual information is significant and overcomesFanorsquos inequality These features exist in molecular biologyand have been used for phylogenetic reconstruction Theyare for example the 3D fold structures of protein molecules[87] or the stem modules of RNA structures [97] featuresthat change at very slow rate when compared to associatedsequences For example protein 3D structural cores evolvelinearly with amino acid substitutions per site and change at3ndash10 times slower rates than sequences [98] This high con-servation highlights the evolutionary dynamics of molecularstructure Remarkably rates of change of proteins performinga same function are maintained by functional constraintsbut accelerate when proteins perform different functions orcontain indels In turn fold structural diversity explodesinto modular structures at low sequence identities probablytriggered by functional diversification Within the contextof structural conservation the fact that fold structures arestructural and evolutionary modules that accumulate inproteomes by gene duplication and rearrangements andspread in biological networks by recruitment (eg [99]) alsoprovides a solution to the problems of vanishing phylogeneticsignal Since fold accumulation increases with time in theMarkov chain mutual information must increase reversing

Archaea 9

the ldquodata processing inequalityrdquo that destroys informationand enabling deep evolutionary information

(ii) Characters Must Show at Least Two Distinct CharacterStates One of these two states (transformational homologs)must be ancestral (the ldquoplesiomorphicrdquo state) and the othermust be derived (the ldquoapomorphicrdquo state) [74] Only sharedand derived features (synapomorphies) provide vertical phy-logenetic evidence Consequently determining the relativeancestry of alternative character states defines the polarity ofcharacter transformations and roots the underlying treeThisis a fundamental property of phylogenetic inference Polar-ization in tree reconstruction enables the ldquoarrow of timerdquo(sensu Eddingtonrsquos entropy-induced asymmetry) solves therooting problem and fulfills other epistemological require-ments

Cladistically speaking character polarity refers to thedistinction between the ancestral and derived states and theidentification of synapomorphies However an evolutionaryview of polarity also refers to the direction of characterstate transformations in the phylogenetic model Historicallythree accepted alternatives have been available for rootingtrees [100ndash102] the outgroup comparison the ontogeneticmethod and the paleontological or stratigraphic methodWhile the three methods do not include assumptions ofevolutionary process they have been the subject of muchdiscussion and their interpretation of much controversyThe first two are however justified by the assumption thatdiversity results in a nested taxonomic hierarchy which mayor may not be induced by evolution We will not discussthe stratigraphic method as it relies on auxiliary assump-tions regarding the completeness of the fossil record Themidpoint rooting criterion mentioned earlier will not bediscussed either The procedure is contested by the existenceof heterogeneities in rates of change across the trees andproblems with the accurate characterization of phylogeneticdistances

In outgroup comparison polarity is inferred by thedistribution of character states in the ingroup (group of taxaof interest) and the sister group (taxa outside the group ofinterest) In a simple case if the character state is only foundin the ingroup it must be considered derived Outgroupcomparison is by far the method of choice because phyloge-neticists tend to have confidence in the supporting assump-tions higher-level relationships are outside the ingroupequivalent ontogenetic stages are compared and characterstate distributions are appropriately surveyed Unfortunatelythe method is ldquoindirectrdquo in that it depends on the assumptionof the existence of a higher-level relationship between theoutgroup and the ingroup Consequently the method cannotroot the ToL because at that level there is no higher-levelrelationship that is presently available Moreover the methodin itself does not polarize characters It simply connects theingroup to the rest of the ToL [100]

The ontogenetic criterion confers polarity through thedistribution of the states of homologous characters in onto-genies of the ingroup generally by focusing on the generalityof character states with more widely distributed states beingconsidered ancestral Nelsonrsquos rule states ldquogiven an ontogenetic

character transformation from a character observed to be moregeneral to a character observed to be less general the moregeneral character is primitive and the less general advancedrdquo[103] This ldquobiogenetic lawrdquo appears powerful in that itdepends only on the assumption that ontogenies of ingrouptaxa are properly surveyed It is also a ldquodirectrdquo method thatrelies exclusively on the ingroup Consequently it has thepotential to root the ToL Unfortunately Nelsonrsquos ldquogeneralityrdquohas been interpreted in numerous ways especially as it relatesto the ontogenetic sequence leading tomuch confusion [102]It also involves comparison of developmentally nested anddistinct life history stages making it difficult to extend themethod (originally conceptualized for vertebrate phylogeny)to the microbial world However Weston [100 104] madeit clear that the ontogenetic criterion embodies a widerldquogenerality criterionrdquo in which the taxic distribution of acharacter state is a subset of the distribution of anotherIn other words character states that characterize an entiregroup must be considered ancestral relative to an alternativestate that characterizes a subset of the group Besides thecentrality of nested patterns the generality criterion embedsthe core assumption that every homology is a synapomorphyin naturersquos nested taxonomic hierarchy and that homologiesin the hierarchy result from additive phylogenetic change[100] Westonrsquos more general rule therefore states that ldquogivena distribution of two homologous characters in which one119909 is possessed by all of the species that possess its homologcharacter 119910 and by at least one other species that does notthen 119910 may be postulated to be apomorphous relative to 119909rdquo[104] The only assumption of the method is that relevantcharacter states in the ingroup are properly surveyed Thisnew rule crucially substitutes the concept of ontogenetictransformation by the more general concept of homologyand additive phylogenetic change which can be applied tocases in which homologous entities accumulate ldquoiterativelyrdquoin evolution (eg generation of paralogous genes by dupli-cation) Since horizontally acquired characters (xenologs)are not considered synapomorphies they contribute towardsphylogenetic noise and are excluded after calculation ofhomoplasy and retention indices (ie measures of goodnessof fit of characters to the phylogeny)

Wehave applied the ldquogenerality criterionrdquo to the rooting ofthe ToL through polarization strategies that embody axiomsof evolutionary process Figure 4 shows three examplesA rooted phylogeny describing the evolution of 5S rRNAmolecules sampled from a wide range of organisms wasreconstructed from molecular sequence and structure [105]The ToL that was recovered was rooted paraphyletically inArchaea (Figure 4(a)) The model of character state transfor-mation was based on the axiom that evolved RNA moleculesare optimized to increase molecular persistence and producehighly stable folded conformations Molecular persistencematerializes in RNA structure in vitro with base pairsassociating and disassociating at rates as high as 05 sminus1 [106]The frustrated kinetics and energetics of this folding processenable some structural conformations to quickly reach stablestatesThis process is evolutionarily optimized through struc-tural canalization [6] in which evolution attains molecularfunctions by both increasing the average life and stability of

10 Archaea

Archaea Bacteria Eukarya

Root

Taxa = 666 5S rRNA molecules of diverse organismsCharacters = sequence and structure of tRNA

CI = 0072 RI = 0772 and g1=minus013Tree length = 11381 steps

(a)

Ancestral

AB B

EB

AE

EABEA

Derived

A E B

Timeline

Eukarya (E)

Archaea (A)

Bacteria (B)

ABE

(b)

1000

Root

Eukarya

Archaea

Bacteria

Taxa = 150 free-living proteomesCharacters = genomic abundance of 1510 FSFsTree length = 70132 stepsCI = 0155 RI = 0739 and g1=minus028

(c)

Figure 4 Trees of life generated from the structure of RNA and protein molecules congruently show a rooting in Archaea (a) A rootedphylogenetic tree of 5S rRNA reconstructed from both the sequence and the structure of the molecules (from [105]) (b) Global most-parsimonious scenario of organismal diversification based on tRNA (from [107]) A total of 571 tRNA molecules with sequence basemodification and structural information were used to build a ToL which failed to show monophyletic groupings Ancestries of lineageswere then inferred by constraining sets of tRNAs into monophyletic groups representing competing (shown in boxes) or noncompetingphylogenetic hypotheses and measuring tree suboptimality and lineage coalescence (illustrated with color hues in circles) (c) A ToLreconstructed from the genomic abundance counts of 1510 FSFs as phylogenetic characters in the proteomes of 150 free-living organismssampled equally and randomly from the three domains of life (data taken from [122 123]) Taxa were labeled with circles colored accordingto superkingdom CI = consistency index RI = retention index and 119892

1= gamma distribution parameter

selected conformations and decreasing their relative numberThus conformational diversity measured for example by theShannon entropy of the base-pairing probability matrix orfeatures of thermodynamic stability act as ldquoevo-devordquo proxyof a generality criterion for RNA molecules in which thecriterion of similarity (eg ontogenetic transformation) isof positional and compositional correspondence Using adifferent approach we recently broadened the use of phyloge-netic constraint analysis [107 108] borrowing from a formal

cyberneticmethod that decomposes a reconstructable systeminto its components [109] and used it to root a ToL derivedfrom tRNA sequence and structure (Figure 4(b)) The ToLwas again rooted in Archaea The number of additionalsteps required to force (constrain) particular taxa into amonophyletic group was used to define a lineage coalescencedistance (119878) with which to test alternative hypotheses ofmonophyly These hypotheses were then ordered accordingto 119878 value in an evolutionary timeline Since 119878 records

Archaea 11

the relative distribution of character states in taxic sets italso embodies the generality criterion of rooting Finallywe generated rooted ToLs that describe the evolution ofproteomes directly from a census of protein fold structures inproteomes (Figure 4(c)) This ToL shows a paraphyletic root-ing inArchaeaThemethod extracts phylogenetic signal fromproteomic abundance of protein fold structures and considersthat the most abundant and widely distributed folds are ofancient origin when defining transformation series [41] Thispolarization scheme which results in the gradual growth ofthe proteome repertoire again represents an embodimentof the generality criterion in which statements of homology(fold structures) result from additive phylogenetic change(increases in abundance) It is noteworthy that these ToLreconstructions take into account the genomic abundance ofeach and every fold structure in each proteome and acrossthe entire matrix thereby generating a frustrated system Ingeneral very high abundance of only few folds will not attracttaxa (species) to the derived branches in theToLThe ancestryof taxa is determined by both the abundance and interplayamong fold structure characters For example metabolicfolds such as those involved inATP hydrolysis are widespreadin cells and considered to be very ancient These are alsothe most ancient folds in our phylogenies In comparisonsome popular eukaryote-specific folds (eg immunoglobulinsuperfamilies) are highly abundant but appear in a derivedmanner in our phylogenies Thus we reason that there is nocircularity involved in the character polarization scheme

The compositional schemes extend the concept of rootingwith paralogous sequences to the entire proteome comple-ments from gene family level [100] to structural hierarchiesThe three examples make use of different rooting rationalesbut provide a congruent scenario of origins of diversifiedlife Technically roots are inferred by the Lundberg method[110] that does not require any outgroup taxa specificationThis method roots the trees a posteriori by attaching thehypothetical ancestor to that branch of the unrooted networkthat would yield minimum increase in the tree length (thuspreserving the principle of parsimony)

(iii) Characters Must Serve as Independent EvolutionaryHypotheses Valid phylogenetic optimization requires thatcharacters be independent pieces of evidence Charactersshould not dependonother charactersWhen the assumptionof independence is violated characters are overweighed inthe analysis and the resulting phylogeny fails to representtrue history [111] Possible dependencies could be of manykinds from structural to functional from developmen-tal to ecological These dependencies distort and obscurephylogenetic signal and must be either avoided or codedinto the phylogenetic model through parameters or weightcorrections

As we will now elaborate the problem of character inde-pendence is about parts and wholes in the hierarchical fabricof life and in the nested hierarchies of the ToL Biologicalsystems are by definition made of parts regardless of the wayparts are defined In evolution diversification and integrationof parts unify parts into cohesive entities modules whichthen diversify [112]This process and the rise of modules may

explain evolutionary waves of complexity and organizationand the emergence of structure (defined broadly) in biologythat is hierarchical The hierarchical makeup is made evidentin the structure of protein molecules where lower levelparts of the polymer (the amino acid residues) interact witheach other and establish cohesive higher-level modular partswhich also establish interaction networks and are crucialfor molecular function and for interaction of proteins withthe cellular environment Consequently the structure ofproteins can be described at increasing hierarchical levelsof structural abstraction sequences motifs loops domainsfamilies superfamilies topologies folds architectures andclasses Two accepted gold standards of protein classificationstructural classification of proteins (SCOP) [113] and classarchitecture topology homology (CATH) [114] use parts ofthis incomplete scheme to describe the atomic complexityof the molecules We note that these classifications do notconsider unrealized structural states such as protein foldsthat are possible but that have never been identified inthe natural world of protein structures We also note thatmodules sometimes engage in combinatorial games Forexample protein domains are rearranged in evolution byfusions and fissions producing the enormous diversity ofalternative domain rearrangements that exist inmultidomainproteins [92]

Since biological systems evolve and carry common ances-try parts of these systems by definition evolve and bythemselves carry common ancestry In other words thehistories of parts are embodied in the ensemble of lineagesof the ToL The focus of phylogenetic analysis however hasbeen overwhelmingly the organism as the biological systemas testified by the effort devoted to taxonomical classificationsystematic biology and building of the ToL The genomicrevolution has provided a wealth of parts in the aminoacid sequence sites of proteins and nucleic acid moleculeswhich have been used as phylogenetic characters for analysisof molecules representing organisms This has maintainedthe focus of reconstructing trees of systems (the wholes)For example amino acid sites of a protein are generally fitinto a data matrix (alignment) which is then used to buildtrees of genes and organisms (Figure 5(a)) using modernalgorithmic implementations [115] However and as we havementioned proteins have complicated structures that resultfrom interactions between amino acids at the 3D atomic levelThese intramolecular interactions or at least some of them[116] induce protein folding and delimit molecular func-tion and protein stability They are responsible for proteinsecondary supersecondary domain and tertiary structureand by definition their mere existence induces violation ofcharacter independence Penny and Collins [117] proposedthe simple thought experiment in which the bioinformaticianexchanges rows of sequence sites in the alignment matrixand asks what was lost in the process Randomization ofcharacters (columns) in the data matrix does not changethe phylogenetic tree However randomization destroys thestructure of the molecule and very likely its function Thisconfirms that reconstruction of trees from information insequence sites violates character independence and in theprocess ignores structure and biologyThe effects of violation

12 Archaea

Interactions between parts(eg atomic contacts between sites)

Characters parts(amino acid sequence sites)

16

7

5Impact onphylogeneticreconstruction high

Phylogenetic model

Evolutionof systems

Canonicalparadigm

System 3 (protein from Mba)

System 2 (protein from Mja)

System 1 (protein from Pfu)

A L Y S

A L Y S

G L S S

1 5 6 7

(a)

Interactions between systems(eg proteomes in trophic chains)

Characters systems(proteomes)

Impact onphylogeneticreconstruction low

Phylogenetic model

Evolutionof parts

Newparadigm

Part 3 (FSF domain c261)

Part 2 (FSF domain c12)

Part 1 (FSF domain c371)

Higher trophic

levels

Mesoplankton

Microplankton

Nanoplankton

Picoplankton

Pfu Mja Mba 22 29 28

54 44 43

84 88 80

(b)

Figure 5 A new phylogenetic strategy simplifies the problems of character independence (a)The canonical paradigm explores the evolutionof systems such as the evolution of organisms in the reconstruction of ToLs For example the terminals of phylogenetic trees can be genessampled from different organisms (eg Pyrococcus furiosus (Pfu)Methanococcus jannaschii (Mja) andMethanosarcina barkeri (Mba)) andphylogenetic characters can be amino acid sequence sites of the corresponding gene products Character states can describe the identityof the amino acid at each site Since characters are molecular parts that interact with other parts when molecules fold into compact 3Dstructures their interaction violates the principle of character independence Consequently the effects of covariation must be considered inthe phylogeneticmodel used to build the trees (b)The new paradigm explores the evolution of parts such as the evolution of protein domainsin proteomes For example the terminals of phylogenetic trees can be domains defined at fold superfamily level of structural complexity andthe characters used to build the trees can be proteomes Character states can be the number of domains holding the FSF structure Sinceproteomes interact with other proteomes when organisms establish close interactions interactions that could affect the abundance of domainsin proteomes should be considered negligible (unless there is an obligate parasitic lifestyle involved) and there is no need to budget trophicinteractions in the phylogenetic model

of character independence may be minimal for trees ofsequences that are closely related However trees describingdeep historical relationships require that sequences be diver-gent and this maximizes the chances of even wider diver-gences in molecular structure that are not being accountedfor by the models of sequence evolution

The genomic revolution has also provided a wealth ofmodels of 3D atomic structure These structures are usedas gold standards to assign with high confidence structuralmodules to sequences As mentioned earlier proteomesembody collections of protein domains with well-definedstructures and functions Protein domain counts in pro-teomes have been used to generate trees of protein domains(Figure 5(b)) using standard cladistic approaches and well-established methods (reviewed in [87]) These trees describethe evolution of protein structure at global level Theyare effectively trees of parts While domains interact witheach other in multidomain proteins or establish protein-protein interactions with the domains of other proteinsthese interactions of parts do not violate character indepen-dence This is because phylogenetic characters are actuallyproteomes systems defined by structural states that exist atmuch higher levels than the protein domain not far awayfrom the organism level Remarkably no information is lostwhen character columns in the matrix are randomized in

the thought experiment The order of proteome charactersin the matrix does not follow any rationale Characters arenot ordered by lifestyles or trophic levels of the organismsInteractions between free-living organisms will seldom biastheir domain makeup and if so those characters can beexcluded from analysis Even the establishment of symbioticor obligate parasitic interactions such as the nodule-formingsymbioses between rhizobia and legumes may have littleimpact on character independence as long as the jointinclusion of the host and the symbiont is avoided

6 Evidence Supporting the Archaeal Rootingof the Universal Tree

Figure 4 shows examples of rooted ToLs generated fromthe sequence and structure of RNA and protein moleculesThe different phylogenomic approaches arrive at a commonrooted topology that places the stem group of Archaea at thebase of the ToL (the archaeal rooting of Figure 1(c)) Howeverthe evolutionary interrelationship of parts andwholes promptthe use of trees of parts a focus on higher-level structureand a decrease in confidence in the power of trees of systemsThese concepts have been applied to the study of the historyof nucleic acid and protein structures for over a decade and

Archaea 13

23

38

786

360324

164

38

2

33

453

126 56

1

1

FSF domains

Archaea

Archaea

EukaryaEukarya

Bact

eria

Bact

eria

f gt 06

(a)

1

11

526

852272

162

100

0

3

232

35 15

5

16

GO terms

Archaea

Archaea

EukaryaBa

cter

iaEukarya

Bact

eria

f gt 06

(b)

69

0

7

8393

228

2

1

0

5

91

1

0

RNA families

Archaea

Eukarya

Bact

eria

Archaea

Eukarya

Bact

eria

f gt 06

(c)

Figure 6 Venn diagrams displaying the distributions of 1733 FSF domains (a) 1924 terminal GO terms (b) and 1148 RNA families (c)in the genomes of the three domains of life FSF domain data was taken from Nasir et al [122 123] and included 981 completely sequencedproteomes from 70 Archaea 652 Bacteria and 259 Eukarya Terminal GO terms corresponding to the ldquomolecular functionrdquo hierarchy definedby the GO database [118] were identified in 249 free-living organisms including 45 Archaea 183 Bacteria and 21 Eukarya (data taken from[123]) The Venn diagram of RNA families and the distribution of Rfam clans and families in organisms were taken from Hoeppner et al[129] and their Dataset S1 Shown below are distribution patterns for FSFs GO terms and RNA families that are present in more than 60 ofthe organisms examined (119891 gt 06) All distributions highlight maximum sharing in the ancient ABE and BE taxonomic groups and minimalsharing in archaeal taxonomic groups

provide additional evidence in support of the archaeal rootingscenario

61 Comparative Genomic Argumentation The distributionof gene-encoded products in the genomes of sequencedorganisms and parsimony thinking can reveal global evolu-tionary patterns without formal phylogenetic reconstructionWe will show how simple numerical analyses of proteindomains molecular activities defined by the gene ontology(GO) consortium [118] and RNA families that are domain-specific or are shared between domains of life can uncover thetripartite division of cellular life exclude chimeric scenariosof origin and provide initial insight on the rooting of the ToL(Figure 6)

The number of unique protein folds observed in natureis very small SCOP ver 175 defines only sim2000 foldsuperfamilies (FSFs) groups of homologous domains unifiedon the base of common structural and evolutionary rela-tionships for a total of 110800 known domains in proteins[113 119] FSFs represent highly conserved evolutionary unitsthat are suitable for studying organismal diversification [87]Yafremava et al [120] plotted the total number of distinctFSFs (FSF diversity) versus the average reuse of FSFs in theproteome of an organism (FSF abundance) This exerciseuncovered a scaling behavior typical of a Benford distributionwith a linear regime of proteomic growth for microbial

organisms and a superlinear regime for eukaryotic organisms(Figure 8 in [120]) These same scaling patterns are observedwhen studying the relationship between open reading framesand genome size [121] Remarkably archaeal and eukaryalproteomes exhibited both minimum and maximum levelsof FSF abundance and diversity respectively Bacterial pro-teomes however showed intermediate levels We note thatthe general scaling behavior is consistent with a scenarioin which evolutionary diversification proceeds from simplerproteomes to the more complex ones in gradual mannersupporting the principle of spatiotemporal continuity andrevealing the nested phylogenetic hierarchies of organismsUnder this scenario (and results of [120]) the streamlinedarchaeal proteomes represent the earliest form of cellular lifeRemarkably archaeal species harboring thermophilic andhyperthermophilic lifestyles encoded the most streamlinedFSF repertoires (Figure 7) Clearly modern thermophilicarchaeons are most closely related to the ancient cells thatinhabited planet Earth billions of years ago (also read below)

FSF domain distributions in the genomes of the threedomains of life provide further insights into their evolutionNasir et al [122 123] generated Venn diagrams to illustrateFSF sharing patterns in the genomes of Archaea Bacteriaand Eukarya (Figure 6(a)) These diagrams display the totalnumber of FSFs that are unique to a domain of life (taxonomicgroupsA B and E) shared by only two (AB BE andAE) and

14 Archaea

FSF use (log)

S marinus I hospitalis

H butylicus

T pendens

Cand korarchaeum

A pernix

M kandleri

T neutrophils

P islandicum

290

285

280

275

270

265

260

255

250

245

240

22 23 24 25

ThermophilicNonthermophilic

FSF

reus

e (lo

g)

N maritimus

Figure 7 A plot of FSF use (diversity) against FSF reuse (abun-dance) reveals a linear pattern of proteomic growth in 48 archaealproteomes Thermophilic archaeal species occupy positions that areclose to the origin of the plot They also populate the most basalbranch positions in ToLs (see discussion in themain text) Both axesare in logarithmic scale

those that are universal (ABE) About half of the FSFs (786out of 1733) were present in all three domains of life 453 ofwhich were present in at least 60 of the organisms that wereexamined (119891 gt 06 where 119891 is the distribution index rangingfrom 0 to 1) The fact that about 70 of widely shared FSFs(672) belong to the ABE taxonomic group strongly supports acommon evolutionary origin for cells Evolutionary timelineshave confirmed that a large number of these universal FSFswere present in the ancestor of the three domains of life theurancestor (Figure 1) and were for the most part retained inextant proteomes [48]

Interestingly the number of BE FSFs was sim10-fold greaterthan AE and AB FSFs (324 versus 38 and 38) (Figure 6(a))The finding that Bacteria and Eukarya encode a significantlylarge number of shared FSF domains is remarkable and hintstowards an unprecedented strong evolutionary associationbetween these two domains of life Moreover a significantnumber of BE FSFs is widespread in the bacterial andeukaryotic proteomes 56 of the 324 FSFs are shared bymore than 60 of the bacterial and eukaryal organismsthat were analyzed (Figure 6(a)) This is strong evidenceof an ancient vertical evolutionary trace from their mutualancestor (anticipated in [123]) This trace uniquely supportsthe archaeal rooting of the ToL In turn the canonicaland eukaryotic rooting alternatives are highly unlikely asnone alone can explain the remarkable diversity of the BEtaxonomic group (as well as its ancient origin [123]) Theremarkable vertical trace of the ABE taxonomic group andthe negligible vertical trace of the AB group (only one FSF

shared by more than 60 of organisms) also falsify fusionhypotheses responsible for a putative chimeric makeup ofEukarya In light of these findings the most parsimoniousexplanation of comparative genomic data is that Archaea isthe first domain of diversified cellular life

The patterns of FSF sharing are further strengthenedby the genomic distributions of terminal-level molecularfunction GO terms (Figure 6(b)) The three domains of lifeshared about a quarter (526) of the 1924 GO terms thatwere surveyed A total of 232 of these ABE terms wereshared by 60 of organisms analyzed Again the fact thatabout 76 of widely shared GO terms (306) belong to theABE taxonomic group strongly supports a common verticaltrace while the finding that Bacteria and Eukarya share asignificantly large number of GO terms (272) supports againthe archaeal rooting of the ToL An alternative explanationfor the very large size of BE could be large-scale metabolism-related gene transfer from the ancestors of mitochondriaand plastids to the ancestors of modern eukaryotes [124]However we note that the BE group is not restricted to onlymetabolic functions It also includes FSFs and GOs involvedin intracellular and extracellular processes and regulation andinformational functions [125] Thus the very large size ofBE is a significant outcome most likely shaped by verticalevolutionary scenarios and cannot solely be explained byparasiticsymbiotic relationships that exist between Bacteriaand Eukarya [123] Moreover the simplicity of archaeal FSFrepertoires is not due to the paucity of available archaealgenomic dataWe confirmed that themean FSF coverage (ienumber of proteinsannotated to FSFsGOs out of total) forArchaea Bacteria and Eukarya was largely comparable (egTable S1 in [122]) Finally as we will describe below thesecomparative patterns and tentative conclusions are confirmedby phylogenomic analysis [91 122 126] Congruent phyloge-nies were obtained with (Figure 4(c)) and without equal andrandom sampling of taxa Thus the relatively low numberof archaeal genomes is also not expected to compromise ourinferences

We note that Eukarya shares many informational geneswith Archaea and many operational genes with BacteriaWhile informational genes have been thought more refrac-tory to HGT than noninformational genes we have con-firmed at GO level that this is not the case A ToL builtfrom GO terms showed that in fact noninformational termswere less homoplasious while statistical enrichment analysesrevealed that HGT had little if any functional preferencefor GO terms across GO hierarchical levels We recentlycompared ToLs reconstructed from non-HGTGO terms andToLs reconstructed from informational GO terms that wereextracted fromnon-HGTGO terms (Kim et al ms in reviewalso read below) In both cases Archaea appeared as a basalparaphyletic group of the ToLs and the common origin ofBacteria and Eukarya was maintained Thus the 272 GOterms shared by Bacteria and Eukarya harbor a strong verticaltrace

Another observation often used as support for Archaea-Eukarya kinship is the discovery of few eukaryote-specificproteins (eg actin tubulin H3 H4 ESCRT ribosomalproteins and others) in some archaeal species [127] and

Archaea 15

their complete absence fromBacteria (except for documentedtubulin HGT from eukaryotes to bacterial genus Prosthe-cobacter [128]) This suggests either that eukaryotes arosefrom an archaeal lineage [127] or that the ancestor of Eukaryaand Archaea was complex and modern Archaea are highlyreduced [25] Indeed few eukaryote-specific proteins havenow been found in some archaeal species (in some casesjust one species) We argue that this poor spread cannotbe taken as evidence for the Archaea-Eukarya sister rela-tionship This needs to be confirmed by robust phylogeneticanalysis which is unfortunately not possible when usingprotein sequences Recently we described a new strategy forinferring vertical and horizontal traces [123] This methodcalculates the spread of FSFs andGOs that are shared betweentwo-superkingdom groups (ie AB AE and BE) Balanceddistributions often indicate vertical inheritance while biaseddistributions suggest horizontal flux For example penicillinbinding molecular activity (GO 0008658) was present in100 of the sampled bacterial proteomes but was onlypresent in 11 of the archaeal species [123] Thus presenceof GO 0008658 in Archaea was attributed to HGT gain fromBacteria Using this simple method we established that bothBacteria and Eukarya were united by much stronger verticaltrace than either was to Archaea In fact strong reductivetendencies in the archaeal genomes were recorded [123]Thus in our opinion presence of eukaryote-specific proteinsin only very few archaeal species could in fact be an HGTevent that is not detectable by sequence phylogenies

In turn many arguments favor now the Bacteria-Eukaryasisterhood in addition to the structure-based phylogeniesand balanced distribution ofmolecular features For exampleBacteria and Eukarya have similar lipid membranes that canbe used as argument for their evolutionary kinshipMoreoverArchaea fundamentally differ from Eukarya in terms oftheir virosphere Viruses infecting Archaea and Eukarya aredrastically different as recently discussed by Forterre [25]

The genomic distribution of RNA families was taken fromHoeppner et al [129] and also shows a vertical evolutionarytrace in five crucial Rfam clans that are universal includingtRNA 5S rRNA subunit rRNA and RNase P RNA Theseuniversal RNA groups are likely minimally affected by HGTHowever 99 of Rfam clans and families are specific todomains of life and only 11 of the 1148 groups were sharedat 119891 gt 06 levels This clearly shows that the functionalcomplexity of RNA materialized very late during organismaldiversification and that it is not a good genomic feature forexploring the rooting of the ToL Only five RNA familiesare shared between two domains of life and only one ofthese does so at 119891 gt 06 the G12 pseudoknot of the 23SrRNA which is present in bacterial and eukaryotic organellarrRNA While the large subunit rRNA scaffold supports theG12 pseudoknot with its vertical trace all five interdomainRNA families can be explainedmost parsimoniously byHGTThus only a handful of ancient and universal RNA speciescan be used to root the ToL

We end by noting that the Venn diagrams consistentlyshow that Archaea harbors the least number of unique (A)and shared (AB and AE) FSFs GO terms or RNA familiesThis trend supports an early divergence of this domain of life

from the urancestor and the possibility that such divergencebe shaped by evolutionary reductive events We reason thatsuch losseswould bemore parsimonious early on in evolutionthan in the later periods This is because genes often increasetheir abundance in evolutionary time (by gene duplicationsHGT and other processes)Thus it is reasonable to think thatloss of an ancient gene would be more feasible very early inevolution relative to losing it very late

62 Phylogenomic Evidence from the Sequence and Structure ofRNA Phylogenetic analyses of the few RNA families that areuniversal and display an important vertical evolutionary trace(Figure 6(c)) provide compelling evidence in favor of an earlyevolutionary appearance of Archaea [105 107 108 130ndash135]Here we briefly summarize evidence from tRNA 5S rRNAand RNase P RNA Unpublished analyses of rRNA sequenceand structure using advanced phylogenetic methods alsoshow that Archaea was the first domain of life

tRNA molecules are generally short (sim73ndash95 nucleotidesin length) and highly conserved Consequently theirsequence generally contains limited amount of phylogeneticinformation These limitations have been overcome by ana-lyzing entire tRNomes [136] which extend the length ofan organismal set of tRNAs to over 2000 bases Xue etal [130 131] analyzed the genetic distances between tRNAsequences as averages between alloacceptor tRNAs fromdiverse groups of tRNomes using multiple molecular basesubstitution models The distances were mapped onto anunrooted phylogeny of tRNA molecules (Figure 8(a))The ldquoarrow of timerdquo assumption in these studies is thatancient tRNA paralogs closely resemble each other when lin-eages originate close to the time of the gene duplication Re-markably the results revealed a paraphyletic rooting of theToL in Archaea The root was specifically located close tothe hyperthermophilic methanogen Methanopyrus kandleri(Figure 8(a)) The hypothesis of this specific rooting scenariohas been supported by several other studies [134 137]including a study of genetic distances between paralogouspairs of aminoacyl-tRNA synthetase (aaRS) proteins [131] Aremarkable match between distance scores of tRNA and pairsof aaRS paralogs (Figure 8(b)) not only confirms the earlyappearance of Archaea but also suggests a coevolutionarytrace associated with molecular interactions that areresponsible for the genetic code In fact a recent exhaustivephylogenomic analysis of tRNA and aaRS coevolutionexplicitly reveals the origins and evolution of the geneticcode and the underlying molecular basis of genetics [59]Di Giulio [132 133] has also proposed an archaeal rootingof the tree of life specifically in the lineage leading to thephylum of Nanoarchaeota This rooting is based on uniqueand ancestral genomic traits of Nanoarchaeum equitans splitgenes separately codifying for the 51015840 and 31015840 halves of tRNAand the absence of operons which are considered molecularfossils [132] However this claim needs additional support ascontrasting evidence now recognizes N equitans as a highlyderived archaeal species [138]

In addition to sequences structural features of tRNAmolecules also support the archaeal rooting of the ToL

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Archaea 7

to the whole However sublineages in one hierarchical levelmay hold different histories compared to the histories ofhigher and lower levels Historical mismatches introduce forexample lineage sorting and reticulation problems that rep-resent a complication to phylogenetic analysis For examplemismatches between gene and organismal phylogenies existin the presence of differential sorting of genomic compo-nents due to for example species extinctions or genomicrearrangements These mismatches violate the fundamentalcladistic assumption that history follows a branching processbut can be explained by homoplasy (convergent evolution)the horizontal ldquotracerdquo of the homology-homoplasy yin-yangof phylogenetic signal The problem of homoplasy cannotbe solved by conceptually preferring one particular levelof the hierarchy [74] A focus on the organismal level forexample will not solve the problems of hybridization thatare brought by strict or relaxed sexual reproduction or viral-mediated genetic exchanges or the problems of ancientevents of fusions or endosymbiosis However homoplasy andthe optimization of characters and trees of cladistic analysisprovide a rigorous framework to discover the magnitude andsource of nonvertical processes in evolution Thus cladisticsor the ToL are not invalidated by network-like signals Infact recent discrete mathematical formulations that test thefundamental axioms of tree construction have also proventhat the bifurcating history of trees is preserved despiteevolutionary reticulations [75] While there is no ldquoconfusionof doubtsrdquo at this level the babel of confusion has not stopped

Genomics revealed evolutionary patchiness which in-cited hypotheses of chimeras and fusions Phylogenies ofgenes were found highly discordant Organisms that werebeing sequenced shared very few gene sequences and moretroublingly gene trees that were generated possessed differ-ent topologies especially within akaryotic organisms Withtime the number of nearly universal genes decreased andthe patchiness and discordancy increased The comparativegenomic patchwork of sequence homologies showed thatthere were groups of genes that were only shared by certaindomains of life Thus the makeup of genomes appearedchimeric A number of hypotheses of chimeric origins ofeukaryotes were proposed based on the fact that eukaryotesshared genes expressing sister group phylogenetic relation-ships with both Bacteria and Archaea [64 76] One thatis notable is the rise of eukaryotes from a ldquoring of liferdquofusion of an archaeon and a bacterium (eg [77]) which ina single blow defeated both the tree-like and the tripartitenature of life Under this school homology searches withBLAST against a database ofsim38million akaryotic sequencesallowed to assign archaeal bacterial or ambiguous ancestriesto genes in the human genome and explain homologypatterns as relics of the akaryotic ancestors of humans [78]Remarkably archaeal genes tended to be involved in informa-tional processes encoded shorter and more central proteinsand were less likely to be involved in heritable human dis-eases The chimeric origin of eukaryotes by fusions has beenrightfully contested there is no proper evidence supportingits existence [79] Unfortunately chimerism opened a floodof speculations about reticulation The history of life wasseen by many through the lens of a ldquoforestrdquo of gene histories

[80] Reticulations andultimatelyHGTwere forced to explainchimeric patterns Fusion hypotheses diverted the centralissue of the rooting of the ToL and prompted the separateanalysis of eukaryotic origins and akaryotic evolution

Dagan and Martin [64] denounced that the ToL wasa ldquotree of one percentrdquo because only a small fraction ofsequences could be considered universal and could bemined for deep phylogenetic signal The rest would accountfor lateral processes that confounded vertical descent Theclaim came fundamentally from networks constructed usingBLAST heuristic searches for short and strong matchesin genomic sequences Phylogenetic ldquoforestsrdquo of akaryoticgenes later on boosted the claim [81] These networks werebuilt from 6901 trees of genes using maximum likelihoodmethods Only 102 of these trees were derived from nearlyuniversal genes and contributed very little vertical phyloge-netic signal The initial conclusion was striking ldquothe originaltree of life concept is obsolete it would not even be a treeof one percentrdquo [81] However the fact that vertical signalwas present in the forests merited reevaluation ldquoreplacementof the ToL with a network graph would change our entireperception of the process of evolution and invalidate allevolutionary reconstructionrdquo [82] We note that in thesestudies an ultrametric tree of akaryotes was recovered fromvertical signal in a supertree of nearly universal genesThe rooted tree which was used to simulate clock-likebehavior revealed the early divergence of Nanoarchaeumequitans and then Archaea [81] While inconsistency of theforest supertree increased at high phylogenetic depths itsassociated supernetwork showed there were no reticulationsbetween Archaea and Bacteria Thus the deep rooting signalof archaeal diversification may be bona fide and worthy offurther study

While those that value tree thinking have contested onmany grounds the idea that the ToL is ldquoobsoleterdquo (eg[13]) our take in the debate is simple Homologies betweengene sequences established through the ldquoemperorrsquos BLASTrdquo(sensu [71]) are poor substitutes to phylogenetic tree recon-structions from gene sequences By the same token phy-logenies reconstructed from sequences are poor substitutesto phylogenies that consider the molecular structure andfunction of the encoded products Gene sequences are notonly prone to mutational saturation but they generally comein pieces These pieces represent evolutionary and structuralmodules that host the functions of the encoded moleculesProtein domains for example are three-dimensional (3D)arrangements of elements of secondary structure that foldautonomously [83] are compact [84] and are evolutionarilyconserved [85 86]The landscape of evolutionary explorationchanges when the role of protein domains in function andevolution is considered [87] Changes at sequence levelincluding substitutions insertions and deletions can havelittle impact on structure and vice versa sequence changes incrucial sites can have devastating consequences on functionand fitness However there is no detailed analysis of historicalmismatches between gene sequences and domain structuresat the ToL level Remarkably while HGT seems rampantat sequence level its impact at the domain structural levelis limited [88] This makes trees derived from domains

8 Archaea

effectively ldquotrees of 99 percentrdquo and their use very powerfulThe fact that domains diversify mostly by vertical descent(eg [89ndash91]) suggests that gene reticulations simply reflectthe pervasive and impactful combinatorial effect of domainrearrangements in proteins [92] and perhaps little else Thisimportant claim must be carefully evaluated It offers thepossibility of dissecting how levels of organization impactprocesses of inheritance in biology

5 Out of the Impasse Parts and Wholes inthe Evolving Structure of Systems

Therooting of theToL is clearlymuddled by the high dynamicnature of change in protein and nucleic acid sequences andby the patchiness and reticulation complexities that exist atgene level However it is also possible that the problemsof the ToL are ultimately technical We have made thecase that the use of molecular sequences is problematic onmany grounds including mutational saturation definitionof homology of sites in sequence alignments inapplicablecharacters taxon sampling and tree imbalance and differenthistorical signatures in domains of multidomain proteins[93] In particular violation of character independence by themere existence of atomic structure represents a very seriousproblem that plagues phylogenetic analysis of sequences Wehere present a solution to the impasse We show that theToL can be rooted with different approaches that focus onstructure and function and that its root is congruently placedin Archaea

Epistemologically phylogenetic characters must complywith symmetry breaking and the irreversibility of time [94]In other words characters must establish transformationalhomology relationships and serve as independent evidentialstatements

(i) Characters Must Be Homologous and Heritable acrossTree Terminal Units (Taxa) Character homology is a centraland controversial concept that embodies the existence ofhistorical continuity of information [74] Characters areldquobasicrdquo evidential statements that make up columns in datamatrices used for tree reconstructionThey are conjectures ofperceived similarities that are accepted as fact for the dura-tion of the study are strengthened by Hennigian reciprocalillumination and can be put to the test through congruencewith other characters as these are fit to the trees To be usefulcharacters must be heritable and informative across the taxarows of the data matrix This can be evaluated for examplewith the cladistics information content (CIC) measure [95]Finding informative characters can be particularly challeng-ing when the features that are studied change at fast pace andwhen taxa sample awide and consequently deep phylogeneticspectrum When building a ToL the highly dynamic natureof change in the sequence makeup of protein or nucleic acidmolecules challenges the ability to retrieve reliable phyloge-netic signatures across taxa even if molecules are universallydistributed and harbor evolutionarily conserved regions withdeep phylogenetic signal (eg rRNA) The reason is thatgiven enough time functionally or structurally constrained

regions of the sequence will be fixed (will be structurallycanalized [6]) and will offer little if any phylogeneticallymeaningful signal to uncover for example the universalrRNA core In turn mutational saturation of unconstrainedregions will quickly erase history

The mutational saturation problem was made mathe-matically explicit by Sober and Steel [24] using ldquomutualinformationrdquo and the concept of time as an informationdestroying process Mutual information 119868(119883 119884) between tworandom variables119883 and 119884 is defined by

119868 (119883 119884) = sum

119909119910

119875 (119883 = 119909 119884 = 119910) log(119875 (119883 = 119909 119884 = 119910)

119875 (119883 = 119909) 119875 (119884 = 119910)

)

(1)

When 119868(119883 119884) approaches 0 119883 and 119884 become independentand no method can predict 119883 from knowledge of 119884 Impor-tantly mutual information approaches 0 as the time between119883 and 119884 increases in a Markov chain Regardless of theuse of a parsimony maximum likelihood or Bayesian-basedframework of analysis 119868(119883 119884)will be particularly small whensequence sites are saturated by too many substitutions dueto high substitution rates or large time scales Ancestralstates at interior nodes of the trees cannot be establishedwith confidence from extant information even in the mostoptimal situation of knowing the underlying phylogeny andthe model of character evolution Under simple models theproblem is not mitigated by the fact that the number ofterminal leaves of trees and the sources of initial phylogeneticinformation increases with time Since a phase transitionoccurs when substitution probabilities exceed a critical value[96] oneway out of the impasse is to find features in sufficientnumber that change at much slower pace than sequence sitesand test if mutual information is significant and overcomesFanorsquos inequality These features exist in molecular biologyand have been used for phylogenetic reconstruction Theyare for example the 3D fold structures of protein molecules[87] or the stem modules of RNA structures [97] featuresthat change at very slow rate when compared to associatedsequences For example protein 3D structural cores evolvelinearly with amino acid substitutions per site and change at3ndash10 times slower rates than sequences [98] This high con-servation highlights the evolutionary dynamics of molecularstructure Remarkably rates of change of proteins performinga same function are maintained by functional constraintsbut accelerate when proteins perform different functions orcontain indels In turn fold structural diversity explodesinto modular structures at low sequence identities probablytriggered by functional diversification Within the contextof structural conservation the fact that fold structures arestructural and evolutionary modules that accumulate inproteomes by gene duplication and rearrangements andspread in biological networks by recruitment (eg [99]) alsoprovides a solution to the problems of vanishing phylogeneticsignal Since fold accumulation increases with time in theMarkov chain mutual information must increase reversing

Archaea 9

the ldquodata processing inequalityrdquo that destroys informationand enabling deep evolutionary information

(ii) Characters Must Show at Least Two Distinct CharacterStates One of these two states (transformational homologs)must be ancestral (the ldquoplesiomorphicrdquo state) and the othermust be derived (the ldquoapomorphicrdquo state) [74] Only sharedand derived features (synapomorphies) provide vertical phy-logenetic evidence Consequently determining the relativeancestry of alternative character states defines the polarity ofcharacter transformations and roots the underlying treeThisis a fundamental property of phylogenetic inference Polar-ization in tree reconstruction enables the ldquoarrow of timerdquo(sensu Eddingtonrsquos entropy-induced asymmetry) solves therooting problem and fulfills other epistemological require-ments

Cladistically speaking character polarity refers to thedistinction between the ancestral and derived states and theidentification of synapomorphies However an evolutionaryview of polarity also refers to the direction of characterstate transformations in the phylogenetic model Historicallythree accepted alternatives have been available for rootingtrees [100ndash102] the outgroup comparison the ontogeneticmethod and the paleontological or stratigraphic methodWhile the three methods do not include assumptions ofevolutionary process they have been the subject of muchdiscussion and their interpretation of much controversyThe first two are however justified by the assumption thatdiversity results in a nested taxonomic hierarchy which mayor may not be induced by evolution We will not discussthe stratigraphic method as it relies on auxiliary assump-tions regarding the completeness of the fossil record Themidpoint rooting criterion mentioned earlier will not bediscussed either The procedure is contested by the existenceof heterogeneities in rates of change across the trees andproblems with the accurate characterization of phylogeneticdistances

In outgroup comparison polarity is inferred by thedistribution of character states in the ingroup (group of taxaof interest) and the sister group (taxa outside the group ofinterest) In a simple case if the character state is only foundin the ingroup it must be considered derived Outgroupcomparison is by far the method of choice because phyloge-neticists tend to have confidence in the supporting assump-tions higher-level relationships are outside the ingroupequivalent ontogenetic stages are compared and characterstate distributions are appropriately surveyed Unfortunatelythe method is ldquoindirectrdquo in that it depends on the assumptionof the existence of a higher-level relationship between theoutgroup and the ingroup Consequently the method cannotroot the ToL because at that level there is no higher-levelrelationship that is presently available Moreover the methodin itself does not polarize characters It simply connects theingroup to the rest of the ToL [100]

The ontogenetic criterion confers polarity through thedistribution of the states of homologous characters in onto-genies of the ingroup generally by focusing on the generalityof character states with more widely distributed states beingconsidered ancestral Nelsonrsquos rule states ldquogiven an ontogenetic

character transformation from a character observed to be moregeneral to a character observed to be less general the moregeneral character is primitive and the less general advancedrdquo[103] This ldquobiogenetic lawrdquo appears powerful in that itdepends only on the assumption that ontogenies of ingrouptaxa are properly surveyed It is also a ldquodirectrdquo method thatrelies exclusively on the ingroup Consequently it has thepotential to root the ToL Unfortunately Nelsonrsquos ldquogeneralityrdquohas been interpreted in numerous ways especially as it relatesto the ontogenetic sequence leading tomuch confusion [102]It also involves comparison of developmentally nested anddistinct life history stages making it difficult to extend themethod (originally conceptualized for vertebrate phylogeny)to the microbial world However Weston [100 104] madeit clear that the ontogenetic criterion embodies a widerldquogenerality criterionrdquo in which the taxic distribution of acharacter state is a subset of the distribution of anotherIn other words character states that characterize an entiregroup must be considered ancestral relative to an alternativestate that characterizes a subset of the group Besides thecentrality of nested patterns the generality criterion embedsthe core assumption that every homology is a synapomorphyin naturersquos nested taxonomic hierarchy and that homologiesin the hierarchy result from additive phylogenetic change[100] Westonrsquos more general rule therefore states that ldquogivena distribution of two homologous characters in which one119909 is possessed by all of the species that possess its homologcharacter 119910 and by at least one other species that does notthen 119910 may be postulated to be apomorphous relative to 119909rdquo[104] The only assumption of the method is that relevantcharacter states in the ingroup are properly surveyed Thisnew rule crucially substitutes the concept of ontogenetictransformation by the more general concept of homologyand additive phylogenetic change which can be applied tocases in which homologous entities accumulate ldquoiterativelyrdquoin evolution (eg generation of paralogous genes by dupli-cation) Since horizontally acquired characters (xenologs)are not considered synapomorphies they contribute towardsphylogenetic noise and are excluded after calculation ofhomoplasy and retention indices (ie measures of goodnessof fit of characters to the phylogeny)

Wehave applied the ldquogenerality criterionrdquo to the rooting ofthe ToL through polarization strategies that embody axiomsof evolutionary process Figure 4 shows three examplesA rooted phylogeny describing the evolution of 5S rRNAmolecules sampled from a wide range of organisms wasreconstructed from molecular sequence and structure [105]The ToL that was recovered was rooted paraphyletically inArchaea (Figure 4(a)) The model of character state transfor-mation was based on the axiom that evolved RNA moleculesare optimized to increase molecular persistence and producehighly stable folded conformations Molecular persistencematerializes in RNA structure in vitro with base pairsassociating and disassociating at rates as high as 05 sminus1 [106]The frustrated kinetics and energetics of this folding processenable some structural conformations to quickly reach stablestatesThis process is evolutionarily optimized through struc-tural canalization [6] in which evolution attains molecularfunctions by both increasing the average life and stability of

10 Archaea

Archaea Bacteria Eukarya

Root

Taxa = 666 5S rRNA molecules of diverse organismsCharacters = sequence and structure of tRNA

CI = 0072 RI = 0772 and g1=minus013Tree length = 11381 steps

(a)

Ancestral

AB B

EB

AE

EABEA

Derived

A E B

Timeline

Eukarya (E)

Archaea (A)

Bacteria (B)

ABE

(b)

1000

Root

Eukarya

Archaea

Bacteria

Taxa = 150 free-living proteomesCharacters = genomic abundance of 1510 FSFsTree length = 70132 stepsCI = 0155 RI = 0739 and g1=minus028

(c)

Figure 4 Trees of life generated from the structure of RNA and protein molecules congruently show a rooting in Archaea (a) A rootedphylogenetic tree of 5S rRNA reconstructed from both the sequence and the structure of the molecules (from [105]) (b) Global most-parsimonious scenario of organismal diversification based on tRNA (from [107]) A total of 571 tRNA molecules with sequence basemodification and structural information were used to build a ToL which failed to show monophyletic groupings Ancestries of lineageswere then inferred by constraining sets of tRNAs into monophyletic groups representing competing (shown in boxes) or noncompetingphylogenetic hypotheses and measuring tree suboptimality and lineage coalescence (illustrated with color hues in circles) (c) A ToLreconstructed from the genomic abundance counts of 1510 FSFs as phylogenetic characters in the proteomes of 150 free-living organismssampled equally and randomly from the three domains of life (data taken from [122 123]) Taxa were labeled with circles colored accordingto superkingdom CI = consistency index RI = retention index and 119892

1= gamma distribution parameter

selected conformations and decreasing their relative numberThus conformational diversity measured for example by theShannon entropy of the base-pairing probability matrix orfeatures of thermodynamic stability act as ldquoevo-devordquo proxyof a generality criterion for RNA molecules in which thecriterion of similarity (eg ontogenetic transformation) isof positional and compositional correspondence Using adifferent approach we recently broadened the use of phyloge-netic constraint analysis [107 108] borrowing from a formal

cyberneticmethod that decomposes a reconstructable systeminto its components [109] and used it to root a ToL derivedfrom tRNA sequence and structure (Figure 4(b)) The ToLwas again rooted in Archaea The number of additionalsteps required to force (constrain) particular taxa into amonophyletic group was used to define a lineage coalescencedistance (119878) with which to test alternative hypotheses ofmonophyly These hypotheses were then ordered accordingto 119878 value in an evolutionary timeline Since 119878 records

Archaea 11

the relative distribution of character states in taxic sets italso embodies the generality criterion of rooting Finallywe generated rooted ToLs that describe the evolution ofproteomes directly from a census of protein fold structures inproteomes (Figure 4(c)) This ToL shows a paraphyletic root-ing inArchaeaThemethod extracts phylogenetic signal fromproteomic abundance of protein fold structures and considersthat the most abundant and widely distributed folds are ofancient origin when defining transformation series [41] Thispolarization scheme which results in the gradual growth ofthe proteome repertoire again represents an embodimentof the generality criterion in which statements of homology(fold structures) result from additive phylogenetic change(increases in abundance) It is noteworthy that these ToLreconstructions take into account the genomic abundance ofeach and every fold structure in each proteome and acrossthe entire matrix thereby generating a frustrated system Ingeneral very high abundance of only few folds will not attracttaxa (species) to the derived branches in theToLThe ancestryof taxa is determined by both the abundance and interplayamong fold structure characters For example metabolicfolds such as those involved inATP hydrolysis are widespreadin cells and considered to be very ancient These are alsothe most ancient folds in our phylogenies In comparisonsome popular eukaryote-specific folds (eg immunoglobulinsuperfamilies) are highly abundant but appear in a derivedmanner in our phylogenies Thus we reason that there is nocircularity involved in the character polarization scheme

The compositional schemes extend the concept of rootingwith paralogous sequences to the entire proteome comple-ments from gene family level [100] to structural hierarchiesThe three examples make use of different rooting rationalesbut provide a congruent scenario of origins of diversifiedlife Technically roots are inferred by the Lundberg method[110] that does not require any outgroup taxa specificationThis method roots the trees a posteriori by attaching thehypothetical ancestor to that branch of the unrooted networkthat would yield minimum increase in the tree length (thuspreserving the principle of parsimony)

(iii) Characters Must Serve as Independent EvolutionaryHypotheses Valid phylogenetic optimization requires thatcharacters be independent pieces of evidence Charactersshould not dependonother charactersWhen the assumptionof independence is violated characters are overweighed inthe analysis and the resulting phylogeny fails to representtrue history [111] Possible dependencies could be of manykinds from structural to functional from developmen-tal to ecological These dependencies distort and obscurephylogenetic signal and must be either avoided or codedinto the phylogenetic model through parameters or weightcorrections

As we will now elaborate the problem of character inde-pendence is about parts and wholes in the hierarchical fabricof life and in the nested hierarchies of the ToL Biologicalsystems are by definition made of parts regardless of the wayparts are defined In evolution diversification and integrationof parts unify parts into cohesive entities modules whichthen diversify [112]This process and the rise of modules may

explain evolutionary waves of complexity and organizationand the emergence of structure (defined broadly) in biologythat is hierarchical The hierarchical makeup is made evidentin the structure of protein molecules where lower levelparts of the polymer (the amino acid residues) interact witheach other and establish cohesive higher-level modular partswhich also establish interaction networks and are crucialfor molecular function and for interaction of proteins withthe cellular environment Consequently the structure ofproteins can be described at increasing hierarchical levelsof structural abstraction sequences motifs loops domainsfamilies superfamilies topologies folds architectures andclasses Two accepted gold standards of protein classificationstructural classification of proteins (SCOP) [113] and classarchitecture topology homology (CATH) [114] use parts ofthis incomplete scheme to describe the atomic complexityof the molecules We note that these classifications do notconsider unrealized structural states such as protein foldsthat are possible but that have never been identified inthe natural world of protein structures We also note thatmodules sometimes engage in combinatorial games Forexample protein domains are rearranged in evolution byfusions and fissions producing the enormous diversity ofalternative domain rearrangements that exist inmultidomainproteins [92]

Since biological systems evolve and carry common ances-try parts of these systems by definition evolve and bythemselves carry common ancestry In other words thehistories of parts are embodied in the ensemble of lineagesof the ToL The focus of phylogenetic analysis however hasbeen overwhelmingly the organism as the biological systemas testified by the effort devoted to taxonomical classificationsystematic biology and building of the ToL The genomicrevolution has provided a wealth of parts in the aminoacid sequence sites of proteins and nucleic acid moleculeswhich have been used as phylogenetic characters for analysisof molecules representing organisms This has maintainedthe focus of reconstructing trees of systems (the wholes)For example amino acid sites of a protein are generally fitinto a data matrix (alignment) which is then used to buildtrees of genes and organisms (Figure 5(a)) using modernalgorithmic implementations [115] However and as we havementioned proteins have complicated structures that resultfrom interactions between amino acids at the 3D atomic levelThese intramolecular interactions or at least some of them[116] induce protein folding and delimit molecular func-tion and protein stability They are responsible for proteinsecondary supersecondary domain and tertiary structureand by definition their mere existence induces violation ofcharacter independence Penny and Collins [117] proposedthe simple thought experiment in which the bioinformaticianexchanges rows of sequence sites in the alignment matrixand asks what was lost in the process Randomization ofcharacters (columns) in the data matrix does not changethe phylogenetic tree However randomization destroys thestructure of the molecule and very likely its function Thisconfirms that reconstruction of trees from information insequence sites violates character independence and in theprocess ignores structure and biologyThe effects of violation

12 Archaea

Interactions between parts(eg atomic contacts between sites)

Characters parts(amino acid sequence sites)

16

7

5Impact onphylogeneticreconstruction high

Phylogenetic model

Evolutionof systems

Canonicalparadigm

System 3 (protein from Mba)

System 2 (protein from Mja)

System 1 (protein from Pfu)

A L Y S

A L Y S

G L S S

1 5 6 7

(a)

Interactions between systems(eg proteomes in trophic chains)

Characters systems(proteomes)

Impact onphylogeneticreconstruction low

Phylogenetic model

Evolutionof parts

Newparadigm

Part 3 (FSF domain c261)

Part 2 (FSF domain c12)

Part 1 (FSF domain c371)

Higher trophic

levels

Mesoplankton

Microplankton

Nanoplankton

Picoplankton

Pfu Mja Mba 22 29 28

54 44 43

84 88 80

(b)

Figure 5 A new phylogenetic strategy simplifies the problems of character independence (a)The canonical paradigm explores the evolutionof systems such as the evolution of organisms in the reconstruction of ToLs For example the terminals of phylogenetic trees can be genessampled from different organisms (eg Pyrococcus furiosus (Pfu)Methanococcus jannaschii (Mja) andMethanosarcina barkeri (Mba)) andphylogenetic characters can be amino acid sequence sites of the corresponding gene products Character states can describe the identityof the amino acid at each site Since characters are molecular parts that interact with other parts when molecules fold into compact 3Dstructures their interaction violates the principle of character independence Consequently the effects of covariation must be considered inthe phylogeneticmodel used to build the trees (b)The new paradigm explores the evolution of parts such as the evolution of protein domainsin proteomes For example the terminals of phylogenetic trees can be domains defined at fold superfamily level of structural complexity andthe characters used to build the trees can be proteomes Character states can be the number of domains holding the FSF structure Sinceproteomes interact with other proteomes when organisms establish close interactions interactions that could affect the abundance of domainsin proteomes should be considered negligible (unless there is an obligate parasitic lifestyle involved) and there is no need to budget trophicinteractions in the phylogenetic model

of character independence may be minimal for trees ofsequences that are closely related However trees describingdeep historical relationships require that sequences be diver-gent and this maximizes the chances of even wider diver-gences in molecular structure that are not being accountedfor by the models of sequence evolution

The genomic revolution has also provided a wealth ofmodels of 3D atomic structure These structures are usedas gold standards to assign with high confidence structuralmodules to sequences As mentioned earlier proteomesembody collections of protein domains with well-definedstructures and functions Protein domain counts in pro-teomes have been used to generate trees of protein domains(Figure 5(b)) using standard cladistic approaches and well-established methods (reviewed in [87]) These trees describethe evolution of protein structure at global level Theyare effectively trees of parts While domains interact witheach other in multidomain proteins or establish protein-protein interactions with the domains of other proteinsthese interactions of parts do not violate character indepen-dence This is because phylogenetic characters are actuallyproteomes systems defined by structural states that exist atmuch higher levels than the protein domain not far awayfrom the organism level Remarkably no information is lostwhen character columns in the matrix are randomized in

the thought experiment The order of proteome charactersin the matrix does not follow any rationale Characters arenot ordered by lifestyles or trophic levels of the organismsInteractions between free-living organisms will seldom biastheir domain makeup and if so those characters can beexcluded from analysis Even the establishment of symbioticor obligate parasitic interactions such as the nodule-formingsymbioses between rhizobia and legumes may have littleimpact on character independence as long as the jointinclusion of the host and the symbiont is avoided

6 Evidence Supporting the Archaeal Rootingof the Universal Tree

Figure 4 shows examples of rooted ToLs generated fromthe sequence and structure of RNA and protein moleculesThe different phylogenomic approaches arrive at a commonrooted topology that places the stem group of Archaea at thebase of the ToL (the archaeal rooting of Figure 1(c)) Howeverthe evolutionary interrelationship of parts andwholes promptthe use of trees of parts a focus on higher-level structureand a decrease in confidence in the power of trees of systemsThese concepts have been applied to the study of the historyof nucleic acid and protein structures for over a decade and

Archaea 13

23

38

786

360324

164

38

2

33

453

126 56

1

1

FSF domains

Archaea

Archaea

EukaryaEukarya

Bact

eria

Bact

eria

f gt 06

(a)

1

11

526

852272

162

100

0

3

232

35 15

5

16

GO terms

Archaea

Archaea

EukaryaBa

cter

iaEukarya

Bact

eria

f gt 06

(b)

69

0

7

8393

228

2

1

0

5

91

1

0

RNA families

Archaea

Eukarya

Bact

eria

Archaea

Eukarya

Bact

eria

f gt 06

(c)

Figure 6 Venn diagrams displaying the distributions of 1733 FSF domains (a) 1924 terminal GO terms (b) and 1148 RNA families (c)in the genomes of the three domains of life FSF domain data was taken from Nasir et al [122 123] and included 981 completely sequencedproteomes from 70 Archaea 652 Bacteria and 259 Eukarya Terminal GO terms corresponding to the ldquomolecular functionrdquo hierarchy definedby the GO database [118] were identified in 249 free-living organisms including 45 Archaea 183 Bacteria and 21 Eukarya (data taken from[123]) The Venn diagram of RNA families and the distribution of Rfam clans and families in organisms were taken from Hoeppner et al[129] and their Dataset S1 Shown below are distribution patterns for FSFs GO terms and RNA families that are present in more than 60 ofthe organisms examined (119891 gt 06) All distributions highlight maximum sharing in the ancient ABE and BE taxonomic groups and minimalsharing in archaeal taxonomic groups

provide additional evidence in support of the archaeal rootingscenario

61 Comparative Genomic Argumentation The distributionof gene-encoded products in the genomes of sequencedorganisms and parsimony thinking can reveal global evolu-tionary patterns without formal phylogenetic reconstructionWe will show how simple numerical analyses of proteindomains molecular activities defined by the gene ontology(GO) consortium [118] and RNA families that are domain-specific or are shared between domains of life can uncover thetripartite division of cellular life exclude chimeric scenariosof origin and provide initial insight on the rooting of the ToL(Figure 6)

The number of unique protein folds observed in natureis very small SCOP ver 175 defines only sim2000 foldsuperfamilies (FSFs) groups of homologous domains unifiedon the base of common structural and evolutionary rela-tionships for a total of 110800 known domains in proteins[113 119] FSFs represent highly conserved evolutionary unitsthat are suitable for studying organismal diversification [87]Yafremava et al [120] plotted the total number of distinctFSFs (FSF diversity) versus the average reuse of FSFs in theproteome of an organism (FSF abundance) This exerciseuncovered a scaling behavior typical of a Benford distributionwith a linear regime of proteomic growth for microbial

organisms and a superlinear regime for eukaryotic organisms(Figure 8 in [120]) These same scaling patterns are observedwhen studying the relationship between open reading framesand genome size [121] Remarkably archaeal and eukaryalproteomes exhibited both minimum and maximum levelsof FSF abundance and diversity respectively Bacterial pro-teomes however showed intermediate levels We note thatthe general scaling behavior is consistent with a scenarioin which evolutionary diversification proceeds from simplerproteomes to the more complex ones in gradual mannersupporting the principle of spatiotemporal continuity andrevealing the nested phylogenetic hierarchies of organismsUnder this scenario (and results of [120]) the streamlinedarchaeal proteomes represent the earliest form of cellular lifeRemarkably archaeal species harboring thermophilic andhyperthermophilic lifestyles encoded the most streamlinedFSF repertoires (Figure 7) Clearly modern thermophilicarchaeons are most closely related to the ancient cells thatinhabited planet Earth billions of years ago (also read below)

FSF domain distributions in the genomes of the threedomains of life provide further insights into their evolutionNasir et al [122 123] generated Venn diagrams to illustrateFSF sharing patterns in the genomes of Archaea Bacteriaand Eukarya (Figure 6(a)) These diagrams display the totalnumber of FSFs that are unique to a domain of life (taxonomicgroupsA B and E) shared by only two (AB BE andAE) and

14 Archaea

FSF use (log)

S marinus I hospitalis

H butylicus

T pendens

Cand korarchaeum

A pernix

M kandleri

T neutrophils

P islandicum

290

285

280

275

270

265

260

255

250

245

240

22 23 24 25

ThermophilicNonthermophilic

FSF

reus

e (lo

g)

N maritimus

Figure 7 A plot of FSF use (diversity) against FSF reuse (abun-dance) reveals a linear pattern of proteomic growth in 48 archaealproteomes Thermophilic archaeal species occupy positions that areclose to the origin of the plot They also populate the most basalbranch positions in ToLs (see discussion in themain text) Both axesare in logarithmic scale

those that are universal (ABE) About half of the FSFs (786out of 1733) were present in all three domains of life 453 ofwhich were present in at least 60 of the organisms that wereexamined (119891 gt 06 where 119891 is the distribution index rangingfrom 0 to 1) The fact that about 70 of widely shared FSFs(672) belong to the ABE taxonomic group strongly supports acommon evolutionary origin for cells Evolutionary timelineshave confirmed that a large number of these universal FSFswere present in the ancestor of the three domains of life theurancestor (Figure 1) and were for the most part retained inextant proteomes [48]

Interestingly the number of BE FSFs was sim10-fold greaterthan AE and AB FSFs (324 versus 38 and 38) (Figure 6(a))The finding that Bacteria and Eukarya encode a significantlylarge number of shared FSF domains is remarkable and hintstowards an unprecedented strong evolutionary associationbetween these two domains of life Moreover a significantnumber of BE FSFs is widespread in the bacterial andeukaryotic proteomes 56 of the 324 FSFs are shared bymore than 60 of the bacterial and eukaryal organismsthat were analyzed (Figure 6(a)) This is strong evidenceof an ancient vertical evolutionary trace from their mutualancestor (anticipated in [123]) This trace uniquely supportsthe archaeal rooting of the ToL In turn the canonicaland eukaryotic rooting alternatives are highly unlikely asnone alone can explain the remarkable diversity of the BEtaxonomic group (as well as its ancient origin [123]) Theremarkable vertical trace of the ABE taxonomic group andthe negligible vertical trace of the AB group (only one FSF

shared by more than 60 of organisms) also falsify fusionhypotheses responsible for a putative chimeric makeup ofEukarya In light of these findings the most parsimoniousexplanation of comparative genomic data is that Archaea isthe first domain of diversified cellular life

The patterns of FSF sharing are further strengthenedby the genomic distributions of terminal-level molecularfunction GO terms (Figure 6(b)) The three domains of lifeshared about a quarter (526) of the 1924 GO terms thatwere surveyed A total of 232 of these ABE terms wereshared by 60 of organisms analyzed Again the fact thatabout 76 of widely shared GO terms (306) belong to theABE taxonomic group strongly supports a common verticaltrace while the finding that Bacteria and Eukarya share asignificantly large number of GO terms (272) supports againthe archaeal rooting of the ToL An alternative explanationfor the very large size of BE could be large-scale metabolism-related gene transfer from the ancestors of mitochondriaand plastids to the ancestors of modern eukaryotes [124]However we note that the BE group is not restricted to onlymetabolic functions It also includes FSFs and GOs involvedin intracellular and extracellular processes and regulation andinformational functions [125] Thus the very large size ofBE is a significant outcome most likely shaped by verticalevolutionary scenarios and cannot solely be explained byparasiticsymbiotic relationships that exist between Bacteriaand Eukarya [123] Moreover the simplicity of archaeal FSFrepertoires is not due to the paucity of available archaealgenomic dataWe confirmed that themean FSF coverage (ienumber of proteinsannotated to FSFsGOs out of total) forArchaea Bacteria and Eukarya was largely comparable (egTable S1 in [122]) Finally as we will describe below thesecomparative patterns and tentative conclusions are confirmedby phylogenomic analysis [91 122 126] Congruent phyloge-nies were obtained with (Figure 4(c)) and without equal andrandom sampling of taxa Thus the relatively low numberof archaeal genomes is also not expected to compromise ourinferences

We note that Eukarya shares many informational geneswith Archaea and many operational genes with BacteriaWhile informational genes have been thought more refrac-tory to HGT than noninformational genes we have con-firmed at GO level that this is not the case A ToL builtfrom GO terms showed that in fact noninformational termswere less homoplasious while statistical enrichment analysesrevealed that HGT had little if any functional preferencefor GO terms across GO hierarchical levels We recentlycompared ToLs reconstructed from non-HGTGO terms andToLs reconstructed from informational GO terms that wereextracted fromnon-HGTGO terms (Kim et al ms in reviewalso read below) In both cases Archaea appeared as a basalparaphyletic group of the ToLs and the common origin ofBacteria and Eukarya was maintained Thus the 272 GOterms shared by Bacteria and Eukarya harbor a strong verticaltrace

Another observation often used as support for Archaea-Eukarya kinship is the discovery of few eukaryote-specificproteins (eg actin tubulin H3 H4 ESCRT ribosomalproteins and others) in some archaeal species [127] and

Archaea 15

their complete absence fromBacteria (except for documentedtubulin HGT from eukaryotes to bacterial genus Prosthe-cobacter [128]) This suggests either that eukaryotes arosefrom an archaeal lineage [127] or that the ancestor of Eukaryaand Archaea was complex and modern Archaea are highlyreduced [25] Indeed few eukaryote-specific proteins havenow been found in some archaeal species (in some casesjust one species) We argue that this poor spread cannotbe taken as evidence for the Archaea-Eukarya sister rela-tionship This needs to be confirmed by robust phylogeneticanalysis which is unfortunately not possible when usingprotein sequences Recently we described a new strategy forinferring vertical and horizontal traces [123] This methodcalculates the spread of FSFs andGOs that are shared betweentwo-superkingdom groups (ie AB AE and BE) Balanceddistributions often indicate vertical inheritance while biaseddistributions suggest horizontal flux For example penicillinbinding molecular activity (GO 0008658) was present in100 of the sampled bacterial proteomes but was onlypresent in 11 of the archaeal species [123] Thus presenceof GO 0008658 in Archaea was attributed to HGT gain fromBacteria Using this simple method we established that bothBacteria and Eukarya were united by much stronger verticaltrace than either was to Archaea In fact strong reductivetendencies in the archaeal genomes were recorded [123]Thus in our opinion presence of eukaryote-specific proteinsin only very few archaeal species could in fact be an HGTevent that is not detectable by sequence phylogenies

In turn many arguments favor now the Bacteria-Eukaryasisterhood in addition to the structure-based phylogeniesand balanced distribution ofmolecular features For exampleBacteria and Eukarya have similar lipid membranes that canbe used as argument for their evolutionary kinshipMoreoverArchaea fundamentally differ from Eukarya in terms oftheir virosphere Viruses infecting Archaea and Eukarya aredrastically different as recently discussed by Forterre [25]

The genomic distribution of RNA families was taken fromHoeppner et al [129] and also shows a vertical evolutionarytrace in five crucial Rfam clans that are universal includingtRNA 5S rRNA subunit rRNA and RNase P RNA Theseuniversal RNA groups are likely minimally affected by HGTHowever 99 of Rfam clans and families are specific todomains of life and only 11 of the 1148 groups were sharedat 119891 gt 06 levels This clearly shows that the functionalcomplexity of RNA materialized very late during organismaldiversification and that it is not a good genomic feature forexploring the rooting of the ToL Only five RNA familiesare shared between two domains of life and only one ofthese does so at 119891 gt 06 the G12 pseudoknot of the 23SrRNA which is present in bacterial and eukaryotic organellarrRNA While the large subunit rRNA scaffold supports theG12 pseudoknot with its vertical trace all five interdomainRNA families can be explainedmost parsimoniously byHGTThus only a handful of ancient and universal RNA speciescan be used to root the ToL

We end by noting that the Venn diagrams consistentlyshow that Archaea harbors the least number of unique (A)and shared (AB and AE) FSFs GO terms or RNA familiesThis trend supports an early divergence of this domain of life

from the urancestor and the possibility that such divergencebe shaped by evolutionary reductive events We reason thatsuch losseswould bemore parsimonious early on in evolutionthan in the later periods This is because genes often increasetheir abundance in evolutionary time (by gene duplicationsHGT and other processes)Thus it is reasonable to think thatloss of an ancient gene would be more feasible very early inevolution relative to losing it very late

62 Phylogenomic Evidence from the Sequence and Structure ofRNA Phylogenetic analyses of the few RNA families that areuniversal and display an important vertical evolutionary trace(Figure 6(c)) provide compelling evidence in favor of an earlyevolutionary appearance of Archaea [105 107 108 130ndash135]Here we briefly summarize evidence from tRNA 5S rRNAand RNase P RNA Unpublished analyses of rRNA sequenceand structure using advanced phylogenetic methods alsoshow that Archaea was the first domain of life

tRNA molecules are generally short (sim73ndash95 nucleotidesin length) and highly conserved Consequently theirsequence generally contains limited amount of phylogeneticinformation These limitations have been overcome by ana-lyzing entire tRNomes [136] which extend the length ofan organismal set of tRNAs to over 2000 bases Xue etal [130 131] analyzed the genetic distances between tRNAsequences as averages between alloacceptor tRNAs fromdiverse groups of tRNomes using multiple molecular basesubstitution models The distances were mapped onto anunrooted phylogeny of tRNA molecules (Figure 8(a))The ldquoarrow of timerdquo assumption in these studies is thatancient tRNA paralogs closely resemble each other when lin-eages originate close to the time of the gene duplication Re-markably the results revealed a paraphyletic rooting of theToL in Archaea The root was specifically located close tothe hyperthermophilic methanogen Methanopyrus kandleri(Figure 8(a)) The hypothesis of this specific rooting scenariohas been supported by several other studies [134 137]including a study of genetic distances between paralogouspairs of aminoacyl-tRNA synthetase (aaRS) proteins [131] Aremarkable match between distance scores of tRNA and pairsof aaRS paralogs (Figure 8(b)) not only confirms the earlyappearance of Archaea but also suggests a coevolutionarytrace associated with molecular interactions that areresponsible for the genetic code In fact a recent exhaustivephylogenomic analysis of tRNA and aaRS coevolutionexplicitly reveals the origins and evolution of the geneticcode and the underlying molecular basis of genetics [59]Di Giulio [132 133] has also proposed an archaeal rootingof the tree of life specifically in the lineage leading to thephylum of Nanoarchaeota This rooting is based on uniqueand ancestral genomic traits of Nanoarchaeum equitans splitgenes separately codifying for the 51015840 and 31015840 halves of tRNAand the absence of operons which are considered molecularfossils [132] However this claim needs additional support ascontrasting evidence now recognizes N equitans as a highlyderived archaeal species [138]

In addition to sequences structural features of tRNAmolecules also support the archaeal rooting of the ToL

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

8 Archaea

effectively ldquotrees of 99 percentrdquo and their use very powerfulThe fact that domains diversify mostly by vertical descent(eg [89ndash91]) suggests that gene reticulations simply reflectthe pervasive and impactful combinatorial effect of domainrearrangements in proteins [92] and perhaps little else Thisimportant claim must be carefully evaluated It offers thepossibility of dissecting how levels of organization impactprocesses of inheritance in biology

5 Out of the Impasse Parts and Wholes inthe Evolving Structure of Systems

Therooting of theToL is clearlymuddled by the high dynamicnature of change in protein and nucleic acid sequences andby the patchiness and reticulation complexities that exist atgene level However it is also possible that the problemsof the ToL are ultimately technical We have made thecase that the use of molecular sequences is problematic onmany grounds including mutational saturation definitionof homology of sites in sequence alignments inapplicablecharacters taxon sampling and tree imbalance and differenthistorical signatures in domains of multidomain proteins[93] In particular violation of character independence by themere existence of atomic structure represents a very seriousproblem that plagues phylogenetic analysis of sequences Wehere present a solution to the impasse We show that theToL can be rooted with different approaches that focus onstructure and function and that its root is congruently placedin Archaea

Epistemologically phylogenetic characters must complywith symmetry breaking and the irreversibility of time [94]In other words characters must establish transformationalhomology relationships and serve as independent evidentialstatements

(i) Characters Must Be Homologous and Heritable acrossTree Terminal Units (Taxa) Character homology is a centraland controversial concept that embodies the existence ofhistorical continuity of information [74] Characters areldquobasicrdquo evidential statements that make up columns in datamatrices used for tree reconstructionThey are conjectures ofperceived similarities that are accepted as fact for the dura-tion of the study are strengthened by Hennigian reciprocalillumination and can be put to the test through congruencewith other characters as these are fit to the trees To be usefulcharacters must be heritable and informative across the taxarows of the data matrix This can be evaluated for examplewith the cladistics information content (CIC) measure [95]Finding informative characters can be particularly challeng-ing when the features that are studied change at fast pace andwhen taxa sample awide and consequently deep phylogeneticspectrum When building a ToL the highly dynamic natureof change in the sequence makeup of protein or nucleic acidmolecules challenges the ability to retrieve reliable phyloge-netic signatures across taxa even if molecules are universallydistributed and harbor evolutionarily conserved regions withdeep phylogenetic signal (eg rRNA) The reason is thatgiven enough time functionally or structurally constrained

regions of the sequence will be fixed (will be structurallycanalized [6]) and will offer little if any phylogeneticallymeaningful signal to uncover for example the universalrRNA core In turn mutational saturation of unconstrainedregions will quickly erase history

The mutational saturation problem was made mathe-matically explicit by Sober and Steel [24] using ldquomutualinformationrdquo and the concept of time as an informationdestroying process Mutual information 119868(119883 119884) between tworandom variables119883 and 119884 is defined by

119868 (119883 119884) = sum

119909119910

119875 (119883 = 119909 119884 = 119910) log(119875 (119883 = 119909 119884 = 119910)

119875 (119883 = 119909) 119875 (119884 = 119910)

)

(1)

When 119868(119883 119884) approaches 0 119883 and 119884 become independentand no method can predict 119883 from knowledge of 119884 Impor-tantly mutual information approaches 0 as the time between119883 and 119884 increases in a Markov chain Regardless of theuse of a parsimony maximum likelihood or Bayesian-basedframework of analysis 119868(119883 119884)will be particularly small whensequence sites are saturated by too many substitutions dueto high substitution rates or large time scales Ancestralstates at interior nodes of the trees cannot be establishedwith confidence from extant information even in the mostoptimal situation of knowing the underlying phylogeny andthe model of character evolution Under simple models theproblem is not mitigated by the fact that the number ofterminal leaves of trees and the sources of initial phylogeneticinformation increases with time Since a phase transitionoccurs when substitution probabilities exceed a critical value[96] oneway out of the impasse is to find features in sufficientnumber that change at much slower pace than sequence sitesand test if mutual information is significant and overcomesFanorsquos inequality These features exist in molecular biologyand have been used for phylogenetic reconstruction Theyare for example the 3D fold structures of protein molecules[87] or the stem modules of RNA structures [97] featuresthat change at very slow rate when compared to associatedsequences For example protein 3D structural cores evolvelinearly with amino acid substitutions per site and change at3ndash10 times slower rates than sequences [98] This high con-servation highlights the evolutionary dynamics of molecularstructure Remarkably rates of change of proteins performinga same function are maintained by functional constraintsbut accelerate when proteins perform different functions orcontain indels In turn fold structural diversity explodesinto modular structures at low sequence identities probablytriggered by functional diversification Within the contextof structural conservation the fact that fold structures arestructural and evolutionary modules that accumulate inproteomes by gene duplication and rearrangements andspread in biological networks by recruitment (eg [99]) alsoprovides a solution to the problems of vanishing phylogeneticsignal Since fold accumulation increases with time in theMarkov chain mutual information must increase reversing

Archaea 9

the ldquodata processing inequalityrdquo that destroys informationand enabling deep evolutionary information

(ii) Characters Must Show at Least Two Distinct CharacterStates One of these two states (transformational homologs)must be ancestral (the ldquoplesiomorphicrdquo state) and the othermust be derived (the ldquoapomorphicrdquo state) [74] Only sharedand derived features (synapomorphies) provide vertical phy-logenetic evidence Consequently determining the relativeancestry of alternative character states defines the polarity ofcharacter transformations and roots the underlying treeThisis a fundamental property of phylogenetic inference Polar-ization in tree reconstruction enables the ldquoarrow of timerdquo(sensu Eddingtonrsquos entropy-induced asymmetry) solves therooting problem and fulfills other epistemological require-ments

Cladistically speaking character polarity refers to thedistinction between the ancestral and derived states and theidentification of synapomorphies However an evolutionaryview of polarity also refers to the direction of characterstate transformations in the phylogenetic model Historicallythree accepted alternatives have been available for rootingtrees [100ndash102] the outgroup comparison the ontogeneticmethod and the paleontological or stratigraphic methodWhile the three methods do not include assumptions ofevolutionary process they have been the subject of muchdiscussion and their interpretation of much controversyThe first two are however justified by the assumption thatdiversity results in a nested taxonomic hierarchy which mayor may not be induced by evolution We will not discussthe stratigraphic method as it relies on auxiliary assump-tions regarding the completeness of the fossil record Themidpoint rooting criterion mentioned earlier will not bediscussed either The procedure is contested by the existenceof heterogeneities in rates of change across the trees andproblems with the accurate characterization of phylogeneticdistances

In outgroup comparison polarity is inferred by thedistribution of character states in the ingroup (group of taxaof interest) and the sister group (taxa outside the group ofinterest) In a simple case if the character state is only foundin the ingroup it must be considered derived Outgroupcomparison is by far the method of choice because phyloge-neticists tend to have confidence in the supporting assump-tions higher-level relationships are outside the ingroupequivalent ontogenetic stages are compared and characterstate distributions are appropriately surveyed Unfortunatelythe method is ldquoindirectrdquo in that it depends on the assumptionof the existence of a higher-level relationship between theoutgroup and the ingroup Consequently the method cannotroot the ToL because at that level there is no higher-levelrelationship that is presently available Moreover the methodin itself does not polarize characters It simply connects theingroup to the rest of the ToL [100]

The ontogenetic criterion confers polarity through thedistribution of the states of homologous characters in onto-genies of the ingroup generally by focusing on the generalityof character states with more widely distributed states beingconsidered ancestral Nelsonrsquos rule states ldquogiven an ontogenetic

character transformation from a character observed to be moregeneral to a character observed to be less general the moregeneral character is primitive and the less general advancedrdquo[103] This ldquobiogenetic lawrdquo appears powerful in that itdepends only on the assumption that ontogenies of ingrouptaxa are properly surveyed It is also a ldquodirectrdquo method thatrelies exclusively on the ingroup Consequently it has thepotential to root the ToL Unfortunately Nelsonrsquos ldquogeneralityrdquohas been interpreted in numerous ways especially as it relatesto the ontogenetic sequence leading tomuch confusion [102]It also involves comparison of developmentally nested anddistinct life history stages making it difficult to extend themethod (originally conceptualized for vertebrate phylogeny)to the microbial world However Weston [100 104] madeit clear that the ontogenetic criterion embodies a widerldquogenerality criterionrdquo in which the taxic distribution of acharacter state is a subset of the distribution of anotherIn other words character states that characterize an entiregroup must be considered ancestral relative to an alternativestate that characterizes a subset of the group Besides thecentrality of nested patterns the generality criterion embedsthe core assumption that every homology is a synapomorphyin naturersquos nested taxonomic hierarchy and that homologiesin the hierarchy result from additive phylogenetic change[100] Westonrsquos more general rule therefore states that ldquogivena distribution of two homologous characters in which one119909 is possessed by all of the species that possess its homologcharacter 119910 and by at least one other species that does notthen 119910 may be postulated to be apomorphous relative to 119909rdquo[104] The only assumption of the method is that relevantcharacter states in the ingroup are properly surveyed Thisnew rule crucially substitutes the concept of ontogenetictransformation by the more general concept of homologyand additive phylogenetic change which can be applied tocases in which homologous entities accumulate ldquoiterativelyrdquoin evolution (eg generation of paralogous genes by dupli-cation) Since horizontally acquired characters (xenologs)are not considered synapomorphies they contribute towardsphylogenetic noise and are excluded after calculation ofhomoplasy and retention indices (ie measures of goodnessof fit of characters to the phylogeny)

Wehave applied the ldquogenerality criterionrdquo to the rooting ofthe ToL through polarization strategies that embody axiomsof evolutionary process Figure 4 shows three examplesA rooted phylogeny describing the evolution of 5S rRNAmolecules sampled from a wide range of organisms wasreconstructed from molecular sequence and structure [105]The ToL that was recovered was rooted paraphyletically inArchaea (Figure 4(a)) The model of character state transfor-mation was based on the axiom that evolved RNA moleculesare optimized to increase molecular persistence and producehighly stable folded conformations Molecular persistencematerializes in RNA structure in vitro with base pairsassociating and disassociating at rates as high as 05 sminus1 [106]The frustrated kinetics and energetics of this folding processenable some structural conformations to quickly reach stablestatesThis process is evolutionarily optimized through struc-tural canalization [6] in which evolution attains molecularfunctions by both increasing the average life and stability of

10 Archaea

Archaea Bacteria Eukarya

Root

Taxa = 666 5S rRNA molecules of diverse organismsCharacters = sequence and structure of tRNA

CI = 0072 RI = 0772 and g1=minus013Tree length = 11381 steps

(a)

Ancestral

AB B

EB

AE

EABEA

Derived

A E B

Timeline

Eukarya (E)

Archaea (A)

Bacteria (B)

ABE

(b)

1000

Root

Eukarya

Archaea

Bacteria

Taxa = 150 free-living proteomesCharacters = genomic abundance of 1510 FSFsTree length = 70132 stepsCI = 0155 RI = 0739 and g1=minus028

(c)

Figure 4 Trees of life generated from the structure of RNA and protein molecules congruently show a rooting in Archaea (a) A rootedphylogenetic tree of 5S rRNA reconstructed from both the sequence and the structure of the molecules (from [105]) (b) Global most-parsimonious scenario of organismal diversification based on tRNA (from [107]) A total of 571 tRNA molecules with sequence basemodification and structural information were used to build a ToL which failed to show monophyletic groupings Ancestries of lineageswere then inferred by constraining sets of tRNAs into monophyletic groups representing competing (shown in boxes) or noncompetingphylogenetic hypotheses and measuring tree suboptimality and lineage coalescence (illustrated with color hues in circles) (c) A ToLreconstructed from the genomic abundance counts of 1510 FSFs as phylogenetic characters in the proteomes of 150 free-living organismssampled equally and randomly from the three domains of life (data taken from [122 123]) Taxa were labeled with circles colored accordingto superkingdom CI = consistency index RI = retention index and 119892

1= gamma distribution parameter

selected conformations and decreasing their relative numberThus conformational diversity measured for example by theShannon entropy of the base-pairing probability matrix orfeatures of thermodynamic stability act as ldquoevo-devordquo proxyof a generality criterion for RNA molecules in which thecriterion of similarity (eg ontogenetic transformation) isof positional and compositional correspondence Using adifferent approach we recently broadened the use of phyloge-netic constraint analysis [107 108] borrowing from a formal

cyberneticmethod that decomposes a reconstructable systeminto its components [109] and used it to root a ToL derivedfrom tRNA sequence and structure (Figure 4(b)) The ToLwas again rooted in Archaea The number of additionalsteps required to force (constrain) particular taxa into amonophyletic group was used to define a lineage coalescencedistance (119878) with which to test alternative hypotheses ofmonophyly These hypotheses were then ordered accordingto 119878 value in an evolutionary timeline Since 119878 records

Archaea 11

the relative distribution of character states in taxic sets italso embodies the generality criterion of rooting Finallywe generated rooted ToLs that describe the evolution ofproteomes directly from a census of protein fold structures inproteomes (Figure 4(c)) This ToL shows a paraphyletic root-ing inArchaeaThemethod extracts phylogenetic signal fromproteomic abundance of protein fold structures and considersthat the most abundant and widely distributed folds are ofancient origin when defining transformation series [41] Thispolarization scheme which results in the gradual growth ofthe proteome repertoire again represents an embodimentof the generality criterion in which statements of homology(fold structures) result from additive phylogenetic change(increases in abundance) It is noteworthy that these ToLreconstructions take into account the genomic abundance ofeach and every fold structure in each proteome and acrossthe entire matrix thereby generating a frustrated system Ingeneral very high abundance of only few folds will not attracttaxa (species) to the derived branches in theToLThe ancestryof taxa is determined by both the abundance and interplayamong fold structure characters For example metabolicfolds such as those involved inATP hydrolysis are widespreadin cells and considered to be very ancient These are alsothe most ancient folds in our phylogenies In comparisonsome popular eukaryote-specific folds (eg immunoglobulinsuperfamilies) are highly abundant but appear in a derivedmanner in our phylogenies Thus we reason that there is nocircularity involved in the character polarization scheme

The compositional schemes extend the concept of rootingwith paralogous sequences to the entire proteome comple-ments from gene family level [100] to structural hierarchiesThe three examples make use of different rooting rationalesbut provide a congruent scenario of origins of diversifiedlife Technically roots are inferred by the Lundberg method[110] that does not require any outgroup taxa specificationThis method roots the trees a posteriori by attaching thehypothetical ancestor to that branch of the unrooted networkthat would yield minimum increase in the tree length (thuspreserving the principle of parsimony)

(iii) Characters Must Serve as Independent EvolutionaryHypotheses Valid phylogenetic optimization requires thatcharacters be independent pieces of evidence Charactersshould not dependonother charactersWhen the assumptionof independence is violated characters are overweighed inthe analysis and the resulting phylogeny fails to representtrue history [111] Possible dependencies could be of manykinds from structural to functional from developmen-tal to ecological These dependencies distort and obscurephylogenetic signal and must be either avoided or codedinto the phylogenetic model through parameters or weightcorrections

As we will now elaborate the problem of character inde-pendence is about parts and wholes in the hierarchical fabricof life and in the nested hierarchies of the ToL Biologicalsystems are by definition made of parts regardless of the wayparts are defined In evolution diversification and integrationof parts unify parts into cohesive entities modules whichthen diversify [112]This process and the rise of modules may

explain evolutionary waves of complexity and organizationand the emergence of structure (defined broadly) in biologythat is hierarchical The hierarchical makeup is made evidentin the structure of protein molecules where lower levelparts of the polymer (the amino acid residues) interact witheach other and establish cohesive higher-level modular partswhich also establish interaction networks and are crucialfor molecular function and for interaction of proteins withthe cellular environment Consequently the structure ofproteins can be described at increasing hierarchical levelsof structural abstraction sequences motifs loops domainsfamilies superfamilies topologies folds architectures andclasses Two accepted gold standards of protein classificationstructural classification of proteins (SCOP) [113] and classarchitecture topology homology (CATH) [114] use parts ofthis incomplete scheme to describe the atomic complexityof the molecules We note that these classifications do notconsider unrealized structural states such as protein foldsthat are possible but that have never been identified inthe natural world of protein structures We also note thatmodules sometimes engage in combinatorial games Forexample protein domains are rearranged in evolution byfusions and fissions producing the enormous diversity ofalternative domain rearrangements that exist inmultidomainproteins [92]

Since biological systems evolve and carry common ances-try parts of these systems by definition evolve and bythemselves carry common ancestry In other words thehistories of parts are embodied in the ensemble of lineagesof the ToL The focus of phylogenetic analysis however hasbeen overwhelmingly the organism as the biological systemas testified by the effort devoted to taxonomical classificationsystematic biology and building of the ToL The genomicrevolution has provided a wealth of parts in the aminoacid sequence sites of proteins and nucleic acid moleculeswhich have been used as phylogenetic characters for analysisof molecules representing organisms This has maintainedthe focus of reconstructing trees of systems (the wholes)For example amino acid sites of a protein are generally fitinto a data matrix (alignment) which is then used to buildtrees of genes and organisms (Figure 5(a)) using modernalgorithmic implementations [115] However and as we havementioned proteins have complicated structures that resultfrom interactions between amino acids at the 3D atomic levelThese intramolecular interactions or at least some of them[116] induce protein folding and delimit molecular func-tion and protein stability They are responsible for proteinsecondary supersecondary domain and tertiary structureand by definition their mere existence induces violation ofcharacter independence Penny and Collins [117] proposedthe simple thought experiment in which the bioinformaticianexchanges rows of sequence sites in the alignment matrixand asks what was lost in the process Randomization ofcharacters (columns) in the data matrix does not changethe phylogenetic tree However randomization destroys thestructure of the molecule and very likely its function Thisconfirms that reconstruction of trees from information insequence sites violates character independence and in theprocess ignores structure and biologyThe effects of violation

12 Archaea

Interactions between parts(eg atomic contacts between sites)

Characters parts(amino acid sequence sites)

16

7

5Impact onphylogeneticreconstruction high

Phylogenetic model

Evolutionof systems

Canonicalparadigm

System 3 (protein from Mba)

System 2 (protein from Mja)

System 1 (protein from Pfu)

A L Y S

A L Y S

G L S S

1 5 6 7

(a)

Interactions between systems(eg proteomes in trophic chains)

Characters systems(proteomes)

Impact onphylogeneticreconstruction low

Phylogenetic model

Evolutionof parts

Newparadigm

Part 3 (FSF domain c261)

Part 2 (FSF domain c12)

Part 1 (FSF domain c371)

Higher trophic

levels

Mesoplankton

Microplankton

Nanoplankton

Picoplankton

Pfu Mja Mba 22 29 28

54 44 43

84 88 80

(b)

Figure 5 A new phylogenetic strategy simplifies the problems of character independence (a)The canonical paradigm explores the evolutionof systems such as the evolution of organisms in the reconstruction of ToLs For example the terminals of phylogenetic trees can be genessampled from different organisms (eg Pyrococcus furiosus (Pfu)Methanococcus jannaschii (Mja) andMethanosarcina barkeri (Mba)) andphylogenetic characters can be amino acid sequence sites of the corresponding gene products Character states can describe the identityof the amino acid at each site Since characters are molecular parts that interact with other parts when molecules fold into compact 3Dstructures their interaction violates the principle of character independence Consequently the effects of covariation must be considered inthe phylogeneticmodel used to build the trees (b)The new paradigm explores the evolution of parts such as the evolution of protein domainsin proteomes For example the terminals of phylogenetic trees can be domains defined at fold superfamily level of structural complexity andthe characters used to build the trees can be proteomes Character states can be the number of domains holding the FSF structure Sinceproteomes interact with other proteomes when organisms establish close interactions interactions that could affect the abundance of domainsin proteomes should be considered negligible (unless there is an obligate parasitic lifestyle involved) and there is no need to budget trophicinteractions in the phylogenetic model

of character independence may be minimal for trees ofsequences that are closely related However trees describingdeep historical relationships require that sequences be diver-gent and this maximizes the chances of even wider diver-gences in molecular structure that are not being accountedfor by the models of sequence evolution

The genomic revolution has also provided a wealth ofmodels of 3D atomic structure These structures are usedas gold standards to assign with high confidence structuralmodules to sequences As mentioned earlier proteomesembody collections of protein domains with well-definedstructures and functions Protein domain counts in pro-teomes have been used to generate trees of protein domains(Figure 5(b)) using standard cladistic approaches and well-established methods (reviewed in [87]) These trees describethe evolution of protein structure at global level Theyare effectively trees of parts While domains interact witheach other in multidomain proteins or establish protein-protein interactions with the domains of other proteinsthese interactions of parts do not violate character indepen-dence This is because phylogenetic characters are actuallyproteomes systems defined by structural states that exist atmuch higher levels than the protein domain not far awayfrom the organism level Remarkably no information is lostwhen character columns in the matrix are randomized in

the thought experiment The order of proteome charactersin the matrix does not follow any rationale Characters arenot ordered by lifestyles or trophic levels of the organismsInteractions between free-living organisms will seldom biastheir domain makeup and if so those characters can beexcluded from analysis Even the establishment of symbioticor obligate parasitic interactions such as the nodule-formingsymbioses between rhizobia and legumes may have littleimpact on character independence as long as the jointinclusion of the host and the symbiont is avoided

6 Evidence Supporting the Archaeal Rootingof the Universal Tree

Figure 4 shows examples of rooted ToLs generated fromthe sequence and structure of RNA and protein moleculesThe different phylogenomic approaches arrive at a commonrooted topology that places the stem group of Archaea at thebase of the ToL (the archaeal rooting of Figure 1(c)) Howeverthe evolutionary interrelationship of parts andwholes promptthe use of trees of parts a focus on higher-level structureand a decrease in confidence in the power of trees of systemsThese concepts have been applied to the study of the historyof nucleic acid and protein structures for over a decade and

Archaea 13

23

38

786

360324

164

38

2

33

453

126 56

1

1

FSF domains

Archaea

Archaea

EukaryaEukarya

Bact

eria

Bact

eria

f gt 06

(a)

1

11

526

852272

162

100

0

3

232

35 15

5

16

GO terms

Archaea

Archaea

EukaryaBa

cter

iaEukarya

Bact

eria

f gt 06

(b)

69

0

7

8393

228

2

1

0

5

91

1

0

RNA families

Archaea

Eukarya

Bact

eria

Archaea

Eukarya

Bact

eria

f gt 06

(c)

Figure 6 Venn diagrams displaying the distributions of 1733 FSF domains (a) 1924 terminal GO terms (b) and 1148 RNA families (c)in the genomes of the three domains of life FSF domain data was taken from Nasir et al [122 123] and included 981 completely sequencedproteomes from 70 Archaea 652 Bacteria and 259 Eukarya Terminal GO terms corresponding to the ldquomolecular functionrdquo hierarchy definedby the GO database [118] were identified in 249 free-living organisms including 45 Archaea 183 Bacteria and 21 Eukarya (data taken from[123]) The Venn diagram of RNA families and the distribution of Rfam clans and families in organisms were taken from Hoeppner et al[129] and their Dataset S1 Shown below are distribution patterns for FSFs GO terms and RNA families that are present in more than 60 ofthe organisms examined (119891 gt 06) All distributions highlight maximum sharing in the ancient ABE and BE taxonomic groups and minimalsharing in archaeal taxonomic groups

provide additional evidence in support of the archaeal rootingscenario

61 Comparative Genomic Argumentation The distributionof gene-encoded products in the genomes of sequencedorganisms and parsimony thinking can reveal global evolu-tionary patterns without formal phylogenetic reconstructionWe will show how simple numerical analyses of proteindomains molecular activities defined by the gene ontology(GO) consortium [118] and RNA families that are domain-specific or are shared between domains of life can uncover thetripartite division of cellular life exclude chimeric scenariosof origin and provide initial insight on the rooting of the ToL(Figure 6)

The number of unique protein folds observed in natureis very small SCOP ver 175 defines only sim2000 foldsuperfamilies (FSFs) groups of homologous domains unifiedon the base of common structural and evolutionary rela-tionships for a total of 110800 known domains in proteins[113 119] FSFs represent highly conserved evolutionary unitsthat are suitable for studying organismal diversification [87]Yafremava et al [120] plotted the total number of distinctFSFs (FSF diversity) versus the average reuse of FSFs in theproteome of an organism (FSF abundance) This exerciseuncovered a scaling behavior typical of a Benford distributionwith a linear regime of proteomic growth for microbial

organisms and a superlinear regime for eukaryotic organisms(Figure 8 in [120]) These same scaling patterns are observedwhen studying the relationship between open reading framesand genome size [121] Remarkably archaeal and eukaryalproteomes exhibited both minimum and maximum levelsof FSF abundance and diversity respectively Bacterial pro-teomes however showed intermediate levels We note thatthe general scaling behavior is consistent with a scenarioin which evolutionary diversification proceeds from simplerproteomes to the more complex ones in gradual mannersupporting the principle of spatiotemporal continuity andrevealing the nested phylogenetic hierarchies of organismsUnder this scenario (and results of [120]) the streamlinedarchaeal proteomes represent the earliest form of cellular lifeRemarkably archaeal species harboring thermophilic andhyperthermophilic lifestyles encoded the most streamlinedFSF repertoires (Figure 7) Clearly modern thermophilicarchaeons are most closely related to the ancient cells thatinhabited planet Earth billions of years ago (also read below)

FSF domain distributions in the genomes of the threedomains of life provide further insights into their evolutionNasir et al [122 123] generated Venn diagrams to illustrateFSF sharing patterns in the genomes of Archaea Bacteriaand Eukarya (Figure 6(a)) These diagrams display the totalnumber of FSFs that are unique to a domain of life (taxonomicgroupsA B and E) shared by only two (AB BE andAE) and

14 Archaea

FSF use (log)

S marinus I hospitalis

H butylicus

T pendens

Cand korarchaeum

A pernix

M kandleri

T neutrophils

P islandicum

290

285

280

275

270

265

260

255

250

245

240

22 23 24 25

ThermophilicNonthermophilic

FSF

reus

e (lo

g)

N maritimus

Figure 7 A plot of FSF use (diversity) against FSF reuse (abun-dance) reveals a linear pattern of proteomic growth in 48 archaealproteomes Thermophilic archaeal species occupy positions that areclose to the origin of the plot They also populate the most basalbranch positions in ToLs (see discussion in themain text) Both axesare in logarithmic scale

those that are universal (ABE) About half of the FSFs (786out of 1733) were present in all three domains of life 453 ofwhich were present in at least 60 of the organisms that wereexamined (119891 gt 06 where 119891 is the distribution index rangingfrom 0 to 1) The fact that about 70 of widely shared FSFs(672) belong to the ABE taxonomic group strongly supports acommon evolutionary origin for cells Evolutionary timelineshave confirmed that a large number of these universal FSFswere present in the ancestor of the three domains of life theurancestor (Figure 1) and were for the most part retained inextant proteomes [48]

Interestingly the number of BE FSFs was sim10-fold greaterthan AE and AB FSFs (324 versus 38 and 38) (Figure 6(a))The finding that Bacteria and Eukarya encode a significantlylarge number of shared FSF domains is remarkable and hintstowards an unprecedented strong evolutionary associationbetween these two domains of life Moreover a significantnumber of BE FSFs is widespread in the bacterial andeukaryotic proteomes 56 of the 324 FSFs are shared bymore than 60 of the bacterial and eukaryal organismsthat were analyzed (Figure 6(a)) This is strong evidenceof an ancient vertical evolutionary trace from their mutualancestor (anticipated in [123]) This trace uniquely supportsthe archaeal rooting of the ToL In turn the canonicaland eukaryotic rooting alternatives are highly unlikely asnone alone can explain the remarkable diversity of the BEtaxonomic group (as well as its ancient origin [123]) Theremarkable vertical trace of the ABE taxonomic group andthe negligible vertical trace of the AB group (only one FSF

shared by more than 60 of organisms) also falsify fusionhypotheses responsible for a putative chimeric makeup ofEukarya In light of these findings the most parsimoniousexplanation of comparative genomic data is that Archaea isthe first domain of diversified cellular life

The patterns of FSF sharing are further strengthenedby the genomic distributions of terminal-level molecularfunction GO terms (Figure 6(b)) The three domains of lifeshared about a quarter (526) of the 1924 GO terms thatwere surveyed A total of 232 of these ABE terms wereshared by 60 of organisms analyzed Again the fact thatabout 76 of widely shared GO terms (306) belong to theABE taxonomic group strongly supports a common verticaltrace while the finding that Bacteria and Eukarya share asignificantly large number of GO terms (272) supports againthe archaeal rooting of the ToL An alternative explanationfor the very large size of BE could be large-scale metabolism-related gene transfer from the ancestors of mitochondriaand plastids to the ancestors of modern eukaryotes [124]However we note that the BE group is not restricted to onlymetabolic functions It also includes FSFs and GOs involvedin intracellular and extracellular processes and regulation andinformational functions [125] Thus the very large size ofBE is a significant outcome most likely shaped by verticalevolutionary scenarios and cannot solely be explained byparasiticsymbiotic relationships that exist between Bacteriaand Eukarya [123] Moreover the simplicity of archaeal FSFrepertoires is not due to the paucity of available archaealgenomic dataWe confirmed that themean FSF coverage (ienumber of proteinsannotated to FSFsGOs out of total) forArchaea Bacteria and Eukarya was largely comparable (egTable S1 in [122]) Finally as we will describe below thesecomparative patterns and tentative conclusions are confirmedby phylogenomic analysis [91 122 126] Congruent phyloge-nies were obtained with (Figure 4(c)) and without equal andrandom sampling of taxa Thus the relatively low numberof archaeal genomes is also not expected to compromise ourinferences

We note that Eukarya shares many informational geneswith Archaea and many operational genes with BacteriaWhile informational genes have been thought more refrac-tory to HGT than noninformational genes we have con-firmed at GO level that this is not the case A ToL builtfrom GO terms showed that in fact noninformational termswere less homoplasious while statistical enrichment analysesrevealed that HGT had little if any functional preferencefor GO terms across GO hierarchical levels We recentlycompared ToLs reconstructed from non-HGTGO terms andToLs reconstructed from informational GO terms that wereextracted fromnon-HGTGO terms (Kim et al ms in reviewalso read below) In both cases Archaea appeared as a basalparaphyletic group of the ToLs and the common origin ofBacteria and Eukarya was maintained Thus the 272 GOterms shared by Bacteria and Eukarya harbor a strong verticaltrace

Another observation often used as support for Archaea-Eukarya kinship is the discovery of few eukaryote-specificproteins (eg actin tubulin H3 H4 ESCRT ribosomalproteins and others) in some archaeal species [127] and

Archaea 15

their complete absence fromBacteria (except for documentedtubulin HGT from eukaryotes to bacterial genus Prosthe-cobacter [128]) This suggests either that eukaryotes arosefrom an archaeal lineage [127] or that the ancestor of Eukaryaand Archaea was complex and modern Archaea are highlyreduced [25] Indeed few eukaryote-specific proteins havenow been found in some archaeal species (in some casesjust one species) We argue that this poor spread cannotbe taken as evidence for the Archaea-Eukarya sister rela-tionship This needs to be confirmed by robust phylogeneticanalysis which is unfortunately not possible when usingprotein sequences Recently we described a new strategy forinferring vertical and horizontal traces [123] This methodcalculates the spread of FSFs andGOs that are shared betweentwo-superkingdom groups (ie AB AE and BE) Balanceddistributions often indicate vertical inheritance while biaseddistributions suggest horizontal flux For example penicillinbinding molecular activity (GO 0008658) was present in100 of the sampled bacterial proteomes but was onlypresent in 11 of the archaeal species [123] Thus presenceof GO 0008658 in Archaea was attributed to HGT gain fromBacteria Using this simple method we established that bothBacteria and Eukarya were united by much stronger verticaltrace than either was to Archaea In fact strong reductivetendencies in the archaeal genomes were recorded [123]Thus in our opinion presence of eukaryote-specific proteinsin only very few archaeal species could in fact be an HGTevent that is not detectable by sequence phylogenies

In turn many arguments favor now the Bacteria-Eukaryasisterhood in addition to the structure-based phylogeniesand balanced distribution ofmolecular features For exampleBacteria and Eukarya have similar lipid membranes that canbe used as argument for their evolutionary kinshipMoreoverArchaea fundamentally differ from Eukarya in terms oftheir virosphere Viruses infecting Archaea and Eukarya aredrastically different as recently discussed by Forterre [25]

The genomic distribution of RNA families was taken fromHoeppner et al [129] and also shows a vertical evolutionarytrace in five crucial Rfam clans that are universal includingtRNA 5S rRNA subunit rRNA and RNase P RNA Theseuniversal RNA groups are likely minimally affected by HGTHowever 99 of Rfam clans and families are specific todomains of life and only 11 of the 1148 groups were sharedat 119891 gt 06 levels This clearly shows that the functionalcomplexity of RNA materialized very late during organismaldiversification and that it is not a good genomic feature forexploring the rooting of the ToL Only five RNA familiesare shared between two domains of life and only one ofthese does so at 119891 gt 06 the G12 pseudoknot of the 23SrRNA which is present in bacterial and eukaryotic organellarrRNA While the large subunit rRNA scaffold supports theG12 pseudoknot with its vertical trace all five interdomainRNA families can be explainedmost parsimoniously byHGTThus only a handful of ancient and universal RNA speciescan be used to root the ToL

We end by noting that the Venn diagrams consistentlyshow that Archaea harbors the least number of unique (A)and shared (AB and AE) FSFs GO terms or RNA familiesThis trend supports an early divergence of this domain of life

from the urancestor and the possibility that such divergencebe shaped by evolutionary reductive events We reason thatsuch losseswould bemore parsimonious early on in evolutionthan in the later periods This is because genes often increasetheir abundance in evolutionary time (by gene duplicationsHGT and other processes)Thus it is reasonable to think thatloss of an ancient gene would be more feasible very early inevolution relative to losing it very late

62 Phylogenomic Evidence from the Sequence and Structure ofRNA Phylogenetic analyses of the few RNA families that areuniversal and display an important vertical evolutionary trace(Figure 6(c)) provide compelling evidence in favor of an earlyevolutionary appearance of Archaea [105 107 108 130ndash135]Here we briefly summarize evidence from tRNA 5S rRNAand RNase P RNA Unpublished analyses of rRNA sequenceand structure using advanced phylogenetic methods alsoshow that Archaea was the first domain of life

tRNA molecules are generally short (sim73ndash95 nucleotidesin length) and highly conserved Consequently theirsequence generally contains limited amount of phylogeneticinformation These limitations have been overcome by ana-lyzing entire tRNomes [136] which extend the length ofan organismal set of tRNAs to over 2000 bases Xue etal [130 131] analyzed the genetic distances between tRNAsequences as averages between alloacceptor tRNAs fromdiverse groups of tRNomes using multiple molecular basesubstitution models The distances were mapped onto anunrooted phylogeny of tRNA molecules (Figure 8(a))The ldquoarrow of timerdquo assumption in these studies is thatancient tRNA paralogs closely resemble each other when lin-eages originate close to the time of the gene duplication Re-markably the results revealed a paraphyletic rooting of theToL in Archaea The root was specifically located close tothe hyperthermophilic methanogen Methanopyrus kandleri(Figure 8(a)) The hypothesis of this specific rooting scenariohas been supported by several other studies [134 137]including a study of genetic distances between paralogouspairs of aminoacyl-tRNA synthetase (aaRS) proteins [131] Aremarkable match between distance scores of tRNA and pairsof aaRS paralogs (Figure 8(b)) not only confirms the earlyappearance of Archaea but also suggests a coevolutionarytrace associated with molecular interactions that areresponsible for the genetic code In fact a recent exhaustivephylogenomic analysis of tRNA and aaRS coevolutionexplicitly reveals the origins and evolution of the geneticcode and the underlying molecular basis of genetics [59]Di Giulio [132 133] has also proposed an archaeal rootingof the tree of life specifically in the lineage leading to thephylum of Nanoarchaeota This rooting is based on uniqueand ancestral genomic traits of Nanoarchaeum equitans splitgenes separately codifying for the 51015840 and 31015840 halves of tRNAand the absence of operons which are considered molecularfossils [132] However this claim needs additional support ascontrasting evidence now recognizes N equitans as a highlyderived archaeal species [138]

In addition to sequences structural features of tRNAmolecules also support the archaeal rooting of the ToL

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Archaea 9

the ldquodata processing inequalityrdquo that destroys informationand enabling deep evolutionary information

(ii) Characters Must Show at Least Two Distinct CharacterStates One of these two states (transformational homologs)must be ancestral (the ldquoplesiomorphicrdquo state) and the othermust be derived (the ldquoapomorphicrdquo state) [74] Only sharedand derived features (synapomorphies) provide vertical phy-logenetic evidence Consequently determining the relativeancestry of alternative character states defines the polarity ofcharacter transformations and roots the underlying treeThisis a fundamental property of phylogenetic inference Polar-ization in tree reconstruction enables the ldquoarrow of timerdquo(sensu Eddingtonrsquos entropy-induced asymmetry) solves therooting problem and fulfills other epistemological require-ments

Cladistically speaking character polarity refers to thedistinction between the ancestral and derived states and theidentification of synapomorphies However an evolutionaryview of polarity also refers to the direction of characterstate transformations in the phylogenetic model Historicallythree accepted alternatives have been available for rootingtrees [100ndash102] the outgroup comparison the ontogeneticmethod and the paleontological or stratigraphic methodWhile the three methods do not include assumptions ofevolutionary process they have been the subject of muchdiscussion and their interpretation of much controversyThe first two are however justified by the assumption thatdiversity results in a nested taxonomic hierarchy which mayor may not be induced by evolution We will not discussthe stratigraphic method as it relies on auxiliary assump-tions regarding the completeness of the fossil record Themidpoint rooting criterion mentioned earlier will not bediscussed either The procedure is contested by the existenceof heterogeneities in rates of change across the trees andproblems with the accurate characterization of phylogeneticdistances

In outgroup comparison polarity is inferred by thedistribution of character states in the ingroup (group of taxaof interest) and the sister group (taxa outside the group ofinterest) In a simple case if the character state is only foundin the ingroup it must be considered derived Outgroupcomparison is by far the method of choice because phyloge-neticists tend to have confidence in the supporting assump-tions higher-level relationships are outside the ingroupequivalent ontogenetic stages are compared and characterstate distributions are appropriately surveyed Unfortunatelythe method is ldquoindirectrdquo in that it depends on the assumptionof the existence of a higher-level relationship between theoutgroup and the ingroup Consequently the method cannotroot the ToL because at that level there is no higher-levelrelationship that is presently available Moreover the methodin itself does not polarize characters It simply connects theingroup to the rest of the ToL [100]

The ontogenetic criterion confers polarity through thedistribution of the states of homologous characters in onto-genies of the ingroup generally by focusing on the generalityof character states with more widely distributed states beingconsidered ancestral Nelsonrsquos rule states ldquogiven an ontogenetic

character transformation from a character observed to be moregeneral to a character observed to be less general the moregeneral character is primitive and the less general advancedrdquo[103] This ldquobiogenetic lawrdquo appears powerful in that itdepends only on the assumption that ontogenies of ingrouptaxa are properly surveyed It is also a ldquodirectrdquo method thatrelies exclusively on the ingroup Consequently it has thepotential to root the ToL Unfortunately Nelsonrsquos ldquogeneralityrdquohas been interpreted in numerous ways especially as it relatesto the ontogenetic sequence leading tomuch confusion [102]It also involves comparison of developmentally nested anddistinct life history stages making it difficult to extend themethod (originally conceptualized for vertebrate phylogeny)to the microbial world However Weston [100 104] madeit clear that the ontogenetic criterion embodies a widerldquogenerality criterionrdquo in which the taxic distribution of acharacter state is a subset of the distribution of anotherIn other words character states that characterize an entiregroup must be considered ancestral relative to an alternativestate that characterizes a subset of the group Besides thecentrality of nested patterns the generality criterion embedsthe core assumption that every homology is a synapomorphyin naturersquos nested taxonomic hierarchy and that homologiesin the hierarchy result from additive phylogenetic change[100] Westonrsquos more general rule therefore states that ldquogivena distribution of two homologous characters in which one119909 is possessed by all of the species that possess its homologcharacter 119910 and by at least one other species that does notthen 119910 may be postulated to be apomorphous relative to 119909rdquo[104] The only assumption of the method is that relevantcharacter states in the ingroup are properly surveyed Thisnew rule crucially substitutes the concept of ontogenetictransformation by the more general concept of homologyand additive phylogenetic change which can be applied tocases in which homologous entities accumulate ldquoiterativelyrdquoin evolution (eg generation of paralogous genes by dupli-cation) Since horizontally acquired characters (xenologs)are not considered synapomorphies they contribute towardsphylogenetic noise and are excluded after calculation ofhomoplasy and retention indices (ie measures of goodnessof fit of characters to the phylogeny)

Wehave applied the ldquogenerality criterionrdquo to the rooting ofthe ToL through polarization strategies that embody axiomsof evolutionary process Figure 4 shows three examplesA rooted phylogeny describing the evolution of 5S rRNAmolecules sampled from a wide range of organisms wasreconstructed from molecular sequence and structure [105]The ToL that was recovered was rooted paraphyletically inArchaea (Figure 4(a)) The model of character state transfor-mation was based on the axiom that evolved RNA moleculesare optimized to increase molecular persistence and producehighly stable folded conformations Molecular persistencematerializes in RNA structure in vitro with base pairsassociating and disassociating at rates as high as 05 sminus1 [106]The frustrated kinetics and energetics of this folding processenable some structural conformations to quickly reach stablestatesThis process is evolutionarily optimized through struc-tural canalization [6] in which evolution attains molecularfunctions by both increasing the average life and stability of

10 Archaea

Archaea Bacteria Eukarya

Root

Taxa = 666 5S rRNA molecules of diverse organismsCharacters = sequence and structure of tRNA

CI = 0072 RI = 0772 and g1=minus013Tree length = 11381 steps

(a)

Ancestral

AB B

EB

AE

EABEA

Derived

A E B

Timeline

Eukarya (E)

Archaea (A)

Bacteria (B)

ABE

(b)

1000

Root

Eukarya

Archaea

Bacteria

Taxa = 150 free-living proteomesCharacters = genomic abundance of 1510 FSFsTree length = 70132 stepsCI = 0155 RI = 0739 and g1=minus028

(c)

Figure 4 Trees of life generated from the structure of RNA and protein molecules congruently show a rooting in Archaea (a) A rootedphylogenetic tree of 5S rRNA reconstructed from both the sequence and the structure of the molecules (from [105]) (b) Global most-parsimonious scenario of organismal diversification based on tRNA (from [107]) A total of 571 tRNA molecules with sequence basemodification and structural information were used to build a ToL which failed to show monophyletic groupings Ancestries of lineageswere then inferred by constraining sets of tRNAs into monophyletic groups representing competing (shown in boxes) or noncompetingphylogenetic hypotheses and measuring tree suboptimality and lineage coalescence (illustrated with color hues in circles) (c) A ToLreconstructed from the genomic abundance counts of 1510 FSFs as phylogenetic characters in the proteomes of 150 free-living organismssampled equally and randomly from the three domains of life (data taken from [122 123]) Taxa were labeled with circles colored accordingto superkingdom CI = consistency index RI = retention index and 119892

1= gamma distribution parameter

selected conformations and decreasing their relative numberThus conformational diversity measured for example by theShannon entropy of the base-pairing probability matrix orfeatures of thermodynamic stability act as ldquoevo-devordquo proxyof a generality criterion for RNA molecules in which thecriterion of similarity (eg ontogenetic transformation) isof positional and compositional correspondence Using adifferent approach we recently broadened the use of phyloge-netic constraint analysis [107 108] borrowing from a formal

cyberneticmethod that decomposes a reconstructable systeminto its components [109] and used it to root a ToL derivedfrom tRNA sequence and structure (Figure 4(b)) The ToLwas again rooted in Archaea The number of additionalsteps required to force (constrain) particular taxa into amonophyletic group was used to define a lineage coalescencedistance (119878) with which to test alternative hypotheses ofmonophyly These hypotheses were then ordered accordingto 119878 value in an evolutionary timeline Since 119878 records

Archaea 11

the relative distribution of character states in taxic sets italso embodies the generality criterion of rooting Finallywe generated rooted ToLs that describe the evolution ofproteomes directly from a census of protein fold structures inproteomes (Figure 4(c)) This ToL shows a paraphyletic root-ing inArchaeaThemethod extracts phylogenetic signal fromproteomic abundance of protein fold structures and considersthat the most abundant and widely distributed folds are ofancient origin when defining transformation series [41] Thispolarization scheme which results in the gradual growth ofthe proteome repertoire again represents an embodimentof the generality criterion in which statements of homology(fold structures) result from additive phylogenetic change(increases in abundance) It is noteworthy that these ToLreconstructions take into account the genomic abundance ofeach and every fold structure in each proteome and acrossthe entire matrix thereby generating a frustrated system Ingeneral very high abundance of only few folds will not attracttaxa (species) to the derived branches in theToLThe ancestryof taxa is determined by both the abundance and interplayamong fold structure characters For example metabolicfolds such as those involved inATP hydrolysis are widespreadin cells and considered to be very ancient These are alsothe most ancient folds in our phylogenies In comparisonsome popular eukaryote-specific folds (eg immunoglobulinsuperfamilies) are highly abundant but appear in a derivedmanner in our phylogenies Thus we reason that there is nocircularity involved in the character polarization scheme

The compositional schemes extend the concept of rootingwith paralogous sequences to the entire proteome comple-ments from gene family level [100] to structural hierarchiesThe three examples make use of different rooting rationalesbut provide a congruent scenario of origins of diversifiedlife Technically roots are inferred by the Lundberg method[110] that does not require any outgroup taxa specificationThis method roots the trees a posteriori by attaching thehypothetical ancestor to that branch of the unrooted networkthat would yield minimum increase in the tree length (thuspreserving the principle of parsimony)

(iii) Characters Must Serve as Independent EvolutionaryHypotheses Valid phylogenetic optimization requires thatcharacters be independent pieces of evidence Charactersshould not dependonother charactersWhen the assumptionof independence is violated characters are overweighed inthe analysis and the resulting phylogeny fails to representtrue history [111] Possible dependencies could be of manykinds from structural to functional from developmen-tal to ecological These dependencies distort and obscurephylogenetic signal and must be either avoided or codedinto the phylogenetic model through parameters or weightcorrections

As we will now elaborate the problem of character inde-pendence is about parts and wholes in the hierarchical fabricof life and in the nested hierarchies of the ToL Biologicalsystems are by definition made of parts regardless of the wayparts are defined In evolution diversification and integrationof parts unify parts into cohesive entities modules whichthen diversify [112]This process and the rise of modules may

explain evolutionary waves of complexity and organizationand the emergence of structure (defined broadly) in biologythat is hierarchical The hierarchical makeup is made evidentin the structure of protein molecules where lower levelparts of the polymer (the amino acid residues) interact witheach other and establish cohesive higher-level modular partswhich also establish interaction networks and are crucialfor molecular function and for interaction of proteins withthe cellular environment Consequently the structure ofproteins can be described at increasing hierarchical levelsof structural abstraction sequences motifs loops domainsfamilies superfamilies topologies folds architectures andclasses Two accepted gold standards of protein classificationstructural classification of proteins (SCOP) [113] and classarchitecture topology homology (CATH) [114] use parts ofthis incomplete scheme to describe the atomic complexityof the molecules We note that these classifications do notconsider unrealized structural states such as protein foldsthat are possible but that have never been identified inthe natural world of protein structures We also note thatmodules sometimes engage in combinatorial games Forexample protein domains are rearranged in evolution byfusions and fissions producing the enormous diversity ofalternative domain rearrangements that exist inmultidomainproteins [92]

Since biological systems evolve and carry common ances-try parts of these systems by definition evolve and bythemselves carry common ancestry In other words thehistories of parts are embodied in the ensemble of lineagesof the ToL The focus of phylogenetic analysis however hasbeen overwhelmingly the organism as the biological systemas testified by the effort devoted to taxonomical classificationsystematic biology and building of the ToL The genomicrevolution has provided a wealth of parts in the aminoacid sequence sites of proteins and nucleic acid moleculeswhich have been used as phylogenetic characters for analysisof molecules representing organisms This has maintainedthe focus of reconstructing trees of systems (the wholes)For example amino acid sites of a protein are generally fitinto a data matrix (alignment) which is then used to buildtrees of genes and organisms (Figure 5(a)) using modernalgorithmic implementations [115] However and as we havementioned proteins have complicated structures that resultfrom interactions between amino acids at the 3D atomic levelThese intramolecular interactions or at least some of them[116] induce protein folding and delimit molecular func-tion and protein stability They are responsible for proteinsecondary supersecondary domain and tertiary structureand by definition their mere existence induces violation ofcharacter independence Penny and Collins [117] proposedthe simple thought experiment in which the bioinformaticianexchanges rows of sequence sites in the alignment matrixand asks what was lost in the process Randomization ofcharacters (columns) in the data matrix does not changethe phylogenetic tree However randomization destroys thestructure of the molecule and very likely its function Thisconfirms that reconstruction of trees from information insequence sites violates character independence and in theprocess ignores structure and biologyThe effects of violation

12 Archaea

Interactions between parts(eg atomic contacts between sites)

Characters parts(amino acid sequence sites)

16

7

5Impact onphylogeneticreconstruction high

Phylogenetic model

Evolutionof systems

Canonicalparadigm

System 3 (protein from Mba)

System 2 (protein from Mja)

System 1 (protein from Pfu)

A L Y S

A L Y S

G L S S

1 5 6 7

(a)

Interactions between systems(eg proteomes in trophic chains)

Characters systems(proteomes)

Impact onphylogeneticreconstruction low

Phylogenetic model

Evolutionof parts

Newparadigm

Part 3 (FSF domain c261)

Part 2 (FSF domain c12)

Part 1 (FSF domain c371)

Higher trophic

levels

Mesoplankton

Microplankton

Nanoplankton

Picoplankton

Pfu Mja Mba 22 29 28

54 44 43

84 88 80

(b)

Figure 5 A new phylogenetic strategy simplifies the problems of character independence (a)The canonical paradigm explores the evolutionof systems such as the evolution of organisms in the reconstruction of ToLs For example the terminals of phylogenetic trees can be genessampled from different organisms (eg Pyrococcus furiosus (Pfu)Methanococcus jannaschii (Mja) andMethanosarcina barkeri (Mba)) andphylogenetic characters can be amino acid sequence sites of the corresponding gene products Character states can describe the identityof the amino acid at each site Since characters are molecular parts that interact with other parts when molecules fold into compact 3Dstructures their interaction violates the principle of character independence Consequently the effects of covariation must be considered inthe phylogeneticmodel used to build the trees (b)The new paradigm explores the evolution of parts such as the evolution of protein domainsin proteomes For example the terminals of phylogenetic trees can be domains defined at fold superfamily level of structural complexity andthe characters used to build the trees can be proteomes Character states can be the number of domains holding the FSF structure Sinceproteomes interact with other proteomes when organisms establish close interactions interactions that could affect the abundance of domainsin proteomes should be considered negligible (unless there is an obligate parasitic lifestyle involved) and there is no need to budget trophicinteractions in the phylogenetic model

of character independence may be minimal for trees ofsequences that are closely related However trees describingdeep historical relationships require that sequences be diver-gent and this maximizes the chances of even wider diver-gences in molecular structure that are not being accountedfor by the models of sequence evolution

The genomic revolution has also provided a wealth ofmodels of 3D atomic structure These structures are usedas gold standards to assign with high confidence structuralmodules to sequences As mentioned earlier proteomesembody collections of protein domains with well-definedstructures and functions Protein domain counts in pro-teomes have been used to generate trees of protein domains(Figure 5(b)) using standard cladistic approaches and well-established methods (reviewed in [87]) These trees describethe evolution of protein structure at global level Theyare effectively trees of parts While domains interact witheach other in multidomain proteins or establish protein-protein interactions with the domains of other proteinsthese interactions of parts do not violate character indepen-dence This is because phylogenetic characters are actuallyproteomes systems defined by structural states that exist atmuch higher levels than the protein domain not far awayfrom the organism level Remarkably no information is lostwhen character columns in the matrix are randomized in

the thought experiment The order of proteome charactersin the matrix does not follow any rationale Characters arenot ordered by lifestyles or trophic levels of the organismsInteractions between free-living organisms will seldom biastheir domain makeup and if so those characters can beexcluded from analysis Even the establishment of symbioticor obligate parasitic interactions such as the nodule-formingsymbioses between rhizobia and legumes may have littleimpact on character independence as long as the jointinclusion of the host and the symbiont is avoided

6 Evidence Supporting the Archaeal Rootingof the Universal Tree

Figure 4 shows examples of rooted ToLs generated fromthe sequence and structure of RNA and protein moleculesThe different phylogenomic approaches arrive at a commonrooted topology that places the stem group of Archaea at thebase of the ToL (the archaeal rooting of Figure 1(c)) Howeverthe evolutionary interrelationship of parts andwholes promptthe use of trees of parts a focus on higher-level structureand a decrease in confidence in the power of trees of systemsThese concepts have been applied to the study of the historyof nucleic acid and protein structures for over a decade and

Archaea 13

23

38

786

360324

164

38

2

33

453

126 56

1

1

FSF domains

Archaea

Archaea

EukaryaEukarya

Bact

eria

Bact

eria

f gt 06

(a)

1

11

526

852272

162

100

0

3

232

35 15

5

16

GO terms

Archaea

Archaea

EukaryaBa

cter

iaEukarya

Bact

eria

f gt 06

(b)

69

0

7

8393

228

2

1

0

5

91

1

0

RNA families

Archaea

Eukarya

Bact

eria

Archaea

Eukarya

Bact

eria

f gt 06

(c)

Figure 6 Venn diagrams displaying the distributions of 1733 FSF domains (a) 1924 terminal GO terms (b) and 1148 RNA families (c)in the genomes of the three domains of life FSF domain data was taken from Nasir et al [122 123] and included 981 completely sequencedproteomes from 70 Archaea 652 Bacteria and 259 Eukarya Terminal GO terms corresponding to the ldquomolecular functionrdquo hierarchy definedby the GO database [118] were identified in 249 free-living organisms including 45 Archaea 183 Bacteria and 21 Eukarya (data taken from[123]) The Venn diagram of RNA families and the distribution of Rfam clans and families in organisms were taken from Hoeppner et al[129] and their Dataset S1 Shown below are distribution patterns for FSFs GO terms and RNA families that are present in more than 60 ofthe organisms examined (119891 gt 06) All distributions highlight maximum sharing in the ancient ABE and BE taxonomic groups and minimalsharing in archaeal taxonomic groups

provide additional evidence in support of the archaeal rootingscenario

61 Comparative Genomic Argumentation The distributionof gene-encoded products in the genomes of sequencedorganisms and parsimony thinking can reveal global evolu-tionary patterns without formal phylogenetic reconstructionWe will show how simple numerical analyses of proteindomains molecular activities defined by the gene ontology(GO) consortium [118] and RNA families that are domain-specific or are shared between domains of life can uncover thetripartite division of cellular life exclude chimeric scenariosof origin and provide initial insight on the rooting of the ToL(Figure 6)

The number of unique protein folds observed in natureis very small SCOP ver 175 defines only sim2000 foldsuperfamilies (FSFs) groups of homologous domains unifiedon the base of common structural and evolutionary rela-tionships for a total of 110800 known domains in proteins[113 119] FSFs represent highly conserved evolutionary unitsthat are suitable for studying organismal diversification [87]Yafremava et al [120] plotted the total number of distinctFSFs (FSF diversity) versus the average reuse of FSFs in theproteome of an organism (FSF abundance) This exerciseuncovered a scaling behavior typical of a Benford distributionwith a linear regime of proteomic growth for microbial

organisms and a superlinear regime for eukaryotic organisms(Figure 8 in [120]) These same scaling patterns are observedwhen studying the relationship between open reading framesand genome size [121] Remarkably archaeal and eukaryalproteomes exhibited both minimum and maximum levelsof FSF abundance and diversity respectively Bacterial pro-teomes however showed intermediate levels We note thatthe general scaling behavior is consistent with a scenarioin which evolutionary diversification proceeds from simplerproteomes to the more complex ones in gradual mannersupporting the principle of spatiotemporal continuity andrevealing the nested phylogenetic hierarchies of organismsUnder this scenario (and results of [120]) the streamlinedarchaeal proteomes represent the earliest form of cellular lifeRemarkably archaeal species harboring thermophilic andhyperthermophilic lifestyles encoded the most streamlinedFSF repertoires (Figure 7) Clearly modern thermophilicarchaeons are most closely related to the ancient cells thatinhabited planet Earth billions of years ago (also read below)

FSF domain distributions in the genomes of the threedomains of life provide further insights into their evolutionNasir et al [122 123] generated Venn diagrams to illustrateFSF sharing patterns in the genomes of Archaea Bacteriaand Eukarya (Figure 6(a)) These diagrams display the totalnumber of FSFs that are unique to a domain of life (taxonomicgroupsA B and E) shared by only two (AB BE andAE) and

14 Archaea

FSF use (log)

S marinus I hospitalis

H butylicus

T pendens

Cand korarchaeum

A pernix

M kandleri

T neutrophils

P islandicum

290

285

280

275

270

265

260

255

250

245

240

22 23 24 25

ThermophilicNonthermophilic

FSF

reus

e (lo

g)

N maritimus

Figure 7 A plot of FSF use (diversity) against FSF reuse (abun-dance) reveals a linear pattern of proteomic growth in 48 archaealproteomes Thermophilic archaeal species occupy positions that areclose to the origin of the plot They also populate the most basalbranch positions in ToLs (see discussion in themain text) Both axesare in logarithmic scale

those that are universal (ABE) About half of the FSFs (786out of 1733) were present in all three domains of life 453 ofwhich were present in at least 60 of the organisms that wereexamined (119891 gt 06 where 119891 is the distribution index rangingfrom 0 to 1) The fact that about 70 of widely shared FSFs(672) belong to the ABE taxonomic group strongly supports acommon evolutionary origin for cells Evolutionary timelineshave confirmed that a large number of these universal FSFswere present in the ancestor of the three domains of life theurancestor (Figure 1) and were for the most part retained inextant proteomes [48]

Interestingly the number of BE FSFs was sim10-fold greaterthan AE and AB FSFs (324 versus 38 and 38) (Figure 6(a))The finding that Bacteria and Eukarya encode a significantlylarge number of shared FSF domains is remarkable and hintstowards an unprecedented strong evolutionary associationbetween these two domains of life Moreover a significantnumber of BE FSFs is widespread in the bacterial andeukaryotic proteomes 56 of the 324 FSFs are shared bymore than 60 of the bacterial and eukaryal organismsthat were analyzed (Figure 6(a)) This is strong evidenceof an ancient vertical evolutionary trace from their mutualancestor (anticipated in [123]) This trace uniquely supportsthe archaeal rooting of the ToL In turn the canonicaland eukaryotic rooting alternatives are highly unlikely asnone alone can explain the remarkable diversity of the BEtaxonomic group (as well as its ancient origin [123]) Theremarkable vertical trace of the ABE taxonomic group andthe negligible vertical trace of the AB group (only one FSF

shared by more than 60 of organisms) also falsify fusionhypotheses responsible for a putative chimeric makeup ofEukarya In light of these findings the most parsimoniousexplanation of comparative genomic data is that Archaea isthe first domain of diversified cellular life

The patterns of FSF sharing are further strengthenedby the genomic distributions of terminal-level molecularfunction GO terms (Figure 6(b)) The three domains of lifeshared about a quarter (526) of the 1924 GO terms thatwere surveyed A total of 232 of these ABE terms wereshared by 60 of organisms analyzed Again the fact thatabout 76 of widely shared GO terms (306) belong to theABE taxonomic group strongly supports a common verticaltrace while the finding that Bacteria and Eukarya share asignificantly large number of GO terms (272) supports againthe archaeal rooting of the ToL An alternative explanationfor the very large size of BE could be large-scale metabolism-related gene transfer from the ancestors of mitochondriaand plastids to the ancestors of modern eukaryotes [124]However we note that the BE group is not restricted to onlymetabolic functions It also includes FSFs and GOs involvedin intracellular and extracellular processes and regulation andinformational functions [125] Thus the very large size ofBE is a significant outcome most likely shaped by verticalevolutionary scenarios and cannot solely be explained byparasiticsymbiotic relationships that exist between Bacteriaand Eukarya [123] Moreover the simplicity of archaeal FSFrepertoires is not due to the paucity of available archaealgenomic dataWe confirmed that themean FSF coverage (ienumber of proteinsannotated to FSFsGOs out of total) forArchaea Bacteria and Eukarya was largely comparable (egTable S1 in [122]) Finally as we will describe below thesecomparative patterns and tentative conclusions are confirmedby phylogenomic analysis [91 122 126] Congruent phyloge-nies were obtained with (Figure 4(c)) and without equal andrandom sampling of taxa Thus the relatively low numberof archaeal genomes is also not expected to compromise ourinferences

We note that Eukarya shares many informational geneswith Archaea and many operational genes with BacteriaWhile informational genes have been thought more refrac-tory to HGT than noninformational genes we have con-firmed at GO level that this is not the case A ToL builtfrom GO terms showed that in fact noninformational termswere less homoplasious while statistical enrichment analysesrevealed that HGT had little if any functional preferencefor GO terms across GO hierarchical levels We recentlycompared ToLs reconstructed from non-HGTGO terms andToLs reconstructed from informational GO terms that wereextracted fromnon-HGTGO terms (Kim et al ms in reviewalso read below) In both cases Archaea appeared as a basalparaphyletic group of the ToLs and the common origin ofBacteria and Eukarya was maintained Thus the 272 GOterms shared by Bacteria and Eukarya harbor a strong verticaltrace

Another observation often used as support for Archaea-Eukarya kinship is the discovery of few eukaryote-specificproteins (eg actin tubulin H3 H4 ESCRT ribosomalproteins and others) in some archaeal species [127] and

Archaea 15

their complete absence fromBacteria (except for documentedtubulin HGT from eukaryotes to bacterial genus Prosthe-cobacter [128]) This suggests either that eukaryotes arosefrom an archaeal lineage [127] or that the ancestor of Eukaryaand Archaea was complex and modern Archaea are highlyreduced [25] Indeed few eukaryote-specific proteins havenow been found in some archaeal species (in some casesjust one species) We argue that this poor spread cannotbe taken as evidence for the Archaea-Eukarya sister rela-tionship This needs to be confirmed by robust phylogeneticanalysis which is unfortunately not possible when usingprotein sequences Recently we described a new strategy forinferring vertical and horizontal traces [123] This methodcalculates the spread of FSFs andGOs that are shared betweentwo-superkingdom groups (ie AB AE and BE) Balanceddistributions often indicate vertical inheritance while biaseddistributions suggest horizontal flux For example penicillinbinding molecular activity (GO 0008658) was present in100 of the sampled bacterial proteomes but was onlypresent in 11 of the archaeal species [123] Thus presenceof GO 0008658 in Archaea was attributed to HGT gain fromBacteria Using this simple method we established that bothBacteria and Eukarya were united by much stronger verticaltrace than either was to Archaea In fact strong reductivetendencies in the archaeal genomes were recorded [123]Thus in our opinion presence of eukaryote-specific proteinsin only very few archaeal species could in fact be an HGTevent that is not detectable by sequence phylogenies

In turn many arguments favor now the Bacteria-Eukaryasisterhood in addition to the structure-based phylogeniesand balanced distribution ofmolecular features For exampleBacteria and Eukarya have similar lipid membranes that canbe used as argument for their evolutionary kinshipMoreoverArchaea fundamentally differ from Eukarya in terms oftheir virosphere Viruses infecting Archaea and Eukarya aredrastically different as recently discussed by Forterre [25]

The genomic distribution of RNA families was taken fromHoeppner et al [129] and also shows a vertical evolutionarytrace in five crucial Rfam clans that are universal includingtRNA 5S rRNA subunit rRNA and RNase P RNA Theseuniversal RNA groups are likely minimally affected by HGTHowever 99 of Rfam clans and families are specific todomains of life and only 11 of the 1148 groups were sharedat 119891 gt 06 levels This clearly shows that the functionalcomplexity of RNA materialized very late during organismaldiversification and that it is not a good genomic feature forexploring the rooting of the ToL Only five RNA familiesare shared between two domains of life and only one ofthese does so at 119891 gt 06 the G12 pseudoknot of the 23SrRNA which is present in bacterial and eukaryotic organellarrRNA While the large subunit rRNA scaffold supports theG12 pseudoknot with its vertical trace all five interdomainRNA families can be explainedmost parsimoniously byHGTThus only a handful of ancient and universal RNA speciescan be used to root the ToL

We end by noting that the Venn diagrams consistentlyshow that Archaea harbors the least number of unique (A)and shared (AB and AE) FSFs GO terms or RNA familiesThis trend supports an early divergence of this domain of life

from the urancestor and the possibility that such divergencebe shaped by evolutionary reductive events We reason thatsuch losseswould bemore parsimonious early on in evolutionthan in the later periods This is because genes often increasetheir abundance in evolutionary time (by gene duplicationsHGT and other processes)Thus it is reasonable to think thatloss of an ancient gene would be more feasible very early inevolution relative to losing it very late

62 Phylogenomic Evidence from the Sequence and Structure ofRNA Phylogenetic analyses of the few RNA families that areuniversal and display an important vertical evolutionary trace(Figure 6(c)) provide compelling evidence in favor of an earlyevolutionary appearance of Archaea [105 107 108 130ndash135]Here we briefly summarize evidence from tRNA 5S rRNAand RNase P RNA Unpublished analyses of rRNA sequenceand structure using advanced phylogenetic methods alsoshow that Archaea was the first domain of life

tRNA molecules are generally short (sim73ndash95 nucleotidesin length) and highly conserved Consequently theirsequence generally contains limited amount of phylogeneticinformation These limitations have been overcome by ana-lyzing entire tRNomes [136] which extend the length ofan organismal set of tRNAs to over 2000 bases Xue etal [130 131] analyzed the genetic distances between tRNAsequences as averages between alloacceptor tRNAs fromdiverse groups of tRNomes using multiple molecular basesubstitution models The distances were mapped onto anunrooted phylogeny of tRNA molecules (Figure 8(a))The ldquoarrow of timerdquo assumption in these studies is thatancient tRNA paralogs closely resemble each other when lin-eages originate close to the time of the gene duplication Re-markably the results revealed a paraphyletic rooting of theToL in Archaea The root was specifically located close tothe hyperthermophilic methanogen Methanopyrus kandleri(Figure 8(a)) The hypothesis of this specific rooting scenariohas been supported by several other studies [134 137]including a study of genetic distances between paralogouspairs of aminoacyl-tRNA synthetase (aaRS) proteins [131] Aremarkable match between distance scores of tRNA and pairsof aaRS paralogs (Figure 8(b)) not only confirms the earlyappearance of Archaea but also suggests a coevolutionarytrace associated with molecular interactions that areresponsible for the genetic code In fact a recent exhaustivephylogenomic analysis of tRNA and aaRS coevolutionexplicitly reveals the origins and evolution of the geneticcode and the underlying molecular basis of genetics [59]Di Giulio [132 133] has also proposed an archaeal rootingof the tree of life specifically in the lineage leading to thephylum of Nanoarchaeota This rooting is based on uniqueand ancestral genomic traits of Nanoarchaeum equitans splitgenes separately codifying for the 51015840 and 31015840 halves of tRNAand the absence of operons which are considered molecularfossils [132] However this claim needs additional support ascontrasting evidence now recognizes N equitans as a highlyderived archaeal species [138]

In addition to sequences structural features of tRNAmolecules also support the archaeal rooting of the ToL

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

10 Archaea

Archaea Bacteria Eukarya

Root

Taxa = 666 5S rRNA molecules of diverse organismsCharacters = sequence and structure of tRNA

CI = 0072 RI = 0772 and g1=minus013Tree length = 11381 steps

(a)

Ancestral

AB B

EB

AE

EABEA

Derived

A E B

Timeline

Eukarya (E)

Archaea (A)

Bacteria (B)

ABE

(b)

1000

Root

Eukarya

Archaea

Bacteria

Taxa = 150 free-living proteomesCharacters = genomic abundance of 1510 FSFsTree length = 70132 stepsCI = 0155 RI = 0739 and g1=minus028

(c)

Figure 4 Trees of life generated from the structure of RNA and protein molecules congruently show a rooting in Archaea (a) A rootedphylogenetic tree of 5S rRNA reconstructed from both the sequence and the structure of the molecules (from [105]) (b) Global most-parsimonious scenario of organismal diversification based on tRNA (from [107]) A total of 571 tRNA molecules with sequence basemodification and structural information were used to build a ToL which failed to show monophyletic groupings Ancestries of lineageswere then inferred by constraining sets of tRNAs into monophyletic groups representing competing (shown in boxes) or noncompetingphylogenetic hypotheses and measuring tree suboptimality and lineage coalescence (illustrated with color hues in circles) (c) A ToLreconstructed from the genomic abundance counts of 1510 FSFs as phylogenetic characters in the proteomes of 150 free-living organismssampled equally and randomly from the three domains of life (data taken from [122 123]) Taxa were labeled with circles colored accordingto superkingdom CI = consistency index RI = retention index and 119892

1= gamma distribution parameter

selected conformations and decreasing their relative numberThus conformational diversity measured for example by theShannon entropy of the base-pairing probability matrix orfeatures of thermodynamic stability act as ldquoevo-devordquo proxyof a generality criterion for RNA molecules in which thecriterion of similarity (eg ontogenetic transformation) isof positional and compositional correspondence Using adifferent approach we recently broadened the use of phyloge-netic constraint analysis [107 108] borrowing from a formal

cyberneticmethod that decomposes a reconstructable systeminto its components [109] and used it to root a ToL derivedfrom tRNA sequence and structure (Figure 4(b)) The ToLwas again rooted in Archaea The number of additionalsteps required to force (constrain) particular taxa into amonophyletic group was used to define a lineage coalescencedistance (119878) with which to test alternative hypotheses ofmonophyly These hypotheses were then ordered accordingto 119878 value in an evolutionary timeline Since 119878 records

Archaea 11

the relative distribution of character states in taxic sets italso embodies the generality criterion of rooting Finallywe generated rooted ToLs that describe the evolution ofproteomes directly from a census of protein fold structures inproteomes (Figure 4(c)) This ToL shows a paraphyletic root-ing inArchaeaThemethod extracts phylogenetic signal fromproteomic abundance of protein fold structures and considersthat the most abundant and widely distributed folds are ofancient origin when defining transformation series [41] Thispolarization scheme which results in the gradual growth ofthe proteome repertoire again represents an embodimentof the generality criterion in which statements of homology(fold structures) result from additive phylogenetic change(increases in abundance) It is noteworthy that these ToLreconstructions take into account the genomic abundance ofeach and every fold structure in each proteome and acrossthe entire matrix thereby generating a frustrated system Ingeneral very high abundance of only few folds will not attracttaxa (species) to the derived branches in theToLThe ancestryof taxa is determined by both the abundance and interplayamong fold structure characters For example metabolicfolds such as those involved inATP hydrolysis are widespreadin cells and considered to be very ancient These are alsothe most ancient folds in our phylogenies In comparisonsome popular eukaryote-specific folds (eg immunoglobulinsuperfamilies) are highly abundant but appear in a derivedmanner in our phylogenies Thus we reason that there is nocircularity involved in the character polarization scheme

The compositional schemes extend the concept of rootingwith paralogous sequences to the entire proteome comple-ments from gene family level [100] to structural hierarchiesThe three examples make use of different rooting rationalesbut provide a congruent scenario of origins of diversifiedlife Technically roots are inferred by the Lundberg method[110] that does not require any outgroup taxa specificationThis method roots the trees a posteriori by attaching thehypothetical ancestor to that branch of the unrooted networkthat would yield minimum increase in the tree length (thuspreserving the principle of parsimony)

(iii) Characters Must Serve as Independent EvolutionaryHypotheses Valid phylogenetic optimization requires thatcharacters be independent pieces of evidence Charactersshould not dependonother charactersWhen the assumptionof independence is violated characters are overweighed inthe analysis and the resulting phylogeny fails to representtrue history [111] Possible dependencies could be of manykinds from structural to functional from developmen-tal to ecological These dependencies distort and obscurephylogenetic signal and must be either avoided or codedinto the phylogenetic model through parameters or weightcorrections

As we will now elaborate the problem of character inde-pendence is about parts and wholes in the hierarchical fabricof life and in the nested hierarchies of the ToL Biologicalsystems are by definition made of parts regardless of the wayparts are defined In evolution diversification and integrationof parts unify parts into cohesive entities modules whichthen diversify [112]This process and the rise of modules may

explain evolutionary waves of complexity and organizationand the emergence of structure (defined broadly) in biologythat is hierarchical The hierarchical makeup is made evidentin the structure of protein molecules where lower levelparts of the polymer (the amino acid residues) interact witheach other and establish cohesive higher-level modular partswhich also establish interaction networks and are crucialfor molecular function and for interaction of proteins withthe cellular environment Consequently the structure ofproteins can be described at increasing hierarchical levelsof structural abstraction sequences motifs loops domainsfamilies superfamilies topologies folds architectures andclasses Two accepted gold standards of protein classificationstructural classification of proteins (SCOP) [113] and classarchitecture topology homology (CATH) [114] use parts ofthis incomplete scheme to describe the atomic complexityof the molecules We note that these classifications do notconsider unrealized structural states such as protein foldsthat are possible but that have never been identified inthe natural world of protein structures We also note thatmodules sometimes engage in combinatorial games Forexample protein domains are rearranged in evolution byfusions and fissions producing the enormous diversity ofalternative domain rearrangements that exist inmultidomainproteins [92]

Since biological systems evolve and carry common ances-try parts of these systems by definition evolve and bythemselves carry common ancestry In other words thehistories of parts are embodied in the ensemble of lineagesof the ToL The focus of phylogenetic analysis however hasbeen overwhelmingly the organism as the biological systemas testified by the effort devoted to taxonomical classificationsystematic biology and building of the ToL The genomicrevolution has provided a wealth of parts in the aminoacid sequence sites of proteins and nucleic acid moleculeswhich have been used as phylogenetic characters for analysisof molecules representing organisms This has maintainedthe focus of reconstructing trees of systems (the wholes)For example amino acid sites of a protein are generally fitinto a data matrix (alignment) which is then used to buildtrees of genes and organisms (Figure 5(a)) using modernalgorithmic implementations [115] However and as we havementioned proteins have complicated structures that resultfrom interactions between amino acids at the 3D atomic levelThese intramolecular interactions or at least some of them[116] induce protein folding and delimit molecular func-tion and protein stability They are responsible for proteinsecondary supersecondary domain and tertiary structureand by definition their mere existence induces violation ofcharacter independence Penny and Collins [117] proposedthe simple thought experiment in which the bioinformaticianexchanges rows of sequence sites in the alignment matrixand asks what was lost in the process Randomization ofcharacters (columns) in the data matrix does not changethe phylogenetic tree However randomization destroys thestructure of the molecule and very likely its function Thisconfirms that reconstruction of trees from information insequence sites violates character independence and in theprocess ignores structure and biologyThe effects of violation

12 Archaea

Interactions between parts(eg atomic contacts between sites)

Characters parts(amino acid sequence sites)

16

7

5Impact onphylogeneticreconstruction high

Phylogenetic model

Evolutionof systems

Canonicalparadigm

System 3 (protein from Mba)

System 2 (protein from Mja)

System 1 (protein from Pfu)

A L Y S

A L Y S

G L S S

1 5 6 7

(a)

Interactions between systems(eg proteomes in trophic chains)

Characters systems(proteomes)

Impact onphylogeneticreconstruction low

Phylogenetic model

Evolutionof parts

Newparadigm

Part 3 (FSF domain c261)

Part 2 (FSF domain c12)

Part 1 (FSF domain c371)

Higher trophic

levels

Mesoplankton

Microplankton

Nanoplankton

Picoplankton

Pfu Mja Mba 22 29 28

54 44 43

84 88 80

(b)

Figure 5 A new phylogenetic strategy simplifies the problems of character independence (a)The canonical paradigm explores the evolutionof systems such as the evolution of organisms in the reconstruction of ToLs For example the terminals of phylogenetic trees can be genessampled from different organisms (eg Pyrococcus furiosus (Pfu)Methanococcus jannaschii (Mja) andMethanosarcina barkeri (Mba)) andphylogenetic characters can be amino acid sequence sites of the corresponding gene products Character states can describe the identityof the amino acid at each site Since characters are molecular parts that interact with other parts when molecules fold into compact 3Dstructures their interaction violates the principle of character independence Consequently the effects of covariation must be considered inthe phylogeneticmodel used to build the trees (b)The new paradigm explores the evolution of parts such as the evolution of protein domainsin proteomes For example the terminals of phylogenetic trees can be domains defined at fold superfamily level of structural complexity andthe characters used to build the trees can be proteomes Character states can be the number of domains holding the FSF structure Sinceproteomes interact with other proteomes when organisms establish close interactions interactions that could affect the abundance of domainsin proteomes should be considered negligible (unless there is an obligate parasitic lifestyle involved) and there is no need to budget trophicinteractions in the phylogenetic model

of character independence may be minimal for trees ofsequences that are closely related However trees describingdeep historical relationships require that sequences be diver-gent and this maximizes the chances of even wider diver-gences in molecular structure that are not being accountedfor by the models of sequence evolution

The genomic revolution has also provided a wealth ofmodels of 3D atomic structure These structures are usedas gold standards to assign with high confidence structuralmodules to sequences As mentioned earlier proteomesembody collections of protein domains with well-definedstructures and functions Protein domain counts in pro-teomes have been used to generate trees of protein domains(Figure 5(b)) using standard cladistic approaches and well-established methods (reviewed in [87]) These trees describethe evolution of protein structure at global level Theyare effectively trees of parts While domains interact witheach other in multidomain proteins or establish protein-protein interactions with the domains of other proteinsthese interactions of parts do not violate character indepen-dence This is because phylogenetic characters are actuallyproteomes systems defined by structural states that exist atmuch higher levels than the protein domain not far awayfrom the organism level Remarkably no information is lostwhen character columns in the matrix are randomized in

the thought experiment The order of proteome charactersin the matrix does not follow any rationale Characters arenot ordered by lifestyles or trophic levels of the organismsInteractions between free-living organisms will seldom biastheir domain makeup and if so those characters can beexcluded from analysis Even the establishment of symbioticor obligate parasitic interactions such as the nodule-formingsymbioses between rhizobia and legumes may have littleimpact on character independence as long as the jointinclusion of the host and the symbiont is avoided

6 Evidence Supporting the Archaeal Rootingof the Universal Tree

Figure 4 shows examples of rooted ToLs generated fromthe sequence and structure of RNA and protein moleculesThe different phylogenomic approaches arrive at a commonrooted topology that places the stem group of Archaea at thebase of the ToL (the archaeal rooting of Figure 1(c)) Howeverthe evolutionary interrelationship of parts andwholes promptthe use of trees of parts a focus on higher-level structureand a decrease in confidence in the power of trees of systemsThese concepts have been applied to the study of the historyof nucleic acid and protein structures for over a decade and

Archaea 13

23

38

786

360324

164

38

2

33

453

126 56

1

1

FSF domains

Archaea

Archaea

EukaryaEukarya

Bact

eria

Bact

eria

f gt 06

(a)

1

11

526

852272

162

100

0

3

232

35 15

5

16

GO terms

Archaea

Archaea

EukaryaBa

cter

iaEukarya

Bact

eria

f gt 06

(b)

69

0

7

8393

228

2

1

0

5

91

1

0

RNA families

Archaea

Eukarya

Bact

eria

Archaea

Eukarya

Bact

eria

f gt 06

(c)

Figure 6 Venn diagrams displaying the distributions of 1733 FSF domains (a) 1924 terminal GO terms (b) and 1148 RNA families (c)in the genomes of the three domains of life FSF domain data was taken from Nasir et al [122 123] and included 981 completely sequencedproteomes from 70 Archaea 652 Bacteria and 259 Eukarya Terminal GO terms corresponding to the ldquomolecular functionrdquo hierarchy definedby the GO database [118] were identified in 249 free-living organisms including 45 Archaea 183 Bacteria and 21 Eukarya (data taken from[123]) The Venn diagram of RNA families and the distribution of Rfam clans and families in organisms were taken from Hoeppner et al[129] and their Dataset S1 Shown below are distribution patterns for FSFs GO terms and RNA families that are present in more than 60 ofthe organisms examined (119891 gt 06) All distributions highlight maximum sharing in the ancient ABE and BE taxonomic groups and minimalsharing in archaeal taxonomic groups

provide additional evidence in support of the archaeal rootingscenario

61 Comparative Genomic Argumentation The distributionof gene-encoded products in the genomes of sequencedorganisms and parsimony thinking can reveal global evolu-tionary patterns without formal phylogenetic reconstructionWe will show how simple numerical analyses of proteindomains molecular activities defined by the gene ontology(GO) consortium [118] and RNA families that are domain-specific or are shared between domains of life can uncover thetripartite division of cellular life exclude chimeric scenariosof origin and provide initial insight on the rooting of the ToL(Figure 6)

The number of unique protein folds observed in natureis very small SCOP ver 175 defines only sim2000 foldsuperfamilies (FSFs) groups of homologous domains unifiedon the base of common structural and evolutionary rela-tionships for a total of 110800 known domains in proteins[113 119] FSFs represent highly conserved evolutionary unitsthat are suitable for studying organismal diversification [87]Yafremava et al [120] plotted the total number of distinctFSFs (FSF diversity) versus the average reuse of FSFs in theproteome of an organism (FSF abundance) This exerciseuncovered a scaling behavior typical of a Benford distributionwith a linear regime of proteomic growth for microbial

organisms and a superlinear regime for eukaryotic organisms(Figure 8 in [120]) These same scaling patterns are observedwhen studying the relationship between open reading framesand genome size [121] Remarkably archaeal and eukaryalproteomes exhibited both minimum and maximum levelsof FSF abundance and diversity respectively Bacterial pro-teomes however showed intermediate levels We note thatthe general scaling behavior is consistent with a scenarioin which evolutionary diversification proceeds from simplerproteomes to the more complex ones in gradual mannersupporting the principle of spatiotemporal continuity andrevealing the nested phylogenetic hierarchies of organismsUnder this scenario (and results of [120]) the streamlinedarchaeal proteomes represent the earliest form of cellular lifeRemarkably archaeal species harboring thermophilic andhyperthermophilic lifestyles encoded the most streamlinedFSF repertoires (Figure 7) Clearly modern thermophilicarchaeons are most closely related to the ancient cells thatinhabited planet Earth billions of years ago (also read below)

FSF domain distributions in the genomes of the threedomains of life provide further insights into their evolutionNasir et al [122 123] generated Venn diagrams to illustrateFSF sharing patterns in the genomes of Archaea Bacteriaand Eukarya (Figure 6(a)) These diagrams display the totalnumber of FSFs that are unique to a domain of life (taxonomicgroupsA B and E) shared by only two (AB BE andAE) and

14 Archaea

FSF use (log)

S marinus I hospitalis

H butylicus

T pendens

Cand korarchaeum

A pernix

M kandleri

T neutrophils

P islandicum

290

285

280

275

270

265

260

255

250

245

240

22 23 24 25

ThermophilicNonthermophilic

FSF

reus

e (lo

g)

N maritimus

Figure 7 A plot of FSF use (diversity) against FSF reuse (abun-dance) reveals a linear pattern of proteomic growth in 48 archaealproteomes Thermophilic archaeal species occupy positions that areclose to the origin of the plot They also populate the most basalbranch positions in ToLs (see discussion in themain text) Both axesare in logarithmic scale

those that are universal (ABE) About half of the FSFs (786out of 1733) were present in all three domains of life 453 ofwhich were present in at least 60 of the organisms that wereexamined (119891 gt 06 where 119891 is the distribution index rangingfrom 0 to 1) The fact that about 70 of widely shared FSFs(672) belong to the ABE taxonomic group strongly supports acommon evolutionary origin for cells Evolutionary timelineshave confirmed that a large number of these universal FSFswere present in the ancestor of the three domains of life theurancestor (Figure 1) and were for the most part retained inextant proteomes [48]

Interestingly the number of BE FSFs was sim10-fold greaterthan AE and AB FSFs (324 versus 38 and 38) (Figure 6(a))The finding that Bacteria and Eukarya encode a significantlylarge number of shared FSF domains is remarkable and hintstowards an unprecedented strong evolutionary associationbetween these two domains of life Moreover a significantnumber of BE FSFs is widespread in the bacterial andeukaryotic proteomes 56 of the 324 FSFs are shared bymore than 60 of the bacterial and eukaryal organismsthat were analyzed (Figure 6(a)) This is strong evidenceof an ancient vertical evolutionary trace from their mutualancestor (anticipated in [123]) This trace uniquely supportsthe archaeal rooting of the ToL In turn the canonicaland eukaryotic rooting alternatives are highly unlikely asnone alone can explain the remarkable diversity of the BEtaxonomic group (as well as its ancient origin [123]) Theremarkable vertical trace of the ABE taxonomic group andthe negligible vertical trace of the AB group (only one FSF

shared by more than 60 of organisms) also falsify fusionhypotheses responsible for a putative chimeric makeup ofEukarya In light of these findings the most parsimoniousexplanation of comparative genomic data is that Archaea isthe first domain of diversified cellular life

The patterns of FSF sharing are further strengthenedby the genomic distributions of terminal-level molecularfunction GO terms (Figure 6(b)) The three domains of lifeshared about a quarter (526) of the 1924 GO terms thatwere surveyed A total of 232 of these ABE terms wereshared by 60 of organisms analyzed Again the fact thatabout 76 of widely shared GO terms (306) belong to theABE taxonomic group strongly supports a common verticaltrace while the finding that Bacteria and Eukarya share asignificantly large number of GO terms (272) supports againthe archaeal rooting of the ToL An alternative explanationfor the very large size of BE could be large-scale metabolism-related gene transfer from the ancestors of mitochondriaand plastids to the ancestors of modern eukaryotes [124]However we note that the BE group is not restricted to onlymetabolic functions It also includes FSFs and GOs involvedin intracellular and extracellular processes and regulation andinformational functions [125] Thus the very large size ofBE is a significant outcome most likely shaped by verticalevolutionary scenarios and cannot solely be explained byparasiticsymbiotic relationships that exist between Bacteriaand Eukarya [123] Moreover the simplicity of archaeal FSFrepertoires is not due to the paucity of available archaealgenomic dataWe confirmed that themean FSF coverage (ienumber of proteinsannotated to FSFsGOs out of total) forArchaea Bacteria and Eukarya was largely comparable (egTable S1 in [122]) Finally as we will describe below thesecomparative patterns and tentative conclusions are confirmedby phylogenomic analysis [91 122 126] Congruent phyloge-nies were obtained with (Figure 4(c)) and without equal andrandom sampling of taxa Thus the relatively low numberof archaeal genomes is also not expected to compromise ourinferences

We note that Eukarya shares many informational geneswith Archaea and many operational genes with BacteriaWhile informational genes have been thought more refrac-tory to HGT than noninformational genes we have con-firmed at GO level that this is not the case A ToL builtfrom GO terms showed that in fact noninformational termswere less homoplasious while statistical enrichment analysesrevealed that HGT had little if any functional preferencefor GO terms across GO hierarchical levels We recentlycompared ToLs reconstructed from non-HGTGO terms andToLs reconstructed from informational GO terms that wereextracted fromnon-HGTGO terms (Kim et al ms in reviewalso read below) In both cases Archaea appeared as a basalparaphyletic group of the ToLs and the common origin ofBacteria and Eukarya was maintained Thus the 272 GOterms shared by Bacteria and Eukarya harbor a strong verticaltrace

Another observation often used as support for Archaea-Eukarya kinship is the discovery of few eukaryote-specificproteins (eg actin tubulin H3 H4 ESCRT ribosomalproteins and others) in some archaeal species [127] and

Archaea 15

their complete absence fromBacteria (except for documentedtubulin HGT from eukaryotes to bacterial genus Prosthe-cobacter [128]) This suggests either that eukaryotes arosefrom an archaeal lineage [127] or that the ancestor of Eukaryaand Archaea was complex and modern Archaea are highlyreduced [25] Indeed few eukaryote-specific proteins havenow been found in some archaeal species (in some casesjust one species) We argue that this poor spread cannotbe taken as evidence for the Archaea-Eukarya sister rela-tionship This needs to be confirmed by robust phylogeneticanalysis which is unfortunately not possible when usingprotein sequences Recently we described a new strategy forinferring vertical and horizontal traces [123] This methodcalculates the spread of FSFs andGOs that are shared betweentwo-superkingdom groups (ie AB AE and BE) Balanceddistributions often indicate vertical inheritance while biaseddistributions suggest horizontal flux For example penicillinbinding molecular activity (GO 0008658) was present in100 of the sampled bacterial proteomes but was onlypresent in 11 of the archaeal species [123] Thus presenceof GO 0008658 in Archaea was attributed to HGT gain fromBacteria Using this simple method we established that bothBacteria and Eukarya were united by much stronger verticaltrace than either was to Archaea In fact strong reductivetendencies in the archaeal genomes were recorded [123]Thus in our opinion presence of eukaryote-specific proteinsin only very few archaeal species could in fact be an HGTevent that is not detectable by sequence phylogenies

In turn many arguments favor now the Bacteria-Eukaryasisterhood in addition to the structure-based phylogeniesand balanced distribution ofmolecular features For exampleBacteria and Eukarya have similar lipid membranes that canbe used as argument for their evolutionary kinshipMoreoverArchaea fundamentally differ from Eukarya in terms oftheir virosphere Viruses infecting Archaea and Eukarya aredrastically different as recently discussed by Forterre [25]

The genomic distribution of RNA families was taken fromHoeppner et al [129] and also shows a vertical evolutionarytrace in five crucial Rfam clans that are universal includingtRNA 5S rRNA subunit rRNA and RNase P RNA Theseuniversal RNA groups are likely minimally affected by HGTHowever 99 of Rfam clans and families are specific todomains of life and only 11 of the 1148 groups were sharedat 119891 gt 06 levels This clearly shows that the functionalcomplexity of RNA materialized very late during organismaldiversification and that it is not a good genomic feature forexploring the rooting of the ToL Only five RNA familiesare shared between two domains of life and only one ofthese does so at 119891 gt 06 the G12 pseudoknot of the 23SrRNA which is present in bacterial and eukaryotic organellarrRNA While the large subunit rRNA scaffold supports theG12 pseudoknot with its vertical trace all five interdomainRNA families can be explainedmost parsimoniously byHGTThus only a handful of ancient and universal RNA speciescan be used to root the ToL

We end by noting that the Venn diagrams consistentlyshow that Archaea harbors the least number of unique (A)and shared (AB and AE) FSFs GO terms or RNA familiesThis trend supports an early divergence of this domain of life

from the urancestor and the possibility that such divergencebe shaped by evolutionary reductive events We reason thatsuch losseswould bemore parsimonious early on in evolutionthan in the later periods This is because genes often increasetheir abundance in evolutionary time (by gene duplicationsHGT and other processes)Thus it is reasonable to think thatloss of an ancient gene would be more feasible very early inevolution relative to losing it very late

62 Phylogenomic Evidence from the Sequence and Structure ofRNA Phylogenetic analyses of the few RNA families that areuniversal and display an important vertical evolutionary trace(Figure 6(c)) provide compelling evidence in favor of an earlyevolutionary appearance of Archaea [105 107 108 130ndash135]Here we briefly summarize evidence from tRNA 5S rRNAand RNase P RNA Unpublished analyses of rRNA sequenceand structure using advanced phylogenetic methods alsoshow that Archaea was the first domain of life

tRNA molecules are generally short (sim73ndash95 nucleotidesin length) and highly conserved Consequently theirsequence generally contains limited amount of phylogeneticinformation These limitations have been overcome by ana-lyzing entire tRNomes [136] which extend the length ofan organismal set of tRNAs to over 2000 bases Xue etal [130 131] analyzed the genetic distances between tRNAsequences as averages between alloacceptor tRNAs fromdiverse groups of tRNomes using multiple molecular basesubstitution models The distances were mapped onto anunrooted phylogeny of tRNA molecules (Figure 8(a))The ldquoarrow of timerdquo assumption in these studies is thatancient tRNA paralogs closely resemble each other when lin-eages originate close to the time of the gene duplication Re-markably the results revealed a paraphyletic rooting of theToL in Archaea The root was specifically located close tothe hyperthermophilic methanogen Methanopyrus kandleri(Figure 8(a)) The hypothesis of this specific rooting scenariohas been supported by several other studies [134 137]including a study of genetic distances between paralogouspairs of aminoacyl-tRNA synthetase (aaRS) proteins [131] Aremarkable match between distance scores of tRNA and pairsof aaRS paralogs (Figure 8(b)) not only confirms the earlyappearance of Archaea but also suggests a coevolutionarytrace associated with molecular interactions that areresponsible for the genetic code In fact a recent exhaustivephylogenomic analysis of tRNA and aaRS coevolutionexplicitly reveals the origins and evolution of the geneticcode and the underlying molecular basis of genetics [59]Di Giulio [132 133] has also proposed an archaeal rootingof the tree of life specifically in the lineage leading to thephylum of Nanoarchaeota This rooting is based on uniqueand ancestral genomic traits of Nanoarchaeum equitans splitgenes separately codifying for the 51015840 and 31015840 halves of tRNAand the absence of operons which are considered molecularfossils [132] However this claim needs additional support ascontrasting evidence now recognizes N equitans as a highlyderived archaeal species [138]

In addition to sequences structural features of tRNAmolecules also support the archaeal rooting of the ToL

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Archaea 11

the relative distribution of character states in taxic sets italso embodies the generality criterion of rooting Finallywe generated rooted ToLs that describe the evolution ofproteomes directly from a census of protein fold structures inproteomes (Figure 4(c)) This ToL shows a paraphyletic root-ing inArchaeaThemethod extracts phylogenetic signal fromproteomic abundance of protein fold structures and considersthat the most abundant and widely distributed folds are ofancient origin when defining transformation series [41] Thispolarization scheme which results in the gradual growth ofthe proteome repertoire again represents an embodimentof the generality criterion in which statements of homology(fold structures) result from additive phylogenetic change(increases in abundance) It is noteworthy that these ToLreconstructions take into account the genomic abundance ofeach and every fold structure in each proteome and acrossthe entire matrix thereby generating a frustrated system Ingeneral very high abundance of only few folds will not attracttaxa (species) to the derived branches in theToLThe ancestryof taxa is determined by both the abundance and interplayamong fold structure characters For example metabolicfolds such as those involved inATP hydrolysis are widespreadin cells and considered to be very ancient These are alsothe most ancient folds in our phylogenies In comparisonsome popular eukaryote-specific folds (eg immunoglobulinsuperfamilies) are highly abundant but appear in a derivedmanner in our phylogenies Thus we reason that there is nocircularity involved in the character polarization scheme

The compositional schemes extend the concept of rootingwith paralogous sequences to the entire proteome comple-ments from gene family level [100] to structural hierarchiesThe three examples make use of different rooting rationalesbut provide a congruent scenario of origins of diversifiedlife Technically roots are inferred by the Lundberg method[110] that does not require any outgroup taxa specificationThis method roots the trees a posteriori by attaching thehypothetical ancestor to that branch of the unrooted networkthat would yield minimum increase in the tree length (thuspreserving the principle of parsimony)

(iii) Characters Must Serve as Independent EvolutionaryHypotheses Valid phylogenetic optimization requires thatcharacters be independent pieces of evidence Charactersshould not dependonother charactersWhen the assumptionof independence is violated characters are overweighed inthe analysis and the resulting phylogeny fails to representtrue history [111] Possible dependencies could be of manykinds from structural to functional from developmen-tal to ecological These dependencies distort and obscurephylogenetic signal and must be either avoided or codedinto the phylogenetic model through parameters or weightcorrections

As we will now elaborate the problem of character inde-pendence is about parts and wholes in the hierarchical fabricof life and in the nested hierarchies of the ToL Biologicalsystems are by definition made of parts regardless of the wayparts are defined In evolution diversification and integrationof parts unify parts into cohesive entities modules whichthen diversify [112]This process and the rise of modules may

explain evolutionary waves of complexity and organizationand the emergence of structure (defined broadly) in biologythat is hierarchical The hierarchical makeup is made evidentin the structure of protein molecules where lower levelparts of the polymer (the amino acid residues) interact witheach other and establish cohesive higher-level modular partswhich also establish interaction networks and are crucialfor molecular function and for interaction of proteins withthe cellular environment Consequently the structure ofproteins can be described at increasing hierarchical levelsof structural abstraction sequences motifs loops domainsfamilies superfamilies topologies folds architectures andclasses Two accepted gold standards of protein classificationstructural classification of proteins (SCOP) [113] and classarchitecture topology homology (CATH) [114] use parts ofthis incomplete scheme to describe the atomic complexityof the molecules We note that these classifications do notconsider unrealized structural states such as protein foldsthat are possible but that have never been identified inthe natural world of protein structures We also note thatmodules sometimes engage in combinatorial games Forexample protein domains are rearranged in evolution byfusions and fissions producing the enormous diversity ofalternative domain rearrangements that exist inmultidomainproteins [92]

Since biological systems evolve and carry common ances-try parts of these systems by definition evolve and bythemselves carry common ancestry In other words thehistories of parts are embodied in the ensemble of lineagesof the ToL The focus of phylogenetic analysis however hasbeen overwhelmingly the organism as the biological systemas testified by the effort devoted to taxonomical classificationsystematic biology and building of the ToL The genomicrevolution has provided a wealth of parts in the aminoacid sequence sites of proteins and nucleic acid moleculeswhich have been used as phylogenetic characters for analysisof molecules representing organisms This has maintainedthe focus of reconstructing trees of systems (the wholes)For example amino acid sites of a protein are generally fitinto a data matrix (alignment) which is then used to buildtrees of genes and organisms (Figure 5(a)) using modernalgorithmic implementations [115] However and as we havementioned proteins have complicated structures that resultfrom interactions between amino acids at the 3D atomic levelThese intramolecular interactions or at least some of them[116] induce protein folding and delimit molecular func-tion and protein stability They are responsible for proteinsecondary supersecondary domain and tertiary structureand by definition their mere existence induces violation ofcharacter independence Penny and Collins [117] proposedthe simple thought experiment in which the bioinformaticianexchanges rows of sequence sites in the alignment matrixand asks what was lost in the process Randomization ofcharacters (columns) in the data matrix does not changethe phylogenetic tree However randomization destroys thestructure of the molecule and very likely its function Thisconfirms that reconstruction of trees from information insequence sites violates character independence and in theprocess ignores structure and biologyThe effects of violation

12 Archaea

Interactions between parts(eg atomic contacts between sites)

Characters parts(amino acid sequence sites)

16

7

5Impact onphylogeneticreconstruction high

Phylogenetic model

Evolutionof systems

Canonicalparadigm

System 3 (protein from Mba)

System 2 (protein from Mja)

System 1 (protein from Pfu)

A L Y S

A L Y S

G L S S

1 5 6 7

(a)

Interactions between systems(eg proteomes in trophic chains)

Characters systems(proteomes)

Impact onphylogeneticreconstruction low

Phylogenetic model

Evolutionof parts

Newparadigm

Part 3 (FSF domain c261)

Part 2 (FSF domain c12)

Part 1 (FSF domain c371)

Higher trophic

levels

Mesoplankton

Microplankton

Nanoplankton

Picoplankton

Pfu Mja Mba 22 29 28

54 44 43

84 88 80

(b)

Figure 5 A new phylogenetic strategy simplifies the problems of character independence (a)The canonical paradigm explores the evolutionof systems such as the evolution of organisms in the reconstruction of ToLs For example the terminals of phylogenetic trees can be genessampled from different organisms (eg Pyrococcus furiosus (Pfu)Methanococcus jannaschii (Mja) andMethanosarcina barkeri (Mba)) andphylogenetic characters can be amino acid sequence sites of the corresponding gene products Character states can describe the identityof the amino acid at each site Since characters are molecular parts that interact with other parts when molecules fold into compact 3Dstructures their interaction violates the principle of character independence Consequently the effects of covariation must be considered inthe phylogeneticmodel used to build the trees (b)The new paradigm explores the evolution of parts such as the evolution of protein domainsin proteomes For example the terminals of phylogenetic trees can be domains defined at fold superfamily level of structural complexity andthe characters used to build the trees can be proteomes Character states can be the number of domains holding the FSF structure Sinceproteomes interact with other proteomes when organisms establish close interactions interactions that could affect the abundance of domainsin proteomes should be considered negligible (unless there is an obligate parasitic lifestyle involved) and there is no need to budget trophicinteractions in the phylogenetic model

of character independence may be minimal for trees ofsequences that are closely related However trees describingdeep historical relationships require that sequences be diver-gent and this maximizes the chances of even wider diver-gences in molecular structure that are not being accountedfor by the models of sequence evolution

The genomic revolution has also provided a wealth ofmodels of 3D atomic structure These structures are usedas gold standards to assign with high confidence structuralmodules to sequences As mentioned earlier proteomesembody collections of protein domains with well-definedstructures and functions Protein domain counts in pro-teomes have been used to generate trees of protein domains(Figure 5(b)) using standard cladistic approaches and well-established methods (reviewed in [87]) These trees describethe evolution of protein structure at global level Theyare effectively trees of parts While domains interact witheach other in multidomain proteins or establish protein-protein interactions with the domains of other proteinsthese interactions of parts do not violate character indepen-dence This is because phylogenetic characters are actuallyproteomes systems defined by structural states that exist atmuch higher levels than the protein domain not far awayfrom the organism level Remarkably no information is lostwhen character columns in the matrix are randomized in

the thought experiment The order of proteome charactersin the matrix does not follow any rationale Characters arenot ordered by lifestyles or trophic levels of the organismsInteractions between free-living organisms will seldom biastheir domain makeup and if so those characters can beexcluded from analysis Even the establishment of symbioticor obligate parasitic interactions such as the nodule-formingsymbioses between rhizobia and legumes may have littleimpact on character independence as long as the jointinclusion of the host and the symbiont is avoided

6 Evidence Supporting the Archaeal Rootingof the Universal Tree

Figure 4 shows examples of rooted ToLs generated fromthe sequence and structure of RNA and protein moleculesThe different phylogenomic approaches arrive at a commonrooted topology that places the stem group of Archaea at thebase of the ToL (the archaeal rooting of Figure 1(c)) Howeverthe evolutionary interrelationship of parts andwholes promptthe use of trees of parts a focus on higher-level structureand a decrease in confidence in the power of trees of systemsThese concepts have been applied to the study of the historyof nucleic acid and protein structures for over a decade and

Archaea 13

23

38

786

360324

164

38

2

33

453

126 56

1

1

FSF domains

Archaea

Archaea

EukaryaEukarya

Bact

eria

Bact

eria

f gt 06

(a)

1

11

526

852272

162

100

0

3

232

35 15

5

16

GO terms

Archaea

Archaea

EukaryaBa

cter

iaEukarya

Bact

eria

f gt 06

(b)

69

0

7

8393

228

2

1

0

5

91

1

0

RNA families

Archaea

Eukarya

Bact

eria

Archaea

Eukarya

Bact

eria

f gt 06

(c)

Figure 6 Venn diagrams displaying the distributions of 1733 FSF domains (a) 1924 terminal GO terms (b) and 1148 RNA families (c)in the genomes of the three domains of life FSF domain data was taken from Nasir et al [122 123] and included 981 completely sequencedproteomes from 70 Archaea 652 Bacteria and 259 Eukarya Terminal GO terms corresponding to the ldquomolecular functionrdquo hierarchy definedby the GO database [118] were identified in 249 free-living organisms including 45 Archaea 183 Bacteria and 21 Eukarya (data taken from[123]) The Venn diagram of RNA families and the distribution of Rfam clans and families in organisms were taken from Hoeppner et al[129] and their Dataset S1 Shown below are distribution patterns for FSFs GO terms and RNA families that are present in more than 60 ofthe organisms examined (119891 gt 06) All distributions highlight maximum sharing in the ancient ABE and BE taxonomic groups and minimalsharing in archaeal taxonomic groups

provide additional evidence in support of the archaeal rootingscenario

61 Comparative Genomic Argumentation The distributionof gene-encoded products in the genomes of sequencedorganisms and parsimony thinking can reveal global evolu-tionary patterns without formal phylogenetic reconstructionWe will show how simple numerical analyses of proteindomains molecular activities defined by the gene ontology(GO) consortium [118] and RNA families that are domain-specific or are shared between domains of life can uncover thetripartite division of cellular life exclude chimeric scenariosof origin and provide initial insight on the rooting of the ToL(Figure 6)

The number of unique protein folds observed in natureis very small SCOP ver 175 defines only sim2000 foldsuperfamilies (FSFs) groups of homologous domains unifiedon the base of common structural and evolutionary rela-tionships for a total of 110800 known domains in proteins[113 119] FSFs represent highly conserved evolutionary unitsthat are suitable for studying organismal diversification [87]Yafremava et al [120] plotted the total number of distinctFSFs (FSF diversity) versus the average reuse of FSFs in theproteome of an organism (FSF abundance) This exerciseuncovered a scaling behavior typical of a Benford distributionwith a linear regime of proteomic growth for microbial

organisms and a superlinear regime for eukaryotic organisms(Figure 8 in [120]) These same scaling patterns are observedwhen studying the relationship between open reading framesand genome size [121] Remarkably archaeal and eukaryalproteomes exhibited both minimum and maximum levelsof FSF abundance and diversity respectively Bacterial pro-teomes however showed intermediate levels We note thatthe general scaling behavior is consistent with a scenarioin which evolutionary diversification proceeds from simplerproteomes to the more complex ones in gradual mannersupporting the principle of spatiotemporal continuity andrevealing the nested phylogenetic hierarchies of organismsUnder this scenario (and results of [120]) the streamlinedarchaeal proteomes represent the earliest form of cellular lifeRemarkably archaeal species harboring thermophilic andhyperthermophilic lifestyles encoded the most streamlinedFSF repertoires (Figure 7) Clearly modern thermophilicarchaeons are most closely related to the ancient cells thatinhabited planet Earth billions of years ago (also read below)

FSF domain distributions in the genomes of the threedomains of life provide further insights into their evolutionNasir et al [122 123] generated Venn diagrams to illustrateFSF sharing patterns in the genomes of Archaea Bacteriaand Eukarya (Figure 6(a)) These diagrams display the totalnumber of FSFs that are unique to a domain of life (taxonomicgroupsA B and E) shared by only two (AB BE andAE) and

14 Archaea

FSF use (log)

S marinus I hospitalis

H butylicus

T pendens

Cand korarchaeum

A pernix

M kandleri

T neutrophils

P islandicum

290

285

280

275

270

265

260

255

250

245

240

22 23 24 25

ThermophilicNonthermophilic

FSF

reus

e (lo

g)

N maritimus

Figure 7 A plot of FSF use (diversity) against FSF reuse (abun-dance) reveals a linear pattern of proteomic growth in 48 archaealproteomes Thermophilic archaeal species occupy positions that areclose to the origin of the plot They also populate the most basalbranch positions in ToLs (see discussion in themain text) Both axesare in logarithmic scale

those that are universal (ABE) About half of the FSFs (786out of 1733) were present in all three domains of life 453 ofwhich were present in at least 60 of the organisms that wereexamined (119891 gt 06 where 119891 is the distribution index rangingfrom 0 to 1) The fact that about 70 of widely shared FSFs(672) belong to the ABE taxonomic group strongly supports acommon evolutionary origin for cells Evolutionary timelineshave confirmed that a large number of these universal FSFswere present in the ancestor of the three domains of life theurancestor (Figure 1) and were for the most part retained inextant proteomes [48]

Interestingly the number of BE FSFs was sim10-fold greaterthan AE and AB FSFs (324 versus 38 and 38) (Figure 6(a))The finding that Bacteria and Eukarya encode a significantlylarge number of shared FSF domains is remarkable and hintstowards an unprecedented strong evolutionary associationbetween these two domains of life Moreover a significantnumber of BE FSFs is widespread in the bacterial andeukaryotic proteomes 56 of the 324 FSFs are shared bymore than 60 of the bacterial and eukaryal organismsthat were analyzed (Figure 6(a)) This is strong evidenceof an ancient vertical evolutionary trace from their mutualancestor (anticipated in [123]) This trace uniquely supportsthe archaeal rooting of the ToL In turn the canonicaland eukaryotic rooting alternatives are highly unlikely asnone alone can explain the remarkable diversity of the BEtaxonomic group (as well as its ancient origin [123]) Theremarkable vertical trace of the ABE taxonomic group andthe negligible vertical trace of the AB group (only one FSF

shared by more than 60 of organisms) also falsify fusionhypotheses responsible for a putative chimeric makeup ofEukarya In light of these findings the most parsimoniousexplanation of comparative genomic data is that Archaea isthe first domain of diversified cellular life

The patterns of FSF sharing are further strengthenedby the genomic distributions of terminal-level molecularfunction GO terms (Figure 6(b)) The three domains of lifeshared about a quarter (526) of the 1924 GO terms thatwere surveyed A total of 232 of these ABE terms wereshared by 60 of organisms analyzed Again the fact thatabout 76 of widely shared GO terms (306) belong to theABE taxonomic group strongly supports a common verticaltrace while the finding that Bacteria and Eukarya share asignificantly large number of GO terms (272) supports againthe archaeal rooting of the ToL An alternative explanationfor the very large size of BE could be large-scale metabolism-related gene transfer from the ancestors of mitochondriaand plastids to the ancestors of modern eukaryotes [124]However we note that the BE group is not restricted to onlymetabolic functions It also includes FSFs and GOs involvedin intracellular and extracellular processes and regulation andinformational functions [125] Thus the very large size ofBE is a significant outcome most likely shaped by verticalevolutionary scenarios and cannot solely be explained byparasiticsymbiotic relationships that exist between Bacteriaand Eukarya [123] Moreover the simplicity of archaeal FSFrepertoires is not due to the paucity of available archaealgenomic dataWe confirmed that themean FSF coverage (ienumber of proteinsannotated to FSFsGOs out of total) forArchaea Bacteria and Eukarya was largely comparable (egTable S1 in [122]) Finally as we will describe below thesecomparative patterns and tentative conclusions are confirmedby phylogenomic analysis [91 122 126] Congruent phyloge-nies were obtained with (Figure 4(c)) and without equal andrandom sampling of taxa Thus the relatively low numberof archaeal genomes is also not expected to compromise ourinferences

We note that Eukarya shares many informational geneswith Archaea and many operational genes with BacteriaWhile informational genes have been thought more refrac-tory to HGT than noninformational genes we have con-firmed at GO level that this is not the case A ToL builtfrom GO terms showed that in fact noninformational termswere less homoplasious while statistical enrichment analysesrevealed that HGT had little if any functional preferencefor GO terms across GO hierarchical levels We recentlycompared ToLs reconstructed from non-HGTGO terms andToLs reconstructed from informational GO terms that wereextracted fromnon-HGTGO terms (Kim et al ms in reviewalso read below) In both cases Archaea appeared as a basalparaphyletic group of the ToLs and the common origin ofBacteria and Eukarya was maintained Thus the 272 GOterms shared by Bacteria and Eukarya harbor a strong verticaltrace

Another observation often used as support for Archaea-Eukarya kinship is the discovery of few eukaryote-specificproteins (eg actin tubulin H3 H4 ESCRT ribosomalproteins and others) in some archaeal species [127] and

Archaea 15

their complete absence fromBacteria (except for documentedtubulin HGT from eukaryotes to bacterial genus Prosthe-cobacter [128]) This suggests either that eukaryotes arosefrom an archaeal lineage [127] or that the ancestor of Eukaryaand Archaea was complex and modern Archaea are highlyreduced [25] Indeed few eukaryote-specific proteins havenow been found in some archaeal species (in some casesjust one species) We argue that this poor spread cannotbe taken as evidence for the Archaea-Eukarya sister rela-tionship This needs to be confirmed by robust phylogeneticanalysis which is unfortunately not possible when usingprotein sequences Recently we described a new strategy forinferring vertical and horizontal traces [123] This methodcalculates the spread of FSFs andGOs that are shared betweentwo-superkingdom groups (ie AB AE and BE) Balanceddistributions often indicate vertical inheritance while biaseddistributions suggest horizontal flux For example penicillinbinding molecular activity (GO 0008658) was present in100 of the sampled bacterial proteomes but was onlypresent in 11 of the archaeal species [123] Thus presenceof GO 0008658 in Archaea was attributed to HGT gain fromBacteria Using this simple method we established that bothBacteria and Eukarya were united by much stronger verticaltrace than either was to Archaea In fact strong reductivetendencies in the archaeal genomes were recorded [123]Thus in our opinion presence of eukaryote-specific proteinsin only very few archaeal species could in fact be an HGTevent that is not detectable by sequence phylogenies

In turn many arguments favor now the Bacteria-Eukaryasisterhood in addition to the structure-based phylogeniesand balanced distribution ofmolecular features For exampleBacteria and Eukarya have similar lipid membranes that canbe used as argument for their evolutionary kinshipMoreoverArchaea fundamentally differ from Eukarya in terms oftheir virosphere Viruses infecting Archaea and Eukarya aredrastically different as recently discussed by Forterre [25]

The genomic distribution of RNA families was taken fromHoeppner et al [129] and also shows a vertical evolutionarytrace in five crucial Rfam clans that are universal includingtRNA 5S rRNA subunit rRNA and RNase P RNA Theseuniversal RNA groups are likely minimally affected by HGTHowever 99 of Rfam clans and families are specific todomains of life and only 11 of the 1148 groups were sharedat 119891 gt 06 levels This clearly shows that the functionalcomplexity of RNA materialized very late during organismaldiversification and that it is not a good genomic feature forexploring the rooting of the ToL Only five RNA familiesare shared between two domains of life and only one ofthese does so at 119891 gt 06 the G12 pseudoknot of the 23SrRNA which is present in bacterial and eukaryotic organellarrRNA While the large subunit rRNA scaffold supports theG12 pseudoknot with its vertical trace all five interdomainRNA families can be explainedmost parsimoniously byHGTThus only a handful of ancient and universal RNA speciescan be used to root the ToL

We end by noting that the Venn diagrams consistentlyshow that Archaea harbors the least number of unique (A)and shared (AB and AE) FSFs GO terms or RNA familiesThis trend supports an early divergence of this domain of life

from the urancestor and the possibility that such divergencebe shaped by evolutionary reductive events We reason thatsuch losseswould bemore parsimonious early on in evolutionthan in the later periods This is because genes often increasetheir abundance in evolutionary time (by gene duplicationsHGT and other processes)Thus it is reasonable to think thatloss of an ancient gene would be more feasible very early inevolution relative to losing it very late

62 Phylogenomic Evidence from the Sequence and Structure ofRNA Phylogenetic analyses of the few RNA families that areuniversal and display an important vertical evolutionary trace(Figure 6(c)) provide compelling evidence in favor of an earlyevolutionary appearance of Archaea [105 107 108 130ndash135]Here we briefly summarize evidence from tRNA 5S rRNAand RNase P RNA Unpublished analyses of rRNA sequenceand structure using advanced phylogenetic methods alsoshow that Archaea was the first domain of life

tRNA molecules are generally short (sim73ndash95 nucleotidesin length) and highly conserved Consequently theirsequence generally contains limited amount of phylogeneticinformation These limitations have been overcome by ana-lyzing entire tRNomes [136] which extend the length ofan organismal set of tRNAs to over 2000 bases Xue etal [130 131] analyzed the genetic distances between tRNAsequences as averages between alloacceptor tRNAs fromdiverse groups of tRNomes using multiple molecular basesubstitution models The distances were mapped onto anunrooted phylogeny of tRNA molecules (Figure 8(a))The ldquoarrow of timerdquo assumption in these studies is thatancient tRNA paralogs closely resemble each other when lin-eages originate close to the time of the gene duplication Re-markably the results revealed a paraphyletic rooting of theToL in Archaea The root was specifically located close tothe hyperthermophilic methanogen Methanopyrus kandleri(Figure 8(a)) The hypothesis of this specific rooting scenariohas been supported by several other studies [134 137]including a study of genetic distances between paralogouspairs of aminoacyl-tRNA synthetase (aaRS) proteins [131] Aremarkable match between distance scores of tRNA and pairsof aaRS paralogs (Figure 8(b)) not only confirms the earlyappearance of Archaea but also suggests a coevolutionarytrace associated with molecular interactions that areresponsible for the genetic code In fact a recent exhaustivephylogenomic analysis of tRNA and aaRS coevolutionexplicitly reveals the origins and evolution of the geneticcode and the underlying molecular basis of genetics [59]Di Giulio [132 133] has also proposed an archaeal rootingof the tree of life specifically in the lineage leading to thephylum of Nanoarchaeota This rooting is based on uniqueand ancestral genomic traits of Nanoarchaeum equitans splitgenes separately codifying for the 51015840 and 31015840 halves of tRNAand the absence of operons which are considered molecularfossils [132] However this claim needs additional support ascontrasting evidence now recognizes N equitans as a highlyderived archaeal species [138]

In addition to sequences structural features of tRNAmolecules also support the archaeal rooting of the ToL

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

12 Archaea

Interactions between parts(eg atomic contacts between sites)

Characters parts(amino acid sequence sites)

16

7

5Impact onphylogeneticreconstruction high

Phylogenetic model

Evolutionof systems

Canonicalparadigm

System 3 (protein from Mba)

System 2 (protein from Mja)

System 1 (protein from Pfu)

A L Y S

A L Y S

G L S S

1 5 6 7

(a)

Interactions between systems(eg proteomes in trophic chains)

Characters systems(proteomes)

Impact onphylogeneticreconstruction low

Phylogenetic model

Evolutionof parts

Newparadigm

Part 3 (FSF domain c261)

Part 2 (FSF domain c12)

Part 1 (FSF domain c371)

Higher trophic

levels

Mesoplankton

Microplankton

Nanoplankton

Picoplankton

Pfu Mja Mba 22 29 28

54 44 43

84 88 80

(b)

Figure 5 A new phylogenetic strategy simplifies the problems of character independence (a)The canonical paradigm explores the evolutionof systems such as the evolution of organisms in the reconstruction of ToLs For example the terminals of phylogenetic trees can be genessampled from different organisms (eg Pyrococcus furiosus (Pfu)Methanococcus jannaschii (Mja) andMethanosarcina barkeri (Mba)) andphylogenetic characters can be amino acid sequence sites of the corresponding gene products Character states can describe the identityof the amino acid at each site Since characters are molecular parts that interact with other parts when molecules fold into compact 3Dstructures their interaction violates the principle of character independence Consequently the effects of covariation must be considered inthe phylogeneticmodel used to build the trees (b)The new paradigm explores the evolution of parts such as the evolution of protein domainsin proteomes For example the terminals of phylogenetic trees can be domains defined at fold superfamily level of structural complexity andthe characters used to build the trees can be proteomes Character states can be the number of domains holding the FSF structure Sinceproteomes interact with other proteomes when organisms establish close interactions interactions that could affect the abundance of domainsin proteomes should be considered negligible (unless there is an obligate parasitic lifestyle involved) and there is no need to budget trophicinteractions in the phylogenetic model

of character independence may be minimal for trees ofsequences that are closely related However trees describingdeep historical relationships require that sequences be diver-gent and this maximizes the chances of even wider diver-gences in molecular structure that are not being accountedfor by the models of sequence evolution

The genomic revolution has also provided a wealth ofmodels of 3D atomic structure These structures are usedas gold standards to assign with high confidence structuralmodules to sequences As mentioned earlier proteomesembody collections of protein domains with well-definedstructures and functions Protein domain counts in pro-teomes have been used to generate trees of protein domains(Figure 5(b)) using standard cladistic approaches and well-established methods (reviewed in [87]) These trees describethe evolution of protein structure at global level Theyare effectively trees of parts While domains interact witheach other in multidomain proteins or establish protein-protein interactions with the domains of other proteinsthese interactions of parts do not violate character indepen-dence This is because phylogenetic characters are actuallyproteomes systems defined by structural states that exist atmuch higher levels than the protein domain not far awayfrom the organism level Remarkably no information is lostwhen character columns in the matrix are randomized in

the thought experiment The order of proteome charactersin the matrix does not follow any rationale Characters arenot ordered by lifestyles or trophic levels of the organismsInteractions between free-living organisms will seldom biastheir domain makeup and if so those characters can beexcluded from analysis Even the establishment of symbioticor obligate parasitic interactions such as the nodule-formingsymbioses between rhizobia and legumes may have littleimpact on character independence as long as the jointinclusion of the host and the symbiont is avoided

6 Evidence Supporting the Archaeal Rootingof the Universal Tree

Figure 4 shows examples of rooted ToLs generated fromthe sequence and structure of RNA and protein moleculesThe different phylogenomic approaches arrive at a commonrooted topology that places the stem group of Archaea at thebase of the ToL (the archaeal rooting of Figure 1(c)) Howeverthe evolutionary interrelationship of parts andwholes promptthe use of trees of parts a focus on higher-level structureand a decrease in confidence in the power of trees of systemsThese concepts have been applied to the study of the historyof nucleic acid and protein structures for over a decade and

Archaea 13

23

38

786

360324

164

38

2

33

453

126 56

1

1

FSF domains

Archaea

Archaea

EukaryaEukarya

Bact

eria

Bact

eria

f gt 06

(a)

1

11

526

852272

162

100

0

3

232

35 15

5

16

GO terms

Archaea

Archaea

EukaryaBa

cter

iaEukarya

Bact

eria

f gt 06

(b)

69

0

7

8393

228

2

1

0

5

91

1

0

RNA families

Archaea

Eukarya

Bact

eria

Archaea

Eukarya

Bact

eria

f gt 06

(c)

Figure 6 Venn diagrams displaying the distributions of 1733 FSF domains (a) 1924 terminal GO terms (b) and 1148 RNA families (c)in the genomes of the three domains of life FSF domain data was taken from Nasir et al [122 123] and included 981 completely sequencedproteomes from 70 Archaea 652 Bacteria and 259 Eukarya Terminal GO terms corresponding to the ldquomolecular functionrdquo hierarchy definedby the GO database [118] were identified in 249 free-living organisms including 45 Archaea 183 Bacteria and 21 Eukarya (data taken from[123]) The Venn diagram of RNA families and the distribution of Rfam clans and families in organisms were taken from Hoeppner et al[129] and their Dataset S1 Shown below are distribution patterns for FSFs GO terms and RNA families that are present in more than 60 ofthe organisms examined (119891 gt 06) All distributions highlight maximum sharing in the ancient ABE and BE taxonomic groups and minimalsharing in archaeal taxonomic groups

provide additional evidence in support of the archaeal rootingscenario

61 Comparative Genomic Argumentation The distributionof gene-encoded products in the genomes of sequencedorganisms and parsimony thinking can reveal global evolu-tionary patterns without formal phylogenetic reconstructionWe will show how simple numerical analyses of proteindomains molecular activities defined by the gene ontology(GO) consortium [118] and RNA families that are domain-specific or are shared between domains of life can uncover thetripartite division of cellular life exclude chimeric scenariosof origin and provide initial insight on the rooting of the ToL(Figure 6)

The number of unique protein folds observed in natureis very small SCOP ver 175 defines only sim2000 foldsuperfamilies (FSFs) groups of homologous domains unifiedon the base of common structural and evolutionary rela-tionships for a total of 110800 known domains in proteins[113 119] FSFs represent highly conserved evolutionary unitsthat are suitable for studying organismal diversification [87]Yafremava et al [120] plotted the total number of distinctFSFs (FSF diversity) versus the average reuse of FSFs in theproteome of an organism (FSF abundance) This exerciseuncovered a scaling behavior typical of a Benford distributionwith a linear regime of proteomic growth for microbial

organisms and a superlinear regime for eukaryotic organisms(Figure 8 in [120]) These same scaling patterns are observedwhen studying the relationship between open reading framesand genome size [121] Remarkably archaeal and eukaryalproteomes exhibited both minimum and maximum levelsof FSF abundance and diversity respectively Bacterial pro-teomes however showed intermediate levels We note thatthe general scaling behavior is consistent with a scenarioin which evolutionary diversification proceeds from simplerproteomes to the more complex ones in gradual mannersupporting the principle of spatiotemporal continuity andrevealing the nested phylogenetic hierarchies of organismsUnder this scenario (and results of [120]) the streamlinedarchaeal proteomes represent the earliest form of cellular lifeRemarkably archaeal species harboring thermophilic andhyperthermophilic lifestyles encoded the most streamlinedFSF repertoires (Figure 7) Clearly modern thermophilicarchaeons are most closely related to the ancient cells thatinhabited planet Earth billions of years ago (also read below)

FSF domain distributions in the genomes of the threedomains of life provide further insights into their evolutionNasir et al [122 123] generated Venn diagrams to illustrateFSF sharing patterns in the genomes of Archaea Bacteriaand Eukarya (Figure 6(a)) These diagrams display the totalnumber of FSFs that are unique to a domain of life (taxonomicgroupsA B and E) shared by only two (AB BE andAE) and

14 Archaea

FSF use (log)

S marinus I hospitalis

H butylicus

T pendens

Cand korarchaeum

A pernix

M kandleri

T neutrophils

P islandicum

290

285

280

275

270

265

260

255

250

245

240

22 23 24 25

ThermophilicNonthermophilic

FSF

reus

e (lo

g)

N maritimus

Figure 7 A plot of FSF use (diversity) against FSF reuse (abun-dance) reveals a linear pattern of proteomic growth in 48 archaealproteomes Thermophilic archaeal species occupy positions that areclose to the origin of the plot They also populate the most basalbranch positions in ToLs (see discussion in themain text) Both axesare in logarithmic scale

those that are universal (ABE) About half of the FSFs (786out of 1733) were present in all three domains of life 453 ofwhich were present in at least 60 of the organisms that wereexamined (119891 gt 06 where 119891 is the distribution index rangingfrom 0 to 1) The fact that about 70 of widely shared FSFs(672) belong to the ABE taxonomic group strongly supports acommon evolutionary origin for cells Evolutionary timelineshave confirmed that a large number of these universal FSFswere present in the ancestor of the three domains of life theurancestor (Figure 1) and were for the most part retained inextant proteomes [48]

Interestingly the number of BE FSFs was sim10-fold greaterthan AE and AB FSFs (324 versus 38 and 38) (Figure 6(a))The finding that Bacteria and Eukarya encode a significantlylarge number of shared FSF domains is remarkable and hintstowards an unprecedented strong evolutionary associationbetween these two domains of life Moreover a significantnumber of BE FSFs is widespread in the bacterial andeukaryotic proteomes 56 of the 324 FSFs are shared bymore than 60 of the bacterial and eukaryal organismsthat were analyzed (Figure 6(a)) This is strong evidenceof an ancient vertical evolutionary trace from their mutualancestor (anticipated in [123]) This trace uniquely supportsthe archaeal rooting of the ToL In turn the canonicaland eukaryotic rooting alternatives are highly unlikely asnone alone can explain the remarkable diversity of the BEtaxonomic group (as well as its ancient origin [123]) Theremarkable vertical trace of the ABE taxonomic group andthe negligible vertical trace of the AB group (only one FSF

shared by more than 60 of organisms) also falsify fusionhypotheses responsible for a putative chimeric makeup ofEukarya In light of these findings the most parsimoniousexplanation of comparative genomic data is that Archaea isthe first domain of diversified cellular life

The patterns of FSF sharing are further strengthenedby the genomic distributions of terminal-level molecularfunction GO terms (Figure 6(b)) The three domains of lifeshared about a quarter (526) of the 1924 GO terms thatwere surveyed A total of 232 of these ABE terms wereshared by 60 of organisms analyzed Again the fact thatabout 76 of widely shared GO terms (306) belong to theABE taxonomic group strongly supports a common verticaltrace while the finding that Bacteria and Eukarya share asignificantly large number of GO terms (272) supports againthe archaeal rooting of the ToL An alternative explanationfor the very large size of BE could be large-scale metabolism-related gene transfer from the ancestors of mitochondriaand plastids to the ancestors of modern eukaryotes [124]However we note that the BE group is not restricted to onlymetabolic functions It also includes FSFs and GOs involvedin intracellular and extracellular processes and regulation andinformational functions [125] Thus the very large size ofBE is a significant outcome most likely shaped by verticalevolutionary scenarios and cannot solely be explained byparasiticsymbiotic relationships that exist between Bacteriaand Eukarya [123] Moreover the simplicity of archaeal FSFrepertoires is not due to the paucity of available archaealgenomic dataWe confirmed that themean FSF coverage (ienumber of proteinsannotated to FSFsGOs out of total) forArchaea Bacteria and Eukarya was largely comparable (egTable S1 in [122]) Finally as we will describe below thesecomparative patterns and tentative conclusions are confirmedby phylogenomic analysis [91 122 126] Congruent phyloge-nies were obtained with (Figure 4(c)) and without equal andrandom sampling of taxa Thus the relatively low numberof archaeal genomes is also not expected to compromise ourinferences

We note that Eukarya shares many informational geneswith Archaea and many operational genes with BacteriaWhile informational genes have been thought more refrac-tory to HGT than noninformational genes we have con-firmed at GO level that this is not the case A ToL builtfrom GO terms showed that in fact noninformational termswere less homoplasious while statistical enrichment analysesrevealed that HGT had little if any functional preferencefor GO terms across GO hierarchical levels We recentlycompared ToLs reconstructed from non-HGTGO terms andToLs reconstructed from informational GO terms that wereextracted fromnon-HGTGO terms (Kim et al ms in reviewalso read below) In both cases Archaea appeared as a basalparaphyletic group of the ToLs and the common origin ofBacteria and Eukarya was maintained Thus the 272 GOterms shared by Bacteria and Eukarya harbor a strong verticaltrace

Another observation often used as support for Archaea-Eukarya kinship is the discovery of few eukaryote-specificproteins (eg actin tubulin H3 H4 ESCRT ribosomalproteins and others) in some archaeal species [127] and

Archaea 15

their complete absence fromBacteria (except for documentedtubulin HGT from eukaryotes to bacterial genus Prosthe-cobacter [128]) This suggests either that eukaryotes arosefrom an archaeal lineage [127] or that the ancestor of Eukaryaand Archaea was complex and modern Archaea are highlyreduced [25] Indeed few eukaryote-specific proteins havenow been found in some archaeal species (in some casesjust one species) We argue that this poor spread cannotbe taken as evidence for the Archaea-Eukarya sister rela-tionship This needs to be confirmed by robust phylogeneticanalysis which is unfortunately not possible when usingprotein sequences Recently we described a new strategy forinferring vertical and horizontal traces [123] This methodcalculates the spread of FSFs andGOs that are shared betweentwo-superkingdom groups (ie AB AE and BE) Balanceddistributions often indicate vertical inheritance while biaseddistributions suggest horizontal flux For example penicillinbinding molecular activity (GO 0008658) was present in100 of the sampled bacterial proteomes but was onlypresent in 11 of the archaeal species [123] Thus presenceof GO 0008658 in Archaea was attributed to HGT gain fromBacteria Using this simple method we established that bothBacteria and Eukarya were united by much stronger verticaltrace than either was to Archaea In fact strong reductivetendencies in the archaeal genomes were recorded [123]Thus in our opinion presence of eukaryote-specific proteinsin only very few archaeal species could in fact be an HGTevent that is not detectable by sequence phylogenies

In turn many arguments favor now the Bacteria-Eukaryasisterhood in addition to the structure-based phylogeniesand balanced distribution ofmolecular features For exampleBacteria and Eukarya have similar lipid membranes that canbe used as argument for their evolutionary kinshipMoreoverArchaea fundamentally differ from Eukarya in terms oftheir virosphere Viruses infecting Archaea and Eukarya aredrastically different as recently discussed by Forterre [25]

The genomic distribution of RNA families was taken fromHoeppner et al [129] and also shows a vertical evolutionarytrace in five crucial Rfam clans that are universal includingtRNA 5S rRNA subunit rRNA and RNase P RNA Theseuniversal RNA groups are likely minimally affected by HGTHowever 99 of Rfam clans and families are specific todomains of life and only 11 of the 1148 groups were sharedat 119891 gt 06 levels This clearly shows that the functionalcomplexity of RNA materialized very late during organismaldiversification and that it is not a good genomic feature forexploring the rooting of the ToL Only five RNA familiesare shared between two domains of life and only one ofthese does so at 119891 gt 06 the G12 pseudoknot of the 23SrRNA which is present in bacterial and eukaryotic organellarrRNA While the large subunit rRNA scaffold supports theG12 pseudoknot with its vertical trace all five interdomainRNA families can be explainedmost parsimoniously byHGTThus only a handful of ancient and universal RNA speciescan be used to root the ToL

We end by noting that the Venn diagrams consistentlyshow that Archaea harbors the least number of unique (A)and shared (AB and AE) FSFs GO terms or RNA familiesThis trend supports an early divergence of this domain of life

from the urancestor and the possibility that such divergencebe shaped by evolutionary reductive events We reason thatsuch losseswould bemore parsimonious early on in evolutionthan in the later periods This is because genes often increasetheir abundance in evolutionary time (by gene duplicationsHGT and other processes)Thus it is reasonable to think thatloss of an ancient gene would be more feasible very early inevolution relative to losing it very late

62 Phylogenomic Evidence from the Sequence and Structure ofRNA Phylogenetic analyses of the few RNA families that areuniversal and display an important vertical evolutionary trace(Figure 6(c)) provide compelling evidence in favor of an earlyevolutionary appearance of Archaea [105 107 108 130ndash135]Here we briefly summarize evidence from tRNA 5S rRNAand RNase P RNA Unpublished analyses of rRNA sequenceand structure using advanced phylogenetic methods alsoshow that Archaea was the first domain of life

tRNA molecules are generally short (sim73ndash95 nucleotidesin length) and highly conserved Consequently theirsequence generally contains limited amount of phylogeneticinformation These limitations have been overcome by ana-lyzing entire tRNomes [136] which extend the length ofan organismal set of tRNAs to over 2000 bases Xue etal [130 131] analyzed the genetic distances between tRNAsequences as averages between alloacceptor tRNAs fromdiverse groups of tRNomes using multiple molecular basesubstitution models The distances were mapped onto anunrooted phylogeny of tRNA molecules (Figure 8(a))The ldquoarrow of timerdquo assumption in these studies is thatancient tRNA paralogs closely resemble each other when lin-eages originate close to the time of the gene duplication Re-markably the results revealed a paraphyletic rooting of theToL in Archaea The root was specifically located close tothe hyperthermophilic methanogen Methanopyrus kandleri(Figure 8(a)) The hypothesis of this specific rooting scenariohas been supported by several other studies [134 137]including a study of genetic distances between paralogouspairs of aminoacyl-tRNA synthetase (aaRS) proteins [131] Aremarkable match between distance scores of tRNA and pairsof aaRS paralogs (Figure 8(b)) not only confirms the earlyappearance of Archaea but also suggests a coevolutionarytrace associated with molecular interactions that areresponsible for the genetic code In fact a recent exhaustivephylogenomic analysis of tRNA and aaRS coevolutionexplicitly reveals the origins and evolution of the geneticcode and the underlying molecular basis of genetics [59]Di Giulio [132 133] has also proposed an archaeal rootingof the tree of life specifically in the lineage leading to thephylum of Nanoarchaeota This rooting is based on uniqueand ancestral genomic traits of Nanoarchaeum equitans splitgenes separately codifying for the 51015840 and 31015840 halves of tRNAand the absence of operons which are considered molecularfossils [132] However this claim needs additional support ascontrasting evidence now recognizes N equitans as a highlyderived archaeal species [138]

In addition to sequences structural features of tRNAmolecules also support the archaeal rooting of the ToL

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Archaea 13

23

38

786

360324

164

38

2

33

453

126 56

1

1

FSF domains

Archaea

Archaea

EukaryaEukarya

Bact

eria

Bact

eria

f gt 06

(a)

1

11

526

852272

162

100

0

3

232

35 15

5

16

GO terms

Archaea

Archaea

EukaryaBa

cter

iaEukarya

Bact

eria

f gt 06

(b)

69

0

7

8393

228

2

1

0

5

91

1

0

RNA families

Archaea

Eukarya

Bact

eria

Archaea

Eukarya

Bact

eria

f gt 06

(c)

Figure 6 Venn diagrams displaying the distributions of 1733 FSF domains (a) 1924 terminal GO terms (b) and 1148 RNA families (c)in the genomes of the three domains of life FSF domain data was taken from Nasir et al [122 123] and included 981 completely sequencedproteomes from 70 Archaea 652 Bacteria and 259 Eukarya Terminal GO terms corresponding to the ldquomolecular functionrdquo hierarchy definedby the GO database [118] were identified in 249 free-living organisms including 45 Archaea 183 Bacteria and 21 Eukarya (data taken from[123]) The Venn diagram of RNA families and the distribution of Rfam clans and families in organisms were taken from Hoeppner et al[129] and their Dataset S1 Shown below are distribution patterns for FSFs GO terms and RNA families that are present in more than 60 ofthe organisms examined (119891 gt 06) All distributions highlight maximum sharing in the ancient ABE and BE taxonomic groups and minimalsharing in archaeal taxonomic groups

provide additional evidence in support of the archaeal rootingscenario

61 Comparative Genomic Argumentation The distributionof gene-encoded products in the genomes of sequencedorganisms and parsimony thinking can reveal global evolu-tionary patterns without formal phylogenetic reconstructionWe will show how simple numerical analyses of proteindomains molecular activities defined by the gene ontology(GO) consortium [118] and RNA families that are domain-specific or are shared between domains of life can uncover thetripartite division of cellular life exclude chimeric scenariosof origin and provide initial insight on the rooting of the ToL(Figure 6)

The number of unique protein folds observed in natureis very small SCOP ver 175 defines only sim2000 foldsuperfamilies (FSFs) groups of homologous domains unifiedon the base of common structural and evolutionary rela-tionships for a total of 110800 known domains in proteins[113 119] FSFs represent highly conserved evolutionary unitsthat are suitable for studying organismal diversification [87]Yafremava et al [120] plotted the total number of distinctFSFs (FSF diversity) versus the average reuse of FSFs in theproteome of an organism (FSF abundance) This exerciseuncovered a scaling behavior typical of a Benford distributionwith a linear regime of proteomic growth for microbial

organisms and a superlinear regime for eukaryotic organisms(Figure 8 in [120]) These same scaling patterns are observedwhen studying the relationship between open reading framesand genome size [121] Remarkably archaeal and eukaryalproteomes exhibited both minimum and maximum levelsof FSF abundance and diversity respectively Bacterial pro-teomes however showed intermediate levels We note thatthe general scaling behavior is consistent with a scenarioin which evolutionary diversification proceeds from simplerproteomes to the more complex ones in gradual mannersupporting the principle of spatiotemporal continuity andrevealing the nested phylogenetic hierarchies of organismsUnder this scenario (and results of [120]) the streamlinedarchaeal proteomes represent the earliest form of cellular lifeRemarkably archaeal species harboring thermophilic andhyperthermophilic lifestyles encoded the most streamlinedFSF repertoires (Figure 7) Clearly modern thermophilicarchaeons are most closely related to the ancient cells thatinhabited planet Earth billions of years ago (also read below)

FSF domain distributions in the genomes of the threedomains of life provide further insights into their evolutionNasir et al [122 123] generated Venn diagrams to illustrateFSF sharing patterns in the genomes of Archaea Bacteriaand Eukarya (Figure 6(a)) These diagrams display the totalnumber of FSFs that are unique to a domain of life (taxonomicgroupsA B and E) shared by only two (AB BE andAE) and

14 Archaea

FSF use (log)

S marinus I hospitalis

H butylicus

T pendens

Cand korarchaeum

A pernix

M kandleri

T neutrophils

P islandicum

290

285

280

275

270

265

260

255

250

245

240

22 23 24 25

ThermophilicNonthermophilic

FSF

reus

e (lo

g)

N maritimus

Figure 7 A plot of FSF use (diversity) against FSF reuse (abun-dance) reveals a linear pattern of proteomic growth in 48 archaealproteomes Thermophilic archaeal species occupy positions that areclose to the origin of the plot They also populate the most basalbranch positions in ToLs (see discussion in themain text) Both axesare in logarithmic scale

those that are universal (ABE) About half of the FSFs (786out of 1733) were present in all three domains of life 453 ofwhich were present in at least 60 of the organisms that wereexamined (119891 gt 06 where 119891 is the distribution index rangingfrom 0 to 1) The fact that about 70 of widely shared FSFs(672) belong to the ABE taxonomic group strongly supports acommon evolutionary origin for cells Evolutionary timelineshave confirmed that a large number of these universal FSFswere present in the ancestor of the three domains of life theurancestor (Figure 1) and were for the most part retained inextant proteomes [48]

Interestingly the number of BE FSFs was sim10-fold greaterthan AE and AB FSFs (324 versus 38 and 38) (Figure 6(a))The finding that Bacteria and Eukarya encode a significantlylarge number of shared FSF domains is remarkable and hintstowards an unprecedented strong evolutionary associationbetween these two domains of life Moreover a significantnumber of BE FSFs is widespread in the bacterial andeukaryotic proteomes 56 of the 324 FSFs are shared bymore than 60 of the bacterial and eukaryal organismsthat were analyzed (Figure 6(a)) This is strong evidenceof an ancient vertical evolutionary trace from their mutualancestor (anticipated in [123]) This trace uniquely supportsthe archaeal rooting of the ToL In turn the canonicaland eukaryotic rooting alternatives are highly unlikely asnone alone can explain the remarkable diversity of the BEtaxonomic group (as well as its ancient origin [123]) Theremarkable vertical trace of the ABE taxonomic group andthe negligible vertical trace of the AB group (only one FSF

shared by more than 60 of organisms) also falsify fusionhypotheses responsible for a putative chimeric makeup ofEukarya In light of these findings the most parsimoniousexplanation of comparative genomic data is that Archaea isthe first domain of diversified cellular life

The patterns of FSF sharing are further strengthenedby the genomic distributions of terminal-level molecularfunction GO terms (Figure 6(b)) The three domains of lifeshared about a quarter (526) of the 1924 GO terms thatwere surveyed A total of 232 of these ABE terms wereshared by 60 of organisms analyzed Again the fact thatabout 76 of widely shared GO terms (306) belong to theABE taxonomic group strongly supports a common verticaltrace while the finding that Bacteria and Eukarya share asignificantly large number of GO terms (272) supports againthe archaeal rooting of the ToL An alternative explanationfor the very large size of BE could be large-scale metabolism-related gene transfer from the ancestors of mitochondriaand plastids to the ancestors of modern eukaryotes [124]However we note that the BE group is not restricted to onlymetabolic functions It also includes FSFs and GOs involvedin intracellular and extracellular processes and regulation andinformational functions [125] Thus the very large size ofBE is a significant outcome most likely shaped by verticalevolutionary scenarios and cannot solely be explained byparasiticsymbiotic relationships that exist between Bacteriaand Eukarya [123] Moreover the simplicity of archaeal FSFrepertoires is not due to the paucity of available archaealgenomic dataWe confirmed that themean FSF coverage (ienumber of proteinsannotated to FSFsGOs out of total) forArchaea Bacteria and Eukarya was largely comparable (egTable S1 in [122]) Finally as we will describe below thesecomparative patterns and tentative conclusions are confirmedby phylogenomic analysis [91 122 126] Congruent phyloge-nies were obtained with (Figure 4(c)) and without equal andrandom sampling of taxa Thus the relatively low numberof archaeal genomes is also not expected to compromise ourinferences

We note that Eukarya shares many informational geneswith Archaea and many operational genes with BacteriaWhile informational genes have been thought more refrac-tory to HGT than noninformational genes we have con-firmed at GO level that this is not the case A ToL builtfrom GO terms showed that in fact noninformational termswere less homoplasious while statistical enrichment analysesrevealed that HGT had little if any functional preferencefor GO terms across GO hierarchical levels We recentlycompared ToLs reconstructed from non-HGTGO terms andToLs reconstructed from informational GO terms that wereextracted fromnon-HGTGO terms (Kim et al ms in reviewalso read below) In both cases Archaea appeared as a basalparaphyletic group of the ToLs and the common origin ofBacteria and Eukarya was maintained Thus the 272 GOterms shared by Bacteria and Eukarya harbor a strong verticaltrace

Another observation often used as support for Archaea-Eukarya kinship is the discovery of few eukaryote-specificproteins (eg actin tubulin H3 H4 ESCRT ribosomalproteins and others) in some archaeal species [127] and

Archaea 15

their complete absence fromBacteria (except for documentedtubulin HGT from eukaryotes to bacterial genus Prosthe-cobacter [128]) This suggests either that eukaryotes arosefrom an archaeal lineage [127] or that the ancestor of Eukaryaand Archaea was complex and modern Archaea are highlyreduced [25] Indeed few eukaryote-specific proteins havenow been found in some archaeal species (in some casesjust one species) We argue that this poor spread cannotbe taken as evidence for the Archaea-Eukarya sister rela-tionship This needs to be confirmed by robust phylogeneticanalysis which is unfortunately not possible when usingprotein sequences Recently we described a new strategy forinferring vertical and horizontal traces [123] This methodcalculates the spread of FSFs andGOs that are shared betweentwo-superkingdom groups (ie AB AE and BE) Balanceddistributions often indicate vertical inheritance while biaseddistributions suggest horizontal flux For example penicillinbinding molecular activity (GO 0008658) was present in100 of the sampled bacterial proteomes but was onlypresent in 11 of the archaeal species [123] Thus presenceof GO 0008658 in Archaea was attributed to HGT gain fromBacteria Using this simple method we established that bothBacteria and Eukarya were united by much stronger verticaltrace than either was to Archaea In fact strong reductivetendencies in the archaeal genomes were recorded [123]Thus in our opinion presence of eukaryote-specific proteinsin only very few archaeal species could in fact be an HGTevent that is not detectable by sequence phylogenies

In turn many arguments favor now the Bacteria-Eukaryasisterhood in addition to the structure-based phylogeniesand balanced distribution ofmolecular features For exampleBacteria and Eukarya have similar lipid membranes that canbe used as argument for their evolutionary kinshipMoreoverArchaea fundamentally differ from Eukarya in terms oftheir virosphere Viruses infecting Archaea and Eukarya aredrastically different as recently discussed by Forterre [25]

The genomic distribution of RNA families was taken fromHoeppner et al [129] and also shows a vertical evolutionarytrace in five crucial Rfam clans that are universal includingtRNA 5S rRNA subunit rRNA and RNase P RNA Theseuniversal RNA groups are likely minimally affected by HGTHowever 99 of Rfam clans and families are specific todomains of life and only 11 of the 1148 groups were sharedat 119891 gt 06 levels This clearly shows that the functionalcomplexity of RNA materialized very late during organismaldiversification and that it is not a good genomic feature forexploring the rooting of the ToL Only five RNA familiesare shared between two domains of life and only one ofthese does so at 119891 gt 06 the G12 pseudoknot of the 23SrRNA which is present in bacterial and eukaryotic organellarrRNA While the large subunit rRNA scaffold supports theG12 pseudoknot with its vertical trace all five interdomainRNA families can be explainedmost parsimoniously byHGTThus only a handful of ancient and universal RNA speciescan be used to root the ToL

We end by noting that the Venn diagrams consistentlyshow that Archaea harbors the least number of unique (A)and shared (AB and AE) FSFs GO terms or RNA familiesThis trend supports an early divergence of this domain of life

from the urancestor and the possibility that such divergencebe shaped by evolutionary reductive events We reason thatsuch losseswould bemore parsimonious early on in evolutionthan in the later periods This is because genes often increasetheir abundance in evolutionary time (by gene duplicationsHGT and other processes)Thus it is reasonable to think thatloss of an ancient gene would be more feasible very early inevolution relative to losing it very late

62 Phylogenomic Evidence from the Sequence and Structure ofRNA Phylogenetic analyses of the few RNA families that areuniversal and display an important vertical evolutionary trace(Figure 6(c)) provide compelling evidence in favor of an earlyevolutionary appearance of Archaea [105 107 108 130ndash135]Here we briefly summarize evidence from tRNA 5S rRNAand RNase P RNA Unpublished analyses of rRNA sequenceand structure using advanced phylogenetic methods alsoshow that Archaea was the first domain of life

tRNA molecules are generally short (sim73ndash95 nucleotidesin length) and highly conserved Consequently theirsequence generally contains limited amount of phylogeneticinformation These limitations have been overcome by ana-lyzing entire tRNomes [136] which extend the length ofan organismal set of tRNAs to over 2000 bases Xue etal [130 131] analyzed the genetic distances between tRNAsequences as averages between alloacceptor tRNAs fromdiverse groups of tRNomes using multiple molecular basesubstitution models The distances were mapped onto anunrooted phylogeny of tRNA molecules (Figure 8(a))The ldquoarrow of timerdquo assumption in these studies is thatancient tRNA paralogs closely resemble each other when lin-eages originate close to the time of the gene duplication Re-markably the results revealed a paraphyletic rooting of theToL in Archaea The root was specifically located close tothe hyperthermophilic methanogen Methanopyrus kandleri(Figure 8(a)) The hypothesis of this specific rooting scenariohas been supported by several other studies [134 137]including a study of genetic distances between paralogouspairs of aminoacyl-tRNA synthetase (aaRS) proteins [131] Aremarkable match between distance scores of tRNA and pairsof aaRS paralogs (Figure 8(b)) not only confirms the earlyappearance of Archaea but also suggests a coevolutionarytrace associated with molecular interactions that areresponsible for the genetic code In fact a recent exhaustivephylogenomic analysis of tRNA and aaRS coevolutionexplicitly reveals the origins and evolution of the geneticcode and the underlying molecular basis of genetics [59]Di Giulio [132 133] has also proposed an archaeal rootingof the tree of life specifically in the lineage leading to thephylum of Nanoarchaeota This rooting is based on uniqueand ancestral genomic traits of Nanoarchaeum equitans splitgenes separately codifying for the 51015840 and 31015840 halves of tRNAand the absence of operons which are considered molecularfossils [132] However this claim needs additional support ascontrasting evidence now recognizes N equitans as a highlyderived archaeal species [138]

In addition to sequences structural features of tRNAmolecules also support the archaeal rooting of the ToL

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

14 Archaea

FSF use (log)

S marinus I hospitalis

H butylicus

T pendens

Cand korarchaeum

A pernix

M kandleri

T neutrophils

P islandicum

290

285

280

275

270

265

260

255

250

245

240

22 23 24 25

ThermophilicNonthermophilic

FSF

reus

e (lo

g)

N maritimus

Figure 7 A plot of FSF use (diversity) against FSF reuse (abun-dance) reveals a linear pattern of proteomic growth in 48 archaealproteomes Thermophilic archaeal species occupy positions that areclose to the origin of the plot They also populate the most basalbranch positions in ToLs (see discussion in themain text) Both axesare in logarithmic scale

those that are universal (ABE) About half of the FSFs (786out of 1733) were present in all three domains of life 453 ofwhich were present in at least 60 of the organisms that wereexamined (119891 gt 06 where 119891 is the distribution index rangingfrom 0 to 1) The fact that about 70 of widely shared FSFs(672) belong to the ABE taxonomic group strongly supports acommon evolutionary origin for cells Evolutionary timelineshave confirmed that a large number of these universal FSFswere present in the ancestor of the three domains of life theurancestor (Figure 1) and were for the most part retained inextant proteomes [48]

Interestingly the number of BE FSFs was sim10-fold greaterthan AE and AB FSFs (324 versus 38 and 38) (Figure 6(a))The finding that Bacteria and Eukarya encode a significantlylarge number of shared FSF domains is remarkable and hintstowards an unprecedented strong evolutionary associationbetween these two domains of life Moreover a significantnumber of BE FSFs is widespread in the bacterial andeukaryotic proteomes 56 of the 324 FSFs are shared bymore than 60 of the bacterial and eukaryal organismsthat were analyzed (Figure 6(a)) This is strong evidenceof an ancient vertical evolutionary trace from their mutualancestor (anticipated in [123]) This trace uniquely supportsthe archaeal rooting of the ToL In turn the canonicaland eukaryotic rooting alternatives are highly unlikely asnone alone can explain the remarkable diversity of the BEtaxonomic group (as well as its ancient origin [123]) Theremarkable vertical trace of the ABE taxonomic group andthe negligible vertical trace of the AB group (only one FSF

shared by more than 60 of organisms) also falsify fusionhypotheses responsible for a putative chimeric makeup ofEukarya In light of these findings the most parsimoniousexplanation of comparative genomic data is that Archaea isthe first domain of diversified cellular life

The patterns of FSF sharing are further strengthenedby the genomic distributions of terminal-level molecularfunction GO terms (Figure 6(b)) The three domains of lifeshared about a quarter (526) of the 1924 GO terms thatwere surveyed A total of 232 of these ABE terms wereshared by 60 of organisms analyzed Again the fact thatabout 76 of widely shared GO terms (306) belong to theABE taxonomic group strongly supports a common verticaltrace while the finding that Bacteria and Eukarya share asignificantly large number of GO terms (272) supports againthe archaeal rooting of the ToL An alternative explanationfor the very large size of BE could be large-scale metabolism-related gene transfer from the ancestors of mitochondriaand plastids to the ancestors of modern eukaryotes [124]However we note that the BE group is not restricted to onlymetabolic functions It also includes FSFs and GOs involvedin intracellular and extracellular processes and regulation andinformational functions [125] Thus the very large size ofBE is a significant outcome most likely shaped by verticalevolutionary scenarios and cannot solely be explained byparasiticsymbiotic relationships that exist between Bacteriaand Eukarya [123] Moreover the simplicity of archaeal FSFrepertoires is not due to the paucity of available archaealgenomic dataWe confirmed that themean FSF coverage (ienumber of proteinsannotated to FSFsGOs out of total) forArchaea Bacteria and Eukarya was largely comparable (egTable S1 in [122]) Finally as we will describe below thesecomparative patterns and tentative conclusions are confirmedby phylogenomic analysis [91 122 126] Congruent phyloge-nies were obtained with (Figure 4(c)) and without equal andrandom sampling of taxa Thus the relatively low numberof archaeal genomes is also not expected to compromise ourinferences

We note that Eukarya shares many informational geneswith Archaea and many operational genes with BacteriaWhile informational genes have been thought more refrac-tory to HGT than noninformational genes we have con-firmed at GO level that this is not the case A ToL builtfrom GO terms showed that in fact noninformational termswere less homoplasious while statistical enrichment analysesrevealed that HGT had little if any functional preferencefor GO terms across GO hierarchical levels We recentlycompared ToLs reconstructed from non-HGTGO terms andToLs reconstructed from informational GO terms that wereextracted fromnon-HGTGO terms (Kim et al ms in reviewalso read below) In both cases Archaea appeared as a basalparaphyletic group of the ToLs and the common origin ofBacteria and Eukarya was maintained Thus the 272 GOterms shared by Bacteria and Eukarya harbor a strong verticaltrace

Another observation often used as support for Archaea-Eukarya kinship is the discovery of few eukaryote-specificproteins (eg actin tubulin H3 H4 ESCRT ribosomalproteins and others) in some archaeal species [127] and

Archaea 15

their complete absence fromBacteria (except for documentedtubulin HGT from eukaryotes to bacterial genus Prosthe-cobacter [128]) This suggests either that eukaryotes arosefrom an archaeal lineage [127] or that the ancestor of Eukaryaand Archaea was complex and modern Archaea are highlyreduced [25] Indeed few eukaryote-specific proteins havenow been found in some archaeal species (in some casesjust one species) We argue that this poor spread cannotbe taken as evidence for the Archaea-Eukarya sister rela-tionship This needs to be confirmed by robust phylogeneticanalysis which is unfortunately not possible when usingprotein sequences Recently we described a new strategy forinferring vertical and horizontal traces [123] This methodcalculates the spread of FSFs andGOs that are shared betweentwo-superkingdom groups (ie AB AE and BE) Balanceddistributions often indicate vertical inheritance while biaseddistributions suggest horizontal flux For example penicillinbinding molecular activity (GO 0008658) was present in100 of the sampled bacterial proteomes but was onlypresent in 11 of the archaeal species [123] Thus presenceof GO 0008658 in Archaea was attributed to HGT gain fromBacteria Using this simple method we established that bothBacteria and Eukarya were united by much stronger verticaltrace than either was to Archaea In fact strong reductivetendencies in the archaeal genomes were recorded [123]Thus in our opinion presence of eukaryote-specific proteinsin only very few archaeal species could in fact be an HGTevent that is not detectable by sequence phylogenies

In turn many arguments favor now the Bacteria-Eukaryasisterhood in addition to the structure-based phylogeniesand balanced distribution ofmolecular features For exampleBacteria and Eukarya have similar lipid membranes that canbe used as argument for their evolutionary kinshipMoreoverArchaea fundamentally differ from Eukarya in terms oftheir virosphere Viruses infecting Archaea and Eukarya aredrastically different as recently discussed by Forterre [25]

The genomic distribution of RNA families was taken fromHoeppner et al [129] and also shows a vertical evolutionarytrace in five crucial Rfam clans that are universal includingtRNA 5S rRNA subunit rRNA and RNase P RNA Theseuniversal RNA groups are likely minimally affected by HGTHowever 99 of Rfam clans and families are specific todomains of life and only 11 of the 1148 groups were sharedat 119891 gt 06 levels This clearly shows that the functionalcomplexity of RNA materialized very late during organismaldiversification and that it is not a good genomic feature forexploring the rooting of the ToL Only five RNA familiesare shared between two domains of life and only one ofthese does so at 119891 gt 06 the G12 pseudoknot of the 23SrRNA which is present in bacterial and eukaryotic organellarrRNA While the large subunit rRNA scaffold supports theG12 pseudoknot with its vertical trace all five interdomainRNA families can be explainedmost parsimoniously byHGTThus only a handful of ancient and universal RNA speciescan be used to root the ToL

We end by noting that the Venn diagrams consistentlyshow that Archaea harbors the least number of unique (A)and shared (AB and AE) FSFs GO terms or RNA familiesThis trend supports an early divergence of this domain of life

from the urancestor and the possibility that such divergencebe shaped by evolutionary reductive events We reason thatsuch losseswould bemore parsimonious early on in evolutionthan in the later periods This is because genes often increasetheir abundance in evolutionary time (by gene duplicationsHGT and other processes)Thus it is reasonable to think thatloss of an ancient gene would be more feasible very early inevolution relative to losing it very late

62 Phylogenomic Evidence from the Sequence and Structure ofRNA Phylogenetic analyses of the few RNA families that areuniversal and display an important vertical evolutionary trace(Figure 6(c)) provide compelling evidence in favor of an earlyevolutionary appearance of Archaea [105 107 108 130ndash135]Here we briefly summarize evidence from tRNA 5S rRNAand RNase P RNA Unpublished analyses of rRNA sequenceand structure using advanced phylogenetic methods alsoshow that Archaea was the first domain of life

tRNA molecules are generally short (sim73ndash95 nucleotidesin length) and highly conserved Consequently theirsequence generally contains limited amount of phylogeneticinformation These limitations have been overcome by ana-lyzing entire tRNomes [136] which extend the length ofan organismal set of tRNAs to over 2000 bases Xue etal [130 131] analyzed the genetic distances between tRNAsequences as averages between alloacceptor tRNAs fromdiverse groups of tRNomes using multiple molecular basesubstitution models The distances were mapped onto anunrooted phylogeny of tRNA molecules (Figure 8(a))The ldquoarrow of timerdquo assumption in these studies is thatancient tRNA paralogs closely resemble each other when lin-eages originate close to the time of the gene duplication Re-markably the results revealed a paraphyletic rooting of theToL in Archaea The root was specifically located close tothe hyperthermophilic methanogen Methanopyrus kandleri(Figure 8(a)) The hypothesis of this specific rooting scenariohas been supported by several other studies [134 137]including a study of genetic distances between paralogouspairs of aminoacyl-tRNA synthetase (aaRS) proteins [131] Aremarkable match between distance scores of tRNA and pairsof aaRS paralogs (Figure 8(b)) not only confirms the earlyappearance of Archaea but also suggests a coevolutionarytrace associated with molecular interactions that areresponsible for the genetic code In fact a recent exhaustivephylogenomic analysis of tRNA and aaRS coevolutionexplicitly reveals the origins and evolution of the geneticcode and the underlying molecular basis of genetics [59]Di Giulio [132 133] has also proposed an archaeal rootingof the tree of life specifically in the lineage leading to thephylum of Nanoarchaeota This rooting is based on uniqueand ancestral genomic traits of Nanoarchaeum equitans splitgenes separately codifying for the 51015840 and 31015840 halves of tRNAand the absence of operons which are considered molecularfossils [132] However this claim needs additional support ascontrasting evidence now recognizes N equitans as a highlyderived archaeal species [138]

In addition to sequences structural features of tRNAmolecules also support the archaeal rooting of the ToL

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Archaea 15

their complete absence fromBacteria (except for documentedtubulin HGT from eukaryotes to bacterial genus Prosthe-cobacter [128]) This suggests either that eukaryotes arosefrom an archaeal lineage [127] or that the ancestor of Eukaryaand Archaea was complex and modern Archaea are highlyreduced [25] Indeed few eukaryote-specific proteins havenow been found in some archaeal species (in some casesjust one species) We argue that this poor spread cannotbe taken as evidence for the Archaea-Eukarya sister rela-tionship This needs to be confirmed by robust phylogeneticanalysis which is unfortunately not possible when usingprotein sequences Recently we described a new strategy forinferring vertical and horizontal traces [123] This methodcalculates the spread of FSFs andGOs that are shared betweentwo-superkingdom groups (ie AB AE and BE) Balanceddistributions often indicate vertical inheritance while biaseddistributions suggest horizontal flux For example penicillinbinding molecular activity (GO 0008658) was present in100 of the sampled bacterial proteomes but was onlypresent in 11 of the archaeal species [123] Thus presenceof GO 0008658 in Archaea was attributed to HGT gain fromBacteria Using this simple method we established that bothBacteria and Eukarya were united by much stronger verticaltrace than either was to Archaea In fact strong reductivetendencies in the archaeal genomes were recorded [123]Thus in our opinion presence of eukaryote-specific proteinsin only very few archaeal species could in fact be an HGTevent that is not detectable by sequence phylogenies

In turn many arguments favor now the Bacteria-Eukaryasisterhood in addition to the structure-based phylogeniesand balanced distribution ofmolecular features For exampleBacteria and Eukarya have similar lipid membranes that canbe used as argument for their evolutionary kinshipMoreoverArchaea fundamentally differ from Eukarya in terms oftheir virosphere Viruses infecting Archaea and Eukarya aredrastically different as recently discussed by Forterre [25]

The genomic distribution of RNA families was taken fromHoeppner et al [129] and also shows a vertical evolutionarytrace in five crucial Rfam clans that are universal includingtRNA 5S rRNA subunit rRNA and RNase P RNA Theseuniversal RNA groups are likely minimally affected by HGTHowever 99 of Rfam clans and families are specific todomains of life and only 11 of the 1148 groups were sharedat 119891 gt 06 levels This clearly shows that the functionalcomplexity of RNA materialized very late during organismaldiversification and that it is not a good genomic feature forexploring the rooting of the ToL Only five RNA familiesare shared between two domains of life and only one ofthese does so at 119891 gt 06 the G12 pseudoknot of the 23SrRNA which is present in bacterial and eukaryotic organellarrRNA While the large subunit rRNA scaffold supports theG12 pseudoknot with its vertical trace all five interdomainRNA families can be explainedmost parsimoniously byHGTThus only a handful of ancient and universal RNA speciescan be used to root the ToL

We end by noting that the Venn diagrams consistentlyshow that Archaea harbors the least number of unique (A)and shared (AB and AE) FSFs GO terms or RNA familiesThis trend supports an early divergence of this domain of life

from the urancestor and the possibility that such divergencebe shaped by evolutionary reductive events We reason thatsuch losseswould bemore parsimonious early on in evolutionthan in the later periods This is because genes often increasetheir abundance in evolutionary time (by gene duplicationsHGT and other processes)Thus it is reasonable to think thatloss of an ancient gene would be more feasible very early inevolution relative to losing it very late

62 Phylogenomic Evidence from the Sequence and Structure ofRNA Phylogenetic analyses of the few RNA families that areuniversal and display an important vertical evolutionary trace(Figure 6(c)) provide compelling evidence in favor of an earlyevolutionary appearance of Archaea [105 107 108 130ndash135]Here we briefly summarize evidence from tRNA 5S rRNAand RNase P RNA Unpublished analyses of rRNA sequenceand structure using advanced phylogenetic methods alsoshow that Archaea was the first domain of life

tRNA molecules are generally short (sim73ndash95 nucleotidesin length) and highly conserved Consequently theirsequence generally contains limited amount of phylogeneticinformation These limitations have been overcome by ana-lyzing entire tRNomes [136] which extend the length ofan organismal set of tRNAs to over 2000 bases Xue etal [130 131] analyzed the genetic distances between tRNAsequences as averages between alloacceptor tRNAs fromdiverse groups of tRNomes using multiple molecular basesubstitution models The distances were mapped onto anunrooted phylogeny of tRNA molecules (Figure 8(a))The ldquoarrow of timerdquo assumption in these studies is thatancient tRNA paralogs closely resemble each other when lin-eages originate close to the time of the gene duplication Re-markably the results revealed a paraphyletic rooting of theToL in Archaea The root was specifically located close tothe hyperthermophilic methanogen Methanopyrus kandleri(Figure 8(a)) The hypothesis of this specific rooting scenariohas been supported by several other studies [134 137]including a study of genetic distances between paralogouspairs of aminoacyl-tRNA synthetase (aaRS) proteins [131] Aremarkable match between distance scores of tRNA and pairsof aaRS paralogs (Figure 8(b)) not only confirms the earlyappearance of Archaea but also suggests a coevolutionarytrace associated with molecular interactions that areresponsible for the genetic code In fact a recent exhaustivephylogenomic analysis of tRNA and aaRS coevolutionexplicitly reveals the origins and evolution of the geneticcode and the underlying molecular basis of genetics [59]Di Giulio [132 133] has also proposed an archaeal rootingof the tree of life specifically in the lineage leading to thephylum of Nanoarchaeota This rooting is based on uniqueand ancestral genomic traits of Nanoarchaeum equitans splitgenes separately codifying for the 51015840 and 31015840 halves of tRNAand the absence of operons which are considered molecularfossils [132] However this claim needs additional support ascontrasting evidence now recognizes N equitans as a highlyderived archaeal species [138]

In addition to sequences structural features of tRNAmolecules also support the archaeal rooting of the ToL

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

16 Archaea

Mka

Mja PfuPab Pho

Neq PaeApe

SsoSto

Aae

Dra CacTte

Mpn

LlaSpn

LinBsu

Bbu

TpaCjeHpy

PsaVchStyHinBap

EcoNme

RsoXfa

XcaRpr

AtuCcr

CtrTelSynAnaBlo

MtuScoTma

Ecu

SpoSce

Gth

Pfa

AthCel

DmeHsa

Hal

AfuMac

MmaMba

Fac

TvoTac

Mth

035 092Dallo

Archaea

Bacteria

Eukarya

(a)

140

120

100

80

60

40

20

09080706050403

Archaea

Bacteria

Eukarya

MkaAncientorigins

Recentorigins

ApePae

Pab TmaPfu Pho

Spo

Aae

Mth

Tte

Mja

Afu

StoSso

tRNA alloacceptor history (Dallo )A

min

oacy

l-tRN

A sy

nthe

tase

hist

ory

(Qar

s)(b)

Figure 8 Ancient phylogenetic signal in the sequence of tRNA and their associated aminoacyl-tRNA synthetase (aaRS) enzymes (a)Unrooted ToL derived from tRNA sequences with their alloacceptor 119863allo distances traced in thermal scale (from [130]) 119863allo is averageof pairwise distances for 190 pairs of tRNA isoacceptors These distances measure pairwise sequence mismatches of tRNAs for every genomeand their values increase for faster evolving sequences of species of more recent origin (b) Coevolution of aaRSs and their correspondingtRNA Genetic distances between the top 10 potentially paralogous aaRS pairs estimated using BLASTP define ameasure (119876ars) of how closelythe proteins resemble each other in genomes (from [131]) Larger119876ars scores imply more ancestral and slowly evolving protein pairs The plotof 119876ars scores against 119863allo distances reveals a hidden correlation between the evolution of tRNA and aaRSs and the early origin of Archaea(larger 119876ars and lower119863allo distances)

The application of RNA structural evidence in phylogeneticstudies [50 139ndash141] has multiple advantages over sequencedata when studying ancient events especially because RNAstructures are far more conserved than sequences This hasbeen demonstrated in a phylogenetic approach that uses RNAstructural information to reconstruct evolutionary historyof macromolecules such as rRNA [29 50] tRNA [107 108]5S rRNA [105] RNase P RNA [135] and SINE RNA [142]Geometrical and statistical properties of structure (eg stemsor loops commonly found in the secondary structures ofRNA molecules) are treated as linearly ordered multistatephylogenetic characters In order to build rooted trees anevolutionary tendency toward conformational order is usedto polarize change in character state transformations Thisdefines a hypothetical ancestor with which to root theingroup using the Lundbergmethod Reconstructed phyloge-nies produce trees ofmolecules and ToLs (eg Figure 4(a)) ortrees of substructures that describe the gradual evolutionaryaccretion of structural components intomolecules (Figure 9)For example phylogenetic trees of tRNA substructures defineexplicit models of molecular history and show that tRNAoriginated in the acceptor stem of the molecule [108]Remarkably trees reconstructed from tRNA drawn fromindividual domains of life demonstrate that the sequenceof accretion events occurred differently in Archaea than inBacteria and Eukarya suggesting a sister group evolutionaryrelationship between the bacterial and eukaryotic domains

(Figure 9(a)) A similar result obtained from trees of 5SrRNA substructures revealed different molecular accretionsequences of archaeal molecules when these were comparedto bacterial and eukaryal counterparts [105] confirmingagain in a completely different molecular system the historyof the domains of life

An analysis of the structure of RNase PRNAalso providessimilar conclusions [135] While a ToL reconstructed frommolecular structure placed type A archaeal molecules at itsbase (a topology that resembles the ToL of 5S rRNA) atree of RNase P RNA substructures uncovered the history ofmolecular accretion of the RNA component of the ancientendonuclease and revealed a remarkable reductive evolution-ary trend (Figure 9(b)) Molecules originated in stem P12and were immediately accessorized with the catalytic P1ndashP4 catalytic pseudoknotted core structure that interacts withRNase P proteins of the endonuclease complex and ancientsegments of tRNA Soon after this important accretion stagethe evolving molecule loses its first stem in Archaea (stemP8) several accretion steps earlier than the first loss of a stemin Eukarya or the first appearance of a Bacteria-specific stemThese phylogenetic statements provide additional strongsupport to the early origin of the archaeal superkingdomprior to the divergence of the shared common ancestor ofBacteria and Eukarya As we will discuss below the earlyloss of a structure in the molecular accretion process of acentral and ancient RNA family is significant It suggests

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Archaea 17

100

100100 Var

DHUAC

AccEukarya

100

100100 Var

DHUAC

AccBacteria

100

100100 Var

DHU

ACAcc

ArchaeaAC

AccDHU

Ancestral Derived0 105

Age (nd)

3998400endTΨCTΨC

TΨC

TΨC

Var

(a)

P2

P3

P1 P9 P12

P7

P51

P151P15

P4

P5

P8P101

P18

0 101 02 03 04 05 06 07 08 09

P101P151P51P20P15-16

P162P16-17

P161P19

P14P18

P13P17

P16P6

P15P5

P7P8

P9P10-11

P2P4

P3P1

P12

Ancestral Derived

First stem lost inArchaea (type M)

Age (nd)

Universalpseudoknot

Type A pseudoknot(first loss in Eukarya)

100

100

100

89

75

51

66

74

51

93

95

75

75

9084

First stem specificto bacteria (type A)

Ancientuniversalstems

Universalstems

First stems specificto Archaea (type A)

Bacterialtype B-specific

P19

100 changes

(b)

Figure 9The history of accretion of tRNA and RNase P RNA substructures reveals the early evolutionary appearance of Archaea (a) Rootedtrees of tRNA arm substructures reveal the early appearance of the acceptor arm (Acc) followed by the anticodon arm (AC) in Archaeaor the pseudouridine (TΨC) arm in both Bacteria and Eukarya (from [108]) The result confirms the sister group relationship of Bacteriaand Eukarya (b) Trees of molecular substructures of RNase P RNAs were reconstructed from characters describing the geometry of theirstructures (from [135]) Branches and corresponding substructures in a 3D atomicmodel are colored according to the age of each substructure(nd node distance) Note the early loss of stem P8 in Archaea immediately after the evolutionary assembly of the universal functional coreof the molecule

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

18 Archaea

that the emerging archaeal lineages were subjected to strongreductive evolutionary pressures during the early evolution ofa very ancient RNA molecule

63 Phylogenomic Evidence from Protein Domain StructureGCaetano-Anolles andDCaetano-Anolles [41] were the firstto utilize protein domain structures as taxa and to reconstructtrees of domains (ToDs) describing their evolution Figure 10shows a rooted ToD built from a census of FSF structuresin 981 genomes (data taken from [122 123]) These treesare unique in that their terminal leaves represent a finiteset of component parts [87] These parts describe at globallevel the structural diversity of the protein world Whenbuilding trees of FSFs the age of each FSF domain structurecan be calculated from the ToDs by simply calculating adistance in nodes between the root of the tree and itscorresponding terminal leaf This is possible because ToDsexhibit highly unbalanced (pectinate) topologies that are theresult of semipunctuated processes of domain appearanceand accumulation This node distance (nd) rescaled from 0to 1 provides a relative timescale to study the order of FSFappearance in evolutionary history [126 143] Wang et al[144] showed that nd correlates linearly with geological timeand defines a global molecular clock of protein folds Thusnd can be used as a reliable proxy for time Plotting the ageof FSFs in each of the seven taxonomic groups confirmedevolutionary statements we had previously deduced fromthe Venn diagrams of Figure 6 The ABE taxonomic groupincluded the majority of ancient and widely distributedFSFs This is expected In the presence of a strong verticaltrace molecular diversity must delimit a nested taxonomichierarchy The ABE group was followed by the evolution-ary appearance of the BE group which preceded the firstdomain-specific structures which were Bacteria-specific (B)Remarkably Archaea-specific (A) and Eukarya-specific (E)structures appeared concurrently and relatively late Thesegeneral trends captured in the box plots of Figure 10 havebeen recovered repeatedly when studying domain structuresat various levels of structural complexity from folds to foldfamilies [91 126] when using CATH or SCOP structuraldefinitions [122 145] or when exploring the evolution ofterminal GO terms

While the early rise of BE FSFs supports the early diver-gence of Archaea from the urancestor the very significanttrend of gradual loss of structures occurring in the lineagesof the archaeal domain and the very late appearance ofArchaea-specific structures (eg [126]) demand explanationSince ancient BE FSFs are widely distributed in proteomes(Figure 10) they cannot arise from separate gains of FSFs inBacteria and Eukarya or by processes of horizontal spread ofstructures This was already evident from the Venn diagramsof Figure 6 Moreover the BE sisterhood to the exclusionof Archaea was further supported by the inspection of FSFsinvolved in lipid synthesis and transport (Table 1) Membranelipids are very relevant to the origins of diversified life([146] and references therein) Bacteria and Eukarya encodesimilar lipid membranes while archaeal membranes havedifferent lipid composition (isoprenoid ethers) To check if

ABE

BE

AB

AE

B

A

E

Origin of FSF domains

00 02 04 06 08 10

Age of FSFs (nd)

Taxa = 1733FSF domainsCharacters = 981 genomesTree length = 458226 steps

CI = 0049 RI = 0796 and g1=minus008

c551

c371

b404

c21a45c661

01 gt f lt 05 (623)

00 gt f lt 01 (529)

100005 gt f lt 09 (336)

09 gt f lt 10 (228)f = 10 (17)

Figure 10 Phylogenomic tree of domains (ToD) describing theevolution of 1733 FSF domain structures Taxa (FSFs) were coloredaccording to their distribution (119891) in the 981 genomes that were sur-veyed and used as characters to reconstruct the phylogenomic tree(data taken from [122 123]) The most basal FSFs are labeled withSCOP alphanumeric identifiers (eg c371 is the P-loop containingnucleoside triphosphate hydrolase FSF) Boxplots display the age (ndvalue) distribution of FSFs for the seven possible taxonomic groupsnd values were calculated directly from the tree [122] and define atimeline of FSF innovation from the origin of proteins (nd = 0)to the present (nd = 1) The group of FSFs that are shared by thethree domains of life (ABE) is the most ancient taxonomic groupwhich spans the entire time axis and their FSFs arewidely distributedin genomes The appearance of the BE group coincides with thefirst reductive loss of an FSF in Archaea FSF structures specific todomains of life appear much later in evolution

lipid synthesis was another BE synapomorphy we identified17 FSFs that were involved in lipid metabolism and transport

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Archaea 19

Table 1 List of FSFs involved in lipid metabolism and transport along with taxonomic distribution (data taken from [122 123])

Group SCOPId

FSFId

FSF description

ABE 89392 b1251 Prokaryotic lipoproteins and lipoprotein localization factorsABE 53092 c552 Creatinaseprolidase N-terminal domainABE 49723 b121 Lipaselipooxygenase domain (PLATLH2 domain)ABE 54637 d381 Thioesterasethiol ester dehydrase-isomeraseABE 63825 b685 YWTD domainBE 47027 a111 Acyl-CoA binding proteinBE 48431 a1184 Lipovitellin-phosvitin complex superhelical domainBE 55048 d5823 Probable ACP-binding domain of malonyl-CoA ACP transacylaseBE 56968 f71 Lipovitellin-phosvitin complex beta-sheet shell regionsBE 58113 h51 Apolipoprotein A-IBE 47162 a241 ApolipoproteinBE 56931 f42 Outer membrane phospholipase A (OMPLA)B 82220 b1201 Tp47 lipoprotein N-terminal domainE 47699 a521 Bifunctional inhibitorlipid-transfer proteinseed storage 2S albuminE 82936 h61 Apolipoprotein A-IIE 57190 g310 Colipase-likeE 49594 b74 Rab geranylgeranyltransferase alpha-subunit insert domain

(Table 1) Remarkably the majority of these FSFs (7 outof 17) were unique to the BE group In comparison nonewere present in either AE or AB groups The ABE groupincluded five universal FSFs while onewas unique to Bacteriaand four were eukarya-specific In turn no FSF was uniqueto Archaea The BE FSFs cannot be explained by moderneffects impinging on variations in proteomic accumulation inFSFs or by processes of domain rearrangement since theseappear in the protein world quite late in evolution [92] Theonly and most-parsimonious explanation of the patterns ofFSF distribution that unfold in the ToD is the very early(and protracted) rise of the archaeal domain by processesof reductive evolution possibly triggered by the adaptationof urancestral lineages to harsh environments and survivalmodes Under extremophilic conditions typical of hyperther-mophilic environments considerable investments of matter-energy and informationmust bemade for protein persistence[120] This puts limits on viable protein structures [147]Extremophilic environments will thus poise the maintenanceof a limited set of FSFs for persistence of emergent diversifiedlineagesThis would induce a primordial episode of reductiveevolution in the growing FSF repertoire explaining whyhyperthermophilic and thermophilic archaeal species holdthe most reduced proteomes (Figure 7) It would also explainthe biases that exist in FSFs GO terms and RNA families(Figure 6) and the placement of hyperthermophilic andthermophilic archaeal species at the base of ToLs SinceArchaea populate the oceans and sometimes rival in numberBacteria in those environments we further interpret the lateappearance of Archaea-specific FSFs as the result of late

colonization of these mild environments by both ancientarchaeons and emerging eukaryotes This relaxes primordialextremophilic pressures on protein structures and enablesthe late archaeal exploration of structural flexibility andfunctional novelty

64 Full Circle Evidence from Trees of Proteomes and Func-tionomes and a Tree Derived from the Distribution of ViralReplicons in Superkingdoms While we distrust trees of sys-tems especially ToLs built from sequences the use of molec-ular structure at high levels of structural abstraction has thepotential to mitigate some limitations of sequence analysis[93] For example rooted ToLs built from abundance countsof domain structures and terminal GO terms in the genomesof free-living organisms describe the evolution of proteomes(eg [91]) All ToL reconstructions of these kinds approx-imate the physiology of living organisms dissect the threeprimary domains of life and reveal the early paraphyleticorigin of extremophilic archaeal lineages followed by the lateappearances of monophyletic Bacteria and Eukarya Thesepatterns have been reliably recovered with datasets of varyingsizes irrespective of the structural classification scheme [91122 126 145] Even a tree reconstructed from the distributionof 2662 viral replicons in superkingdoms from an exhaustivecomparative genomic analysis of viral genomes showed thebasal placement of Archaea and the sister taxa relationshipbetween Bacteria and Eukarya (Figure 11)

65 Additional Evidence from Comparative Genomics Theuneven distribution of protein domain structures in theworld

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

20 Archaea

100

5 changes

ArchaeaBacteria

Eukarya5 0 0 0 0 036 5 5 5 0 0

Figure 11 One of three optimal phylogenetic tree reconstructionswith identical topologies recovered from an exhaustive maximumparsimony search (58 steps CI = 1 RI = 1HI = 0 RC= 1119892

1=minus0707)

of the abundance of viral replicon types of dsDNA ssDNA dsRNAssRNA(+) ssRNA(minus) and retrotranscribing viruses Abundancewas scored on a 0ndash20 scale and ranged from 0 to 759 viral repliconsVectors of abundance reconstructions in internal nodes are givenas percentage of total abundance of replicons in superkingdomsBoostrap support values are shown below nodes

of proteomes (Figure 6) is preserved as we climb in thestructural hierarchy This was recently made evident whenstudying the evolution of CATH domains [145] The Venndiagrams of Figure 12 show how domain structures in alltaxonomic groups decrease in numbers with increases in evo-lutionary conservation At the highest CATH architecturallevel there were 32 universal architectures but no domain-specific architecturesThe four architectures shared by Bacte-ria and Eukarya were present in at least 60 of the proteomesthat were surveyed The other two interdomain architectureswere topological designs of considerable complexity thatwerepoorly shared between proteomes and were evolutionarilyderived They were the clam architectures lost in Eukaryaand the box architectures of nucleotide excision repair sharedby Archaea and Eukarya The most parsimonious corollaryof these distribution patterns is that the BE taxonomicalgroup must arise by loss of structures in Archaea IndeedToDs describing the evolution of CATH domain structuresagain confirm the early appearance of BE structures andconsequently their loss in Archaea

7 Paraphyletic Origins Grades andClades in Archaeal History

Saying that ldquoa ToL is rooted in a domain of liferdquo is an incorrectstatement that comes from phylogenetic methodology (theuse of outgroups) and the tendency to look at the past withmodern eyes Clades in ToLs have been rooted relative toeach other by generating unrooted trees and by definingextant organisms as outgroup taxaThe ToL however must beconsidered rooted in the urancestor of cellular life (Figure 1)This planted edge that connects to the ingroup of extantorganisms represents a cellular state in which productivediversification (in the sense of successful lineages)was absentThe primordial urancestral edge leads to a ldquophase transitionrdquothe last universal cellular ancestor (LUCA) of which little isknown The physiology of the urancestor cannot be consid-ered linked to that of any extant organism even if it shared acommonmolecular core with all of themThe urancestor was

not an archeon a bacterium or a eukaryote [148] It was notnecessarily thermophilic Perhaps it was a communal entityor a megametaorganism in the sense of a modern syncytium(the result of multiple cell fusions) and a modern coenocyte(the result of multiple cell divisions) The organismal bound-aries were likely present judging by the number of widely dis-tributed protein domains that associate with membranes andappear at the base of our ToDs and by the universal existenceof acidocalcisome organelles [149] However the molecularmakeup of the urancestral cells was most likely fluid andquasistatistical the repertoire distributed unequally in theurancestral populations of communal parts of course withinconfines delimited by persistence This urancestral popula-tion is therefore consistent with the idea of a primordial stemline proposed by Kandler and Woese [150 151] However itwas relatively richer in molecular structures and functionsas opposed to the simple cellular systems hypothesized byWoese This richness is confirmed by modern analyses ofproteomes and functionomes that reveal vast number ofuniversally shared protein domains and GOs among threesuperkingdoms While each syncytialcoenocyte element ofthe megaorganism exchanged component parts in searchof cellular stability and persistence the process could notbe equated with modern HGT The exchanging commu-nity of primordial cells was not cohesive enough to makethe horizontal exchange meaningful Macromolecules mostlikely established loose and diverse associations with eachother and with smaller molecules limited by the short aver-age life of their unevolved structural conformations Withtime molecules with better-optimized properties engagedin more durable interactions stabilizing the emergent cellsand providing increased cellular cohesiveness This poisedthe urancestral community towards a phase transition (acrystallization [148]) a point in which cellular groups haddistinct properties and could be individuatedWe believe thiswas the time of the origin of the archaeal lineage 29 billionyears ago [48]

At the base of the ToLs that were reconstructed fromgenomic data basal archaeal taxa arise as paraphyletic lin-eages (Figures 4(a) and 4(c))These lineages likely arose fromsubgroups of the urancestral population that pervasively lostcrucial domain structures andmolecular functionsThis rep-resents an evolutionary grade under the scenario describedabove The emerging lineages shared with the urancestralcommunity a unifying condition that was related to archaicbiochemistry In other words the urancestral and emergingarchaeal lineages expressed fundamental structural and func-tional equivalences in terms of their repertoires but revealedin each emerging and durable paraphyletic lineage a handfulof distinct newly developed traitsThese traits could be globalsuch as increased thermostability of some crucial members ofthe protein repertoire or change in the membrane makeupor local such as the selective loss of crucial structures andfunctions Figure 13 uses the tree paradigm to portray thestructural and functional equivalences of the urancestral andemerging archaeal lineages and the slow progression fromgrades to clades

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Archaea 21

31

42849

349410

382

158

10

16493

187214

163

69

Superfamilies Topologies

0

132

04

ArchaeaArchaeaArchaea

0

1

Architectures

Evolutionary conservation

Bact

eria

Bact

eria

Bact

eria

Eukarya

Eukarya

Eukarya

Figure 12 Venn diagrams displaying the distributions of 2221 homologous superfamilies 1152 topologies and 38 architectures of CATHdomains in the proteomes of 492 fully sequenced genomes (from [145]) All distributions highlight maximum sharing in the ancient ABEand BE taxonomic groups and minimal sharing in archaeal taxonomic groups

Grades CladesTime0

Time1

Figure 13 From grades to clades The cartoon describes a possibleprogression of modes of organismal diversification during the riseof primordial archaeal lineages The width of emerging lineages isproportional to uniquely identifying features of physiological andmolecular complexity

8 Through the Wormhole The Makeupof the lsquolsquoMegaorganismrsquorsquo and the EmergingArchaeal Lineages

Character state reconstructions of proteome repertoiresderived from ToLs coupled to the timelines of ToDs providean effective way to define the ancestral protein domain com-plement of the urancestor [48] and consequently the likelymakeup of the emerging archaeal lineages The urancestralproteome possessed a lower bound of sim70 FSF domain struc-tures 75 of whichwere composed of120572120573 and120572+120573 proteinsAbout 50of FSFswere part ofmetabolic enzymes includinga rich toolkit of transferases and enzymes of nucleotidemetabolism The rest of domains were involved in functionsrelated to information (translation replication and repair)intracellular processes (transport protein modification andproteolytic activities) regulation (kinasesphosphatases andDNA binding functions) and small molecule binding Theurancestor had a limited repertoire of aaRSs and translationfactors It contained a primordial ribosome with a lim-ited core of universal ribosomal proteins It had numerousmembrane proteins necessary for transport including arelatively advanced ATP synthetase complex and structuresnecessary for cellular organization (filaments and primor-dial cytoskeletal structures) The cells lacked enzymes fordeoxyribonucleotide production so it is likely that thecellular urancestor itself did not harbor a DNA genomeThe cells lacked functions related to extracellular processes

(cell adhesion immune response and toxinsdefense) andcellular motility suggesting an ancient living world withoutcompetitive strategies of survival

9 Conclusions

The rooting of the ToL has been always controversial inevolutionary biology [26 152 153] While it is popularlyaccepted that the ToL based on sequence phylogenies isrooted in the akaryotes and that Archaea and Eukarya aresister groups to each other only two of the threemain steps ofphylogenetic analysis [104] have been partially fulfilled withsequences This includes selecting an appropriate statisticalor nonstatistical evolutionary model of character changeand an optimization method for phylogenetic tree recon-struction However no adequate method exists for characterpolarization that identifies ancestral and derived characterstates in sequences In the absence of robust polarizationmethodology any statement about the rooting of the ToLshould be considered suspect or subject of apriorismHereweshow that information derived from a genomic structural andfunctional census of millions of encoded proteins and RNAscoupled with process models that comply with Westonrsquosgenerality criterion provide the means to dissect the originsof diversified life The generality criterion is fulfilled in thesestudies by focusing on the accumulation of modules such asprotein domain structures elements of RNA substructuresor ontogenetic definitions of molecular function In generalthese features are the subject of accretion processes thatcomply with additive phylogenetic change within the nestedtaxonomic hierarchy and result in changes of abundanceThese processes include those responsible for the growth ofmolecules (eg multidomain proteins) molecular ensembles(eg the ribosome) and molecular repertoires (eg pro-teomes) The new methods unfold a consistent evolutionaryscenario in which the origin of diversified life traces back tothe early history of Archaea Remarkably the archaic originof this microbial urkingdom now does justice to its name

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

22 Archaea

Acknowledgments

The authors would like to thank members and friends ofthe GCA laboratory for fruitful discussions This researchhas been supported by grants from the National Sci-ence Foundation (MCB-0749836 and OISE-1132791) and theUnited States Department of Agriculture (ILLU-802-909and ILLU-483-625) to Gustavo Caetano-Anolles and fromKRIBBResearch Initiative Program and theNext-GenerationBioGreen 21 Program Rural Development Administration(PJ0090192013) to Kyung Mo Kim

References

[1] C R Woese ldquoA new biology for a new centuryrdquo MicrobiologyandMolecular Biology Reviews vol 68 no 2 pp 173ndash186 2004

[2] T Grant ldquoTesting methods the evaluation of discovery opera-tions in evolutionary biologyrdquo Cladistics vol 18 no 1 pp 94ndash111 2002

[3] E O Wiley ldquoKarl R Popper systematics and classificationa reply to Walter Bock and other evolutionary taxonomistsrdquoSystematic Zoology vol 24 no 2 pp 233ndash243 1975

[4] J S Farris ldquoParsimony and explanatory powerrdquo Cladistics vol24 no 5 pp 825ndash847 2008

[5] G Caetano-Anolles K M Kim and D Caetano-Anolles ldquoThephylogenomic roots of modern biochemistry origins of pro-teins cofactors and protein biosynthesisrdquo Journal of MolecularEvolution vol 74 no 1-2 pp 1ndash34 2012

[6] W Fontana ldquoModeling ldquoevo-devordquo with RNArdquo BioEssays vol24 no 12 pp 1164ndash1177 2002

[7] WHennigPhylogenetic Systematics University of Illinois PressUrbana Ill USA 1999

[8] D Fraix-Burnet T Chattopadhyay A K Chattopadhyay EDavoust and M Thuillard ldquoA six-parameter space to describegalaxy diversificationrdquo Astronomy and Astrophysics vol 545article A80 24 pages 2012

[9] E Sober ldquoThe contest between parsimony and likelihoodrdquoSystematic Biology vol 53 no 4 pp 644ndash653 2004

[10] E K Lienau and R DeSalle ldquoIs the microbial tree of lifeverificationistrdquo Cladistics vol 26 no 2 pp 195ndash201 2010

[11] J S Huxley ldquoEvolutionary processes and taxonomywith specialreference to gradesrdquoUppsalaUniversitets Arsskrift vol 6 pp 21ndash39 1958

[12] C R Woese and G E Fox ldquoPhylogenetic structure of theprokaryotic domain the primary kingdomsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 74 no 11 pp 5088ndash5090 1977

[13] P Forterre ldquoThe universal tree of life and the last universalcellular ancestor revolution and counterrevolutionsrdquo in Evolu-tionary Genomics and Systems Biology G Caetano-Anolles Edpp 43ndash62 Wiley-Blackwell Hoboken NJ USA 2010

[14] J P Gogarten H Kibak P Dittrich et al ldquoEvolution of thevacuolar H+-ATPase implications for the origin of eukaryotesrdquoProceedings of the National Academy of Sciences of the UnitedStates of America vol 86 no 17 pp 6661ndash6665 1989

[15] N Iwabe K Kuma M Hasegawa S Osawa and T MiyataldquoEvolutionary relationship of archaebacteria eubacteria andeukaryotes inferred from phylogenetic trees of duplicatedgenesrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 86 no 23 pp 9355ndash9359 1989

[16] J P Gogarten and L Olendzenski ldquoOrthologs paralogs andgenome comparisonsrdquo Current Opinion in Genetics and Devel-opment vol 9 no 6 pp 630ndash636 1999

[17] C R Woese O Kandler and M L Wheelis ldquoTowards anatural system of organisms proposal for the domains ArchaeaBacteria and Eucaryardquo Proceedings of the National Academy ofSciences of the United States of America vol 87 no 12 pp 4576ndash4579 1990

[18] J Martijn and T J Ettema ldquoFrom archaeon to eukaryotethe evolutionary dark ages of the eukaryotic cellrdquo BiochemicalSociety Transactions vol 41 no 1 pp 451ndash457 2013

[19] C J Cox P G Foster R P Hirt S R Harris and T M EmbleyldquoThe archaebacterial origin of eukaryotesrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 105 no 51 pp 20356ndash20361 2008

[20] P G Foster C J Cox and TM Embley ldquoThe primary divisionsof life a phylogenomic approach employing composition-heterogeneousmethodsrdquoPhilosophical Transactions of the RoyalSociety B Biological Sciences vol 364 no 1527 pp 2197ndash22072009

[21] T A Williams P G Foster T M Nye C J Cox and T MEmbley ldquoA congruent phylogenomic signal places eukaryoteswithin the Archaeardquo Proceedings of the Royal Society B Biologi-cal Sciences vol 279 no 1749 pp 4870ndash4879 2012

[22] S Gribaldo A M Poole V Daubin P Forterre and CBrochier-Armanet ldquoThe origin of eukaryotes and their rela-tionship with the Archaea are we at a phylogenomic impasserdquoNature Reviews Microbiology vol 8 no 10 pp 743ndash752 2010

[23] H Philippe and P Forterre ldquoThe rooting of the universal tree oflife is not reliablerdquo Journal of Molecular Evolution vol 49 no 4pp 509ndash523 1999

[24] E Sober and M Steel ldquoTesting the hypothesis of commonancestryrdquo Journal of Theoretical Biology vol 218 no 4 pp 395ndash408 2002

[25] P Forterre ldquoThe common ancestor of Archaea and Eukarya wasnot an archaeonrdquoArchaea vol 2013 Article ID 372396 18 pages2013

[26] P Forterre and H Philippe ldquoWhere is the root of the universaltree of liferdquo BioEssays vol 21 no 10 pp 871ndash879 1999

[27] S Gribaldo and H Philippe ldquoAncient phylogenetic relation-shipsrdquoTheoretical Population Biology vol 61 no 4 pp 391ndash4082002

[28] J K Harris S T Kelley G B Spiegelman and N R Pace ldquoThegenetic core of the universal ancestorrdquoGenome Research vol 13no 3 pp 407ndash412 2003

[29] A Harish and G Caetano-Anolles ldquoRibosomal history revealsorigins of modern protein synthesisrdquo PLoS ONE vol 7 no 3Article ID e32776 2012

[30] G Caetano-Anolles and M J Seufferheld ldquoThe coevolutionaryroots of biochemistry and cellular organization challenge theRNA world paradigmrdquo Journal of Molecular Microbiology andBiotechnology vol 23 no 1-2 pp 152ndash177 2013

[31] N R Pace ldquoMapping the tree of life progress and prospectsrdquoMicrobiology and Molecular Biology Reviews vol 73 no 4 pp565ndash576 2009

[32] P de Rijk Y van de Peer I van den Broeck and R de WachterldquoEvolution according to large ribosomal subunit RNArdquo Journalof Molecular Evolution vol 41 no 3 pp 366ndash375 1995

[33] G Caetano-Anolles ldquoTracing the evolution of RNA structurein ribosomesrdquo Nucleic Acids Research vol 30 no 11 pp 2575ndash2587 2002

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Archaea 23

[34] J Mallatt and C J Winchell ldquoTesting the new animal phy-logeny first use of combined large-subunit and small-subunitrRNA gene sequences to classify the protostomesrdquo MolecularBiology and Evolution vol 19 no 3 pp 289ndash301 2002

[35] M Gerstein ldquoPatterns of proteinminusfold usage in eight microbialgenomes a comprehensive structural censusrdquo Proteins Struc-ture Function and Bioinformatics vol 33 no 4 pp 518ndash5341998

[36] M Gerstein and H Hegyi ldquoComparing genomes in termsof protein structure surveys of a finite parts listrdquo FEMSMicrobiology Reviews vol 22 no 4 pp 277ndash304 1998

[37] B Snel P Bork and M A Huynen ldquoGenome phylogeny basedon gene contentrdquo Nature Genetics vol 21 no 1 pp 108ndash1101999

[38] F Tekaia A Lazcano and B Dujon ldquoThe genomic tree asrevealed fromwhole proteome comparisonsrdquoGenomeResearchvol 9 no 6 pp 550ndash557 1999

[39] Y I Wolf S E Brenner P A Bash and E V KooninldquoDistribution of protein folds in the three superkingdoms ofliferdquo Genome Research vol 9 no 1 pp 17ndash26 1999

[40] J O Korbel B Snel M A Huynen and P Bork ldquoSHOT a webserver for the construction of genome phylogeniesrdquo Trends inGenetics vol 18 no 3 pp 158ndash162 2002

[41] G Caetano-Anolles andD Caetano-Anolles ldquoAn evolutionarilystructural universe of protein architecturerdquo Genome Researchvol 13 no 7 pp 1563ndash1571 2003

[42] S L Baldauf A J Roger I Wenk-Siefert and W F DoolittleldquoA kingdom-level phylogeny of eukaryotes based on combinedprotein datardquo Science vol 290 no 5493 pp 972ndash977 2000

[43] J R Brown C J Douady M J Italia W E Marshall and MJ Stanhope ldquoUniversal trees based on large combined proteinsequence data setsrdquo Nature Genetics vol 28 no 3 pp 281ndash2852001

[44] F D Ciccarelli T Doerks C von Mering C J Creevey BSnel and P Bork ldquoToward automatic reconstruction of a highlyresolved tree of liferdquo Science vol 311 no 5765 pp 1283ndash12872006

[45] T Dagan M Roettger D Bryant and W Martin ldquoGenomenetworks root the tree of life between prokaryotic domainsrdquoGenome Biology and Evolution vol 2 no 1 pp 379ndash392 2010

[46] S Jun G E Sims G A Wu and S Kim ldquoWhole-proteomephylogeny of prokaryotes by feature frequency profiles analignment-free method with optimal feature resolutionrdquo Pro-ceedings of the National Academy of Sciences of the United Statesof America vol 107 no 1 pp 133ndash138 2010

[47] E A Schultes P T Hraber and T H LaBean ldquoNo molecule isan islandmolecular evolution and the study of sequence spacerdquoin Algorithmic Bioprocesses A Condon D Harel J N Kok ASalomaa and E Winfree Eds pp 675ndash704 Springer BerlinGermany 2009

[48] KM Kim andG Caetano-Anolles ldquoThe proteomic complexityand rise of the primordial ancestor of diversified liferdquo BMCEvolutionary Biology vol 11 no 1 article 140 2011

[49] V A Albert Parsimony Phylogeny and Genomics OxfordUniversity Press Oxford UK 2005

[50] G Caetano-Anolles ldquoEvolved RNA secondary structure andthe rooting of the universal tree of liferdquo Journal of MolecularEvolution vol 54 no 3 pp 333ndash345 2002

[51] J A Lake R G Skophammer C W Herbold and J A ServinldquoGenome beginnings rooting the tree of liferdquo PhilosophicalTransactions of the Royal Society B Biological Sciences vol 364no 1527 pp 2177ndash2185 2009

[52] B GuoM Zou andAWagner ldquoPervasive indels and their evo-lutionary dynamics after the fish-specific genome duplicationrdquoMolecular Biology and Evolution vol 29 no 10 pp 3005ndash30222012

[53] M K Basu I B Rogozin O Deusch T Dagan W Martinand E V Koonin ldquoEvolutionary dynamics of introns in plastid-derived genes in plants saturation nearly reached but slowintron gain continuesrdquoMolecular Biology and Evolution vol 25no 1 pp 111ndash119 2008

[54] J S Farris ldquoPhylogenetic analysis underDollorsquos Lawrdquo SystematicBiology vol 26 no 1 pp 77ndash88 1977

[55] H Fang M E Oates R B Pethica et al ldquoA daily-updated treeof (sequenced) life as a reference for genome researchrdquo ScientificReports vol 3 article 2015 2013

[56] T Cavalier-Smith ldquoThe neomuran origin of archaebacteria thenegibacterial root of the universal tree and bacterial megaclas-sificationrdquo International Journal of Systematic and EvolutionaryMicrobiology vol 52 no 1 pp 7ndash76 2002

[57] T Cavalier-Smith ldquoRooting the tree of life by transition analy-sesrdquo Biology Direct vol 1 article 19 2006

[58] H S Kim J E Mittenthal and G Caetano-AnollesldquoWidespread recruitment of ancient domain structures inmodern enzymes during metabolic evolutionrdquo Journal ofIntegrative Bioinformatics vol 10 no 1 article 214 2013

[59] G Caetano-Anolles MWang and D Caetano-Anolles ldquoStruc-tural phylogenomics retrodicts the origin of the genetic codeand uncovers the evolutionary impact of protein flexibilityrdquoPLoS ONE vol 8 no 8 Article ID e72225 2013

[60] E Bapteste and C Brochier ldquoOn the conceptual difficulties inrooting the tree of liferdquo Trends in Microbiology vol 12 no 1 pp9ndash13 2004

[61] A M Poole ldquoHorizontal gene transfer and the earliest stages ofthe evolution of liferdquo Research in Microbiology vol 160 no 7pp 473ndash480 2009

[62] D Raoult ldquoThe post-Darwinist rhizome of liferdquoTheLancet vol375 no 9709 pp 104ndash105 2010

[63] W F Doolittle ldquoPhylogenetic classification and the universaltreerdquo Science vol 284 no 5423 pp 2124ndash2128 1999

[64] T Dagan and W Martin ldquoThe tree of one percentrdquo GenomeBiology vol 7 no 10 article 118 2006

[65] W FDoolittle andE Bapteste ldquoPattern pluralism and the tree oflife hypothesisrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 104 no 7 pp 2043ndash20492007

[66] E V Koonin ldquoTowards a postmodern synthesis of evolutionarybiologyrdquo Cell Cycle vol 8 no 6 pp 799ndash800 2009

[67] T Kloesges O Popa W Martin and T Dagan ldquoNetworks ofgene sharing among 329 proteobacterial genomes reveal differ-ences in lateral gene transfer frequency at different phylogeneticdepthsrdquoMolecular Biology andEvolution vol 28 no 2 pp 1057ndash1074 2011

[68] S Halary J W Leigh B Cheaib P Lopez and E BaptesteldquoNetwork analyses structure genetic diversity in independentgenetic worldsrdquo Proceedings of the National Academy of Sciencesof the United States of America vol 107 no 1 pp 127ndash132 2010

[69] S S Abby E Tannier M Gouy and V Daubin ldquoLateral genetransfer as a support for the tree of liferdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 109 no 13 pp 4962ndash4967 2012

[70] N Glansdorff Y Xu and B Labedan ldquoThe conflict betweenhorizontal gene transfer and the safeguard of identity origin of

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

24 Archaea

meiotic sexualityrdquo Journal of Molecular Evolution vol 69 no 5pp 470ndash480 2009

[71] C G Kurland and O G Berg ldquoA hitchhikerrsquos guide to evolvingnetworksrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 361ndash396 Wiley-Blackwell HobokenNJ USA 2010

[72] L P Villarreal and G Witzany ldquoThe DNA habitat and its RNAinhabitants at the dawn of RNA sociologyrdquo Genomics Insightsvol 6 pp 1ndash12 2013

[73] S Gribaldo and C Brochier ldquoPhylogeny of prokaryotes does itexist and why should we carerdquo Research in Microbiology vol160 no 7 pp 513ndash521 2009

[74] B D Mishler ldquoThe logic of the data matrix in phylogeneticanalysisrdquo in Parsimony Phylogeny and Genomics V A AlbertEd pp 57ndash69 Oxford University Press New York NY USA2005

[75] A Dress V Moulton M Steel and T Wu ldquoSpecies clustersand the ldquotree of liferdquo a graph-theoretic perspectiverdquo Journal ofTheoretical Biology vol 265 no 4 pp 535ndash542 2010

[76] N Lane andWMartin ldquoThe energetics of genome complexityrdquoNature vol 467 no 7318 pp 928ndash934 2010

[77] M C Rivera and J A Lake ldquoThe ring of life provides evidencefor a genome fusion origin of eukaryotesrdquo Nature vol 431 no7005 pp 152ndash155 2004

[78] D Alvarez-Ponce and J O McInerney ldquoThe human genomeretains relics of its prokaryotic ancestry human genes of archae-bacterial and eubacterial origin exhibit remarkable differencesrdquoGenome Biology and Evolution vol 3 no 1 pp 782ndash790 2011

[79] A Poole and D Penny ldquoEukaryote evolution engulfed byspeculationrdquo Nature vol 447 no 7147 p 913 2007

[80] E V Koonin and Y I Wolf ldquoThe fundamental units processesand patterns of evolution and the tree of life conundrumrdquoBiology Direct vol 4 no 1 article 33 2009

[81] P Puigbo Y IWolf and E V Koonin ldquoSearch for a ldquotree of liferdquoin the thicket of the phylogenetic forestrdquo Journal of Biology vol8 no 6 article 59 2009

[82] P Puigbo Y I Wolf and E V Koonin ldquoSeeing the tree of lifebehind the phylogenetic forestrdquo BMC Biology vol 11 article 462013

[83] D B Wetlaufer ldquoNucleation rapid folding and globular intra-chain regions in proteinsrdquo Proceedings of the National Academyof Sciences of the United States of America vol 70 no 3 pp 697ndash701 1973

[84] J S Richardson ldquoThe anatomy and taxonomy of proteinstructurerdquo Advances in Protein Chemistry vol 34 pp 167ndash3391981

[85] M Riley and B Labedan ldquoProtein evolution viewed throughEscherichia coli protein sequences introducing the notion ofa structural segment of homology the modulerdquo Journal ofMolecular Biology vol 268 no 5 pp 857ndash868 1997

[86] J Janin and S J Wodak ldquoStructural domains in proteins andtheir role in the dynamics of protein functionrdquo Progress inBiophysics and Molecular Biology vol 42 pp 21ndash78 1983

[87] G Caetano-Anolles M Wang D Caetano-Anolles and J EMittenthal ldquoThe origin evolution and structure of the proteinworldrdquo Biochemical Journal vol 417 no 3 pp 621ndash637 2009

[88] J Gough ldquoConvergent evolution of domain architectures (israre)rdquo Bioinformatics vol 21 no 8 pp 1464ndash1471 2005

[89] K Forslund A Henricson V Hollich and E L L Sonnham-mer ldquoDomain tree-based analysis of protein architecture evo-lutionrdquoMolecular Biology and Evolution vol 25 no 2 pp 254ndash264 2008

[90] S Yang and P E Bourne ldquoThe evolutionary history of proteindomains viewed by species phylogenyrdquo PLoS ONE vol 4 no 12Article ID e8378 2009

[91] K M Kim and G Caetano-Anolles ldquoThe evolutionary historyof protein fold families and proteomes confirms that thearchaeal ancestor is more ancient than the ancestors of othersuperkingdomsrdquo BMC Evolutionary Biology vol 12 article 132012

[92] M Wang and G Caetano-Anolles ldquoThe evolutionary mechan-ics of domain organization in proteomes and the rise ofmodularity in the protein worldrdquo Structure vol 17 no 1 pp 66ndash78 2009

[93] G Caetano-Anolles and A Nasir ldquoBenefits of using molecularstructure and abundance in phylogenomic analysisrdquo Frontiersin Genetics vol 3 article 172 2012

[94] K Brading and E Castellani Symmetries in Physics Philosoph-ical Reflections Cambridge University Press Cambridge UK2003

[95] J A Cotton and M Wilkinson ldquoQuantifying the potentialutility of phylogenetic charactersrdquo Taxon vol 57 no 1 pp 131ndash136 2008

[96] E Mossel andM Steel ldquoA phase transition for a random clustermodel on phylogenetic treesrdquoMathematical Biosciences vol 187no 2 pp 189ndash203 2004

[97] M H Bailor X Sun and H M Al-Hashimi ldquoTopology linksRNA secondary structure with global conformation dynamicsand adaptationrdquo Science vol 327 no 5962 pp 202ndash206 2010

[98] K Illergard D H Ardell and A Elofsson ldquoStructure is three toten times more conserved than sequencemdasha study of structuralresponse in protein coresrdquo Proteins Structure Function andBioinformatics vol 77 no 3 pp 499ndash508 2009

[99] G Caetano-Anolles S K Hee and J E Mittenthal ldquoThe originof modern metabolic networks inferred from phylogenomicanalysis of protein architecturerdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 104 no22 pp 9358ndash9363 2007

[100] P HWeston ldquoMethods for rooting cladistic treesrdquo inModels inPhylogeny Reconstruction D J Siebert R W Scotland and DMWilliams Eds pp 125ndash155OxfordUniversity PressOxfordUK 1994

[101] H N Bryant ldquoThe polarization of character transformationsin phylogenetic systematics role of axiomatic and auxiliaryassumptionsrdquo Systematic Biology vol 40 no 4 pp 433ndash4451991

[102] H N Bryant and G Wagner ldquoCharacter polarity and the root-ing of cladogramsrdquo in The Character Concept in EvolutionaryBiology G P Wagner Ed pp 319ndash338 Academic Press NewYork NY USA 2001

[103] G Nelson ldquoOntogeny phylogeny paleontology and the bio-genetic lawrdquo Systematic Biology vol 27 no 3 pp 324ndash345 1978

[104] P H Weston ldquoIndirect and direct methods in systematicsrdquo inOntogeny and Systematics C J Humphries Ed pp 27ndash56 NewYork NY USA Columbia University Press edition 1988

[105] F-J Sun and G Caetano-Anolles ldquoThe evolutionary historyof the structure of 5S ribosomal RNArdquo Journal of MolecularEvolution vol 69 no 5 pp 430ndash443 2009

[106] X Fang T Pan andT R Sosnick ldquoMg2+-dependent folding of alarge ribozymewithout kinetic trapsrdquoNature Structural Biologyvol 6 no 12 pp 1091ndash1095 1999

[107] F-J Sun and G Caetano-Anolles ldquoEvolutionary patterns inthe sequence and structure of transfer RNA early origins of

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Archaea 25

Archaea and virusesrdquo PLoS Computational Biology vol 4 no3 Article ID e1000018 2008

[108] F-J Sun and G Caetano-Anolles ldquoThe origin and evolution oftRNA inferred from phylogenetic analysis of structurerdquo Journalof Molecular Evolution vol 66 no 1 pp 21ndash35 2008

[109] W R Ashby An Introduction to Cybernetics Taylor amp FrancisLondon UK 1955

[110] J G Lundberg ldquoWagner networks and ancestorsrdquo SystematicBiology vol 21 no 4 pp 398ndash413 1972

[111] P T Chippindale and J J Wiens ldquoWeighting partitioningand combining characters in phylogenetic analysisrdquo SystematicBiology vol 43 no 2 pp 278ndash287 1994

[112] J Mittenthal D Caetano-Anolles and G Caetano-AnollesldquoBiphasic patterns of diversification and the emergence ofmodulesrdquo Frontiers in Genetics vol 3 article 147 2012

[113] A G Murzin S E Brenner T Hubbard and C ChothialdquoSCOP a structural classification of proteins database for theinvestigation of sequences and structuresrdquo Journal of MolecularBiology vol 247 no 4 pp 536ndash540 1995

[114] C A Orengo AMichie S Jones D T Jones M Swindells andJ M Thornton ldquoCATHmdasha hierarchic classification of proteindomain structuresrdquo Structure vol 5 no 8 pp 1093ndash1109 1997

[115] G Sawa J Dicks and I N Roberts ldquoCurrent approaches towhole genome phylogenetic analysisrdquo Briefings in Bioinformat-ics vol 4 no 1 pp 63ndash74 2003

[116] R Sathyapriya J M Duarte H Stehr I Filippis and M LappeldquoDefining an essence of structure determining residue contactsin proteinsrdquo PLoS Computational Biology vol 5 no 12 ArticleID e1000584 2009

[117] D Penny and L J Collins ldquoEvolutionary genomics leadsthe wayrdquo in Evolutionary Genomics and Systems Biology GCaetano-Anolles Ed pp 1ndash16 Wiley-Blackwell Hoboken NJUSA 2010

[118] M Ashburner C A Ball J A Blake et al ldquoGene ontology toolfor the unification of biologyrdquoNature Genetics vol 25 no 1 pp25ndash29 2000

[119] A Andreeva DHoworth J Chandonia et al ldquoData growth andits impact on the SCOP database new developmentsrdquo NucleicAcids Research vol 36 pp D419ndashD425 2008

[120] L S Yafremava M Wielgos S Thomas et al ldquoA generalframework of persistence strategies for biological systems helpsexplain domains of liferdquo Frontiers in Genetics vol 4 article 162013

[121] J L Friar T Goldman and J Perez-Mercader ldquoGenome sizesand the Benford distributionrdquo PLoS ONE vol 7 no 5 ArticleID e36624 2012

[122] A Nasir K M Kim and G Caetano-Anolles ldquoGiant virusescoexisted with the cellular ancestors and represent a distinctsupergroup along with superkingdoms Archaea Bacteria andEukaryardquo BMC Evolutionary Biology vol 12 article 156 2012

[123] A Nasir and G Caetano-Anolles ldquoComparative analysis ofproteomes and functionomes provides insights into origins ofcellular diversificationrdquo Archaea vol 2013 Article ID 64874613 pages 2013

[124] WMartin andMMuller ldquoThe hydrogen hypothesis for the firsteukaryoterdquo Nature vol 392 no 6671 pp 37ndash41 1998

[125] A Nasir K M Kim and G Caetano-Anolles ldquoGlobal patternsof protein domain gain and loss in superkingdomsrdquo PLoSComputational Biology vol 10 no 1 Article ID e1003452 2014

[126] MWang L S Yafremava D Caetano-Anolles J E Mittenthaland G Caetano-Anolles ldquoReductive evolution of architectural

repertoires in proteomes and the birth of the tripartite worldrdquoGenome Research vol 17 no 11 pp 1572ndash1585 2007

[127] L Guy and T J G Ettema ldquoThe archaeal ldquoTACKrdquo superphylumand the origin of eukaryotesrdquoTrends inMicrobiology vol 19 no12 pp 580ndash587 2011

[128] D Schlieper M A Oliva J M Andreu and J Lowe ldquoStructureof bacterial tubulin BtubAB evidence for horizontal genetransferrdquo Proceedings of the National Academy of Sciences of theUnited States of America vol 102 no 26 pp 9170ndash9175 2005

[129] M P Hoeppner P P Gardner and A M Poole ldquoComparativeanalysis of RNA families reveals distinct repertoires for eachdomain of liferdquo PLoS Computational Biology vol 8 no 11Article ID e1002752 2012

[130] H Xue K Tong C Marck H Grosjean and J T WongldquoTransfer RNA paralogs evidence for genetic code-amino acidbiosynthesis coevolution and an archaeal root of liferdquoGene vol310 no 1-2 pp 59ndash66 2003

[131] H Xue SNg K Tong and J TWong ldquoCongruence of evidencefor aMethanopyrus-proximal root of life based on transfer RNAand aminoacyl-tRNA synthetase genesrdquo Gene vol 360 no 2pp 120ndash130 2005

[132] MDiGiulio ldquoNanoarchaeum equitans is a living fossilrdquo Journalof Theoretical Biology vol 242 no 1 pp 257ndash260 2006

[133] M Di Giulio ldquoThe tree of life might be rooted in the branchleading to Nanoarchaeotardquo Gene vol 401 no 1-2 pp 108ndash1132007

[134] J T Wong J Chen W Mat S Ng and H Xue ldquoPolyphasicevidence delineating the root of life and roots of biologicaldomainsrdquo Gene vol 403 no 1-2 pp 39ndash52 2007

[135] F-J Sun and G Caetano-Anolles ldquoThe ancient history of thestructure of ribonuclease P and the early origins of ArchaeardquoBMC Bioinformatics vol 11 article 153 2010

[136] C Marck and H Grosjean ldquotRNomics analysis of tRNA genesfrom 50 genomes of Eukarya Archaea and bacteria revealsanticodon-sparing strategies and domain-specific featuresrdquoRNA vol 8 no 10 pp 1189ndash1232 2002

[137] W Mat H Xue and J T Wong ldquoThe genomics of LUCArdquoFrontiers in Bioscience vol 13 no 14 pp 5605ndash5613 2008

[138] C Brochier S Gribaldo Y Zivanovic F Confalonieri andP Forterre ldquoNanoarchaea representatives of a novel archaealphylum or a fast-evolving euryarchaeal lineage related toThermococcalesrdquo Genome Biology vol 6 no 5 article R422005

[139] B Billoud M Guerrucci M Masselot and J S DeutschldquoCirripede phylogeny using a novel approach molecular mor-phometricsrdquoMolecular Biology and Evolution vol 17 no 10 pp1435ndash1445 2000

[140] L J Collins V Moulton and D Penny ldquoUse of RNA secondarystructure for studying the evolution of RNase P and RNaseMRPrdquo Journal of Molecular Evolution vol 51 no 3 pp 194ndash2042000

[141] G Caetano-Anolles ldquoNovel strategies to study the role ofmutation and nucleic acid structure in evolutionrdquo Plant CellTissue and Organ Culture vol 67 no 2 pp 115ndash132 2001

[142] F-J Sun S Fleurdepine C Bousquet-Antonelli G Caetano-Anolles and J Deragon ldquoCommon evolutionary trends forSINE RNA structuresrdquo Trends in Genetics vol 23 no 1 pp 26ndash33 2007

[143] D Caetano-Anolles K M Kim J E Mittenthal and GCaetano-Anolles ldquoProteome evolution and the metabolic ori-gins of translation and cellular liferdquo Journal of MolecularEvolution vol 72 no 1 pp 14ndash33 2011

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

26 Archaea

[144] MWang Y Jiang KMKim et al ldquoA universalmolecular clockof protein folds and its power in tracing the early history ofaerobic metabolism and planet oxygenationrdquoMolecular Biologyand Evolution vol 28 no 1 pp 567ndash582 2011

[145] S A Bukhari and G Caetano-Anolles ldquoOrigin and evolutionof protein fold designs inferred from phylogenomic analysis ofCATH domain structures in proteomesrdquo PLoS ComputationalBiology vol 9 no 3 Article ID e1003009 2013

[146] M Vesteg and J Krajcovic ldquoThe falsifiability of the models forthe origin of eukaryotesrdquo Current Genetics vol 57 no 6 pp367ndash390 2011

[147] I N Berezovsky and E I Shakhnovich ldquoPhysics and evolu-tion of thermophilic adaptationrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 102 no36 pp 12742ndash12747 2005

[148] C Woese ldquoThe universal ancestorrdquo Proceedings of the NationalAcademy of Sciences of the United States of America vol 95 no12 pp 6854ndash6859 1998

[149] M J Seufferheld andG Caetano-Anolles ldquoPhylogenomics sup-ports a cellularly structured urancestorrdquo Journal of MolecularMicrobiology and Biotechnology vol 23 no 1-2 pp 178ndash1912013

[150] O Kandler ldquoCell wall biochemistry and three-domain conceptof liferdquo Systematic and Applied Microbiology vol 16 no 4 pp501ndash509 1994

[151] C R Woese ldquoOn the evolution of cellsrdquo Proceedings of theNational Academy of Sciences of the United States of Americavol 99 no 13 pp 8742ndash8747 2002

[152] S L Baldauf J D Palmer and W F Doolittle ldquoThe root of theuniversal tree and the origin of eukaryotes based on elongationfactor phylogenyrdquo Proceedings of the National Academy ofSciences of the United States of America vol 93 no 15 pp 7749ndash7754 1996

[153] H Leffers J Kjems L Oslashstergaard N Larsen and R AGarrett ldquoEvolutionary relationships amongst archaebacteriaA comparative study of 23 S ribosomal RNAs of a sulphur-dependent extreme thermophile an extreme halophile and athermophilic methanogenrdquo Journal of Molecular Biology vol195 no 1 pp 43ndash61 1987

[154] J Martin D Blackburn and E Wiley ldquoAre node-based andstem-based clades equivalent Insights from graph theoryrdquoPLoS Currents vol 2 Article ID RRN1196 2010

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

Hindawi Publishing Corporationhttpwwwhindawicom

GenomicsInternational Journal of

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology