Download - Molecular mechanisms of adaptation emerging from the physics and evolution of nucleic acids and proteins

Transcript

Molecular mechanisms of adaptation emergingfrom the physics and evolution of nucleic acidsand proteinsAlexander Goncearenco12 Bin-Guang Ma1 and Igor N Berezovsky34

1CBU University of Bergen 5020 Bergen Norway 2Department of Informatics University of Bergen 5020Bergen Norway 3Bioinformatics Institute (BII) Agency for Science Technology and Research (ASTAR) 30Biopolis Street 07-01 Matrix 138671 Singapore and 4Department of Biological Chemistry Weizmann Instituteof Science Rehovot 76100 Israel

Received July 31 2013 Revised November 27 2013 Accepted December 1 2013

ABSTRACT

DNA RNA and proteins are major biological macro-molecules that coevolve and adapt to environmentsas components of one highly interconnected systemWe explore here sequencestructure determinants ofmechanisms of adaptation of these molecules linksbetween them and results of their mutual evolutionWe complemented statistical analysis of genomic andproteomic sequences with folding simulations of RNAmolecules unraveling causal relations between com-positional and sequence biases reflecting molecularadaptation on DNA RNA and protein levels We foundmany compositional peculiarities related to environ-mental adaptation and the life style Specifically ther-mal adaptation of protein-coding sequences inArchaea is characterized by a stronger codon biasthan in Bacteria Guanine and cytosine load in thethird codon position is important for supporting theaerobic life style and it is highly pronounced inBacteria The third codon position also provides atradeoff between arginine and lysine which arefavorable for thermal adaptation and aerobicity re-spectively Dinucleotide composition provides stabil-ity of nucleic acids via strong base-stacking in ApGdinucleotides In relation to coevolution of nucleicacids and proteins thermostability-related demandson the amino acid composition affect the nucleotidecontent in the second codon position in Archaea

INTRODUCTION

More than 50 years have passed since Francis Crick in1958 first envisioned a way of information transfer from

genes to proteins known as the central dogma of molecu-lar biology (1) The dogma illuminates a relationshipbetween the genotype (coding DNA sequences) andphenotype (proteins) through the mRNA that serves asan lsquointerpreterrsquo from nucleotide to protein sequences Asa result the phenotype secures survival reproduction andevolution of the genotype based on the fitness andevolvability of the latter (2ndash4) Therefore even thoughthe basic information flow from genotype to phenotypehas an unequivocal directionality the lsquophenotype-to-geno-typersquo feedback or in other words the epigenetic variationthat facilitates genetic adaptation is an indispensablecomponent of molecular evolution and adaptation (5)The goal of this work is an exhaustive survey of compos-itional and sequence biases and their mutual influence andadjustment that underlie molecular mechanisms of adap-tation of DNA RNA and protein moleculesComparative analysis of genomes and proteomes is

proven to be a powerful instrument in finding genes predict-ing structures and functions of proteins and phylogenetic in-ference Usually orthologous sequences from the comparedorganisms are considered Despite obvious importance ofthe comparative analysis detection of orthologs dependsstrongly on the quality of sequences alignments which ishard to control automatically for large datasets (6)Besides overall organismal characteristics and species re-latedness are not necessarily represented correctly by the re-semblance of specific protein coding sequences because ofancestral gene duplication emergence of pseudogenes andgene loss and lateralhorizontal gene transfer (7)Therefore if the organismal level of molecular adaptationis concerned it is important to obtain whole-genomeproteome average of every characteristic Molecular mech-anisms of adaptation in proteins are the subject of intensediscussion for already several decades The role of nucleotidecontent in mechanisms of adaptation of nucleic acids (8ndash11)

To whom correspondence should be addressed Email igorbbiia-staredusgPresent addressBin-Guang Ma Center for Bioinformatics Huazhong Agricultural University Wuhan 430070 China

Published online 25 December 2013 Nucleic Acids Research 2014 Vol 42 No 5 2879ndash2892doi101093nargkt1336

The Author(s) 2013 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby30) whichpermits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

as well as possible effects of nucleotide composition on theamino acid one (12ndash16) have been discussed in number ofworks (8101316ndash20) The (A+G) content or so-calledpurine load (112122) and the (G+C) content (8101316ndash20) were suggested as signatures of thermal adaptation inprokaryotes (2123) It has been shown that increase of thepurine load in the coding DNA is to a large extent a result ofthe thermal adaptation of protein sequences (22) and a signalof stabilizing stacking interactions between purine bases inrRNA (112122) The origin and role of the (G+C) contentis a topic of special interest Specifically it has been claimedthat it is essentially governed by the genome replication andDNA-repair mechanisms (19) is involved into lineage- andniche-specific molecular strategies of adaptation (17) drivesa codon usage (20) and even amino acid composition (12ndash1416) In the case of protein structure distinct stabilizinginteractions (24ndash28) their structural determinants (2429ndash33)and contribution from different amino acid residues(2224252934ndash36) have been studied extensively (37ndash40)However until recently all the studies were focused aroundfew proteins or small set of them individual stabilizing inter-actions or considered anecdotal cases of organisms thrivingunder normal or extreme temperatures First combined pre-dictor of thermostability was proposed in Ponnuswamy et al(41) and it was finally exhaustively enumerated for mono-meric proteins in Zeldovich et al (22) and for proteincomplexes in Berezovsky (42) and Ma et al (43) Rapidgrowth of genomic data allows one to tackle topics thatseemed unreachable few years ago Here we compare themechanisms of molecular adaptation in Archaeal andBacterial domains of life Profound knowledge on phyl-ogeny metabolism and evolutionary peculiarities ofArchaea (4445) in comparison with Bacteria wasaccumulated (4647) Both Archaea and Bacteria are unicel-lular prokaryotic organisms and their macromolecules areunder immediate influence of the environment It makes acomparative study of Archaeal and Bacterial compositionalbiases and sequence peculiarities an ideal framework forstudying mechanisms of adaptation on molecular level Weanalyze complete sets of their coding DNA mRNA tRNArRNA non-coding DNA (ncDNA) and protein sequencesin order to find generic trends associated with mechanisms oftheir adaptation as well as differences between these trendsin Archaea and Bacteria We use 244 Archaeal and Bacterialcomplete genomes of species with optimal growth tempera-tures (OGT) spanning from 8C to 100C and representingaerobic and anaerobic life styles

MATERIALS AND METHODS

Datasets

Complete sets of coding sequences for 244 prokaryoticorganisms were downloaded from NCBI Refseq andGenbank We obtained the data on optimal OGT andaerobicity There are many more than used organisms inthe NCBI Refseq and Genbank however the data on theirOGT and aerobicity is scarce Therefore the size of thedataset in this study was determined by the availability ofboth OGT and aerobicity data for genomes and by thedemand on good coverage of the whole temperature

interval Jackknife tests performed in earlier works(2243) showed that the number of genomesproteomesin the dataset is sufficient for obtaining unbiased andreliable conclusions which will persist in the futureanalysis with an extended set of organisms We classi-fied the genomes according to their domain of lifeArchaea and Bacteria temperature (psycrophilesOGTlt 24C mesophiles 24COGTlt 50C thermo-philes 50COGTlt 80C hyperthermophiles80COGT) and oxygen tolerance (aerobic anaeroicfacultative and microaerophilic) In calculations of correl-ations with OGT (and in the corresponding compositionalanalysis) we excluded organisms with OGT 26C 30Cand 37C (116 genomes) since it has been previouslyshown that they are represented mainly by parasites andsymbionts possessing traits unrelated to temperatureadaptation (22) The compositional analysis in aerobicitywas performed based on the set of 244 genomes 146 outof them classified either as aerobic or anaerobic Dataoriginating from NCBI Refseq and Genbank were pro-cessed with Python programs and imported intoPostgresql database with constraints for additionalcontrol of data integrity Molecular features werecalculated with Python programs and stored in thedatabase The R scripts were used for statistical analysisThe database is freely accessible for download at httpfolkuibnoagoncear

DNARNA analysisWe have separated DNA and amino acid sequences ofprotein-coding genes nucleic acid sequences of tRNA-and rRNA-coding genes and ncDNA sequences fromthe intergenic regions We generated DNA sequenceswith unbiased codon usage (NCB non-codon-bias) byuniformly choosing a codon for each amino acid fromall possible codons Dinucleotide and nucleotide compos-itions are not preserved in NCB sequences We reshuffledcodons in the DNA sequence (Shfld) by choosing a syn-onymous codon for each amino acid from the list ofpossible codons weighted by their genomic codon usagehence keeping intact the amino acid sequence This pro-cedure preserves positional nucleotide composition andpositional dinucleotide composition for positions 1-2and 2-3 but destroys the natural frequency of 3-1 di-nucleotides Dicodon-shuffle program (48) was usedto reshuffle dicodons (dShfld sequence) This procedureis applied gene-wise It preserves amino acid sequencepositional nucleotide and dinucleotide frequenciesbut destroys natural mRNA sequence We analyzeddifferent phases in double-stranded RNA stems PhasesI II and III mean that respective codon positions 1 2and 3 in the sense and anti-sense strands are complemen-tary ones

We analyzed the nucleotide composition in naturalshuffled dShfld sequences and sequences without thecodon bias calculating genomic averages for tRNArRNA and ncDNA regions and for each codon positionseparately in protein-coding DNA We grouped gen-omes based on the domain of life oxygen tolerance andenvironmental temperature factors For each group ofgenomes we calculated weighted averages of the

2880 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

genomic compositions with the weight proportionalto genome size so that each group is represented asone meta-genome We report the weighted averagesin Supplementary File S1 and show the selected valuesin Table 1 The standard deviations are reported inSupplementary Files S5 and S8 along with the P-valuesfor OGT correlations P-values for comparison be-tween the groups reported in the text are calculatedusing two-sample weighted t-test (Studentrsquos t-testwith Welch correction to the degrees of freedom whichis a standard procedure in R software for unequalvariances)

dfWelch frac14

s21

n1+

s22

n2

2

s21n1

2

n11+

s22n2

2

n21

where s2i and ni are the sample variance and the number ofobservations in group i respectively

Supplementary File S6 describes the tests for (G+C)3composition in connection to oxygen tolerance InSupplementary Files S7 and S9 one-sample weightedt-test P-values are reported for the comparisons of nucleo-tide compositions with 025 and 05 for nucleotides andnucleotide combinations respectively The significance ofdinucleotide contrasts (DCT) is assessed by comparing theweighted DCT averages to 1 and it is also reported in thesame supplementary files

RNA folding in silicoThe lsquoRNAfoldrsquo program (49) in the lsquoVienna RNAPackagersquo was used for performing the RNA foldingNative and randomized sequences were folded usingZukerrsquos energy minimization algorithm (50) which deter-mines the folding free energy for the most favorable con-formation from a vast number of possible simulatedstructures Calculations were performed with default par-ameters and a temperature setting of 37C for all organ-isms The latter allowed us to detect the effect of theorganismal nucleotide composition on the mRNA struc-ture and stability Sequences in a sliding window of 50bases were folded (51) and their characteristics werecalculated We have also checked other sliding windows(100 and 150 bases) observing that they give similarqualitative outputs The choice of 50-bases window isjustified by the previous knowledge that most knownfunctionally important secondary structures are smalland local (48) and structures gt50 bases would notnormally be formed in actively translating mRNAs(48) We used dicodon-shuffle program by Katz andBurge (48) for obtaining control sets of reshuffledmRNA sequences The base pairing pattern [base pairfrequencies at three Phases (52) of mRNA sequence]folding energy of folded mRNAs and their correlationswith OGT of corresponding organisms were examinedThe comparison of these features between Archaea andBacteria was also performed To purify signal ie tofocus exclusively around an effect of the local mRNAstructure each quantity is also represented as a ratiobetween the natural sequence signal and the signal

from dicodon-shuffled sequences (dShfld) The purineload in the loop and stem regions of the predictedmRNA secondary structures was also analyzed (21)

Amino acid and dipeptide composition andOGT correlationAmino acid Z-scored predictors of the OGT (43) werederived for Archaeal and Bacterial proteomesAdditionally we analyzed the groups of amino acidsaccording to their physical chemical properties [charged(D E K R) hydrophobic (C F I L M P V W) andpolar (A G H N Q S T Y)] The dipeptide classes forabove residues types and their correlations with OGT wereexamined separately for Archaea and Bacteria The dipep-tide frequencies were normalized by the individual aminoacid frequencies in order to exclude the compositionalbias

RESULTS

In order to achieve our goal in understanding compos-itional and sequence signals of evolution and adaptationof DNA RNA and protein macromolecules we pursuethe following strategy in the analysis First wherever it ispossible we single out compositional biases existing inthese molecules Then we establish possible connectionsbetween detected biases and mechanisms of adaptationSpecifically we discuss nucleotide dinucleotide aminoacid and dipeptide biases their correlations with differentenvironmental factors and their potential role indetermining and tuning molecular mechanisms of stabilityand adaptation in corresponding macromolecules Wealso seek to understand how biases in one type of mol-ecules can affect or can be affected by the demands on thesequencesstructures of others Finally we explore causalrelationships between them in light of their evolutionaryhistory and phylogeny To this end we analyze the differ-ence in compositionalsequence biases between Archaeaand Bacteria in conjunction with their mechanisms ofadaptation

Nucleotide compositions of DNA and RNA

ncDNA has significantly higher adenine and thymine(A+T) content in Archaea (Table 2 and SupplementaryFile S9) hinting to the role of nucleotide composition indiscriminating coding and ncDNA of Archaea the samemechanism exists in eukaryotes Deviations of nucleotidecontents from even fractions differ between Archaea andBacteria A and C is significantly deviated in Archaeawhereas in Bacteria the composition of T and G nucleo-tides is skewed (Table 2) There is an increased (G+C)load in t- and rRNA of Archaea and tRNA of BacteriaBacterial rRNA yields higher guanine load Both guanineand cytosine loads of t- and rRNA correlate with OGTand the correlation is stronger in Archaea than inBacteria (Table 2 correlation coefficients and their signifi-cance levels are given in parentheses P-values are inSupplementary File S8) The role of the (G+C) contentin thermal adaptation of t- and rRNAs is furthercorroborated by folding simulations The (G+C)

Nucleic Acids Research 2014 Vol 42 No 5 2881

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

Table 1 Compositional and sequence signals in coding and ncDNA

ARCHAEA BACTERIA

Characteristic Value Characteristic Value

A Correlations between nucleotide compositions and OGT

Codon position 1ANatNCB R=069 ANatNCB R=028+

(A+T)NatNCB R=068 (A+T)NatNCB R=029(A+G)NatNCB R=071 (A+G)NatNCB R=034(A+C)NatNCB R=071 (A+C)NatNCB R=036

Codon position 2TNat R=055 TNat R=037TNCB R=055 TNCB R=037GNatNCB R=060 (T+G)Nat R=036(T+G)Nat R=072 (T+G)NCB R=042(T+G)NCB R=068(G+C)NatNCB R=060

Codon position 3ANat R=002 ANat R=013ANCB R=070 ANCB R=076GNat R=022 (A+G)Nat R=060GNCB R=055 (A+G)NCB R=043(A+G)Nat R=056 (A+G)NatNCB R=011(A+G)NCB R=067

B Position-independent dinucleotide contrasts

ncDNAApATpT 115 ApATpT 126CpCGpG 124 GpC 125GpTApC 082

cDNAApA 125ApANCB 106TpT 120TpTNCB 113GpC 124GpCNCB 107

tRNAApA 132 ApG 128CpC 126 TpC 135TpC 126ApC 052TpG 070

rRNAApA 125 ApA 115CpC 127 CpC 119

C Correlations of position-independent dinucleotide contrasts (DCT) with OGT

ncDNACpT R=063 GpG R=049ApG R=064

cDNACpTNat R=066 GpGNat R=040CpTNCB R=057 GpGNCB R=009CpTShfld R=057 GpGShfld R=032ApGNat R=080ApGNCB R=059ApGShfld R=078

tRNAApA R=082 ApA R=045TpA R=072TpT R=080

(continued)

2882 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

Table 1 Continued

ARCHAEA BACTERIA

Characteristic Value Characteristic Value

GpG R=044CpC R=080

rRNAApA R=092 ApA R=067TpA R=088 CpC R=052GpG R=072CpC R=083

D Position-dependent dinucleotide composition ratios (Freq) dinucleotide contrasts and their OGT correlations

Codon position 1-2Freq(ApG)NatNCB 135na Freq(CpG)NatNCB 127na

Freq(CpG)NatNCB 062na

ApG 102 TpT 145ApGNCB 079 TpTNCB 149CpG 076 TpA 069CpGNCB 117 TpANCB 065

GpT 067GpTNCB 067

ApG R=077ApGNCB R=011GpC R=055GpCNCB R=044CpC R=037+

CpCNCB R=037+

ApGNatNCB R=063GpCNatNCB R=060CpCNatNCB R=037+

RpR R=040 RpR R=033RpRNCB R=015 RpRNCB R=027+

YpY R=034 YpY R=039YpYNCB R=005+ YpYNCB R=033RpRNatNCB R=059 RpRNatNCB R=037YpYNatNCB R=064 YpYNatNCB R=045

Codon position 2-3TpA R=063 ApA 150TpANCB R=028 ApANCB 105CpT R=061 GpC 133CpTNCB R=068 GpCNCB 098ApG R=069 ApCNatNCB R=052ApGNCB R=037+ GpANatNCB R=041

RpR R=020 RpR R=051RpRNCB R=065 RpRNCB R=073YpY R=028 YpY R=058YpYNCB R=066 YpYNCB R=075

RpRNatNCB R=039YpYNatNCB R=043

Codon position 3-1CpT R=060CpTShfld R=005ApG R=053ApGShfld R=002

RpR R=030+ RpR R=048RpRShfld R=020 RpRShfld R=001YpY R=035+ YpY R=056YpYShfld R=006 YpYShfld R=017

Pearson correlation coefficient is denoted by R DCT is calculated as the ratio of dinucleotide frequency to the product of frequencies of thecorresponding independent nucleotides Nucleotides with purine (A or G) and pyrimidine (T or C) bases are denoted with R and Y respectively Thelower index distinguishes the values observed for natural sequences (Nat) sequences with eliminated codon bias (NCB) values observed aftershuffling of amino acid sequences (Shfld) If the lower index is omitted the value is given for the natural sequences The P-values for correlationsand for the dinucleotide contrast t-tests (H0 DCT is 10) are shown in superscripts as significance levels +P-valuelt 005 lt 001 lt 00001Supplementary Files S5 and S7 list all P-values for position-specific correlations SD for compositions and the P-values for t-tests SupplementaryFiles S8 and S9 show the P-values for position-independent contrast t-tests and correlations

Nucleic Acids Research 2014 Vol 42 No 5 2883

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

content increases and energy per base pair decreases withtemperature in stems of folded t- and rRNAs (Table 3)These data point to base-pairing interactions as an im-portant contributor to thermal stabilization (53) of t-and rRNAs of prokaryotes and it is stronger manifestedin Archaea than in Bacteria

Dinucleotide composition of nucleic acid sequences

Dinucleotide contrasts (DCTs) show the ratio of observeddinucleotide frequencies to the expected ones given thecomposition of individual nucleotides Coupling of thesame nucleotides (Table 1) is preferred in ncDNA se-quences in Archaea (ApATpT and CpCGpG) andBacteria (ApATpT) ApA and CpC dinucleotides areprevalent in t- and rRNA of Archaea and rRNA ofBacteria Other outstanding contrasts in Archaea arefound in ncDNA (GpTApC) and tRNA (TpC ApC andTpG) In Bacteria ApG and TpC are prevalent in tRNA(Table 1) while GpC is preferred in both non-coding andcoding sequences In coding sequences of Archaea there isno preference for coupling of similar nucleotides while inBacteria it is found for ApA and TpT (Table 1)In both coding and ncDNA sequences of Archaea there

is a clear excess of complementary ApG and CpT di-nucleotides which is persistent after elimination ofthe codon bias and reshuffling of amino acid sequence(Table 1) In Bacteria the excess of GpG dinucleotidesin non-coding and coding sequences is most pronounced(Table 1) provided by the codon bias in coding sequencesPairing of similar nucleotides is highly correlated withOGT in tRNA (ApA TpT GpG and CpC) and rRNA

(ApA GpG and CpC) in Archaea Correlation of thesame nucleotide pairs is also found in Bacteria thoughit is for fewer pairs and is weaker (tRNA ApA rRNAApA and CpC) In Archaeal t- and rRNA there is alsostrong correlation of TpA pairs with OGT (Table 1)

Correlation between nucleotide frequencies andtemperature in different codon positions

The first codon position in Archaea is characterized byhigh correlation of the ratio of natural to NCBfrequencies of adenine with OGT [ANatNCB R=069Table 1] The second codon position reveals correlationof the thymine frequency with OGT (Table 1) The com-bination of thymine and guanine nucleotides is alsocorrelated with OGT however thymine is the major con-tributor (Table 1) The correlation of the guaninecontent with OGT is supported by the codon bias(Table 1) The combination of guanine with cytosine isalso correlated with OGT in the second position ofArchaeal sequences and is provided by the codon biasThe third position reveals strong selection againstadenine and guanine in relation to OGT (Table 1)Bringing together (anti-)correlations with OGT in differ-ent codon positions one can draw the optimal from thepoint of view of thermal adaptation (Table 4) combinedtriplet as [A]1 [TG]2 [non-(AG)]3 Prevalence of thyminein the second codon position is linked to an excess ofhydrophobes Indeed codons with thymine in the secondposition encode Ile Leu Met Phe and Val These arestrongly hydrophobic residues and aromatic Phe whichcan form many van der Waals contacts and contribute

Table 2 Nucleotide compositions and their OGT correlations in DNA and RNAs

Domain of life Nucleotide cDNANat cDNANCB ncDNA tRNA (R) rRNA (R)

Archaea A 2845 2794 3073 1713 (ndash076) 2368 (ndash085)T 2389 2444 3065 1817 (ndash089) 1924 (ndash088)G 2600 2612 1930 3391 (084) 3208 (094)

C 2166 2151 1931 3080 (084) 2499 (076)Bacteria A 2405 2642 2657 1966 (ndash044) 2609 (ndash052)

T 2268 2370 2661 2156 (ndash051) 2068 (ndash077)G 2750 2666 2341 3121 (053) 3110 (072)C 2577 2323 2341 2758 (039) 2213 (060)

The numbers represent the average frequencies of nucleotides in the corresponding parts of genomes while the numbers in parentheses are correl-ation coefficients (R) of nucleotide frequencies with OGT The most important biases and correlations are shown in bold font The P-values forcorrelations (H0 correlation coefficient R=0) and for the nucleotide composition t-tests (H0 mean frequency is 025) are shown in superscripts assignificance levels P-valuelt 001 lt 00001 Supplementary Files S8 and S9 list all correlations and composition tests cDNANAT naturalnucleotide composition in coding DNA cDNANCB nucleotide composition in coding sequences with eliminated codon bias

Table 3 OGT correlations in r- and t-RNA observed in folding simulations

Domain of life RNA type R((G+C) OGT) (P-values) R(ltEbpgt OGT) (P-values)

Archaea rRNA 089 (lt10ndash22) ndash093 (11 10ndash20)tRNA 084 (184 10ndash13) ndash071 (334 10ndash8)

Bacteria rRNA 073 (138 10ndash14) ndash066 (1 10ndash11)tRNA 053 (31 10ndash7) ndash051 (91 10ndash7)

R((G+C) OGT) correlation coefficient between the (G+C) content and the OGT R(ltEbpgt OGT) correlation coefficient between the averagedper base pair free energy of RNA folding and OGT

2884 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

thus to the packing of the hydrophobic coreNoteworthy elimination of the codon bias does notaffect correlation of the thymine fraction with OGT(R=055 in both native and NCB sequences) Itshows that excess of thymine in the second codonposition is a result of selection on the protein level Anapparent explanation for such selection is domination ofthe structure-based strategy in thermostabilization ofArchaeal proteins This strategy is characterized by theincreased compactness of the hydrophobic core providedby the massive van der Waals contacts The correlationof the combination of A and G with OGT is notprovided by tuning of the codon bias which points toadaptation on the other than DNA level

Bacterial coding DNA sequences yield less position-dependent correlations with OGT than Archaeal onesThe ratio of natural adenine frequency over the one foreliminated codon bias is weakly correlated with OGT(Table 1) There are also moderate correlations ofthymine frequency and of the thymine and guaninecombination with OGT in the second codon position(Table 1) The third codon position is characterized byselection against adenine provided by the codon bias(Table 1) However the combination of adenine andguanine is correlated with OGT It does not depend onthe codon bias to a large extent (same as in Archaea)pointing to possible signal of adaptation on thelevel other than protein-coding DNA sequence Thegeneralized codon in Bacteria (Table 4) characterizingthe thermophilic trends reads therefore [weak A]1 [weakT]2 [non-(AG)]3

The role of excessive (G+C) load in the third codonposition in adaptation to aerobicity

In both Archaea and Bacteria the average nucleotidecomposition in different codon positions is not affectedby the codon bias to the large extent except the thirdcodon position in Bacteria However the compositional

variance on the third position is essentially destroyedwhen there is no codon bias (standard deviation (SD)diminishes from gt10 to lt1 see Supplementary FileS5) The natural (G+C)3 load in Bacteria yields 9excess in comparison with NCB sequences (P-value=17e ndash 07 Supplementary File S6) while inArchaea there is no significant difference We found thatdespite frequent involvement of G3 and C3 nucleotidesinto the GCCG pairs the (G+C)3 load does not playa crucial role in thermal stability of mRNA (5254)Overall the difference in (G+C)3 load between Archeaand Bacteria is insignificant If we consider separatethermal groups the only difference appears in mesophilicBacteria (95 more in Bacteria P=009) Howeverwhen we take into account the oxygen tolerance factorthe difference in (G+C)3 load will become extremelypronounced between aerobic (7013 in natural and 5084in NCB) and anaerobic (4894 in natural and 5046 inNCB) species (P-value=2e ndash 09) Considering the lackof OGT correlation in protein-coding nucleotides (exceptthymine on the second codon position) one can concludethat increased (G+C)3 load is a result of the aerobic lifestyle rather than thermal adaptation (SupplementaryFiles S2 and S3) The comparison of aerobes and anaer-obes in the group of mesophilic Bacteria corroborates thisconclusion (Supplementary File S6) The difference isgt20 of (G+C)3 content 7097 in aerobes and 5051 inanaerobes (P=0000178) This adaptation is entirelydriven by the codon bias because if the codon bias isremoved the (G+C)3 bias as well as the differencebetween aerobes and anaerobes will disappear (5087and 5064)We also compared usage of codons with G or C in a

third position with other codons (lsquoCodon Usagersquo inSupplementary Files S3 and S6) There are three aminoacids encoded by six codons (Leu Arg Ser) five by fourcodons (Ala Pro Thr Gly Val) one by three codons (Ile)nine by two codons (Lys Asn Asp Phe Cys Gln Glu

Table 4 Generalized nucleotides and dinucleotides in different codon positions favorable for thermostability

Nucleotides correlated with OGT

Domain of life Codon position 1 2 3

Archaea Nucleotide A TG Non-[AG]Origin of the bias Codon bias T- amino acid Against codon bias

G-codon biasBacteria Nucleotide Weak A Weak T Non-[AG]

Origin of the bias Codon bias Amino acid Against codon bias

Dinucleotides in Purine (R) and Pyrimidine (Y) notation correlated with OGT

Domain of life Codon position 1-2 2-3 3-1

Archaea Dinucleotide RpRYpY Non-RpRYpY RpRYpYOrigin of the bias Codon bias Codon bias Amino acid

Bacteria Dinucleotide RpRYpY Non-RpRYpY RpRYpYOrigin of the bias Not codon bias Codon bias Amino acid

Part 1 Thermophilic-prone nucleotide biases Columns 1 2 3 contain information on favorable nucleotides and origin of the bias in codon positions1 2 3 Part 2 Thermophilic-prone dinucleotide biases Columns 1-2 2-3 3-1 contain information on favorable nucleotides and origin of the bias incodon positions 1-2 2-3 3-1

Nucleic Acids Research 2014 Vol 42 No 5 2885

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

His Tyr) and two by one codon (Met Trp) Typicallyhalf of the synonymous codons of each amino acid have Gor C in a third codon position others A or T It appearedthat almost all of the codons with G and C in a thirdposition have higher frequencies in aerobes compared toanaerobes and in Bacteria than in Archaea (see lsquoCodonUsagersquo in Supplementary File S3) This observation holdsfor all codons of the lsquotwo-codonrsquo amino acids and formajority of codons of the three- four- and six-codonamino acids The noticeable exception is significant sup-pression of AGAAGG codons of the Arg (aerobesndashanaerobes 177ndash125 P-values 446e ndash 858e ndash 8BacteriandashArchaea 174ndash2625 P-values 165e ndash 7114e ndash11) In lysine the codon AAA is suppressed in aerobes(difference 1716 P-value 644e ndash 6) while the codonAAG is preferred The AAG codon of lysine is favorablein aerobes (aerobesanaerobes 143) and another lysinersquoscodon AAA can be turned into AAG by only one mutationin the third codon position Therefore one can speculatethat the demand for discriminating between Arg and Lys isthe most probable cause for the suppression of AGAand AGG codons of arginine in aerobes Lysine isthus preferred in aerobic adaptation over the arginine(Supplementary Files S3 and S6) The contribution of theamino acid in relation to thermophilic adaptation is notcompromised because of the similarity between physicalndashchemical characteristics of lysine and arginine from thepoint of view of thermostability Overall an excess ofthe G3 codons is always more pronounced in aerobescompared to anaerobes which corroborates the conclusionthat the (G+C)3 load is an indications of the aerobic style(55ndash58) The similar trend in the difference between theBacteria and Archaea is a result of the domination ofaerobic life style in the Bacteria in the analyzed dataset

Correlation between dinucleotide frequencies in differentcodon positions and OGTIn order to analyze dinucleotides in different codon pos-itions (1-2 2-3 and 3-1) we used the DCTNat DCTNCB

(for positions 1-2 and 2-3) and shuffled (DCTShfld forposition 3 and 1) sequences We considered correlationsof these contrasts with OGT The most pronounced biasesare found for dinucleotides in coding sequences ofArchaea The ApG dinucleotides show the strongestexcess of dinucleotides in all positions of natural se-quences compared to NCB and shuffled ones In codonpositions 1-2 and 2-3 this compositional peculiarity andits correlation with OGT are provided by the codon bias(Table 1) In position 3-1 high correlation of ApGfrequencies with OGT is determined by the amino acidsequences and it vanishes after amino acid reshuffling(Table 1) There is also correlation of the frequency ofcomplementary CpT dinucleotide in position 3-1 withOGT disappearing after reshuffling of amino acid se-quences (Table 1) The most plausible role of ApG di-nucleotides is a contribution to stabilization via base-stacking interactions between the purine rings (22) Thisconclusion is supported by the fact that excess of ApGdinucleotides is provided by the codon bias manifestingthus adaptation on a DNA level The complementary CpTdinucleotides indicate enrichment of anti-sense strand of

double-stranded DNA (dsDNA) with the stabilizing ApGones The resulting mosaic of ApG stacking in sense andanti-sense strands can provide stabilization over the longdistances without compromising flexibility of the dsDNAOther overrepresented dinucleotides (regardless of thecodon bias) correlated with OGT in positions 1-2 areGpC CpC TpA and CpT (Table 1) The frequencies ofTpA and CpT dinucleotides are correlated with OGT inposition 2-3 where the former is provided by the codonbias and the latter is not

In purine (R) and pyrimidine (Y) notation codonbias provides grouping (stacking) of similar nucleotidesin position 1-2 which increases with OGT (Table 1) Atthe same time codon bias works against such grouping inposition 2-3 (Table 1) Amino acid sequence is responsiblefor the grouping of similar nucleotides and correlationswith OGT in position 3-1 (Table 1) The resultinggeneralized thermophilic pattern of dinucleotides inArchaeal cDNA sequences reads [RpR YpY]1-2 [non-RpR non-YpY]2-3 [RpRYpY]3-1 Most of the compos-itional peculiarities found in Bacteria are not determinedby the codon bias yielding weak correlations with tem-perature (Table 1) The thermophilic pattern of dinucleo-tides in Bacteria (Table 4) is the same as Archaeal oneThe difference with Archaea is that selection for RpR andYpY dinucleotides in position 1-2 is not determined by thecodon bias in Bacteria (Tables 1 and 4)

Compositional and sequence biases observed in foldingsimulations of mRNAThough RNA molecules have different overall structuresone may expect that there are some common mechanismsproviding stability and function of the folded t- r- andmRNAs We have shown above that the (G+C) contentapparently contributes to the stability of the t- and rRNA(3659) and not to the stability of coding DNA (545960)It is manifested in correlation of the former with OGT(Table 2) and the anti-correlation of the base-pairingenergy in the folded structures with OGT (Table 3)Nucleotide compositions in coding DNA (hence inmRNA as well) do not correlate with OGT The mRNAcase is of special interest because redundancy of geneticcode endues the nucleotide sequence with a potential tosatisfy requirements for DNA mRNA and protein stabil-ity For example it has been hypothesized (61) that theoptimization of the base-pairing in the mRNA contributesto formation of stem fragments Authors claimed thatthere is a corresponding periodic pattern of the mRNAsecondary structure in human and mouse which isdetermined chiefly by the selection that operates on thethird codon position (52) An opposing opinion suggeststhat the secondary structure in mRNA interferes withtranslation and therefore should be avoided (62)Overall three scenario of the relation between mRNAand proteins has been considered earlier (52) (i) thebiases are determined by the demands on protein stability(ii) mRNA stability is the major determinant of sequencebiases (iii) complete independence of the sequence biasesrelated to mechanisms of stability on each level We havecomputationally folded the mRNA sequences from 244prokaryotic genomes and analyzed their characteristics

2886 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

(Table 5) All the parameters are reported as relative oneswhere the signal in native sequence is normalized by thevalue obtained for dicodon shuffled ones (48) Thedicodon shuffling randomizes mRNA sequence whilepreserving the native protein sequence native codonusage and native dinucleotide composition Therefore itallows one to select out peculiarities of mRNA caused bythe demands on its stability and function from the othersrelated to the coding DNA and proteins

Structure of stems in folded mRNA is characterized byphases which show relative location of triplets in oppositestrands of the stem Specifically Phases I II and III cor-respond to positioning of triplets where respectively firstsecond and third nucleotides are complementary Themost and the least frequent pairs in all phases of thedsDNA (see Materials and methods section) are con-sidered Phases I and II are similar to each otheryielding CG GC GU UG AU and UA (Table 5)Overall the Archaea and thermophilic Bacteria havesimilar trends in most and least frequent pairsMesophilic Bacteria stands out with A13U and G33Cmost frequent pairs in Phases II and III respectivelyThe major contribution from the three pairing phases(52) to the total amount of stem pairs is provided byPhase III and nucleotides in codon position 3 are mostfrequent in stem pairs (Table 5 and SupplementaryTable S1) OGT correlations highlight the contributionfrom the GU wobble pair to the thermostability ofachaeal mRNA structures (Table 5 and SupplementaryTable S1) We found the highest OGT correlation forthe Archaeal pairs U32G (Table 5) confirming import-ance of the GU wobble pairs The stronger effect forArchaeal rather than for Bacterial mRNA is possibly arelic of ancient RNA world where RNA was the carrierof genetic information and harsh environmental condi-tions demanded its increased stability

Table 6 shows that purine load (the contents of A+GRY ApG) is larger in the loop regions than in the stems

In Archaea this difference is slightly more pronouncedthan in Bacteria (Supplementary Table S2) The purineload is also in good positive correlation with OGT forthe loops not for stems However the amount of ApGdinucleotides is correlated with OGT in both loops andstems For synonymous codons the fractions of purine-rich codons (eg GGR versus GGY for glycine and AGRversus CGY for arginine) are correlated with OGT in bothloop and stem regions (slightly stronger in loops) Fornon-synonymous codons however amount of purine-rich codons (GAR for glutamate) is correlated withOGT in both loops and stems while fractions of purine-rich AAR codons for lysine do not correlate withOGT (Table 6) While Forsdyke et al (21) reportedincrease of the purine content and its positive correlationwith OGT in loops we found that fraction of purine reachsynonymous codons is correlated with OGT in stems aswell At the same time the amount of non-synonymouspurine-rich codons of lysine is not correlated with OGT inboth loop and stem regions Additionally the amount ofApG dinucleotides is correlated with OGT in both loopsand stems (Table 6)

Signals of thermophilic adaptation in protein sequencesIt has been shown earlier (22) that environmental tem-perature directly affects amino acid composition of pro-karyotic proteins and IVYWREL combination can serveas a predictor of the OGT in prokaryotes We use here alsquoZ-scorersquo predictor of OGT (43) which takes into accountdifferences in variances of the proteomic frequencies ofindividual amino acids (Table 7) We have shown thatthe Z-score predictor properly corrects for these differ-ences (43) better reflecting the contribution of theamino acid combinations to thermal adaptationSeparate predictors for Archaea and Bacteria (based on33 and 99 proteomes with OGTrsquos spread over 5ndash100 and10ndash85C respectively) yield the same general trend in theincrease of hydrophobic and charged residues at the

Table 5 Compositional and dicodon signals of mRNA adaptation purified by the dicodon shuffling

Characteristic Archaea Bacteria

The most and least (after the comma) frequentpairs in stems of mRNA

MesophilesPhase I C23G G23U Phase I G32U U32GPhase II C31G G13U Phase II A13U G13UPhase III A33U U33G Phase III G33C U33G

ThermophilesPhase I G32C G23U Phase I C23G G23UPhase II C22G G13U Phase II C31G G13UPhase III A33U U33G Phase III U33A U33G

Correlations with OGT ofSegment energy ltEsggt R=ndash071 R=ndash039Energy per base pair ltEbpgt R=ndash073 R=ndash054

The most signification correlations with OGT Phase I U32G R=077 (U21G)Nat PIII R=046Phase I G23U R=071 (G12U)Nat PIII R=045Phase I All pairs R=07 (U21G)dShfld PIII R=045

(G12U)dShfld PIII R=041

Phases I II and III correspond to positioning of triplets where respectively first second and third nucleotides are comple-mentary All signals (except OGT correlations in Bacteria) are normalized by corresponding values for control sequences afterthe dicodon shuffling See explanations of abbreviations in the Materials and methods section

Nucleic Acids Research 2014 Vol 42 No 5 2887

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

expense of polar ones (34) showing minor differences inpresence of individual residuesWe have shown earlier (22) that the amino acid bias

working in thermal adaptation of proteins does notdepend on the overall nucleotide composition of coding-DNA sequences However one can still expect that nu-cleotide compositions of particular codon positions andor dinucleotide compositions of positions 1-2 and 2-3 incodons may be somehow linked to amino acid biases Weindeed found a significant difference between nucleotideload in individual codon positions of Archaeal andBacterial coding DNA sequences (Table 1) FirstArchaeal protein coding sequences yield a strong correl-ation between (T+G)2 load in the second codon positionand OGT (r=072) The corresponding non-codon biasedsequences show comparable correlation (r=068)pointing that (T+G) load reflects tuning of the aminoacid composition in connection to thermal adaptationThe main contributor to this correlation is thymine(Table 1) that provides increase of strongly hydrophobicresidues LMFIV (Supplementary File S1) The (T+G)load is much weaker correlated with OGT in Bacterialsequences (r=036 for natural and r=042 for NCB se-quences) The correlation of the thymine load with OGT is037 for both Natural (NAT) and Non-codon-biased(NCB) sequences This observation agrees with thepresumed prevalence of structure-based strategy ofthermal adaptation in Archaeal proteins (29) contrary tosequence-based one in Bacterial proteomes Specificallythere is a clear correlation between amounts of chargedresidues and the OGT in the set of Bacterial proteomes(Table 7) presumably reflecting the domination of thesequence-based strategy in thermophilic adaptation ofBacteria (29)Finally we checked if there exists any specific connec-

tion between the dipeptide composition of proteins andthose of 3-1 dinucleotides We found (data not shown)that the strongest correlations between dinucleotides 3-1and dipeptides exist mostly for the dinucleotides withguanine in position 3 The most frequent pair of aminoacids contains methionine (Met encoded exclusively by

the AUG codon) as the first and any polarchargedamino acid as the second one The preference of polarcharged amino acid to be the next to Met can be explainedby the typically surface location of the protein N-terminiand its role as a signal for ubiquitination (63)

DISCUSSION

We will make a brief overview of the most pronouncedbiases and causal relationships in DNA RNA andproteins The major signal in nucleotide compositionsspecifically (G+C) load is related to thermal stabilizationof t- and rRNA (3654555960) and to discriminationbetween the coding and non-coding sequences inArchaea (Table 2) Comparative analysis of nucleotideand amino acid compositions in relation to thermophilicadaptation prompts to conclude that the (G+C) contentdoes not contribute to thermostabilization of coding DNA(1822365459) as well as it does not affect amino acidcomposition and its thermophilic trends (1522) When

Table 6 Purine loading in loop and stem regions of folded mRNA and its OGT correlation

Feature Loop Stem L-v-S p-value

Mean contents OGT correlation Mean content OGT correlation

A+G 0560 059 0500 ndash026 lt22E ndash 16RY 1299 061 1002 ndash026 lt22E ndash 16ApG 0061 079 0051 050 00002GGR (glycine) 0027 062 0045 042 12E ndash 17GGY (glycine) 0017 ndash005 0079 ndash023 10E ndash 42AGR (arginine) 0035 072 0022 056 24E ndash 10CGR (arginine) 0018 ndash022 0040 ndash024 71E ndash 09CGY (arginine) 0016 ndash025 0037 ndash026 62E ndash 10GAR (glutamate) 0060 071 0036 059 lt22E ndash 16AAR (lysine) 0086 022 0017 011 lt22E ndash 16GAY (aspartate) 0037 ndash017 0040 ndash037 00002

Feature analyzed nucleotide dinucleotide or amino acid Loop and stem information on mean content and OGT correlation of the above L-v-S acomparison between corresponding contents in the loop region and stem regions by Wilcoxon-tests and P-values are shown in this columnCorrelation coefficients with OGT are shown P-valuelt 001 P-valuelt 00001

Table 7 Signals of thermophilic adaptation in protein sequences of

Archaea and Bacteria

Correlation with OGT Archaea Bacteria

Z-scored predictor ILVW Y DKR(R=093)

IPV Y EKR(R=089)

The most abundant residuesin gt70 (gt60) ofall Z-scored predictors

VIWL Y ER VPI Y ER(K)

Individual amino acids + L W + Endash T Q D

Types of amino acids + h + h cndash p ndash p

Dipeptides + hp ph + ccndash cp pc ndash pc

+ increase of the amino acid (or amino acid type) fraction with OGT ndashdecrease of the amino acid (or amino acid type) fraction with OGTCapital letters are names of amino acids h p c are hydrophobic polarcharged types of amino acids Correlation coefficients between the Z-scored thermostability predictors and OGT are given in parentheses

2888 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

separate codon positions are considered for coding DNAthe only compositional bias observed for nucleotide com-positions is an excess of (G+C) load in the third codonposition in Bacteria There is a significant excess ofguanine and cytosine nucleotides (292 and 3282)compared to the NCB nucleotide contents (2609 and2466) A plausible explanation for this bias is the roleof the (G+C)3 load in the adaptation to the aerobic lifestyle which dominates in Bacteria The (G+C)3 load canbe advantageous for several complementary reasonsIndeed additional GC base pairing will contribute tothe stability of double helix of DNA and stem regions ofRNA molecules Moreover G3 bases can work as scaven-gers of oxidizing agents providing protection for G basesin other codon positions (56ndash58) At the same time thisbias would not lead to any changes in amino acid com-position leaving protein structure and stability intact (55)

In this work we considered compositional and sequencebiases in proteins in relation to those in correspondingnucleic acid sequences and to the phylogeny of species(Archaea or Bacteria) OGT correlations of nucleotidesin different codon positions are similar in Archaea andBacteria but higher in the former (Table 1) Theoptimal from the point of view of thermophilic adaptation(ie most correlated with OGT) codon reads A1 [TG]2non-[AG]3 in Archaea but [weak A]1 [weak T]2 non-[AG]3 in Bacteria (Table 4) Codon bias supports thestrong correlation between the amount of Adenine andOGT in the first codon position in Archaea same butweaker correlation in Bacteria While we showed earlier(22) that amino acid thermophilic trend is not determinedby the overall nucleotide composition considerationof separate codon positions reveals an interesting linkbetween positional nucleotide frequencies and aminoacid composition in relation to domain of LifeSpecifically prevalence of the nucleotide thymine in thesecond codon position and its strong correlation withOGT in Archaea is apparently a result of the demandon enrichement of Archaeal proteins with hydrophobicresidues (Table 7) Noteworthy a massive increase ofvan der Waals interactions (2464) was found to be thecornerstone of the structure-based evolutionary strategyof protein thermostability in ancient species (29) InBacterial proteins however there is an increase of thefractions of charged residues with OGT (Table 7) Thuswe observe a transitions from structure-based evolution-ary strategy of protein themostability in Archaea tosequence-based one in Bacteria and corresponding nucleo-tide and protein compositional biases (Table 7)underlying these strategies (29) The third codonposition is characterized by the selection against adenineand guanine in Archaea and against adenine in Bacteria isa result of this bias

The most pronounced dinucleotide compositionalbiases are excess of homo-dinucleotides and its correlationwith OGT in ncDNA tRNA and rRNA It is presumablya relic of ancient primitive homo(poly)nucleotides fromwhich life started Overrepresentation of the complemen-tary ApGCpT dinucleotides points to their contributionto DNARNA stability via strong base-stacking inter-actions (6566) The consideration of dinucleotide biases

in purine (R) and pyrimidine (Y) notation also reveals aninteresting picture Overall for both Archaea and Bacteriathe signature for thermophilic dinucleotides reads RpRYpY non-(RpRYpY) RpRYpY for codon positions1-2 2-3 and 3-1 respectively (Table 4) The major differ-ence between Archaea and Bacteria in this case is that thedinucleotide bias in positions 1-2 is caused by demands onnucleotide sequence level (and provided by the codonbias) in Archaea while some other factors work inBacteria The preference for homodinucleotides in pos-itions 1-2 and 3-1 can be a possible reason for avoidinghomodinucleotides in positions 2-3 because heterodinu-cleotides can contribute to DNA flexibility tuningIndeed it has been shown that given an average flexibilityF of homodinucleotides (either RpR or YpY) theflexibilities of the heterodinucleotides YpR and RpY are2F and 05F respectively (67) The codon positions 2-3 aremost versatile in terms of the relationship between thenucleotide and protein sequences Therefore presenceof heteronucleotides in the codon position 2-3 makes itpossible to adjust flexibility of the nucleotide sequencewithout changing physicalndashchemical characteristics ofthe encoded amino acid residue Additionally we foundthat the link between dipeptide and corresponding 3-1 di-nucleotide frequencies is determined by the amino aciddipeptide The dominating bias reads as Met followedby a polar or charged residueOn the RNA level simple consideration of the nucleo-

tide composition suggests to discriminate the tRNA andrRNA molecules from the mRNA Indeed based on thecorrelation with OGT we can conclude that (G+C)content is important for providing stability of tRNAand rRNA molecules but not for mRNA The role ofthe (G+C) content was further confirmed by foldingsimulations (Table 3) We also found other complemen-tary signals in mRNA sequences related to its structurestability and function Archaea and Bacteria yield similartrends in the most and least frequent nucleotide pairsand the third codon position makes a major contributionto the base-pairing mechanism of stem stabilizationSpecifically in addition to WatsonndashCrick pairing theGU wobble pairs sufficiently contribute to the ther-mal stability of the Archaeal mRNA in agreement withearlier observations (6869) Overall it results in moderatecorrelation of the segment energy and energy per base pairof the folded mRNA (Table 5) Folding simulations per-formed in this work also allowed us to analyzepeculiarities of the nucleotide contents and structuralcontacts in stem and loop regions of folded mRNAPurine load in loop regions is higher than in stems andthis effect is slightly stronger in Archaea than in Bacteria(Table 6) Purine load in loops also correlates with OGTnot in stems This correlation was observed earlier byForsdyke et al (21) and it was described as lsquopolitepurine load of the loop regionsrsquo that prevents undesiredmRNAndashmRNA single-strand interactions The authorsconcluded that increased purine load can affect thecodon choice and consequently the amino acid compos-ition (21) Our data [as well as careful analysis of theoriginal data in (21)] does not support the latter claimThere is indeed a correlation between the fraction of

Nucleic Acids Research 2014 Vol 42 No 5 2889

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

some purine-rich codons (GGR(Gly) AGR(Arg) andGAR(Glu)) and OGT (Table 6) However the fractionof purine-rich non-synonymous codons AAR (Lys) isnot correlated with OGT (Table 6) Moreover thesecodons are synonymous and cannot directly affect theamino acid composition It has been shown earlier(2270) that increase of Glu and Arg fractions is a trendin protein thermophilic adaptation At the same timepurine load remains high after elimination of the codonbias pointing to the amino acid composition as the mostprobable cause for the former (22) All the above promptsone to conclude that in mutual tuning of nucleotide andamino acid compositions the purine load does notdominate and determine biases in amino acid compos-ition if not opposite The analysis of purine load showsthat it correlates with OGT in both stems and loops(Table 6) in case of purine-rich codons GGR(Gly)AGR(Arg) and GAR(Glu) Further the amount ofApG dinucleotides strongly correlates with OGT in bothloops and stems (Table 6) and ApG provides a strongbase stacking important for thermal stability of nucleicacid sequences (2265) Purine load therefore is appar-ently a determinant of the base-stacking mechanism inmRNA andor DNA thermal stability as well as a resultof the thermophilic amino acid composition trendProkaryotes thrive under the temperature interval

spanning over hundred degrees and they represent twomajor life stylesmdashaerobic and anaerobic Analysis ofcomplete Archaeal and Bacterial genomes unraveled com-positional and sequence signals related to molecular mech-anisms of stability and adaptation unaffected by selectivesequencing or by the comparison of orthologs Overallcodon bias works stronger in Archaea and is mostlyutilized in thermophilic adaptation of nucleic acids It ap-parently reflects longer evolutionary history of Archaeawhich presumably started close to the origin of life in hotconditions (29) The codon bias and amino acid sequences(dipeptide composition) work in accord for supportingenrichment of the nucleotide sequences with ApGCpT di-nucleotidesmdashdeterminants of the base-stacking mechan-ism of the nucleic acid stability (65ndash67) We also foundthat the second codon position reveals a strong linkbetween the nucleotide and amino acid compositionsSpecifically excess of thymine in this position is a resultof a demand on the enrichment of Archaeal proteins withhydrophobic amino acids The third codon position inBacteria is the only case where codon bias is detectablealready on the level of pure composition We found thatthe (G+C)3 load is related to aerobic life style dominatingin Bacteria From the point of view of thermophilic adap-tation codon bias works against G in the third codonposition in both Archaea and Bacteria It supports thus aspecific role of the third codon position in discriminatingbetween adaptations to temperature and aerobic life styleFinally we obtained an interesting and complex picture ofthe relationship between the nucleotide composition(purine load) and amino acid composition (selectionbetween Arg and Lys) in relation to thermal adaptationaerobicity and phylogeny Specifically purine load in bothArchaea and Bacteria is a result of the lsquofrom both endof hydrophobicity scalersquo trend in thermal adaptation of

proteins (34) reflected in the IVYWREL predictor of theOGT (22) According to this predictor Arg is a preferredamino acid for thermal adaptation though Lys is the nextcandidate present in top most correlated predictors as well(2243) In particular selection for Lys over the Arg insome species was shown to be important for the entropicmechanism of protein thermostabilization (70) Howeveroverall preference for Arg over the Lys in case of thermaladaptation is well manifested in the excess of the purine-rich (AGR) codons of Arg at the expense of purine-richones of Lys (AAR) At the same time discriminationbetween Arg and Lys works in opposite direction inaerobicity where Arg codons AGA and AGG are sup-pressed in favor of the purine-rich AAA and AAG ofLys It thus reflect a preference for Lys over Arg inaerobes compared to anaerobes in particular and inBacteria versus Archaea in general preserving at thesame time high purine load necessary for providing abase-stacking in corresponding nucleic acids (225365)An intricate connection between the LysArg and theircodons in relation to thermal adaptation and aerobicityexemplifies how selection can work on nucleic acids andprotein simultaneously in response to demands of differentenvironments Obviously the whole picture of molecularmechanisms of adaptation and relations between them isfar from being complete Consideration of other environ-mental factors such as salinity pressure etc will help to un-ravel new mechanisms of stability their sequencestructuredeterminants and to understand tradeoffs that Natureembraced en route of the evolution and adaptation

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

FUNDING

Funding for Open access charges Functional GenomicsProgram (FUGE II) Norwegian Research Council

Conflict of interest statement None declared

REFERENCES

1 CrickF (1970) Central dogma of molecular biology Nature 227561ndash563

2 TawfikDS (2010) Messy biology and the origins of evolutionaryinnovations Nat Chem Biol 6 692ndash696

3 TokurikiN and TawfikDS (2009) Stability effects of mutationsand protein evolvability Curr Opin Struct Biol 19 596ndash604

4 TokurikiN and TawfikDS (2009) Protein dynamism andevolvability Science 324 203ndash207

5 KooninEV (2012) Does the central dogma still stand BiolDirect 7 27

6 PersquoerI FelderCE ManO SilmanI SussmanJL andBeckmannJS (2004) Proteomic signatures amino acid andoligopeptide compositions differentiate among phyla Proteins 5420ndash40

7 AravindL TatusovRL WolfYI WalkerDR andKooninEV (1998) Evidence for massive gene exchange betweenarchaeal and bacterial hyperthermophiles Trends Genet 14442ndash444

8 KhachaneAN TimmisKN and dos SantosVA (2005) Uracilcontent of 16S rRNA of thermophilic and psychrophilic

2890 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

prokaryotes correlates inversely with their optimal growthtemperatures Nucleic Acids Res 33 4016ndash4022

9 SuhreK and ClaverieJM (2003) Genomic correlates ofhyperthermostability an update J Biol Chem 27817198ndash17202

10 TekaiaF and YeramianE (2006) Evolution of proteomesfundamental signatures and global trends in amino acidcompositions BMC Genom 7 307

11 WangHC and HickeyDA (2002) Evidence for strong selectiveconstraint acting on the nucleotide composition of 16S ribosomalRNA genes Nucleic Acids Res 30 2501ndash2507

12 FriedmanR DrakeJW and HughesAL (2004) Genome-widepatterns of nucleotide substitution reveal stringent functionalconstraints on the protein sequences of thermophiles Genetics167 1507ndash1512

13 LynnDJ SingerGA and HickeyDA (2002) Synonymouscodon usage is subject to selection in thermophilic bacteriaNucleic Acids Res 30 4272ndash4277

14 SingerGA and HickeyDA (2000) Nucleotide bias causes agenomewide bias in the amino acid composition of proteinsMol Biol Evol 17 1581ndash1588

15 SingerGA and HickeyDA (2003) Thermophilicprokaryotes have characteristic patterns of codonusage amino acid composition and nucleotide content Gene 31739ndash47

16 TekaiaF YeramianE and DujonB (2002) Amino acidcomposition of genomes lifestyles of organisms and evolutionarytrends a global picture with correspondence analysis Gene 29751ndash60

17 Roy ChowdhuryA and DuttaC (2012) A pursuit of lineage-specific and niche-specific proteome features in the world ofarchaea BMC Genom 13 236

18 WangHC SuskoE and RogerAJ (2006) On the correlationbetween genomic G+C content and optimal growth temperaturein prokaryotes data quality and confounding factors BiochemBiophys Res Commun 342 681ndash684

19 WuH ZhangZ HuS and YuJ (2012) On the molecularmechanism of GC content variation among eubacterial genomesBiol Direct 7 2

20 KnightRD FreelandSJ and LandweberLF (2001) A simplemodel based on mutation and selection explains trends in codonand amino-acid usage and GC composition within and acrossgenomes Genome Biol 2 RESEARCH0010

21 LaoPJ and ForsdykeDR (2000) Thermophilic bacteria strictlyobey Szybalskirsquos transcription direction rule and politely purine-load RNAs with both adenine and guanine Genome Res 10228ndash236

22 ZeldovichKB BerezovskyIN and ShakhnovichEI (2007)Protein and DNA sequence determinants of thermophilicadaptation PLoS Comput Biol 3 e5

23 KreilDP and OuzounisCA (2001) Identification ofthermophilic species by the amino acid compositions deducedfrom their genomes Nucleic Acids Res 29 1608ndash1615

24 BerezovskyIN TumanyanVG and EsipovaNG (1997)Representation of amino acid sequences in terms of interactionenergy in protein globules FEBS Lett 418 43ndash46

25 CambillauC and ClaverieJM (2000) Structural and genomiccorrelates of hyperthermostability J Biol Chem 27532383ndash32386

26 GreavesRB and WarwickerJ (2007) Mechanisms forstabilisation and the maintenance of solubility in proteins fromthermophiles BMC Struct Biol 7 18

27 JaenickeR (1999) Stability and folding of domain proteins ProgBiophys Mol Biol 71 155ndash241

28 JaenickeR and BohmG (1998) The stability of proteins inextreme environments Curr Opin Struct Biol 8 738ndash748

29 BerezovskyIN and ShakhnovichEI (2005) Physics andevolution of thermophilic adaptation Proc Natl Acad Sci USA102 12742ndash12747

30 ChakravartyS and VaradarajanR (2002) Elucidation of factorsresponsible for enhanced thermal stability of proteins a structuralgenomics based study Biochemistry 41 8152ndash8161

31 GlyakinaAV GarbuzynskiySO LobanovMY andGalzitskayaOV (2007) Different packing of external residues can

explain differences in the thermostability of proteins fromthermophilic and mesophilic organisms Bioinformatics 232231ndash2238

32 ThompsonMJ and EisenbergD (1999) Transproteomic evidenceof a loop-deletion mechanism for enhancing proteinthermostability J Mol Biol 290 595ndash604

33 TokurikiN OldfieldCJ UverskyVN BerezovskyIN andTawfikDS (2009) Do viral proteins possess unique biophysicalfeatures Trends Biochem Sci 34 53ndash59

34 BerezovskyIN ZeldovichKB and ShakhnovichEI (2007)Positive and negative design in stability and thermal adaptationof natural proteins PLoS Comput Biol 3 e52

35 BharanidharanD BhargaviGR UthanumallianK andGauthamN (2004) Correlations between nucleotide frequenciesand amino acid composition in 115 bacterial species BiochemBiophys Res Commun 315 1097ndash1103

36 NakashimaH FukuchiS and NishikawaK (2003)Compositional changes in RNA DNA and proteins for bacterialadaptation to higher and lower temperatures J Biochem(Tokyo) 133 507ndash513

37 DehouckY FolchB and RoomanM (2008) Revisiting thecorrelation between proteinsrsquo thermoresistance and organismsrsquothermophilicity Protein Eng Des Sel 21 275ndash278

38 FolchB DehouckY and RoomanM (2010) Thermo- andmesostabilizing protein interactions identified by temperature-dependent statistical potentials Biophys J 98 667ndash677

39 FolchB RoomanM and DehouckY (2008) Thermostability ofsalt bridges versus hydrophobic interactions in proteins probed bystatistical potentials J Chem Inf Model 48 119ndash127

40 GonnelliG RoomanM and DehouckY (2012) Structure-basedmutant stability predictions on proteins of unknown structureJ Biotechnol 161 287ndash293

41 PonnuswamyP MuthusamyR and ManavalanP (1982) Aminoacid composition and thermal stability of globular proteins IntJ Biol Macromol 4 186ndash190

42 BerezovskyIN (2011) The diversity of physical forces andmechanisms in intermolecular interactions Phys Biol 8 035002

43 MaBG GoncearencoA and BerezovskyIN (2010)Thermophilic adaptation of protein complexes inferred fromproteomic homology modeling Structure 18 819ndash828

44 MakarovaKS and KooninEV (2005) Evolutionary andfunctional genomics of the Archaea Curr Opin Microbiol 8586ndash594

45 NovichkovPS WolfYI DubchakI and KooninEV (2009)Trends in prokaryotic evolution revealed by comparison of closelyrelated bacterial and archaeal genomes J Bacteriol 191 65ndash73

46 KooninEV and WolfYI (2008) Genomics of bacteria andarchaea the emerging dynamic view of the prokaryotic worldNucleic Acids Res 36 6688ndash6719

47 KooninEV MushegianAR GalperinMY and WalkerDR(1997) Comparison of archaeal and bacterial genomes computeranalysis of protein sequences predicts novel functions andsuggests a chimeric origin for the archaea Mol Microbiol 25619ndash637

48 KatzL and BurgeCB (2003) Widespread selection for localRNA secondary structure in coding regions of bacterial genesGenome Res 13 2042ndash2051

49 HofackerIL PriwitzerB and StadlerPF (2004) Prediction oflocally stable RNA secondary structures for genome-wide surveysBioinformatics 20 186ndash190

50 ZukerM and StieglerP (1981) Optimal computer folding oflarge RNA sequences using thermodynamics and auxiliaryinformation Nucleic Acids Res 9 133ndash148

51 MathewsDH SabinaJ ZukerM and TurnerDH (1999)Expanded sequence dependence of thermodynamic parametersimproves prediction of RNA secondary structure J Mol Biol288 911ndash940

52 ShabalinaSA OgurtsovAY and SpiridonovNA (2006) Aperiodic pattern of mRNA secondary structure created by thegenetic code Nucleic Acids Res 34 2428ndash2437

53 MarmurJ and DotyP (1962) Determination of the basecomposition of deoxyribonucleic acid from its thermaldenaturation temperature J Mol Biol 5 109ndash118

Nucleic Acids Research 2014 Vol 42 No 5 2891

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

54 HurstLD and MerchantAR (2001) High guanine-cytosinecontent is not an adaptation to high temperature acomparative analysis amongst prokaryotes Proc Biol Sci 268493ndash497

55 NayaH RomeroH ZavalaA AlvarezB and MustoH (2002)Aerobiosis increases the genomic guanine plus cytosine content(GC) in prokaryotes J Mol Evol 55 260ndash264

56 BeckmanKB and AmesBN (1997) Oxidative decay of DNAJ Biol Chem 272 19633ndash19636

57 ChengKC CahillDS KasaiH NishimuraS and LoebLA(1992) 8-Hydroxyguanine an abundant form of oxidative DNAdamage causes Gmdash-T and Amdash-C substitutions J Biol Chem267 166ndash172

58 Vieira-SilvaS and RochaEP (2008) An assessment of theimpacts of molecular oxygen on the evolution of proteomes MolBiol Evol 25 1931ndash1942

59 GaltierN and LobryJR (1997) Relationships between genomicG+C content RNA secondary structures and optimal growthtemperature in prokaryotes J Mol Evol 44 632ndash636

60 MustoH NayaH ZavalaA RomeroH Alvarez-ValinF andBernardiG (2004) Correlations between genomic GC levels andoptimal growth temperatures in prokaryotes FEBS Lett 57373ndash77

61 FitchWM (1974) The large extent of putative secondary nucleicacid structure in random nucleotide sequences or amino acidderived messenger-RNA J Mol Evol 3 279ndash291

62 WorkmanC and KroghA (1999) No evidence that mRNAshave lower folding free energies than random sequences with

the same dinucleotide distribution Nucleic Acids Res 274816ndash4822

63 BachmairA FinleyD and VarshavskyA (1986) In vivo half-lifeof a protein is a function of its amino-terminal residue Science234 179ndash186

64 BerezovskyIN NamiotVA TumanyanVG and EsipovaNG(1999) Hierarchy of the interaction energy distribution in thespatial structure of globular proteins and the problem of domaindefinition J Biomol Struct Dyn 17 133ndash155

65 YakovchukP ProtozanovaE and Frank-KamenetskiiMD(2006) Base-stacking and base-pairing contributions into thermalstability of the DNA double helix Nucleic Acids Res 34564ndash574

66 FriedmanRA and HonigB (1995) A free energy analysis ofnucleic acid base stacking in aqueous solution Biophys J 691528ndash1535

67 OkonogiTM AlleySC ReeseAW HopkinsPB andRobinsonBH (2002) Sequence-dependent dynamics of duplexDNA the applicability of a dinucleotide model Biophys J 833446ndash3459

68 MasquidaB and WesthofE (2000) On the wobble GoU andrelated pairs RNA 6 9ndash15

69 VaraniG and McClainWH (2000) The G x U wobble base pairA fundamental building block of RNA structure crucial to RNAfunction in diverse biological systems EMBO Rep 1 18ndash23

70 BerezovskyIN ChenWW ChoiPJ and ShakhnovichEI(2005) Entropic stabilization of proteins and its proteomicconsequences PLoS Comput Biol 1 e47

2892 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

as well as possible effects of nucleotide composition on theamino acid one (12ndash16) have been discussed in number ofworks (8101316ndash20) The (A+G) content or so-calledpurine load (112122) and the (G+C) content (8101316ndash20) were suggested as signatures of thermal adaptation inprokaryotes (2123) It has been shown that increase of thepurine load in the coding DNA is to a large extent a result ofthe thermal adaptation of protein sequences (22) and a signalof stabilizing stacking interactions between purine bases inrRNA (112122) The origin and role of the (G+C) contentis a topic of special interest Specifically it has been claimedthat it is essentially governed by the genome replication andDNA-repair mechanisms (19) is involved into lineage- andniche-specific molecular strategies of adaptation (17) drivesa codon usage (20) and even amino acid composition (12ndash1416) In the case of protein structure distinct stabilizinginteractions (24ndash28) their structural determinants (2429ndash33)and contribution from different amino acid residues(2224252934ndash36) have been studied extensively (37ndash40)However until recently all the studies were focused aroundfew proteins or small set of them individual stabilizing inter-actions or considered anecdotal cases of organisms thrivingunder normal or extreme temperatures First combined pre-dictor of thermostability was proposed in Ponnuswamy et al(41) and it was finally exhaustively enumerated for mono-meric proteins in Zeldovich et al (22) and for proteincomplexes in Berezovsky (42) and Ma et al (43) Rapidgrowth of genomic data allows one to tackle topics thatseemed unreachable few years ago Here we compare themechanisms of molecular adaptation in Archaeal andBacterial domains of life Profound knowledge on phyl-ogeny metabolism and evolutionary peculiarities ofArchaea (4445) in comparison with Bacteria wasaccumulated (4647) Both Archaea and Bacteria are unicel-lular prokaryotic organisms and their macromolecules areunder immediate influence of the environment It makes acomparative study of Archaeal and Bacterial compositionalbiases and sequence peculiarities an ideal framework forstudying mechanisms of adaptation on molecular level Weanalyze complete sets of their coding DNA mRNA tRNArRNA non-coding DNA (ncDNA) and protein sequencesin order to find generic trends associated with mechanisms oftheir adaptation as well as differences between these trendsin Archaea and Bacteria We use 244 Archaeal and Bacterialcomplete genomes of species with optimal growth tempera-tures (OGT) spanning from 8C to 100C and representingaerobic and anaerobic life styles

MATERIALS AND METHODS

Datasets

Complete sets of coding sequences for 244 prokaryoticorganisms were downloaded from NCBI Refseq andGenbank We obtained the data on optimal OGT andaerobicity There are many more than used organisms inthe NCBI Refseq and Genbank however the data on theirOGT and aerobicity is scarce Therefore the size of thedataset in this study was determined by the availability ofboth OGT and aerobicity data for genomes and by thedemand on good coverage of the whole temperature

interval Jackknife tests performed in earlier works(2243) showed that the number of genomesproteomesin the dataset is sufficient for obtaining unbiased andreliable conclusions which will persist in the futureanalysis with an extended set of organisms We classi-fied the genomes according to their domain of lifeArchaea and Bacteria temperature (psycrophilesOGTlt 24C mesophiles 24COGTlt 50C thermo-philes 50COGTlt 80C hyperthermophiles80COGT) and oxygen tolerance (aerobic anaeroicfacultative and microaerophilic) In calculations of correl-ations with OGT (and in the corresponding compositionalanalysis) we excluded organisms with OGT 26C 30Cand 37C (116 genomes) since it has been previouslyshown that they are represented mainly by parasites andsymbionts possessing traits unrelated to temperatureadaptation (22) The compositional analysis in aerobicitywas performed based on the set of 244 genomes 146 outof them classified either as aerobic or anaerobic Dataoriginating from NCBI Refseq and Genbank were pro-cessed with Python programs and imported intoPostgresql database with constraints for additionalcontrol of data integrity Molecular features werecalculated with Python programs and stored in thedatabase The R scripts were used for statistical analysisThe database is freely accessible for download at httpfolkuibnoagoncear

DNARNA analysisWe have separated DNA and amino acid sequences ofprotein-coding genes nucleic acid sequences of tRNA-and rRNA-coding genes and ncDNA sequences fromthe intergenic regions We generated DNA sequenceswith unbiased codon usage (NCB non-codon-bias) byuniformly choosing a codon for each amino acid fromall possible codons Dinucleotide and nucleotide compos-itions are not preserved in NCB sequences We reshuffledcodons in the DNA sequence (Shfld) by choosing a syn-onymous codon for each amino acid from the list ofpossible codons weighted by their genomic codon usagehence keeping intact the amino acid sequence This pro-cedure preserves positional nucleotide composition andpositional dinucleotide composition for positions 1-2and 2-3 but destroys the natural frequency of 3-1 di-nucleotides Dicodon-shuffle program (48) was usedto reshuffle dicodons (dShfld sequence) This procedureis applied gene-wise It preserves amino acid sequencepositional nucleotide and dinucleotide frequenciesbut destroys natural mRNA sequence We analyzeddifferent phases in double-stranded RNA stems PhasesI II and III mean that respective codon positions 1 2and 3 in the sense and anti-sense strands are complemen-tary ones

We analyzed the nucleotide composition in naturalshuffled dShfld sequences and sequences without thecodon bias calculating genomic averages for tRNArRNA and ncDNA regions and for each codon positionseparately in protein-coding DNA We grouped gen-omes based on the domain of life oxygen tolerance andenvironmental temperature factors For each group ofgenomes we calculated weighted averages of the

2880 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

genomic compositions with the weight proportionalto genome size so that each group is represented asone meta-genome We report the weighted averagesin Supplementary File S1 and show the selected valuesin Table 1 The standard deviations are reported inSupplementary Files S5 and S8 along with the P-valuesfor OGT correlations P-values for comparison be-tween the groups reported in the text are calculatedusing two-sample weighted t-test (Studentrsquos t-testwith Welch correction to the degrees of freedom whichis a standard procedure in R software for unequalvariances)

dfWelch frac14

s21

n1+

s22

n2

2

s21n1

2

n11+

s22n2

2

n21

where s2i and ni are the sample variance and the number ofobservations in group i respectively

Supplementary File S6 describes the tests for (G+C)3composition in connection to oxygen tolerance InSupplementary Files S7 and S9 one-sample weightedt-test P-values are reported for the comparisons of nucleo-tide compositions with 025 and 05 for nucleotides andnucleotide combinations respectively The significance ofdinucleotide contrasts (DCT) is assessed by comparing theweighted DCT averages to 1 and it is also reported in thesame supplementary files

RNA folding in silicoThe lsquoRNAfoldrsquo program (49) in the lsquoVienna RNAPackagersquo was used for performing the RNA foldingNative and randomized sequences were folded usingZukerrsquos energy minimization algorithm (50) which deter-mines the folding free energy for the most favorable con-formation from a vast number of possible simulatedstructures Calculations were performed with default par-ameters and a temperature setting of 37C for all organ-isms The latter allowed us to detect the effect of theorganismal nucleotide composition on the mRNA struc-ture and stability Sequences in a sliding window of 50bases were folded (51) and their characteristics werecalculated We have also checked other sliding windows(100 and 150 bases) observing that they give similarqualitative outputs The choice of 50-bases window isjustified by the previous knowledge that most knownfunctionally important secondary structures are smalland local (48) and structures gt50 bases would notnormally be formed in actively translating mRNAs(48) We used dicodon-shuffle program by Katz andBurge (48) for obtaining control sets of reshuffledmRNA sequences The base pairing pattern [base pairfrequencies at three Phases (52) of mRNA sequence]folding energy of folded mRNAs and their correlationswith OGT of corresponding organisms were examinedThe comparison of these features between Archaea andBacteria was also performed To purify signal ie tofocus exclusively around an effect of the local mRNAstructure each quantity is also represented as a ratiobetween the natural sequence signal and the signal

from dicodon-shuffled sequences (dShfld) The purineload in the loop and stem regions of the predictedmRNA secondary structures was also analyzed (21)

Amino acid and dipeptide composition andOGT correlationAmino acid Z-scored predictors of the OGT (43) werederived for Archaeal and Bacterial proteomesAdditionally we analyzed the groups of amino acidsaccording to their physical chemical properties [charged(D E K R) hydrophobic (C F I L M P V W) andpolar (A G H N Q S T Y)] The dipeptide classes forabove residues types and their correlations with OGT wereexamined separately for Archaea and Bacteria The dipep-tide frequencies were normalized by the individual aminoacid frequencies in order to exclude the compositionalbias

RESULTS

In order to achieve our goal in understanding compos-itional and sequence signals of evolution and adaptationof DNA RNA and protein macromolecules we pursuethe following strategy in the analysis First wherever it ispossible we single out compositional biases existing inthese molecules Then we establish possible connectionsbetween detected biases and mechanisms of adaptationSpecifically we discuss nucleotide dinucleotide aminoacid and dipeptide biases their correlations with differentenvironmental factors and their potential role indetermining and tuning molecular mechanisms of stabilityand adaptation in corresponding macromolecules Wealso seek to understand how biases in one type of mol-ecules can affect or can be affected by the demands on thesequencesstructures of others Finally we explore causalrelationships between them in light of their evolutionaryhistory and phylogeny To this end we analyze the differ-ence in compositionalsequence biases between Archaeaand Bacteria in conjunction with their mechanisms ofadaptation

Nucleotide compositions of DNA and RNA

ncDNA has significantly higher adenine and thymine(A+T) content in Archaea (Table 2 and SupplementaryFile S9) hinting to the role of nucleotide composition indiscriminating coding and ncDNA of Archaea the samemechanism exists in eukaryotes Deviations of nucleotidecontents from even fractions differ between Archaea andBacteria A and C is significantly deviated in Archaeawhereas in Bacteria the composition of T and G nucleo-tides is skewed (Table 2) There is an increased (G+C)load in t- and rRNA of Archaea and tRNA of BacteriaBacterial rRNA yields higher guanine load Both guanineand cytosine loads of t- and rRNA correlate with OGTand the correlation is stronger in Archaea than inBacteria (Table 2 correlation coefficients and their signifi-cance levels are given in parentheses P-values are inSupplementary File S8) The role of the (G+C) contentin thermal adaptation of t- and rRNAs is furthercorroborated by folding simulations The (G+C)

Nucleic Acids Research 2014 Vol 42 No 5 2881

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

Table 1 Compositional and sequence signals in coding and ncDNA

ARCHAEA BACTERIA

Characteristic Value Characteristic Value

A Correlations between nucleotide compositions and OGT

Codon position 1ANatNCB R=069 ANatNCB R=028+

(A+T)NatNCB R=068 (A+T)NatNCB R=029(A+G)NatNCB R=071 (A+G)NatNCB R=034(A+C)NatNCB R=071 (A+C)NatNCB R=036

Codon position 2TNat R=055 TNat R=037TNCB R=055 TNCB R=037GNatNCB R=060 (T+G)Nat R=036(T+G)Nat R=072 (T+G)NCB R=042(T+G)NCB R=068(G+C)NatNCB R=060

Codon position 3ANat R=002 ANat R=013ANCB R=070 ANCB R=076GNat R=022 (A+G)Nat R=060GNCB R=055 (A+G)NCB R=043(A+G)Nat R=056 (A+G)NatNCB R=011(A+G)NCB R=067

B Position-independent dinucleotide contrasts

ncDNAApATpT 115 ApATpT 126CpCGpG 124 GpC 125GpTApC 082

cDNAApA 125ApANCB 106TpT 120TpTNCB 113GpC 124GpCNCB 107

tRNAApA 132 ApG 128CpC 126 TpC 135TpC 126ApC 052TpG 070

rRNAApA 125 ApA 115CpC 127 CpC 119

C Correlations of position-independent dinucleotide contrasts (DCT) with OGT

ncDNACpT R=063 GpG R=049ApG R=064

cDNACpTNat R=066 GpGNat R=040CpTNCB R=057 GpGNCB R=009CpTShfld R=057 GpGShfld R=032ApGNat R=080ApGNCB R=059ApGShfld R=078

tRNAApA R=082 ApA R=045TpA R=072TpT R=080

(continued)

2882 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

Table 1 Continued

ARCHAEA BACTERIA

Characteristic Value Characteristic Value

GpG R=044CpC R=080

rRNAApA R=092 ApA R=067TpA R=088 CpC R=052GpG R=072CpC R=083

D Position-dependent dinucleotide composition ratios (Freq) dinucleotide contrasts and their OGT correlations

Codon position 1-2Freq(ApG)NatNCB 135na Freq(CpG)NatNCB 127na

Freq(CpG)NatNCB 062na

ApG 102 TpT 145ApGNCB 079 TpTNCB 149CpG 076 TpA 069CpGNCB 117 TpANCB 065

GpT 067GpTNCB 067

ApG R=077ApGNCB R=011GpC R=055GpCNCB R=044CpC R=037+

CpCNCB R=037+

ApGNatNCB R=063GpCNatNCB R=060CpCNatNCB R=037+

RpR R=040 RpR R=033RpRNCB R=015 RpRNCB R=027+

YpY R=034 YpY R=039YpYNCB R=005+ YpYNCB R=033RpRNatNCB R=059 RpRNatNCB R=037YpYNatNCB R=064 YpYNatNCB R=045

Codon position 2-3TpA R=063 ApA 150TpANCB R=028 ApANCB 105CpT R=061 GpC 133CpTNCB R=068 GpCNCB 098ApG R=069 ApCNatNCB R=052ApGNCB R=037+ GpANatNCB R=041

RpR R=020 RpR R=051RpRNCB R=065 RpRNCB R=073YpY R=028 YpY R=058YpYNCB R=066 YpYNCB R=075

RpRNatNCB R=039YpYNatNCB R=043

Codon position 3-1CpT R=060CpTShfld R=005ApG R=053ApGShfld R=002

RpR R=030+ RpR R=048RpRShfld R=020 RpRShfld R=001YpY R=035+ YpY R=056YpYShfld R=006 YpYShfld R=017

Pearson correlation coefficient is denoted by R DCT is calculated as the ratio of dinucleotide frequency to the product of frequencies of thecorresponding independent nucleotides Nucleotides with purine (A or G) and pyrimidine (T or C) bases are denoted with R and Y respectively Thelower index distinguishes the values observed for natural sequences (Nat) sequences with eliminated codon bias (NCB) values observed aftershuffling of amino acid sequences (Shfld) If the lower index is omitted the value is given for the natural sequences The P-values for correlationsand for the dinucleotide contrast t-tests (H0 DCT is 10) are shown in superscripts as significance levels +P-valuelt 005 lt 001 lt 00001Supplementary Files S5 and S7 list all P-values for position-specific correlations SD for compositions and the P-values for t-tests SupplementaryFiles S8 and S9 show the P-values for position-independent contrast t-tests and correlations

Nucleic Acids Research 2014 Vol 42 No 5 2883

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

content increases and energy per base pair decreases withtemperature in stems of folded t- and rRNAs (Table 3)These data point to base-pairing interactions as an im-portant contributor to thermal stabilization (53) of t-and rRNAs of prokaryotes and it is stronger manifestedin Archaea than in Bacteria

Dinucleotide composition of nucleic acid sequences

Dinucleotide contrasts (DCTs) show the ratio of observeddinucleotide frequencies to the expected ones given thecomposition of individual nucleotides Coupling of thesame nucleotides (Table 1) is preferred in ncDNA se-quences in Archaea (ApATpT and CpCGpG) andBacteria (ApATpT) ApA and CpC dinucleotides areprevalent in t- and rRNA of Archaea and rRNA ofBacteria Other outstanding contrasts in Archaea arefound in ncDNA (GpTApC) and tRNA (TpC ApC andTpG) In Bacteria ApG and TpC are prevalent in tRNA(Table 1) while GpC is preferred in both non-coding andcoding sequences In coding sequences of Archaea there isno preference for coupling of similar nucleotides while inBacteria it is found for ApA and TpT (Table 1)In both coding and ncDNA sequences of Archaea there

is a clear excess of complementary ApG and CpT di-nucleotides which is persistent after elimination ofthe codon bias and reshuffling of amino acid sequence(Table 1) In Bacteria the excess of GpG dinucleotidesin non-coding and coding sequences is most pronounced(Table 1) provided by the codon bias in coding sequencesPairing of similar nucleotides is highly correlated withOGT in tRNA (ApA TpT GpG and CpC) and rRNA

(ApA GpG and CpC) in Archaea Correlation of thesame nucleotide pairs is also found in Bacteria thoughit is for fewer pairs and is weaker (tRNA ApA rRNAApA and CpC) In Archaeal t- and rRNA there is alsostrong correlation of TpA pairs with OGT (Table 1)

Correlation between nucleotide frequencies andtemperature in different codon positions

The first codon position in Archaea is characterized byhigh correlation of the ratio of natural to NCBfrequencies of adenine with OGT [ANatNCB R=069Table 1] The second codon position reveals correlationof the thymine frequency with OGT (Table 1) The com-bination of thymine and guanine nucleotides is alsocorrelated with OGT however thymine is the major con-tributor (Table 1) The correlation of the guaninecontent with OGT is supported by the codon bias(Table 1) The combination of guanine with cytosine isalso correlated with OGT in the second position ofArchaeal sequences and is provided by the codon biasThe third position reveals strong selection againstadenine and guanine in relation to OGT (Table 1)Bringing together (anti-)correlations with OGT in differ-ent codon positions one can draw the optimal from thepoint of view of thermal adaptation (Table 4) combinedtriplet as [A]1 [TG]2 [non-(AG)]3 Prevalence of thyminein the second codon position is linked to an excess ofhydrophobes Indeed codons with thymine in the secondposition encode Ile Leu Met Phe and Val These arestrongly hydrophobic residues and aromatic Phe whichcan form many van der Waals contacts and contribute

Table 2 Nucleotide compositions and their OGT correlations in DNA and RNAs

Domain of life Nucleotide cDNANat cDNANCB ncDNA tRNA (R) rRNA (R)

Archaea A 2845 2794 3073 1713 (ndash076) 2368 (ndash085)T 2389 2444 3065 1817 (ndash089) 1924 (ndash088)G 2600 2612 1930 3391 (084) 3208 (094)

C 2166 2151 1931 3080 (084) 2499 (076)Bacteria A 2405 2642 2657 1966 (ndash044) 2609 (ndash052)

T 2268 2370 2661 2156 (ndash051) 2068 (ndash077)G 2750 2666 2341 3121 (053) 3110 (072)C 2577 2323 2341 2758 (039) 2213 (060)

The numbers represent the average frequencies of nucleotides in the corresponding parts of genomes while the numbers in parentheses are correl-ation coefficients (R) of nucleotide frequencies with OGT The most important biases and correlations are shown in bold font The P-values forcorrelations (H0 correlation coefficient R=0) and for the nucleotide composition t-tests (H0 mean frequency is 025) are shown in superscripts assignificance levels P-valuelt 001 lt 00001 Supplementary Files S8 and S9 list all correlations and composition tests cDNANAT naturalnucleotide composition in coding DNA cDNANCB nucleotide composition in coding sequences with eliminated codon bias

Table 3 OGT correlations in r- and t-RNA observed in folding simulations

Domain of life RNA type R((G+C) OGT) (P-values) R(ltEbpgt OGT) (P-values)

Archaea rRNA 089 (lt10ndash22) ndash093 (11 10ndash20)tRNA 084 (184 10ndash13) ndash071 (334 10ndash8)

Bacteria rRNA 073 (138 10ndash14) ndash066 (1 10ndash11)tRNA 053 (31 10ndash7) ndash051 (91 10ndash7)

R((G+C) OGT) correlation coefficient between the (G+C) content and the OGT R(ltEbpgt OGT) correlation coefficient between the averagedper base pair free energy of RNA folding and OGT

2884 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

thus to the packing of the hydrophobic coreNoteworthy elimination of the codon bias does notaffect correlation of the thymine fraction with OGT(R=055 in both native and NCB sequences) Itshows that excess of thymine in the second codonposition is a result of selection on the protein level Anapparent explanation for such selection is domination ofthe structure-based strategy in thermostabilization ofArchaeal proteins This strategy is characterized by theincreased compactness of the hydrophobic core providedby the massive van der Waals contacts The correlationof the combination of A and G with OGT is notprovided by tuning of the codon bias which points toadaptation on the other than DNA level

Bacterial coding DNA sequences yield less position-dependent correlations with OGT than Archaeal onesThe ratio of natural adenine frequency over the one foreliminated codon bias is weakly correlated with OGT(Table 1) There are also moderate correlations ofthymine frequency and of the thymine and guaninecombination with OGT in the second codon position(Table 1) The third codon position is characterized byselection against adenine provided by the codon bias(Table 1) However the combination of adenine andguanine is correlated with OGT It does not depend onthe codon bias to a large extent (same as in Archaea)pointing to possible signal of adaptation on thelevel other than protein-coding DNA sequence Thegeneralized codon in Bacteria (Table 4) characterizingthe thermophilic trends reads therefore [weak A]1 [weakT]2 [non-(AG)]3

The role of excessive (G+C) load in the third codonposition in adaptation to aerobicity

In both Archaea and Bacteria the average nucleotidecomposition in different codon positions is not affectedby the codon bias to the large extent except the thirdcodon position in Bacteria However the compositional

variance on the third position is essentially destroyedwhen there is no codon bias (standard deviation (SD)diminishes from gt10 to lt1 see Supplementary FileS5) The natural (G+C)3 load in Bacteria yields 9excess in comparison with NCB sequences (P-value=17e ndash 07 Supplementary File S6) while inArchaea there is no significant difference We found thatdespite frequent involvement of G3 and C3 nucleotidesinto the GCCG pairs the (G+C)3 load does not playa crucial role in thermal stability of mRNA (5254)Overall the difference in (G+C)3 load between Archeaand Bacteria is insignificant If we consider separatethermal groups the only difference appears in mesophilicBacteria (95 more in Bacteria P=009) Howeverwhen we take into account the oxygen tolerance factorthe difference in (G+C)3 load will become extremelypronounced between aerobic (7013 in natural and 5084in NCB) and anaerobic (4894 in natural and 5046 inNCB) species (P-value=2e ndash 09) Considering the lackof OGT correlation in protein-coding nucleotides (exceptthymine on the second codon position) one can concludethat increased (G+C)3 load is a result of the aerobic lifestyle rather than thermal adaptation (SupplementaryFiles S2 and S3) The comparison of aerobes and anaer-obes in the group of mesophilic Bacteria corroborates thisconclusion (Supplementary File S6) The difference isgt20 of (G+C)3 content 7097 in aerobes and 5051 inanaerobes (P=0000178) This adaptation is entirelydriven by the codon bias because if the codon bias isremoved the (G+C)3 bias as well as the differencebetween aerobes and anaerobes will disappear (5087and 5064)We also compared usage of codons with G or C in a

third position with other codons (lsquoCodon Usagersquo inSupplementary Files S3 and S6) There are three aminoacids encoded by six codons (Leu Arg Ser) five by fourcodons (Ala Pro Thr Gly Val) one by three codons (Ile)nine by two codons (Lys Asn Asp Phe Cys Gln Glu

Table 4 Generalized nucleotides and dinucleotides in different codon positions favorable for thermostability

Nucleotides correlated with OGT

Domain of life Codon position 1 2 3

Archaea Nucleotide A TG Non-[AG]Origin of the bias Codon bias T- amino acid Against codon bias

G-codon biasBacteria Nucleotide Weak A Weak T Non-[AG]

Origin of the bias Codon bias Amino acid Against codon bias

Dinucleotides in Purine (R) and Pyrimidine (Y) notation correlated with OGT

Domain of life Codon position 1-2 2-3 3-1

Archaea Dinucleotide RpRYpY Non-RpRYpY RpRYpYOrigin of the bias Codon bias Codon bias Amino acid

Bacteria Dinucleotide RpRYpY Non-RpRYpY RpRYpYOrigin of the bias Not codon bias Codon bias Amino acid

Part 1 Thermophilic-prone nucleotide biases Columns 1 2 3 contain information on favorable nucleotides and origin of the bias in codon positions1 2 3 Part 2 Thermophilic-prone dinucleotide biases Columns 1-2 2-3 3-1 contain information on favorable nucleotides and origin of the bias incodon positions 1-2 2-3 3-1

Nucleic Acids Research 2014 Vol 42 No 5 2885

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

His Tyr) and two by one codon (Met Trp) Typicallyhalf of the synonymous codons of each amino acid have Gor C in a third codon position others A or T It appearedthat almost all of the codons with G and C in a thirdposition have higher frequencies in aerobes compared toanaerobes and in Bacteria than in Archaea (see lsquoCodonUsagersquo in Supplementary File S3) This observation holdsfor all codons of the lsquotwo-codonrsquo amino acids and formajority of codons of the three- four- and six-codonamino acids The noticeable exception is significant sup-pression of AGAAGG codons of the Arg (aerobesndashanaerobes 177ndash125 P-values 446e ndash 858e ndash 8BacteriandashArchaea 174ndash2625 P-values 165e ndash 7114e ndash11) In lysine the codon AAA is suppressed in aerobes(difference 1716 P-value 644e ndash 6) while the codonAAG is preferred The AAG codon of lysine is favorablein aerobes (aerobesanaerobes 143) and another lysinersquoscodon AAA can be turned into AAG by only one mutationin the third codon position Therefore one can speculatethat the demand for discriminating between Arg and Lys isthe most probable cause for the suppression of AGAand AGG codons of arginine in aerobes Lysine isthus preferred in aerobic adaptation over the arginine(Supplementary Files S3 and S6) The contribution of theamino acid in relation to thermophilic adaptation is notcompromised because of the similarity between physicalndashchemical characteristics of lysine and arginine from thepoint of view of thermostability Overall an excess ofthe G3 codons is always more pronounced in aerobescompared to anaerobes which corroborates the conclusionthat the (G+C)3 load is an indications of the aerobic style(55ndash58) The similar trend in the difference between theBacteria and Archaea is a result of the domination ofaerobic life style in the Bacteria in the analyzed dataset

Correlation between dinucleotide frequencies in differentcodon positions and OGTIn order to analyze dinucleotides in different codon pos-itions (1-2 2-3 and 3-1) we used the DCTNat DCTNCB

(for positions 1-2 and 2-3) and shuffled (DCTShfld forposition 3 and 1) sequences We considered correlationsof these contrasts with OGT The most pronounced biasesare found for dinucleotides in coding sequences ofArchaea The ApG dinucleotides show the strongestexcess of dinucleotides in all positions of natural se-quences compared to NCB and shuffled ones In codonpositions 1-2 and 2-3 this compositional peculiarity andits correlation with OGT are provided by the codon bias(Table 1) In position 3-1 high correlation of ApGfrequencies with OGT is determined by the amino acidsequences and it vanishes after amino acid reshuffling(Table 1) There is also correlation of the frequency ofcomplementary CpT dinucleotide in position 3-1 withOGT disappearing after reshuffling of amino acid se-quences (Table 1) The most plausible role of ApG di-nucleotides is a contribution to stabilization via base-stacking interactions between the purine rings (22) Thisconclusion is supported by the fact that excess of ApGdinucleotides is provided by the codon bias manifestingthus adaptation on a DNA level The complementary CpTdinucleotides indicate enrichment of anti-sense strand of

double-stranded DNA (dsDNA) with the stabilizing ApGones The resulting mosaic of ApG stacking in sense andanti-sense strands can provide stabilization over the longdistances without compromising flexibility of the dsDNAOther overrepresented dinucleotides (regardless of thecodon bias) correlated with OGT in positions 1-2 areGpC CpC TpA and CpT (Table 1) The frequencies ofTpA and CpT dinucleotides are correlated with OGT inposition 2-3 where the former is provided by the codonbias and the latter is not

In purine (R) and pyrimidine (Y) notation codonbias provides grouping (stacking) of similar nucleotidesin position 1-2 which increases with OGT (Table 1) Atthe same time codon bias works against such grouping inposition 2-3 (Table 1) Amino acid sequence is responsiblefor the grouping of similar nucleotides and correlationswith OGT in position 3-1 (Table 1) The resultinggeneralized thermophilic pattern of dinucleotides inArchaeal cDNA sequences reads [RpR YpY]1-2 [non-RpR non-YpY]2-3 [RpRYpY]3-1 Most of the compos-itional peculiarities found in Bacteria are not determinedby the codon bias yielding weak correlations with tem-perature (Table 1) The thermophilic pattern of dinucleo-tides in Bacteria (Table 4) is the same as Archaeal oneThe difference with Archaea is that selection for RpR andYpY dinucleotides in position 1-2 is not determined by thecodon bias in Bacteria (Tables 1 and 4)

Compositional and sequence biases observed in foldingsimulations of mRNAThough RNA molecules have different overall structuresone may expect that there are some common mechanismsproviding stability and function of the folded t- r- andmRNAs We have shown above that the (G+C) contentapparently contributes to the stability of the t- and rRNA(3659) and not to the stability of coding DNA (545960)It is manifested in correlation of the former with OGT(Table 2) and the anti-correlation of the base-pairingenergy in the folded structures with OGT (Table 3)Nucleotide compositions in coding DNA (hence inmRNA as well) do not correlate with OGT The mRNAcase is of special interest because redundancy of geneticcode endues the nucleotide sequence with a potential tosatisfy requirements for DNA mRNA and protein stabil-ity For example it has been hypothesized (61) that theoptimization of the base-pairing in the mRNA contributesto formation of stem fragments Authors claimed thatthere is a corresponding periodic pattern of the mRNAsecondary structure in human and mouse which isdetermined chiefly by the selection that operates on thethird codon position (52) An opposing opinion suggeststhat the secondary structure in mRNA interferes withtranslation and therefore should be avoided (62)Overall three scenario of the relation between mRNAand proteins has been considered earlier (52) (i) thebiases are determined by the demands on protein stability(ii) mRNA stability is the major determinant of sequencebiases (iii) complete independence of the sequence biasesrelated to mechanisms of stability on each level We havecomputationally folded the mRNA sequences from 244prokaryotic genomes and analyzed their characteristics

2886 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

(Table 5) All the parameters are reported as relative oneswhere the signal in native sequence is normalized by thevalue obtained for dicodon shuffled ones (48) Thedicodon shuffling randomizes mRNA sequence whilepreserving the native protein sequence native codonusage and native dinucleotide composition Therefore itallows one to select out peculiarities of mRNA caused bythe demands on its stability and function from the othersrelated to the coding DNA and proteins

Structure of stems in folded mRNA is characterized byphases which show relative location of triplets in oppositestrands of the stem Specifically Phases I II and III cor-respond to positioning of triplets where respectively firstsecond and third nucleotides are complementary Themost and the least frequent pairs in all phases of thedsDNA (see Materials and methods section) are con-sidered Phases I and II are similar to each otheryielding CG GC GU UG AU and UA (Table 5)Overall the Archaea and thermophilic Bacteria havesimilar trends in most and least frequent pairsMesophilic Bacteria stands out with A13U and G33Cmost frequent pairs in Phases II and III respectivelyThe major contribution from the three pairing phases(52) to the total amount of stem pairs is provided byPhase III and nucleotides in codon position 3 are mostfrequent in stem pairs (Table 5 and SupplementaryTable S1) OGT correlations highlight the contributionfrom the GU wobble pair to the thermostability ofachaeal mRNA structures (Table 5 and SupplementaryTable S1) We found the highest OGT correlation forthe Archaeal pairs U32G (Table 5) confirming import-ance of the GU wobble pairs The stronger effect forArchaeal rather than for Bacterial mRNA is possibly arelic of ancient RNA world where RNA was the carrierof genetic information and harsh environmental condi-tions demanded its increased stability

Table 6 shows that purine load (the contents of A+GRY ApG) is larger in the loop regions than in the stems

In Archaea this difference is slightly more pronouncedthan in Bacteria (Supplementary Table S2) The purineload is also in good positive correlation with OGT forthe loops not for stems However the amount of ApGdinucleotides is correlated with OGT in both loops andstems For synonymous codons the fractions of purine-rich codons (eg GGR versus GGY for glycine and AGRversus CGY for arginine) are correlated with OGT in bothloop and stem regions (slightly stronger in loops) Fornon-synonymous codons however amount of purine-rich codons (GAR for glutamate) is correlated withOGT in both loops and stems while fractions of purine-rich AAR codons for lysine do not correlate withOGT (Table 6) While Forsdyke et al (21) reportedincrease of the purine content and its positive correlationwith OGT in loops we found that fraction of purine reachsynonymous codons is correlated with OGT in stems aswell At the same time the amount of non-synonymouspurine-rich codons of lysine is not correlated with OGT inboth loop and stem regions Additionally the amount ofApG dinucleotides is correlated with OGT in both loopsand stems (Table 6)

Signals of thermophilic adaptation in protein sequencesIt has been shown earlier (22) that environmental tem-perature directly affects amino acid composition of pro-karyotic proteins and IVYWREL combination can serveas a predictor of the OGT in prokaryotes We use here alsquoZ-scorersquo predictor of OGT (43) which takes into accountdifferences in variances of the proteomic frequencies ofindividual amino acids (Table 7) We have shown thatthe Z-score predictor properly corrects for these differ-ences (43) better reflecting the contribution of theamino acid combinations to thermal adaptationSeparate predictors for Archaea and Bacteria (based on33 and 99 proteomes with OGTrsquos spread over 5ndash100 and10ndash85C respectively) yield the same general trend in theincrease of hydrophobic and charged residues at the

Table 5 Compositional and dicodon signals of mRNA adaptation purified by the dicodon shuffling

Characteristic Archaea Bacteria

The most and least (after the comma) frequentpairs in stems of mRNA

MesophilesPhase I C23G G23U Phase I G32U U32GPhase II C31G G13U Phase II A13U G13UPhase III A33U U33G Phase III G33C U33G

ThermophilesPhase I G32C G23U Phase I C23G G23UPhase II C22G G13U Phase II C31G G13UPhase III A33U U33G Phase III U33A U33G

Correlations with OGT ofSegment energy ltEsggt R=ndash071 R=ndash039Energy per base pair ltEbpgt R=ndash073 R=ndash054

The most signification correlations with OGT Phase I U32G R=077 (U21G)Nat PIII R=046Phase I G23U R=071 (G12U)Nat PIII R=045Phase I All pairs R=07 (U21G)dShfld PIII R=045

(G12U)dShfld PIII R=041

Phases I II and III correspond to positioning of triplets where respectively first second and third nucleotides are comple-mentary All signals (except OGT correlations in Bacteria) are normalized by corresponding values for control sequences afterthe dicodon shuffling See explanations of abbreviations in the Materials and methods section

Nucleic Acids Research 2014 Vol 42 No 5 2887

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

expense of polar ones (34) showing minor differences inpresence of individual residuesWe have shown earlier (22) that the amino acid bias

working in thermal adaptation of proteins does notdepend on the overall nucleotide composition of coding-DNA sequences However one can still expect that nu-cleotide compositions of particular codon positions andor dinucleotide compositions of positions 1-2 and 2-3 incodons may be somehow linked to amino acid biases Weindeed found a significant difference between nucleotideload in individual codon positions of Archaeal andBacterial coding DNA sequences (Table 1) FirstArchaeal protein coding sequences yield a strong correl-ation between (T+G)2 load in the second codon positionand OGT (r=072) The corresponding non-codon biasedsequences show comparable correlation (r=068)pointing that (T+G) load reflects tuning of the aminoacid composition in connection to thermal adaptationThe main contributor to this correlation is thymine(Table 1) that provides increase of strongly hydrophobicresidues LMFIV (Supplementary File S1) The (T+G)load is much weaker correlated with OGT in Bacterialsequences (r=036 for natural and r=042 for NCB se-quences) The correlation of the thymine load with OGT is037 for both Natural (NAT) and Non-codon-biased(NCB) sequences This observation agrees with thepresumed prevalence of structure-based strategy ofthermal adaptation in Archaeal proteins (29) contrary tosequence-based one in Bacterial proteomes Specificallythere is a clear correlation between amounts of chargedresidues and the OGT in the set of Bacterial proteomes(Table 7) presumably reflecting the domination of thesequence-based strategy in thermophilic adaptation ofBacteria (29)Finally we checked if there exists any specific connec-

tion between the dipeptide composition of proteins andthose of 3-1 dinucleotides We found (data not shown)that the strongest correlations between dinucleotides 3-1and dipeptides exist mostly for the dinucleotides withguanine in position 3 The most frequent pair of aminoacids contains methionine (Met encoded exclusively by

the AUG codon) as the first and any polarchargedamino acid as the second one The preference of polarcharged amino acid to be the next to Met can be explainedby the typically surface location of the protein N-terminiand its role as a signal for ubiquitination (63)

DISCUSSION

We will make a brief overview of the most pronouncedbiases and causal relationships in DNA RNA andproteins The major signal in nucleotide compositionsspecifically (G+C) load is related to thermal stabilizationof t- and rRNA (3654555960) and to discriminationbetween the coding and non-coding sequences inArchaea (Table 2) Comparative analysis of nucleotideand amino acid compositions in relation to thermophilicadaptation prompts to conclude that the (G+C) contentdoes not contribute to thermostabilization of coding DNA(1822365459) as well as it does not affect amino acidcomposition and its thermophilic trends (1522) When

Table 6 Purine loading in loop and stem regions of folded mRNA and its OGT correlation

Feature Loop Stem L-v-S p-value

Mean contents OGT correlation Mean content OGT correlation

A+G 0560 059 0500 ndash026 lt22E ndash 16RY 1299 061 1002 ndash026 lt22E ndash 16ApG 0061 079 0051 050 00002GGR (glycine) 0027 062 0045 042 12E ndash 17GGY (glycine) 0017 ndash005 0079 ndash023 10E ndash 42AGR (arginine) 0035 072 0022 056 24E ndash 10CGR (arginine) 0018 ndash022 0040 ndash024 71E ndash 09CGY (arginine) 0016 ndash025 0037 ndash026 62E ndash 10GAR (glutamate) 0060 071 0036 059 lt22E ndash 16AAR (lysine) 0086 022 0017 011 lt22E ndash 16GAY (aspartate) 0037 ndash017 0040 ndash037 00002

Feature analyzed nucleotide dinucleotide or amino acid Loop and stem information on mean content and OGT correlation of the above L-v-S acomparison between corresponding contents in the loop region and stem regions by Wilcoxon-tests and P-values are shown in this columnCorrelation coefficients with OGT are shown P-valuelt 001 P-valuelt 00001

Table 7 Signals of thermophilic adaptation in protein sequences of

Archaea and Bacteria

Correlation with OGT Archaea Bacteria

Z-scored predictor ILVW Y DKR(R=093)

IPV Y EKR(R=089)

The most abundant residuesin gt70 (gt60) ofall Z-scored predictors

VIWL Y ER VPI Y ER(K)

Individual amino acids + L W + Endash T Q D

Types of amino acids + h + h cndash p ndash p

Dipeptides + hp ph + ccndash cp pc ndash pc

+ increase of the amino acid (or amino acid type) fraction with OGT ndashdecrease of the amino acid (or amino acid type) fraction with OGTCapital letters are names of amino acids h p c are hydrophobic polarcharged types of amino acids Correlation coefficients between the Z-scored thermostability predictors and OGT are given in parentheses

2888 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

separate codon positions are considered for coding DNAthe only compositional bias observed for nucleotide com-positions is an excess of (G+C) load in the third codonposition in Bacteria There is a significant excess ofguanine and cytosine nucleotides (292 and 3282)compared to the NCB nucleotide contents (2609 and2466) A plausible explanation for this bias is the roleof the (G+C)3 load in the adaptation to the aerobic lifestyle which dominates in Bacteria The (G+C)3 load canbe advantageous for several complementary reasonsIndeed additional GC base pairing will contribute tothe stability of double helix of DNA and stem regions ofRNA molecules Moreover G3 bases can work as scaven-gers of oxidizing agents providing protection for G basesin other codon positions (56ndash58) At the same time thisbias would not lead to any changes in amino acid com-position leaving protein structure and stability intact (55)

In this work we considered compositional and sequencebiases in proteins in relation to those in correspondingnucleic acid sequences and to the phylogeny of species(Archaea or Bacteria) OGT correlations of nucleotidesin different codon positions are similar in Archaea andBacteria but higher in the former (Table 1) Theoptimal from the point of view of thermophilic adaptation(ie most correlated with OGT) codon reads A1 [TG]2non-[AG]3 in Archaea but [weak A]1 [weak T]2 non-[AG]3 in Bacteria (Table 4) Codon bias supports thestrong correlation between the amount of Adenine andOGT in the first codon position in Archaea same butweaker correlation in Bacteria While we showed earlier(22) that amino acid thermophilic trend is not determinedby the overall nucleotide composition considerationof separate codon positions reveals an interesting linkbetween positional nucleotide frequencies and aminoacid composition in relation to domain of LifeSpecifically prevalence of the nucleotide thymine in thesecond codon position and its strong correlation withOGT in Archaea is apparently a result of the demandon enrichement of Archaeal proteins with hydrophobicresidues (Table 7) Noteworthy a massive increase ofvan der Waals interactions (2464) was found to be thecornerstone of the structure-based evolutionary strategyof protein thermostability in ancient species (29) InBacterial proteins however there is an increase of thefractions of charged residues with OGT (Table 7) Thuswe observe a transitions from structure-based evolution-ary strategy of protein themostability in Archaea tosequence-based one in Bacteria and corresponding nucleo-tide and protein compositional biases (Table 7)underlying these strategies (29) The third codonposition is characterized by the selection against adenineand guanine in Archaea and against adenine in Bacteria isa result of this bias

The most pronounced dinucleotide compositionalbiases are excess of homo-dinucleotides and its correlationwith OGT in ncDNA tRNA and rRNA It is presumablya relic of ancient primitive homo(poly)nucleotides fromwhich life started Overrepresentation of the complemen-tary ApGCpT dinucleotides points to their contributionto DNARNA stability via strong base-stacking inter-actions (6566) The consideration of dinucleotide biases

in purine (R) and pyrimidine (Y) notation also reveals aninteresting picture Overall for both Archaea and Bacteriathe signature for thermophilic dinucleotides reads RpRYpY non-(RpRYpY) RpRYpY for codon positions1-2 2-3 and 3-1 respectively (Table 4) The major differ-ence between Archaea and Bacteria in this case is that thedinucleotide bias in positions 1-2 is caused by demands onnucleotide sequence level (and provided by the codonbias) in Archaea while some other factors work inBacteria The preference for homodinucleotides in pos-itions 1-2 and 3-1 can be a possible reason for avoidinghomodinucleotides in positions 2-3 because heterodinu-cleotides can contribute to DNA flexibility tuningIndeed it has been shown that given an average flexibilityF of homodinucleotides (either RpR or YpY) theflexibilities of the heterodinucleotides YpR and RpY are2F and 05F respectively (67) The codon positions 2-3 aremost versatile in terms of the relationship between thenucleotide and protein sequences Therefore presenceof heteronucleotides in the codon position 2-3 makes itpossible to adjust flexibility of the nucleotide sequencewithout changing physicalndashchemical characteristics ofthe encoded amino acid residue Additionally we foundthat the link between dipeptide and corresponding 3-1 di-nucleotide frequencies is determined by the amino aciddipeptide The dominating bias reads as Met followedby a polar or charged residueOn the RNA level simple consideration of the nucleo-

tide composition suggests to discriminate the tRNA andrRNA molecules from the mRNA Indeed based on thecorrelation with OGT we can conclude that (G+C)content is important for providing stability of tRNAand rRNA molecules but not for mRNA The role ofthe (G+C) content was further confirmed by foldingsimulations (Table 3) We also found other complemen-tary signals in mRNA sequences related to its structurestability and function Archaea and Bacteria yield similartrends in the most and least frequent nucleotide pairsand the third codon position makes a major contributionto the base-pairing mechanism of stem stabilizationSpecifically in addition to WatsonndashCrick pairing theGU wobble pairs sufficiently contribute to the ther-mal stability of the Archaeal mRNA in agreement withearlier observations (6869) Overall it results in moderatecorrelation of the segment energy and energy per base pairof the folded mRNA (Table 5) Folding simulations per-formed in this work also allowed us to analyzepeculiarities of the nucleotide contents and structuralcontacts in stem and loop regions of folded mRNAPurine load in loop regions is higher than in stems andthis effect is slightly stronger in Archaea than in Bacteria(Table 6) Purine load in loops also correlates with OGTnot in stems This correlation was observed earlier byForsdyke et al (21) and it was described as lsquopolitepurine load of the loop regionsrsquo that prevents undesiredmRNAndashmRNA single-strand interactions The authorsconcluded that increased purine load can affect thecodon choice and consequently the amino acid compos-ition (21) Our data [as well as careful analysis of theoriginal data in (21)] does not support the latter claimThere is indeed a correlation between the fraction of

Nucleic Acids Research 2014 Vol 42 No 5 2889

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

some purine-rich codons (GGR(Gly) AGR(Arg) andGAR(Glu)) and OGT (Table 6) However the fractionof purine-rich non-synonymous codons AAR (Lys) isnot correlated with OGT (Table 6) Moreover thesecodons are synonymous and cannot directly affect theamino acid composition It has been shown earlier(2270) that increase of Glu and Arg fractions is a trendin protein thermophilic adaptation At the same timepurine load remains high after elimination of the codonbias pointing to the amino acid composition as the mostprobable cause for the former (22) All the above promptsone to conclude that in mutual tuning of nucleotide andamino acid compositions the purine load does notdominate and determine biases in amino acid compos-ition if not opposite The analysis of purine load showsthat it correlates with OGT in both stems and loops(Table 6) in case of purine-rich codons GGR(Gly)AGR(Arg) and GAR(Glu) Further the amount ofApG dinucleotides strongly correlates with OGT in bothloops and stems (Table 6) and ApG provides a strongbase stacking important for thermal stability of nucleicacid sequences (2265) Purine load therefore is appar-ently a determinant of the base-stacking mechanism inmRNA andor DNA thermal stability as well as a resultof the thermophilic amino acid composition trendProkaryotes thrive under the temperature interval

spanning over hundred degrees and they represent twomajor life stylesmdashaerobic and anaerobic Analysis ofcomplete Archaeal and Bacterial genomes unraveled com-positional and sequence signals related to molecular mech-anisms of stability and adaptation unaffected by selectivesequencing or by the comparison of orthologs Overallcodon bias works stronger in Archaea and is mostlyutilized in thermophilic adaptation of nucleic acids It ap-parently reflects longer evolutionary history of Archaeawhich presumably started close to the origin of life in hotconditions (29) The codon bias and amino acid sequences(dipeptide composition) work in accord for supportingenrichment of the nucleotide sequences with ApGCpT di-nucleotidesmdashdeterminants of the base-stacking mechan-ism of the nucleic acid stability (65ndash67) We also foundthat the second codon position reveals a strong linkbetween the nucleotide and amino acid compositionsSpecifically excess of thymine in this position is a resultof a demand on the enrichment of Archaeal proteins withhydrophobic amino acids The third codon position inBacteria is the only case where codon bias is detectablealready on the level of pure composition We found thatthe (G+C)3 load is related to aerobic life style dominatingin Bacteria From the point of view of thermophilic adap-tation codon bias works against G in the third codonposition in both Archaea and Bacteria It supports thus aspecific role of the third codon position in discriminatingbetween adaptations to temperature and aerobic life styleFinally we obtained an interesting and complex picture ofthe relationship between the nucleotide composition(purine load) and amino acid composition (selectionbetween Arg and Lys) in relation to thermal adaptationaerobicity and phylogeny Specifically purine load in bothArchaea and Bacteria is a result of the lsquofrom both endof hydrophobicity scalersquo trend in thermal adaptation of

proteins (34) reflected in the IVYWREL predictor of theOGT (22) According to this predictor Arg is a preferredamino acid for thermal adaptation though Lys is the nextcandidate present in top most correlated predictors as well(2243) In particular selection for Lys over the Arg insome species was shown to be important for the entropicmechanism of protein thermostabilization (70) Howeveroverall preference for Arg over the Lys in case of thermaladaptation is well manifested in the excess of the purine-rich (AGR) codons of Arg at the expense of purine-richones of Lys (AAR) At the same time discriminationbetween Arg and Lys works in opposite direction inaerobicity where Arg codons AGA and AGG are sup-pressed in favor of the purine-rich AAA and AAG ofLys It thus reflect a preference for Lys over Arg inaerobes compared to anaerobes in particular and inBacteria versus Archaea in general preserving at thesame time high purine load necessary for providing abase-stacking in corresponding nucleic acids (225365)An intricate connection between the LysArg and theircodons in relation to thermal adaptation and aerobicityexemplifies how selection can work on nucleic acids andprotein simultaneously in response to demands of differentenvironments Obviously the whole picture of molecularmechanisms of adaptation and relations between them isfar from being complete Consideration of other environ-mental factors such as salinity pressure etc will help to un-ravel new mechanisms of stability their sequencestructuredeterminants and to understand tradeoffs that Natureembraced en route of the evolution and adaptation

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

FUNDING

Funding for Open access charges Functional GenomicsProgram (FUGE II) Norwegian Research Council

Conflict of interest statement None declared

REFERENCES

1 CrickF (1970) Central dogma of molecular biology Nature 227561ndash563

2 TawfikDS (2010) Messy biology and the origins of evolutionaryinnovations Nat Chem Biol 6 692ndash696

3 TokurikiN and TawfikDS (2009) Stability effects of mutationsand protein evolvability Curr Opin Struct Biol 19 596ndash604

4 TokurikiN and TawfikDS (2009) Protein dynamism andevolvability Science 324 203ndash207

5 KooninEV (2012) Does the central dogma still stand BiolDirect 7 27

6 PersquoerI FelderCE ManO SilmanI SussmanJL andBeckmannJS (2004) Proteomic signatures amino acid andoligopeptide compositions differentiate among phyla Proteins 5420ndash40

7 AravindL TatusovRL WolfYI WalkerDR andKooninEV (1998) Evidence for massive gene exchange betweenarchaeal and bacterial hyperthermophiles Trends Genet 14442ndash444

8 KhachaneAN TimmisKN and dos SantosVA (2005) Uracilcontent of 16S rRNA of thermophilic and psychrophilic

2890 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

prokaryotes correlates inversely with their optimal growthtemperatures Nucleic Acids Res 33 4016ndash4022

9 SuhreK and ClaverieJM (2003) Genomic correlates ofhyperthermostability an update J Biol Chem 27817198ndash17202

10 TekaiaF and YeramianE (2006) Evolution of proteomesfundamental signatures and global trends in amino acidcompositions BMC Genom 7 307

11 WangHC and HickeyDA (2002) Evidence for strong selectiveconstraint acting on the nucleotide composition of 16S ribosomalRNA genes Nucleic Acids Res 30 2501ndash2507

12 FriedmanR DrakeJW and HughesAL (2004) Genome-widepatterns of nucleotide substitution reveal stringent functionalconstraints on the protein sequences of thermophiles Genetics167 1507ndash1512

13 LynnDJ SingerGA and HickeyDA (2002) Synonymouscodon usage is subject to selection in thermophilic bacteriaNucleic Acids Res 30 4272ndash4277

14 SingerGA and HickeyDA (2000) Nucleotide bias causes agenomewide bias in the amino acid composition of proteinsMol Biol Evol 17 1581ndash1588

15 SingerGA and HickeyDA (2003) Thermophilicprokaryotes have characteristic patterns of codonusage amino acid composition and nucleotide content Gene 31739ndash47

16 TekaiaF YeramianE and DujonB (2002) Amino acidcomposition of genomes lifestyles of organisms and evolutionarytrends a global picture with correspondence analysis Gene 29751ndash60

17 Roy ChowdhuryA and DuttaC (2012) A pursuit of lineage-specific and niche-specific proteome features in the world ofarchaea BMC Genom 13 236

18 WangHC SuskoE and RogerAJ (2006) On the correlationbetween genomic G+C content and optimal growth temperaturein prokaryotes data quality and confounding factors BiochemBiophys Res Commun 342 681ndash684

19 WuH ZhangZ HuS and YuJ (2012) On the molecularmechanism of GC content variation among eubacterial genomesBiol Direct 7 2

20 KnightRD FreelandSJ and LandweberLF (2001) A simplemodel based on mutation and selection explains trends in codonand amino-acid usage and GC composition within and acrossgenomes Genome Biol 2 RESEARCH0010

21 LaoPJ and ForsdykeDR (2000) Thermophilic bacteria strictlyobey Szybalskirsquos transcription direction rule and politely purine-load RNAs with both adenine and guanine Genome Res 10228ndash236

22 ZeldovichKB BerezovskyIN and ShakhnovichEI (2007)Protein and DNA sequence determinants of thermophilicadaptation PLoS Comput Biol 3 e5

23 KreilDP and OuzounisCA (2001) Identification ofthermophilic species by the amino acid compositions deducedfrom their genomes Nucleic Acids Res 29 1608ndash1615

24 BerezovskyIN TumanyanVG and EsipovaNG (1997)Representation of amino acid sequences in terms of interactionenergy in protein globules FEBS Lett 418 43ndash46

25 CambillauC and ClaverieJM (2000) Structural and genomiccorrelates of hyperthermostability J Biol Chem 27532383ndash32386

26 GreavesRB and WarwickerJ (2007) Mechanisms forstabilisation and the maintenance of solubility in proteins fromthermophiles BMC Struct Biol 7 18

27 JaenickeR (1999) Stability and folding of domain proteins ProgBiophys Mol Biol 71 155ndash241

28 JaenickeR and BohmG (1998) The stability of proteins inextreme environments Curr Opin Struct Biol 8 738ndash748

29 BerezovskyIN and ShakhnovichEI (2005) Physics andevolution of thermophilic adaptation Proc Natl Acad Sci USA102 12742ndash12747

30 ChakravartyS and VaradarajanR (2002) Elucidation of factorsresponsible for enhanced thermal stability of proteins a structuralgenomics based study Biochemistry 41 8152ndash8161

31 GlyakinaAV GarbuzynskiySO LobanovMY andGalzitskayaOV (2007) Different packing of external residues can

explain differences in the thermostability of proteins fromthermophilic and mesophilic organisms Bioinformatics 232231ndash2238

32 ThompsonMJ and EisenbergD (1999) Transproteomic evidenceof a loop-deletion mechanism for enhancing proteinthermostability J Mol Biol 290 595ndash604

33 TokurikiN OldfieldCJ UverskyVN BerezovskyIN andTawfikDS (2009) Do viral proteins possess unique biophysicalfeatures Trends Biochem Sci 34 53ndash59

34 BerezovskyIN ZeldovichKB and ShakhnovichEI (2007)Positive and negative design in stability and thermal adaptationof natural proteins PLoS Comput Biol 3 e52

35 BharanidharanD BhargaviGR UthanumallianK andGauthamN (2004) Correlations between nucleotide frequenciesand amino acid composition in 115 bacterial species BiochemBiophys Res Commun 315 1097ndash1103

36 NakashimaH FukuchiS and NishikawaK (2003)Compositional changes in RNA DNA and proteins for bacterialadaptation to higher and lower temperatures J Biochem(Tokyo) 133 507ndash513

37 DehouckY FolchB and RoomanM (2008) Revisiting thecorrelation between proteinsrsquo thermoresistance and organismsrsquothermophilicity Protein Eng Des Sel 21 275ndash278

38 FolchB DehouckY and RoomanM (2010) Thermo- andmesostabilizing protein interactions identified by temperature-dependent statistical potentials Biophys J 98 667ndash677

39 FolchB RoomanM and DehouckY (2008) Thermostability ofsalt bridges versus hydrophobic interactions in proteins probed bystatistical potentials J Chem Inf Model 48 119ndash127

40 GonnelliG RoomanM and DehouckY (2012) Structure-basedmutant stability predictions on proteins of unknown structureJ Biotechnol 161 287ndash293

41 PonnuswamyP MuthusamyR and ManavalanP (1982) Aminoacid composition and thermal stability of globular proteins IntJ Biol Macromol 4 186ndash190

42 BerezovskyIN (2011) The diversity of physical forces andmechanisms in intermolecular interactions Phys Biol 8 035002

43 MaBG GoncearencoA and BerezovskyIN (2010)Thermophilic adaptation of protein complexes inferred fromproteomic homology modeling Structure 18 819ndash828

44 MakarovaKS and KooninEV (2005) Evolutionary andfunctional genomics of the Archaea Curr Opin Microbiol 8586ndash594

45 NovichkovPS WolfYI DubchakI and KooninEV (2009)Trends in prokaryotic evolution revealed by comparison of closelyrelated bacterial and archaeal genomes J Bacteriol 191 65ndash73

46 KooninEV and WolfYI (2008) Genomics of bacteria andarchaea the emerging dynamic view of the prokaryotic worldNucleic Acids Res 36 6688ndash6719

47 KooninEV MushegianAR GalperinMY and WalkerDR(1997) Comparison of archaeal and bacterial genomes computeranalysis of protein sequences predicts novel functions andsuggests a chimeric origin for the archaea Mol Microbiol 25619ndash637

48 KatzL and BurgeCB (2003) Widespread selection for localRNA secondary structure in coding regions of bacterial genesGenome Res 13 2042ndash2051

49 HofackerIL PriwitzerB and StadlerPF (2004) Prediction oflocally stable RNA secondary structures for genome-wide surveysBioinformatics 20 186ndash190

50 ZukerM and StieglerP (1981) Optimal computer folding oflarge RNA sequences using thermodynamics and auxiliaryinformation Nucleic Acids Res 9 133ndash148

51 MathewsDH SabinaJ ZukerM and TurnerDH (1999)Expanded sequence dependence of thermodynamic parametersimproves prediction of RNA secondary structure J Mol Biol288 911ndash940

52 ShabalinaSA OgurtsovAY and SpiridonovNA (2006) Aperiodic pattern of mRNA secondary structure created by thegenetic code Nucleic Acids Res 34 2428ndash2437

53 MarmurJ and DotyP (1962) Determination of the basecomposition of deoxyribonucleic acid from its thermaldenaturation temperature J Mol Biol 5 109ndash118

Nucleic Acids Research 2014 Vol 42 No 5 2891

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

54 HurstLD and MerchantAR (2001) High guanine-cytosinecontent is not an adaptation to high temperature acomparative analysis amongst prokaryotes Proc Biol Sci 268493ndash497

55 NayaH RomeroH ZavalaA AlvarezB and MustoH (2002)Aerobiosis increases the genomic guanine plus cytosine content(GC) in prokaryotes J Mol Evol 55 260ndash264

56 BeckmanKB and AmesBN (1997) Oxidative decay of DNAJ Biol Chem 272 19633ndash19636

57 ChengKC CahillDS KasaiH NishimuraS and LoebLA(1992) 8-Hydroxyguanine an abundant form of oxidative DNAdamage causes Gmdash-T and Amdash-C substitutions J Biol Chem267 166ndash172

58 Vieira-SilvaS and RochaEP (2008) An assessment of theimpacts of molecular oxygen on the evolution of proteomes MolBiol Evol 25 1931ndash1942

59 GaltierN and LobryJR (1997) Relationships between genomicG+C content RNA secondary structures and optimal growthtemperature in prokaryotes J Mol Evol 44 632ndash636

60 MustoH NayaH ZavalaA RomeroH Alvarez-ValinF andBernardiG (2004) Correlations between genomic GC levels andoptimal growth temperatures in prokaryotes FEBS Lett 57373ndash77

61 FitchWM (1974) The large extent of putative secondary nucleicacid structure in random nucleotide sequences or amino acidderived messenger-RNA J Mol Evol 3 279ndash291

62 WorkmanC and KroghA (1999) No evidence that mRNAshave lower folding free energies than random sequences with

the same dinucleotide distribution Nucleic Acids Res 274816ndash4822

63 BachmairA FinleyD and VarshavskyA (1986) In vivo half-lifeof a protein is a function of its amino-terminal residue Science234 179ndash186

64 BerezovskyIN NamiotVA TumanyanVG and EsipovaNG(1999) Hierarchy of the interaction energy distribution in thespatial structure of globular proteins and the problem of domaindefinition J Biomol Struct Dyn 17 133ndash155

65 YakovchukP ProtozanovaE and Frank-KamenetskiiMD(2006) Base-stacking and base-pairing contributions into thermalstability of the DNA double helix Nucleic Acids Res 34564ndash574

66 FriedmanRA and HonigB (1995) A free energy analysis ofnucleic acid base stacking in aqueous solution Biophys J 691528ndash1535

67 OkonogiTM AlleySC ReeseAW HopkinsPB andRobinsonBH (2002) Sequence-dependent dynamics of duplexDNA the applicability of a dinucleotide model Biophys J 833446ndash3459

68 MasquidaB and WesthofE (2000) On the wobble GoU andrelated pairs RNA 6 9ndash15

69 VaraniG and McClainWH (2000) The G x U wobble base pairA fundamental building block of RNA structure crucial to RNAfunction in diverse biological systems EMBO Rep 1 18ndash23

70 BerezovskyIN ChenWW ChoiPJ and ShakhnovichEI(2005) Entropic stabilization of proteins and its proteomicconsequences PLoS Comput Biol 1 e47

2892 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

genomic compositions with the weight proportionalto genome size so that each group is represented asone meta-genome We report the weighted averagesin Supplementary File S1 and show the selected valuesin Table 1 The standard deviations are reported inSupplementary Files S5 and S8 along with the P-valuesfor OGT correlations P-values for comparison be-tween the groups reported in the text are calculatedusing two-sample weighted t-test (Studentrsquos t-testwith Welch correction to the degrees of freedom whichis a standard procedure in R software for unequalvariances)

dfWelch frac14

s21

n1+

s22

n2

2

s21n1

2

n11+

s22n2

2

n21

where s2i and ni are the sample variance and the number ofobservations in group i respectively

Supplementary File S6 describes the tests for (G+C)3composition in connection to oxygen tolerance InSupplementary Files S7 and S9 one-sample weightedt-test P-values are reported for the comparisons of nucleo-tide compositions with 025 and 05 for nucleotides andnucleotide combinations respectively The significance ofdinucleotide contrasts (DCT) is assessed by comparing theweighted DCT averages to 1 and it is also reported in thesame supplementary files

RNA folding in silicoThe lsquoRNAfoldrsquo program (49) in the lsquoVienna RNAPackagersquo was used for performing the RNA foldingNative and randomized sequences were folded usingZukerrsquos energy minimization algorithm (50) which deter-mines the folding free energy for the most favorable con-formation from a vast number of possible simulatedstructures Calculations were performed with default par-ameters and a temperature setting of 37C for all organ-isms The latter allowed us to detect the effect of theorganismal nucleotide composition on the mRNA struc-ture and stability Sequences in a sliding window of 50bases were folded (51) and their characteristics werecalculated We have also checked other sliding windows(100 and 150 bases) observing that they give similarqualitative outputs The choice of 50-bases window isjustified by the previous knowledge that most knownfunctionally important secondary structures are smalland local (48) and structures gt50 bases would notnormally be formed in actively translating mRNAs(48) We used dicodon-shuffle program by Katz andBurge (48) for obtaining control sets of reshuffledmRNA sequences The base pairing pattern [base pairfrequencies at three Phases (52) of mRNA sequence]folding energy of folded mRNAs and their correlationswith OGT of corresponding organisms were examinedThe comparison of these features between Archaea andBacteria was also performed To purify signal ie tofocus exclusively around an effect of the local mRNAstructure each quantity is also represented as a ratiobetween the natural sequence signal and the signal

from dicodon-shuffled sequences (dShfld) The purineload in the loop and stem regions of the predictedmRNA secondary structures was also analyzed (21)

Amino acid and dipeptide composition andOGT correlationAmino acid Z-scored predictors of the OGT (43) werederived for Archaeal and Bacterial proteomesAdditionally we analyzed the groups of amino acidsaccording to their physical chemical properties [charged(D E K R) hydrophobic (C F I L M P V W) andpolar (A G H N Q S T Y)] The dipeptide classes forabove residues types and their correlations with OGT wereexamined separately for Archaea and Bacteria The dipep-tide frequencies were normalized by the individual aminoacid frequencies in order to exclude the compositionalbias

RESULTS

In order to achieve our goal in understanding compos-itional and sequence signals of evolution and adaptationof DNA RNA and protein macromolecules we pursuethe following strategy in the analysis First wherever it ispossible we single out compositional biases existing inthese molecules Then we establish possible connectionsbetween detected biases and mechanisms of adaptationSpecifically we discuss nucleotide dinucleotide aminoacid and dipeptide biases their correlations with differentenvironmental factors and their potential role indetermining and tuning molecular mechanisms of stabilityand adaptation in corresponding macromolecules Wealso seek to understand how biases in one type of mol-ecules can affect or can be affected by the demands on thesequencesstructures of others Finally we explore causalrelationships between them in light of their evolutionaryhistory and phylogeny To this end we analyze the differ-ence in compositionalsequence biases between Archaeaand Bacteria in conjunction with their mechanisms ofadaptation

Nucleotide compositions of DNA and RNA

ncDNA has significantly higher adenine and thymine(A+T) content in Archaea (Table 2 and SupplementaryFile S9) hinting to the role of nucleotide composition indiscriminating coding and ncDNA of Archaea the samemechanism exists in eukaryotes Deviations of nucleotidecontents from even fractions differ between Archaea andBacteria A and C is significantly deviated in Archaeawhereas in Bacteria the composition of T and G nucleo-tides is skewed (Table 2) There is an increased (G+C)load in t- and rRNA of Archaea and tRNA of BacteriaBacterial rRNA yields higher guanine load Both guanineand cytosine loads of t- and rRNA correlate with OGTand the correlation is stronger in Archaea than inBacteria (Table 2 correlation coefficients and their signifi-cance levels are given in parentheses P-values are inSupplementary File S8) The role of the (G+C) contentin thermal adaptation of t- and rRNAs is furthercorroborated by folding simulations The (G+C)

Nucleic Acids Research 2014 Vol 42 No 5 2881

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

Table 1 Compositional and sequence signals in coding and ncDNA

ARCHAEA BACTERIA

Characteristic Value Characteristic Value

A Correlations between nucleotide compositions and OGT

Codon position 1ANatNCB R=069 ANatNCB R=028+

(A+T)NatNCB R=068 (A+T)NatNCB R=029(A+G)NatNCB R=071 (A+G)NatNCB R=034(A+C)NatNCB R=071 (A+C)NatNCB R=036

Codon position 2TNat R=055 TNat R=037TNCB R=055 TNCB R=037GNatNCB R=060 (T+G)Nat R=036(T+G)Nat R=072 (T+G)NCB R=042(T+G)NCB R=068(G+C)NatNCB R=060

Codon position 3ANat R=002 ANat R=013ANCB R=070 ANCB R=076GNat R=022 (A+G)Nat R=060GNCB R=055 (A+G)NCB R=043(A+G)Nat R=056 (A+G)NatNCB R=011(A+G)NCB R=067

B Position-independent dinucleotide contrasts

ncDNAApATpT 115 ApATpT 126CpCGpG 124 GpC 125GpTApC 082

cDNAApA 125ApANCB 106TpT 120TpTNCB 113GpC 124GpCNCB 107

tRNAApA 132 ApG 128CpC 126 TpC 135TpC 126ApC 052TpG 070

rRNAApA 125 ApA 115CpC 127 CpC 119

C Correlations of position-independent dinucleotide contrasts (DCT) with OGT

ncDNACpT R=063 GpG R=049ApG R=064

cDNACpTNat R=066 GpGNat R=040CpTNCB R=057 GpGNCB R=009CpTShfld R=057 GpGShfld R=032ApGNat R=080ApGNCB R=059ApGShfld R=078

tRNAApA R=082 ApA R=045TpA R=072TpT R=080

(continued)

2882 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

Table 1 Continued

ARCHAEA BACTERIA

Characteristic Value Characteristic Value

GpG R=044CpC R=080

rRNAApA R=092 ApA R=067TpA R=088 CpC R=052GpG R=072CpC R=083

D Position-dependent dinucleotide composition ratios (Freq) dinucleotide contrasts and their OGT correlations

Codon position 1-2Freq(ApG)NatNCB 135na Freq(CpG)NatNCB 127na

Freq(CpG)NatNCB 062na

ApG 102 TpT 145ApGNCB 079 TpTNCB 149CpG 076 TpA 069CpGNCB 117 TpANCB 065

GpT 067GpTNCB 067

ApG R=077ApGNCB R=011GpC R=055GpCNCB R=044CpC R=037+

CpCNCB R=037+

ApGNatNCB R=063GpCNatNCB R=060CpCNatNCB R=037+

RpR R=040 RpR R=033RpRNCB R=015 RpRNCB R=027+

YpY R=034 YpY R=039YpYNCB R=005+ YpYNCB R=033RpRNatNCB R=059 RpRNatNCB R=037YpYNatNCB R=064 YpYNatNCB R=045

Codon position 2-3TpA R=063 ApA 150TpANCB R=028 ApANCB 105CpT R=061 GpC 133CpTNCB R=068 GpCNCB 098ApG R=069 ApCNatNCB R=052ApGNCB R=037+ GpANatNCB R=041

RpR R=020 RpR R=051RpRNCB R=065 RpRNCB R=073YpY R=028 YpY R=058YpYNCB R=066 YpYNCB R=075

RpRNatNCB R=039YpYNatNCB R=043

Codon position 3-1CpT R=060CpTShfld R=005ApG R=053ApGShfld R=002

RpR R=030+ RpR R=048RpRShfld R=020 RpRShfld R=001YpY R=035+ YpY R=056YpYShfld R=006 YpYShfld R=017

Pearson correlation coefficient is denoted by R DCT is calculated as the ratio of dinucleotide frequency to the product of frequencies of thecorresponding independent nucleotides Nucleotides with purine (A or G) and pyrimidine (T or C) bases are denoted with R and Y respectively Thelower index distinguishes the values observed for natural sequences (Nat) sequences with eliminated codon bias (NCB) values observed aftershuffling of amino acid sequences (Shfld) If the lower index is omitted the value is given for the natural sequences The P-values for correlationsand for the dinucleotide contrast t-tests (H0 DCT is 10) are shown in superscripts as significance levels +P-valuelt 005 lt 001 lt 00001Supplementary Files S5 and S7 list all P-values for position-specific correlations SD for compositions and the P-values for t-tests SupplementaryFiles S8 and S9 show the P-values for position-independent contrast t-tests and correlations

Nucleic Acids Research 2014 Vol 42 No 5 2883

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

content increases and energy per base pair decreases withtemperature in stems of folded t- and rRNAs (Table 3)These data point to base-pairing interactions as an im-portant contributor to thermal stabilization (53) of t-and rRNAs of prokaryotes and it is stronger manifestedin Archaea than in Bacteria

Dinucleotide composition of nucleic acid sequences

Dinucleotide contrasts (DCTs) show the ratio of observeddinucleotide frequencies to the expected ones given thecomposition of individual nucleotides Coupling of thesame nucleotides (Table 1) is preferred in ncDNA se-quences in Archaea (ApATpT and CpCGpG) andBacteria (ApATpT) ApA and CpC dinucleotides areprevalent in t- and rRNA of Archaea and rRNA ofBacteria Other outstanding contrasts in Archaea arefound in ncDNA (GpTApC) and tRNA (TpC ApC andTpG) In Bacteria ApG and TpC are prevalent in tRNA(Table 1) while GpC is preferred in both non-coding andcoding sequences In coding sequences of Archaea there isno preference for coupling of similar nucleotides while inBacteria it is found for ApA and TpT (Table 1)In both coding and ncDNA sequences of Archaea there

is a clear excess of complementary ApG and CpT di-nucleotides which is persistent after elimination ofthe codon bias and reshuffling of amino acid sequence(Table 1) In Bacteria the excess of GpG dinucleotidesin non-coding and coding sequences is most pronounced(Table 1) provided by the codon bias in coding sequencesPairing of similar nucleotides is highly correlated withOGT in tRNA (ApA TpT GpG and CpC) and rRNA

(ApA GpG and CpC) in Archaea Correlation of thesame nucleotide pairs is also found in Bacteria thoughit is for fewer pairs and is weaker (tRNA ApA rRNAApA and CpC) In Archaeal t- and rRNA there is alsostrong correlation of TpA pairs with OGT (Table 1)

Correlation between nucleotide frequencies andtemperature in different codon positions

The first codon position in Archaea is characterized byhigh correlation of the ratio of natural to NCBfrequencies of adenine with OGT [ANatNCB R=069Table 1] The second codon position reveals correlationof the thymine frequency with OGT (Table 1) The com-bination of thymine and guanine nucleotides is alsocorrelated with OGT however thymine is the major con-tributor (Table 1) The correlation of the guaninecontent with OGT is supported by the codon bias(Table 1) The combination of guanine with cytosine isalso correlated with OGT in the second position ofArchaeal sequences and is provided by the codon biasThe third position reveals strong selection againstadenine and guanine in relation to OGT (Table 1)Bringing together (anti-)correlations with OGT in differ-ent codon positions one can draw the optimal from thepoint of view of thermal adaptation (Table 4) combinedtriplet as [A]1 [TG]2 [non-(AG)]3 Prevalence of thyminein the second codon position is linked to an excess ofhydrophobes Indeed codons with thymine in the secondposition encode Ile Leu Met Phe and Val These arestrongly hydrophobic residues and aromatic Phe whichcan form many van der Waals contacts and contribute

Table 2 Nucleotide compositions and their OGT correlations in DNA and RNAs

Domain of life Nucleotide cDNANat cDNANCB ncDNA tRNA (R) rRNA (R)

Archaea A 2845 2794 3073 1713 (ndash076) 2368 (ndash085)T 2389 2444 3065 1817 (ndash089) 1924 (ndash088)G 2600 2612 1930 3391 (084) 3208 (094)

C 2166 2151 1931 3080 (084) 2499 (076)Bacteria A 2405 2642 2657 1966 (ndash044) 2609 (ndash052)

T 2268 2370 2661 2156 (ndash051) 2068 (ndash077)G 2750 2666 2341 3121 (053) 3110 (072)C 2577 2323 2341 2758 (039) 2213 (060)

The numbers represent the average frequencies of nucleotides in the corresponding parts of genomes while the numbers in parentheses are correl-ation coefficients (R) of nucleotide frequencies with OGT The most important biases and correlations are shown in bold font The P-values forcorrelations (H0 correlation coefficient R=0) and for the nucleotide composition t-tests (H0 mean frequency is 025) are shown in superscripts assignificance levels P-valuelt 001 lt 00001 Supplementary Files S8 and S9 list all correlations and composition tests cDNANAT naturalnucleotide composition in coding DNA cDNANCB nucleotide composition in coding sequences with eliminated codon bias

Table 3 OGT correlations in r- and t-RNA observed in folding simulations

Domain of life RNA type R((G+C) OGT) (P-values) R(ltEbpgt OGT) (P-values)

Archaea rRNA 089 (lt10ndash22) ndash093 (11 10ndash20)tRNA 084 (184 10ndash13) ndash071 (334 10ndash8)

Bacteria rRNA 073 (138 10ndash14) ndash066 (1 10ndash11)tRNA 053 (31 10ndash7) ndash051 (91 10ndash7)

R((G+C) OGT) correlation coefficient between the (G+C) content and the OGT R(ltEbpgt OGT) correlation coefficient between the averagedper base pair free energy of RNA folding and OGT

2884 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

thus to the packing of the hydrophobic coreNoteworthy elimination of the codon bias does notaffect correlation of the thymine fraction with OGT(R=055 in both native and NCB sequences) Itshows that excess of thymine in the second codonposition is a result of selection on the protein level Anapparent explanation for such selection is domination ofthe structure-based strategy in thermostabilization ofArchaeal proteins This strategy is characterized by theincreased compactness of the hydrophobic core providedby the massive van der Waals contacts The correlationof the combination of A and G with OGT is notprovided by tuning of the codon bias which points toadaptation on the other than DNA level

Bacterial coding DNA sequences yield less position-dependent correlations with OGT than Archaeal onesThe ratio of natural adenine frequency over the one foreliminated codon bias is weakly correlated with OGT(Table 1) There are also moderate correlations ofthymine frequency and of the thymine and guaninecombination with OGT in the second codon position(Table 1) The third codon position is characterized byselection against adenine provided by the codon bias(Table 1) However the combination of adenine andguanine is correlated with OGT It does not depend onthe codon bias to a large extent (same as in Archaea)pointing to possible signal of adaptation on thelevel other than protein-coding DNA sequence Thegeneralized codon in Bacteria (Table 4) characterizingthe thermophilic trends reads therefore [weak A]1 [weakT]2 [non-(AG)]3

The role of excessive (G+C) load in the third codonposition in adaptation to aerobicity

In both Archaea and Bacteria the average nucleotidecomposition in different codon positions is not affectedby the codon bias to the large extent except the thirdcodon position in Bacteria However the compositional

variance on the third position is essentially destroyedwhen there is no codon bias (standard deviation (SD)diminishes from gt10 to lt1 see Supplementary FileS5) The natural (G+C)3 load in Bacteria yields 9excess in comparison with NCB sequences (P-value=17e ndash 07 Supplementary File S6) while inArchaea there is no significant difference We found thatdespite frequent involvement of G3 and C3 nucleotidesinto the GCCG pairs the (G+C)3 load does not playa crucial role in thermal stability of mRNA (5254)Overall the difference in (G+C)3 load between Archeaand Bacteria is insignificant If we consider separatethermal groups the only difference appears in mesophilicBacteria (95 more in Bacteria P=009) Howeverwhen we take into account the oxygen tolerance factorthe difference in (G+C)3 load will become extremelypronounced between aerobic (7013 in natural and 5084in NCB) and anaerobic (4894 in natural and 5046 inNCB) species (P-value=2e ndash 09) Considering the lackof OGT correlation in protein-coding nucleotides (exceptthymine on the second codon position) one can concludethat increased (G+C)3 load is a result of the aerobic lifestyle rather than thermal adaptation (SupplementaryFiles S2 and S3) The comparison of aerobes and anaer-obes in the group of mesophilic Bacteria corroborates thisconclusion (Supplementary File S6) The difference isgt20 of (G+C)3 content 7097 in aerobes and 5051 inanaerobes (P=0000178) This adaptation is entirelydriven by the codon bias because if the codon bias isremoved the (G+C)3 bias as well as the differencebetween aerobes and anaerobes will disappear (5087and 5064)We also compared usage of codons with G or C in a

third position with other codons (lsquoCodon Usagersquo inSupplementary Files S3 and S6) There are three aminoacids encoded by six codons (Leu Arg Ser) five by fourcodons (Ala Pro Thr Gly Val) one by three codons (Ile)nine by two codons (Lys Asn Asp Phe Cys Gln Glu

Table 4 Generalized nucleotides and dinucleotides in different codon positions favorable for thermostability

Nucleotides correlated with OGT

Domain of life Codon position 1 2 3

Archaea Nucleotide A TG Non-[AG]Origin of the bias Codon bias T- amino acid Against codon bias

G-codon biasBacteria Nucleotide Weak A Weak T Non-[AG]

Origin of the bias Codon bias Amino acid Against codon bias

Dinucleotides in Purine (R) and Pyrimidine (Y) notation correlated with OGT

Domain of life Codon position 1-2 2-3 3-1

Archaea Dinucleotide RpRYpY Non-RpRYpY RpRYpYOrigin of the bias Codon bias Codon bias Amino acid

Bacteria Dinucleotide RpRYpY Non-RpRYpY RpRYpYOrigin of the bias Not codon bias Codon bias Amino acid

Part 1 Thermophilic-prone nucleotide biases Columns 1 2 3 contain information on favorable nucleotides and origin of the bias in codon positions1 2 3 Part 2 Thermophilic-prone dinucleotide biases Columns 1-2 2-3 3-1 contain information on favorable nucleotides and origin of the bias incodon positions 1-2 2-3 3-1

Nucleic Acids Research 2014 Vol 42 No 5 2885

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

His Tyr) and two by one codon (Met Trp) Typicallyhalf of the synonymous codons of each amino acid have Gor C in a third codon position others A or T It appearedthat almost all of the codons with G and C in a thirdposition have higher frequencies in aerobes compared toanaerobes and in Bacteria than in Archaea (see lsquoCodonUsagersquo in Supplementary File S3) This observation holdsfor all codons of the lsquotwo-codonrsquo amino acids and formajority of codons of the three- four- and six-codonamino acids The noticeable exception is significant sup-pression of AGAAGG codons of the Arg (aerobesndashanaerobes 177ndash125 P-values 446e ndash 858e ndash 8BacteriandashArchaea 174ndash2625 P-values 165e ndash 7114e ndash11) In lysine the codon AAA is suppressed in aerobes(difference 1716 P-value 644e ndash 6) while the codonAAG is preferred The AAG codon of lysine is favorablein aerobes (aerobesanaerobes 143) and another lysinersquoscodon AAA can be turned into AAG by only one mutationin the third codon position Therefore one can speculatethat the demand for discriminating between Arg and Lys isthe most probable cause for the suppression of AGAand AGG codons of arginine in aerobes Lysine isthus preferred in aerobic adaptation over the arginine(Supplementary Files S3 and S6) The contribution of theamino acid in relation to thermophilic adaptation is notcompromised because of the similarity between physicalndashchemical characteristics of lysine and arginine from thepoint of view of thermostability Overall an excess ofthe G3 codons is always more pronounced in aerobescompared to anaerobes which corroborates the conclusionthat the (G+C)3 load is an indications of the aerobic style(55ndash58) The similar trend in the difference between theBacteria and Archaea is a result of the domination ofaerobic life style in the Bacteria in the analyzed dataset

Correlation between dinucleotide frequencies in differentcodon positions and OGTIn order to analyze dinucleotides in different codon pos-itions (1-2 2-3 and 3-1) we used the DCTNat DCTNCB

(for positions 1-2 and 2-3) and shuffled (DCTShfld forposition 3 and 1) sequences We considered correlationsof these contrasts with OGT The most pronounced biasesare found for dinucleotides in coding sequences ofArchaea The ApG dinucleotides show the strongestexcess of dinucleotides in all positions of natural se-quences compared to NCB and shuffled ones In codonpositions 1-2 and 2-3 this compositional peculiarity andits correlation with OGT are provided by the codon bias(Table 1) In position 3-1 high correlation of ApGfrequencies with OGT is determined by the amino acidsequences and it vanishes after amino acid reshuffling(Table 1) There is also correlation of the frequency ofcomplementary CpT dinucleotide in position 3-1 withOGT disappearing after reshuffling of amino acid se-quences (Table 1) The most plausible role of ApG di-nucleotides is a contribution to stabilization via base-stacking interactions between the purine rings (22) Thisconclusion is supported by the fact that excess of ApGdinucleotides is provided by the codon bias manifestingthus adaptation on a DNA level The complementary CpTdinucleotides indicate enrichment of anti-sense strand of

double-stranded DNA (dsDNA) with the stabilizing ApGones The resulting mosaic of ApG stacking in sense andanti-sense strands can provide stabilization over the longdistances without compromising flexibility of the dsDNAOther overrepresented dinucleotides (regardless of thecodon bias) correlated with OGT in positions 1-2 areGpC CpC TpA and CpT (Table 1) The frequencies ofTpA and CpT dinucleotides are correlated with OGT inposition 2-3 where the former is provided by the codonbias and the latter is not

In purine (R) and pyrimidine (Y) notation codonbias provides grouping (stacking) of similar nucleotidesin position 1-2 which increases with OGT (Table 1) Atthe same time codon bias works against such grouping inposition 2-3 (Table 1) Amino acid sequence is responsiblefor the grouping of similar nucleotides and correlationswith OGT in position 3-1 (Table 1) The resultinggeneralized thermophilic pattern of dinucleotides inArchaeal cDNA sequences reads [RpR YpY]1-2 [non-RpR non-YpY]2-3 [RpRYpY]3-1 Most of the compos-itional peculiarities found in Bacteria are not determinedby the codon bias yielding weak correlations with tem-perature (Table 1) The thermophilic pattern of dinucleo-tides in Bacteria (Table 4) is the same as Archaeal oneThe difference with Archaea is that selection for RpR andYpY dinucleotides in position 1-2 is not determined by thecodon bias in Bacteria (Tables 1 and 4)

Compositional and sequence biases observed in foldingsimulations of mRNAThough RNA molecules have different overall structuresone may expect that there are some common mechanismsproviding stability and function of the folded t- r- andmRNAs We have shown above that the (G+C) contentapparently contributes to the stability of the t- and rRNA(3659) and not to the stability of coding DNA (545960)It is manifested in correlation of the former with OGT(Table 2) and the anti-correlation of the base-pairingenergy in the folded structures with OGT (Table 3)Nucleotide compositions in coding DNA (hence inmRNA as well) do not correlate with OGT The mRNAcase is of special interest because redundancy of geneticcode endues the nucleotide sequence with a potential tosatisfy requirements for DNA mRNA and protein stabil-ity For example it has been hypothesized (61) that theoptimization of the base-pairing in the mRNA contributesto formation of stem fragments Authors claimed thatthere is a corresponding periodic pattern of the mRNAsecondary structure in human and mouse which isdetermined chiefly by the selection that operates on thethird codon position (52) An opposing opinion suggeststhat the secondary structure in mRNA interferes withtranslation and therefore should be avoided (62)Overall three scenario of the relation between mRNAand proteins has been considered earlier (52) (i) thebiases are determined by the demands on protein stability(ii) mRNA stability is the major determinant of sequencebiases (iii) complete independence of the sequence biasesrelated to mechanisms of stability on each level We havecomputationally folded the mRNA sequences from 244prokaryotic genomes and analyzed their characteristics

2886 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

(Table 5) All the parameters are reported as relative oneswhere the signal in native sequence is normalized by thevalue obtained for dicodon shuffled ones (48) Thedicodon shuffling randomizes mRNA sequence whilepreserving the native protein sequence native codonusage and native dinucleotide composition Therefore itallows one to select out peculiarities of mRNA caused bythe demands on its stability and function from the othersrelated to the coding DNA and proteins

Structure of stems in folded mRNA is characterized byphases which show relative location of triplets in oppositestrands of the stem Specifically Phases I II and III cor-respond to positioning of triplets where respectively firstsecond and third nucleotides are complementary Themost and the least frequent pairs in all phases of thedsDNA (see Materials and methods section) are con-sidered Phases I and II are similar to each otheryielding CG GC GU UG AU and UA (Table 5)Overall the Archaea and thermophilic Bacteria havesimilar trends in most and least frequent pairsMesophilic Bacteria stands out with A13U and G33Cmost frequent pairs in Phases II and III respectivelyThe major contribution from the three pairing phases(52) to the total amount of stem pairs is provided byPhase III and nucleotides in codon position 3 are mostfrequent in stem pairs (Table 5 and SupplementaryTable S1) OGT correlations highlight the contributionfrom the GU wobble pair to the thermostability ofachaeal mRNA structures (Table 5 and SupplementaryTable S1) We found the highest OGT correlation forthe Archaeal pairs U32G (Table 5) confirming import-ance of the GU wobble pairs The stronger effect forArchaeal rather than for Bacterial mRNA is possibly arelic of ancient RNA world where RNA was the carrierof genetic information and harsh environmental condi-tions demanded its increased stability

Table 6 shows that purine load (the contents of A+GRY ApG) is larger in the loop regions than in the stems

In Archaea this difference is slightly more pronouncedthan in Bacteria (Supplementary Table S2) The purineload is also in good positive correlation with OGT forthe loops not for stems However the amount of ApGdinucleotides is correlated with OGT in both loops andstems For synonymous codons the fractions of purine-rich codons (eg GGR versus GGY for glycine and AGRversus CGY for arginine) are correlated with OGT in bothloop and stem regions (slightly stronger in loops) Fornon-synonymous codons however amount of purine-rich codons (GAR for glutamate) is correlated withOGT in both loops and stems while fractions of purine-rich AAR codons for lysine do not correlate withOGT (Table 6) While Forsdyke et al (21) reportedincrease of the purine content and its positive correlationwith OGT in loops we found that fraction of purine reachsynonymous codons is correlated with OGT in stems aswell At the same time the amount of non-synonymouspurine-rich codons of lysine is not correlated with OGT inboth loop and stem regions Additionally the amount ofApG dinucleotides is correlated with OGT in both loopsand stems (Table 6)

Signals of thermophilic adaptation in protein sequencesIt has been shown earlier (22) that environmental tem-perature directly affects amino acid composition of pro-karyotic proteins and IVYWREL combination can serveas a predictor of the OGT in prokaryotes We use here alsquoZ-scorersquo predictor of OGT (43) which takes into accountdifferences in variances of the proteomic frequencies ofindividual amino acids (Table 7) We have shown thatthe Z-score predictor properly corrects for these differ-ences (43) better reflecting the contribution of theamino acid combinations to thermal adaptationSeparate predictors for Archaea and Bacteria (based on33 and 99 proteomes with OGTrsquos spread over 5ndash100 and10ndash85C respectively) yield the same general trend in theincrease of hydrophobic and charged residues at the

Table 5 Compositional and dicodon signals of mRNA adaptation purified by the dicodon shuffling

Characteristic Archaea Bacteria

The most and least (after the comma) frequentpairs in stems of mRNA

MesophilesPhase I C23G G23U Phase I G32U U32GPhase II C31G G13U Phase II A13U G13UPhase III A33U U33G Phase III G33C U33G

ThermophilesPhase I G32C G23U Phase I C23G G23UPhase II C22G G13U Phase II C31G G13UPhase III A33U U33G Phase III U33A U33G

Correlations with OGT ofSegment energy ltEsggt R=ndash071 R=ndash039Energy per base pair ltEbpgt R=ndash073 R=ndash054

The most signification correlations with OGT Phase I U32G R=077 (U21G)Nat PIII R=046Phase I G23U R=071 (G12U)Nat PIII R=045Phase I All pairs R=07 (U21G)dShfld PIII R=045

(G12U)dShfld PIII R=041

Phases I II and III correspond to positioning of triplets where respectively first second and third nucleotides are comple-mentary All signals (except OGT correlations in Bacteria) are normalized by corresponding values for control sequences afterthe dicodon shuffling See explanations of abbreviations in the Materials and methods section

Nucleic Acids Research 2014 Vol 42 No 5 2887

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

expense of polar ones (34) showing minor differences inpresence of individual residuesWe have shown earlier (22) that the amino acid bias

working in thermal adaptation of proteins does notdepend on the overall nucleotide composition of coding-DNA sequences However one can still expect that nu-cleotide compositions of particular codon positions andor dinucleotide compositions of positions 1-2 and 2-3 incodons may be somehow linked to amino acid biases Weindeed found a significant difference between nucleotideload in individual codon positions of Archaeal andBacterial coding DNA sequences (Table 1) FirstArchaeal protein coding sequences yield a strong correl-ation between (T+G)2 load in the second codon positionand OGT (r=072) The corresponding non-codon biasedsequences show comparable correlation (r=068)pointing that (T+G) load reflects tuning of the aminoacid composition in connection to thermal adaptationThe main contributor to this correlation is thymine(Table 1) that provides increase of strongly hydrophobicresidues LMFIV (Supplementary File S1) The (T+G)load is much weaker correlated with OGT in Bacterialsequences (r=036 for natural and r=042 for NCB se-quences) The correlation of the thymine load with OGT is037 for both Natural (NAT) and Non-codon-biased(NCB) sequences This observation agrees with thepresumed prevalence of structure-based strategy ofthermal adaptation in Archaeal proteins (29) contrary tosequence-based one in Bacterial proteomes Specificallythere is a clear correlation between amounts of chargedresidues and the OGT in the set of Bacterial proteomes(Table 7) presumably reflecting the domination of thesequence-based strategy in thermophilic adaptation ofBacteria (29)Finally we checked if there exists any specific connec-

tion between the dipeptide composition of proteins andthose of 3-1 dinucleotides We found (data not shown)that the strongest correlations between dinucleotides 3-1and dipeptides exist mostly for the dinucleotides withguanine in position 3 The most frequent pair of aminoacids contains methionine (Met encoded exclusively by

the AUG codon) as the first and any polarchargedamino acid as the second one The preference of polarcharged amino acid to be the next to Met can be explainedby the typically surface location of the protein N-terminiand its role as a signal for ubiquitination (63)

DISCUSSION

We will make a brief overview of the most pronouncedbiases and causal relationships in DNA RNA andproteins The major signal in nucleotide compositionsspecifically (G+C) load is related to thermal stabilizationof t- and rRNA (3654555960) and to discriminationbetween the coding and non-coding sequences inArchaea (Table 2) Comparative analysis of nucleotideand amino acid compositions in relation to thermophilicadaptation prompts to conclude that the (G+C) contentdoes not contribute to thermostabilization of coding DNA(1822365459) as well as it does not affect amino acidcomposition and its thermophilic trends (1522) When

Table 6 Purine loading in loop and stem regions of folded mRNA and its OGT correlation

Feature Loop Stem L-v-S p-value

Mean contents OGT correlation Mean content OGT correlation

A+G 0560 059 0500 ndash026 lt22E ndash 16RY 1299 061 1002 ndash026 lt22E ndash 16ApG 0061 079 0051 050 00002GGR (glycine) 0027 062 0045 042 12E ndash 17GGY (glycine) 0017 ndash005 0079 ndash023 10E ndash 42AGR (arginine) 0035 072 0022 056 24E ndash 10CGR (arginine) 0018 ndash022 0040 ndash024 71E ndash 09CGY (arginine) 0016 ndash025 0037 ndash026 62E ndash 10GAR (glutamate) 0060 071 0036 059 lt22E ndash 16AAR (lysine) 0086 022 0017 011 lt22E ndash 16GAY (aspartate) 0037 ndash017 0040 ndash037 00002

Feature analyzed nucleotide dinucleotide or amino acid Loop and stem information on mean content and OGT correlation of the above L-v-S acomparison between corresponding contents in the loop region and stem regions by Wilcoxon-tests and P-values are shown in this columnCorrelation coefficients with OGT are shown P-valuelt 001 P-valuelt 00001

Table 7 Signals of thermophilic adaptation in protein sequences of

Archaea and Bacteria

Correlation with OGT Archaea Bacteria

Z-scored predictor ILVW Y DKR(R=093)

IPV Y EKR(R=089)

The most abundant residuesin gt70 (gt60) ofall Z-scored predictors

VIWL Y ER VPI Y ER(K)

Individual amino acids + L W + Endash T Q D

Types of amino acids + h + h cndash p ndash p

Dipeptides + hp ph + ccndash cp pc ndash pc

+ increase of the amino acid (or amino acid type) fraction with OGT ndashdecrease of the amino acid (or amino acid type) fraction with OGTCapital letters are names of amino acids h p c are hydrophobic polarcharged types of amino acids Correlation coefficients between the Z-scored thermostability predictors and OGT are given in parentheses

2888 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

separate codon positions are considered for coding DNAthe only compositional bias observed for nucleotide com-positions is an excess of (G+C) load in the third codonposition in Bacteria There is a significant excess ofguanine and cytosine nucleotides (292 and 3282)compared to the NCB nucleotide contents (2609 and2466) A plausible explanation for this bias is the roleof the (G+C)3 load in the adaptation to the aerobic lifestyle which dominates in Bacteria The (G+C)3 load canbe advantageous for several complementary reasonsIndeed additional GC base pairing will contribute tothe stability of double helix of DNA and stem regions ofRNA molecules Moreover G3 bases can work as scaven-gers of oxidizing agents providing protection for G basesin other codon positions (56ndash58) At the same time thisbias would not lead to any changes in amino acid com-position leaving protein structure and stability intact (55)

In this work we considered compositional and sequencebiases in proteins in relation to those in correspondingnucleic acid sequences and to the phylogeny of species(Archaea or Bacteria) OGT correlations of nucleotidesin different codon positions are similar in Archaea andBacteria but higher in the former (Table 1) Theoptimal from the point of view of thermophilic adaptation(ie most correlated with OGT) codon reads A1 [TG]2non-[AG]3 in Archaea but [weak A]1 [weak T]2 non-[AG]3 in Bacteria (Table 4) Codon bias supports thestrong correlation between the amount of Adenine andOGT in the first codon position in Archaea same butweaker correlation in Bacteria While we showed earlier(22) that amino acid thermophilic trend is not determinedby the overall nucleotide composition considerationof separate codon positions reveals an interesting linkbetween positional nucleotide frequencies and aminoacid composition in relation to domain of LifeSpecifically prevalence of the nucleotide thymine in thesecond codon position and its strong correlation withOGT in Archaea is apparently a result of the demandon enrichement of Archaeal proteins with hydrophobicresidues (Table 7) Noteworthy a massive increase ofvan der Waals interactions (2464) was found to be thecornerstone of the structure-based evolutionary strategyof protein thermostability in ancient species (29) InBacterial proteins however there is an increase of thefractions of charged residues with OGT (Table 7) Thuswe observe a transitions from structure-based evolution-ary strategy of protein themostability in Archaea tosequence-based one in Bacteria and corresponding nucleo-tide and protein compositional biases (Table 7)underlying these strategies (29) The third codonposition is characterized by the selection against adenineand guanine in Archaea and against adenine in Bacteria isa result of this bias

The most pronounced dinucleotide compositionalbiases are excess of homo-dinucleotides and its correlationwith OGT in ncDNA tRNA and rRNA It is presumablya relic of ancient primitive homo(poly)nucleotides fromwhich life started Overrepresentation of the complemen-tary ApGCpT dinucleotides points to their contributionto DNARNA stability via strong base-stacking inter-actions (6566) The consideration of dinucleotide biases

in purine (R) and pyrimidine (Y) notation also reveals aninteresting picture Overall for both Archaea and Bacteriathe signature for thermophilic dinucleotides reads RpRYpY non-(RpRYpY) RpRYpY for codon positions1-2 2-3 and 3-1 respectively (Table 4) The major differ-ence between Archaea and Bacteria in this case is that thedinucleotide bias in positions 1-2 is caused by demands onnucleotide sequence level (and provided by the codonbias) in Archaea while some other factors work inBacteria The preference for homodinucleotides in pos-itions 1-2 and 3-1 can be a possible reason for avoidinghomodinucleotides in positions 2-3 because heterodinu-cleotides can contribute to DNA flexibility tuningIndeed it has been shown that given an average flexibilityF of homodinucleotides (either RpR or YpY) theflexibilities of the heterodinucleotides YpR and RpY are2F and 05F respectively (67) The codon positions 2-3 aremost versatile in terms of the relationship between thenucleotide and protein sequences Therefore presenceof heteronucleotides in the codon position 2-3 makes itpossible to adjust flexibility of the nucleotide sequencewithout changing physicalndashchemical characteristics ofthe encoded amino acid residue Additionally we foundthat the link between dipeptide and corresponding 3-1 di-nucleotide frequencies is determined by the amino aciddipeptide The dominating bias reads as Met followedby a polar or charged residueOn the RNA level simple consideration of the nucleo-

tide composition suggests to discriminate the tRNA andrRNA molecules from the mRNA Indeed based on thecorrelation with OGT we can conclude that (G+C)content is important for providing stability of tRNAand rRNA molecules but not for mRNA The role ofthe (G+C) content was further confirmed by foldingsimulations (Table 3) We also found other complemen-tary signals in mRNA sequences related to its structurestability and function Archaea and Bacteria yield similartrends in the most and least frequent nucleotide pairsand the third codon position makes a major contributionto the base-pairing mechanism of stem stabilizationSpecifically in addition to WatsonndashCrick pairing theGU wobble pairs sufficiently contribute to the ther-mal stability of the Archaeal mRNA in agreement withearlier observations (6869) Overall it results in moderatecorrelation of the segment energy and energy per base pairof the folded mRNA (Table 5) Folding simulations per-formed in this work also allowed us to analyzepeculiarities of the nucleotide contents and structuralcontacts in stem and loop regions of folded mRNAPurine load in loop regions is higher than in stems andthis effect is slightly stronger in Archaea than in Bacteria(Table 6) Purine load in loops also correlates with OGTnot in stems This correlation was observed earlier byForsdyke et al (21) and it was described as lsquopolitepurine load of the loop regionsrsquo that prevents undesiredmRNAndashmRNA single-strand interactions The authorsconcluded that increased purine load can affect thecodon choice and consequently the amino acid compos-ition (21) Our data [as well as careful analysis of theoriginal data in (21)] does not support the latter claimThere is indeed a correlation between the fraction of

Nucleic Acids Research 2014 Vol 42 No 5 2889

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

some purine-rich codons (GGR(Gly) AGR(Arg) andGAR(Glu)) and OGT (Table 6) However the fractionof purine-rich non-synonymous codons AAR (Lys) isnot correlated with OGT (Table 6) Moreover thesecodons are synonymous and cannot directly affect theamino acid composition It has been shown earlier(2270) that increase of Glu and Arg fractions is a trendin protein thermophilic adaptation At the same timepurine load remains high after elimination of the codonbias pointing to the amino acid composition as the mostprobable cause for the former (22) All the above promptsone to conclude that in mutual tuning of nucleotide andamino acid compositions the purine load does notdominate and determine biases in amino acid compos-ition if not opposite The analysis of purine load showsthat it correlates with OGT in both stems and loops(Table 6) in case of purine-rich codons GGR(Gly)AGR(Arg) and GAR(Glu) Further the amount ofApG dinucleotides strongly correlates with OGT in bothloops and stems (Table 6) and ApG provides a strongbase stacking important for thermal stability of nucleicacid sequences (2265) Purine load therefore is appar-ently a determinant of the base-stacking mechanism inmRNA andor DNA thermal stability as well as a resultof the thermophilic amino acid composition trendProkaryotes thrive under the temperature interval

spanning over hundred degrees and they represent twomajor life stylesmdashaerobic and anaerobic Analysis ofcomplete Archaeal and Bacterial genomes unraveled com-positional and sequence signals related to molecular mech-anisms of stability and adaptation unaffected by selectivesequencing or by the comparison of orthologs Overallcodon bias works stronger in Archaea and is mostlyutilized in thermophilic adaptation of nucleic acids It ap-parently reflects longer evolutionary history of Archaeawhich presumably started close to the origin of life in hotconditions (29) The codon bias and amino acid sequences(dipeptide composition) work in accord for supportingenrichment of the nucleotide sequences with ApGCpT di-nucleotidesmdashdeterminants of the base-stacking mechan-ism of the nucleic acid stability (65ndash67) We also foundthat the second codon position reveals a strong linkbetween the nucleotide and amino acid compositionsSpecifically excess of thymine in this position is a resultof a demand on the enrichment of Archaeal proteins withhydrophobic amino acids The third codon position inBacteria is the only case where codon bias is detectablealready on the level of pure composition We found thatthe (G+C)3 load is related to aerobic life style dominatingin Bacteria From the point of view of thermophilic adap-tation codon bias works against G in the third codonposition in both Archaea and Bacteria It supports thus aspecific role of the third codon position in discriminatingbetween adaptations to temperature and aerobic life styleFinally we obtained an interesting and complex picture ofthe relationship between the nucleotide composition(purine load) and amino acid composition (selectionbetween Arg and Lys) in relation to thermal adaptationaerobicity and phylogeny Specifically purine load in bothArchaea and Bacteria is a result of the lsquofrom both endof hydrophobicity scalersquo trend in thermal adaptation of

proteins (34) reflected in the IVYWREL predictor of theOGT (22) According to this predictor Arg is a preferredamino acid for thermal adaptation though Lys is the nextcandidate present in top most correlated predictors as well(2243) In particular selection for Lys over the Arg insome species was shown to be important for the entropicmechanism of protein thermostabilization (70) Howeveroverall preference for Arg over the Lys in case of thermaladaptation is well manifested in the excess of the purine-rich (AGR) codons of Arg at the expense of purine-richones of Lys (AAR) At the same time discriminationbetween Arg and Lys works in opposite direction inaerobicity where Arg codons AGA and AGG are sup-pressed in favor of the purine-rich AAA and AAG ofLys It thus reflect a preference for Lys over Arg inaerobes compared to anaerobes in particular and inBacteria versus Archaea in general preserving at thesame time high purine load necessary for providing abase-stacking in corresponding nucleic acids (225365)An intricate connection between the LysArg and theircodons in relation to thermal adaptation and aerobicityexemplifies how selection can work on nucleic acids andprotein simultaneously in response to demands of differentenvironments Obviously the whole picture of molecularmechanisms of adaptation and relations between them isfar from being complete Consideration of other environ-mental factors such as salinity pressure etc will help to un-ravel new mechanisms of stability their sequencestructuredeterminants and to understand tradeoffs that Natureembraced en route of the evolution and adaptation

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

FUNDING

Funding for Open access charges Functional GenomicsProgram (FUGE II) Norwegian Research Council

Conflict of interest statement None declared

REFERENCES

1 CrickF (1970) Central dogma of molecular biology Nature 227561ndash563

2 TawfikDS (2010) Messy biology and the origins of evolutionaryinnovations Nat Chem Biol 6 692ndash696

3 TokurikiN and TawfikDS (2009) Stability effects of mutationsand protein evolvability Curr Opin Struct Biol 19 596ndash604

4 TokurikiN and TawfikDS (2009) Protein dynamism andevolvability Science 324 203ndash207

5 KooninEV (2012) Does the central dogma still stand BiolDirect 7 27

6 PersquoerI FelderCE ManO SilmanI SussmanJL andBeckmannJS (2004) Proteomic signatures amino acid andoligopeptide compositions differentiate among phyla Proteins 5420ndash40

7 AravindL TatusovRL WolfYI WalkerDR andKooninEV (1998) Evidence for massive gene exchange betweenarchaeal and bacterial hyperthermophiles Trends Genet 14442ndash444

8 KhachaneAN TimmisKN and dos SantosVA (2005) Uracilcontent of 16S rRNA of thermophilic and psychrophilic

2890 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

prokaryotes correlates inversely with their optimal growthtemperatures Nucleic Acids Res 33 4016ndash4022

9 SuhreK and ClaverieJM (2003) Genomic correlates ofhyperthermostability an update J Biol Chem 27817198ndash17202

10 TekaiaF and YeramianE (2006) Evolution of proteomesfundamental signatures and global trends in amino acidcompositions BMC Genom 7 307

11 WangHC and HickeyDA (2002) Evidence for strong selectiveconstraint acting on the nucleotide composition of 16S ribosomalRNA genes Nucleic Acids Res 30 2501ndash2507

12 FriedmanR DrakeJW and HughesAL (2004) Genome-widepatterns of nucleotide substitution reveal stringent functionalconstraints on the protein sequences of thermophiles Genetics167 1507ndash1512

13 LynnDJ SingerGA and HickeyDA (2002) Synonymouscodon usage is subject to selection in thermophilic bacteriaNucleic Acids Res 30 4272ndash4277

14 SingerGA and HickeyDA (2000) Nucleotide bias causes agenomewide bias in the amino acid composition of proteinsMol Biol Evol 17 1581ndash1588

15 SingerGA and HickeyDA (2003) Thermophilicprokaryotes have characteristic patterns of codonusage amino acid composition and nucleotide content Gene 31739ndash47

16 TekaiaF YeramianE and DujonB (2002) Amino acidcomposition of genomes lifestyles of organisms and evolutionarytrends a global picture with correspondence analysis Gene 29751ndash60

17 Roy ChowdhuryA and DuttaC (2012) A pursuit of lineage-specific and niche-specific proteome features in the world ofarchaea BMC Genom 13 236

18 WangHC SuskoE and RogerAJ (2006) On the correlationbetween genomic G+C content and optimal growth temperaturein prokaryotes data quality and confounding factors BiochemBiophys Res Commun 342 681ndash684

19 WuH ZhangZ HuS and YuJ (2012) On the molecularmechanism of GC content variation among eubacterial genomesBiol Direct 7 2

20 KnightRD FreelandSJ and LandweberLF (2001) A simplemodel based on mutation and selection explains trends in codonand amino-acid usage and GC composition within and acrossgenomes Genome Biol 2 RESEARCH0010

21 LaoPJ and ForsdykeDR (2000) Thermophilic bacteria strictlyobey Szybalskirsquos transcription direction rule and politely purine-load RNAs with both adenine and guanine Genome Res 10228ndash236

22 ZeldovichKB BerezovskyIN and ShakhnovichEI (2007)Protein and DNA sequence determinants of thermophilicadaptation PLoS Comput Biol 3 e5

23 KreilDP and OuzounisCA (2001) Identification ofthermophilic species by the amino acid compositions deducedfrom their genomes Nucleic Acids Res 29 1608ndash1615

24 BerezovskyIN TumanyanVG and EsipovaNG (1997)Representation of amino acid sequences in terms of interactionenergy in protein globules FEBS Lett 418 43ndash46

25 CambillauC and ClaverieJM (2000) Structural and genomiccorrelates of hyperthermostability J Biol Chem 27532383ndash32386

26 GreavesRB and WarwickerJ (2007) Mechanisms forstabilisation and the maintenance of solubility in proteins fromthermophiles BMC Struct Biol 7 18

27 JaenickeR (1999) Stability and folding of domain proteins ProgBiophys Mol Biol 71 155ndash241

28 JaenickeR and BohmG (1998) The stability of proteins inextreme environments Curr Opin Struct Biol 8 738ndash748

29 BerezovskyIN and ShakhnovichEI (2005) Physics andevolution of thermophilic adaptation Proc Natl Acad Sci USA102 12742ndash12747

30 ChakravartyS and VaradarajanR (2002) Elucidation of factorsresponsible for enhanced thermal stability of proteins a structuralgenomics based study Biochemistry 41 8152ndash8161

31 GlyakinaAV GarbuzynskiySO LobanovMY andGalzitskayaOV (2007) Different packing of external residues can

explain differences in the thermostability of proteins fromthermophilic and mesophilic organisms Bioinformatics 232231ndash2238

32 ThompsonMJ and EisenbergD (1999) Transproteomic evidenceof a loop-deletion mechanism for enhancing proteinthermostability J Mol Biol 290 595ndash604

33 TokurikiN OldfieldCJ UverskyVN BerezovskyIN andTawfikDS (2009) Do viral proteins possess unique biophysicalfeatures Trends Biochem Sci 34 53ndash59

34 BerezovskyIN ZeldovichKB and ShakhnovichEI (2007)Positive and negative design in stability and thermal adaptationof natural proteins PLoS Comput Biol 3 e52

35 BharanidharanD BhargaviGR UthanumallianK andGauthamN (2004) Correlations between nucleotide frequenciesand amino acid composition in 115 bacterial species BiochemBiophys Res Commun 315 1097ndash1103

36 NakashimaH FukuchiS and NishikawaK (2003)Compositional changes in RNA DNA and proteins for bacterialadaptation to higher and lower temperatures J Biochem(Tokyo) 133 507ndash513

37 DehouckY FolchB and RoomanM (2008) Revisiting thecorrelation between proteinsrsquo thermoresistance and organismsrsquothermophilicity Protein Eng Des Sel 21 275ndash278

38 FolchB DehouckY and RoomanM (2010) Thermo- andmesostabilizing protein interactions identified by temperature-dependent statistical potentials Biophys J 98 667ndash677

39 FolchB RoomanM and DehouckY (2008) Thermostability ofsalt bridges versus hydrophobic interactions in proteins probed bystatistical potentials J Chem Inf Model 48 119ndash127

40 GonnelliG RoomanM and DehouckY (2012) Structure-basedmutant stability predictions on proteins of unknown structureJ Biotechnol 161 287ndash293

41 PonnuswamyP MuthusamyR and ManavalanP (1982) Aminoacid composition and thermal stability of globular proteins IntJ Biol Macromol 4 186ndash190

42 BerezovskyIN (2011) The diversity of physical forces andmechanisms in intermolecular interactions Phys Biol 8 035002

43 MaBG GoncearencoA and BerezovskyIN (2010)Thermophilic adaptation of protein complexes inferred fromproteomic homology modeling Structure 18 819ndash828

44 MakarovaKS and KooninEV (2005) Evolutionary andfunctional genomics of the Archaea Curr Opin Microbiol 8586ndash594

45 NovichkovPS WolfYI DubchakI and KooninEV (2009)Trends in prokaryotic evolution revealed by comparison of closelyrelated bacterial and archaeal genomes J Bacteriol 191 65ndash73

46 KooninEV and WolfYI (2008) Genomics of bacteria andarchaea the emerging dynamic view of the prokaryotic worldNucleic Acids Res 36 6688ndash6719

47 KooninEV MushegianAR GalperinMY and WalkerDR(1997) Comparison of archaeal and bacterial genomes computeranalysis of protein sequences predicts novel functions andsuggests a chimeric origin for the archaea Mol Microbiol 25619ndash637

48 KatzL and BurgeCB (2003) Widespread selection for localRNA secondary structure in coding regions of bacterial genesGenome Res 13 2042ndash2051

49 HofackerIL PriwitzerB and StadlerPF (2004) Prediction oflocally stable RNA secondary structures for genome-wide surveysBioinformatics 20 186ndash190

50 ZukerM and StieglerP (1981) Optimal computer folding oflarge RNA sequences using thermodynamics and auxiliaryinformation Nucleic Acids Res 9 133ndash148

51 MathewsDH SabinaJ ZukerM and TurnerDH (1999)Expanded sequence dependence of thermodynamic parametersimproves prediction of RNA secondary structure J Mol Biol288 911ndash940

52 ShabalinaSA OgurtsovAY and SpiridonovNA (2006) Aperiodic pattern of mRNA secondary structure created by thegenetic code Nucleic Acids Res 34 2428ndash2437

53 MarmurJ and DotyP (1962) Determination of the basecomposition of deoxyribonucleic acid from its thermaldenaturation temperature J Mol Biol 5 109ndash118

Nucleic Acids Research 2014 Vol 42 No 5 2891

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

54 HurstLD and MerchantAR (2001) High guanine-cytosinecontent is not an adaptation to high temperature acomparative analysis amongst prokaryotes Proc Biol Sci 268493ndash497

55 NayaH RomeroH ZavalaA AlvarezB and MustoH (2002)Aerobiosis increases the genomic guanine plus cytosine content(GC) in prokaryotes J Mol Evol 55 260ndash264

56 BeckmanKB and AmesBN (1997) Oxidative decay of DNAJ Biol Chem 272 19633ndash19636

57 ChengKC CahillDS KasaiH NishimuraS and LoebLA(1992) 8-Hydroxyguanine an abundant form of oxidative DNAdamage causes Gmdash-T and Amdash-C substitutions J Biol Chem267 166ndash172

58 Vieira-SilvaS and RochaEP (2008) An assessment of theimpacts of molecular oxygen on the evolution of proteomes MolBiol Evol 25 1931ndash1942

59 GaltierN and LobryJR (1997) Relationships between genomicG+C content RNA secondary structures and optimal growthtemperature in prokaryotes J Mol Evol 44 632ndash636

60 MustoH NayaH ZavalaA RomeroH Alvarez-ValinF andBernardiG (2004) Correlations between genomic GC levels andoptimal growth temperatures in prokaryotes FEBS Lett 57373ndash77

61 FitchWM (1974) The large extent of putative secondary nucleicacid structure in random nucleotide sequences or amino acidderived messenger-RNA J Mol Evol 3 279ndash291

62 WorkmanC and KroghA (1999) No evidence that mRNAshave lower folding free energies than random sequences with

the same dinucleotide distribution Nucleic Acids Res 274816ndash4822

63 BachmairA FinleyD and VarshavskyA (1986) In vivo half-lifeof a protein is a function of its amino-terminal residue Science234 179ndash186

64 BerezovskyIN NamiotVA TumanyanVG and EsipovaNG(1999) Hierarchy of the interaction energy distribution in thespatial structure of globular proteins and the problem of domaindefinition J Biomol Struct Dyn 17 133ndash155

65 YakovchukP ProtozanovaE and Frank-KamenetskiiMD(2006) Base-stacking and base-pairing contributions into thermalstability of the DNA double helix Nucleic Acids Res 34564ndash574

66 FriedmanRA and HonigB (1995) A free energy analysis ofnucleic acid base stacking in aqueous solution Biophys J 691528ndash1535

67 OkonogiTM AlleySC ReeseAW HopkinsPB andRobinsonBH (2002) Sequence-dependent dynamics of duplexDNA the applicability of a dinucleotide model Biophys J 833446ndash3459

68 MasquidaB and WesthofE (2000) On the wobble GoU andrelated pairs RNA 6 9ndash15

69 VaraniG and McClainWH (2000) The G x U wobble base pairA fundamental building block of RNA structure crucial to RNAfunction in diverse biological systems EMBO Rep 1 18ndash23

70 BerezovskyIN ChenWW ChoiPJ and ShakhnovichEI(2005) Entropic stabilization of proteins and its proteomicconsequences PLoS Comput Biol 1 e47

2892 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

Table 1 Compositional and sequence signals in coding and ncDNA

ARCHAEA BACTERIA

Characteristic Value Characteristic Value

A Correlations between nucleotide compositions and OGT

Codon position 1ANatNCB R=069 ANatNCB R=028+

(A+T)NatNCB R=068 (A+T)NatNCB R=029(A+G)NatNCB R=071 (A+G)NatNCB R=034(A+C)NatNCB R=071 (A+C)NatNCB R=036

Codon position 2TNat R=055 TNat R=037TNCB R=055 TNCB R=037GNatNCB R=060 (T+G)Nat R=036(T+G)Nat R=072 (T+G)NCB R=042(T+G)NCB R=068(G+C)NatNCB R=060

Codon position 3ANat R=002 ANat R=013ANCB R=070 ANCB R=076GNat R=022 (A+G)Nat R=060GNCB R=055 (A+G)NCB R=043(A+G)Nat R=056 (A+G)NatNCB R=011(A+G)NCB R=067

B Position-independent dinucleotide contrasts

ncDNAApATpT 115 ApATpT 126CpCGpG 124 GpC 125GpTApC 082

cDNAApA 125ApANCB 106TpT 120TpTNCB 113GpC 124GpCNCB 107

tRNAApA 132 ApG 128CpC 126 TpC 135TpC 126ApC 052TpG 070

rRNAApA 125 ApA 115CpC 127 CpC 119

C Correlations of position-independent dinucleotide contrasts (DCT) with OGT

ncDNACpT R=063 GpG R=049ApG R=064

cDNACpTNat R=066 GpGNat R=040CpTNCB R=057 GpGNCB R=009CpTShfld R=057 GpGShfld R=032ApGNat R=080ApGNCB R=059ApGShfld R=078

tRNAApA R=082 ApA R=045TpA R=072TpT R=080

(continued)

2882 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

Table 1 Continued

ARCHAEA BACTERIA

Characteristic Value Characteristic Value

GpG R=044CpC R=080

rRNAApA R=092 ApA R=067TpA R=088 CpC R=052GpG R=072CpC R=083

D Position-dependent dinucleotide composition ratios (Freq) dinucleotide contrasts and their OGT correlations

Codon position 1-2Freq(ApG)NatNCB 135na Freq(CpG)NatNCB 127na

Freq(CpG)NatNCB 062na

ApG 102 TpT 145ApGNCB 079 TpTNCB 149CpG 076 TpA 069CpGNCB 117 TpANCB 065

GpT 067GpTNCB 067

ApG R=077ApGNCB R=011GpC R=055GpCNCB R=044CpC R=037+

CpCNCB R=037+

ApGNatNCB R=063GpCNatNCB R=060CpCNatNCB R=037+

RpR R=040 RpR R=033RpRNCB R=015 RpRNCB R=027+

YpY R=034 YpY R=039YpYNCB R=005+ YpYNCB R=033RpRNatNCB R=059 RpRNatNCB R=037YpYNatNCB R=064 YpYNatNCB R=045

Codon position 2-3TpA R=063 ApA 150TpANCB R=028 ApANCB 105CpT R=061 GpC 133CpTNCB R=068 GpCNCB 098ApG R=069 ApCNatNCB R=052ApGNCB R=037+ GpANatNCB R=041

RpR R=020 RpR R=051RpRNCB R=065 RpRNCB R=073YpY R=028 YpY R=058YpYNCB R=066 YpYNCB R=075

RpRNatNCB R=039YpYNatNCB R=043

Codon position 3-1CpT R=060CpTShfld R=005ApG R=053ApGShfld R=002

RpR R=030+ RpR R=048RpRShfld R=020 RpRShfld R=001YpY R=035+ YpY R=056YpYShfld R=006 YpYShfld R=017

Pearson correlation coefficient is denoted by R DCT is calculated as the ratio of dinucleotide frequency to the product of frequencies of thecorresponding independent nucleotides Nucleotides with purine (A or G) and pyrimidine (T or C) bases are denoted with R and Y respectively Thelower index distinguishes the values observed for natural sequences (Nat) sequences with eliminated codon bias (NCB) values observed aftershuffling of amino acid sequences (Shfld) If the lower index is omitted the value is given for the natural sequences The P-values for correlationsand for the dinucleotide contrast t-tests (H0 DCT is 10) are shown in superscripts as significance levels +P-valuelt 005 lt 001 lt 00001Supplementary Files S5 and S7 list all P-values for position-specific correlations SD for compositions and the P-values for t-tests SupplementaryFiles S8 and S9 show the P-values for position-independent contrast t-tests and correlations

Nucleic Acids Research 2014 Vol 42 No 5 2883

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

content increases and energy per base pair decreases withtemperature in stems of folded t- and rRNAs (Table 3)These data point to base-pairing interactions as an im-portant contributor to thermal stabilization (53) of t-and rRNAs of prokaryotes and it is stronger manifestedin Archaea than in Bacteria

Dinucleotide composition of nucleic acid sequences

Dinucleotide contrasts (DCTs) show the ratio of observeddinucleotide frequencies to the expected ones given thecomposition of individual nucleotides Coupling of thesame nucleotides (Table 1) is preferred in ncDNA se-quences in Archaea (ApATpT and CpCGpG) andBacteria (ApATpT) ApA and CpC dinucleotides areprevalent in t- and rRNA of Archaea and rRNA ofBacteria Other outstanding contrasts in Archaea arefound in ncDNA (GpTApC) and tRNA (TpC ApC andTpG) In Bacteria ApG and TpC are prevalent in tRNA(Table 1) while GpC is preferred in both non-coding andcoding sequences In coding sequences of Archaea there isno preference for coupling of similar nucleotides while inBacteria it is found for ApA and TpT (Table 1)In both coding and ncDNA sequences of Archaea there

is a clear excess of complementary ApG and CpT di-nucleotides which is persistent after elimination ofthe codon bias and reshuffling of amino acid sequence(Table 1) In Bacteria the excess of GpG dinucleotidesin non-coding and coding sequences is most pronounced(Table 1) provided by the codon bias in coding sequencesPairing of similar nucleotides is highly correlated withOGT in tRNA (ApA TpT GpG and CpC) and rRNA

(ApA GpG and CpC) in Archaea Correlation of thesame nucleotide pairs is also found in Bacteria thoughit is for fewer pairs and is weaker (tRNA ApA rRNAApA and CpC) In Archaeal t- and rRNA there is alsostrong correlation of TpA pairs with OGT (Table 1)

Correlation between nucleotide frequencies andtemperature in different codon positions

The first codon position in Archaea is characterized byhigh correlation of the ratio of natural to NCBfrequencies of adenine with OGT [ANatNCB R=069Table 1] The second codon position reveals correlationof the thymine frequency with OGT (Table 1) The com-bination of thymine and guanine nucleotides is alsocorrelated with OGT however thymine is the major con-tributor (Table 1) The correlation of the guaninecontent with OGT is supported by the codon bias(Table 1) The combination of guanine with cytosine isalso correlated with OGT in the second position ofArchaeal sequences and is provided by the codon biasThe third position reveals strong selection againstadenine and guanine in relation to OGT (Table 1)Bringing together (anti-)correlations with OGT in differ-ent codon positions one can draw the optimal from thepoint of view of thermal adaptation (Table 4) combinedtriplet as [A]1 [TG]2 [non-(AG)]3 Prevalence of thyminein the second codon position is linked to an excess ofhydrophobes Indeed codons with thymine in the secondposition encode Ile Leu Met Phe and Val These arestrongly hydrophobic residues and aromatic Phe whichcan form many van der Waals contacts and contribute

Table 2 Nucleotide compositions and their OGT correlations in DNA and RNAs

Domain of life Nucleotide cDNANat cDNANCB ncDNA tRNA (R) rRNA (R)

Archaea A 2845 2794 3073 1713 (ndash076) 2368 (ndash085)T 2389 2444 3065 1817 (ndash089) 1924 (ndash088)G 2600 2612 1930 3391 (084) 3208 (094)

C 2166 2151 1931 3080 (084) 2499 (076)Bacteria A 2405 2642 2657 1966 (ndash044) 2609 (ndash052)

T 2268 2370 2661 2156 (ndash051) 2068 (ndash077)G 2750 2666 2341 3121 (053) 3110 (072)C 2577 2323 2341 2758 (039) 2213 (060)

The numbers represent the average frequencies of nucleotides in the corresponding parts of genomes while the numbers in parentheses are correl-ation coefficients (R) of nucleotide frequencies with OGT The most important biases and correlations are shown in bold font The P-values forcorrelations (H0 correlation coefficient R=0) and for the nucleotide composition t-tests (H0 mean frequency is 025) are shown in superscripts assignificance levels P-valuelt 001 lt 00001 Supplementary Files S8 and S9 list all correlations and composition tests cDNANAT naturalnucleotide composition in coding DNA cDNANCB nucleotide composition in coding sequences with eliminated codon bias

Table 3 OGT correlations in r- and t-RNA observed in folding simulations

Domain of life RNA type R((G+C) OGT) (P-values) R(ltEbpgt OGT) (P-values)

Archaea rRNA 089 (lt10ndash22) ndash093 (11 10ndash20)tRNA 084 (184 10ndash13) ndash071 (334 10ndash8)

Bacteria rRNA 073 (138 10ndash14) ndash066 (1 10ndash11)tRNA 053 (31 10ndash7) ndash051 (91 10ndash7)

R((G+C) OGT) correlation coefficient between the (G+C) content and the OGT R(ltEbpgt OGT) correlation coefficient between the averagedper base pair free energy of RNA folding and OGT

2884 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

thus to the packing of the hydrophobic coreNoteworthy elimination of the codon bias does notaffect correlation of the thymine fraction with OGT(R=055 in both native and NCB sequences) Itshows that excess of thymine in the second codonposition is a result of selection on the protein level Anapparent explanation for such selection is domination ofthe structure-based strategy in thermostabilization ofArchaeal proteins This strategy is characterized by theincreased compactness of the hydrophobic core providedby the massive van der Waals contacts The correlationof the combination of A and G with OGT is notprovided by tuning of the codon bias which points toadaptation on the other than DNA level

Bacterial coding DNA sequences yield less position-dependent correlations with OGT than Archaeal onesThe ratio of natural adenine frequency over the one foreliminated codon bias is weakly correlated with OGT(Table 1) There are also moderate correlations ofthymine frequency and of the thymine and guaninecombination with OGT in the second codon position(Table 1) The third codon position is characterized byselection against adenine provided by the codon bias(Table 1) However the combination of adenine andguanine is correlated with OGT It does not depend onthe codon bias to a large extent (same as in Archaea)pointing to possible signal of adaptation on thelevel other than protein-coding DNA sequence Thegeneralized codon in Bacteria (Table 4) characterizingthe thermophilic trends reads therefore [weak A]1 [weakT]2 [non-(AG)]3

The role of excessive (G+C) load in the third codonposition in adaptation to aerobicity

In both Archaea and Bacteria the average nucleotidecomposition in different codon positions is not affectedby the codon bias to the large extent except the thirdcodon position in Bacteria However the compositional

variance on the third position is essentially destroyedwhen there is no codon bias (standard deviation (SD)diminishes from gt10 to lt1 see Supplementary FileS5) The natural (G+C)3 load in Bacteria yields 9excess in comparison with NCB sequences (P-value=17e ndash 07 Supplementary File S6) while inArchaea there is no significant difference We found thatdespite frequent involvement of G3 and C3 nucleotidesinto the GCCG pairs the (G+C)3 load does not playa crucial role in thermal stability of mRNA (5254)Overall the difference in (G+C)3 load between Archeaand Bacteria is insignificant If we consider separatethermal groups the only difference appears in mesophilicBacteria (95 more in Bacteria P=009) Howeverwhen we take into account the oxygen tolerance factorthe difference in (G+C)3 load will become extremelypronounced between aerobic (7013 in natural and 5084in NCB) and anaerobic (4894 in natural and 5046 inNCB) species (P-value=2e ndash 09) Considering the lackof OGT correlation in protein-coding nucleotides (exceptthymine on the second codon position) one can concludethat increased (G+C)3 load is a result of the aerobic lifestyle rather than thermal adaptation (SupplementaryFiles S2 and S3) The comparison of aerobes and anaer-obes in the group of mesophilic Bacteria corroborates thisconclusion (Supplementary File S6) The difference isgt20 of (G+C)3 content 7097 in aerobes and 5051 inanaerobes (P=0000178) This adaptation is entirelydriven by the codon bias because if the codon bias isremoved the (G+C)3 bias as well as the differencebetween aerobes and anaerobes will disappear (5087and 5064)We also compared usage of codons with G or C in a

third position with other codons (lsquoCodon Usagersquo inSupplementary Files S3 and S6) There are three aminoacids encoded by six codons (Leu Arg Ser) five by fourcodons (Ala Pro Thr Gly Val) one by three codons (Ile)nine by two codons (Lys Asn Asp Phe Cys Gln Glu

Table 4 Generalized nucleotides and dinucleotides in different codon positions favorable for thermostability

Nucleotides correlated with OGT

Domain of life Codon position 1 2 3

Archaea Nucleotide A TG Non-[AG]Origin of the bias Codon bias T- amino acid Against codon bias

G-codon biasBacteria Nucleotide Weak A Weak T Non-[AG]

Origin of the bias Codon bias Amino acid Against codon bias

Dinucleotides in Purine (R) and Pyrimidine (Y) notation correlated with OGT

Domain of life Codon position 1-2 2-3 3-1

Archaea Dinucleotide RpRYpY Non-RpRYpY RpRYpYOrigin of the bias Codon bias Codon bias Amino acid

Bacteria Dinucleotide RpRYpY Non-RpRYpY RpRYpYOrigin of the bias Not codon bias Codon bias Amino acid

Part 1 Thermophilic-prone nucleotide biases Columns 1 2 3 contain information on favorable nucleotides and origin of the bias in codon positions1 2 3 Part 2 Thermophilic-prone dinucleotide biases Columns 1-2 2-3 3-1 contain information on favorable nucleotides and origin of the bias incodon positions 1-2 2-3 3-1

Nucleic Acids Research 2014 Vol 42 No 5 2885

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

His Tyr) and two by one codon (Met Trp) Typicallyhalf of the synonymous codons of each amino acid have Gor C in a third codon position others A or T It appearedthat almost all of the codons with G and C in a thirdposition have higher frequencies in aerobes compared toanaerobes and in Bacteria than in Archaea (see lsquoCodonUsagersquo in Supplementary File S3) This observation holdsfor all codons of the lsquotwo-codonrsquo amino acids and formajority of codons of the three- four- and six-codonamino acids The noticeable exception is significant sup-pression of AGAAGG codons of the Arg (aerobesndashanaerobes 177ndash125 P-values 446e ndash 858e ndash 8BacteriandashArchaea 174ndash2625 P-values 165e ndash 7114e ndash11) In lysine the codon AAA is suppressed in aerobes(difference 1716 P-value 644e ndash 6) while the codonAAG is preferred The AAG codon of lysine is favorablein aerobes (aerobesanaerobes 143) and another lysinersquoscodon AAA can be turned into AAG by only one mutationin the third codon position Therefore one can speculatethat the demand for discriminating between Arg and Lys isthe most probable cause for the suppression of AGAand AGG codons of arginine in aerobes Lysine isthus preferred in aerobic adaptation over the arginine(Supplementary Files S3 and S6) The contribution of theamino acid in relation to thermophilic adaptation is notcompromised because of the similarity between physicalndashchemical characteristics of lysine and arginine from thepoint of view of thermostability Overall an excess ofthe G3 codons is always more pronounced in aerobescompared to anaerobes which corroborates the conclusionthat the (G+C)3 load is an indications of the aerobic style(55ndash58) The similar trend in the difference between theBacteria and Archaea is a result of the domination ofaerobic life style in the Bacteria in the analyzed dataset

Correlation between dinucleotide frequencies in differentcodon positions and OGTIn order to analyze dinucleotides in different codon pos-itions (1-2 2-3 and 3-1) we used the DCTNat DCTNCB

(for positions 1-2 and 2-3) and shuffled (DCTShfld forposition 3 and 1) sequences We considered correlationsof these contrasts with OGT The most pronounced biasesare found for dinucleotides in coding sequences ofArchaea The ApG dinucleotides show the strongestexcess of dinucleotides in all positions of natural se-quences compared to NCB and shuffled ones In codonpositions 1-2 and 2-3 this compositional peculiarity andits correlation with OGT are provided by the codon bias(Table 1) In position 3-1 high correlation of ApGfrequencies with OGT is determined by the amino acidsequences and it vanishes after amino acid reshuffling(Table 1) There is also correlation of the frequency ofcomplementary CpT dinucleotide in position 3-1 withOGT disappearing after reshuffling of amino acid se-quences (Table 1) The most plausible role of ApG di-nucleotides is a contribution to stabilization via base-stacking interactions between the purine rings (22) Thisconclusion is supported by the fact that excess of ApGdinucleotides is provided by the codon bias manifestingthus adaptation on a DNA level The complementary CpTdinucleotides indicate enrichment of anti-sense strand of

double-stranded DNA (dsDNA) with the stabilizing ApGones The resulting mosaic of ApG stacking in sense andanti-sense strands can provide stabilization over the longdistances without compromising flexibility of the dsDNAOther overrepresented dinucleotides (regardless of thecodon bias) correlated with OGT in positions 1-2 areGpC CpC TpA and CpT (Table 1) The frequencies ofTpA and CpT dinucleotides are correlated with OGT inposition 2-3 where the former is provided by the codonbias and the latter is not

In purine (R) and pyrimidine (Y) notation codonbias provides grouping (stacking) of similar nucleotidesin position 1-2 which increases with OGT (Table 1) Atthe same time codon bias works against such grouping inposition 2-3 (Table 1) Amino acid sequence is responsiblefor the grouping of similar nucleotides and correlationswith OGT in position 3-1 (Table 1) The resultinggeneralized thermophilic pattern of dinucleotides inArchaeal cDNA sequences reads [RpR YpY]1-2 [non-RpR non-YpY]2-3 [RpRYpY]3-1 Most of the compos-itional peculiarities found in Bacteria are not determinedby the codon bias yielding weak correlations with tem-perature (Table 1) The thermophilic pattern of dinucleo-tides in Bacteria (Table 4) is the same as Archaeal oneThe difference with Archaea is that selection for RpR andYpY dinucleotides in position 1-2 is not determined by thecodon bias in Bacteria (Tables 1 and 4)

Compositional and sequence biases observed in foldingsimulations of mRNAThough RNA molecules have different overall structuresone may expect that there are some common mechanismsproviding stability and function of the folded t- r- andmRNAs We have shown above that the (G+C) contentapparently contributes to the stability of the t- and rRNA(3659) and not to the stability of coding DNA (545960)It is manifested in correlation of the former with OGT(Table 2) and the anti-correlation of the base-pairingenergy in the folded structures with OGT (Table 3)Nucleotide compositions in coding DNA (hence inmRNA as well) do not correlate with OGT The mRNAcase is of special interest because redundancy of geneticcode endues the nucleotide sequence with a potential tosatisfy requirements for DNA mRNA and protein stabil-ity For example it has been hypothesized (61) that theoptimization of the base-pairing in the mRNA contributesto formation of stem fragments Authors claimed thatthere is a corresponding periodic pattern of the mRNAsecondary structure in human and mouse which isdetermined chiefly by the selection that operates on thethird codon position (52) An opposing opinion suggeststhat the secondary structure in mRNA interferes withtranslation and therefore should be avoided (62)Overall three scenario of the relation between mRNAand proteins has been considered earlier (52) (i) thebiases are determined by the demands on protein stability(ii) mRNA stability is the major determinant of sequencebiases (iii) complete independence of the sequence biasesrelated to mechanisms of stability on each level We havecomputationally folded the mRNA sequences from 244prokaryotic genomes and analyzed their characteristics

2886 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

(Table 5) All the parameters are reported as relative oneswhere the signal in native sequence is normalized by thevalue obtained for dicodon shuffled ones (48) Thedicodon shuffling randomizes mRNA sequence whilepreserving the native protein sequence native codonusage and native dinucleotide composition Therefore itallows one to select out peculiarities of mRNA caused bythe demands on its stability and function from the othersrelated to the coding DNA and proteins

Structure of stems in folded mRNA is characterized byphases which show relative location of triplets in oppositestrands of the stem Specifically Phases I II and III cor-respond to positioning of triplets where respectively firstsecond and third nucleotides are complementary Themost and the least frequent pairs in all phases of thedsDNA (see Materials and methods section) are con-sidered Phases I and II are similar to each otheryielding CG GC GU UG AU and UA (Table 5)Overall the Archaea and thermophilic Bacteria havesimilar trends in most and least frequent pairsMesophilic Bacteria stands out with A13U and G33Cmost frequent pairs in Phases II and III respectivelyThe major contribution from the three pairing phases(52) to the total amount of stem pairs is provided byPhase III and nucleotides in codon position 3 are mostfrequent in stem pairs (Table 5 and SupplementaryTable S1) OGT correlations highlight the contributionfrom the GU wobble pair to the thermostability ofachaeal mRNA structures (Table 5 and SupplementaryTable S1) We found the highest OGT correlation forthe Archaeal pairs U32G (Table 5) confirming import-ance of the GU wobble pairs The stronger effect forArchaeal rather than for Bacterial mRNA is possibly arelic of ancient RNA world where RNA was the carrierof genetic information and harsh environmental condi-tions demanded its increased stability

Table 6 shows that purine load (the contents of A+GRY ApG) is larger in the loop regions than in the stems

In Archaea this difference is slightly more pronouncedthan in Bacteria (Supplementary Table S2) The purineload is also in good positive correlation with OGT forthe loops not for stems However the amount of ApGdinucleotides is correlated with OGT in both loops andstems For synonymous codons the fractions of purine-rich codons (eg GGR versus GGY for glycine and AGRversus CGY for arginine) are correlated with OGT in bothloop and stem regions (slightly stronger in loops) Fornon-synonymous codons however amount of purine-rich codons (GAR for glutamate) is correlated withOGT in both loops and stems while fractions of purine-rich AAR codons for lysine do not correlate withOGT (Table 6) While Forsdyke et al (21) reportedincrease of the purine content and its positive correlationwith OGT in loops we found that fraction of purine reachsynonymous codons is correlated with OGT in stems aswell At the same time the amount of non-synonymouspurine-rich codons of lysine is not correlated with OGT inboth loop and stem regions Additionally the amount ofApG dinucleotides is correlated with OGT in both loopsand stems (Table 6)

Signals of thermophilic adaptation in protein sequencesIt has been shown earlier (22) that environmental tem-perature directly affects amino acid composition of pro-karyotic proteins and IVYWREL combination can serveas a predictor of the OGT in prokaryotes We use here alsquoZ-scorersquo predictor of OGT (43) which takes into accountdifferences in variances of the proteomic frequencies ofindividual amino acids (Table 7) We have shown thatthe Z-score predictor properly corrects for these differ-ences (43) better reflecting the contribution of theamino acid combinations to thermal adaptationSeparate predictors for Archaea and Bacteria (based on33 and 99 proteomes with OGTrsquos spread over 5ndash100 and10ndash85C respectively) yield the same general trend in theincrease of hydrophobic and charged residues at the

Table 5 Compositional and dicodon signals of mRNA adaptation purified by the dicodon shuffling

Characteristic Archaea Bacteria

The most and least (after the comma) frequentpairs in stems of mRNA

MesophilesPhase I C23G G23U Phase I G32U U32GPhase II C31G G13U Phase II A13U G13UPhase III A33U U33G Phase III G33C U33G

ThermophilesPhase I G32C G23U Phase I C23G G23UPhase II C22G G13U Phase II C31G G13UPhase III A33U U33G Phase III U33A U33G

Correlations with OGT ofSegment energy ltEsggt R=ndash071 R=ndash039Energy per base pair ltEbpgt R=ndash073 R=ndash054

The most signification correlations with OGT Phase I U32G R=077 (U21G)Nat PIII R=046Phase I G23U R=071 (G12U)Nat PIII R=045Phase I All pairs R=07 (U21G)dShfld PIII R=045

(G12U)dShfld PIII R=041

Phases I II and III correspond to positioning of triplets where respectively first second and third nucleotides are comple-mentary All signals (except OGT correlations in Bacteria) are normalized by corresponding values for control sequences afterthe dicodon shuffling See explanations of abbreviations in the Materials and methods section

Nucleic Acids Research 2014 Vol 42 No 5 2887

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

expense of polar ones (34) showing minor differences inpresence of individual residuesWe have shown earlier (22) that the amino acid bias

working in thermal adaptation of proteins does notdepend on the overall nucleotide composition of coding-DNA sequences However one can still expect that nu-cleotide compositions of particular codon positions andor dinucleotide compositions of positions 1-2 and 2-3 incodons may be somehow linked to amino acid biases Weindeed found a significant difference between nucleotideload in individual codon positions of Archaeal andBacterial coding DNA sequences (Table 1) FirstArchaeal protein coding sequences yield a strong correl-ation between (T+G)2 load in the second codon positionand OGT (r=072) The corresponding non-codon biasedsequences show comparable correlation (r=068)pointing that (T+G) load reflects tuning of the aminoacid composition in connection to thermal adaptationThe main contributor to this correlation is thymine(Table 1) that provides increase of strongly hydrophobicresidues LMFIV (Supplementary File S1) The (T+G)load is much weaker correlated with OGT in Bacterialsequences (r=036 for natural and r=042 for NCB se-quences) The correlation of the thymine load with OGT is037 for both Natural (NAT) and Non-codon-biased(NCB) sequences This observation agrees with thepresumed prevalence of structure-based strategy ofthermal adaptation in Archaeal proteins (29) contrary tosequence-based one in Bacterial proteomes Specificallythere is a clear correlation between amounts of chargedresidues and the OGT in the set of Bacterial proteomes(Table 7) presumably reflecting the domination of thesequence-based strategy in thermophilic adaptation ofBacteria (29)Finally we checked if there exists any specific connec-

tion between the dipeptide composition of proteins andthose of 3-1 dinucleotides We found (data not shown)that the strongest correlations between dinucleotides 3-1and dipeptides exist mostly for the dinucleotides withguanine in position 3 The most frequent pair of aminoacids contains methionine (Met encoded exclusively by

the AUG codon) as the first and any polarchargedamino acid as the second one The preference of polarcharged amino acid to be the next to Met can be explainedby the typically surface location of the protein N-terminiand its role as a signal for ubiquitination (63)

DISCUSSION

We will make a brief overview of the most pronouncedbiases and causal relationships in DNA RNA andproteins The major signal in nucleotide compositionsspecifically (G+C) load is related to thermal stabilizationof t- and rRNA (3654555960) and to discriminationbetween the coding and non-coding sequences inArchaea (Table 2) Comparative analysis of nucleotideand amino acid compositions in relation to thermophilicadaptation prompts to conclude that the (G+C) contentdoes not contribute to thermostabilization of coding DNA(1822365459) as well as it does not affect amino acidcomposition and its thermophilic trends (1522) When

Table 6 Purine loading in loop and stem regions of folded mRNA and its OGT correlation

Feature Loop Stem L-v-S p-value

Mean contents OGT correlation Mean content OGT correlation

A+G 0560 059 0500 ndash026 lt22E ndash 16RY 1299 061 1002 ndash026 lt22E ndash 16ApG 0061 079 0051 050 00002GGR (glycine) 0027 062 0045 042 12E ndash 17GGY (glycine) 0017 ndash005 0079 ndash023 10E ndash 42AGR (arginine) 0035 072 0022 056 24E ndash 10CGR (arginine) 0018 ndash022 0040 ndash024 71E ndash 09CGY (arginine) 0016 ndash025 0037 ndash026 62E ndash 10GAR (glutamate) 0060 071 0036 059 lt22E ndash 16AAR (lysine) 0086 022 0017 011 lt22E ndash 16GAY (aspartate) 0037 ndash017 0040 ndash037 00002

Feature analyzed nucleotide dinucleotide or amino acid Loop and stem information on mean content and OGT correlation of the above L-v-S acomparison between corresponding contents in the loop region and stem regions by Wilcoxon-tests and P-values are shown in this columnCorrelation coefficients with OGT are shown P-valuelt 001 P-valuelt 00001

Table 7 Signals of thermophilic adaptation in protein sequences of

Archaea and Bacteria

Correlation with OGT Archaea Bacteria

Z-scored predictor ILVW Y DKR(R=093)

IPV Y EKR(R=089)

The most abundant residuesin gt70 (gt60) ofall Z-scored predictors

VIWL Y ER VPI Y ER(K)

Individual amino acids + L W + Endash T Q D

Types of amino acids + h + h cndash p ndash p

Dipeptides + hp ph + ccndash cp pc ndash pc

+ increase of the amino acid (or amino acid type) fraction with OGT ndashdecrease of the amino acid (or amino acid type) fraction with OGTCapital letters are names of amino acids h p c are hydrophobic polarcharged types of amino acids Correlation coefficients between the Z-scored thermostability predictors and OGT are given in parentheses

2888 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

separate codon positions are considered for coding DNAthe only compositional bias observed for nucleotide com-positions is an excess of (G+C) load in the third codonposition in Bacteria There is a significant excess ofguanine and cytosine nucleotides (292 and 3282)compared to the NCB nucleotide contents (2609 and2466) A plausible explanation for this bias is the roleof the (G+C)3 load in the adaptation to the aerobic lifestyle which dominates in Bacteria The (G+C)3 load canbe advantageous for several complementary reasonsIndeed additional GC base pairing will contribute tothe stability of double helix of DNA and stem regions ofRNA molecules Moreover G3 bases can work as scaven-gers of oxidizing agents providing protection for G basesin other codon positions (56ndash58) At the same time thisbias would not lead to any changes in amino acid com-position leaving protein structure and stability intact (55)

In this work we considered compositional and sequencebiases in proteins in relation to those in correspondingnucleic acid sequences and to the phylogeny of species(Archaea or Bacteria) OGT correlations of nucleotidesin different codon positions are similar in Archaea andBacteria but higher in the former (Table 1) Theoptimal from the point of view of thermophilic adaptation(ie most correlated with OGT) codon reads A1 [TG]2non-[AG]3 in Archaea but [weak A]1 [weak T]2 non-[AG]3 in Bacteria (Table 4) Codon bias supports thestrong correlation between the amount of Adenine andOGT in the first codon position in Archaea same butweaker correlation in Bacteria While we showed earlier(22) that amino acid thermophilic trend is not determinedby the overall nucleotide composition considerationof separate codon positions reveals an interesting linkbetween positional nucleotide frequencies and aminoacid composition in relation to domain of LifeSpecifically prevalence of the nucleotide thymine in thesecond codon position and its strong correlation withOGT in Archaea is apparently a result of the demandon enrichement of Archaeal proteins with hydrophobicresidues (Table 7) Noteworthy a massive increase ofvan der Waals interactions (2464) was found to be thecornerstone of the structure-based evolutionary strategyof protein thermostability in ancient species (29) InBacterial proteins however there is an increase of thefractions of charged residues with OGT (Table 7) Thuswe observe a transitions from structure-based evolution-ary strategy of protein themostability in Archaea tosequence-based one in Bacteria and corresponding nucleo-tide and protein compositional biases (Table 7)underlying these strategies (29) The third codonposition is characterized by the selection against adenineand guanine in Archaea and against adenine in Bacteria isa result of this bias

The most pronounced dinucleotide compositionalbiases are excess of homo-dinucleotides and its correlationwith OGT in ncDNA tRNA and rRNA It is presumablya relic of ancient primitive homo(poly)nucleotides fromwhich life started Overrepresentation of the complemen-tary ApGCpT dinucleotides points to their contributionto DNARNA stability via strong base-stacking inter-actions (6566) The consideration of dinucleotide biases

in purine (R) and pyrimidine (Y) notation also reveals aninteresting picture Overall for both Archaea and Bacteriathe signature for thermophilic dinucleotides reads RpRYpY non-(RpRYpY) RpRYpY for codon positions1-2 2-3 and 3-1 respectively (Table 4) The major differ-ence between Archaea and Bacteria in this case is that thedinucleotide bias in positions 1-2 is caused by demands onnucleotide sequence level (and provided by the codonbias) in Archaea while some other factors work inBacteria The preference for homodinucleotides in pos-itions 1-2 and 3-1 can be a possible reason for avoidinghomodinucleotides in positions 2-3 because heterodinu-cleotides can contribute to DNA flexibility tuningIndeed it has been shown that given an average flexibilityF of homodinucleotides (either RpR or YpY) theflexibilities of the heterodinucleotides YpR and RpY are2F and 05F respectively (67) The codon positions 2-3 aremost versatile in terms of the relationship between thenucleotide and protein sequences Therefore presenceof heteronucleotides in the codon position 2-3 makes itpossible to adjust flexibility of the nucleotide sequencewithout changing physicalndashchemical characteristics ofthe encoded amino acid residue Additionally we foundthat the link between dipeptide and corresponding 3-1 di-nucleotide frequencies is determined by the amino aciddipeptide The dominating bias reads as Met followedby a polar or charged residueOn the RNA level simple consideration of the nucleo-

tide composition suggests to discriminate the tRNA andrRNA molecules from the mRNA Indeed based on thecorrelation with OGT we can conclude that (G+C)content is important for providing stability of tRNAand rRNA molecules but not for mRNA The role ofthe (G+C) content was further confirmed by foldingsimulations (Table 3) We also found other complemen-tary signals in mRNA sequences related to its structurestability and function Archaea and Bacteria yield similartrends in the most and least frequent nucleotide pairsand the third codon position makes a major contributionto the base-pairing mechanism of stem stabilizationSpecifically in addition to WatsonndashCrick pairing theGU wobble pairs sufficiently contribute to the ther-mal stability of the Archaeal mRNA in agreement withearlier observations (6869) Overall it results in moderatecorrelation of the segment energy and energy per base pairof the folded mRNA (Table 5) Folding simulations per-formed in this work also allowed us to analyzepeculiarities of the nucleotide contents and structuralcontacts in stem and loop regions of folded mRNAPurine load in loop regions is higher than in stems andthis effect is slightly stronger in Archaea than in Bacteria(Table 6) Purine load in loops also correlates with OGTnot in stems This correlation was observed earlier byForsdyke et al (21) and it was described as lsquopolitepurine load of the loop regionsrsquo that prevents undesiredmRNAndashmRNA single-strand interactions The authorsconcluded that increased purine load can affect thecodon choice and consequently the amino acid compos-ition (21) Our data [as well as careful analysis of theoriginal data in (21)] does not support the latter claimThere is indeed a correlation between the fraction of

Nucleic Acids Research 2014 Vol 42 No 5 2889

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

some purine-rich codons (GGR(Gly) AGR(Arg) andGAR(Glu)) and OGT (Table 6) However the fractionof purine-rich non-synonymous codons AAR (Lys) isnot correlated with OGT (Table 6) Moreover thesecodons are synonymous and cannot directly affect theamino acid composition It has been shown earlier(2270) that increase of Glu and Arg fractions is a trendin protein thermophilic adaptation At the same timepurine load remains high after elimination of the codonbias pointing to the amino acid composition as the mostprobable cause for the former (22) All the above promptsone to conclude that in mutual tuning of nucleotide andamino acid compositions the purine load does notdominate and determine biases in amino acid compos-ition if not opposite The analysis of purine load showsthat it correlates with OGT in both stems and loops(Table 6) in case of purine-rich codons GGR(Gly)AGR(Arg) and GAR(Glu) Further the amount ofApG dinucleotides strongly correlates with OGT in bothloops and stems (Table 6) and ApG provides a strongbase stacking important for thermal stability of nucleicacid sequences (2265) Purine load therefore is appar-ently a determinant of the base-stacking mechanism inmRNA andor DNA thermal stability as well as a resultof the thermophilic amino acid composition trendProkaryotes thrive under the temperature interval

spanning over hundred degrees and they represent twomajor life stylesmdashaerobic and anaerobic Analysis ofcomplete Archaeal and Bacterial genomes unraveled com-positional and sequence signals related to molecular mech-anisms of stability and adaptation unaffected by selectivesequencing or by the comparison of orthologs Overallcodon bias works stronger in Archaea and is mostlyutilized in thermophilic adaptation of nucleic acids It ap-parently reflects longer evolutionary history of Archaeawhich presumably started close to the origin of life in hotconditions (29) The codon bias and amino acid sequences(dipeptide composition) work in accord for supportingenrichment of the nucleotide sequences with ApGCpT di-nucleotidesmdashdeterminants of the base-stacking mechan-ism of the nucleic acid stability (65ndash67) We also foundthat the second codon position reveals a strong linkbetween the nucleotide and amino acid compositionsSpecifically excess of thymine in this position is a resultof a demand on the enrichment of Archaeal proteins withhydrophobic amino acids The third codon position inBacteria is the only case where codon bias is detectablealready on the level of pure composition We found thatthe (G+C)3 load is related to aerobic life style dominatingin Bacteria From the point of view of thermophilic adap-tation codon bias works against G in the third codonposition in both Archaea and Bacteria It supports thus aspecific role of the third codon position in discriminatingbetween adaptations to temperature and aerobic life styleFinally we obtained an interesting and complex picture ofthe relationship between the nucleotide composition(purine load) and amino acid composition (selectionbetween Arg and Lys) in relation to thermal adaptationaerobicity and phylogeny Specifically purine load in bothArchaea and Bacteria is a result of the lsquofrom both endof hydrophobicity scalersquo trend in thermal adaptation of

proteins (34) reflected in the IVYWREL predictor of theOGT (22) According to this predictor Arg is a preferredamino acid for thermal adaptation though Lys is the nextcandidate present in top most correlated predictors as well(2243) In particular selection for Lys over the Arg insome species was shown to be important for the entropicmechanism of protein thermostabilization (70) Howeveroverall preference for Arg over the Lys in case of thermaladaptation is well manifested in the excess of the purine-rich (AGR) codons of Arg at the expense of purine-richones of Lys (AAR) At the same time discriminationbetween Arg and Lys works in opposite direction inaerobicity where Arg codons AGA and AGG are sup-pressed in favor of the purine-rich AAA and AAG ofLys It thus reflect a preference for Lys over Arg inaerobes compared to anaerobes in particular and inBacteria versus Archaea in general preserving at thesame time high purine load necessary for providing abase-stacking in corresponding nucleic acids (225365)An intricate connection between the LysArg and theircodons in relation to thermal adaptation and aerobicityexemplifies how selection can work on nucleic acids andprotein simultaneously in response to demands of differentenvironments Obviously the whole picture of molecularmechanisms of adaptation and relations between them isfar from being complete Consideration of other environ-mental factors such as salinity pressure etc will help to un-ravel new mechanisms of stability their sequencestructuredeterminants and to understand tradeoffs that Natureembraced en route of the evolution and adaptation

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

FUNDING

Funding for Open access charges Functional GenomicsProgram (FUGE II) Norwegian Research Council

Conflict of interest statement None declared

REFERENCES

1 CrickF (1970) Central dogma of molecular biology Nature 227561ndash563

2 TawfikDS (2010) Messy biology and the origins of evolutionaryinnovations Nat Chem Biol 6 692ndash696

3 TokurikiN and TawfikDS (2009) Stability effects of mutationsand protein evolvability Curr Opin Struct Biol 19 596ndash604

4 TokurikiN and TawfikDS (2009) Protein dynamism andevolvability Science 324 203ndash207

5 KooninEV (2012) Does the central dogma still stand BiolDirect 7 27

6 PersquoerI FelderCE ManO SilmanI SussmanJL andBeckmannJS (2004) Proteomic signatures amino acid andoligopeptide compositions differentiate among phyla Proteins 5420ndash40

7 AravindL TatusovRL WolfYI WalkerDR andKooninEV (1998) Evidence for massive gene exchange betweenarchaeal and bacterial hyperthermophiles Trends Genet 14442ndash444

8 KhachaneAN TimmisKN and dos SantosVA (2005) Uracilcontent of 16S rRNA of thermophilic and psychrophilic

2890 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

prokaryotes correlates inversely with their optimal growthtemperatures Nucleic Acids Res 33 4016ndash4022

9 SuhreK and ClaverieJM (2003) Genomic correlates ofhyperthermostability an update J Biol Chem 27817198ndash17202

10 TekaiaF and YeramianE (2006) Evolution of proteomesfundamental signatures and global trends in amino acidcompositions BMC Genom 7 307

11 WangHC and HickeyDA (2002) Evidence for strong selectiveconstraint acting on the nucleotide composition of 16S ribosomalRNA genes Nucleic Acids Res 30 2501ndash2507

12 FriedmanR DrakeJW and HughesAL (2004) Genome-widepatterns of nucleotide substitution reveal stringent functionalconstraints on the protein sequences of thermophiles Genetics167 1507ndash1512

13 LynnDJ SingerGA and HickeyDA (2002) Synonymouscodon usage is subject to selection in thermophilic bacteriaNucleic Acids Res 30 4272ndash4277

14 SingerGA and HickeyDA (2000) Nucleotide bias causes agenomewide bias in the amino acid composition of proteinsMol Biol Evol 17 1581ndash1588

15 SingerGA and HickeyDA (2003) Thermophilicprokaryotes have characteristic patterns of codonusage amino acid composition and nucleotide content Gene 31739ndash47

16 TekaiaF YeramianE and DujonB (2002) Amino acidcomposition of genomes lifestyles of organisms and evolutionarytrends a global picture with correspondence analysis Gene 29751ndash60

17 Roy ChowdhuryA and DuttaC (2012) A pursuit of lineage-specific and niche-specific proteome features in the world ofarchaea BMC Genom 13 236

18 WangHC SuskoE and RogerAJ (2006) On the correlationbetween genomic G+C content and optimal growth temperaturein prokaryotes data quality and confounding factors BiochemBiophys Res Commun 342 681ndash684

19 WuH ZhangZ HuS and YuJ (2012) On the molecularmechanism of GC content variation among eubacterial genomesBiol Direct 7 2

20 KnightRD FreelandSJ and LandweberLF (2001) A simplemodel based on mutation and selection explains trends in codonand amino-acid usage and GC composition within and acrossgenomes Genome Biol 2 RESEARCH0010

21 LaoPJ and ForsdykeDR (2000) Thermophilic bacteria strictlyobey Szybalskirsquos transcription direction rule and politely purine-load RNAs with both adenine and guanine Genome Res 10228ndash236

22 ZeldovichKB BerezovskyIN and ShakhnovichEI (2007)Protein and DNA sequence determinants of thermophilicadaptation PLoS Comput Biol 3 e5

23 KreilDP and OuzounisCA (2001) Identification ofthermophilic species by the amino acid compositions deducedfrom their genomes Nucleic Acids Res 29 1608ndash1615

24 BerezovskyIN TumanyanVG and EsipovaNG (1997)Representation of amino acid sequences in terms of interactionenergy in protein globules FEBS Lett 418 43ndash46

25 CambillauC and ClaverieJM (2000) Structural and genomiccorrelates of hyperthermostability J Biol Chem 27532383ndash32386

26 GreavesRB and WarwickerJ (2007) Mechanisms forstabilisation and the maintenance of solubility in proteins fromthermophiles BMC Struct Biol 7 18

27 JaenickeR (1999) Stability and folding of domain proteins ProgBiophys Mol Biol 71 155ndash241

28 JaenickeR and BohmG (1998) The stability of proteins inextreme environments Curr Opin Struct Biol 8 738ndash748

29 BerezovskyIN and ShakhnovichEI (2005) Physics andevolution of thermophilic adaptation Proc Natl Acad Sci USA102 12742ndash12747

30 ChakravartyS and VaradarajanR (2002) Elucidation of factorsresponsible for enhanced thermal stability of proteins a structuralgenomics based study Biochemistry 41 8152ndash8161

31 GlyakinaAV GarbuzynskiySO LobanovMY andGalzitskayaOV (2007) Different packing of external residues can

explain differences in the thermostability of proteins fromthermophilic and mesophilic organisms Bioinformatics 232231ndash2238

32 ThompsonMJ and EisenbergD (1999) Transproteomic evidenceof a loop-deletion mechanism for enhancing proteinthermostability J Mol Biol 290 595ndash604

33 TokurikiN OldfieldCJ UverskyVN BerezovskyIN andTawfikDS (2009) Do viral proteins possess unique biophysicalfeatures Trends Biochem Sci 34 53ndash59

34 BerezovskyIN ZeldovichKB and ShakhnovichEI (2007)Positive and negative design in stability and thermal adaptationof natural proteins PLoS Comput Biol 3 e52

35 BharanidharanD BhargaviGR UthanumallianK andGauthamN (2004) Correlations between nucleotide frequenciesand amino acid composition in 115 bacterial species BiochemBiophys Res Commun 315 1097ndash1103

36 NakashimaH FukuchiS and NishikawaK (2003)Compositional changes in RNA DNA and proteins for bacterialadaptation to higher and lower temperatures J Biochem(Tokyo) 133 507ndash513

37 DehouckY FolchB and RoomanM (2008) Revisiting thecorrelation between proteinsrsquo thermoresistance and organismsrsquothermophilicity Protein Eng Des Sel 21 275ndash278

38 FolchB DehouckY and RoomanM (2010) Thermo- andmesostabilizing protein interactions identified by temperature-dependent statistical potentials Biophys J 98 667ndash677

39 FolchB RoomanM and DehouckY (2008) Thermostability ofsalt bridges versus hydrophobic interactions in proteins probed bystatistical potentials J Chem Inf Model 48 119ndash127

40 GonnelliG RoomanM and DehouckY (2012) Structure-basedmutant stability predictions on proteins of unknown structureJ Biotechnol 161 287ndash293

41 PonnuswamyP MuthusamyR and ManavalanP (1982) Aminoacid composition and thermal stability of globular proteins IntJ Biol Macromol 4 186ndash190

42 BerezovskyIN (2011) The diversity of physical forces andmechanisms in intermolecular interactions Phys Biol 8 035002

43 MaBG GoncearencoA and BerezovskyIN (2010)Thermophilic adaptation of protein complexes inferred fromproteomic homology modeling Structure 18 819ndash828

44 MakarovaKS and KooninEV (2005) Evolutionary andfunctional genomics of the Archaea Curr Opin Microbiol 8586ndash594

45 NovichkovPS WolfYI DubchakI and KooninEV (2009)Trends in prokaryotic evolution revealed by comparison of closelyrelated bacterial and archaeal genomes J Bacteriol 191 65ndash73

46 KooninEV and WolfYI (2008) Genomics of bacteria andarchaea the emerging dynamic view of the prokaryotic worldNucleic Acids Res 36 6688ndash6719

47 KooninEV MushegianAR GalperinMY and WalkerDR(1997) Comparison of archaeal and bacterial genomes computeranalysis of protein sequences predicts novel functions andsuggests a chimeric origin for the archaea Mol Microbiol 25619ndash637

48 KatzL and BurgeCB (2003) Widespread selection for localRNA secondary structure in coding regions of bacterial genesGenome Res 13 2042ndash2051

49 HofackerIL PriwitzerB and StadlerPF (2004) Prediction oflocally stable RNA secondary structures for genome-wide surveysBioinformatics 20 186ndash190

50 ZukerM and StieglerP (1981) Optimal computer folding oflarge RNA sequences using thermodynamics and auxiliaryinformation Nucleic Acids Res 9 133ndash148

51 MathewsDH SabinaJ ZukerM and TurnerDH (1999)Expanded sequence dependence of thermodynamic parametersimproves prediction of RNA secondary structure J Mol Biol288 911ndash940

52 ShabalinaSA OgurtsovAY and SpiridonovNA (2006) Aperiodic pattern of mRNA secondary structure created by thegenetic code Nucleic Acids Res 34 2428ndash2437

53 MarmurJ and DotyP (1962) Determination of the basecomposition of deoxyribonucleic acid from its thermaldenaturation temperature J Mol Biol 5 109ndash118

Nucleic Acids Research 2014 Vol 42 No 5 2891

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

54 HurstLD and MerchantAR (2001) High guanine-cytosinecontent is not an adaptation to high temperature acomparative analysis amongst prokaryotes Proc Biol Sci 268493ndash497

55 NayaH RomeroH ZavalaA AlvarezB and MustoH (2002)Aerobiosis increases the genomic guanine plus cytosine content(GC) in prokaryotes J Mol Evol 55 260ndash264

56 BeckmanKB and AmesBN (1997) Oxidative decay of DNAJ Biol Chem 272 19633ndash19636

57 ChengKC CahillDS KasaiH NishimuraS and LoebLA(1992) 8-Hydroxyguanine an abundant form of oxidative DNAdamage causes Gmdash-T and Amdash-C substitutions J Biol Chem267 166ndash172

58 Vieira-SilvaS and RochaEP (2008) An assessment of theimpacts of molecular oxygen on the evolution of proteomes MolBiol Evol 25 1931ndash1942

59 GaltierN and LobryJR (1997) Relationships between genomicG+C content RNA secondary structures and optimal growthtemperature in prokaryotes J Mol Evol 44 632ndash636

60 MustoH NayaH ZavalaA RomeroH Alvarez-ValinF andBernardiG (2004) Correlations between genomic GC levels andoptimal growth temperatures in prokaryotes FEBS Lett 57373ndash77

61 FitchWM (1974) The large extent of putative secondary nucleicacid structure in random nucleotide sequences or amino acidderived messenger-RNA J Mol Evol 3 279ndash291

62 WorkmanC and KroghA (1999) No evidence that mRNAshave lower folding free energies than random sequences with

the same dinucleotide distribution Nucleic Acids Res 274816ndash4822

63 BachmairA FinleyD and VarshavskyA (1986) In vivo half-lifeof a protein is a function of its amino-terminal residue Science234 179ndash186

64 BerezovskyIN NamiotVA TumanyanVG and EsipovaNG(1999) Hierarchy of the interaction energy distribution in thespatial structure of globular proteins and the problem of domaindefinition J Biomol Struct Dyn 17 133ndash155

65 YakovchukP ProtozanovaE and Frank-KamenetskiiMD(2006) Base-stacking and base-pairing contributions into thermalstability of the DNA double helix Nucleic Acids Res 34564ndash574

66 FriedmanRA and HonigB (1995) A free energy analysis ofnucleic acid base stacking in aqueous solution Biophys J 691528ndash1535

67 OkonogiTM AlleySC ReeseAW HopkinsPB andRobinsonBH (2002) Sequence-dependent dynamics of duplexDNA the applicability of a dinucleotide model Biophys J 833446ndash3459

68 MasquidaB and WesthofE (2000) On the wobble GoU andrelated pairs RNA 6 9ndash15

69 VaraniG and McClainWH (2000) The G x U wobble base pairA fundamental building block of RNA structure crucial to RNAfunction in diverse biological systems EMBO Rep 1 18ndash23

70 BerezovskyIN ChenWW ChoiPJ and ShakhnovichEI(2005) Entropic stabilization of proteins and its proteomicconsequences PLoS Comput Biol 1 e47

2892 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

Table 1 Continued

ARCHAEA BACTERIA

Characteristic Value Characteristic Value

GpG R=044CpC R=080

rRNAApA R=092 ApA R=067TpA R=088 CpC R=052GpG R=072CpC R=083

D Position-dependent dinucleotide composition ratios (Freq) dinucleotide contrasts and their OGT correlations

Codon position 1-2Freq(ApG)NatNCB 135na Freq(CpG)NatNCB 127na

Freq(CpG)NatNCB 062na

ApG 102 TpT 145ApGNCB 079 TpTNCB 149CpG 076 TpA 069CpGNCB 117 TpANCB 065

GpT 067GpTNCB 067

ApG R=077ApGNCB R=011GpC R=055GpCNCB R=044CpC R=037+

CpCNCB R=037+

ApGNatNCB R=063GpCNatNCB R=060CpCNatNCB R=037+

RpR R=040 RpR R=033RpRNCB R=015 RpRNCB R=027+

YpY R=034 YpY R=039YpYNCB R=005+ YpYNCB R=033RpRNatNCB R=059 RpRNatNCB R=037YpYNatNCB R=064 YpYNatNCB R=045

Codon position 2-3TpA R=063 ApA 150TpANCB R=028 ApANCB 105CpT R=061 GpC 133CpTNCB R=068 GpCNCB 098ApG R=069 ApCNatNCB R=052ApGNCB R=037+ GpANatNCB R=041

RpR R=020 RpR R=051RpRNCB R=065 RpRNCB R=073YpY R=028 YpY R=058YpYNCB R=066 YpYNCB R=075

RpRNatNCB R=039YpYNatNCB R=043

Codon position 3-1CpT R=060CpTShfld R=005ApG R=053ApGShfld R=002

RpR R=030+ RpR R=048RpRShfld R=020 RpRShfld R=001YpY R=035+ YpY R=056YpYShfld R=006 YpYShfld R=017

Pearson correlation coefficient is denoted by R DCT is calculated as the ratio of dinucleotide frequency to the product of frequencies of thecorresponding independent nucleotides Nucleotides with purine (A or G) and pyrimidine (T or C) bases are denoted with R and Y respectively Thelower index distinguishes the values observed for natural sequences (Nat) sequences with eliminated codon bias (NCB) values observed aftershuffling of amino acid sequences (Shfld) If the lower index is omitted the value is given for the natural sequences The P-values for correlationsand for the dinucleotide contrast t-tests (H0 DCT is 10) are shown in superscripts as significance levels +P-valuelt 005 lt 001 lt 00001Supplementary Files S5 and S7 list all P-values for position-specific correlations SD for compositions and the P-values for t-tests SupplementaryFiles S8 and S9 show the P-values for position-independent contrast t-tests and correlations

Nucleic Acids Research 2014 Vol 42 No 5 2883

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

content increases and energy per base pair decreases withtemperature in stems of folded t- and rRNAs (Table 3)These data point to base-pairing interactions as an im-portant contributor to thermal stabilization (53) of t-and rRNAs of prokaryotes and it is stronger manifestedin Archaea than in Bacteria

Dinucleotide composition of nucleic acid sequences

Dinucleotide contrasts (DCTs) show the ratio of observeddinucleotide frequencies to the expected ones given thecomposition of individual nucleotides Coupling of thesame nucleotides (Table 1) is preferred in ncDNA se-quences in Archaea (ApATpT and CpCGpG) andBacteria (ApATpT) ApA and CpC dinucleotides areprevalent in t- and rRNA of Archaea and rRNA ofBacteria Other outstanding contrasts in Archaea arefound in ncDNA (GpTApC) and tRNA (TpC ApC andTpG) In Bacteria ApG and TpC are prevalent in tRNA(Table 1) while GpC is preferred in both non-coding andcoding sequences In coding sequences of Archaea there isno preference for coupling of similar nucleotides while inBacteria it is found for ApA and TpT (Table 1)In both coding and ncDNA sequences of Archaea there

is a clear excess of complementary ApG and CpT di-nucleotides which is persistent after elimination ofthe codon bias and reshuffling of amino acid sequence(Table 1) In Bacteria the excess of GpG dinucleotidesin non-coding and coding sequences is most pronounced(Table 1) provided by the codon bias in coding sequencesPairing of similar nucleotides is highly correlated withOGT in tRNA (ApA TpT GpG and CpC) and rRNA

(ApA GpG and CpC) in Archaea Correlation of thesame nucleotide pairs is also found in Bacteria thoughit is for fewer pairs and is weaker (tRNA ApA rRNAApA and CpC) In Archaeal t- and rRNA there is alsostrong correlation of TpA pairs with OGT (Table 1)

Correlation between nucleotide frequencies andtemperature in different codon positions

The first codon position in Archaea is characterized byhigh correlation of the ratio of natural to NCBfrequencies of adenine with OGT [ANatNCB R=069Table 1] The second codon position reveals correlationof the thymine frequency with OGT (Table 1) The com-bination of thymine and guanine nucleotides is alsocorrelated with OGT however thymine is the major con-tributor (Table 1) The correlation of the guaninecontent with OGT is supported by the codon bias(Table 1) The combination of guanine with cytosine isalso correlated with OGT in the second position ofArchaeal sequences and is provided by the codon biasThe third position reveals strong selection againstadenine and guanine in relation to OGT (Table 1)Bringing together (anti-)correlations with OGT in differ-ent codon positions one can draw the optimal from thepoint of view of thermal adaptation (Table 4) combinedtriplet as [A]1 [TG]2 [non-(AG)]3 Prevalence of thyminein the second codon position is linked to an excess ofhydrophobes Indeed codons with thymine in the secondposition encode Ile Leu Met Phe and Val These arestrongly hydrophobic residues and aromatic Phe whichcan form many van der Waals contacts and contribute

Table 2 Nucleotide compositions and their OGT correlations in DNA and RNAs

Domain of life Nucleotide cDNANat cDNANCB ncDNA tRNA (R) rRNA (R)

Archaea A 2845 2794 3073 1713 (ndash076) 2368 (ndash085)T 2389 2444 3065 1817 (ndash089) 1924 (ndash088)G 2600 2612 1930 3391 (084) 3208 (094)

C 2166 2151 1931 3080 (084) 2499 (076)Bacteria A 2405 2642 2657 1966 (ndash044) 2609 (ndash052)

T 2268 2370 2661 2156 (ndash051) 2068 (ndash077)G 2750 2666 2341 3121 (053) 3110 (072)C 2577 2323 2341 2758 (039) 2213 (060)

The numbers represent the average frequencies of nucleotides in the corresponding parts of genomes while the numbers in parentheses are correl-ation coefficients (R) of nucleotide frequencies with OGT The most important biases and correlations are shown in bold font The P-values forcorrelations (H0 correlation coefficient R=0) and for the nucleotide composition t-tests (H0 mean frequency is 025) are shown in superscripts assignificance levels P-valuelt 001 lt 00001 Supplementary Files S8 and S9 list all correlations and composition tests cDNANAT naturalnucleotide composition in coding DNA cDNANCB nucleotide composition in coding sequences with eliminated codon bias

Table 3 OGT correlations in r- and t-RNA observed in folding simulations

Domain of life RNA type R((G+C) OGT) (P-values) R(ltEbpgt OGT) (P-values)

Archaea rRNA 089 (lt10ndash22) ndash093 (11 10ndash20)tRNA 084 (184 10ndash13) ndash071 (334 10ndash8)

Bacteria rRNA 073 (138 10ndash14) ndash066 (1 10ndash11)tRNA 053 (31 10ndash7) ndash051 (91 10ndash7)

R((G+C) OGT) correlation coefficient between the (G+C) content and the OGT R(ltEbpgt OGT) correlation coefficient between the averagedper base pair free energy of RNA folding and OGT

2884 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

thus to the packing of the hydrophobic coreNoteworthy elimination of the codon bias does notaffect correlation of the thymine fraction with OGT(R=055 in both native and NCB sequences) Itshows that excess of thymine in the second codonposition is a result of selection on the protein level Anapparent explanation for such selection is domination ofthe structure-based strategy in thermostabilization ofArchaeal proteins This strategy is characterized by theincreased compactness of the hydrophobic core providedby the massive van der Waals contacts The correlationof the combination of A and G with OGT is notprovided by tuning of the codon bias which points toadaptation on the other than DNA level

Bacterial coding DNA sequences yield less position-dependent correlations with OGT than Archaeal onesThe ratio of natural adenine frequency over the one foreliminated codon bias is weakly correlated with OGT(Table 1) There are also moderate correlations ofthymine frequency and of the thymine and guaninecombination with OGT in the second codon position(Table 1) The third codon position is characterized byselection against adenine provided by the codon bias(Table 1) However the combination of adenine andguanine is correlated with OGT It does not depend onthe codon bias to a large extent (same as in Archaea)pointing to possible signal of adaptation on thelevel other than protein-coding DNA sequence Thegeneralized codon in Bacteria (Table 4) characterizingthe thermophilic trends reads therefore [weak A]1 [weakT]2 [non-(AG)]3

The role of excessive (G+C) load in the third codonposition in adaptation to aerobicity

In both Archaea and Bacteria the average nucleotidecomposition in different codon positions is not affectedby the codon bias to the large extent except the thirdcodon position in Bacteria However the compositional

variance on the third position is essentially destroyedwhen there is no codon bias (standard deviation (SD)diminishes from gt10 to lt1 see Supplementary FileS5) The natural (G+C)3 load in Bacteria yields 9excess in comparison with NCB sequences (P-value=17e ndash 07 Supplementary File S6) while inArchaea there is no significant difference We found thatdespite frequent involvement of G3 and C3 nucleotidesinto the GCCG pairs the (G+C)3 load does not playa crucial role in thermal stability of mRNA (5254)Overall the difference in (G+C)3 load between Archeaand Bacteria is insignificant If we consider separatethermal groups the only difference appears in mesophilicBacteria (95 more in Bacteria P=009) Howeverwhen we take into account the oxygen tolerance factorthe difference in (G+C)3 load will become extremelypronounced between aerobic (7013 in natural and 5084in NCB) and anaerobic (4894 in natural and 5046 inNCB) species (P-value=2e ndash 09) Considering the lackof OGT correlation in protein-coding nucleotides (exceptthymine on the second codon position) one can concludethat increased (G+C)3 load is a result of the aerobic lifestyle rather than thermal adaptation (SupplementaryFiles S2 and S3) The comparison of aerobes and anaer-obes in the group of mesophilic Bacteria corroborates thisconclusion (Supplementary File S6) The difference isgt20 of (G+C)3 content 7097 in aerobes and 5051 inanaerobes (P=0000178) This adaptation is entirelydriven by the codon bias because if the codon bias isremoved the (G+C)3 bias as well as the differencebetween aerobes and anaerobes will disappear (5087and 5064)We also compared usage of codons with G or C in a

third position with other codons (lsquoCodon Usagersquo inSupplementary Files S3 and S6) There are three aminoacids encoded by six codons (Leu Arg Ser) five by fourcodons (Ala Pro Thr Gly Val) one by three codons (Ile)nine by two codons (Lys Asn Asp Phe Cys Gln Glu

Table 4 Generalized nucleotides and dinucleotides in different codon positions favorable for thermostability

Nucleotides correlated with OGT

Domain of life Codon position 1 2 3

Archaea Nucleotide A TG Non-[AG]Origin of the bias Codon bias T- amino acid Against codon bias

G-codon biasBacteria Nucleotide Weak A Weak T Non-[AG]

Origin of the bias Codon bias Amino acid Against codon bias

Dinucleotides in Purine (R) and Pyrimidine (Y) notation correlated with OGT

Domain of life Codon position 1-2 2-3 3-1

Archaea Dinucleotide RpRYpY Non-RpRYpY RpRYpYOrigin of the bias Codon bias Codon bias Amino acid

Bacteria Dinucleotide RpRYpY Non-RpRYpY RpRYpYOrigin of the bias Not codon bias Codon bias Amino acid

Part 1 Thermophilic-prone nucleotide biases Columns 1 2 3 contain information on favorable nucleotides and origin of the bias in codon positions1 2 3 Part 2 Thermophilic-prone dinucleotide biases Columns 1-2 2-3 3-1 contain information on favorable nucleotides and origin of the bias incodon positions 1-2 2-3 3-1

Nucleic Acids Research 2014 Vol 42 No 5 2885

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

His Tyr) and two by one codon (Met Trp) Typicallyhalf of the synonymous codons of each amino acid have Gor C in a third codon position others A or T It appearedthat almost all of the codons with G and C in a thirdposition have higher frequencies in aerobes compared toanaerobes and in Bacteria than in Archaea (see lsquoCodonUsagersquo in Supplementary File S3) This observation holdsfor all codons of the lsquotwo-codonrsquo amino acids and formajority of codons of the three- four- and six-codonamino acids The noticeable exception is significant sup-pression of AGAAGG codons of the Arg (aerobesndashanaerobes 177ndash125 P-values 446e ndash 858e ndash 8BacteriandashArchaea 174ndash2625 P-values 165e ndash 7114e ndash11) In lysine the codon AAA is suppressed in aerobes(difference 1716 P-value 644e ndash 6) while the codonAAG is preferred The AAG codon of lysine is favorablein aerobes (aerobesanaerobes 143) and another lysinersquoscodon AAA can be turned into AAG by only one mutationin the third codon position Therefore one can speculatethat the demand for discriminating between Arg and Lys isthe most probable cause for the suppression of AGAand AGG codons of arginine in aerobes Lysine isthus preferred in aerobic adaptation over the arginine(Supplementary Files S3 and S6) The contribution of theamino acid in relation to thermophilic adaptation is notcompromised because of the similarity between physicalndashchemical characteristics of lysine and arginine from thepoint of view of thermostability Overall an excess ofthe G3 codons is always more pronounced in aerobescompared to anaerobes which corroborates the conclusionthat the (G+C)3 load is an indications of the aerobic style(55ndash58) The similar trend in the difference between theBacteria and Archaea is a result of the domination ofaerobic life style in the Bacteria in the analyzed dataset

Correlation between dinucleotide frequencies in differentcodon positions and OGTIn order to analyze dinucleotides in different codon pos-itions (1-2 2-3 and 3-1) we used the DCTNat DCTNCB

(for positions 1-2 and 2-3) and shuffled (DCTShfld forposition 3 and 1) sequences We considered correlationsof these contrasts with OGT The most pronounced biasesare found for dinucleotides in coding sequences ofArchaea The ApG dinucleotides show the strongestexcess of dinucleotides in all positions of natural se-quences compared to NCB and shuffled ones In codonpositions 1-2 and 2-3 this compositional peculiarity andits correlation with OGT are provided by the codon bias(Table 1) In position 3-1 high correlation of ApGfrequencies with OGT is determined by the amino acidsequences and it vanishes after amino acid reshuffling(Table 1) There is also correlation of the frequency ofcomplementary CpT dinucleotide in position 3-1 withOGT disappearing after reshuffling of amino acid se-quences (Table 1) The most plausible role of ApG di-nucleotides is a contribution to stabilization via base-stacking interactions between the purine rings (22) Thisconclusion is supported by the fact that excess of ApGdinucleotides is provided by the codon bias manifestingthus adaptation on a DNA level The complementary CpTdinucleotides indicate enrichment of anti-sense strand of

double-stranded DNA (dsDNA) with the stabilizing ApGones The resulting mosaic of ApG stacking in sense andanti-sense strands can provide stabilization over the longdistances without compromising flexibility of the dsDNAOther overrepresented dinucleotides (regardless of thecodon bias) correlated with OGT in positions 1-2 areGpC CpC TpA and CpT (Table 1) The frequencies ofTpA and CpT dinucleotides are correlated with OGT inposition 2-3 where the former is provided by the codonbias and the latter is not

In purine (R) and pyrimidine (Y) notation codonbias provides grouping (stacking) of similar nucleotidesin position 1-2 which increases with OGT (Table 1) Atthe same time codon bias works against such grouping inposition 2-3 (Table 1) Amino acid sequence is responsiblefor the grouping of similar nucleotides and correlationswith OGT in position 3-1 (Table 1) The resultinggeneralized thermophilic pattern of dinucleotides inArchaeal cDNA sequences reads [RpR YpY]1-2 [non-RpR non-YpY]2-3 [RpRYpY]3-1 Most of the compos-itional peculiarities found in Bacteria are not determinedby the codon bias yielding weak correlations with tem-perature (Table 1) The thermophilic pattern of dinucleo-tides in Bacteria (Table 4) is the same as Archaeal oneThe difference with Archaea is that selection for RpR andYpY dinucleotides in position 1-2 is not determined by thecodon bias in Bacteria (Tables 1 and 4)

Compositional and sequence biases observed in foldingsimulations of mRNAThough RNA molecules have different overall structuresone may expect that there are some common mechanismsproviding stability and function of the folded t- r- andmRNAs We have shown above that the (G+C) contentapparently contributes to the stability of the t- and rRNA(3659) and not to the stability of coding DNA (545960)It is manifested in correlation of the former with OGT(Table 2) and the anti-correlation of the base-pairingenergy in the folded structures with OGT (Table 3)Nucleotide compositions in coding DNA (hence inmRNA as well) do not correlate with OGT The mRNAcase is of special interest because redundancy of geneticcode endues the nucleotide sequence with a potential tosatisfy requirements for DNA mRNA and protein stabil-ity For example it has been hypothesized (61) that theoptimization of the base-pairing in the mRNA contributesto formation of stem fragments Authors claimed thatthere is a corresponding periodic pattern of the mRNAsecondary structure in human and mouse which isdetermined chiefly by the selection that operates on thethird codon position (52) An opposing opinion suggeststhat the secondary structure in mRNA interferes withtranslation and therefore should be avoided (62)Overall three scenario of the relation between mRNAand proteins has been considered earlier (52) (i) thebiases are determined by the demands on protein stability(ii) mRNA stability is the major determinant of sequencebiases (iii) complete independence of the sequence biasesrelated to mechanisms of stability on each level We havecomputationally folded the mRNA sequences from 244prokaryotic genomes and analyzed their characteristics

2886 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

(Table 5) All the parameters are reported as relative oneswhere the signal in native sequence is normalized by thevalue obtained for dicodon shuffled ones (48) Thedicodon shuffling randomizes mRNA sequence whilepreserving the native protein sequence native codonusage and native dinucleotide composition Therefore itallows one to select out peculiarities of mRNA caused bythe demands on its stability and function from the othersrelated to the coding DNA and proteins

Structure of stems in folded mRNA is characterized byphases which show relative location of triplets in oppositestrands of the stem Specifically Phases I II and III cor-respond to positioning of triplets where respectively firstsecond and third nucleotides are complementary Themost and the least frequent pairs in all phases of thedsDNA (see Materials and methods section) are con-sidered Phases I and II are similar to each otheryielding CG GC GU UG AU and UA (Table 5)Overall the Archaea and thermophilic Bacteria havesimilar trends in most and least frequent pairsMesophilic Bacteria stands out with A13U and G33Cmost frequent pairs in Phases II and III respectivelyThe major contribution from the three pairing phases(52) to the total amount of stem pairs is provided byPhase III and nucleotides in codon position 3 are mostfrequent in stem pairs (Table 5 and SupplementaryTable S1) OGT correlations highlight the contributionfrom the GU wobble pair to the thermostability ofachaeal mRNA structures (Table 5 and SupplementaryTable S1) We found the highest OGT correlation forthe Archaeal pairs U32G (Table 5) confirming import-ance of the GU wobble pairs The stronger effect forArchaeal rather than for Bacterial mRNA is possibly arelic of ancient RNA world where RNA was the carrierof genetic information and harsh environmental condi-tions demanded its increased stability

Table 6 shows that purine load (the contents of A+GRY ApG) is larger in the loop regions than in the stems

In Archaea this difference is slightly more pronouncedthan in Bacteria (Supplementary Table S2) The purineload is also in good positive correlation with OGT forthe loops not for stems However the amount of ApGdinucleotides is correlated with OGT in both loops andstems For synonymous codons the fractions of purine-rich codons (eg GGR versus GGY for glycine and AGRversus CGY for arginine) are correlated with OGT in bothloop and stem regions (slightly stronger in loops) Fornon-synonymous codons however amount of purine-rich codons (GAR for glutamate) is correlated withOGT in both loops and stems while fractions of purine-rich AAR codons for lysine do not correlate withOGT (Table 6) While Forsdyke et al (21) reportedincrease of the purine content and its positive correlationwith OGT in loops we found that fraction of purine reachsynonymous codons is correlated with OGT in stems aswell At the same time the amount of non-synonymouspurine-rich codons of lysine is not correlated with OGT inboth loop and stem regions Additionally the amount ofApG dinucleotides is correlated with OGT in both loopsand stems (Table 6)

Signals of thermophilic adaptation in protein sequencesIt has been shown earlier (22) that environmental tem-perature directly affects amino acid composition of pro-karyotic proteins and IVYWREL combination can serveas a predictor of the OGT in prokaryotes We use here alsquoZ-scorersquo predictor of OGT (43) which takes into accountdifferences in variances of the proteomic frequencies ofindividual amino acids (Table 7) We have shown thatthe Z-score predictor properly corrects for these differ-ences (43) better reflecting the contribution of theamino acid combinations to thermal adaptationSeparate predictors for Archaea and Bacteria (based on33 and 99 proteomes with OGTrsquos spread over 5ndash100 and10ndash85C respectively) yield the same general trend in theincrease of hydrophobic and charged residues at the

Table 5 Compositional and dicodon signals of mRNA adaptation purified by the dicodon shuffling

Characteristic Archaea Bacteria

The most and least (after the comma) frequentpairs in stems of mRNA

MesophilesPhase I C23G G23U Phase I G32U U32GPhase II C31G G13U Phase II A13U G13UPhase III A33U U33G Phase III G33C U33G

ThermophilesPhase I G32C G23U Phase I C23G G23UPhase II C22G G13U Phase II C31G G13UPhase III A33U U33G Phase III U33A U33G

Correlations with OGT ofSegment energy ltEsggt R=ndash071 R=ndash039Energy per base pair ltEbpgt R=ndash073 R=ndash054

The most signification correlations with OGT Phase I U32G R=077 (U21G)Nat PIII R=046Phase I G23U R=071 (G12U)Nat PIII R=045Phase I All pairs R=07 (U21G)dShfld PIII R=045

(G12U)dShfld PIII R=041

Phases I II and III correspond to positioning of triplets where respectively first second and third nucleotides are comple-mentary All signals (except OGT correlations in Bacteria) are normalized by corresponding values for control sequences afterthe dicodon shuffling See explanations of abbreviations in the Materials and methods section

Nucleic Acids Research 2014 Vol 42 No 5 2887

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

expense of polar ones (34) showing minor differences inpresence of individual residuesWe have shown earlier (22) that the amino acid bias

working in thermal adaptation of proteins does notdepend on the overall nucleotide composition of coding-DNA sequences However one can still expect that nu-cleotide compositions of particular codon positions andor dinucleotide compositions of positions 1-2 and 2-3 incodons may be somehow linked to amino acid biases Weindeed found a significant difference between nucleotideload in individual codon positions of Archaeal andBacterial coding DNA sequences (Table 1) FirstArchaeal protein coding sequences yield a strong correl-ation between (T+G)2 load in the second codon positionand OGT (r=072) The corresponding non-codon biasedsequences show comparable correlation (r=068)pointing that (T+G) load reflects tuning of the aminoacid composition in connection to thermal adaptationThe main contributor to this correlation is thymine(Table 1) that provides increase of strongly hydrophobicresidues LMFIV (Supplementary File S1) The (T+G)load is much weaker correlated with OGT in Bacterialsequences (r=036 for natural and r=042 for NCB se-quences) The correlation of the thymine load with OGT is037 for both Natural (NAT) and Non-codon-biased(NCB) sequences This observation agrees with thepresumed prevalence of structure-based strategy ofthermal adaptation in Archaeal proteins (29) contrary tosequence-based one in Bacterial proteomes Specificallythere is a clear correlation between amounts of chargedresidues and the OGT in the set of Bacterial proteomes(Table 7) presumably reflecting the domination of thesequence-based strategy in thermophilic adaptation ofBacteria (29)Finally we checked if there exists any specific connec-

tion between the dipeptide composition of proteins andthose of 3-1 dinucleotides We found (data not shown)that the strongest correlations between dinucleotides 3-1and dipeptides exist mostly for the dinucleotides withguanine in position 3 The most frequent pair of aminoacids contains methionine (Met encoded exclusively by

the AUG codon) as the first and any polarchargedamino acid as the second one The preference of polarcharged amino acid to be the next to Met can be explainedby the typically surface location of the protein N-terminiand its role as a signal for ubiquitination (63)

DISCUSSION

We will make a brief overview of the most pronouncedbiases and causal relationships in DNA RNA andproteins The major signal in nucleotide compositionsspecifically (G+C) load is related to thermal stabilizationof t- and rRNA (3654555960) and to discriminationbetween the coding and non-coding sequences inArchaea (Table 2) Comparative analysis of nucleotideand amino acid compositions in relation to thermophilicadaptation prompts to conclude that the (G+C) contentdoes not contribute to thermostabilization of coding DNA(1822365459) as well as it does not affect amino acidcomposition and its thermophilic trends (1522) When

Table 6 Purine loading in loop and stem regions of folded mRNA and its OGT correlation

Feature Loop Stem L-v-S p-value

Mean contents OGT correlation Mean content OGT correlation

A+G 0560 059 0500 ndash026 lt22E ndash 16RY 1299 061 1002 ndash026 lt22E ndash 16ApG 0061 079 0051 050 00002GGR (glycine) 0027 062 0045 042 12E ndash 17GGY (glycine) 0017 ndash005 0079 ndash023 10E ndash 42AGR (arginine) 0035 072 0022 056 24E ndash 10CGR (arginine) 0018 ndash022 0040 ndash024 71E ndash 09CGY (arginine) 0016 ndash025 0037 ndash026 62E ndash 10GAR (glutamate) 0060 071 0036 059 lt22E ndash 16AAR (lysine) 0086 022 0017 011 lt22E ndash 16GAY (aspartate) 0037 ndash017 0040 ndash037 00002

Feature analyzed nucleotide dinucleotide or amino acid Loop and stem information on mean content and OGT correlation of the above L-v-S acomparison between corresponding contents in the loop region and stem regions by Wilcoxon-tests and P-values are shown in this columnCorrelation coefficients with OGT are shown P-valuelt 001 P-valuelt 00001

Table 7 Signals of thermophilic adaptation in protein sequences of

Archaea and Bacteria

Correlation with OGT Archaea Bacteria

Z-scored predictor ILVW Y DKR(R=093)

IPV Y EKR(R=089)

The most abundant residuesin gt70 (gt60) ofall Z-scored predictors

VIWL Y ER VPI Y ER(K)

Individual amino acids + L W + Endash T Q D

Types of amino acids + h + h cndash p ndash p

Dipeptides + hp ph + ccndash cp pc ndash pc

+ increase of the amino acid (or amino acid type) fraction with OGT ndashdecrease of the amino acid (or amino acid type) fraction with OGTCapital letters are names of amino acids h p c are hydrophobic polarcharged types of amino acids Correlation coefficients between the Z-scored thermostability predictors and OGT are given in parentheses

2888 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

separate codon positions are considered for coding DNAthe only compositional bias observed for nucleotide com-positions is an excess of (G+C) load in the third codonposition in Bacteria There is a significant excess ofguanine and cytosine nucleotides (292 and 3282)compared to the NCB nucleotide contents (2609 and2466) A plausible explanation for this bias is the roleof the (G+C)3 load in the adaptation to the aerobic lifestyle which dominates in Bacteria The (G+C)3 load canbe advantageous for several complementary reasonsIndeed additional GC base pairing will contribute tothe stability of double helix of DNA and stem regions ofRNA molecules Moreover G3 bases can work as scaven-gers of oxidizing agents providing protection for G basesin other codon positions (56ndash58) At the same time thisbias would not lead to any changes in amino acid com-position leaving protein structure and stability intact (55)

In this work we considered compositional and sequencebiases in proteins in relation to those in correspondingnucleic acid sequences and to the phylogeny of species(Archaea or Bacteria) OGT correlations of nucleotidesin different codon positions are similar in Archaea andBacteria but higher in the former (Table 1) Theoptimal from the point of view of thermophilic adaptation(ie most correlated with OGT) codon reads A1 [TG]2non-[AG]3 in Archaea but [weak A]1 [weak T]2 non-[AG]3 in Bacteria (Table 4) Codon bias supports thestrong correlation between the amount of Adenine andOGT in the first codon position in Archaea same butweaker correlation in Bacteria While we showed earlier(22) that amino acid thermophilic trend is not determinedby the overall nucleotide composition considerationof separate codon positions reveals an interesting linkbetween positional nucleotide frequencies and aminoacid composition in relation to domain of LifeSpecifically prevalence of the nucleotide thymine in thesecond codon position and its strong correlation withOGT in Archaea is apparently a result of the demandon enrichement of Archaeal proteins with hydrophobicresidues (Table 7) Noteworthy a massive increase ofvan der Waals interactions (2464) was found to be thecornerstone of the structure-based evolutionary strategyof protein thermostability in ancient species (29) InBacterial proteins however there is an increase of thefractions of charged residues with OGT (Table 7) Thuswe observe a transitions from structure-based evolution-ary strategy of protein themostability in Archaea tosequence-based one in Bacteria and corresponding nucleo-tide and protein compositional biases (Table 7)underlying these strategies (29) The third codonposition is characterized by the selection against adenineand guanine in Archaea and against adenine in Bacteria isa result of this bias

The most pronounced dinucleotide compositionalbiases are excess of homo-dinucleotides and its correlationwith OGT in ncDNA tRNA and rRNA It is presumablya relic of ancient primitive homo(poly)nucleotides fromwhich life started Overrepresentation of the complemen-tary ApGCpT dinucleotides points to their contributionto DNARNA stability via strong base-stacking inter-actions (6566) The consideration of dinucleotide biases

in purine (R) and pyrimidine (Y) notation also reveals aninteresting picture Overall for both Archaea and Bacteriathe signature for thermophilic dinucleotides reads RpRYpY non-(RpRYpY) RpRYpY for codon positions1-2 2-3 and 3-1 respectively (Table 4) The major differ-ence between Archaea and Bacteria in this case is that thedinucleotide bias in positions 1-2 is caused by demands onnucleotide sequence level (and provided by the codonbias) in Archaea while some other factors work inBacteria The preference for homodinucleotides in pos-itions 1-2 and 3-1 can be a possible reason for avoidinghomodinucleotides in positions 2-3 because heterodinu-cleotides can contribute to DNA flexibility tuningIndeed it has been shown that given an average flexibilityF of homodinucleotides (either RpR or YpY) theflexibilities of the heterodinucleotides YpR and RpY are2F and 05F respectively (67) The codon positions 2-3 aremost versatile in terms of the relationship between thenucleotide and protein sequences Therefore presenceof heteronucleotides in the codon position 2-3 makes itpossible to adjust flexibility of the nucleotide sequencewithout changing physicalndashchemical characteristics ofthe encoded amino acid residue Additionally we foundthat the link between dipeptide and corresponding 3-1 di-nucleotide frequencies is determined by the amino aciddipeptide The dominating bias reads as Met followedby a polar or charged residueOn the RNA level simple consideration of the nucleo-

tide composition suggests to discriminate the tRNA andrRNA molecules from the mRNA Indeed based on thecorrelation with OGT we can conclude that (G+C)content is important for providing stability of tRNAand rRNA molecules but not for mRNA The role ofthe (G+C) content was further confirmed by foldingsimulations (Table 3) We also found other complemen-tary signals in mRNA sequences related to its structurestability and function Archaea and Bacteria yield similartrends in the most and least frequent nucleotide pairsand the third codon position makes a major contributionto the base-pairing mechanism of stem stabilizationSpecifically in addition to WatsonndashCrick pairing theGU wobble pairs sufficiently contribute to the ther-mal stability of the Archaeal mRNA in agreement withearlier observations (6869) Overall it results in moderatecorrelation of the segment energy and energy per base pairof the folded mRNA (Table 5) Folding simulations per-formed in this work also allowed us to analyzepeculiarities of the nucleotide contents and structuralcontacts in stem and loop regions of folded mRNAPurine load in loop regions is higher than in stems andthis effect is slightly stronger in Archaea than in Bacteria(Table 6) Purine load in loops also correlates with OGTnot in stems This correlation was observed earlier byForsdyke et al (21) and it was described as lsquopolitepurine load of the loop regionsrsquo that prevents undesiredmRNAndashmRNA single-strand interactions The authorsconcluded that increased purine load can affect thecodon choice and consequently the amino acid compos-ition (21) Our data [as well as careful analysis of theoriginal data in (21)] does not support the latter claimThere is indeed a correlation between the fraction of

Nucleic Acids Research 2014 Vol 42 No 5 2889

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

some purine-rich codons (GGR(Gly) AGR(Arg) andGAR(Glu)) and OGT (Table 6) However the fractionof purine-rich non-synonymous codons AAR (Lys) isnot correlated with OGT (Table 6) Moreover thesecodons are synonymous and cannot directly affect theamino acid composition It has been shown earlier(2270) that increase of Glu and Arg fractions is a trendin protein thermophilic adaptation At the same timepurine load remains high after elimination of the codonbias pointing to the amino acid composition as the mostprobable cause for the former (22) All the above promptsone to conclude that in mutual tuning of nucleotide andamino acid compositions the purine load does notdominate and determine biases in amino acid compos-ition if not opposite The analysis of purine load showsthat it correlates with OGT in both stems and loops(Table 6) in case of purine-rich codons GGR(Gly)AGR(Arg) and GAR(Glu) Further the amount ofApG dinucleotides strongly correlates with OGT in bothloops and stems (Table 6) and ApG provides a strongbase stacking important for thermal stability of nucleicacid sequences (2265) Purine load therefore is appar-ently a determinant of the base-stacking mechanism inmRNA andor DNA thermal stability as well as a resultof the thermophilic amino acid composition trendProkaryotes thrive under the temperature interval

spanning over hundred degrees and they represent twomajor life stylesmdashaerobic and anaerobic Analysis ofcomplete Archaeal and Bacterial genomes unraveled com-positional and sequence signals related to molecular mech-anisms of stability and adaptation unaffected by selectivesequencing or by the comparison of orthologs Overallcodon bias works stronger in Archaea and is mostlyutilized in thermophilic adaptation of nucleic acids It ap-parently reflects longer evolutionary history of Archaeawhich presumably started close to the origin of life in hotconditions (29) The codon bias and amino acid sequences(dipeptide composition) work in accord for supportingenrichment of the nucleotide sequences with ApGCpT di-nucleotidesmdashdeterminants of the base-stacking mechan-ism of the nucleic acid stability (65ndash67) We also foundthat the second codon position reveals a strong linkbetween the nucleotide and amino acid compositionsSpecifically excess of thymine in this position is a resultof a demand on the enrichment of Archaeal proteins withhydrophobic amino acids The third codon position inBacteria is the only case where codon bias is detectablealready on the level of pure composition We found thatthe (G+C)3 load is related to aerobic life style dominatingin Bacteria From the point of view of thermophilic adap-tation codon bias works against G in the third codonposition in both Archaea and Bacteria It supports thus aspecific role of the third codon position in discriminatingbetween adaptations to temperature and aerobic life styleFinally we obtained an interesting and complex picture ofthe relationship between the nucleotide composition(purine load) and amino acid composition (selectionbetween Arg and Lys) in relation to thermal adaptationaerobicity and phylogeny Specifically purine load in bothArchaea and Bacteria is a result of the lsquofrom both endof hydrophobicity scalersquo trend in thermal adaptation of

proteins (34) reflected in the IVYWREL predictor of theOGT (22) According to this predictor Arg is a preferredamino acid for thermal adaptation though Lys is the nextcandidate present in top most correlated predictors as well(2243) In particular selection for Lys over the Arg insome species was shown to be important for the entropicmechanism of protein thermostabilization (70) Howeveroverall preference for Arg over the Lys in case of thermaladaptation is well manifested in the excess of the purine-rich (AGR) codons of Arg at the expense of purine-richones of Lys (AAR) At the same time discriminationbetween Arg and Lys works in opposite direction inaerobicity where Arg codons AGA and AGG are sup-pressed in favor of the purine-rich AAA and AAG ofLys It thus reflect a preference for Lys over Arg inaerobes compared to anaerobes in particular and inBacteria versus Archaea in general preserving at thesame time high purine load necessary for providing abase-stacking in corresponding nucleic acids (225365)An intricate connection between the LysArg and theircodons in relation to thermal adaptation and aerobicityexemplifies how selection can work on nucleic acids andprotein simultaneously in response to demands of differentenvironments Obviously the whole picture of molecularmechanisms of adaptation and relations between them isfar from being complete Consideration of other environ-mental factors such as salinity pressure etc will help to un-ravel new mechanisms of stability their sequencestructuredeterminants and to understand tradeoffs that Natureembraced en route of the evolution and adaptation

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

FUNDING

Funding for Open access charges Functional GenomicsProgram (FUGE II) Norwegian Research Council

Conflict of interest statement None declared

REFERENCES

1 CrickF (1970) Central dogma of molecular biology Nature 227561ndash563

2 TawfikDS (2010) Messy biology and the origins of evolutionaryinnovations Nat Chem Biol 6 692ndash696

3 TokurikiN and TawfikDS (2009) Stability effects of mutationsand protein evolvability Curr Opin Struct Biol 19 596ndash604

4 TokurikiN and TawfikDS (2009) Protein dynamism andevolvability Science 324 203ndash207

5 KooninEV (2012) Does the central dogma still stand BiolDirect 7 27

6 PersquoerI FelderCE ManO SilmanI SussmanJL andBeckmannJS (2004) Proteomic signatures amino acid andoligopeptide compositions differentiate among phyla Proteins 5420ndash40

7 AravindL TatusovRL WolfYI WalkerDR andKooninEV (1998) Evidence for massive gene exchange betweenarchaeal and bacterial hyperthermophiles Trends Genet 14442ndash444

8 KhachaneAN TimmisKN and dos SantosVA (2005) Uracilcontent of 16S rRNA of thermophilic and psychrophilic

2890 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

prokaryotes correlates inversely with their optimal growthtemperatures Nucleic Acids Res 33 4016ndash4022

9 SuhreK and ClaverieJM (2003) Genomic correlates ofhyperthermostability an update J Biol Chem 27817198ndash17202

10 TekaiaF and YeramianE (2006) Evolution of proteomesfundamental signatures and global trends in amino acidcompositions BMC Genom 7 307

11 WangHC and HickeyDA (2002) Evidence for strong selectiveconstraint acting on the nucleotide composition of 16S ribosomalRNA genes Nucleic Acids Res 30 2501ndash2507

12 FriedmanR DrakeJW and HughesAL (2004) Genome-widepatterns of nucleotide substitution reveal stringent functionalconstraints on the protein sequences of thermophiles Genetics167 1507ndash1512

13 LynnDJ SingerGA and HickeyDA (2002) Synonymouscodon usage is subject to selection in thermophilic bacteriaNucleic Acids Res 30 4272ndash4277

14 SingerGA and HickeyDA (2000) Nucleotide bias causes agenomewide bias in the amino acid composition of proteinsMol Biol Evol 17 1581ndash1588

15 SingerGA and HickeyDA (2003) Thermophilicprokaryotes have characteristic patterns of codonusage amino acid composition and nucleotide content Gene 31739ndash47

16 TekaiaF YeramianE and DujonB (2002) Amino acidcomposition of genomes lifestyles of organisms and evolutionarytrends a global picture with correspondence analysis Gene 29751ndash60

17 Roy ChowdhuryA and DuttaC (2012) A pursuit of lineage-specific and niche-specific proteome features in the world ofarchaea BMC Genom 13 236

18 WangHC SuskoE and RogerAJ (2006) On the correlationbetween genomic G+C content and optimal growth temperaturein prokaryotes data quality and confounding factors BiochemBiophys Res Commun 342 681ndash684

19 WuH ZhangZ HuS and YuJ (2012) On the molecularmechanism of GC content variation among eubacterial genomesBiol Direct 7 2

20 KnightRD FreelandSJ and LandweberLF (2001) A simplemodel based on mutation and selection explains trends in codonand amino-acid usage and GC composition within and acrossgenomes Genome Biol 2 RESEARCH0010

21 LaoPJ and ForsdykeDR (2000) Thermophilic bacteria strictlyobey Szybalskirsquos transcription direction rule and politely purine-load RNAs with both adenine and guanine Genome Res 10228ndash236

22 ZeldovichKB BerezovskyIN and ShakhnovichEI (2007)Protein and DNA sequence determinants of thermophilicadaptation PLoS Comput Biol 3 e5

23 KreilDP and OuzounisCA (2001) Identification ofthermophilic species by the amino acid compositions deducedfrom their genomes Nucleic Acids Res 29 1608ndash1615

24 BerezovskyIN TumanyanVG and EsipovaNG (1997)Representation of amino acid sequences in terms of interactionenergy in protein globules FEBS Lett 418 43ndash46

25 CambillauC and ClaverieJM (2000) Structural and genomiccorrelates of hyperthermostability J Biol Chem 27532383ndash32386

26 GreavesRB and WarwickerJ (2007) Mechanisms forstabilisation and the maintenance of solubility in proteins fromthermophiles BMC Struct Biol 7 18

27 JaenickeR (1999) Stability and folding of domain proteins ProgBiophys Mol Biol 71 155ndash241

28 JaenickeR and BohmG (1998) The stability of proteins inextreme environments Curr Opin Struct Biol 8 738ndash748

29 BerezovskyIN and ShakhnovichEI (2005) Physics andevolution of thermophilic adaptation Proc Natl Acad Sci USA102 12742ndash12747

30 ChakravartyS and VaradarajanR (2002) Elucidation of factorsresponsible for enhanced thermal stability of proteins a structuralgenomics based study Biochemistry 41 8152ndash8161

31 GlyakinaAV GarbuzynskiySO LobanovMY andGalzitskayaOV (2007) Different packing of external residues can

explain differences in the thermostability of proteins fromthermophilic and mesophilic organisms Bioinformatics 232231ndash2238

32 ThompsonMJ and EisenbergD (1999) Transproteomic evidenceof a loop-deletion mechanism for enhancing proteinthermostability J Mol Biol 290 595ndash604

33 TokurikiN OldfieldCJ UverskyVN BerezovskyIN andTawfikDS (2009) Do viral proteins possess unique biophysicalfeatures Trends Biochem Sci 34 53ndash59

34 BerezovskyIN ZeldovichKB and ShakhnovichEI (2007)Positive and negative design in stability and thermal adaptationof natural proteins PLoS Comput Biol 3 e52

35 BharanidharanD BhargaviGR UthanumallianK andGauthamN (2004) Correlations between nucleotide frequenciesand amino acid composition in 115 bacterial species BiochemBiophys Res Commun 315 1097ndash1103

36 NakashimaH FukuchiS and NishikawaK (2003)Compositional changes in RNA DNA and proteins for bacterialadaptation to higher and lower temperatures J Biochem(Tokyo) 133 507ndash513

37 DehouckY FolchB and RoomanM (2008) Revisiting thecorrelation between proteinsrsquo thermoresistance and organismsrsquothermophilicity Protein Eng Des Sel 21 275ndash278

38 FolchB DehouckY and RoomanM (2010) Thermo- andmesostabilizing protein interactions identified by temperature-dependent statistical potentials Biophys J 98 667ndash677

39 FolchB RoomanM and DehouckY (2008) Thermostability ofsalt bridges versus hydrophobic interactions in proteins probed bystatistical potentials J Chem Inf Model 48 119ndash127

40 GonnelliG RoomanM and DehouckY (2012) Structure-basedmutant stability predictions on proteins of unknown structureJ Biotechnol 161 287ndash293

41 PonnuswamyP MuthusamyR and ManavalanP (1982) Aminoacid composition and thermal stability of globular proteins IntJ Biol Macromol 4 186ndash190

42 BerezovskyIN (2011) The diversity of physical forces andmechanisms in intermolecular interactions Phys Biol 8 035002

43 MaBG GoncearencoA and BerezovskyIN (2010)Thermophilic adaptation of protein complexes inferred fromproteomic homology modeling Structure 18 819ndash828

44 MakarovaKS and KooninEV (2005) Evolutionary andfunctional genomics of the Archaea Curr Opin Microbiol 8586ndash594

45 NovichkovPS WolfYI DubchakI and KooninEV (2009)Trends in prokaryotic evolution revealed by comparison of closelyrelated bacterial and archaeal genomes J Bacteriol 191 65ndash73

46 KooninEV and WolfYI (2008) Genomics of bacteria andarchaea the emerging dynamic view of the prokaryotic worldNucleic Acids Res 36 6688ndash6719

47 KooninEV MushegianAR GalperinMY and WalkerDR(1997) Comparison of archaeal and bacterial genomes computeranalysis of protein sequences predicts novel functions andsuggests a chimeric origin for the archaea Mol Microbiol 25619ndash637

48 KatzL and BurgeCB (2003) Widespread selection for localRNA secondary structure in coding regions of bacterial genesGenome Res 13 2042ndash2051

49 HofackerIL PriwitzerB and StadlerPF (2004) Prediction oflocally stable RNA secondary structures for genome-wide surveysBioinformatics 20 186ndash190

50 ZukerM and StieglerP (1981) Optimal computer folding oflarge RNA sequences using thermodynamics and auxiliaryinformation Nucleic Acids Res 9 133ndash148

51 MathewsDH SabinaJ ZukerM and TurnerDH (1999)Expanded sequence dependence of thermodynamic parametersimproves prediction of RNA secondary structure J Mol Biol288 911ndash940

52 ShabalinaSA OgurtsovAY and SpiridonovNA (2006) Aperiodic pattern of mRNA secondary structure created by thegenetic code Nucleic Acids Res 34 2428ndash2437

53 MarmurJ and DotyP (1962) Determination of the basecomposition of deoxyribonucleic acid from its thermaldenaturation temperature J Mol Biol 5 109ndash118

Nucleic Acids Research 2014 Vol 42 No 5 2891

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

54 HurstLD and MerchantAR (2001) High guanine-cytosinecontent is not an adaptation to high temperature acomparative analysis amongst prokaryotes Proc Biol Sci 268493ndash497

55 NayaH RomeroH ZavalaA AlvarezB and MustoH (2002)Aerobiosis increases the genomic guanine plus cytosine content(GC) in prokaryotes J Mol Evol 55 260ndash264

56 BeckmanKB and AmesBN (1997) Oxidative decay of DNAJ Biol Chem 272 19633ndash19636

57 ChengKC CahillDS KasaiH NishimuraS and LoebLA(1992) 8-Hydroxyguanine an abundant form of oxidative DNAdamage causes Gmdash-T and Amdash-C substitutions J Biol Chem267 166ndash172

58 Vieira-SilvaS and RochaEP (2008) An assessment of theimpacts of molecular oxygen on the evolution of proteomes MolBiol Evol 25 1931ndash1942

59 GaltierN and LobryJR (1997) Relationships between genomicG+C content RNA secondary structures and optimal growthtemperature in prokaryotes J Mol Evol 44 632ndash636

60 MustoH NayaH ZavalaA RomeroH Alvarez-ValinF andBernardiG (2004) Correlations between genomic GC levels andoptimal growth temperatures in prokaryotes FEBS Lett 57373ndash77

61 FitchWM (1974) The large extent of putative secondary nucleicacid structure in random nucleotide sequences or amino acidderived messenger-RNA J Mol Evol 3 279ndash291

62 WorkmanC and KroghA (1999) No evidence that mRNAshave lower folding free energies than random sequences with

the same dinucleotide distribution Nucleic Acids Res 274816ndash4822

63 BachmairA FinleyD and VarshavskyA (1986) In vivo half-lifeof a protein is a function of its amino-terminal residue Science234 179ndash186

64 BerezovskyIN NamiotVA TumanyanVG and EsipovaNG(1999) Hierarchy of the interaction energy distribution in thespatial structure of globular proteins and the problem of domaindefinition J Biomol Struct Dyn 17 133ndash155

65 YakovchukP ProtozanovaE and Frank-KamenetskiiMD(2006) Base-stacking and base-pairing contributions into thermalstability of the DNA double helix Nucleic Acids Res 34564ndash574

66 FriedmanRA and HonigB (1995) A free energy analysis ofnucleic acid base stacking in aqueous solution Biophys J 691528ndash1535

67 OkonogiTM AlleySC ReeseAW HopkinsPB andRobinsonBH (2002) Sequence-dependent dynamics of duplexDNA the applicability of a dinucleotide model Biophys J 833446ndash3459

68 MasquidaB and WesthofE (2000) On the wobble GoU andrelated pairs RNA 6 9ndash15

69 VaraniG and McClainWH (2000) The G x U wobble base pairA fundamental building block of RNA structure crucial to RNAfunction in diverse biological systems EMBO Rep 1 18ndash23

70 BerezovskyIN ChenWW ChoiPJ and ShakhnovichEI(2005) Entropic stabilization of proteins and its proteomicconsequences PLoS Comput Biol 1 e47

2892 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

content increases and energy per base pair decreases withtemperature in stems of folded t- and rRNAs (Table 3)These data point to base-pairing interactions as an im-portant contributor to thermal stabilization (53) of t-and rRNAs of prokaryotes and it is stronger manifestedin Archaea than in Bacteria

Dinucleotide composition of nucleic acid sequences

Dinucleotide contrasts (DCTs) show the ratio of observeddinucleotide frequencies to the expected ones given thecomposition of individual nucleotides Coupling of thesame nucleotides (Table 1) is preferred in ncDNA se-quences in Archaea (ApATpT and CpCGpG) andBacteria (ApATpT) ApA and CpC dinucleotides areprevalent in t- and rRNA of Archaea and rRNA ofBacteria Other outstanding contrasts in Archaea arefound in ncDNA (GpTApC) and tRNA (TpC ApC andTpG) In Bacteria ApG and TpC are prevalent in tRNA(Table 1) while GpC is preferred in both non-coding andcoding sequences In coding sequences of Archaea there isno preference for coupling of similar nucleotides while inBacteria it is found for ApA and TpT (Table 1)In both coding and ncDNA sequences of Archaea there

is a clear excess of complementary ApG and CpT di-nucleotides which is persistent after elimination ofthe codon bias and reshuffling of amino acid sequence(Table 1) In Bacteria the excess of GpG dinucleotidesin non-coding and coding sequences is most pronounced(Table 1) provided by the codon bias in coding sequencesPairing of similar nucleotides is highly correlated withOGT in tRNA (ApA TpT GpG and CpC) and rRNA

(ApA GpG and CpC) in Archaea Correlation of thesame nucleotide pairs is also found in Bacteria thoughit is for fewer pairs and is weaker (tRNA ApA rRNAApA and CpC) In Archaeal t- and rRNA there is alsostrong correlation of TpA pairs with OGT (Table 1)

Correlation between nucleotide frequencies andtemperature in different codon positions

The first codon position in Archaea is characterized byhigh correlation of the ratio of natural to NCBfrequencies of adenine with OGT [ANatNCB R=069Table 1] The second codon position reveals correlationof the thymine frequency with OGT (Table 1) The com-bination of thymine and guanine nucleotides is alsocorrelated with OGT however thymine is the major con-tributor (Table 1) The correlation of the guaninecontent with OGT is supported by the codon bias(Table 1) The combination of guanine with cytosine isalso correlated with OGT in the second position ofArchaeal sequences and is provided by the codon biasThe third position reveals strong selection againstadenine and guanine in relation to OGT (Table 1)Bringing together (anti-)correlations with OGT in differ-ent codon positions one can draw the optimal from thepoint of view of thermal adaptation (Table 4) combinedtriplet as [A]1 [TG]2 [non-(AG)]3 Prevalence of thyminein the second codon position is linked to an excess ofhydrophobes Indeed codons with thymine in the secondposition encode Ile Leu Met Phe and Val These arestrongly hydrophobic residues and aromatic Phe whichcan form many van der Waals contacts and contribute

Table 2 Nucleotide compositions and their OGT correlations in DNA and RNAs

Domain of life Nucleotide cDNANat cDNANCB ncDNA tRNA (R) rRNA (R)

Archaea A 2845 2794 3073 1713 (ndash076) 2368 (ndash085)T 2389 2444 3065 1817 (ndash089) 1924 (ndash088)G 2600 2612 1930 3391 (084) 3208 (094)

C 2166 2151 1931 3080 (084) 2499 (076)Bacteria A 2405 2642 2657 1966 (ndash044) 2609 (ndash052)

T 2268 2370 2661 2156 (ndash051) 2068 (ndash077)G 2750 2666 2341 3121 (053) 3110 (072)C 2577 2323 2341 2758 (039) 2213 (060)

The numbers represent the average frequencies of nucleotides in the corresponding parts of genomes while the numbers in parentheses are correl-ation coefficients (R) of nucleotide frequencies with OGT The most important biases and correlations are shown in bold font The P-values forcorrelations (H0 correlation coefficient R=0) and for the nucleotide composition t-tests (H0 mean frequency is 025) are shown in superscripts assignificance levels P-valuelt 001 lt 00001 Supplementary Files S8 and S9 list all correlations and composition tests cDNANAT naturalnucleotide composition in coding DNA cDNANCB nucleotide composition in coding sequences with eliminated codon bias

Table 3 OGT correlations in r- and t-RNA observed in folding simulations

Domain of life RNA type R((G+C) OGT) (P-values) R(ltEbpgt OGT) (P-values)

Archaea rRNA 089 (lt10ndash22) ndash093 (11 10ndash20)tRNA 084 (184 10ndash13) ndash071 (334 10ndash8)

Bacteria rRNA 073 (138 10ndash14) ndash066 (1 10ndash11)tRNA 053 (31 10ndash7) ndash051 (91 10ndash7)

R((G+C) OGT) correlation coefficient between the (G+C) content and the OGT R(ltEbpgt OGT) correlation coefficient between the averagedper base pair free energy of RNA folding and OGT

2884 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

thus to the packing of the hydrophobic coreNoteworthy elimination of the codon bias does notaffect correlation of the thymine fraction with OGT(R=055 in both native and NCB sequences) Itshows that excess of thymine in the second codonposition is a result of selection on the protein level Anapparent explanation for such selection is domination ofthe structure-based strategy in thermostabilization ofArchaeal proteins This strategy is characterized by theincreased compactness of the hydrophobic core providedby the massive van der Waals contacts The correlationof the combination of A and G with OGT is notprovided by tuning of the codon bias which points toadaptation on the other than DNA level

Bacterial coding DNA sequences yield less position-dependent correlations with OGT than Archaeal onesThe ratio of natural adenine frequency over the one foreliminated codon bias is weakly correlated with OGT(Table 1) There are also moderate correlations ofthymine frequency and of the thymine and guaninecombination with OGT in the second codon position(Table 1) The third codon position is characterized byselection against adenine provided by the codon bias(Table 1) However the combination of adenine andguanine is correlated with OGT It does not depend onthe codon bias to a large extent (same as in Archaea)pointing to possible signal of adaptation on thelevel other than protein-coding DNA sequence Thegeneralized codon in Bacteria (Table 4) characterizingthe thermophilic trends reads therefore [weak A]1 [weakT]2 [non-(AG)]3

The role of excessive (G+C) load in the third codonposition in adaptation to aerobicity

In both Archaea and Bacteria the average nucleotidecomposition in different codon positions is not affectedby the codon bias to the large extent except the thirdcodon position in Bacteria However the compositional

variance on the third position is essentially destroyedwhen there is no codon bias (standard deviation (SD)diminishes from gt10 to lt1 see Supplementary FileS5) The natural (G+C)3 load in Bacteria yields 9excess in comparison with NCB sequences (P-value=17e ndash 07 Supplementary File S6) while inArchaea there is no significant difference We found thatdespite frequent involvement of G3 and C3 nucleotidesinto the GCCG pairs the (G+C)3 load does not playa crucial role in thermal stability of mRNA (5254)Overall the difference in (G+C)3 load between Archeaand Bacteria is insignificant If we consider separatethermal groups the only difference appears in mesophilicBacteria (95 more in Bacteria P=009) Howeverwhen we take into account the oxygen tolerance factorthe difference in (G+C)3 load will become extremelypronounced between aerobic (7013 in natural and 5084in NCB) and anaerobic (4894 in natural and 5046 inNCB) species (P-value=2e ndash 09) Considering the lackof OGT correlation in protein-coding nucleotides (exceptthymine on the second codon position) one can concludethat increased (G+C)3 load is a result of the aerobic lifestyle rather than thermal adaptation (SupplementaryFiles S2 and S3) The comparison of aerobes and anaer-obes in the group of mesophilic Bacteria corroborates thisconclusion (Supplementary File S6) The difference isgt20 of (G+C)3 content 7097 in aerobes and 5051 inanaerobes (P=0000178) This adaptation is entirelydriven by the codon bias because if the codon bias isremoved the (G+C)3 bias as well as the differencebetween aerobes and anaerobes will disappear (5087and 5064)We also compared usage of codons with G or C in a

third position with other codons (lsquoCodon Usagersquo inSupplementary Files S3 and S6) There are three aminoacids encoded by six codons (Leu Arg Ser) five by fourcodons (Ala Pro Thr Gly Val) one by three codons (Ile)nine by two codons (Lys Asn Asp Phe Cys Gln Glu

Table 4 Generalized nucleotides and dinucleotides in different codon positions favorable for thermostability

Nucleotides correlated with OGT

Domain of life Codon position 1 2 3

Archaea Nucleotide A TG Non-[AG]Origin of the bias Codon bias T- amino acid Against codon bias

G-codon biasBacteria Nucleotide Weak A Weak T Non-[AG]

Origin of the bias Codon bias Amino acid Against codon bias

Dinucleotides in Purine (R) and Pyrimidine (Y) notation correlated with OGT

Domain of life Codon position 1-2 2-3 3-1

Archaea Dinucleotide RpRYpY Non-RpRYpY RpRYpYOrigin of the bias Codon bias Codon bias Amino acid

Bacteria Dinucleotide RpRYpY Non-RpRYpY RpRYpYOrigin of the bias Not codon bias Codon bias Amino acid

Part 1 Thermophilic-prone nucleotide biases Columns 1 2 3 contain information on favorable nucleotides and origin of the bias in codon positions1 2 3 Part 2 Thermophilic-prone dinucleotide biases Columns 1-2 2-3 3-1 contain information on favorable nucleotides and origin of the bias incodon positions 1-2 2-3 3-1

Nucleic Acids Research 2014 Vol 42 No 5 2885

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

His Tyr) and two by one codon (Met Trp) Typicallyhalf of the synonymous codons of each amino acid have Gor C in a third codon position others A or T It appearedthat almost all of the codons with G and C in a thirdposition have higher frequencies in aerobes compared toanaerobes and in Bacteria than in Archaea (see lsquoCodonUsagersquo in Supplementary File S3) This observation holdsfor all codons of the lsquotwo-codonrsquo amino acids and formajority of codons of the three- four- and six-codonamino acids The noticeable exception is significant sup-pression of AGAAGG codons of the Arg (aerobesndashanaerobes 177ndash125 P-values 446e ndash 858e ndash 8BacteriandashArchaea 174ndash2625 P-values 165e ndash 7114e ndash11) In lysine the codon AAA is suppressed in aerobes(difference 1716 P-value 644e ndash 6) while the codonAAG is preferred The AAG codon of lysine is favorablein aerobes (aerobesanaerobes 143) and another lysinersquoscodon AAA can be turned into AAG by only one mutationin the third codon position Therefore one can speculatethat the demand for discriminating between Arg and Lys isthe most probable cause for the suppression of AGAand AGG codons of arginine in aerobes Lysine isthus preferred in aerobic adaptation over the arginine(Supplementary Files S3 and S6) The contribution of theamino acid in relation to thermophilic adaptation is notcompromised because of the similarity between physicalndashchemical characteristics of lysine and arginine from thepoint of view of thermostability Overall an excess ofthe G3 codons is always more pronounced in aerobescompared to anaerobes which corroborates the conclusionthat the (G+C)3 load is an indications of the aerobic style(55ndash58) The similar trend in the difference between theBacteria and Archaea is a result of the domination ofaerobic life style in the Bacteria in the analyzed dataset

Correlation between dinucleotide frequencies in differentcodon positions and OGTIn order to analyze dinucleotides in different codon pos-itions (1-2 2-3 and 3-1) we used the DCTNat DCTNCB

(for positions 1-2 and 2-3) and shuffled (DCTShfld forposition 3 and 1) sequences We considered correlationsof these contrasts with OGT The most pronounced biasesare found for dinucleotides in coding sequences ofArchaea The ApG dinucleotides show the strongestexcess of dinucleotides in all positions of natural se-quences compared to NCB and shuffled ones In codonpositions 1-2 and 2-3 this compositional peculiarity andits correlation with OGT are provided by the codon bias(Table 1) In position 3-1 high correlation of ApGfrequencies with OGT is determined by the amino acidsequences and it vanishes after amino acid reshuffling(Table 1) There is also correlation of the frequency ofcomplementary CpT dinucleotide in position 3-1 withOGT disappearing after reshuffling of amino acid se-quences (Table 1) The most plausible role of ApG di-nucleotides is a contribution to stabilization via base-stacking interactions between the purine rings (22) Thisconclusion is supported by the fact that excess of ApGdinucleotides is provided by the codon bias manifestingthus adaptation on a DNA level The complementary CpTdinucleotides indicate enrichment of anti-sense strand of

double-stranded DNA (dsDNA) with the stabilizing ApGones The resulting mosaic of ApG stacking in sense andanti-sense strands can provide stabilization over the longdistances without compromising flexibility of the dsDNAOther overrepresented dinucleotides (regardless of thecodon bias) correlated with OGT in positions 1-2 areGpC CpC TpA and CpT (Table 1) The frequencies ofTpA and CpT dinucleotides are correlated with OGT inposition 2-3 where the former is provided by the codonbias and the latter is not

In purine (R) and pyrimidine (Y) notation codonbias provides grouping (stacking) of similar nucleotidesin position 1-2 which increases with OGT (Table 1) Atthe same time codon bias works against such grouping inposition 2-3 (Table 1) Amino acid sequence is responsiblefor the grouping of similar nucleotides and correlationswith OGT in position 3-1 (Table 1) The resultinggeneralized thermophilic pattern of dinucleotides inArchaeal cDNA sequences reads [RpR YpY]1-2 [non-RpR non-YpY]2-3 [RpRYpY]3-1 Most of the compos-itional peculiarities found in Bacteria are not determinedby the codon bias yielding weak correlations with tem-perature (Table 1) The thermophilic pattern of dinucleo-tides in Bacteria (Table 4) is the same as Archaeal oneThe difference with Archaea is that selection for RpR andYpY dinucleotides in position 1-2 is not determined by thecodon bias in Bacteria (Tables 1 and 4)

Compositional and sequence biases observed in foldingsimulations of mRNAThough RNA molecules have different overall structuresone may expect that there are some common mechanismsproviding stability and function of the folded t- r- andmRNAs We have shown above that the (G+C) contentapparently contributes to the stability of the t- and rRNA(3659) and not to the stability of coding DNA (545960)It is manifested in correlation of the former with OGT(Table 2) and the anti-correlation of the base-pairingenergy in the folded structures with OGT (Table 3)Nucleotide compositions in coding DNA (hence inmRNA as well) do not correlate with OGT The mRNAcase is of special interest because redundancy of geneticcode endues the nucleotide sequence with a potential tosatisfy requirements for DNA mRNA and protein stabil-ity For example it has been hypothesized (61) that theoptimization of the base-pairing in the mRNA contributesto formation of stem fragments Authors claimed thatthere is a corresponding periodic pattern of the mRNAsecondary structure in human and mouse which isdetermined chiefly by the selection that operates on thethird codon position (52) An opposing opinion suggeststhat the secondary structure in mRNA interferes withtranslation and therefore should be avoided (62)Overall three scenario of the relation between mRNAand proteins has been considered earlier (52) (i) thebiases are determined by the demands on protein stability(ii) mRNA stability is the major determinant of sequencebiases (iii) complete independence of the sequence biasesrelated to mechanisms of stability on each level We havecomputationally folded the mRNA sequences from 244prokaryotic genomes and analyzed their characteristics

2886 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

(Table 5) All the parameters are reported as relative oneswhere the signal in native sequence is normalized by thevalue obtained for dicodon shuffled ones (48) Thedicodon shuffling randomizes mRNA sequence whilepreserving the native protein sequence native codonusage and native dinucleotide composition Therefore itallows one to select out peculiarities of mRNA caused bythe demands on its stability and function from the othersrelated to the coding DNA and proteins

Structure of stems in folded mRNA is characterized byphases which show relative location of triplets in oppositestrands of the stem Specifically Phases I II and III cor-respond to positioning of triplets where respectively firstsecond and third nucleotides are complementary Themost and the least frequent pairs in all phases of thedsDNA (see Materials and methods section) are con-sidered Phases I and II are similar to each otheryielding CG GC GU UG AU and UA (Table 5)Overall the Archaea and thermophilic Bacteria havesimilar trends in most and least frequent pairsMesophilic Bacteria stands out with A13U and G33Cmost frequent pairs in Phases II and III respectivelyThe major contribution from the three pairing phases(52) to the total amount of stem pairs is provided byPhase III and nucleotides in codon position 3 are mostfrequent in stem pairs (Table 5 and SupplementaryTable S1) OGT correlations highlight the contributionfrom the GU wobble pair to the thermostability ofachaeal mRNA structures (Table 5 and SupplementaryTable S1) We found the highest OGT correlation forthe Archaeal pairs U32G (Table 5) confirming import-ance of the GU wobble pairs The stronger effect forArchaeal rather than for Bacterial mRNA is possibly arelic of ancient RNA world where RNA was the carrierof genetic information and harsh environmental condi-tions demanded its increased stability

Table 6 shows that purine load (the contents of A+GRY ApG) is larger in the loop regions than in the stems

In Archaea this difference is slightly more pronouncedthan in Bacteria (Supplementary Table S2) The purineload is also in good positive correlation with OGT forthe loops not for stems However the amount of ApGdinucleotides is correlated with OGT in both loops andstems For synonymous codons the fractions of purine-rich codons (eg GGR versus GGY for glycine and AGRversus CGY for arginine) are correlated with OGT in bothloop and stem regions (slightly stronger in loops) Fornon-synonymous codons however amount of purine-rich codons (GAR for glutamate) is correlated withOGT in both loops and stems while fractions of purine-rich AAR codons for lysine do not correlate withOGT (Table 6) While Forsdyke et al (21) reportedincrease of the purine content and its positive correlationwith OGT in loops we found that fraction of purine reachsynonymous codons is correlated with OGT in stems aswell At the same time the amount of non-synonymouspurine-rich codons of lysine is not correlated with OGT inboth loop and stem regions Additionally the amount ofApG dinucleotides is correlated with OGT in both loopsand stems (Table 6)

Signals of thermophilic adaptation in protein sequencesIt has been shown earlier (22) that environmental tem-perature directly affects amino acid composition of pro-karyotic proteins and IVYWREL combination can serveas a predictor of the OGT in prokaryotes We use here alsquoZ-scorersquo predictor of OGT (43) which takes into accountdifferences in variances of the proteomic frequencies ofindividual amino acids (Table 7) We have shown thatthe Z-score predictor properly corrects for these differ-ences (43) better reflecting the contribution of theamino acid combinations to thermal adaptationSeparate predictors for Archaea and Bacteria (based on33 and 99 proteomes with OGTrsquos spread over 5ndash100 and10ndash85C respectively) yield the same general trend in theincrease of hydrophobic and charged residues at the

Table 5 Compositional and dicodon signals of mRNA adaptation purified by the dicodon shuffling

Characteristic Archaea Bacteria

The most and least (after the comma) frequentpairs in stems of mRNA

MesophilesPhase I C23G G23U Phase I G32U U32GPhase II C31G G13U Phase II A13U G13UPhase III A33U U33G Phase III G33C U33G

ThermophilesPhase I G32C G23U Phase I C23G G23UPhase II C22G G13U Phase II C31G G13UPhase III A33U U33G Phase III U33A U33G

Correlations with OGT ofSegment energy ltEsggt R=ndash071 R=ndash039Energy per base pair ltEbpgt R=ndash073 R=ndash054

The most signification correlations with OGT Phase I U32G R=077 (U21G)Nat PIII R=046Phase I G23U R=071 (G12U)Nat PIII R=045Phase I All pairs R=07 (U21G)dShfld PIII R=045

(G12U)dShfld PIII R=041

Phases I II and III correspond to positioning of triplets where respectively first second and third nucleotides are comple-mentary All signals (except OGT correlations in Bacteria) are normalized by corresponding values for control sequences afterthe dicodon shuffling See explanations of abbreviations in the Materials and methods section

Nucleic Acids Research 2014 Vol 42 No 5 2887

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

expense of polar ones (34) showing minor differences inpresence of individual residuesWe have shown earlier (22) that the amino acid bias

working in thermal adaptation of proteins does notdepend on the overall nucleotide composition of coding-DNA sequences However one can still expect that nu-cleotide compositions of particular codon positions andor dinucleotide compositions of positions 1-2 and 2-3 incodons may be somehow linked to amino acid biases Weindeed found a significant difference between nucleotideload in individual codon positions of Archaeal andBacterial coding DNA sequences (Table 1) FirstArchaeal protein coding sequences yield a strong correl-ation between (T+G)2 load in the second codon positionand OGT (r=072) The corresponding non-codon biasedsequences show comparable correlation (r=068)pointing that (T+G) load reflects tuning of the aminoacid composition in connection to thermal adaptationThe main contributor to this correlation is thymine(Table 1) that provides increase of strongly hydrophobicresidues LMFIV (Supplementary File S1) The (T+G)load is much weaker correlated with OGT in Bacterialsequences (r=036 for natural and r=042 for NCB se-quences) The correlation of the thymine load with OGT is037 for both Natural (NAT) and Non-codon-biased(NCB) sequences This observation agrees with thepresumed prevalence of structure-based strategy ofthermal adaptation in Archaeal proteins (29) contrary tosequence-based one in Bacterial proteomes Specificallythere is a clear correlation between amounts of chargedresidues and the OGT in the set of Bacterial proteomes(Table 7) presumably reflecting the domination of thesequence-based strategy in thermophilic adaptation ofBacteria (29)Finally we checked if there exists any specific connec-

tion between the dipeptide composition of proteins andthose of 3-1 dinucleotides We found (data not shown)that the strongest correlations between dinucleotides 3-1and dipeptides exist mostly for the dinucleotides withguanine in position 3 The most frequent pair of aminoacids contains methionine (Met encoded exclusively by

the AUG codon) as the first and any polarchargedamino acid as the second one The preference of polarcharged amino acid to be the next to Met can be explainedby the typically surface location of the protein N-terminiand its role as a signal for ubiquitination (63)

DISCUSSION

We will make a brief overview of the most pronouncedbiases and causal relationships in DNA RNA andproteins The major signal in nucleotide compositionsspecifically (G+C) load is related to thermal stabilizationof t- and rRNA (3654555960) and to discriminationbetween the coding and non-coding sequences inArchaea (Table 2) Comparative analysis of nucleotideand amino acid compositions in relation to thermophilicadaptation prompts to conclude that the (G+C) contentdoes not contribute to thermostabilization of coding DNA(1822365459) as well as it does not affect amino acidcomposition and its thermophilic trends (1522) When

Table 6 Purine loading in loop and stem regions of folded mRNA and its OGT correlation

Feature Loop Stem L-v-S p-value

Mean contents OGT correlation Mean content OGT correlation

A+G 0560 059 0500 ndash026 lt22E ndash 16RY 1299 061 1002 ndash026 lt22E ndash 16ApG 0061 079 0051 050 00002GGR (glycine) 0027 062 0045 042 12E ndash 17GGY (glycine) 0017 ndash005 0079 ndash023 10E ndash 42AGR (arginine) 0035 072 0022 056 24E ndash 10CGR (arginine) 0018 ndash022 0040 ndash024 71E ndash 09CGY (arginine) 0016 ndash025 0037 ndash026 62E ndash 10GAR (glutamate) 0060 071 0036 059 lt22E ndash 16AAR (lysine) 0086 022 0017 011 lt22E ndash 16GAY (aspartate) 0037 ndash017 0040 ndash037 00002

Feature analyzed nucleotide dinucleotide or amino acid Loop and stem information on mean content and OGT correlation of the above L-v-S acomparison between corresponding contents in the loop region and stem regions by Wilcoxon-tests and P-values are shown in this columnCorrelation coefficients with OGT are shown P-valuelt 001 P-valuelt 00001

Table 7 Signals of thermophilic adaptation in protein sequences of

Archaea and Bacteria

Correlation with OGT Archaea Bacteria

Z-scored predictor ILVW Y DKR(R=093)

IPV Y EKR(R=089)

The most abundant residuesin gt70 (gt60) ofall Z-scored predictors

VIWL Y ER VPI Y ER(K)

Individual amino acids + L W + Endash T Q D

Types of amino acids + h + h cndash p ndash p

Dipeptides + hp ph + ccndash cp pc ndash pc

+ increase of the amino acid (or amino acid type) fraction with OGT ndashdecrease of the amino acid (or amino acid type) fraction with OGTCapital letters are names of amino acids h p c are hydrophobic polarcharged types of amino acids Correlation coefficients between the Z-scored thermostability predictors and OGT are given in parentheses

2888 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

separate codon positions are considered for coding DNAthe only compositional bias observed for nucleotide com-positions is an excess of (G+C) load in the third codonposition in Bacteria There is a significant excess ofguanine and cytosine nucleotides (292 and 3282)compared to the NCB nucleotide contents (2609 and2466) A plausible explanation for this bias is the roleof the (G+C)3 load in the adaptation to the aerobic lifestyle which dominates in Bacteria The (G+C)3 load canbe advantageous for several complementary reasonsIndeed additional GC base pairing will contribute tothe stability of double helix of DNA and stem regions ofRNA molecules Moreover G3 bases can work as scaven-gers of oxidizing agents providing protection for G basesin other codon positions (56ndash58) At the same time thisbias would not lead to any changes in amino acid com-position leaving protein structure and stability intact (55)

In this work we considered compositional and sequencebiases in proteins in relation to those in correspondingnucleic acid sequences and to the phylogeny of species(Archaea or Bacteria) OGT correlations of nucleotidesin different codon positions are similar in Archaea andBacteria but higher in the former (Table 1) Theoptimal from the point of view of thermophilic adaptation(ie most correlated with OGT) codon reads A1 [TG]2non-[AG]3 in Archaea but [weak A]1 [weak T]2 non-[AG]3 in Bacteria (Table 4) Codon bias supports thestrong correlation between the amount of Adenine andOGT in the first codon position in Archaea same butweaker correlation in Bacteria While we showed earlier(22) that amino acid thermophilic trend is not determinedby the overall nucleotide composition considerationof separate codon positions reveals an interesting linkbetween positional nucleotide frequencies and aminoacid composition in relation to domain of LifeSpecifically prevalence of the nucleotide thymine in thesecond codon position and its strong correlation withOGT in Archaea is apparently a result of the demandon enrichement of Archaeal proteins with hydrophobicresidues (Table 7) Noteworthy a massive increase ofvan der Waals interactions (2464) was found to be thecornerstone of the structure-based evolutionary strategyof protein thermostability in ancient species (29) InBacterial proteins however there is an increase of thefractions of charged residues with OGT (Table 7) Thuswe observe a transitions from structure-based evolution-ary strategy of protein themostability in Archaea tosequence-based one in Bacteria and corresponding nucleo-tide and protein compositional biases (Table 7)underlying these strategies (29) The third codonposition is characterized by the selection against adenineand guanine in Archaea and against adenine in Bacteria isa result of this bias

The most pronounced dinucleotide compositionalbiases are excess of homo-dinucleotides and its correlationwith OGT in ncDNA tRNA and rRNA It is presumablya relic of ancient primitive homo(poly)nucleotides fromwhich life started Overrepresentation of the complemen-tary ApGCpT dinucleotides points to their contributionto DNARNA stability via strong base-stacking inter-actions (6566) The consideration of dinucleotide biases

in purine (R) and pyrimidine (Y) notation also reveals aninteresting picture Overall for both Archaea and Bacteriathe signature for thermophilic dinucleotides reads RpRYpY non-(RpRYpY) RpRYpY for codon positions1-2 2-3 and 3-1 respectively (Table 4) The major differ-ence between Archaea and Bacteria in this case is that thedinucleotide bias in positions 1-2 is caused by demands onnucleotide sequence level (and provided by the codonbias) in Archaea while some other factors work inBacteria The preference for homodinucleotides in pos-itions 1-2 and 3-1 can be a possible reason for avoidinghomodinucleotides in positions 2-3 because heterodinu-cleotides can contribute to DNA flexibility tuningIndeed it has been shown that given an average flexibilityF of homodinucleotides (either RpR or YpY) theflexibilities of the heterodinucleotides YpR and RpY are2F and 05F respectively (67) The codon positions 2-3 aremost versatile in terms of the relationship between thenucleotide and protein sequences Therefore presenceof heteronucleotides in the codon position 2-3 makes itpossible to adjust flexibility of the nucleotide sequencewithout changing physicalndashchemical characteristics ofthe encoded amino acid residue Additionally we foundthat the link between dipeptide and corresponding 3-1 di-nucleotide frequencies is determined by the amino aciddipeptide The dominating bias reads as Met followedby a polar or charged residueOn the RNA level simple consideration of the nucleo-

tide composition suggests to discriminate the tRNA andrRNA molecules from the mRNA Indeed based on thecorrelation with OGT we can conclude that (G+C)content is important for providing stability of tRNAand rRNA molecules but not for mRNA The role ofthe (G+C) content was further confirmed by foldingsimulations (Table 3) We also found other complemen-tary signals in mRNA sequences related to its structurestability and function Archaea and Bacteria yield similartrends in the most and least frequent nucleotide pairsand the third codon position makes a major contributionto the base-pairing mechanism of stem stabilizationSpecifically in addition to WatsonndashCrick pairing theGU wobble pairs sufficiently contribute to the ther-mal stability of the Archaeal mRNA in agreement withearlier observations (6869) Overall it results in moderatecorrelation of the segment energy and energy per base pairof the folded mRNA (Table 5) Folding simulations per-formed in this work also allowed us to analyzepeculiarities of the nucleotide contents and structuralcontacts in stem and loop regions of folded mRNAPurine load in loop regions is higher than in stems andthis effect is slightly stronger in Archaea than in Bacteria(Table 6) Purine load in loops also correlates with OGTnot in stems This correlation was observed earlier byForsdyke et al (21) and it was described as lsquopolitepurine load of the loop regionsrsquo that prevents undesiredmRNAndashmRNA single-strand interactions The authorsconcluded that increased purine load can affect thecodon choice and consequently the amino acid compos-ition (21) Our data [as well as careful analysis of theoriginal data in (21)] does not support the latter claimThere is indeed a correlation between the fraction of

Nucleic Acids Research 2014 Vol 42 No 5 2889

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

some purine-rich codons (GGR(Gly) AGR(Arg) andGAR(Glu)) and OGT (Table 6) However the fractionof purine-rich non-synonymous codons AAR (Lys) isnot correlated with OGT (Table 6) Moreover thesecodons are synonymous and cannot directly affect theamino acid composition It has been shown earlier(2270) that increase of Glu and Arg fractions is a trendin protein thermophilic adaptation At the same timepurine load remains high after elimination of the codonbias pointing to the amino acid composition as the mostprobable cause for the former (22) All the above promptsone to conclude that in mutual tuning of nucleotide andamino acid compositions the purine load does notdominate and determine biases in amino acid compos-ition if not opposite The analysis of purine load showsthat it correlates with OGT in both stems and loops(Table 6) in case of purine-rich codons GGR(Gly)AGR(Arg) and GAR(Glu) Further the amount ofApG dinucleotides strongly correlates with OGT in bothloops and stems (Table 6) and ApG provides a strongbase stacking important for thermal stability of nucleicacid sequences (2265) Purine load therefore is appar-ently a determinant of the base-stacking mechanism inmRNA andor DNA thermal stability as well as a resultof the thermophilic amino acid composition trendProkaryotes thrive under the temperature interval

spanning over hundred degrees and they represent twomajor life stylesmdashaerobic and anaerobic Analysis ofcomplete Archaeal and Bacterial genomes unraveled com-positional and sequence signals related to molecular mech-anisms of stability and adaptation unaffected by selectivesequencing or by the comparison of orthologs Overallcodon bias works stronger in Archaea and is mostlyutilized in thermophilic adaptation of nucleic acids It ap-parently reflects longer evolutionary history of Archaeawhich presumably started close to the origin of life in hotconditions (29) The codon bias and amino acid sequences(dipeptide composition) work in accord for supportingenrichment of the nucleotide sequences with ApGCpT di-nucleotidesmdashdeterminants of the base-stacking mechan-ism of the nucleic acid stability (65ndash67) We also foundthat the second codon position reveals a strong linkbetween the nucleotide and amino acid compositionsSpecifically excess of thymine in this position is a resultof a demand on the enrichment of Archaeal proteins withhydrophobic amino acids The third codon position inBacteria is the only case where codon bias is detectablealready on the level of pure composition We found thatthe (G+C)3 load is related to aerobic life style dominatingin Bacteria From the point of view of thermophilic adap-tation codon bias works against G in the third codonposition in both Archaea and Bacteria It supports thus aspecific role of the third codon position in discriminatingbetween adaptations to temperature and aerobic life styleFinally we obtained an interesting and complex picture ofthe relationship between the nucleotide composition(purine load) and amino acid composition (selectionbetween Arg and Lys) in relation to thermal adaptationaerobicity and phylogeny Specifically purine load in bothArchaea and Bacteria is a result of the lsquofrom both endof hydrophobicity scalersquo trend in thermal adaptation of

proteins (34) reflected in the IVYWREL predictor of theOGT (22) According to this predictor Arg is a preferredamino acid for thermal adaptation though Lys is the nextcandidate present in top most correlated predictors as well(2243) In particular selection for Lys over the Arg insome species was shown to be important for the entropicmechanism of protein thermostabilization (70) Howeveroverall preference for Arg over the Lys in case of thermaladaptation is well manifested in the excess of the purine-rich (AGR) codons of Arg at the expense of purine-richones of Lys (AAR) At the same time discriminationbetween Arg and Lys works in opposite direction inaerobicity where Arg codons AGA and AGG are sup-pressed in favor of the purine-rich AAA and AAG ofLys It thus reflect a preference for Lys over Arg inaerobes compared to anaerobes in particular and inBacteria versus Archaea in general preserving at thesame time high purine load necessary for providing abase-stacking in corresponding nucleic acids (225365)An intricate connection between the LysArg and theircodons in relation to thermal adaptation and aerobicityexemplifies how selection can work on nucleic acids andprotein simultaneously in response to demands of differentenvironments Obviously the whole picture of molecularmechanisms of adaptation and relations between them isfar from being complete Consideration of other environ-mental factors such as salinity pressure etc will help to un-ravel new mechanisms of stability their sequencestructuredeterminants and to understand tradeoffs that Natureembraced en route of the evolution and adaptation

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

FUNDING

Funding for Open access charges Functional GenomicsProgram (FUGE II) Norwegian Research Council

Conflict of interest statement None declared

REFERENCES

1 CrickF (1970) Central dogma of molecular biology Nature 227561ndash563

2 TawfikDS (2010) Messy biology and the origins of evolutionaryinnovations Nat Chem Biol 6 692ndash696

3 TokurikiN and TawfikDS (2009) Stability effects of mutationsand protein evolvability Curr Opin Struct Biol 19 596ndash604

4 TokurikiN and TawfikDS (2009) Protein dynamism andevolvability Science 324 203ndash207

5 KooninEV (2012) Does the central dogma still stand BiolDirect 7 27

6 PersquoerI FelderCE ManO SilmanI SussmanJL andBeckmannJS (2004) Proteomic signatures amino acid andoligopeptide compositions differentiate among phyla Proteins 5420ndash40

7 AravindL TatusovRL WolfYI WalkerDR andKooninEV (1998) Evidence for massive gene exchange betweenarchaeal and bacterial hyperthermophiles Trends Genet 14442ndash444

8 KhachaneAN TimmisKN and dos SantosVA (2005) Uracilcontent of 16S rRNA of thermophilic and psychrophilic

2890 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

prokaryotes correlates inversely with their optimal growthtemperatures Nucleic Acids Res 33 4016ndash4022

9 SuhreK and ClaverieJM (2003) Genomic correlates ofhyperthermostability an update J Biol Chem 27817198ndash17202

10 TekaiaF and YeramianE (2006) Evolution of proteomesfundamental signatures and global trends in amino acidcompositions BMC Genom 7 307

11 WangHC and HickeyDA (2002) Evidence for strong selectiveconstraint acting on the nucleotide composition of 16S ribosomalRNA genes Nucleic Acids Res 30 2501ndash2507

12 FriedmanR DrakeJW and HughesAL (2004) Genome-widepatterns of nucleotide substitution reveal stringent functionalconstraints on the protein sequences of thermophiles Genetics167 1507ndash1512

13 LynnDJ SingerGA and HickeyDA (2002) Synonymouscodon usage is subject to selection in thermophilic bacteriaNucleic Acids Res 30 4272ndash4277

14 SingerGA and HickeyDA (2000) Nucleotide bias causes agenomewide bias in the amino acid composition of proteinsMol Biol Evol 17 1581ndash1588

15 SingerGA and HickeyDA (2003) Thermophilicprokaryotes have characteristic patterns of codonusage amino acid composition and nucleotide content Gene 31739ndash47

16 TekaiaF YeramianE and DujonB (2002) Amino acidcomposition of genomes lifestyles of organisms and evolutionarytrends a global picture with correspondence analysis Gene 29751ndash60

17 Roy ChowdhuryA and DuttaC (2012) A pursuit of lineage-specific and niche-specific proteome features in the world ofarchaea BMC Genom 13 236

18 WangHC SuskoE and RogerAJ (2006) On the correlationbetween genomic G+C content and optimal growth temperaturein prokaryotes data quality and confounding factors BiochemBiophys Res Commun 342 681ndash684

19 WuH ZhangZ HuS and YuJ (2012) On the molecularmechanism of GC content variation among eubacterial genomesBiol Direct 7 2

20 KnightRD FreelandSJ and LandweberLF (2001) A simplemodel based on mutation and selection explains trends in codonand amino-acid usage and GC composition within and acrossgenomes Genome Biol 2 RESEARCH0010

21 LaoPJ and ForsdykeDR (2000) Thermophilic bacteria strictlyobey Szybalskirsquos transcription direction rule and politely purine-load RNAs with both adenine and guanine Genome Res 10228ndash236

22 ZeldovichKB BerezovskyIN and ShakhnovichEI (2007)Protein and DNA sequence determinants of thermophilicadaptation PLoS Comput Biol 3 e5

23 KreilDP and OuzounisCA (2001) Identification ofthermophilic species by the amino acid compositions deducedfrom their genomes Nucleic Acids Res 29 1608ndash1615

24 BerezovskyIN TumanyanVG and EsipovaNG (1997)Representation of amino acid sequences in terms of interactionenergy in protein globules FEBS Lett 418 43ndash46

25 CambillauC and ClaverieJM (2000) Structural and genomiccorrelates of hyperthermostability J Biol Chem 27532383ndash32386

26 GreavesRB and WarwickerJ (2007) Mechanisms forstabilisation and the maintenance of solubility in proteins fromthermophiles BMC Struct Biol 7 18

27 JaenickeR (1999) Stability and folding of domain proteins ProgBiophys Mol Biol 71 155ndash241

28 JaenickeR and BohmG (1998) The stability of proteins inextreme environments Curr Opin Struct Biol 8 738ndash748

29 BerezovskyIN and ShakhnovichEI (2005) Physics andevolution of thermophilic adaptation Proc Natl Acad Sci USA102 12742ndash12747

30 ChakravartyS and VaradarajanR (2002) Elucidation of factorsresponsible for enhanced thermal stability of proteins a structuralgenomics based study Biochemistry 41 8152ndash8161

31 GlyakinaAV GarbuzynskiySO LobanovMY andGalzitskayaOV (2007) Different packing of external residues can

explain differences in the thermostability of proteins fromthermophilic and mesophilic organisms Bioinformatics 232231ndash2238

32 ThompsonMJ and EisenbergD (1999) Transproteomic evidenceof a loop-deletion mechanism for enhancing proteinthermostability J Mol Biol 290 595ndash604

33 TokurikiN OldfieldCJ UverskyVN BerezovskyIN andTawfikDS (2009) Do viral proteins possess unique biophysicalfeatures Trends Biochem Sci 34 53ndash59

34 BerezovskyIN ZeldovichKB and ShakhnovichEI (2007)Positive and negative design in stability and thermal adaptationof natural proteins PLoS Comput Biol 3 e52

35 BharanidharanD BhargaviGR UthanumallianK andGauthamN (2004) Correlations between nucleotide frequenciesand amino acid composition in 115 bacterial species BiochemBiophys Res Commun 315 1097ndash1103

36 NakashimaH FukuchiS and NishikawaK (2003)Compositional changes in RNA DNA and proteins for bacterialadaptation to higher and lower temperatures J Biochem(Tokyo) 133 507ndash513

37 DehouckY FolchB and RoomanM (2008) Revisiting thecorrelation between proteinsrsquo thermoresistance and organismsrsquothermophilicity Protein Eng Des Sel 21 275ndash278

38 FolchB DehouckY and RoomanM (2010) Thermo- andmesostabilizing protein interactions identified by temperature-dependent statistical potentials Biophys J 98 667ndash677

39 FolchB RoomanM and DehouckY (2008) Thermostability ofsalt bridges versus hydrophobic interactions in proteins probed bystatistical potentials J Chem Inf Model 48 119ndash127

40 GonnelliG RoomanM and DehouckY (2012) Structure-basedmutant stability predictions on proteins of unknown structureJ Biotechnol 161 287ndash293

41 PonnuswamyP MuthusamyR and ManavalanP (1982) Aminoacid composition and thermal stability of globular proteins IntJ Biol Macromol 4 186ndash190

42 BerezovskyIN (2011) The diversity of physical forces andmechanisms in intermolecular interactions Phys Biol 8 035002

43 MaBG GoncearencoA and BerezovskyIN (2010)Thermophilic adaptation of protein complexes inferred fromproteomic homology modeling Structure 18 819ndash828

44 MakarovaKS and KooninEV (2005) Evolutionary andfunctional genomics of the Archaea Curr Opin Microbiol 8586ndash594

45 NovichkovPS WolfYI DubchakI and KooninEV (2009)Trends in prokaryotic evolution revealed by comparison of closelyrelated bacterial and archaeal genomes J Bacteriol 191 65ndash73

46 KooninEV and WolfYI (2008) Genomics of bacteria andarchaea the emerging dynamic view of the prokaryotic worldNucleic Acids Res 36 6688ndash6719

47 KooninEV MushegianAR GalperinMY and WalkerDR(1997) Comparison of archaeal and bacterial genomes computeranalysis of protein sequences predicts novel functions andsuggests a chimeric origin for the archaea Mol Microbiol 25619ndash637

48 KatzL and BurgeCB (2003) Widespread selection for localRNA secondary structure in coding regions of bacterial genesGenome Res 13 2042ndash2051

49 HofackerIL PriwitzerB and StadlerPF (2004) Prediction oflocally stable RNA secondary structures for genome-wide surveysBioinformatics 20 186ndash190

50 ZukerM and StieglerP (1981) Optimal computer folding oflarge RNA sequences using thermodynamics and auxiliaryinformation Nucleic Acids Res 9 133ndash148

51 MathewsDH SabinaJ ZukerM and TurnerDH (1999)Expanded sequence dependence of thermodynamic parametersimproves prediction of RNA secondary structure J Mol Biol288 911ndash940

52 ShabalinaSA OgurtsovAY and SpiridonovNA (2006) Aperiodic pattern of mRNA secondary structure created by thegenetic code Nucleic Acids Res 34 2428ndash2437

53 MarmurJ and DotyP (1962) Determination of the basecomposition of deoxyribonucleic acid from its thermaldenaturation temperature J Mol Biol 5 109ndash118

Nucleic Acids Research 2014 Vol 42 No 5 2891

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

54 HurstLD and MerchantAR (2001) High guanine-cytosinecontent is not an adaptation to high temperature acomparative analysis amongst prokaryotes Proc Biol Sci 268493ndash497

55 NayaH RomeroH ZavalaA AlvarezB and MustoH (2002)Aerobiosis increases the genomic guanine plus cytosine content(GC) in prokaryotes J Mol Evol 55 260ndash264

56 BeckmanKB and AmesBN (1997) Oxidative decay of DNAJ Biol Chem 272 19633ndash19636

57 ChengKC CahillDS KasaiH NishimuraS and LoebLA(1992) 8-Hydroxyguanine an abundant form of oxidative DNAdamage causes Gmdash-T and Amdash-C substitutions J Biol Chem267 166ndash172

58 Vieira-SilvaS and RochaEP (2008) An assessment of theimpacts of molecular oxygen on the evolution of proteomes MolBiol Evol 25 1931ndash1942

59 GaltierN and LobryJR (1997) Relationships between genomicG+C content RNA secondary structures and optimal growthtemperature in prokaryotes J Mol Evol 44 632ndash636

60 MustoH NayaH ZavalaA RomeroH Alvarez-ValinF andBernardiG (2004) Correlations between genomic GC levels andoptimal growth temperatures in prokaryotes FEBS Lett 57373ndash77

61 FitchWM (1974) The large extent of putative secondary nucleicacid structure in random nucleotide sequences or amino acidderived messenger-RNA J Mol Evol 3 279ndash291

62 WorkmanC and KroghA (1999) No evidence that mRNAshave lower folding free energies than random sequences with

the same dinucleotide distribution Nucleic Acids Res 274816ndash4822

63 BachmairA FinleyD and VarshavskyA (1986) In vivo half-lifeof a protein is a function of its amino-terminal residue Science234 179ndash186

64 BerezovskyIN NamiotVA TumanyanVG and EsipovaNG(1999) Hierarchy of the interaction energy distribution in thespatial structure of globular proteins and the problem of domaindefinition J Biomol Struct Dyn 17 133ndash155

65 YakovchukP ProtozanovaE and Frank-KamenetskiiMD(2006) Base-stacking and base-pairing contributions into thermalstability of the DNA double helix Nucleic Acids Res 34564ndash574

66 FriedmanRA and HonigB (1995) A free energy analysis ofnucleic acid base stacking in aqueous solution Biophys J 691528ndash1535

67 OkonogiTM AlleySC ReeseAW HopkinsPB andRobinsonBH (2002) Sequence-dependent dynamics of duplexDNA the applicability of a dinucleotide model Biophys J 833446ndash3459

68 MasquidaB and WesthofE (2000) On the wobble GoU andrelated pairs RNA 6 9ndash15

69 VaraniG and McClainWH (2000) The G x U wobble base pairA fundamental building block of RNA structure crucial to RNAfunction in diverse biological systems EMBO Rep 1 18ndash23

70 BerezovskyIN ChenWW ChoiPJ and ShakhnovichEI(2005) Entropic stabilization of proteins and its proteomicconsequences PLoS Comput Biol 1 e47

2892 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

thus to the packing of the hydrophobic coreNoteworthy elimination of the codon bias does notaffect correlation of the thymine fraction with OGT(R=055 in both native and NCB sequences) Itshows that excess of thymine in the second codonposition is a result of selection on the protein level Anapparent explanation for such selection is domination ofthe structure-based strategy in thermostabilization ofArchaeal proteins This strategy is characterized by theincreased compactness of the hydrophobic core providedby the massive van der Waals contacts The correlationof the combination of A and G with OGT is notprovided by tuning of the codon bias which points toadaptation on the other than DNA level

Bacterial coding DNA sequences yield less position-dependent correlations with OGT than Archaeal onesThe ratio of natural adenine frequency over the one foreliminated codon bias is weakly correlated with OGT(Table 1) There are also moderate correlations ofthymine frequency and of the thymine and guaninecombination with OGT in the second codon position(Table 1) The third codon position is characterized byselection against adenine provided by the codon bias(Table 1) However the combination of adenine andguanine is correlated with OGT It does not depend onthe codon bias to a large extent (same as in Archaea)pointing to possible signal of adaptation on thelevel other than protein-coding DNA sequence Thegeneralized codon in Bacteria (Table 4) characterizingthe thermophilic trends reads therefore [weak A]1 [weakT]2 [non-(AG)]3

The role of excessive (G+C) load in the third codonposition in adaptation to aerobicity

In both Archaea and Bacteria the average nucleotidecomposition in different codon positions is not affectedby the codon bias to the large extent except the thirdcodon position in Bacteria However the compositional

variance on the third position is essentially destroyedwhen there is no codon bias (standard deviation (SD)diminishes from gt10 to lt1 see Supplementary FileS5) The natural (G+C)3 load in Bacteria yields 9excess in comparison with NCB sequences (P-value=17e ndash 07 Supplementary File S6) while inArchaea there is no significant difference We found thatdespite frequent involvement of G3 and C3 nucleotidesinto the GCCG pairs the (G+C)3 load does not playa crucial role in thermal stability of mRNA (5254)Overall the difference in (G+C)3 load between Archeaand Bacteria is insignificant If we consider separatethermal groups the only difference appears in mesophilicBacteria (95 more in Bacteria P=009) Howeverwhen we take into account the oxygen tolerance factorthe difference in (G+C)3 load will become extremelypronounced between aerobic (7013 in natural and 5084in NCB) and anaerobic (4894 in natural and 5046 inNCB) species (P-value=2e ndash 09) Considering the lackof OGT correlation in protein-coding nucleotides (exceptthymine on the second codon position) one can concludethat increased (G+C)3 load is a result of the aerobic lifestyle rather than thermal adaptation (SupplementaryFiles S2 and S3) The comparison of aerobes and anaer-obes in the group of mesophilic Bacteria corroborates thisconclusion (Supplementary File S6) The difference isgt20 of (G+C)3 content 7097 in aerobes and 5051 inanaerobes (P=0000178) This adaptation is entirelydriven by the codon bias because if the codon bias isremoved the (G+C)3 bias as well as the differencebetween aerobes and anaerobes will disappear (5087and 5064)We also compared usage of codons with G or C in a

third position with other codons (lsquoCodon Usagersquo inSupplementary Files S3 and S6) There are three aminoacids encoded by six codons (Leu Arg Ser) five by fourcodons (Ala Pro Thr Gly Val) one by three codons (Ile)nine by two codons (Lys Asn Asp Phe Cys Gln Glu

Table 4 Generalized nucleotides and dinucleotides in different codon positions favorable for thermostability

Nucleotides correlated with OGT

Domain of life Codon position 1 2 3

Archaea Nucleotide A TG Non-[AG]Origin of the bias Codon bias T- amino acid Against codon bias

G-codon biasBacteria Nucleotide Weak A Weak T Non-[AG]

Origin of the bias Codon bias Amino acid Against codon bias

Dinucleotides in Purine (R) and Pyrimidine (Y) notation correlated with OGT

Domain of life Codon position 1-2 2-3 3-1

Archaea Dinucleotide RpRYpY Non-RpRYpY RpRYpYOrigin of the bias Codon bias Codon bias Amino acid

Bacteria Dinucleotide RpRYpY Non-RpRYpY RpRYpYOrigin of the bias Not codon bias Codon bias Amino acid

Part 1 Thermophilic-prone nucleotide biases Columns 1 2 3 contain information on favorable nucleotides and origin of the bias in codon positions1 2 3 Part 2 Thermophilic-prone dinucleotide biases Columns 1-2 2-3 3-1 contain information on favorable nucleotides and origin of the bias incodon positions 1-2 2-3 3-1

Nucleic Acids Research 2014 Vol 42 No 5 2885

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

His Tyr) and two by one codon (Met Trp) Typicallyhalf of the synonymous codons of each amino acid have Gor C in a third codon position others A or T It appearedthat almost all of the codons with G and C in a thirdposition have higher frequencies in aerobes compared toanaerobes and in Bacteria than in Archaea (see lsquoCodonUsagersquo in Supplementary File S3) This observation holdsfor all codons of the lsquotwo-codonrsquo amino acids and formajority of codons of the three- four- and six-codonamino acids The noticeable exception is significant sup-pression of AGAAGG codons of the Arg (aerobesndashanaerobes 177ndash125 P-values 446e ndash 858e ndash 8BacteriandashArchaea 174ndash2625 P-values 165e ndash 7114e ndash11) In lysine the codon AAA is suppressed in aerobes(difference 1716 P-value 644e ndash 6) while the codonAAG is preferred The AAG codon of lysine is favorablein aerobes (aerobesanaerobes 143) and another lysinersquoscodon AAA can be turned into AAG by only one mutationin the third codon position Therefore one can speculatethat the demand for discriminating between Arg and Lys isthe most probable cause for the suppression of AGAand AGG codons of arginine in aerobes Lysine isthus preferred in aerobic adaptation over the arginine(Supplementary Files S3 and S6) The contribution of theamino acid in relation to thermophilic adaptation is notcompromised because of the similarity between physicalndashchemical characteristics of lysine and arginine from thepoint of view of thermostability Overall an excess ofthe G3 codons is always more pronounced in aerobescompared to anaerobes which corroborates the conclusionthat the (G+C)3 load is an indications of the aerobic style(55ndash58) The similar trend in the difference between theBacteria and Archaea is a result of the domination ofaerobic life style in the Bacteria in the analyzed dataset

Correlation between dinucleotide frequencies in differentcodon positions and OGTIn order to analyze dinucleotides in different codon pos-itions (1-2 2-3 and 3-1) we used the DCTNat DCTNCB

(for positions 1-2 and 2-3) and shuffled (DCTShfld forposition 3 and 1) sequences We considered correlationsof these contrasts with OGT The most pronounced biasesare found for dinucleotides in coding sequences ofArchaea The ApG dinucleotides show the strongestexcess of dinucleotides in all positions of natural se-quences compared to NCB and shuffled ones In codonpositions 1-2 and 2-3 this compositional peculiarity andits correlation with OGT are provided by the codon bias(Table 1) In position 3-1 high correlation of ApGfrequencies with OGT is determined by the amino acidsequences and it vanishes after amino acid reshuffling(Table 1) There is also correlation of the frequency ofcomplementary CpT dinucleotide in position 3-1 withOGT disappearing after reshuffling of amino acid se-quences (Table 1) The most plausible role of ApG di-nucleotides is a contribution to stabilization via base-stacking interactions between the purine rings (22) Thisconclusion is supported by the fact that excess of ApGdinucleotides is provided by the codon bias manifestingthus adaptation on a DNA level The complementary CpTdinucleotides indicate enrichment of anti-sense strand of

double-stranded DNA (dsDNA) with the stabilizing ApGones The resulting mosaic of ApG stacking in sense andanti-sense strands can provide stabilization over the longdistances without compromising flexibility of the dsDNAOther overrepresented dinucleotides (regardless of thecodon bias) correlated with OGT in positions 1-2 areGpC CpC TpA and CpT (Table 1) The frequencies ofTpA and CpT dinucleotides are correlated with OGT inposition 2-3 where the former is provided by the codonbias and the latter is not

In purine (R) and pyrimidine (Y) notation codonbias provides grouping (stacking) of similar nucleotidesin position 1-2 which increases with OGT (Table 1) Atthe same time codon bias works against such grouping inposition 2-3 (Table 1) Amino acid sequence is responsiblefor the grouping of similar nucleotides and correlationswith OGT in position 3-1 (Table 1) The resultinggeneralized thermophilic pattern of dinucleotides inArchaeal cDNA sequences reads [RpR YpY]1-2 [non-RpR non-YpY]2-3 [RpRYpY]3-1 Most of the compos-itional peculiarities found in Bacteria are not determinedby the codon bias yielding weak correlations with tem-perature (Table 1) The thermophilic pattern of dinucleo-tides in Bacteria (Table 4) is the same as Archaeal oneThe difference with Archaea is that selection for RpR andYpY dinucleotides in position 1-2 is not determined by thecodon bias in Bacteria (Tables 1 and 4)

Compositional and sequence biases observed in foldingsimulations of mRNAThough RNA molecules have different overall structuresone may expect that there are some common mechanismsproviding stability and function of the folded t- r- andmRNAs We have shown above that the (G+C) contentapparently contributes to the stability of the t- and rRNA(3659) and not to the stability of coding DNA (545960)It is manifested in correlation of the former with OGT(Table 2) and the anti-correlation of the base-pairingenergy in the folded structures with OGT (Table 3)Nucleotide compositions in coding DNA (hence inmRNA as well) do not correlate with OGT The mRNAcase is of special interest because redundancy of geneticcode endues the nucleotide sequence with a potential tosatisfy requirements for DNA mRNA and protein stabil-ity For example it has been hypothesized (61) that theoptimization of the base-pairing in the mRNA contributesto formation of stem fragments Authors claimed thatthere is a corresponding periodic pattern of the mRNAsecondary structure in human and mouse which isdetermined chiefly by the selection that operates on thethird codon position (52) An opposing opinion suggeststhat the secondary structure in mRNA interferes withtranslation and therefore should be avoided (62)Overall three scenario of the relation between mRNAand proteins has been considered earlier (52) (i) thebiases are determined by the demands on protein stability(ii) mRNA stability is the major determinant of sequencebiases (iii) complete independence of the sequence biasesrelated to mechanisms of stability on each level We havecomputationally folded the mRNA sequences from 244prokaryotic genomes and analyzed their characteristics

2886 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

(Table 5) All the parameters are reported as relative oneswhere the signal in native sequence is normalized by thevalue obtained for dicodon shuffled ones (48) Thedicodon shuffling randomizes mRNA sequence whilepreserving the native protein sequence native codonusage and native dinucleotide composition Therefore itallows one to select out peculiarities of mRNA caused bythe demands on its stability and function from the othersrelated to the coding DNA and proteins

Structure of stems in folded mRNA is characterized byphases which show relative location of triplets in oppositestrands of the stem Specifically Phases I II and III cor-respond to positioning of triplets where respectively firstsecond and third nucleotides are complementary Themost and the least frequent pairs in all phases of thedsDNA (see Materials and methods section) are con-sidered Phases I and II are similar to each otheryielding CG GC GU UG AU and UA (Table 5)Overall the Archaea and thermophilic Bacteria havesimilar trends in most and least frequent pairsMesophilic Bacteria stands out with A13U and G33Cmost frequent pairs in Phases II and III respectivelyThe major contribution from the three pairing phases(52) to the total amount of stem pairs is provided byPhase III and nucleotides in codon position 3 are mostfrequent in stem pairs (Table 5 and SupplementaryTable S1) OGT correlations highlight the contributionfrom the GU wobble pair to the thermostability ofachaeal mRNA structures (Table 5 and SupplementaryTable S1) We found the highest OGT correlation forthe Archaeal pairs U32G (Table 5) confirming import-ance of the GU wobble pairs The stronger effect forArchaeal rather than for Bacterial mRNA is possibly arelic of ancient RNA world where RNA was the carrierof genetic information and harsh environmental condi-tions demanded its increased stability

Table 6 shows that purine load (the contents of A+GRY ApG) is larger in the loop regions than in the stems

In Archaea this difference is slightly more pronouncedthan in Bacteria (Supplementary Table S2) The purineload is also in good positive correlation with OGT forthe loops not for stems However the amount of ApGdinucleotides is correlated with OGT in both loops andstems For synonymous codons the fractions of purine-rich codons (eg GGR versus GGY for glycine and AGRversus CGY for arginine) are correlated with OGT in bothloop and stem regions (slightly stronger in loops) Fornon-synonymous codons however amount of purine-rich codons (GAR for glutamate) is correlated withOGT in both loops and stems while fractions of purine-rich AAR codons for lysine do not correlate withOGT (Table 6) While Forsdyke et al (21) reportedincrease of the purine content and its positive correlationwith OGT in loops we found that fraction of purine reachsynonymous codons is correlated with OGT in stems aswell At the same time the amount of non-synonymouspurine-rich codons of lysine is not correlated with OGT inboth loop and stem regions Additionally the amount ofApG dinucleotides is correlated with OGT in both loopsand stems (Table 6)

Signals of thermophilic adaptation in protein sequencesIt has been shown earlier (22) that environmental tem-perature directly affects amino acid composition of pro-karyotic proteins and IVYWREL combination can serveas a predictor of the OGT in prokaryotes We use here alsquoZ-scorersquo predictor of OGT (43) which takes into accountdifferences in variances of the proteomic frequencies ofindividual amino acids (Table 7) We have shown thatthe Z-score predictor properly corrects for these differ-ences (43) better reflecting the contribution of theamino acid combinations to thermal adaptationSeparate predictors for Archaea and Bacteria (based on33 and 99 proteomes with OGTrsquos spread over 5ndash100 and10ndash85C respectively) yield the same general trend in theincrease of hydrophobic and charged residues at the

Table 5 Compositional and dicodon signals of mRNA adaptation purified by the dicodon shuffling

Characteristic Archaea Bacteria

The most and least (after the comma) frequentpairs in stems of mRNA

MesophilesPhase I C23G G23U Phase I G32U U32GPhase II C31G G13U Phase II A13U G13UPhase III A33U U33G Phase III G33C U33G

ThermophilesPhase I G32C G23U Phase I C23G G23UPhase II C22G G13U Phase II C31G G13UPhase III A33U U33G Phase III U33A U33G

Correlations with OGT ofSegment energy ltEsggt R=ndash071 R=ndash039Energy per base pair ltEbpgt R=ndash073 R=ndash054

The most signification correlations with OGT Phase I U32G R=077 (U21G)Nat PIII R=046Phase I G23U R=071 (G12U)Nat PIII R=045Phase I All pairs R=07 (U21G)dShfld PIII R=045

(G12U)dShfld PIII R=041

Phases I II and III correspond to positioning of triplets where respectively first second and third nucleotides are comple-mentary All signals (except OGT correlations in Bacteria) are normalized by corresponding values for control sequences afterthe dicodon shuffling See explanations of abbreviations in the Materials and methods section

Nucleic Acids Research 2014 Vol 42 No 5 2887

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

expense of polar ones (34) showing minor differences inpresence of individual residuesWe have shown earlier (22) that the amino acid bias

working in thermal adaptation of proteins does notdepend on the overall nucleotide composition of coding-DNA sequences However one can still expect that nu-cleotide compositions of particular codon positions andor dinucleotide compositions of positions 1-2 and 2-3 incodons may be somehow linked to amino acid biases Weindeed found a significant difference between nucleotideload in individual codon positions of Archaeal andBacterial coding DNA sequences (Table 1) FirstArchaeal protein coding sequences yield a strong correl-ation between (T+G)2 load in the second codon positionand OGT (r=072) The corresponding non-codon biasedsequences show comparable correlation (r=068)pointing that (T+G) load reflects tuning of the aminoacid composition in connection to thermal adaptationThe main contributor to this correlation is thymine(Table 1) that provides increase of strongly hydrophobicresidues LMFIV (Supplementary File S1) The (T+G)load is much weaker correlated with OGT in Bacterialsequences (r=036 for natural and r=042 for NCB se-quences) The correlation of the thymine load with OGT is037 for both Natural (NAT) and Non-codon-biased(NCB) sequences This observation agrees with thepresumed prevalence of structure-based strategy ofthermal adaptation in Archaeal proteins (29) contrary tosequence-based one in Bacterial proteomes Specificallythere is a clear correlation between amounts of chargedresidues and the OGT in the set of Bacterial proteomes(Table 7) presumably reflecting the domination of thesequence-based strategy in thermophilic adaptation ofBacteria (29)Finally we checked if there exists any specific connec-

tion between the dipeptide composition of proteins andthose of 3-1 dinucleotides We found (data not shown)that the strongest correlations between dinucleotides 3-1and dipeptides exist mostly for the dinucleotides withguanine in position 3 The most frequent pair of aminoacids contains methionine (Met encoded exclusively by

the AUG codon) as the first and any polarchargedamino acid as the second one The preference of polarcharged amino acid to be the next to Met can be explainedby the typically surface location of the protein N-terminiand its role as a signal for ubiquitination (63)

DISCUSSION

We will make a brief overview of the most pronouncedbiases and causal relationships in DNA RNA andproteins The major signal in nucleotide compositionsspecifically (G+C) load is related to thermal stabilizationof t- and rRNA (3654555960) and to discriminationbetween the coding and non-coding sequences inArchaea (Table 2) Comparative analysis of nucleotideand amino acid compositions in relation to thermophilicadaptation prompts to conclude that the (G+C) contentdoes not contribute to thermostabilization of coding DNA(1822365459) as well as it does not affect amino acidcomposition and its thermophilic trends (1522) When

Table 6 Purine loading in loop and stem regions of folded mRNA and its OGT correlation

Feature Loop Stem L-v-S p-value

Mean contents OGT correlation Mean content OGT correlation

A+G 0560 059 0500 ndash026 lt22E ndash 16RY 1299 061 1002 ndash026 lt22E ndash 16ApG 0061 079 0051 050 00002GGR (glycine) 0027 062 0045 042 12E ndash 17GGY (glycine) 0017 ndash005 0079 ndash023 10E ndash 42AGR (arginine) 0035 072 0022 056 24E ndash 10CGR (arginine) 0018 ndash022 0040 ndash024 71E ndash 09CGY (arginine) 0016 ndash025 0037 ndash026 62E ndash 10GAR (glutamate) 0060 071 0036 059 lt22E ndash 16AAR (lysine) 0086 022 0017 011 lt22E ndash 16GAY (aspartate) 0037 ndash017 0040 ndash037 00002

Feature analyzed nucleotide dinucleotide or amino acid Loop and stem information on mean content and OGT correlation of the above L-v-S acomparison between corresponding contents in the loop region and stem regions by Wilcoxon-tests and P-values are shown in this columnCorrelation coefficients with OGT are shown P-valuelt 001 P-valuelt 00001

Table 7 Signals of thermophilic adaptation in protein sequences of

Archaea and Bacteria

Correlation with OGT Archaea Bacteria

Z-scored predictor ILVW Y DKR(R=093)

IPV Y EKR(R=089)

The most abundant residuesin gt70 (gt60) ofall Z-scored predictors

VIWL Y ER VPI Y ER(K)

Individual amino acids + L W + Endash T Q D

Types of amino acids + h + h cndash p ndash p

Dipeptides + hp ph + ccndash cp pc ndash pc

+ increase of the amino acid (or amino acid type) fraction with OGT ndashdecrease of the amino acid (or amino acid type) fraction with OGTCapital letters are names of amino acids h p c are hydrophobic polarcharged types of amino acids Correlation coefficients between the Z-scored thermostability predictors and OGT are given in parentheses

2888 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

separate codon positions are considered for coding DNAthe only compositional bias observed for nucleotide com-positions is an excess of (G+C) load in the third codonposition in Bacteria There is a significant excess ofguanine and cytosine nucleotides (292 and 3282)compared to the NCB nucleotide contents (2609 and2466) A plausible explanation for this bias is the roleof the (G+C)3 load in the adaptation to the aerobic lifestyle which dominates in Bacteria The (G+C)3 load canbe advantageous for several complementary reasonsIndeed additional GC base pairing will contribute tothe stability of double helix of DNA and stem regions ofRNA molecules Moreover G3 bases can work as scaven-gers of oxidizing agents providing protection for G basesin other codon positions (56ndash58) At the same time thisbias would not lead to any changes in amino acid com-position leaving protein structure and stability intact (55)

In this work we considered compositional and sequencebiases in proteins in relation to those in correspondingnucleic acid sequences and to the phylogeny of species(Archaea or Bacteria) OGT correlations of nucleotidesin different codon positions are similar in Archaea andBacteria but higher in the former (Table 1) Theoptimal from the point of view of thermophilic adaptation(ie most correlated with OGT) codon reads A1 [TG]2non-[AG]3 in Archaea but [weak A]1 [weak T]2 non-[AG]3 in Bacteria (Table 4) Codon bias supports thestrong correlation between the amount of Adenine andOGT in the first codon position in Archaea same butweaker correlation in Bacteria While we showed earlier(22) that amino acid thermophilic trend is not determinedby the overall nucleotide composition considerationof separate codon positions reveals an interesting linkbetween positional nucleotide frequencies and aminoacid composition in relation to domain of LifeSpecifically prevalence of the nucleotide thymine in thesecond codon position and its strong correlation withOGT in Archaea is apparently a result of the demandon enrichement of Archaeal proteins with hydrophobicresidues (Table 7) Noteworthy a massive increase ofvan der Waals interactions (2464) was found to be thecornerstone of the structure-based evolutionary strategyof protein thermostability in ancient species (29) InBacterial proteins however there is an increase of thefractions of charged residues with OGT (Table 7) Thuswe observe a transitions from structure-based evolution-ary strategy of protein themostability in Archaea tosequence-based one in Bacteria and corresponding nucleo-tide and protein compositional biases (Table 7)underlying these strategies (29) The third codonposition is characterized by the selection against adenineand guanine in Archaea and against adenine in Bacteria isa result of this bias

The most pronounced dinucleotide compositionalbiases are excess of homo-dinucleotides and its correlationwith OGT in ncDNA tRNA and rRNA It is presumablya relic of ancient primitive homo(poly)nucleotides fromwhich life started Overrepresentation of the complemen-tary ApGCpT dinucleotides points to their contributionto DNARNA stability via strong base-stacking inter-actions (6566) The consideration of dinucleotide biases

in purine (R) and pyrimidine (Y) notation also reveals aninteresting picture Overall for both Archaea and Bacteriathe signature for thermophilic dinucleotides reads RpRYpY non-(RpRYpY) RpRYpY for codon positions1-2 2-3 and 3-1 respectively (Table 4) The major differ-ence between Archaea and Bacteria in this case is that thedinucleotide bias in positions 1-2 is caused by demands onnucleotide sequence level (and provided by the codonbias) in Archaea while some other factors work inBacteria The preference for homodinucleotides in pos-itions 1-2 and 3-1 can be a possible reason for avoidinghomodinucleotides in positions 2-3 because heterodinu-cleotides can contribute to DNA flexibility tuningIndeed it has been shown that given an average flexibilityF of homodinucleotides (either RpR or YpY) theflexibilities of the heterodinucleotides YpR and RpY are2F and 05F respectively (67) The codon positions 2-3 aremost versatile in terms of the relationship between thenucleotide and protein sequences Therefore presenceof heteronucleotides in the codon position 2-3 makes itpossible to adjust flexibility of the nucleotide sequencewithout changing physicalndashchemical characteristics ofthe encoded amino acid residue Additionally we foundthat the link between dipeptide and corresponding 3-1 di-nucleotide frequencies is determined by the amino aciddipeptide The dominating bias reads as Met followedby a polar or charged residueOn the RNA level simple consideration of the nucleo-

tide composition suggests to discriminate the tRNA andrRNA molecules from the mRNA Indeed based on thecorrelation with OGT we can conclude that (G+C)content is important for providing stability of tRNAand rRNA molecules but not for mRNA The role ofthe (G+C) content was further confirmed by foldingsimulations (Table 3) We also found other complemen-tary signals in mRNA sequences related to its structurestability and function Archaea and Bacteria yield similartrends in the most and least frequent nucleotide pairsand the third codon position makes a major contributionto the base-pairing mechanism of stem stabilizationSpecifically in addition to WatsonndashCrick pairing theGU wobble pairs sufficiently contribute to the ther-mal stability of the Archaeal mRNA in agreement withearlier observations (6869) Overall it results in moderatecorrelation of the segment energy and energy per base pairof the folded mRNA (Table 5) Folding simulations per-formed in this work also allowed us to analyzepeculiarities of the nucleotide contents and structuralcontacts in stem and loop regions of folded mRNAPurine load in loop regions is higher than in stems andthis effect is slightly stronger in Archaea than in Bacteria(Table 6) Purine load in loops also correlates with OGTnot in stems This correlation was observed earlier byForsdyke et al (21) and it was described as lsquopolitepurine load of the loop regionsrsquo that prevents undesiredmRNAndashmRNA single-strand interactions The authorsconcluded that increased purine load can affect thecodon choice and consequently the amino acid compos-ition (21) Our data [as well as careful analysis of theoriginal data in (21)] does not support the latter claimThere is indeed a correlation between the fraction of

Nucleic Acids Research 2014 Vol 42 No 5 2889

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

some purine-rich codons (GGR(Gly) AGR(Arg) andGAR(Glu)) and OGT (Table 6) However the fractionof purine-rich non-synonymous codons AAR (Lys) isnot correlated with OGT (Table 6) Moreover thesecodons are synonymous and cannot directly affect theamino acid composition It has been shown earlier(2270) that increase of Glu and Arg fractions is a trendin protein thermophilic adaptation At the same timepurine load remains high after elimination of the codonbias pointing to the amino acid composition as the mostprobable cause for the former (22) All the above promptsone to conclude that in mutual tuning of nucleotide andamino acid compositions the purine load does notdominate and determine biases in amino acid compos-ition if not opposite The analysis of purine load showsthat it correlates with OGT in both stems and loops(Table 6) in case of purine-rich codons GGR(Gly)AGR(Arg) and GAR(Glu) Further the amount ofApG dinucleotides strongly correlates with OGT in bothloops and stems (Table 6) and ApG provides a strongbase stacking important for thermal stability of nucleicacid sequences (2265) Purine load therefore is appar-ently a determinant of the base-stacking mechanism inmRNA andor DNA thermal stability as well as a resultof the thermophilic amino acid composition trendProkaryotes thrive under the temperature interval

spanning over hundred degrees and they represent twomajor life stylesmdashaerobic and anaerobic Analysis ofcomplete Archaeal and Bacterial genomes unraveled com-positional and sequence signals related to molecular mech-anisms of stability and adaptation unaffected by selectivesequencing or by the comparison of orthologs Overallcodon bias works stronger in Archaea and is mostlyutilized in thermophilic adaptation of nucleic acids It ap-parently reflects longer evolutionary history of Archaeawhich presumably started close to the origin of life in hotconditions (29) The codon bias and amino acid sequences(dipeptide composition) work in accord for supportingenrichment of the nucleotide sequences with ApGCpT di-nucleotidesmdashdeterminants of the base-stacking mechan-ism of the nucleic acid stability (65ndash67) We also foundthat the second codon position reveals a strong linkbetween the nucleotide and amino acid compositionsSpecifically excess of thymine in this position is a resultof a demand on the enrichment of Archaeal proteins withhydrophobic amino acids The third codon position inBacteria is the only case where codon bias is detectablealready on the level of pure composition We found thatthe (G+C)3 load is related to aerobic life style dominatingin Bacteria From the point of view of thermophilic adap-tation codon bias works against G in the third codonposition in both Archaea and Bacteria It supports thus aspecific role of the third codon position in discriminatingbetween adaptations to temperature and aerobic life styleFinally we obtained an interesting and complex picture ofthe relationship between the nucleotide composition(purine load) and amino acid composition (selectionbetween Arg and Lys) in relation to thermal adaptationaerobicity and phylogeny Specifically purine load in bothArchaea and Bacteria is a result of the lsquofrom both endof hydrophobicity scalersquo trend in thermal adaptation of

proteins (34) reflected in the IVYWREL predictor of theOGT (22) According to this predictor Arg is a preferredamino acid for thermal adaptation though Lys is the nextcandidate present in top most correlated predictors as well(2243) In particular selection for Lys over the Arg insome species was shown to be important for the entropicmechanism of protein thermostabilization (70) Howeveroverall preference for Arg over the Lys in case of thermaladaptation is well manifested in the excess of the purine-rich (AGR) codons of Arg at the expense of purine-richones of Lys (AAR) At the same time discriminationbetween Arg and Lys works in opposite direction inaerobicity where Arg codons AGA and AGG are sup-pressed in favor of the purine-rich AAA and AAG ofLys It thus reflect a preference for Lys over Arg inaerobes compared to anaerobes in particular and inBacteria versus Archaea in general preserving at thesame time high purine load necessary for providing abase-stacking in corresponding nucleic acids (225365)An intricate connection between the LysArg and theircodons in relation to thermal adaptation and aerobicityexemplifies how selection can work on nucleic acids andprotein simultaneously in response to demands of differentenvironments Obviously the whole picture of molecularmechanisms of adaptation and relations between them isfar from being complete Consideration of other environ-mental factors such as salinity pressure etc will help to un-ravel new mechanisms of stability their sequencestructuredeterminants and to understand tradeoffs that Natureembraced en route of the evolution and adaptation

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

FUNDING

Funding for Open access charges Functional GenomicsProgram (FUGE II) Norwegian Research Council

Conflict of interest statement None declared

REFERENCES

1 CrickF (1970) Central dogma of molecular biology Nature 227561ndash563

2 TawfikDS (2010) Messy biology and the origins of evolutionaryinnovations Nat Chem Biol 6 692ndash696

3 TokurikiN and TawfikDS (2009) Stability effects of mutationsand protein evolvability Curr Opin Struct Biol 19 596ndash604

4 TokurikiN and TawfikDS (2009) Protein dynamism andevolvability Science 324 203ndash207

5 KooninEV (2012) Does the central dogma still stand BiolDirect 7 27

6 PersquoerI FelderCE ManO SilmanI SussmanJL andBeckmannJS (2004) Proteomic signatures amino acid andoligopeptide compositions differentiate among phyla Proteins 5420ndash40

7 AravindL TatusovRL WolfYI WalkerDR andKooninEV (1998) Evidence for massive gene exchange betweenarchaeal and bacterial hyperthermophiles Trends Genet 14442ndash444

8 KhachaneAN TimmisKN and dos SantosVA (2005) Uracilcontent of 16S rRNA of thermophilic and psychrophilic

2890 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

prokaryotes correlates inversely with their optimal growthtemperatures Nucleic Acids Res 33 4016ndash4022

9 SuhreK and ClaverieJM (2003) Genomic correlates ofhyperthermostability an update J Biol Chem 27817198ndash17202

10 TekaiaF and YeramianE (2006) Evolution of proteomesfundamental signatures and global trends in amino acidcompositions BMC Genom 7 307

11 WangHC and HickeyDA (2002) Evidence for strong selectiveconstraint acting on the nucleotide composition of 16S ribosomalRNA genes Nucleic Acids Res 30 2501ndash2507

12 FriedmanR DrakeJW and HughesAL (2004) Genome-widepatterns of nucleotide substitution reveal stringent functionalconstraints on the protein sequences of thermophiles Genetics167 1507ndash1512

13 LynnDJ SingerGA and HickeyDA (2002) Synonymouscodon usage is subject to selection in thermophilic bacteriaNucleic Acids Res 30 4272ndash4277

14 SingerGA and HickeyDA (2000) Nucleotide bias causes agenomewide bias in the amino acid composition of proteinsMol Biol Evol 17 1581ndash1588

15 SingerGA and HickeyDA (2003) Thermophilicprokaryotes have characteristic patterns of codonusage amino acid composition and nucleotide content Gene 31739ndash47

16 TekaiaF YeramianE and DujonB (2002) Amino acidcomposition of genomes lifestyles of organisms and evolutionarytrends a global picture with correspondence analysis Gene 29751ndash60

17 Roy ChowdhuryA and DuttaC (2012) A pursuit of lineage-specific and niche-specific proteome features in the world ofarchaea BMC Genom 13 236

18 WangHC SuskoE and RogerAJ (2006) On the correlationbetween genomic G+C content and optimal growth temperaturein prokaryotes data quality and confounding factors BiochemBiophys Res Commun 342 681ndash684

19 WuH ZhangZ HuS and YuJ (2012) On the molecularmechanism of GC content variation among eubacterial genomesBiol Direct 7 2

20 KnightRD FreelandSJ and LandweberLF (2001) A simplemodel based on mutation and selection explains trends in codonand amino-acid usage and GC composition within and acrossgenomes Genome Biol 2 RESEARCH0010

21 LaoPJ and ForsdykeDR (2000) Thermophilic bacteria strictlyobey Szybalskirsquos transcription direction rule and politely purine-load RNAs with both adenine and guanine Genome Res 10228ndash236

22 ZeldovichKB BerezovskyIN and ShakhnovichEI (2007)Protein and DNA sequence determinants of thermophilicadaptation PLoS Comput Biol 3 e5

23 KreilDP and OuzounisCA (2001) Identification ofthermophilic species by the amino acid compositions deducedfrom their genomes Nucleic Acids Res 29 1608ndash1615

24 BerezovskyIN TumanyanVG and EsipovaNG (1997)Representation of amino acid sequences in terms of interactionenergy in protein globules FEBS Lett 418 43ndash46

25 CambillauC and ClaverieJM (2000) Structural and genomiccorrelates of hyperthermostability J Biol Chem 27532383ndash32386

26 GreavesRB and WarwickerJ (2007) Mechanisms forstabilisation and the maintenance of solubility in proteins fromthermophiles BMC Struct Biol 7 18

27 JaenickeR (1999) Stability and folding of domain proteins ProgBiophys Mol Biol 71 155ndash241

28 JaenickeR and BohmG (1998) The stability of proteins inextreme environments Curr Opin Struct Biol 8 738ndash748

29 BerezovskyIN and ShakhnovichEI (2005) Physics andevolution of thermophilic adaptation Proc Natl Acad Sci USA102 12742ndash12747

30 ChakravartyS and VaradarajanR (2002) Elucidation of factorsresponsible for enhanced thermal stability of proteins a structuralgenomics based study Biochemistry 41 8152ndash8161

31 GlyakinaAV GarbuzynskiySO LobanovMY andGalzitskayaOV (2007) Different packing of external residues can

explain differences in the thermostability of proteins fromthermophilic and mesophilic organisms Bioinformatics 232231ndash2238

32 ThompsonMJ and EisenbergD (1999) Transproteomic evidenceof a loop-deletion mechanism for enhancing proteinthermostability J Mol Biol 290 595ndash604

33 TokurikiN OldfieldCJ UverskyVN BerezovskyIN andTawfikDS (2009) Do viral proteins possess unique biophysicalfeatures Trends Biochem Sci 34 53ndash59

34 BerezovskyIN ZeldovichKB and ShakhnovichEI (2007)Positive and negative design in stability and thermal adaptationof natural proteins PLoS Comput Biol 3 e52

35 BharanidharanD BhargaviGR UthanumallianK andGauthamN (2004) Correlations between nucleotide frequenciesand amino acid composition in 115 bacterial species BiochemBiophys Res Commun 315 1097ndash1103

36 NakashimaH FukuchiS and NishikawaK (2003)Compositional changes in RNA DNA and proteins for bacterialadaptation to higher and lower temperatures J Biochem(Tokyo) 133 507ndash513

37 DehouckY FolchB and RoomanM (2008) Revisiting thecorrelation between proteinsrsquo thermoresistance and organismsrsquothermophilicity Protein Eng Des Sel 21 275ndash278

38 FolchB DehouckY and RoomanM (2010) Thermo- andmesostabilizing protein interactions identified by temperature-dependent statistical potentials Biophys J 98 667ndash677

39 FolchB RoomanM and DehouckY (2008) Thermostability ofsalt bridges versus hydrophobic interactions in proteins probed bystatistical potentials J Chem Inf Model 48 119ndash127

40 GonnelliG RoomanM and DehouckY (2012) Structure-basedmutant stability predictions on proteins of unknown structureJ Biotechnol 161 287ndash293

41 PonnuswamyP MuthusamyR and ManavalanP (1982) Aminoacid composition and thermal stability of globular proteins IntJ Biol Macromol 4 186ndash190

42 BerezovskyIN (2011) The diversity of physical forces andmechanisms in intermolecular interactions Phys Biol 8 035002

43 MaBG GoncearencoA and BerezovskyIN (2010)Thermophilic adaptation of protein complexes inferred fromproteomic homology modeling Structure 18 819ndash828

44 MakarovaKS and KooninEV (2005) Evolutionary andfunctional genomics of the Archaea Curr Opin Microbiol 8586ndash594

45 NovichkovPS WolfYI DubchakI and KooninEV (2009)Trends in prokaryotic evolution revealed by comparison of closelyrelated bacterial and archaeal genomes J Bacteriol 191 65ndash73

46 KooninEV and WolfYI (2008) Genomics of bacteria andarchaea the emerging dynamic view of the prokaryotic worldNucleic Acids Res 36 6688ndash6719

47 KooninEV MushegianAR GalperinMY and WalkerDR(1997) Comparison of archaeal and bacterial genomes computeranalysis of protein sequences predicts novel functions andsuggests a chimeric origin for the archaea Mol Microbiol 25619ndash637

48 KatzL and BurgeCB (2003) Widespread selection for localRNA secondary structure in coding regions of bacterial genesGenome Res 13 2042ndash2051

49 HofackerIL PriwitzerB and StadlerPF (2004) Prediction oflocally stable RNA secondary structures for genome-wide surveysBioinformatics 20 186ndash190

50 ZukerM and StieglerP (1981) Optimal computer folding oflarge RNA sequences using thermodynamics and auxiliaryinformation Nucleic Acids Res 9 133ndash148

51 MathewsDH SabinaJ ZukerM and TurnerDH (1999)Expanded sequence dependence of thermodynamic parametersimproves prediction of RNA secondary structure J Mol Biol288 911ndash940

52 ShabalinaSA OgurtsovAY and SpiridonovNA (2006) Aperiodic pattern of mRNA secondary structure created by thegenetic code Nucleic Acids Res 34 2428ndash2437

53 MarmurJ and DotyP (1962) Determination of the basecomposition of deoxyribonucleic acid from its thermaldenaturation temperature J Mol Biol 5 109ndash118

Nucleic Acids Research 2014 Vol 42 No 5 2891

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

54 HurstLD and MerchantAR (2001) High guanine-cytosinecontent is not an adaptation to high temperature acomparative analysis amongst prokaryotes Proc Biol Sci 268493ndash497

55 NayaH RomeroH ZavalaA AlvarezB and MustoH (2002)Aerobiosis increases the genomic guanine plus cytosine content(GC) in prokaryotes J Mol Evol 55 260ndash264

56 BeckmanKB and AmesBN (1997) Oxidative decay of DNAJ Biol Chem 272 19633ndash19636

57 ChengKC CahillDS KasaiH NishimuraS and LoebLA(1992) 8-Hydroxyguanine an abundant form of oxidative DNAdamage causes Gmdash-T and Amdash-C substitutions J Biol Chem267 166ndash172

58 Vieira-SilvaS and RochaEP (2008) An assessment of theimpacts of molecular oxygen on the evolution of proteomes MolBiol Evol 25 1931ndash1942

59 GaltierN and LobryJR (1997) Relationships between genomicG+C content RNA secondary structures and optimal growthtemperature in prokaryotes J Mol Evol 44 632ndash636

60 MustoH NayaH ZavalaA RomeroH Alvarez-ValinF andBernardiG (2004) Correlations between genomic GC levels andoptimal growth temperatures in prokaryotes FEBS Lett 57373ndash77

61 FitchWM (1974) The large extent of putative secondary nucleicacid structure in random nucleotide sequences or amino acidderived messenger-RNA J Mol Evol 3 279ndash291

62 WorkmanC and KroghA (1999) No evidence that mRNAshave lower folding free energies than random sequences with

the same dinucleotide distribution Nucleic Acids Res 274816ndash4822

63 BachmairA FinleyD and VarshavskyA (1986) In vivo half-lifeof a protein is a function of its amino-terminal residue Science234 179ndash186

64 BerezovskyIN NamiotVA TumanyanVG and EsipovaNG(1999) Hierarchy of the interaction energy distribution in thespatial structure of globular proteins and the problem of domaindefinition J Biomol Struct Dyn 17 133ndash155

65 YakovchukP ProtozanovaE and Frank-KamenetskiiMD(2006) Base-stacking and base-pairing contributions into thermalstability of the DNA double helix Nucleic Acids Res 34564ndash574

66 FriedmanRA and HonigB (1995) A free energy analysis ofnucleic acid base stacking in aqueous solution Biophys J 691528ndash1535

67 OkonogiTM AlleySC ReeseAW HopkinsPB andRobinsonBH (2002) Sequence-dependent dynamics of duplexDNA the applicability of a dinucleotide model Biophys J 833446ndash3459

68 MasquidaB and WesthofE (2000) On the wobble GoU andrelated pairs RNA 6 9ndash15

69 VaraniG and McClainWH (2000) The G x U wobble base pairA fundamental building block of RNA structure crucial to RNAfunction in diverse biological systems EMBO Rep 1 18ndash23

70 BerezovskyIN ChenWW ChoiPJ and ShakhnovichEI(2005) Entropic stabilization of proteins and its proteomicconsequences PLoS Comput Biol 1 e47

2892 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

His Tyr) and two by one codon (Met Trp) Typicallyhalf of the synonymous codons of each amino acid have Gor C in a third codon position others A or T It appearedthat almost all of the codons with G and C in a thirdposition have higher frequencies in aerobes compared toanaerobes and in Bacteria than in Archaea (see lsquoCodonUsagersquo in Supplementary File S3) This observation holdsfor all codons of the lsquotwo-codonrsquo amino acids and formajority of codons of the three- four- and six-codonamino acids The noticeable exception is significant sup-pression of AGAAGG codons of the Arg (aerobesndashanaerobes 177ndash125 P-values 446e ndash 858e ndash 8BacteriandashArchaea 174ndash2625 P-values 165e ndash 7114e ndash11) In lysine the codon AAA is suppressed in aerobes(difference 1716 P-value 644e ndash 6) while the codonAAG is preferred The AAG codon of lysine is favorablein aerobes (aerobesanaerobes 143) and another lysinersquoscodon AAA can be turned into AAG by only one mutationin the third codon position Therefore one can speculatethat the demand for discriminating between Arg and Lys isthe most probable cause for the suppression of AGAand AGG codons of arginine in aerobes Lysine isthus preferred in aerobic adaptation over the arginine(Supplementary Files S3 and S6) The contribution of theamino acid in relation to thermophilic adaptation is notcompromised because of the similarity between physicalndashchemical characteristics of lysine and arginine from thepoint of view of thermostability Overall an excess ofthe G3 codons is always more pronounced in aerobescompared to anaerobes which corroborates the conclusionthat the (G+C)3 load is an indications of the aerobic style(55ndash58) The similar trend in the difference between theBacteria and Archaea is a result of the domination ofaerobic life style in the Bacteria in the analyzed dataset

Correlation between dinucleotide frequencies in differentcodon positions and OGTIn order to analyze dinucleotides in different codon pos-itions (1-2 2-3 and 3-1) we used the DCTNat DCTNCB

(for positions 1-2 and 2-3) and shuffled (DCTShfld forposition 3 and 1) sequences We considered correlationsof these contrasts with OGT The most pronounced biasesare found for dinucleotides in coding sequences ofArchaea The ApG dinucleotides show the strongestexcess of dinucleotides in all positions of natural se-quences compared to NCB and shuffled ones In codonpositions 1-2 and 2-3 this compositional peculiarity andits correlation with OGT are provided by the codon bias(Table 1) In position 3-1 high correlation of ApGfrequencies with OGT is determined by the amino acidsequences and it vanishes after amino acid reshuffling(Table 1) There is also correlation of the frequency ofcomplementary CpT dinucleotide in position 3-1 withOGT disappearing after reshuffling of amino acid se-quences (Table 1) The most plausible role of ApG di-nucleotides is a contribution to stabilization via base-stacking interactions between the purine rings (22) Thisconclusion is supported by the fact that excess of ApGdinucleotides is provided by the codon bias manifestingthus adaptation on a DNA level The complementary CpTdinucleotides indicate enrichment of anti-sense strand of

double-stranded DNA (dsDNA) with the stabilizing ApGones The resulting mosaic of ApG stacking in sense andanti-sense strands can provide stabilization over the longdistances without compromising flexibility of the dsDNAOther overrepresented dinucleotides (regardless of thecodon bias) correlated with OGT in positions 1-2 areGpC CpC TpA and CpT (Table 1) The frequencies ofTpA and CpT dinucleotides are correlated with OGT inposition 2-3 where the former is provided by the codonbias and the latter is not

In purine (R) and pyrimidine (Y) notation codonbias provides grouping (stacking) of similar nucleotidesin position 1-2 which increases with OGT (Table 1) Atthe same time codon bias works against such grouping inposition 2-3 (Table 1) Amino acid sequence is responsiblefor the grouping of similar nucleotides and correlationswith OGT in position 3-1 (Table 1) The resultinggeneralized thermophilic pattern of dinucleotides inArchaeal cDNA sequences reads [RpR YpY]1-2 [non-RpR non-YpY]2-3 [RpRYpY]3-1 Most of the compos-itional peculiarities found in Bacteria are not determinedby the codon bias yielding weak correlations with tem-perature (Table 1) The thermophilic pattern of dinucleo-tides in Bacteria (Table 4) is the same as Archaeal oneThe difference with Archaea is that selection for RpR andYpY dinucleotides in position 1-2 is not determined by thecodon bias in Bacteria (Tables 1 and 4)

Compositional and sequence biases observed in foldingsimulations of mRNAThough RNA molecules have different overall structuresone may expect that there are some common mechanismsproviding stability and function of the folded t- r- andmRNAs We have shown above that the (G+C) contentapparently contributes to the stability of the t- and rRNA(3659) and not to the stability of coding DNA (545960)It is manifested in correlation of the former with OGT(Table 2) and the anti-correlation of the base-pairingenergy in the folded structures with OGT (Table 3)Nucleotide compositions in coding DNA (hence inmRNA as well) do not correlate with OGT The mRNAcase is of special interest because redundancy of geneticcode endues the nucleotide sequence with a potential tosatisfy requirements for DNA mRNA and protein stabil-ity For example it has been hypothesized (61) that theoptimization of the base-pairing in the mRNA contributesto formation of stem fragments Authors claimed thatthere is a corresponding periodic pattern of the mRNAsecondary structure in human and mouse which isdetermined chiefly by the selection that operates on thethird codon position (52) An opposing opinion suggeststhat the secondary structure in mRNA interferes withtranslation and therefore should be avoided (62)Overall three scenario of the relation between mRNAand proteins has been considered earlier (52) (i) thebiases are determined by the demands on protein stability(ii) mRNA stability is the major determinant of sequencebiases (iii) complete independence of the sequence biasesrelated to mechanisms of stability on each level We havecomputationally folded the mRNA sequences from 244prokaryotic genomes and analyzed their characteristics

2886 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

(Table 5) All the parameters are reported as relative oneswhere the signal in native sequence is normalized by thevalue obtained for dicodon shuffled ones (48) Thedicodon shuffling randomizes mRNA sequence whilepreserving the native protein sequence native codonusage and native dinucleotide composition Therefore itallows one to select out peculiarities of mRNA caused bythe demands on its stability and function from the othersrelated to the coding DNA and proteins

Structure of stems in folded mRNA is characterized byphases which show relative location of triplets in oppositestrands of the stem Specifically Phases I II and III cor-respond to positioning of triplets where respectively firstsecond and third nucleotides are complementary Themost and the least frequent pairs in all phases of thedsDNA (see Materials and methods section) are con-sidered Phases I and II are similar to each otheryielding CG GC GU UG AU and UA (Table 5)Overall the Archaea and thermophilic Bacteria havesimilar trends in most and least frequent pairsMesophilic Bacteria stands out with A13U and G33Cmost frequent pairs in Phases II and III respectivelyThe major contribution from the three pairing phases(52) to the total amount of stem pairs is provided byPhase III and nucleotides in codon position 3 are mostfrequent in stem pairs (Table 5 and SupplementaryTable S1) OGT correlations highlight the contributionfrom the GU wobble pair to the thermostability ofachaeal mRNA structures (Table 5 and SupplementaryTable S1) We found the highest OGT correlation forthe Archaeal pairs U32G (Table 5) confirming import-ance of the GU wobble pairs The stronger effect forArchaeal rather than for Bacterial mRNA is possibly arelic of ancient RNA world where RNA was the carrierof genetic information and harsh environmental condi-tions demanded its increased stability

Table 6 shows that purine load (the contents of A+GRY ApG) is larger in the loop regions than in the stems

In Archaea this difference is slightly more pronouncedthan in Bacteria (Supplementary Table S2) The purineload is also in good positive correlation with OGT forthe loops not for stems However the amount of ApGdinucleotides is correlated with OGT in both loops andstems For synonymous codons the fractions of purine-rich codons (eg GGR versus GGY for glycine and AGRversus CGY for arginine) are correlated with OGT in bothloop and stem regions (slightly stronger in loops) Fornon-synonymous codons however amount of purine-rich codons (GAR for glutamate) is correlated withOGT in both loops and stems while fractions of purine-rich AAR codons for lysine do not correlate withOGT (Table 6) While Forsdyke et al (21) reportedincrease of the purine content and its positive correlationwith OGT in loops we found that fraction of purine reachsynonymous codons is correlated with OGT in stems aswell At the same time the amount of non-synonymouspurine-rich codons of lysine is not correlated with OGT inboth loop and stem regions Additionally the amount ofApG dinucleotides is correlated with OGT in both loopsand stems (Table 6)

Signals of thermophilic adaptation in protein sequencesIt has been shown earlier (22) that environmental tem-perature directly affects amino acid composition of pro-karyotic proteins and IVYWREL combination can serveas a predictor of the OGT in prokaryotes We use here alsquoZ-scorersquo predictor of OGT (43) which takes into accountdifferences in variances of the proteomic frequencies ofindividual amino acids (Table 7) We have shown thatthe Z-score predictor properly corrects for these differ-ences (43) better reflecting the contribution of theamino acid combinations to thermal adaptationSeparate predictors for Archaea and Bacteria (based on33 and 99 proteomes with OGTrsquos spread over 5ndash100 and10ndash85C respectively) yield the same general trend in theincrease of hydrophobic and charged residues at the

Table 5 Compositional and dicodon signals of mRNA adaptation purified by the dicodon shuffling

Characteristic Archaea Bacteria

The most and least (after the comma) frequentpairs in stems of mRNA

MesophilesPhase I C23G G23U Phase I G32U U32GPhase II C31G G13U Phase II A13U G13UPhase III A33U U33G Phase III G33C U33G

ThermophilesPhase I G32C G23U Phase I C23G G23UPhase II C22G G13U Phase II C31G G13UPhase III A33U U33G Phase III U33A U33G

Correlations with OGT ofSegment energy ltEsggt R=ndash071 R=ndash039Energy per base pair ltEbpgt R=ndash073 R=ndash054

The most signification correlations with OGT Phase I U32G R=077 (U21G)Nat PIII R=046Phase I G23U R=071 (G12U)Nat PIII R=045Phase I All pairs R=07 (U21G)dShfld PIII R=045

(G12U)dShfld PIII R=041

Phases I II and III correspond to positioning of triplets where respectively first second and third nucleotides are comple-mentary All signals (except OGT correlations in Bacteria) are normalized by corresponding values for control sequences afterthe dicodon shuffling See explanations of abbreviations in the Materials and methods section

Nucleic Acids Research 2014 Vol 42 No 5 2887

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

expense of polar ones (34) showing minor differences inpresence of individual residuesWe have shown earlier (22) that the amino acid bias

working in thermal adaptation of proteins does notdepend on the overall nucleotide composition of coding-DNA sequences However one can still expect that nu-cleotide compositions of particular codon positions andor dinucleotide compositions of positions 1-2 and 2-3 incodons may be somehow linked to amino acid biases Weindeed found a significant difference between nucleotideload in individual codon positions of Archaeal andBacterial coding DNA sequences (Table 1) FirstArchaeal protein coding sequences yield a strong correl-ation between (T+G)2 load in the second codon positionand OGT (r=072) The corresponding non-codon biasedsequences show comparable correlation (r=068)pointing that (T+G) load reflects tuning of the aminoacid composition in connection to thermal adaptationThe main contributor to this correlation is thymine(Table 1) that provides increase of strongly hydrophobicresidues LMFIV (Supplementary File S1) The (T+G)load is much weaker correlated with OGT in Bacterialsequences (r=036 for natural and r=042 for NCB se-quences) The correlation of the thymine load with OGT is037 for both Natural (NAT) and Non-codon-biased(NCB) sequences This observation agrees with thepresumed prevalence of structure-based strategy ofthermal adaptation in Archaeal proteins (29) contrary tosequence-based one in Bacterial proteomes Specificallythere is a clear correlation between amounts of chargedresidues and the OGT in the set of Bacterial proteomes(Table 7) presumably reflecting the domination of thesequence-based strategy in thermophilic adaptation ofBacteria (29)Finally we checked if there exists any specific connec-

tion between the dipeptide composition of proteins andthose of 3-1 dinucleotides We found (data not shown)that the strongest correlations between dinucleotides 3-1and dipeptides exist mostly for the dinucleotides withguanine in position 3 The most frequent pair of aminoacids contains methionine (Met encoded exclusively by

the AUG codon) as the first and any polarchargedamino acid as the second one The preference of polarcharged amino acid to be the next to Met can be explainedby the typically surface location of the protein N-terminiand its role as a signal for ubiquitination (63)

DISCUSSION

We will make a brief overview of the most pronouncedbiases and causal relationships in DNA RNA andproteins The major signal in nucleotide compositionsspecifically (G+C) load is related to thermal stabilizationof t- and rRNA (3654555960) and to discriminationbetween the coding and non-coding sequences inArchaea (Table 2) Comparative analysis of nucleotideand amino acid compositions in relation to thermophilicadaptation prompts to conclude that the (G+C) contentdoes not contribute to thermostabilization of coding DNA(1822365459) as well as it does not affect amino acidcomposition and its thermophilic trends (1522) When

Table 6 Purine loading in loop and stem regions of folded mRNA and its OGT correlation

Feature Loop Stem L-v-S p-value

Mean contents OGT correlation Mean content OGT correlation

A+G 0560 059 0500 ndash026 lt22E ndash 16RY 1299 061 1002 ndash026 lt22E ndash 16ApG 0061 079 0051 050 00002GGR (glycine) 0027 062 0045 042 12E ndash 17GGY (glycine) 0017 ndash005 0079 ndash023 10E ndash 42AGR (arginine) 0035 072 0022 056 24E ndash 10CGR (arginine) 0018 ndash022 0040 ndash024 71E ndash 09CGY (arginine) 0016 ndash025 0037 ndash026 62E ndash 10GAR (glutamate) 0060 071 0036 059 lt22E ndash 16AAR (lysine) 0086 022 0017 011 lt22E ndash 16GAY (aspartate) 0037 ndash017 0040 ndash037 00002

Feature analyzed nucleotide dinucleotide or amino acid Loop and stem information on mean content and OGT correlation of the above L-v-S acomparison between corresponding contents in the loop region and stem regions by Wilcoxon-tests and P-values are shown in this columnCorrelation coefficients with OGT are shown P-valuelt 001 P-valuelt 00001

Table 7 Signals of thermophilic adaptation in protein sequences of

Archaea and Bacteria

Correlation with OGT Archaea Bacteria

Z-scored predictor ILVW Y DKR(R=093)

IPV Y EKR(R=089)

The most abundant residuesin gt70 (gt60) ofall Z-scored predictors

VIWL Y ER VPI Y ER(K)

Individual amino acids + L W + Endash T Q D

Types of amino acids + h + h cndash p ndash p

Dipeptides + hp ph + ccndash cp pc ndash pc

+ increase of the amino acid (or amino acid type) fraction with OGT ndashdecrease of the amino acid (or amino acid type) fraction with OGTCapital letters are names of amino acids h p c are hydrophobic polarcharged types of amino acids Correlation coefficients between the Z-scored thermostability predictors and OGT are given in parentheses

2888 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

separate codon positions are considered for coding DNAthe only compositional bias observed for nucleotide com-positions is an excess of (G+C) load in the third codonposition in Bacteria There is a significant excess ofguanine and cytosine nucleotides (292 and 3282)compared to the NCB nucleotide contents (2609 and2466) A plausible explanation for this bias is the roleof the (G+C)3 load in the adaptation to the aerobic lifestyle which dominates in Bacteria The (G+C)3 load canbe advantageous for several complementary reasonsIndeed additional GC base pairing will contribute tothe stability of double helix of DNA and stem regions ofRNA molecules Moreover G3 bases can work as scaven-gers of oxidizing agents providing protection for G basesin other codon positions (56ndash58) At the same time thisbias would not lead to any changes in amino acid com-position leaving protein structure and stability intact (55)

In this work we considered compositional and sequencebiases in proteins in relation to those in correspondingnucleic acid sequences and to the phylogeny of species(Archaea or Bacteria) OGT correlations of nucleotidesin different codon positions are similar in Archaea andBacteria but higher in the former (Table 1) Theoptimal from the point of view of thermophilic adaptation(ie most correlated with OGT) codon reads A1 [TG]2non-[AG]3 in Archaea but [weak A]1 [weak T]2 non-[AG]3 in Bacteria (Table 4) Codon bias supports thestrong correlation between the amount of Adenine andOGT in the first codon position in Archaea same butweaker correlation in Bacteria While we showed earlier(22) that amino acid thermophilic trend is not determinedby the overall nucleotide composition considerationof separate codon positions reveals an interesting linkbetween positional nucleotide frequencies and aminoacid composition in relation to domain of LifeSpecifically prevalence of the nucleotide thymine in thesecond codon position and its strong correlation withOGT in Archaea is apparently a result of the demandon enrichement of Archaeal proteins with hydrophobicresidues (Table 7) Noteworthy a massive increase ofvan der Waals interactions (2464) was found to be thecornerstone of the structure-based evolutionary strategyof protein thermostability in ancient species (29) InBacterial proteins however there is an increase of thefractions of charged residues with OGT (Table 7) Thuswe observe a transitions from structure-based evolution-ary strategy of protein themostability in Archaea tosequence-based one in Bacteria and corresponding nucleo-tide and protein compositional biases (Table 7)underlying these strategies (29) The third codonposition is characterized by the selection against adenineand guanine in Archaea and against adenine in Bacteria isa result of this bias

The most pronounced dinucleotide compositionalbiases are excess of homo-dinucleotides and its correlationwith OGT in ncDNA tRNA and rRNA It is presumablya relic of ancient primitive homo(poly)nucleotides fromwhich life started Overrepresentation of the complemen-tary ApGCpT dinucleotides points to their contributionto DNARNA stability via strong base-stacking inter-actions (6566) The consideration of dinucleotide biases

in purine (R) and pyrimidine (Y) notation also reveals aninteresting picture Overall for both Archaea and Bacteriathe signature for thermophilic dinucleotides reads RpRYpY non-(RpRYpY) RpRYpY for codon positions1-2 2-3 and 3-1 respectively (Table 4) The major differ-ence between Archaea and Bacteria in this case is that thedinucleotide bias in positions 1-2 is caused by demands onnucleotide sequence level (and provided by the codonbias) in Archaea while some other factors work inBacteria The preference for homodinucleotides in pos-itions 1-2 and 3-1 can be a possible reason for avoidinghomodinucleotides in positions 2-3 because heterodinu-cleotides can contribute to DNA flexibility tuningIndeed it has been shown that given an average flexibilityF of homodinucleotides (either RpR or YpY) theflexibilities of the heterodinucleotides YpR and RpY are2F and 05F respectively (67) The codon positions 2-3 aremost versatile in terms of the relationship between thenucleotide and protein sequences Therefore presenceof heteronucleotides in the codon position 2-3 makes itpossible to adjust flexibility of the nucleotide sequencewithout changing physicalndashchemical characteristics ofthe encoded amino acid residue Additionally we foundthat the link between dipeptide and corresponding 3-1 di-nucleotide frequencies is determined by the amino aciddipeptide The dominating bias reads as Met followedby a polar or charged residueOn the RNA level simple consideration of the nucleo-

tide composition suggests to discriminate the tRNA andrRNA molecules from the mRNA Indeed based on thecorrelation with OGT we can conclude that (G+C)content is important for providing stability of tRNAand rRNA molecules but not for mRNA The role ofthe (G+C) content was further confirmed by foldingsimulations (Table 3) We also found other complemen-tary signals in mRNA sequences related to its structurestability and function Archaea and Bacteria yield similartrends in the most and least frequent nucleotide pairsand the third codon position makes a major contributionto the base-pairing mechanism of stem stabilizationSpecifically in addition to WatsonndashCrick pairing theGU wobble pairs sufficiently contribute to the ther-mal stability of the Archaeal mRNA in agreement withearlier observations (6869) Overall it results in moderatecorrelation of the segment energy and energy per base pairof the folded mRNA (Table 5) Folding simulations per-formed in this work also allowed us to analyzepeculiarities of the nucleotide contents and structuralcontacts in stem and loop regions of folded mRNAPurine load in loop regions is higher than in stems andthis effect is slightly stronger in Archaea than in Bacteria(Table 6) Purine load in loops also correlates with OGTnot in stems This correlation was observed earlier byForsdyke et al (21) and it was described as lsquopolitepurine load of the loop regionsrsquo that prevents undesiredmRNAndashmRNA single-strand interactions The authorsconcluded that increased purine load can affect thecodon choice and consequently the amino acid compos-ition (21) Our data [as well as careful analysis of theoriginal data in (21)] does not support the latter claimThere is indeed a correlation between the fraction of

Nucleic Acids Research 2014 Vol 42 No 5 2889

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

some purine-rich codons (GGR(Gly) AGR(Arg) andGAR(Glu)) and OGT (Table 6) However the fractionof purine-rich non-synonymous codons AAR (Lys) isnot correlated with OGT (Table 6) Moreover thesecodons are synonymous and cannot directly affect theamino acid composition It has been shown earlier(2270) that increase of Glu and Arg fractions is a trendin protein thermophilic adaptation At the same timepurine load remains high after elimination of the codonbias pointing to the amino acid composition as the mostprobable cause for the former (22) All the above promptsone to conclude that in mutual tuning of nucleotide andamino acid compositions the purine load does notdominate and determine biases in amino acid compos-ition if not opposite The analysis of purine load showsthat it correlates with OGT in both stems and loops(Table 6) in case of purine-rich codons GGR(Gly)AGR(Arg) and GAR(Glu) Further the amount ofApG dinucleotides strongly correlates with OGT in bothloops and stems (Table 6) and ApG provides a strongbase stacking important for thermal stability of nucleicacid sequences (2265) Purine load therefore is appar-ently a determinant of the base-stacking mechanism inmRNA andor DNA thermal stability as well as a resultof the thermophilic amino acid composition trendProkaryotes thrive under the temperature interval

spanning over hundred degrees and they represent twomajor life stylesmdashaerobic and anaerobic Analysis ofcomplete Archaeal and Bacterial genomes unraveled com-positional and sequence signals related to molecular mech-anisms of stability and adaptation unaffected by selectivesequencing or by the comparison of orthologs Overallcodon bias works stronger in Archaea and is mostlyutilized in thermophilic adaptation of nucleic acids It ap-parently reflects longer evolutionary history of Archaeawhich presumably started close to the origin of life in hotconditions (29) The codon bias and amino acid sequences(dipeptide composition) work in accord for supportingenrichment of the nucleotide sequences with ApGCpT di-nucleotidesmdashdeterminants of the base-stacking mechan-ism of the nucleic acid stability (65ndash67) We also foundthat the second codon position reveals a strong linkbetween the nucleotide and amino acid compositionsSpecifically excess of thymine in this position is a resultof a demand on the enrichment of Archaeal proteins withhydrophobic amino acids The third codon position inBacteria is the only case where codon bias is detectablealready on the level of pure composition We found thatthe (G+C)3 load is related to aerobic life style dominatingin Bacteria From the point of view of thermophilic adap-tation codon bias works against G in the third codonposition in both Archaea and Bacteria It supports thus aspecific role of the third codon position in discriminatingbetween adaptations to temperature and aerobic life styleFinally we obtained an interesting and complex picture ofthe relationship between the nucleotide composition(purine load) and amino acid composition (selectionbetween Arg and Lys) in relation to thermal adaptationaerobicity and phylogeny Specifically purine load in bothArchaea and Bacteria is a result of the lsquofrom both endof hydrophobicity scalersquo trend in thermal adaptation of

proteins (34) reflected in the IVYWREL predictor of theOGT (22) According to this predictor Arg is a preferredamino acid for thermal adaptation though Lys is the nextcandidate present in top most correlated predictors as well(2243) In particular selection for Lys over the Arg insome species was shown to be important for the entropicmechanism of protein thermostabilization (70) Howeveroverall preference for Arg over the Lys in case of thermaladaptation is well manifested in the excess of the purine-rich (AGR) codons of Arg at the expense of purine-richones of Lys (AAR) At the same time discriminationbetween Arg and Lys works in opposite direction inaerobicity where Arg codons AGA and AGG are sup-pressed in favor of the purine-rich AAA and AAG ofLys It thus reflect a preference for Lys over Arg inaerobes compared to anaerobes in particular and inBacteria versus Archaea in general preserving at thesame time high purine load necessary for providing abase-stacking in corresponding nucleic acids (225365)An intricate connection between the LysArg and theircodons in relation to thermal adaptation and aerobicityexemplifies how selection can work on nucleic acids andprotein simultaneously in response to demands of differentenvironments Obviously the whole picture of molecularmechanisms of adaptation and relations between them isfar from being complete Consideration of other environ-mental factors such as salinity pressure etc will help to un-ravel new mechanisms of stability their sequencestructuredeterminants and to understand tradeoffs that Natureembraced en route of the evolution and adaptation

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

FUNDING

Funding for Open access charges Functional GenomicsProgram (FUGE II) Norwegian Research Council

Conflict of interest statement None declared

REFERENCES

1 CrickF (1970) Central dogma of molecular biology Nature 227561ndash563

2 TawfikDS (2010) Messy biology and the origins of evolutionaryinnovations Nat Chem Biol 6 692ndash696

3 TokurikiN and TawfikDS (2009) Stability effects of mutationsand protein evolvability Curr Opin Struct Biol 19 596ndash604

4 TokurikiN and TawfikDS (2009) Protein dynamism andevolvability Science 324 203ndash207

5 KooninEV (2012) Does the central dogma still stand BiolDirect 7 27

6 PersquoerI FelderCE ManO SilmanI SussmanJL andBeckmannJS (2004) Proteomic signatures amino acid andoligopeptide compositions differentiate among phyla Proteins 5420ndash40

7 AravindL TatusovRL WolfYI WalkerDR andKooninEV (1998) Evidence for massive gene exchange betweenarchaeal and bacterial hyperthermophiles Trends Genet 14442ndash444

8 KhachaneAN TimmisKN and dos SantosVA (2005) Uracilcontent of 16S rRNA of thermophilic and psychrophilic

2890 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

prokaryotes correlates inversely with their optimal growthtemperatures Nucleic Acids Res 33 4016ndash4022

9 SuhreK and ClaverieJM (2003) Genomic correlates ofhyperthermostability an update J Biol Chem 27817198ndash17202

10 TekaiaF and YeramianE (2006) Evolution of proteomesfundamental signatures and global trends in amino acidcompositions BMC Genom 7 307

11 WangHC and HickeyDA (2002) Evidence for strong selectiveconstraint acting on the nucleotide composition of 16S ribosomalRNA genes Nucleic Acids Res 30 2501ndash2507

12 FriedmanR DrakeJW and HughesAL (2004) Genome-widepatterns of nucleotide substitution reveal stringent functionalconstraints on the protein sequences of thermophiles Genetics167 1507ndash1512

13 LynnDJ SingerGA and HickeyDA (2002) Synonymouscodon usage is subject to selection in thermophilic bacteriaNucleic Acids Res 30 4272ndash4277

14 SingerGA and HickeyDA (2000) Nucleotide bias causes agenomewide bias in the amino acid composition of proteinsMol Biol Evol 17 1581ndash1588

15 SingerGA and HickeyDA (2003) Thermophilicprokaryotes have characteristic patterns of codonusage amino acid composition and nucleotide content Gene 31739ndash47

16 TekaiaF YeramianE and DujonB (2002) Amino acidcomposition of genomes lifestyles of organisms and evolutionarytrends a global picture with correspondence analysis Gene 29751ndash60

17 Roy ChowdhuryA and DuttaC (2012) A pursuit of lineage-specific and niche-specific proteome features in the world ofarchaea BMC Genom 13 236

18 WangHC SuskoE and RogerAJ (2006) On the correlationbetween genomic G+C content and optimal growth temperaturein prokaryotes data quality and confounding factors BiochemBiophys Res Commun 342 681ndash684

19 WuH ZhangZ HuS and YuJ (2012) On the molecularmechanism of GC content variation among eubacterial genomesBiol Direct 7 2

20 KnightRD FreelandSJ and LandweberLF (2001) A simplemodel based on mutation and selection explains trends in codonand amino-acid usage and GC composition within and acrossgenomes Genome Biol 2 RESEARCH0010

21 LaoPJ and ForsdykeDR (2000) Thermophilic bacteria strictlyobey Szybalskirsquos transcription direction rule and politely purine-load RNAs with both adenine and guanine Genome Res 10228ndash236

22 ZeldovichKB BerezovskyIN and ShakhnovichEI (2007)Protein and DNA sequence determinants of thermophilicadaptation PLoS Comput Biol 3 e5

23 KreilDP and OuzounisCA (2001) Identification ofthermophilic species by the amino acid compositions deducedfrom their genomes Nucleic Acids Res 29 1608ndash1615

24 BerezovskyIN TumanyanVG and EsipovaNG (1997)Representation of amino acid sequences in terms of interactionenergy in protein globules FEBS Lett 418 43ndash46

25 CambillauC and ClaverieJM (2000) Structural and genomiccorrelates of hyperthermostability J Biol Chem 27532383ndash32386

26 GreavesRB and WarwickerJ (2007) Mechanisms forstabilisation and the maintenance of solubility in proteins fromthermophiles BMC Struct Biol 7 18

27 JaenickeR (1999) Stability and folding of domain proteins ProgBiophys Mol Biol 71 155ndash241

28 JaenickeR and BohmG (1998) The stability of proteins inextreme environments Curr Opin Struct Biol 8 738ndash748

29 BerezovskyIN and ShakhnovichEI (2005) Physics andevolution of thermophilic adaptation Proc Natl Acad Sci USA102 12742ndash12747

30 ChakravartyS and VaradarajanR (2002) Elucidation of factorsresponsible for enhanced thermal stability of proteins a structuralgenomics based study Biochemistry 41 8152ndash8161

31 GlyakinaAV GarbuzynskiySO LobanovMY andGalzitskayaOV (2007) Different packing of external residues can

explain differences in the thermostability of proteins fromthermophilic and mesophilic organisms Bioinformatics 232231ndash2238

32 ThompsonMJ and EisenbergD (1999) Transproteomic evidenceof a loop-deletion mechanism for enhancing proteinthermostability J Mol Biol 290 595ndash604

33 TokurikiN OldfieldCJ UverskyVN BerezovskyIN andTawfikDS (2009) Do viral proteins possess unique biophysicalfeatures Trends Biochem Sci 34 53ndash59

34 BerezovskyIN ZeldovichKB and ShakhnovichEI (2007)Positive and negative design in stability and thermal adaptationof natural proteins PLoS Comput Biol 3 e52

35 BharanidharanD BhargaviGR UthanumallianK andGauthamN (2004) Correlations between nucleotide frequenciesand amino acid composition in 115 bacterial species BiochemBiophys Res Commun 315 1097ndash1103

36 NakashimaH FukuchiS and NishikawaK (2003)Compositional changes in RNA DNA and proteins for bacterialadaptation to higher and lower temperatures J Biochem(Tokyo) 133 507ndash513

37 DehouckY FolchB and RoomanM (2008) Revisiting thecorrelation between proteinsrsquo thermoresistance and organismsrsquothermophilicity Protein Eng Des Sel 21 275ndash278

38 FolchB DehouckY and RoomanM (2010) Thermo- andmesostabilizing protein interactions identified by temperature-dependent statistical potentials Biophys J 98 667ndash677

39 FolchB RoomanM and DehouckY (2008) Thermostability ofsalt bridges versus hydrophobic interactions in proteins probed bystatistical potentials J Chem Inf Model 48 119ndash127

40 GonnelliG RoomanM and DehouckY (2012) Structure-basedmutant stability predictions on proteins of unknown structureJ Biotechnol 161 287ndash293

41 PonnuswamyP MuthusamyR and ManavalanP (1982) Aminoacid composition and thermal stability of globular proteins IntJ Biol Macromol 4 186ndash190

42 BerezovskyIN (2011) The diversity of physical forces andmechanisms in intermolecular interactions Phys Biol 8 035002

43 MaBG GoncearencoA and BerezovskyIN (2010)Thermophilic adaptation of protein complexes inferred fromproteomic homology modeling Structure 18 819ndash828

44 MakarovaKS and KooninEV (2005) Evolutionary andfunctional genomics of the Archaea Curr Opin Microbiol 8586ndash594

45 NovichkovPS WolfYI DubchakI and KooninEV (2009)Trends in prokaryotic evolution revealed by comparison of closelyrelated bacterial and archaeal genomes J Bacteriol 191 65ndash73

46 KooninEV and WolfYI (2008) Genomics of bacteria andarchaea the emerging dynamic view of the prokaryotic worldNucleic Acids Res 36 6688ndash6719

47 KooninEV MushegianAR GalperinMY and WalkerDR(1997) Comparison of archaeal and bacterial genomes computeranalysis of protein sequences predicts novel functions andsuggests a chimeric origin for the archaea Mol Microbiol 25619ndash637

48 KatzL and BurgeCB (2003) Widespread selection for localRNA secondary structure in coding regions of bacterial genesGenome Res 13 2042ndash2051

49 HofackerIL PriwitzerB and StadlerPF (2004) Prediction oflocally stable RNA secondary structures for genome-wide surveysBioinformatics 20 186ndash190

50 ZukerM and StieglerP (1981) Optimal computer folding oflarge RNA sequences using thermodynamics and auxiliaryinformation Nucleic Acids Res 9 133ndash148

51 MathewsDH SabinaJ ZukerM and TurnerDH (1999)Expanded sequence dependence of thermodynamic parametersimproves prediction of RNA secondary structure J Mol Biol288 911ndash940

52 ShabalinaSA OgurtsovAY and SpiridonovNA (2006) Aperiodic pattern of mRNA secondary structure created by thegenetic code Nucleic Acids Res 34 2428ndash2437

53 MarmurJ and DotyP (1962) Determination of the basecomposition of deoxyribonucleic acid from its thermaldenaturation temperature J Mol Biol 5 109ndash118

Nucleic Acids Research 2014 Vol 42 No 5 2891

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

54 HurstLD and MerchantAR (2001) High guanine-cytosinecontent is not an adaptation to high temperature acomparative analysis amongst prokaryotes Proc Biol Sci 268493ndash497

55 NayaH RomeroH ZavalaA AlvarezB and MustoH (2002)Aerobiosis increases the genomic guanine plus cytosine content(GC) in prokaryotes J Mol Evol 55 260ndash264

56 BeckmanKB and AmesBN (1997) Oxidative decay of DNAJ Biol Chem 272 19633ndash19636

57 ChengKC CahillDS KasaiH NishimuraS and LoebLA(1992) 8-Hydroxyguanine an abundant form of oxidative DNAdamage causes Gmdash-T and Amdash-C substitutions J Biol Chem267 166ndash172

58 Vieira-SilvaS and RochaEP (2008) An assessment of theimpacts of molecular oxygen on the evolution of proteomes MolBiol Evol 25 1931ndash1942

59 GaltierN and LobryJR (1997) Relationships between genomicG+C content RNA secondary structures and optimal growthtemperature in prokaryotes J Mol Evol 44 632ndash636

60 MustoH NayaH ZavalaA RomeroH Alvarez-ValinF andBernardiG (2004) Correlations between genomic GC levels andoptimal growth temperatures in prokaryotes FEBS Lett 57373ndash77

61 FitchWM (1974) The large extent of putative secondary nucleicacid structure in random nucleotide sequences or amino acidderived messenger-RNA J Mol Evol 3 279ndash291

62 WorkmanC and KroghA (1999) No evidence that mRNAshave lower folding free energies than random sequences with

the same dinucleotide distribution Nucleic Acids Res 274816ndash4822

63 BachmairA FinleyD and VarshavskyA (1986) In vivo half-lifeof a protein is a function of its amino-terminal residue Science234 179ndash186

64 BerezovskyIN NamiotVA TumanyanVG and EsipovaNG(1999) Hierarchy of the interaction energy distribution in thespatial structure of globular proteins and the problem of domaindefinition J Biomol Struct Dyn 17 133ndash155

65 YakovchukP ProtozanovaE and Frank-KamenetskiiMD(2006) Base-stacking and base-pairing contributions into thermalstability of the DNA double helix Nucleic Acids Res 34564ndash574

66 FriedmanRA and HonigB (1995) A free energy analysis ofnucleic acid base stacking in aqueous solution Biophys J 691528ndash1535

67 OkonogiTM AlleySC ReeseAW HopkinsPB andRobinsonBH (2002) Sequence-dependent dynamics of duplexDNA the applicability of a dinucleotide model Biophys J 833446ndash3459

68 MasquidaB and WesthofE (2000) On the wobble GoU andrelated pairs RNA 6 9ndash15

69 VaraniG and McClainWH (2000) The G x U wobble base pairA fundamental building block of RNA structure crucial to RNAfunction in diverse biological systems EMBO Rep 1 18ndash23

70 BerezovskyIN ChenWW ChoiPJ and ShakhnovichEI(2005) Entropic stabilization of proteins and its proteomicconsequences PLoS Comput Biol 1 e47

2892 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

(Table 5) All the parameters are reported as relative oneswhere the signal in native sequence is normalized by thevalue obtained for dicodon shuffled ones (48) Thedicodon shuffling randomizes mRNA sequence whilepreserving the native protein sequence native codonusage and native dinucleotide composition Therefore itallows one to select out peculiarities of mRNA caused bythe demands on its stability and function from the othersrelated to the coding DNA and proteins

Structure of stems in folded mRNA is characterized byphases which show relative location of triplets in oppositestrands of the stem Specifically Phases I II and III cor-respond to positioning of triplets where respectively firstsecond and third nucleotides are complementary Themost and the least frequent pairs in all phases of thedsDNA (see Materials and methods section) are con-sidered Phases I and II are similar to each otheryielding CG GC GU UG AU and UA (Table 5)Overall the Archaea and thermophilic Bacteria havesimilar trends in most and least frequent pairsMesophilic Bacteria stands out with A13U and G33Cmost frequent pairs in Phases II and III respectivelyThe major contribution from the three pairing phases(52) to the total amount of stem pairs is provided byPhase III and nucleotides in codon position 3 are mostfrequent in stem pairs (Table 5 and SupplementaryTable S1) OGT correlations highlight the contributionfrom the GU wobble pair to the thermostability ofachaeal mRNA structures (Table 5 and SupplementaryTable S1) We found the highest OGT correlation forthe Archaeal pairs U32G (Table 5) confirming import-ance of the GU wobble pairs The stronger effect forArchaeal rather than for Bacterial mRNA is possibly arelic of ancient RNA world where RNA was the carrierof genetic information and harsh environmental condi-tions demanded its increased stability

Table 6 shows that purine load (the contents of A+GRY ApG) is larger in the loop regions than in the stems

In Archaea this difference is slightly more pronouncedthan in Bacteria (Supplementary Table S2) The purineload is also in good positive correlation with OGT forthe loops not for stems However the amount of ApGdinucleotides is correlated with OGT in both loops andstems For synonymous codons the fractions of purine-rich codons (eg GGR versus GGY for glycine and AGRversus CGY for arginine) are correlated with OGT in bothloop and stem regions (slightly stronger in loops) Fornon-synonymous codons however amount of purine-rich codons (GAR for glutamate) is correlated withOGT in both loops and stems while fractions of purine-rich AAR codons for lysine do not correlate withOGT (Table 6) While Forsdyke et al (21) reportedincrease of the purine content and its positive correlationwith OGT in loops we found that fraction of purine reachsynonymous codons is correlated with OGT in stems aswell At the same time the amount of non-synonymouspurine-rich codons of lysine is not correlated with OGT inboth loop and stem regions Additionally the amount ofApG dinucleotides is correlated with OGT in both loopsand stems (Table 6)

Signals of thermophilic adaptation in protein sequencesIt has been shown earlier (22) that environmental tem-perature directly affects amino acid composition of pro-karyotic proteins and IVYWREL combination can serveas a predictor of the OGT in prokaryotes We use here alsquoZ-scorersquo predictor of OGT (43) which takes into accountdifferences in variances of the proteomic frequencies ofindividual amino acids (Table 7) We have shown thatthe Z-score predictor properly corrects for these differ-ences (43) better reflecting the contribution of theamino acid combinations to thermal adaptationSeparate predictors for Archaea and Bacteria (based on33 and 99 proteomes with OGTrsquos spread over 5ndash100 and10ndash85C respectively) yield the same general trend in theincrease of hydrophobic and charged residues at the

Table 5 Compositional and dicodon signals of mRNA adaptation purified by the dicodon shuffling

Characteristic Archaea Bacteria

The most and least (after the comma) frequentpairs in stems of mRNA

MesophilesPhase I C23G G23U Phase I G32U U32GPhase II C31G G13U Phase II A13U G13UPhase III A33U U33G Phase III G33C U33G

ThermophilesPhase I G32C G23U Phase I C23G G23UPhase II C22G G13U Phase II C31G G13UPhase III A33U U33G Phase III U33A U33G

Correlations with OGT ofSegment energy ltEsggt R=ndash071 R=ndash039Energy per base pair ltEbpgt R=ndash073 R=ndash054

The most signification correlations with OGT Phase I U32G R=077 (U21G)Nat PIII R=046Phase I G23U R=071 (G12U)Nat PIII R=045Phase I All pairs R=07 (U21G)dShfld PIII R=045

(G12U)dShfld PIII R=041

Phases I II and III correspond to positioning of triplets where respectively first second and third nucleotides are comple-mentary All signals (except OGT correlations in Bacteria) are normalized by corresponding values for control sequences afterthe dicodon shuffling See explanations of abbreviations in the Materials and methods section

Nucleic Acids Research 2014 Vol 42 No 5 2887

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

expense of polar ones (34) showing minor differences inpresence of individual residuesWe have shown earlier (22) that the amino acid bias

working in thermal adaptation of proteins does notdepend on the overall nucleotide composition of coding-DNA sequences However one can still expect that nu-cleotide compositions of particular codon positions andor dinucleotide compositions of positions 1-2 and 2-3 incodons may be somehow linked to amino acid biases Weindeed found a significant difference between nucleotideload in individual codon positions of Archaeal andBacterial coding DNA sequences (Table 1) FirstArchaeal protein coding sequences yield a strong correl-ation between (T+G)2 load in the second codon positionand OGT (r=072) The corresponding non-codon biasedsequences show comparable correlation (r=068)pointing that (T+G) load reflects tuning of the aminoacid composition in connection to thermal adaptationThe main contributor to this correlation is thymine(Table 1) that provides increase of strongly hydrophobicresidues LMFIV (Supplementary File S1) The (T+G)load is much weaker correlated with OGT in Bacterialsequences (r=036 for natural and r=042 for NCB se-quences) The correlation of the thymine load with OGT is037 for both Natural (NAT) and Non-codon-biased(NCB) sequences This observation agrees with thepresumed prevalence of structure-based strategy ofthermal adaptation in Archaeal proteins (29) contrary tosequence-based one in Bacterial proteomes Specificallythere is a clear correlation between amounts of chargedresidues and the OGT in the set of Bacterial proteomes(Table 7) presumably reflecting the domination of thesequence-based strategy in thermophilic adaptation ofBacteria (29)Finally we checked if there exists any specific connec-

tion between the dipeptide composition of proteins andthose of 3-1 dinucleotides We found (data not shown)that the strongest correlations between dinucleotides 3-1and dipeptides exist mostly for the dinucleotides withguanine in position 3 The most frequent pair of aminoacids contains methionine (Met encoded exclusively by

the AUG codon) as the first and any polarchargedamino acid as the second one The preference of polarcharged amino acid to be the next to Met can be explainedby the typically surface location of the protein N-terminiand its role as a signal for ubiquitination (63)

DISCUSSION

We will make a brief overview of the most pronouncedbiases and causal relationships in DNA RNA andproteins The major signal in nucleotide compositionsspecifically (G+C) load is related to thermal stabilizationof t- and rRNA (3654555960) and to discriminationbetween the coding and non-coding sequences inArchaea (Table 2) Comparative analysis of nucleotideand amino acid compositions in relation to thermophilicadaptation prompts to conclude that the (G+C) contentdoes not contribute to thermostabilization of coding DNA(1822365459) as well as it does not affect amino acidcomposition and its thermophilic trends (1522) When

Table 6 Purine loading in loop and stem regions of folded mRNA and its OGT correlation

Feature Loop Stem L-v-S p-value

Mean contents OGT correlation Mean content OGT correlation

A+G 0560 059 0500 ndash026 lt22E ndash 16RY 1299 061 1002 ndash026 lt22E ndash 16ApG 0061 079 0051 050 00002GGR (glycine) 0027 062 0045 042 12E ndash 17GGY (glycine) 0017 ndash005 0079 ndash023 10E ndash 42AGR (arginine) 0035 072 0022 056 24E ndash 10CGR (arginine) 0018 ndash022 0040 ndash024 71E ndash 09CGY (arginine) 0016 ndash025 0037 ndash026 62E ndash 10GAR (glutamate) 0060 071 0036 059 lt22E ndash 16AAR (lysine) 0086 022 0017 011 lt22E ndash 16GAY (aspartate) 0037 ndash017 0040 ndash037 00002

Feature analyzed nucleotide dinucleotide or amino acid Loop and stem information on mean content and OGT correlation of the above L-v-S acomparison between corresponding contents in the loop region and stem regions by Wilcoxon-tests and P-values are shown in this columnCorrelation coefficients with OGT are shown P-valuelt 001 P-valuelt 00001

Table 7 Signals of thermophilic adaptation in protein sequences of

Archaea and Bacteria

Correlation with OGT Archaea Bacteria

Z-scored predictor ILVW Y DKR(R=093)

IPV Y EKR(R=089)

The most abundant residuesin gt70 (gt60) ofall Z-scored predictors

VIWL Y ER VPI Y ER(K)

Individual amino acids + L W + Endash T Q D

Types of amino acids + h + h cndash p ndash p

Dipeptides + hp ph + ccndash cp pc ndash pc

+ increase of the amino acid (or amino acid type) fraction with OGT ndashdecrease of the amino acid (or amino acid type) fraction with OGTCapital letters are names of amino acids h p c are hydrophobic polarcharged types of amino acids Correlation coefficients between the Z-scored thermostability predictors and OGT are given in parentheses

2888 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

separate codon positions are considered for coding DNAthe only compositional bias observed for nucleotide com-positions is an excess of (G+C) load in the third codonposition in Bacteria There is a significant excess ofguanine and cytosine nucleotides (292 and 3282)compared to the NCB nucleotide contents (2609 and2466) A plausible explanation for this bias is the roleof the (G+C)3 load in the adaptation to the aerobic lifestyle which dominates in Bacteria The (G+C)3 load canbe advantageous for several complementary reasonsIndeed additional GC base pairing will contribute tothe stability of double helix of DNA and stem regions ofRNA molecules Moreover G3 bases can work as scaven-gers of oxidizing agents providing protection for G basesin other codon positions (56ndash58) At the same time thisbias would not lead to any changes in amino acid com-position leaving protein structure and stability intact (55)

In this work we considered compositional and sequencebiases in proteins in relation to those in correspondingnucleic acid sequences and to the phylogeny of species(Archaea or Bacteria) OGT correlations of nucleotidesin different codon positions are similar in Archaea andBacteria but higher in the former (Table 1) Theoptimal from the point of view of thermophilic adaptation(ie most correlated with OGT) codon reads A1 [TG]2non-[AG]3 in Archaea but [weak A]1 [weak T]2 non-[AG]3 in Bacteria (Table 4) Codon bias supports thestrong correlation between the amount of Adenine andOGT in the first codon position in Archaea same butweaker correlation in Bacteria While we showed earlier(22) that amino acid thermophilic trend is not determinedby the overall nucleotide composition considerationof separate codon positions reveals an interesting linkbetween positional nucleotide frequencies and aminoacid composition in relation to domain of LifeSpecifically prevalence of the nucleotide thymine in thesecond codon position and its strong correlation withOGT in Archaea is apparently a result of the demandon enrichement of Archaeal proteins with hydrophobicresidues (Table 7) Noteworthy a massive increase ofvan der Waals interactions (2464) was found to be thecornerstone of the structure-based evolutionary strategyof protein thermostability in ancient species (29) InBacterial proteins however there is an increase of thefractions of charged residues with OGT (Table 7) Thuswe observe a transitions from structure-based evolution-ary strategy of protein themostability in Archaea tosequence-based one in Bacteria and corresponding nucleo-tide and protein compositional biases (Table 7)underlying these strategies (29) The third codonposition is characterized by the selection against adenineand guanine in Archaea and against adenine in Bacteria isa result of this bias

The most pronounced dinucleotide compositionalbiases are excess of homo-dinucleotides and its correlationwith OGT in ncDNA tRNA and rRNA It is presumablya relic of ancient primitive homo(poly)nucleotides fromwhich life started Overrepresentation of the complemen-tary ApGCpT dinucleotides points to their contributionto DNARNA stability via strong base-stacking inter-actions (6566) The consideration of dinucleotide biases

in purine (R) and pyrimidine (Y) notation also reveals aninteresting picture Overall for both Archaea and Bacteriathe signature for thermophilic dinucleotides reads RpRYpY non-(RpRYpY) RpRYpY for codon positions1-2 2-3 and 3-1 respectively (Table 4) The major differ-ence between Archaea and Bacteria in this case is that thedinucleotide bias in positions 1-2 is caused by demands onnucleotide sequence level (and provided by the codonbias) in Archaea while some other factors work inBacteria The preference for homodinucleotides in pos-itions 1-2 and 3-1 can be a possible reason for avoidinghomodinucleotides in positions 2-3 because heterodinu-cleotides can contribute to DNA flexibility tuningIndeed it has been shown that given an average flexibilityF of homodinucleotides (either RpR or YpY) theflexibilities of the heterodinucleotides YpR and RpY are2F and 05F respectively (67) The codon positions 2-3 aremost versatile in terms of the relationship between thenucleotide and protein sequences Therefore presenceof heteronucleotides in the codon position 2-3 makes itpossible to adjust flexibility of the nucleotide sequencewithout changing physicalndashchemical characteristics ofthe encoded amino acid residue Additionally we foundthat the link between dipeptide and corresponding 3-1 di-nucleotide frequencies is determined by the amino aciddipeptide The dominating bias reads as Met followedby a polar or charged residueOn the RNA level simple consideration of the nucleo-

tide composition suggests to discriminate the tRNA andrRNA molecules from the mRNA Indeed based on thecorrelation with OGT we can conclude that (G+C)content is important for providing stability of tRNAand rRNA molecules but not for mRNA The role ofthe (G+C) content was further confirmed by foldingsimulations (Table 3) We also found other complemen-tary signals in mRNA sequences related to its structurestability and function Archaea and Bacteria yield similartrends in the most and least frequent nucleotide pairsand the third codon position makes a major contributionto the base-pairing mechanism of stem stabilizationSpecifically in addition to WatsonndashCrick pairing theGU wobble pairs sufficiently contribute to the ther-mal stability of the Archaeal mRNA in agreement withearlier observations (6869) Overall it results in moderatecorrelation of the segment energy and energy per base pairof the folded mRNA (Table 5) Folding simulations per-formed in this work also allowed us to analyzepeculiarities of the nucleotide contents and structuralcontacts in stem and loop regions of folded mRNAPurine load in loop regions is higher than in stems andthis effect is slightly stronger in Archaea than in Bacteria(Table 6) Purine load in loops also correlates with OGTnot in stems This correlation was observed earlier byForsdyke et al (21) and it was described as lsquopolitepurine load of the loop regionsrsquo that prevents undesiredmRNAndashmRNA single-strand interactions The authorsconcluded that increased purine load can affect thecodon choice and consequently the amino acid compos-ition (21) Our data [as well as careful analysis of theoriginal data in (21)] does not support the latter claimThere is indeed a correlation between the fraction of

Nucleic Acids Research 2014 Vol 42 No 5 2889

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

some purine-rich codons (GGR(Gly) AGR(Arg) andGAR(Glu)) and OGT (Table 6) However the fractionof purine-rich non-synonymous codons AAR (Lys) isnot correlated with OGT (Table 6) Moreover thesecodons are synonymous and cannot directly affect theamino acid composition It has been shown earlier(2270) that increase of Glu and Arg fractions is a trendin protein thermophilic adaptation At the same timepurine load remains high after elimination of the codonbias pointing to the amino acid composition as the mostprobable cause for the former (22) All the above promptsone to conclude that in mutual tuning of nucleotide andamino acid compositions the purine load does notdominate and determine biases in amino acid compos-ition if not opposite The analysis of purine load showsthat it correlates with OGT in both stems and loops(Table 6) in case of purine-rich codons GGR(Gly)AGR(Arg) and GAR(Glu) Further the amount ofApG dinucleotides strongly correlates with OGT in bothloops and stems (Table 6) and ApG provides a strongbase stacking important for thermal stability of nucleicacid sequences (2265) Purine load therefore is appar-ently a determinant of the base-stacking mechanism inmRNA andor DNA thermal stability as well as a resultof the thermophilic amino acid composition trendProkaryotes thrive under the temperature interval

spanning over hundred degrees and they represent twomajor life stylesmdashaerobic and anaerobic Analysis ofcomplete Archaeal and Bacterial genomes unraveled com-positional and sequence signals related to molecular mech-anisms of stability and adaptation unaffected by selectivesequencing or by the comparison of orthologs Overallcodon bias works stronger in Archaea and is mostlyutilized in thermophilic adaptation of nucleic acids It ap-parently reflects longer evolutionary history of Archaeawhich presumably started close to the origin of life in hotconditions (29) The codon bias and amino acid sequences(dipeptide composition) work in accord for supportingenrichment of the nucleotide sequences with ApGCpT di-nucleotidesmdashdeterminants of the base-stacking mechan-ism of the nucleic acid stability (65ndash67) We also foundthat the second codon position reveals a strong linkbetween the nucleotide and amino acid compositionsSpecifically excess of thymine in this position is a resultof a demand on the enrichment of Archaeal proteins withhydrophobic amino acids The third codon position inBacteria is the only case where codon bias is detectablealready on the level of pure composition We found thatthe (G+C)3 load is related to aerobic life style dominatingin Bacteria From the point of view of thermophilic adap-tation codon bias works against G in the third codonposition in both Archaea and Bacteria It supports thus aspecific role of the third codon position in discriminatingbetween adaptations to temperature and aerobic life styleFinally we obtained an interesting and complex picture ofthe relationship between the nucleotide composition(purine load) and amino acid composition (selectionbetween Arg and Lys) in relation to thermal adaptationaerobicity and phylogeny Specifically purine load in bothArchaea and Bacteria is a result of the lsquofrom both endof hydrophobicity scalersquo trend in thermal adaptation of

proteins (34) reflected in the IVYWREL predictor of theOGT (22) According to this predictor Arg is a preferredamino acid for thermal adaptation though Lys is the nextcandidate present in top most correlated predictors as well(2243) In particular selection for Lys over the Arg insome species was shown to be important for the entropicmechanism of protein thermostabilization (70) Howeveroverall preference for Arg over the Lys in case of thermaladaptation is well manifested in the excess of the purine-rich (AGR) codons of Arg at the expense of purine-richones of Lys (AAR) At the same time discriminationbetween Arg and Lys works in opposite direction inaerobicity where Arg codons AGA and AGG are sup-pressed in favor of the purine-rich AAA and AAG ofLys It thus reflect a preference for Lys over Arg inaerobes compared to anaerobes in particular and inBacteria versus Archaea in general preserving at thesame time high purine load necessary for providing abase-stacking in corresponding nucleic acids (225365)An intricate connection between the LysArg and theircodons in relation to thermal adaptation and aerobicityexemplifies how selection can work on nucleic acids andprotein simultaneously in response to demands of differentenvironments Obviously the whole picture of molecularmechanisms of adaptation and relations between them isfar from being complete Consideration of other environ-mental factors such as salinity pressure etc will help to un-ravel new mechanisms of stability their sequencestructuredeterminants and to understand tradeoffs that Natureembraced en route of the evolution and adaptation

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

FUNDING

Funding for Open access charges Functional GenomicsProgram (FUGE II) Norwegian Research Council

Conflict of interest statement None declared

REFERENCES

1 CrickF (1970) Central dogma of molecular biology Nature 227561ndash563

2 TawfikDS (2010) Messy biology and the origins of evolutionaryinnovations Nat Chem Biol 6 692ndash696

3 TokurikiN and TawfikDS (2009) Stability effects of mutationsand protein evolvability Curr Opin Struct Biol 19 596ndash604

4 TokurikiN and TawfikDS (2009) Protein dynamism andevolvability Science 324 203ndash207

5 KooninEV (2012) Does the central dogma still stand BiolDirect 7 27

6 PersquoerI FelderCE ManO SilmanI SussmanJL andBeckmannJS (2004) Proteomic signatures amino acid andoligopeptide compositions differentiate among phyla Proteins 5420ndash40

7 AravindL TatusovRL WolfYI WalkerDR andKooninEV (1998) Evidence for massive gene exchange betweenarchaeal and bacterial hyperthermophiles Trends Genet 14442ndash444

8 KhachaneAN TimmisKN and dos SantosVA (2005) Uracilcontent of 16S rRNA of thermophilic and psychrophilic

2890 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

prokaryotes correlates inversely with their optimal growthtemperatures Nucleic Acids Res 33 4016ndash4022

9 SuhreK and ClaverieJM (2003) Genomic correlates ofhyperthermostability an update J Biol Chem 27817198ndash17202

10 TekaiaF and YeramianE (2006) Evolution of proteomesfundamental signatures and global trends in amino acidcompositions BMC Genom 7 307

11 WangHC and HickeyDA (2002) Evidence for strong selectiveconstraint acting on the nucleotide composition of 16S ribosomalRNA genes Nucleic Acids Res 30 2501ndash2507

12 FriedmanR DrakeJW and HughesAL (2004) Genome-widepatterns of nucleotide substitution reveal stringent functionalconstraints on the protein sequences of thermophiles Genetics167 1507ndash1512

13 LynnDJ SingerGA and HickeyDA (2002) Synonymouscodon usage is subject to selection in thermophilic bacteriaNucleic Acids Res 30 4272ndash4277

14 SingerGA and HickeyDA (2000) Nucleotide bias causes agenomewide bias in the amino acid composition of proteinsMol Biol Evol 17 1581ndash1588

15 SingerGA and HickeyDA (2003) Thermophilicprokaryotes have characteristic patterns of codonusage amino acid composition and nucleotide content Gene 31739ndash47

16 TekaiaF YeramianE and DujonB (2002) Amino acidcomposition of genomes lifestyles of organisms and evolutionarytrends a global picture with correspondence analysis Gene 29751ndash60

17 Roy ChowdhuryA and DuttaC (2012) A pursuit of lineage-specific and niche-specific proteome features in the world ofarchaea BMC Genom 13 236

18 WangHC SuskoE and RogerAJ (2006) On the correlationbetween genomic G+C content and optimal growth temperaturein prokaryotes data quality and confounding factors BiochemBiophys Res Commun 342 681ndash684

19 WuH ZhangZ HuS and YuJ (2012) On the molecularmechanism of GC content variation among eubacterial genomesBiol Direct 7 2

20 KnightRD FreelandSJ and LandweberLF (2001) A simplemodel based on mutation and selection explains trends in codonand amino-acid usage and GC composition within and acrossgenomes Genome Biol 2 RESEARCH0010

21 LaoPJ and ForsdykeDR (2000) Thermophilic bacteria strictlyobey Szybalskirsquos transcription direction rule and politely purine-load RNAs with both adenine and guanine Genome Res 10228ndash236

22 ZeldovichKB BerezovskyIN and ShakhnovichEI (2007)Protein and DNA sequence determinants of thermophilicadaptation PLoS Comput Biol 3 e5

23 KreilDP and OuzounisCA (2001) Identification ofthermophilic species by the amino acid compositions deducedfrom their genomes Nucleic Acids Res 29 1608ndash1615

24 BerezovskyIN TumanyanVG and EsipovaNG (1997)Representation of amino acid sequences in terms of interactionenergy in protein globules FEBS Lett 418 43ndash46

25 CambillauC and ClaverieJM (2000) Structural and genomiccorrelates of hyperthermostability J Biol Chem 27532383ndash32386

26 GreavesRB and WarwickerJ (2007) Mechanisms forstabilisation and the maintenance of solubility in proteins fromthermophiles BMC Struct Biol 7 18

27 JaenickeR (1999) Stability and folding of domain proteins ProgBiophys Mol Biol 71 155ndash241

28 JaenickeR and BohmG (1998) The stability of proteins inextreme environments Curr Opin Struct Biol 8 738ndash748

29 BerezovskyIN and ShakhnovichEI (2005) Physics andevolution of thermophilic adaptation Proc Natl Acad Sci USA102 12742ndash12747

30 ChakravartyS and VaradarajanR (2002) Elucidation of factorsresponsible for enhanced thermal stability of proteins a structuralgenomics based study Biochemistry 41 8152ndash8161

31 GlyakinaAV GarbuzynskiySO LobanovMY andGalzitskayaOV (2007) Different packing of external residues can

explain differences in the thermostability of proteins fromthermophilic and mesophilic organisms Bioinformatics 232231ndash2238

32 ThompsonMJ and EisenbergD (1999) Transproteomic evidenceof a loop-deletion mechanism for enhancing proteinthermostability J Mol Biol 290 595ndash604

33 TokurikiN OldfieldCJ UverskyVN BerezovskyIN andTawfikDS (2009) Do viral proteins possess unique biophysicalfeatures Trends Biochem Sci 34 53ndash59

34 BerezovskyIN ZeldovichKB and ShakhnovichEI (2007)Positive and negative design in stability and thermal adaptationof natural proteins PLoS Comput Biol 3 e52

35 BharanidharanD BhargaviGR UthanumallianK andGauthamN (2004) Correlations between nucleotide frequenciesand amino acid composition in 115 bacterial species BiochemBiophys Res Commun 315 1097ndash1103

36 NakashimaH FukuchiS and NishikawaK (2003)Compositional changes in RNA DNA and proteins for bacterialadaptation to higher and lower temperatures J Biochem(Tokyo) 133 507ndash513

37 DehouckY FolchB and RoomanM (2008) Revisiting thecorrelation between proteinsrsquo thermoresistance and organismsrsquothermophilicity Protein Eng Des Sel 21 275ndash278

38 FolchB DehouckY and RoomanM (2010) Thermo- andmesostabilizing protein interactions identified by temperature-dependent statistical potentials Biophys J 98 667ndash677

39 FolchB RoomanM and DehouckY (2008) Thermostability ofsalt bridges versus hydrophobic interactions in proteins probed bystatistical potentials J Chem Inf Model 48 119ndash127

40 GonnelliG RoomanM and DehouckY (2012) Structure-basedmutant stability predictions on proteins of unknown structureJ Biotechnol 161 287ndash293

41 PonnuswamyP MuthusamyR and ManavalanP (1982) Aminoacid composition and thermal stability of globular proteins IntJ Biol Macromol 4 186ndash190

42 BerezovskyIN (2011) The diversity of physical forces andmechanisms in intermolecular interactions Phys Biol 8 035002

43 MaBG GoncearencoA and BerezovskyIN (2010)Thermophilic adaptation of protein complexes inferred fromproteomic homology modeling Structure 18 819ndash828

44 MakarovaKS and KooninEV (2005) Evolutionary andfunctional genomics of the Archaea Curr Opin Microbiol 8586ndash594

45 NovichkovPS WolfYI DubchakI and KooninEV (2009)Trends in prokaryotic evolution revealed by comparison of closelyrelated bacterial and archaeal genomes J Bacteriol 191 65ndash73

46 KooninEV and WolfYI (2008) Genomics of bacteria andarchaea the emerging dynamic view of the prokaryotic worldNucleic Acids Res 36 6688ndash6719

47 KooninEV MushegianAR GalperinMY and WalkerDR(1997) Comparison of archaeal and bacterial genomes computeranalysis of protein sequences predicts novel functions andsuggests a chimeric origin for the archaea Mol Microbiol 25619ndash637

48 KatzL and BurgeCB (2003) Widespread selection for localRNA secondary structure in coding regions of bacterial genesGenome Res 13 2042ndash2051

49 HofackerIL PriwitzerB and StadlerPF (2004) Prediction oflocally stable RNA secondary structures for genome-wide surveysBioinformatics 20 186ndash190

50 ZukerM and StieglerP (1981) Optimal computer folding oflarge RNA sequences using thermodynamics and auxiliaryinformation Nucleic Acids Res 9 133ndash148

51 MathewsDH SabinaJ ZukerM and TurnerDH (1999)Expanded sequence dependence of thermodynamic parametersimproves prediction of RNA secondary structure J Mol Biol288 911ndash940

52 ShabalinaSA OgurtsovAY and SpiridonovNA (2006) Aperiodic pattern of mRNA secondary structure created by thegenetic code Nucleic Acids Res 34 2428ndash2437

53 MarmurJ and DotyP (1962) Determination of the basecomposition of deoxyribonucleic acid from its thermaldenaturation temperature J Mol Biol 5 109ndash118

Nucleic Acids Research 2014 Vol 42 No 5 2891

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

54 HurstLD and MerchantAR (2001) High guanine-cytosinecontent is not an adaptation to high temperature acomparative analysis amongst prokaryotes Proc Biol Sci 268493ndash497

55 NayaH RomeroH ZavalaA AlvarezB and MustoH (2002)Aerobiosis increases the genomic guanine plus cytosine content(GC) in prokaryotes J Mol Evol 55 260ndash264

56 BeckmanKB and AmesBN (1997) Oxidative decay of DNAJ Biol Chem 272 19633ndash19636

57 ChengKC CahillDS KasaiH NishimuraS and LoebLA(1992) 8-Hydroxyguanine an abundant form of oxidative DNAdamage causes Gmdash-T and Amdash-C substitutions J Biol Chem267 166ndash172

58 Vieira-SilvaS and RochaEP (2008) An assessment of theimpacts of molecular oxygen on the evolution of proteomes MolBiol Evol 25 1931ndash1942

59 GaltierN and LobryJR (1997) Relationships between genomicG+C content RNA secondary structures and optimal growthtemperature in prokaryotes J Mol Evol 44 632ndash636

60 MustoH NayaH ZavalaA RomeroH Alvarez-ValinF andBernardiG (2004) Correlations between genomic GC levels andoptimal growth temperatures in prokaryotes FEBS Lett 57373ndash77

61 FitchWM (1974) The large extent of putative secondary nucleicacid structure in random nucleotide sequences or amino acidderived messenger-RNA J Mol Evol 3 279ndash291

62 WorkmanC and KroghA (1999) No evidence that mRNAshave lower folding free energies than random sequences with

the same dinucleotide distribution Nucleic Acids Res 274816ndash4822

63 BachmairA FinleyD and VarshavskyA (1986) In vivo half-lifeof a protein is a function of its amino-terminal residue Science234 179ndash186

64 BerezovskyIN NamiotVA TumanyanVG and EsipovaNG(1999) Hierarchy of the interaction energy distribution in thespatial structure of globular proteins and the problem of domaindefinition J Biomol Struct Dyn 17 133ndash155

65 YakovchukP ProtozanovaE and Frank-KamenetskiiMD(2006) Base-stacking and base-pairing contributions into thermalstability of the DNA double helix Nucleic Acids Res 34564ndash574

66 FriedmanRA and HonigB (1995) A free energy analysis ofnucleic acid base stacking in aqueous solution Biophys J 691528ndash1535

67 OkonogiTM AlleySC ReeseAW HopkinsPB andRobinsonBH (2002) Sequence-dependent dynamics of duplexDNA the applicability of a dinucleotide model Biophys J 833446ndash3459

68 MasquidaB and WesthofE (2000) On the wobble GoU andrelated pairs RNA 6 9ndash15

69 VaraniG and McClainWH (2000) The G x U wobble base pairA fundamental building block of RNA structure crucial to RNAfunction in diverse biological systems EMBO Rep 1 18ndash23

70 BerezovskyIN ChenWW ChoiPJ and ShakhnovichEI(2005) Entropic stabilization of proteins and its proteomicconsequences PLoS Comput Biol 1 e47

2892 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

expense of polar ones (34) showing minor differences inpresence of individual residuesWe have shown earlier (22) that the amino acid bias

working in thermal adaptation of proteins does notdepend on the overall nucleotide composition of coding-DNA sequences However one can still expect that nu-cleotide compositions of particular codon positions andor dinucleotide compositions of positions 1-2 and 2-3 incodons may be somehow linked to amino acid biases Weindeed found a significant difference between nucleotideload in individual codon positions of Archaeal andBacterial coding DNA sequences (Table 1) FirstArchaeal protein coding sequences yield a strong correl-ation between (T+G)2 load in the second codon positionand OGT (r=072) The corresponding non-codon biasedsequences show comparable correlation (r=068)pointing that (T+G) load reflects tuning of the aminoacid composition in connection to thermal adaptationThe main contributor to this correlation is thymine(Table 1) that provides increase of strongly hydrophobicresidues LMFIV (Supplementary File S1) The (T+G)load is much weaker correlated with OGT in Bacterialsequences (r=036 for natural and r=042 for NCB se-quences) The correlation of the thymine load with OGT is037 for both Natural (NAT) and Non-codon-biased(NCB) sequences This observation agrees with thepresumed prevalence of structure-based strategy ofthermal adaptation in Archaeal proteins (29) contrary tosequence-based one in Bacterial proteomes Specificallythere is a clear correlation between amounts of chargedresidues and the OGT in the set of Bacterial proteomes(Table 7) presumably reflecting the domination of thesequence-based strategy in thermophilic adaptation ofBacteria (29)Finally we checked if there exists any specific connec-

tion between the dipeptide composition of proteins andthose of 3-1 dinucleotides We found (data not shown)that the strongest correlations between dinucleotides 3-1and dipeptides exist mostly for the dinucleotides withguanine in position 3 The most frequent pair of aminoacids contains methionine (Met encoded exclusively by

the AUG codon) as the first and any polarchargedamino acid as the second one The preference of polarcharged amino acid to be the next to Met can be explainedby the typically surface location of the protein N-terminiand its role as a signal for ubiquitination (63)

DISCUSSION

We will make a brief overview of the most pronouncedbiases and causal relationships in DNA RNA andproteins The major signal in nucleotide compositionsspecifically (G+C) load is related to thermal stabilizationof t- and rRNA (3654555960) and to discriminationbetween the coding and non-coding sequences inArchaea (Table 2) Comparative analysis of nucleotideand amino acid compositions in relation to thermophilicadaptation prompts to conclude that the (G+C) contentdoes not contribute to thermostabilization of coding DNA(1822365459) as well as it does not affect amino acidcomposition and its thermophilic trends (1522) When

Table 6 Purine loading in loop and stem regions of folded mRNA and its OGT correlation

Feature Loop Stem L-v-S p-value

Mean contents OGT correlation Mean content OGT correlation

A+G 0560 059 0500 ndash026 lt22E ndash 16RY 1299 061 1002 ndash026 lt22E ndash 16ApG 0061 079 0051 050 00002GGR (glycine) 0027 062 0045 042 12E ndash 17GGY (glycine) 0017 ndash005 0079 ndash023 10E ndash 42AGR (arginine) 0035 072 0022 056 24E ndash 10CGR (arginine) 0018 ndash022 0040 ndash024 71E ndash 09CGY (arginine) 0016 ndash025 0037 ndash026 62E ndash 10GAR (glutamate) 0060 071 0036 059 lt22E ndash 16AAR (lysine) 0086 022 0017 011 lt22E ndash 16GAY (aspartate) 0037 ndash017 0040 ndash037 00002

Feature analyzed nucleotide dinucleotide or amino acid Loop and stem information on mean content and OGT correlation of the above L-v-S acomparison between corresponding contents in the loop region and stem regions by Wilcoxon-tests and P-values are shown in this columnCorrelation coefficients with OGT are shown P-valuelt 001 P-valuelt 00001

Table 7 Signals of thermophilic adaptation in protein sequences of

Archaea and Bacteria

Correlation with OGT Archaea Bacteria

Z-scored predictor ILVW Y DKR(R=093)

IPV Y EKR(R=089)

The most abundant residuesin gt70 (gt60) ofall Z-scored predictors

VIWL Y ER VPI Y ER(K)

Individual amino acids + L W + Endash T Q D

Types of amino acids + h + h cndash p ndash p

Dipeptides + hp ph + ccndash cp pc ndash pc

+ increase of the amino acid (or amino acid type) fraction with OGT ndashdecrease of the amino acid (or amino acid type) fraction with OGTCapital letters are names of amino acids h p c are hydrophobic polarcharged types of amino acids Correlation coefficients between the Z-scored thermostability predictors and OGT are given in parentheses

2888 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

separate codon positions are considered for coding DNAthe only compositional bias observed for nucleotide com-positions is an excess of (G+C) load in the third codonposition in Bacteria There is a significant excess ofguanine and cytosine nucleotides (292 and 3282)compared to the NCB nucleotide contents (2609 and2466) A plausible explanation for this bias is the roleof the (G+C)3 load in the adaptation to the aerobic lifestyle which dominates in Bacteria The (G+C)3 load canbe advantageous for several complementary reasonsIndeed additional GC base pairing will contribute tothe stability of double helix of DNA and stem regions ofRNA molecules Moreover G3 bases can work as scaven-gers of oxidizing agents providing protection for G basesin other codon positions (56ndash58) At the same time thisbias would not lead to any changes in amino acid com-position leaving protein structure and stability intact (55)

In this work we considered compositional and sequencebiases in proteins in relation to those in correspondingnucleic acid sequences and to the phylogeny of species(Archaea or Bacteria) OGT correlations of nucleotidesin different codon positions are similar in Archaea andBacteria but higher in the former (Table 1) Theoptimal from the point of view of thermophilic adaptation(ie most correlated with OGT) codon reads A1 [TG]2non-[AG]3 in Archaea but [weak A]1 [weak T]2 non-[AG]3 in Bacteria (Table 4) Codon bias supports thestrong correlation between the amount of Adenine andOGT in the first codon position in Archaea same butweaker correlation in Bacteria While we showed earlier(22) that amino acid thermophilic trend is not determinedby the overall nucleotide composition considerationof separate codon positions reveals an interesting linkbetween positional nucleotide frequencies and aminoacid composition in relation to domain of LifeSpecifically prevalence of the nucleotide thymine in thesecond codon position and its strong correlation withOGT in Archaea is apparently a result of the demandon enrichement of Archaeal proteins with hydrophobicresidues (Table 7) Noteworthy a massive increase ofvan der Waals interactions (2464) was found to be thecornerstone of the structure-based evolutionary strategyof protein thermostability in ancient species (29) InBacterial proteins however there is an increase of thefractions of charged residues with OGT (Table 7) Thuswe observe a transitions from structure-based evolution-ary strategy of protein themostability in Archaea tosequence-based one in Bacteria and corresponding nucleo-tide and protein compositional biases (Table 7)underlying these strategies (29) The third codonposition is characterized by the selection against adenineand guanine in Archaea and against adenine in Bacteria isa result of this bias

The most pronounced dinucleotide compositionalbiases are excess of homo-dinucleotides and its correlationwith OGT in ncDNA tRNA and rRNA It is presumablya relic of ancient primitive homo(poly)nucleotides fromwhich life started Overrepresentation of the complemen-tary ApGCpT dinucleotides points to their contributionto DNARNA stability via strong base-stacking inter-actions (6566) The consideration of dinucleotide biases

in purine (R) and pyrimidine (Y) notation also reveals aninteresting picture Overall for both Archaea and Bacteriathe signature for thermophilic dinucleotides reads RpRYpY non-(RpRYpY) RpRYpY for codon positions1-2 2-3 and 3-1 respectively (Table 4) The major differ-ence between Archaea and Bacteria in this case is that thedinucleotide bias in positions 1-2 is caused by demands onnucleotide sequence level (and provided by the codonbias) in Archaea while some other factors work inBacteria The preference for homodinucleotides in pos-itions 1-2 and 3-1 can be a possible reason for avoidinghomodinucleotides in positions 2-3 because heterodinu-cleotides can contribute to DNA flexibility tuningIndeed it has been shown that given an average flexibilityF of homodinucleotides (either RpR or YpY) theflexibilities of the heterodinucleotides YpR and RpY are2F and 05F respectively (67) The codon positions 2-3 aremost versatile in terms of the relationship between thenucleotide and protein sequences Therefore presenceof heteronucleotides in the codon position 2-3 makes itpossible to adjust flexibility of the nucleotide sequencewithout changing physicalndashchemical characteristics ofthe encoded amino acid residue Additionally we foundthat the link between dipeptide and corresponding 3-1 di-nucleotide frequencies is determined by the amino aciddipeptide The dominating bias reads as Met followedby a polar or charged residueOn the RNA level simple consideration of the nucleo-

tide composition suggests to discriminate the tRNA andrRNA molecules from the mRNA Indeed based on thecorrelation with OGT we can conclude that (G+C)content is important for providing stability of tRNAand rRNA molecules but not for mRNA The role ofthe (G+C) content was further confirmed by foldingsimulations (Table 3) We also found other complemen-tary signals in mRNA sequences related to its structurestability and function Archaea and Bacteria yield similartrends in the most and least frequent nucleotide pairsand the third codon position makes a major contributionto the base-pairing mechanism of stem stabilizationSpecifically in addition to WatsonndashCrick pairing theGU wobble pairs sufficiently contribute to the ther-mal stability of the Archaeal mRNA in agreement withearlier observations (6869) Overall it results in moderatecorrelation of the segment energy and energy per base pairof the folded mRNA (Table 5) Folding simulations per-formed in this work also allowed us to analyzepeculiarities of the nucleotide contents and structuralcontacts in stem and loop regions of folded mRNAPurine load in loop regions is higher than in stems andthis effect is slightly stronger in Archaea than in Bacteria(Table 6) Purine load in loops also correlates with OGTnot in stems This correlation was observed earlier byForsdyke et al (21) and it was described as lsquopolitepurine load of the loop regionsrsquo that prevents undesiredmRNAndashmRNA single-strand interactions The authorsconcluded that increased purine load can affect thecodon choice and consequently the amino acid compos-ition (21) Our data [as well as careful analysis of theoriginal data in (21)] does not support the latter claimThere is indeed a correlation between the fraction of

Nucleic Acids Research 2014 Vol 42 No 5 2889

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

some purine-rich codons (GGR(Gly) AGR(Arg) andGAR(Glu)) and OGT (Table 6) However the fractionof purine-rich non-synonymous codons AAR (Lys) isnot correlated with OGT (Table 6) Moreover thesecodons are synonymous and cannot directly affect theamino acid composition It has been shown earlier(2270) that increase of Glu and Arg fractions is a trendin protein thermophilic adaptation At the same timepurine load remains high after elimination of the codonbias pointing to the amino acid composition as the mostprobable cause for the former (22) All the above promptsone to conclude that in mutual tuning of nucleotide andamino acid compositions the purine load does notdominate and determine biases in amino acid compos-ition if not opposite The analysis of purine load showsthat it correlates with OGT in both stems and loops(Table 6) in case of purine-rich codons GGR(Gly)AGR(Arg) and GAR(Glu) Further the amount ofApG dinucleotides strongly correlates with OGT in bothloops and stems (Table 6) and ApG provides a strongbase stacking important for thermal stability of nucleicacid sequences (2265) Purine load therefore is appar-ently a determinant of the base-stacking mechanism inmRNA andor DNA thermal stability as well as a resultof the thermophilic amino acid composition trendProkaryotes thrive under the temperature interval

spanning over hundred degrees and they represent twomajor life stylesmdashaerobic and anaerobic Analysis ofcomplete Archaeal and Bacterial genomes unraveled com-positional and sequence signals related to molecular mech-anisms of stability and adaptation unaffected by selectivesequencing or by the comparison of orthologs Overallcodon bias works stronger in Archaea and is mostlyutilized in thermophilic adaptation of nucleic acids It ap-parently reflects longer evolutionary history of Archaeawhich presumably started close to the origin of life in hotconditions (29) The codon bias and amino acid sequences(dipeptide composition) work in accord for supportingenrichment of the nucleotide sequences with ApGCpT di-nucleotidesmdashdeterminants of the base-stacking mechan-ism of the nucleic acid stability (65ndash67) We also foundthat the second codon position reveals a strong linkbetween the nucleotide and amino acid compositionsSpecifically excess of thymine in this position is a resultof a demand on the enrichment of Archaeal proteins withhydrophobic amino acids The third codon position inBacteria is the only case where codon bias is detectablealready on the level of pure composition We found thatthe (G+C)3 load is related to aerobic life style dominatingin Bacteria From the point of view of thermophilic adap-tation codon bias works against G in the third codonposition in both Archaea and Bacteria It supports thus aspecific role of the third codon position in discriminatingbetween adaptations to temperature and aerobic life styleFinally we obtained an interesting and complex picture ofthe relationship between the nucleotide composition(purine load) and amino acid composition (selectionbetween Arg and Lys) in relation to thermal adaptationaerobicity and phylogeny Specifically purine load in bothArchaea and Bacteria is a result of the lsquofrom both endof hydrophobicity scalersquo trend in thermal adaptation of

proteins (34) reflected in the IVYWREL predictor of theOGT (22) According to this predictor Arg is a preferredamino acid for thermal adaptation though Lys is the nextcandidate present in top most correlated predictors as well(2243) In particular selection for Lys over the Arg insome species was shown to be important for the entropicmechanism of protein thermostabilization (70) Howeveroverall preference for Arg over the Lys in case of thermaladaptation is well manifested in the excess of the purine-rich (AGR) codons of Arg at the expense of purine-richones of Lys (AAR) At the same time discriminationbetween Arg and Lys works in opposite direction inaerobicity where Arg codons AGA and AGG are sup-pressed in favor of the purine-rich AAA and AAG ofLys It thus reflect a preference for Lys over Arg inaerobes compared to anaerobes in particular and inBacteria versus Archaea in general preserving at thesame time high purine load necessary for providing abase-stacking in corresponding nucleic acids (225365)An intricate connection between the LysArg and theircodons in relation to thermal adaptation and aerobicityexemplifies how selection can work on nucleic acids andprotein simultaneously in response to demands of differentenvironments Obviously the whole picture of molecularmechanisms of adaptation and relations between them isfar from being complete Consideration of other environ-mental factors such as salinity pressure etc will help to un-ravel new mechanisms of stability their sequencestructuredeterminants and to understand tradeoffs that Natureembraced en route of the evolution and adaptation

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

FUNDING

Funding for Open access charges Functional GenomicsProgram (FUGE II) Norwegian Research Council

Conflict of interest statement None declared

REFERENCES

1 CrickF (1970) Central dogma of molecular biology Nature 227561ndash563

2 TawfikDS (2010) Messy biology and the origins of evolutionaryinnovations Nat Chem Biol 6 692ndash696

3 TokurikiN and TawfikDS (2009) Stability effects of mutationsand protein evolvability Curr Opin Struct Biol 19 596ndash604

4 TokurikiN and TawfikDS (2009) Protein dynamism andevolvability Science 324 203ndash207

5 KooninEV (2012) Does the central dogma still stand BiolDirect 7 27

6 PersquoerI FelderCE ManO SilmanI SussmanJL andBeckmannJS (2004) Proteomic signatures amino acid andoligopeptide compositions differentiate among phyla Proteins 5420ndash40

7 AravindL TatusovRL WolfYI WalkerDR andKooninEV (1998) Evidence for massive gene exchange betweenarchaeal and bacterial hyperthermophiles Trends Genet 14442ndash444

8 KhachaneAN TimmisKN and dos SantosVA (2005) Uracilcontent of 16S rRNA of thermophilic and psychrophilic

2890 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

prokaryotes correlates inversely with their optimal growthtemperatures Nucleic Acids Res 33 4016ndash4022

9 SuhreK and ClaverieJM (2003) Genomic correlates ofhyperthermostability an update J Biol Chem 27817198ndash17202

10 TekaiaF and YeramianE (2006) Evolution of proteomesfundamental signatures and global trends in amino acidcompositions BMC Genom 7 307

11 WangHC and HickeyDA (2002) Evidence for strong selectiveconstraint acting on the nucleotide composition of 16S ribosomalRNA genes Nucleic Acids Res 30 2501ndash2507

12 FriedmanR DrakeJW and HughesAL (2004) Genome-widepatterns of nucleotide substitution reveal stringent functionalconstraints on the protein sequences of thermophiles Genetics167 1507ndash1512

13 LynnDJ SingerGA and HickeyDA (2002) Synonymouscodon usage is subject to selection in thermophilic bacteriaNucleic Acids Res 30 4272ndash4277

14 SingerGA and HickeyDA (2000) Nucleotide bias causes agenomewide bias in the amino acid composition of proteinsMol Biol Evol 17 1581ndash1588

15 SingerGA and HickeyDA (2003) Thermophilicprokaryotes have characteristic patterns of codonusage amino acid composition and nucleotide content Gene 31739ndash47

16 TekaiaF YeramianE and DujonB (2002) Amino acidcomposition of genomes lifestyles of organisms and evolutionarytrends a global picture with correspondence analysis Gene 29751ndash60

17 Roy ChowdhuryA and DuttaC (2012) A pursuit of lineage-specific and niche-specific proteome features in the world ofarchaea BMC Genom 13 236

18 WangHC SuskoE and RogerAJ (2006) On the correlationbetween genomic G+C content and optimal growth temperaturein prokaryotes data quality and confounding factors BiochemBiophys Res Commun 342 681ndash684

19 WuH ZhangZ HuS and YuJ (2012) On the molecularmechanism of GC content variation among eubacterial genomesBiol Direct 7 2

20 KnightRD FreelandSJ and LandweberLF (2001) A simplemodel based on mutation and selection explains trends in codonand amino-acid usage and GC composition within and acrossgenomes Genome Biol 2 RESEARCH0010

21 LaoPJ and ForsdykeDR (2000) Thermophilic bacteria strictlyobey Szybalskirsquos transcription direction rule and politely purine-load RNAs with both adenine and guanine Genome Res 10228ndash236

22 ZeldovichKB BerezovskyIN and ShakhnovichEI (2007)Protein and DNA sequence determinants of thermophilicadaptation PLoS Comput Biol 3 e5

23 KreilDP and OuzounisCA (2001) Identification ofthermophilic species by the amino acid compositions deducedfrom their genomes Nucleic Acids Res 29 1608ndash1615

24 BerezovskyIN TumanyanVG and EsipovaNG (1997)Representation of amino acid sequences in terms of interactionenergy in protein globules FEBS Lett 418 43ndash46

25 CambillauC and ClaverieJM (2000) Structural and genomiccorrelates of hyperthermostability J Biol Chem 27532383ndash32386

26 GreavesRB and WarwickerJ (2007) Mechanisms forstabilisation and the maintenance of solubility in proteins fromthermophiles BMC Struct Biol 7 18

27 JaenickeR (1999) Stability and folding of domain proteins ProgBiophys Mol Biol 71 155ndash241

28 JaenickeR and BohmG (1998) The stability of proteins inextreme environments Curr Opin Struct Biol 8 738ndash748

29 BerezovskyIN and ShakhnovichEI (2005) Physics andevolution of thermophilic adaptation Proc Natl Acad Sci USA102 12742ndash12747

30 ChakravartyS and VaradarajanR (2002) Elucidation of factorsresponsible for enhanced thermal stability of proteins a structuralgenomics based study Biochemistry 41 8152ndash8161

31 GlyakinaAV GarbuzynskiySO LobanovMY andGalzitskayaOV (2007) Different packing of external residues can

explain differences in the thermostability of proteins fromthermophilic and mesophilic organisms Bioinformatics 232231ndash2238

32 ThompsonMJ and EisenbergD (1999) Transproteomic evidenceof a loop-deletion mechanism for enhancing proteinthermostability J Mol Biol 290 595ndash604

33 TokurikiN OldfieldCJ UverskyVN BerezovskyIN andTawfikDS (2009) Do viral proteins possess unique biophysicalfeatures Trends Biochem Sci 34 53ndash59

34 BerezovskyIN ZeldovichKB and ShakhnovichEI (2007)Positive and negative design in stability and thermal adaptationof natural proteins PLoS Comput Biol 3 e52

35 BharanidharanD BhargaviGR UthanumallianK andGauthamN (2004) Correlations between nucleotide frequenciesand amino acid composition in 115 bacterial species BiochemBiophys Res Commun 315 1097ndash1103

36 NakashimaH FukuchiS and NishikawaK (2003)Compositional changes in RNA DNA and proteins for bacterialadaptation to higher and lower temperatures J Biochem(Tokyo) 133 507ndash513

37 DehouckY FolchB and RoomanM (2008) Revisiting thecorrelation between proteinsrsquo thermoresistance and organismsrsquothermophilicity Protein Eng Des Sel 21 275ndash278

38 FolchB DehouckY and RoomanM (2010) Thermo- andmesostabilizing protein interactions identified by temperature-dependent statistical potentials Biophys J 98 667ndash677

39 FolchB RoomanM and DehouckY (2008) Thermostability ofsalt bridges versus hydrophobic interactions in proteins probed bystatistical potentials J Chem Inf Model 48 119ndash127

40 GonnelliG RoomanM and DehouckY (2012) Structure-basedmutant stability predictions on proteins of unknown structureJ Biotechnol 161 287ndash293

41 PonnuswamyP MuthusamyR and ManavalanP (1982) Aminoacid composition and thermal stability of globular proteins IntJ Biol Macromol 4 186ndash190

42 BerezovskyIN (2011) The diversity of physical forces andmechanisms in intermolecular interactions Phys Biol 8 035002

43 MaBG GoncearencoA and BerezovskyIN (2010)Thermophilic adaptation of protein complexes inferred fromproteomic homology modeling Structure 18 819ndash828

44 MakarovaKS and KooninEV (2005) Evolutionary andfunctional genomics of the Archaea Curr Opin Microbiol 8586ndash594

45 NovichkovPS WolfYI DubchakI and KooninEV (2009)Trends in prokaryotic evolution revealed by comparison of closelyrelated bacterial and archaeal genomes J Bacteriol 191 65ndash73

46 KooninEV and WolfYI (2008) Genomics of bacteria andarchaea the emerging dynamic view of the prokaryotic worldNucleic Acids Res 36 6688ndash6719

47 KooninEV MushegianAR GalperinMY and WalkerDR(1997) Comparison of archaeal and bacterial genomes computeranalysis of protein sequences predicts novel functions andsuggests a chimeric origin for the archaea Mol Microbiol 25619ndash637

48 KatzL and BurgeCB (2003) Widespread selection for localRNA secondary structure in coding regions of bacterial genesGenome Res 13 2042ndash2051

49 HofackerIL PriwitzerB and StadlerPF (2004) Prediction oflocally stable RNA secondary structures for genome-wide surveysBioinformatics 20 186ndash190

50 ZukerM and StieglerP (1981) Optimal computer folding oflarge RNA sequences using thermodynamics and auxiliaryinformation Nucleic Acids Res 9 133ndash148

51 MathewsDH SabinaJ ZukerM and TurnerDH (1999)Expanded sequence dependence of thermodynamic parametersimproves prediction of RNA secondary structure J Mol Biol288 911ndash940

52 ShabalinaSA OgurtsovAY and SpiridonovNA (2006) Aperiodic pattern of mRNA secondary structure created by thegenetic code Nucleic Acids Res 34 2428ndash2437

53 MarmurJ and DotyP (1962) Determination of the basecomposition of deoxyribonucleic acid from its thermaldenaturation temperature J Mol Biol 5 109ndash118

Nucleic Acids Research 2014 Vol 42 No 5 2891

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

54 HurstLD and MerchantAR (2001) High guanine-cytosinecontent is not an adaptation to high temperature acomparative analysis amongst prokaryotes Proc Biol Sci 268493ndash497

55 NayaH RomeroH ZavalaA AlvarezB and MustoH (2002)Aerobiosis increases the genomic guanine plus cytosine content(GC) in prokaryotes J Mol Evol 55 260ndash264

56 BeckmanKB and AmesBN (1997) Oxidative decay of DNAJ Biol Chem 272 19633ndash19636

57 ChengKC CahillDS KasaiH NishimuraS and LoebLA(1992) 8-Hydroxyguanine an abundant form of oxidative DNAdamage causes Gmdash-T and Amdash-C substitutions J Biol Chem267 166ndash172

58 Vieira-SilvaS and RochaEP (2008) An assessment of theimpacts of molecular oxygen on the evolution of proteomes MolBiol Evol 25 1931ndash1942

59 GaltierN and LobryJR (1997) Relationships between genomicG+C content RNA secondary structures and optimal growthtemperature in prokaryotes J Mol Evol 44 632ndash636

60 MustoH NayaH ZavalaA RomeroH Alvarez-ValinF andBernardiG (2004) Correlations between genomic GC levels andoptimal growth temperatures in prokaryotes FEBS Lett 57373ndash77

61 FitchWM (1974) The large extent of putative secondary nucleicacid structure in random nucleotide sequences or amino acidderived messenger-RNA J Mol Evol 3 279ndash291

62 WorkmanC and KroghA (1999) No evidence that mRNAshave lower folding free energies than random sequences with

the same dinucleotide distribution Nucleic Acids Res 274816ndash4822

63 BachmairA FinleyD and VarshavskyA (1986) In vivo half-lifeof a protein is a function of its amino-terminal residue Science234 179ndash186

64 BerezovskyIN NamiotVA TumanyanVG and EsipovaNG(1999) Hierarchy of the interaction energy distribution in thespatial structure of globular proteins and the problem of domaindefinition J Biomol Struct Dyn 17 133ndash155

65 YakovchukP ProtozanovaE and Frank-KamenetskiiMD(2006) Base-stacking and base-pairing contributions into thermalstability of the DNA double helix Nucleic Acids Res 34564ndash574

66 FriedmanRA and HonigB (1995) A free energy analysis ofnucleic acid base stacking in aqueous solution Biophys J 691528ndash1535

67 OkonogiTM AlleySC ReeseAW HopkinsPB andRobinsonBH (2002) Sequence-dependent dynamics of duplexDNA the applicability of a dinucleotide model Biophys J 833446ndash3459

68 MasquidaB and WesthofE (2000) On the wobble GoU andrelated pairs RNA 6 9ndash15

69 VaraniG and McClainWH (2000) The G x U wobble base pairA fundamental building block of RNA structure crucial to RNAfunction in diverse biological systems EMBO Rep 1 18ndash23

70 BerezovskyIN ChenWW ChoiPJ and ShakhnovichEI(2005) Entropic stabilization of proteins and its proteomicconsequences PLoS Comput Biol 1 e47

2892 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

separate codon positions are considered for coding DNAthe only compositional bias observed for nucleotide com-positions is an excess of (G+C) load in the third codonposition in Bacteria There is a significant excess ofguanine and cytosine nucleotides (292 and 3282)compared to the NCB nucleotide contents (2609 and2466) A plausible explanation for this bias is the roleof the (G+C)3 load in the adaptation to the aerobic lifestyle which dominates in Bacteria The (G+C)3 load canbe advantageous for several complementary reasonsIndeed additional GC base pairing will contribute tothe stability of double helix of DNA and stem regions ofRNA molecules Moreover G3 bases can work as scaven-gers of oxidizing agents providing protection for G basesin other codon positions (56ndash58) At the same time thisbias would not lead to any changes in amino acid com-position leaving protein structure and stability intact (55)

In this work we considered compositional and sequencebiases in proteins in relation to those in correspondingnucleic acid sequences and to the phylogeny of species(Archaea or Bacteria) OGT correlations of nucleotidesin different codon positions are similar in Archaea andBacteria but higher in the former (Table 1) Theoptimal from the point of view of thermophilic adaptation(ie most correlated with OGT) codon reads A1 [TG]2non-[AG]3 in Archaea but [weak A]1 [weak T]2 non-[AG]3 in Bacteria (Table 4) Codon bias supports thestrong correlation between the amount of Adenine andOGT in the first codon position in Archaea same butweaker correlation in Bacteria While we showed earlier(22) that amino acid thermophilic trend is not determinedby the overall nucleotide composition considerationof separate codon positions reveals an interesting linkbetween positional nucleotide frequencies and aminoacid composition in relation to domain of LifeSpecifically prevalence of the nucleotide thymine in thesecond codon position and its strong correlation withOGT in Archaea is apparently a result of the demandon enrichement of Archaeal proteins with hydrophobicresidues (Table 7) Noteworthy a massive increase ofvan der Waals interactions (2464) was found to be thecornerstone of the structure-based evolutionary strategyof protein thermostability in ancient species (29) InBacterial proteins however there is an increase of thefractions of charged residues with OGT (Table 7) Thuswe observe a transitions from structure-based evolution-ary strategy of protein themostability in Archaea tosequence-based one in Bacteria and corresponding nucleo-tide and protein compositional biases (Table 7)underlying these strategies (29) The third codonposition is characterized by the selection against adenineand guanine in Archaea and against adenine in Bacteria isa result of this bias

The most pronounced dinucleotide compositionalbiases are excess of homo-dinucleotides and its correlationwith OGT in ncDNA tRNA and rRNA It is presumablya relic of ancient primitive homo(poly)nucleotides fromwhich life started Overrepresentation of the complemen-tary ApGCpT dinucleotides points to their contributionto DNARNA stability via strong base-stacking inter-actions (6566) The consideration of dinucleotide biases

in purine (R) and pyrimidine (Y) notation also reveals aninteresting picture Overall for both Archaea and Bacteriathe signature for thermophilic dinucleotides reads RpRYpY non-(RpRYpY) RpRYpY for codon positions1-2 2-3 and 3-1 respectively (Table 4) The major differ-ence between Archaea and Bacteria in this case is that thedinucleotide bias in positions 1-2 is caused by demands onnucleotide sequence level (and provided by the codonbias) in Archaea while some other factors work inBacteria The preference for homodinucleotides in pos-itions 1-2 and 3-1 can be a possible reason for avoidinghomodinucleotides in positions 2-3 because heterodinu-cleotides can contribute to DNA flexibility tuningIndeed it has been shown that given an average flexibilityF of homodinucleotides (either RpR or YpY) theflexibilities of the heterodinucleotides YpR and RpY are2F and 05F respectively (67) The codon positions 2-3 aremost versatile in terms of the relationship between thenucleotide and protein sequences Therefore presenceof heteronucleotides in the codon position 2-3 makes itpossible to adjust flexibility of the nucleotide sequencewithout changing physicalndashchemical characteristics ofthe encoded amino acid residue Additionally we foundthat the link between dipeptide and corresponding 3-1 di-nucleotide frequencies is determined by the amino aciddipeptide The dominating bias reads as Met followedby a polar or charged residueOn the RNA level simple consideration of the nucleo-

tide composition suggests to discriminate the tRNA andrRNA molecules from the mRNA Indeed based on thecorrelation with OGT we can conclude that (G+C)content is important for providing stability of tRNAand rRNA molecules but not for mRNA The role ofthe (G+C) content was further confirmed by foldingsimulations (Table 3) We also found other complemen-tary signals in mRNA sequences related to its structurestability and function Archaea and Bacteria yield similartrends in the most and least frequent nucleotide pairsand the third codon position makes a major contributionto the base-pairing mechanism of stem stabilizationSpecifically in addition to WatsonndashCrick pairing theGU wobble pairs sufficiently contribute to the ther-mal stability of the Archaeal mRNA in agreement withearlier observations (6869) Overall it results in moderatecorrelation of the segment energy and energy per base pairof the folded mRNA (Table 5) Folding simulations per-formed in this work also allowed us to analyzepeculiarities of the nucleotide contents and structuralcontacts in stem and loop regions of folded mRNAPurine load in loop regions is higher than in stems andthis effect is slightly stronger in Archaea than in Bacteria(Table 6) Purine load in loops also correlates with OGTnot in stems This correlation was observed earlier byForsdyke et al (21) and it was described as lsquopolitepurine load of the loop regionsrsquo that prevents undesiredmRNAndashmRNA single-strand interactions The authorsconcluded that increased purine load can affect thecodon choice and consequently the amino acid compos-ition (21) Our data [as well as careful analysis of theoriginal data in (21)] does not support the latter claimThere is indeed a correlation between the fraction of

Nucleic Acids Research 2014 Vol 42 No 5 2889

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

some purine-rich codons (GGR(Gly) AGR(Arg) andGAR(Glu)) and OGT (Table 6) However the fractionof purine-rich non-synonymous codons AAR (Lys) isnot correlated with OGT (Table 6) Moreover thesecodons are synonymous and cannot directly affect theamino acid composition It has been shown earlier(2270) that increase of Glu and Arg fractions is a trendin protein thermophilic adaptation At the same timepurine load remains high after elimination of the codonbias pointing to the amino acid composition as the mostprobable cause for the former (22) All the above promptsone to conclude that in mutual tuning of nucleotide andamino acid compositions the purine load does notdominate and determine biases in amino acid compos-ition if not opposite The analysis of purine load showsthat it correlates with OGT in both stems and loops(Table 6) in case of purine-rich codons GGR(Gly)AGR(Arg) and GAR(Glu) Further the amount ofApG dinucleotides strongly correlates with OGT in bothloops and stems (Table 6) and ApG provides a strongbase stacking important for thermal stability of nucleicacid sequences (2265) Purine load therefore is appar-ently a determinant of the base-stacking mechanism inmRNA andor DNA thermal stability as well as a resultof the thermophilic amino acid composition trendProkaryotes thrive under the temperature interval

spanning over hundred degrees and they represent twomajor life stylesmdashaerobic and anaerobic Analysis ofcomplete Archaeal and Bacterial genomes unraveled com-positional and sequence signals related to molecular mech-anisms of stability and adaptation unaffected by selectivesequencing or by the comparison of orthologs Overallcodon bias works stronger in Archaea and is mostlyutilized in thermophilic adaptation of nucleic acids It ap-parently reflects longer evolutionary history of Archaeawhich presumably started close to the origin of life in hotconditions (29) The codon bias and amino acid sequences(dipeptide composition) work in accord for supportingenrichment of the nucleotide sequences with ApGCpT di-nucleotidesmdashdeterminants of the base-stacking mechan-ism of the nucleic acid stability (65ndash67) We also foundthat the second codon position reveals a strong linkbetween the nucleotide and amino acid compositionsSpecifically excess of thymine in this position is a resultof a demand on the enrichment of Archaeal proteins withhydrophobic amino acids The third codon position inBacteria is the only case where codon bias is detectablealready on the level of pure composition We found thatthe (G+C)3 load is related to aerobic life style dominatingin Bacteria From the point of view of thermophilic adap-tation codon bias works against G in the third codonposition in both Archaea and Bacteria It supports thus aspecific role of the third codon position in discriminatingbetween adaptations to temperature and aerobic life styleFinally we obtained an interesting and complex picture ofthe relationship between the nucleotide composition(purine load) and amino acid composition (selectionbetween Arg and Lys) in relation to thermal adaptationaerobicity and phylogeny Specifically purine load in bothArchaea and Bacteria is a result of the lsquofrom both endof hydrophobicity scalersquo trend in thermal adaptation of

proteins (34) reflected in the IVYWREL predictor of theOGT (22) According to this predictor Arg is a preferredamino acid for thermal adaptation though Lys is the nextcandidate present in top most correlated predictors as well(2243) In particular selection for Lys over the Arg insome species was shown to be important for the entropicmechanism of protein thermostabilization (70) Howeveroverall preference for Arg over the Lys in case of thermaladaptation is well manifested in the excess of the purine-rich (AGR) codons of Arg at the expense of purine-richones of Lys (AAR) At the same time discriminationbetween Arg and Lys works in opposite direction inaerobicity where Arg codons AGA and AGG are sup-pressed in favor of the purine-rich AAA and AAG ofLys It thus reflect a preference for Lys over Arg inaerobes compared to anaerobes in particular and inBacteria versus Archaea in general preserving at thesame time high purine load necessary for providing abase-stacking in corresponding nucleic acids (225365)An intricate connection between the LysArg and theircodons in relation to thermal adaptation and aerobicityexemplifies how selection can work on nucleic acids andprotein simultaneously in response to demands of differentenvironments Obviously the whole picture of molecularmechanisms of adaptation and relations between them isfar from being complete Consideration of other environ-mental factors such as salinity pressure etc will help to un-ravel new mechanisms of stability their sequencestructuredeterminants and to understand tradeoffs that Natureembraced en route of the evolution and adaptation

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

FUNDING

Funding for Open access charges Functional GenomicsProgram (FUGE II) Norwegian Research Council

Conflict of interest statement None declared

REFERENCES

1 CrickF (1970) Central dogma of molecular biology Nature 227561ndash563

2 TawfikDS (2010) Messy biology and the origins of evolutionaryinnovations Nat Chem Biol 6 692ndash696

3 TokurikiN and TawfikDS (2009) Stability effects of mutationsand protein evolvability Curr Opin Struct Biol 19 596ndash604

4 TokurikiN and TawfikDS (2009) Protein dynamism andevolvability Science 324 203ndash207

5 KooninEV (2012) Does the central dogma still stand BiolDirect 7 27

6 PersquoerI FelderCE ManO SilmanI SussmanJL andBeckmannJS (2004) Proteomic signatures amino acid andoligopeptide compositions differentiate among phyla Proteins 5420ndash40

7 AravindL TatusovRL WolfYI WalkerDR andKooninEV (1998) Evidence for massive gene exchange betweenarchaeal and bacterial hyperthermophiles Trends Genet 14442ndash444

8 KhachaneAN TimmisKN and dos SantosVA (2005) Uracilcontent of 16S rRNA of thermophilic and psychrophilic

2890 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

prokaryotes correlates inversely with their optimal growthtemperatures Nucleic Acids Res 33 4016ndash4022

9 SuhreK and ClaverieJM (2003) Genomic correlates ofhyperthermostability an update J Biol Chem 27817198ndash17202

10 TekaiaF and YeramianE (2006) Evolution of proteomesfundamental signatures and global trends in amino acidcompositions BMC Genom 7 307

11 WangHC and HickeyDA (2002) Evidence for strong selectiveconstraint acting on the nucleotide composition of 16S ribosomalRNA genes Nucleic Acids Res 30 2501ndash2507

12 FriedmanR DrakeJW and HughesAL (2004) Genome-widepatterns of nucleotide substitution reveal stringent functionalconstraints on the protein sequences of thermophiles Genetics167 1507ndash1512

13 LynnDJ SingerGA and HickeyDA (2002) Synonymouscodon usage is subject to selection in thermophilic bacteriaNucleic Acids Res 30 4272ndash4277

14 SingerGA and HickeyDA (2000) Nucleotide bias causes agenomewide bias in the amino acid composition of proteinsMol Biol Evol 17 1581ndash1588

15 SingerGA and HickeyDA (2003) Thermophilicprokaryotes have characteristic patterns of codonusage amino acid composition and nucleotide content Gene 31739ndash47

16 TekaiaF YeramianE and DujonB (2002) Amino acidcomposition of genomes lifestyles of organisms and evolutionarytrends a global picture with correspondence analysis Gene 29751ndash60

17 Roy ChowdhuryA and DuttaC (2012) A pursuit of lineage-specific and niche-specific proteome features in the world ofarchaea BMC Genom 13 236

18 WangHC SuskoE and RogerAJ (2006) On the correlationbetween genomic G+C content and optimal growth temperaturein prokaryotes data quality and confounding factors BiochemBiophys Res Commun 342 681ndash684

19 WuH ZhangZ HuS and YuJ (2012) On the molecularmechanism of GC content variation among eubacterial genomesBiol Direct 7 2

20 KnightRD FreelandSJ and LandweberLF (2001) A simplemodel based on mutation and selection explains trends in codonand amino-acid usage and GC composition within and acrossgenomes Genome Biol 2 RESEARCH0010

21 LaoPJ and ForsdykeDR (2000) Thermophilic bacteria strictlyobey Szybalskirsquos transcription direction rule and politely purine-load RNAs with both adenine and guanine Genome Res 10228ndash236

22 ZeldovichKB BerezovskyIN and ShakhnovichEI (2007)Protein and DNA sequence determinants of thermophilicadaptation PLoS Comput Biol 3 e5

23 KreilDP and OuzounisCA (2001) Identification ofthermophilic species by the amino acid compositions deducedfrom their genomes Nucleic Acids Res 29 1608ndash1615

24 BerezovskyIN TumanyanVG and EsipovaNG (1997)Representation of amino acid sequences in terms of interactionenergy in protein globules FEBS Lett 418 43ndash46

25 CambillauC and ClaverieJM (2000) Structural and genomiccorrelates of hyperthermostability J Biol Chem 27532383ndash32386

26 GreavesRB and WarwickerJ (2007) Mechanisms forstabilisation and the maintenance of solubility in proteins fromthermophiles BMC Struct Biol 7 18

27 JaenickeR (1999) Stability and folding of domain proteins ProgBiophys Mol Biol 71 155ndash241

28 JaenickeR and BohmG (1998) The stability of proteins inextreme environments Curr Opin Struct Biol 8 738ndash748

29 BerezovskyIN and ShakhnovichEI (2005) Physics andevolution of thermophilic adaptation Proc Natl Acad Sci USA102 12742ndash12747

30 ChakravartyS and VaradarajanR (2002) Elucidation of factorsresponsible for enhanced thermal stability of proteins a structuralgenomics based study Biochemistry 41 8152ndash8161

31 GlyakinaAV GarbuzynskiySO LobanovMY andGalzitskayaOV (2007) Different packing of external residues can

explain differences in the thermostability of proteins fromthermophilic and mesophilic organisms Bioinformatics 232231ndash2238

32 ThompsonMJ and EisenbergD (1999) Transproteomic evidenceof a loop-deletion mechanism for enhancing proteinthermostability J Mol Biol 290 595ndash604

33 TokurikiN OldfieldCJ UverskyVN BerezovskyIN andTawfikDS (2009) Do viral proteins possess unique biophysicalfeatures Trends Biochem Sci 34 53ndash59

34 BerezovskyIN ZeldovichKB and ShakhnovichEI (2007)Positive and negative design in stability and thermal adaptationof natural proteins PLoS Comput Biol 3 e52

35 BharanidharanD BhargaviGR UthanumallianK andGauthamN (2004) Correlations between nucleotide frequenciesand amino acid composition in 115 bacterial species BiochemBiophys Res Commun 315 1097ndash1103

36 NakashimaH FukuchiS and NishikawaK (2003)Compositional changes in RNA DNA and proteins for bacterialadaptation to higher and lower temperatures J Biochem(Tokyo) 133 507ndash513

37 DehouckY FolchB and RoomanM (2008) Revisiting thecorrelation between proteinsrsquo thermoresistance and organismsrsquothermophilicity Protein Eng Des Sel 21 275ndash278

38 FolchB DehouckY and RoomanM (2010) Thermo- andmesostabilizing protein interactions identified by temperature-dependent statistical potentials Biophys J 98 667ndash677

39 FolchB RoomanM and DehouckY (2008) Thermostability ofsalt bridges versus hydrophobic interactions in proteins probed bystatistical potentials J Chem Inf Model 48 119ndash127

40 GonnelliG RoomanM and DehouckY (2012) Structure-basedmutant stability predictions on proteins of unknown structureJ Biotechnol 161 287ndash293

41 PonnuswamyP MuthusamyR and ManavalanP (1982) Aminoacid composition and thermal stability of globular proteins IntJ Biol Macromol 4 186ndash190

42 BerezovskyIN (2011) The diversity of physical forces andmechanisms in intermolecular interactions Phys Biol 8 035002

43 MaBG GoncearencoA and BerezovskyIN (2010)Thermophilic adaptation of protein complexes inferred fromproteomic homology modeling Structure 18 819ndash828

44 MakarovaKS and KooninEV (2005) Evolutionary andfunctional genomics of the Archaea Curr Opin Microbiol 8586ndash594

45 NovichkovPS WolfYI DubchakI and KooninEV (2009)Trends in prokaryotic evolution revealed by comparison of closelyrelated bacterial and archaeal genomes J Bacteriol 191 65ndash73

46 KooninEV and WolfYI (2008) Genomics of bacteria andarchaea the emerging dynamic view of the prokaryotic worldNucleic Acids Res 36 6688ndash6719

47 KooninEV MushegianAR GalperinMY and WalkerDR(1997) Comparison of archaeal and bacterial genomes computeranalysis of protein sequences predicts novel functions andsuggests a chimeric origin for the archaea Mol Microbiol 25619ndash637

48 KatzL and BurgeCB (2003) Widespread selection for localRNA secondary structure in coding regions of bacterial genesGenome Res 13 2042ndash2051

49 HofackerIL PriwitzerB and StadlerPF (2004) Prediction oflocally stable RNA secondary structures for genome-wide surveysBioinformatics 20 186ndash190

50 ZukerM and StieglerP (1981) Optimal computer folding oflarge RNA sequences using thermodynamics and auxiliaryinformation Nucleic Acids Res 9 133ndash148

51 MathewsDH SabinaJ ZukerM and TurnerDH (1999)Expanded sequence dependence of thermodynamic parametersimproves prediction of RNA secondary structure J Mol Biol288 911ndash940

52 ShabalinaSA OgurtsovAY and SpiridonovNA (2006) Aperiodic pattern of mRNA secondary structure created by thegenetic code Nucleic Acids Res 34 2428ndash2437

53 MarmurJ and DotyP (1962) Determination of the basecomposition of deoxyribonucleic acid from its thermaldenaturation temperature J Mol Biol 5 109ndash118

Nucleic Acids Research 2014 Vol 42 No 5 2891

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

54 HurstLD and MerchantAR (2001) High guanine-cytosinecontent is not an adaptation to high temperature acomparative analysis amongst prokaryotes Proc Biol Sci 268493ndash497

55 NayaH RomeroH ZavalaA AlvarezB and MustoH (2002)Aerobiosis increases the genomic guanine plus cytosine content(GC) in prokaryotes J Mol Evol 55 260ndash264

56 BeckmanKB and AmesBN (1997) Oxidative decay of DNAJ Biol Chem 272 19633ndash19636

57 ChengKC CahillDS KasaiH NishimuraS and LoebLA(1992) 8-Hydroxyguanine an abundant form of oxidative DNAdamage causes Gmdash-T and Amdash-C substitutions J Biol Chem267 166ndash172

58 Vieira-SilvaS and RochaEP (2008) An assessment of theimpacts of molecular oxygen on the evolution of proteomes MolBiol Evol 25 1931ndash1942

59 GaltierN and LobryJR (1997) Relationships between genomicG+C content RNA secondary structures and optimal growthtemperature in prokaryotes J Mol Evol 44 632ndash636

60 MustoH NayaH ZavalaA RomeroH Alvarez-ValinF andBernardiG (2004) Correlations between genomic GC levels andoptimal growth temperatures in prokaryotes FEBS Lett 57373ndash77

61 FitchWM (1974) The large extent of putative secondary nucleicacid structure in random nucleotide sequences or amino acidderived messenger-RNA J Mol Evol 3 279ndash291

62 WorkmanC and KroghA (1999) No evidence that mRNAshave lower folding free energies than random sequences with

the same dinucleotide distribution Nucleic Acids Res 274816ndash4822

63 BachmairA FinleyD and VarshavskyA (1986) In vivo half-lifeof a protein is a function of its amino-terminal residue Science234 179ndash186

64 BerezovskyIN NamiotVA TumanyanVG and EsipovaNG(1999) Hierarchy of the interaction energy distribution in thespatial structure of globular proteins and the problem of domaindefinition J Biomol Struct Dyn 17 133ndash155

65 YakovchukP ProtozanovaE and Frank-KamenetskiiMD(2006) Base-stacking and base-pairing contributions into thermalstability of the DNA double helix Nucleic Acids Res 34564ndash574

66 FriedmanRA and HonigB (1995) A free energy analysis ofnucleic acid base stacking in aqueous solution Biophys J 691528ndash1535

67 OkonogiTM AlleySC ReeseAW HopkinsPB andRobinsonBH (2002) Sequence-dependent dynamics of duplexDNA the applicability of a dinucleotide model Biophys J 833446ndash3459

68 MasquidaB and WesthofE (2000) On the wobble GoU andrelated pairs RNA 6 9ndash15

69 VaraniG and McClainWH (2000) The G x U wobble base pairA fundamental building block of RNA structure crucial to RNAfunction in diverse biological systems EMBO Rep 1 18ndash23

70 BerezovskyIN ChenWW ChoiPJ and ShakhnovichEI(2005) Entropic stabilization of proteins and its proteomicconsequences PLoS Comput Biol 1 e47

2892 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

some purine-rich codons (GGR(Gly) AGR(Arg) andGAR(Glu)) and OGT (Table 6) However the fractionof purine-rich non-synonymous codons AAR (Lys) isnot correlated with OGT (Table 6) Moreover thesecodons are synonymous and cannot directly affect theamino acid composition It has been shown earlier(2270) that increase of Glu and Arg fractions is a trendin protein thermophilic adaptation At the same timepurine load remains high after elimination of the codonbias pointing to the amino acid composition as the mostprobable cause for the former (22) All the above promptsone to conclude that in mutual tuning of nucleotide andamino acid compositions the purine load does notdominate and determine biases in amino acid compos-ition if not opposite The analysis of purine load showsthat it correlates with OGT in both stems and loops(Table 6) in case of purine-rich codons GGR(Gly)AGR(Arg) and GAR(Glu) Further the amount ofApG dinucleotides strongly correlates with OGT in bothloops and stems (Table 6) and ApG provides a strongbase stacking important for thermal stability of nucleicacid sequences (2265) Purine load therefore is appar-ently a determinant of the base-stacking mechanism inmRNA andor DNA thermal stability as well as a resultof the thermophilic amino acid composition trendProkaryotes thrive under the temperature interval

spanning over hundred degrees and they represent twomajor life stylesmdashaerobic and anaerobic Analysis ofcomplete Archaeal and Bacterial genomes unraveled com-positional and sequence signals related to molecular mech-anisms of stability and adaptation unaffected by selectivesequencing or by the comparison of orthologs Overallcodon bias works stronger in Archaea and is mostlyutilized in thermophilic adaptation of nucleic acids It ap-parently reflects longer evolutionary history of Archaeawhich presumably started close to the origin of life in hotconditions (29) The codon bias and amino acid sequences(dipeptide composition) work in accord for supportingenrichment of the nucleotide sequences with ApGCpT di-nucleotidesmdashdeterminants of the base-stacking mechan-ism of the nucleic acid stability (65ndash67) We also foundthat the second codon position reveals a strong linkbetween the nucleotide and amino acid compositionsSpecifically excess of thymine in this position is a resultof a demand on the enrichment of Archaeal proteins withhydrophobic amino acids The third codon position inBacteria is the only case where codon bias is detectablealready on the level of pure composition We found thatthe (G+C)3 load is related to aerobic life style dominatingin Bacteria From the point of view of thermophilic adap-tation codon bias works against G in the third codonposition in both Archaea and Bacteria It supports thus aspecific role of the third codon position in discriminatingbetween adaptations to temperature and aerobic life styleFinally we obtained an interesting and complex picture ofthe relationship between the nucleotide composition(purine load) and amino acid composition (selectionbetween Arg and Lys) in relation to thermal adaptationaerobicity and phylogeny Specifically purine load in bothArchaea and Bacteria is a result of the lsquofrom both endof hydrophobicity scalersquo trend in thermal adaptation of

proteins (34) reflected in the IVYWREL predictor of theOGT (22) According to this predictor Arg is a preferredamino acid for thermal adaptation though Lys is the nextcandidate present in top most correlated predictors as well(2243) In particular selection for Lys over the Arg insome species was shown to be important for the entropicmechanism of protein thermostabilization (70) Howeveroverall preference for Arg over the Lys in case of thermaladaptation is well manifested in the excess of the purine-rich (AGR) codons of Arg at the expense of purine-richones of Lys (AAR) At the same time discriminationbetween Arg and Lys works in opposite direction inaerobicity where Arg codons AGA and AGG are sup-pressed in favor of the purine-rich AAA and AAG ofLys It thus reflect a preference for Lys over Arg inaerobes compared to anaerobes in particular and inBacteria versus Archaea in general preserving at thesame time high purine load necessary for providing abase-stacking in corresponding nucleic acids (225365)An intricate connection between the LysArg and theircodons in relation to thermal adaptation and aerobicityexemplifies how selection can work on nucleic acids andprotein simultaneously in response to demands of differentenvironments Obviously the whole picture of molecularmechanisms of adaptation and relations between them isfar from being complete Consideration of other environ-mental factors such as salinity pressure etc will help to un-ravel new mechanisms of stability their sequencestructuredeterminants and to understand tradeoffs that Natureembraced en route of the evolution and adaptation

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online

FUNDING

Funding for Open access charges Functional GenomicsProgram (FUGE II) Norwegian Research Council

Conflict of interest statement None declared

REFERENCES

1 CrickF (1970) Central dogma of molecular biology Nature 227561ndash563

2 TawfikDS (2010) Messy biology and the origins of evolutionaryinnovations Nat Chem Biol 6 692ndash696

3 TokurikiN and TawfikDS (2009) Stability effects of mutationsand protein evolvability Curr Opin Struct Biol 19 596ndash604

4 TokurikiN and TawfikDS (2009) Protein dynamism andevolvability Science 324 203ndash207

5 KooninEV (2012) Does the central dogma still stand BiolDirect 7 27

6 PersquoerI FelderCE ManO SilmanI SussmanJL andBeckmannJS (2004) Proteomic signatures amino acid andoligopeptide compositions differentiate among phyla Proteins 5420ndash40

7 AravindL TatusovRL WolfYI WalkerDR andKooninEV (1998) Evidence for massive gene exchange betweenarchaeal and bacterial hyperthermophiles Trends Genet 14442ndash444

8 KhachaneAN TimmisKN and dos SantosVA (2005) Uracilcontent of 16S rRNA of thermophilic and psychrophilic

2890 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

prokaryotes correlates inversely with their optimal growthtemperatures Nucleic Acids Res 33 4016ndash4022

9 SuhreK and ClaverieJM (2003) Genomic correlates ofhyperthermostability an update J Biol Chem 27817198ndash17202

10 TekaiaF and YeramianE (2006) Evolution of proteomesfundamental signatures and global trends in amino acidcompositions BMC Genom 7 307

11 WangHC and HickeyDA (2002) Evidence for strong selectiveconstraint acting on the nucleotide composition of 16S ribosomalRNA genes Nucleic Acids Res 30 2501ndash2507

12 FriedmanR DrakeJW and HughesAL (2004) Genome-widepatterns of nucleotide substitution reveal stringent functionalconstraints on the protein sequences of thermophiles Genetics167 1507ndash1512

13 LynnDJ SingerGA and HickeyDA (2002) Synonymouscodon usage is subject to selection in thermophilic bacteriaNucleic Acids Res 30 4272ndash4277

14 SingerGA and HickeyDA (2000) Nucleotide bias causes agenomewide bias in the amino acid composition of proteinsMol Biol Evol 17 1581ndash1588

15 SingerGA and HickeyDA (2003) Thermophilicprokaryotes have characteristic patterns of codonusage amino acid composition and nucleotide content Gene 31739ndash47

16 TekaiaF YeramianE and DujonB (2002) Amino acidcomposition of genomes lifestyles of organisms and evolutionarytrends a global picture with correspondence analysis Gene 29751ndash60

17 Roy ChowdhuryA and DuttaC (2012) A pursuit of lineage-specific and niche-specific proteome features in the world ofarchaea BMC Genom 13 236

18 WangHC SuskoE and RogerAJ (2006) On the correlationbetween genomic G+C content and optimal growth temperaturein prokaryotes data quality and confounding factors BiochemBiophys Res Commun 342 681ndash684

19 WuH ZhangZ HuS and YuJ (2012) On the molecularmechanism of GC content variation among eubacterial genomesBiol Direct 7 2

20 KnightRD FreelandSJ and LandweberLF (2001) A simplemodel based on mutation and selection explains trends in codonand amino-acid usage and GC composition within and acrossgenomes Genome Biol 2 RESEARCH0010

21 LaoPJ and ForsdykeDR (2000) Thermophilic bacteria strictlyobey Szybalskirsquos transcription direction rule and politely purine-load RNAs with both adenine and guanine Genome Res 10228ndash236

22 ZeldovichKB BerezovskyIN and ShakhnovichEI (2007)Protein and DNA sequence determinants of thermophilicadaptation PLoS Comput Biol 3 e5

23 KreilDP and OuzounisCA (2001) Identification ofthermophilic species by the amino acid compositions deducedfrom their genomes Nucleic Acids Res 29 1608ndash1615

24 BerezovskyIN TumanyanVG and EsipovaNG (1997)Representation of amino acid sequences in terms of interactionenergy in protein globules FEBS Lett 418 43ndash46

25 CambillauC and ClaverieJM (2000) Structural and genomiccorrelates of hyperthermostability J Biol Chem 27532383ndash32386

26 GreavesRB and WarwickerJ (2007) Mechanisms forstabilisation and the maintenance of solubility in proteins fromthermophiles BMC Struct Biol 7 18

27 JaenickeR (1999) Stability and folding of domain proteins ProgBiophys Mol Biol 71 155ndash241

28 JaenickeR and BohmG (1998) The stability of proteins inextreme environments Curr Opin Struct Biol 8 738ndash748

29 BerezovskyIN and ShakhnovichEI (2005) Physics andevolution of thermophilic adaptation Proc Natl Acad Sci USA102 12742ndash12747

30 ChakravartyS and VaradarajanR (2002) Elucidation of factorsresponsible for enhanced thermal stability of proteins a structuralgenomics based study Biochemistry 41 8152ndash8161

31 GlyakinaAV GarbuzynskiySO LobanovMY andGalzitskayaOV (2007) Different packing of external residues can

explain differences in the thermostability of proteins fromthermophilic and mesophilic organisms Bioinformatics 232231ndash2238

32 ThompsonMJ and EisenbergD (1999) Transproteomic evidenceof a loop-deletion mechanism for enhancing proteinthermostability J Mol Biol 290 595ndash604

33 TokurikiN OldfieldCJ UverskyVN BerezovskyIN andTawfikDS (2009) Do viral proteins possess unique biophysicalfeatures Trends Biochem Sci 34 53ndash59

34 BerezovskyIN ZeldovichKB and ShakhnovichEI (2007)Positive and negative design in stability and thermal adaptationof natural proteins PLoS Comput Biol 3 e52

35 BharanidharanD BhargaviGR UthanumallianK andGauthamN (2004) Correlations between nucleotide frequenciesand amino acid composition in 115 bacterial species BiochemBiophys Res Commun 315 1097ndash1103

36 NakashimaH FukuchiS and NishikawaK (2003)Compositional changes in RNA DNA and proteins for bacterialadaptation to higher and lower temperatures J Biochem(Tokyo) 133 507ndash513

37 DehouckY FolchB and RoomanM (2008) Revisiting thecorrelation between proteinsrsquo thermoresistance and organismsrsquothermophilicity Protein Eng Des Sel 21 275ndash278

38 FolchB DehouckY and RoomanM (2010) Thermo- andmesostabilizing protein interactions identified by temperature-dependent statistical potentials Biophys J 98 667ndash677

39 FolchB RoomanM and DehouckY (2008) Thermostability ofsalt bridges versus hydrophobic interactions in proteins probed bystatistical potentials J Chem Inf Model 48 119ndash127

40 GonnelliG RoomanM and DehouckY (2012) Structure-basedmutant stability predictions on proteins of unknown structureJ Biotechnol 161 287ndash293

41 PonnuswamyP MuthusamyR and ManavalanP (1982) Aminoacid composition and thermal stability of globular proteins IntJ Biol Macromol 4 186ndash190

42 BerezovskyIN (2011) The diversity of physical forces andmechanisms in intermolecular interactions Phys Biol 8 035002

43 MaBG GoncearencoA and BerezovskyIN (2010)Thermophilic adaptation of protein complexes inferred fromproteomic homology modeling Structure 18 819ndash828

44 MakarovaKS and KooninEV (2005) Evolutionary andfunctional genomics of the Archaea Curr Opin Microbiol 8586ndash594

45 NovichkovPS WolfYI DubchakI and KooninEV (2009)Trends in prokaryotic evolution revealed by comparison of closelyrelated bacterial and archaeal genomes J Bacteriol 191 65ndash73

46 KooninEV and WolfYI (2008) Genomics of bacteria andarchaea the emerging dynamic view of the prokaryotic worldNucleic Acids Res 36 6688ndash6719

47 KooninEV MushegianAR GalperinMY and WalkerDR(1997) Comparison of archaeal and bacterial genomes computeranalysis of protein sequences predicts novel functions andsuggests a chimeric origin for the archaea Mol Microbiol 25619ndash637

48 KatzL and BurgeCB (2003) Widespread selection for localRNA secondary structure in coding regions of bacterial genesGenome Res 13 2042ndash2051

49 HofackerIL PriwitzerB and StadlerPF (2004) Prediction oflocally stable RNA secondary structures for genome-wide surveysBioinformatics 20 186ndash190

50 ZukerM and StieglerP (1981) Optimal computer folding oflarge RNA sequences using thermodynamics and auxiliaryinformation Nucleic Acids Res 9 133ndash148

51 MathewsDH SabinaJ ZukerM and TurnerDH (1999)Expanded sequence dependence of thermodynamic parametersimproves prediction of RNA secondary structure J Mol Biol288 911ndash940

52 ShabalinaSA OgurtsovAY and SpiridonovNA (2006) Aperiodic pattern of mRNA secondary structure created by thegenetic code Nucleic Acids Res 34 2428ndash2437

53 MarmurJ and DotyP (1962) Determination of the basecomposition of deoxyribonucleic acid from its thermaldenaturation temperature J Mol Biol 5 109ndash118

Nucleic Acids Research 2014 Vol 42 No 5 2891

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

54 HurstLD and MerchantAR (2001) High guanine-cytosinecontent is not an adaptation to high temperature acomparative analysis amongst prokaryotes Proc Biol Sci 268493ndash497

55 NayaH RomeroH ZavalaA AlvarezB and MustoH (2002)Aerobiosis increases the genomic guanine plus cytosine content(GC) in prokaryotes J Mol Evol 55 260ndash264

56 BeckmanKB and AmesBN (1997) Oxidative decay of DNAJ Biol Chem 272 19633ndash19636

57 ChengKC CahillDS KasaiH NishimuraS and LoebLA(1992) 8-Hydroxyguanine an abundant form of oxidative DNAdamage causes Gmdash-T and Amdash-C substitutions J Biol Chem267 166ndash172

58 Vieira-SilvaS and RochaEP (2008) An assessment of theimpacts of molecular oxygen on the evolution of proteomes MolBiol Evol 25 1931ndash1942

59 GaltierN and LobryJR (1997) Relationships between genomicG+C content RNA secondary structures and optimal growthtemperature in prokaryotes J Mol Evol 44 632ndash636

60 MustoH NayaH ZavalaA RomeroH Alvarez-ValinF andBernardiG (2004) Correlations between genomic GC levels andoptimal growth temperatures in prokaryotes FEBS Lett 57373ndash77

61 FitchWM (1974) The large extent of putative secondary nucleicacid structure in random nucleotide sequences or amino acidderived messenger-RNA J Mol Evol 3 279ndash291

62 WorkmanC and KroghA (1999) No evidence that mRNAshave lower folding free energies than random sequences with

the same dinucleotide distribution Nucleic Acids Res 274816ndash4822

63 BachmairA FinleyD and VarshavskyA (1986) In vivo half-lifeof a protein is a function of its amino-terminal residue Science234 179ndash186

64 BerezovskyIN NamiotVA TumanyanVG and EsipovaNG(1999) Hierarchy of the interaction energy distribution in thespatial structure of globular proteins and the problem of domaindefinition J Biomol Struct Dyn 17 133ndash155

65 YakovchukP ProtozanovaE and Frank-KamenetskiiMD(2006) Base-stacking and base-pairing contributions into thermalstability of the DNA double helix Nucleic Acids Res 34564ndash574

66 FriedmanRA and HonigB (1995) A free energy analysis ofnucleic acid base stacking in aqueous solution Biophys J 691528ndash1535

67 OkonogiTM AlleySC ReeseAW HopkinsPB andRobinsonBH (2002) Sequence-dependent dynamics of duplexDNA the applicability of a dinucleotide model Biophys J 833446ndash3459

68 MasquidaB and WesthofE (2000) On the wobble GoU andrelated pairs RNA 6 9ndash15

69 VaraniG and McClainWH (2000) The G x U wobble base pairA fundamental building block of RNA structure crucial to RNAfunction in diverse biological systems EMBO Rep 1 18ndash23

70 BerezovskyIN ChenWW ChoiPJ and ShakhnovichEI(2005) Entropic stabilization of proteins and its proteomicconsequences PLoS Comput Biol 1 e47

2892 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

prokaryotes correlates inversely with their optimal growthtemperatures Nucleic Acids Res 33 4016ndash4022

9 SuhreK and ClaverieJM (2003) Genomic correlates ofhyperthermostability an update J Biol Chem 27817198ndash17202

10 TekaiaF and YeramianE (2006) Evolution of proteomesfundamental signatures and global trends in amino acidcompositions BMC Genom 7 307

11 WangHC and HickeyDA (2002) Evidence for strong selectiveconstraint acting on the nucleotide composition of 16S ribosomalRNA genes Nucleic Acids Res 30 2501ndash2507

12 FriedmanR DrakeJW and HughesAL (2004) Genome-widepatterns of nucleotide substitution reveal stringent functionalconstraints on the protein sequences of thermophiles Genetics167 1507ndash1512

13 LynnDJ SingerGA and HickeyDA (2002) Synonymouscodon usage is subject to selection in thermophilic bacteriaNucleic Acids Res 30 4272ndash4277

14 SingerGA and HickeyDA (2000) Nucleotide bias causes agenomewide bias in the amino acid composition of proteinsMol Biol Evol 17 1581ndash1588

15 SingerGA and HickeyDA (2003) Thermophilicprokaryotes have characteristic patterns of codonusage amino acid composition and nucleotide content Gene 31739ndash47

16 TekaiaF YeramianE and DujonB (2002) Amino acidcomposition of genomes lifestyles of organisms and evolutionarytrends a global picture with correspondence analysis Gene 29751ndash60

17 Roy ChowdhuryA and DuttaC (2012) A pursuit of lineage-specific and niche-specific proteome features in the world ofarchaea BMC Genom 13 236

18 WangHC SuskoE and RogerAJ (2006) On the correlationbetween genomic G+C content and optimal growth temperaturein prokaryotes data quality and confounding factors BiochemBiophys Res Commun 342 681ndash684

19 WuH ZhangZ HuS and YuJ (2012) On the molecularmechanism of GC content variation among eubacterial genomesBiol Direct 7 2

20 KnightRD FreelandSJ and LandweberLF (2001) A simplemodel based on mutation and selection explains trends in codonand amino-acid usage and GC composition within and acrossgenomes Genome Biol 2 RESEARCH0010

21 LaoPJ and ForsdykeDR (2000) Thermophilic bacteria strictlyobey Szybalskirsquos transcription direction rule and politely purine-load RNAs with both adenine and guanine Genome Res 10228ndash236

22 ZeldovichKB BerezovskyIN and ShakhnovichEI (2007)Protein and DNA sequence determinants of thermophilicadaptation PLoS Comput Biol 3 e5

23 KreilDP and OuzounisCA (2001) Identification ofthermophilic species by the amino acid compositions deducedfrom their genomes Nucleic Acids Res 29 1608ndash1615

24 BerezovskyIN TumanyanVG and EsipovaNG (1997)Representation of amino acid sequences in terms of interactionenergy in protein globules FEBS Lett 418 43ndash46

25 CambillauC and ClaverieJM (2000) Structural and genomiccorrelates of hyperthermostability J Biol Chem 27532383ndash32386

26 GreavesRB and WarwickerJ (2007) Mechanisms forstabilisation and the maintenance of solubility in proteins fromthermophiles BMC Struct Biol 7 18

27 JaenickeR (1999) Stability and folding of domain proteins ProgBiophys Mol Biol 71 155ndash241

28 JaenickeR and BohmG (1998) The stability of proteins inextreme environments Curr Opin Struct Biol 8 738ndash748

29 BerezovskyIN and ShakhnovichEI (2005) Physics andevolution of thermophilic adaptation Proc Natl Acad Sci USA102 12742ndash12747

30 ChakravartyS and VaradarajanR (2002) Elucidation of factorsresponsible for enhanced thermal stability of proteins a structuralgenomics based study Biochemistry 41 8152ndash8161

31 GlyakinaAV GarbuzynskiySO LobanovMY andGalzitskayaOV (2007) Different packing of external residues can

explain differences in the thermostability of proteins fromthermophilic and mesophilic organisms Bioinformatics 232231ndash2238

32 ThompsonMJ and EisenbergD (1999) Transproteomic evidenceof a loop-deletion mechanism for enhancing proteinthermostability J Mol Biol 290 595ndash604

33 TokurikiN OldfieldCJ UverskyVN BerezovskyIN andTawfikDS (2009) Do viral proteins possess unique biophysicalfeatures Trends Biochem Sci 34 53ndash59

34 BerezovskyIN ZeldovichKB and ShakhnovichEI (2007)Positive and negative design in stability and thermal adaptationof natural proteins PLoS Comput Biol 3 e52

35 BharanidharanD BhargaviGR UthanumallianK andGauthamN (2004) Correlations between nucleotide frequenciesand amino acid composition in 115 bacterial species BiochemBiophys Res Commun 315 1097ndash1103

36 NakashimaH FukuchiS and NishikawaK (2003)Compositional changes in RNA DNA and proteins for bacterialadaptation to higher and lower temperatures J Biochem(Tokyo) 133 507ndash513

37 DehouckY FolchB and RoomanM (2008) Revisiting thecorrelation between proteinsrsquo thermoresistance and organismsrsquothermophilicity Protein Eng Des Sel 21 275ndash278

38 FolchB DehouckY and RoomanM (2010) Thermo- andmesostabilizing protein interactions identified by temperature-dependent statistical potentials Biophys J 98 667ndash677

39 FolchB RoomanM and DehouckY (2008) Thermostability ofsalt bridges versus hydrophobic interactions in proteins probed bystatistical potentials J Chem Inf Model 48 119ndash127

40 GonnelliG RoomanM and DehouckY (2012) Structure-basedmutant stability predictions on proteins of unknown structureJ Biotechnol 161 287ndash293

41 PonnuswamyP MuthusamyR and ManavalanP (1982) Aminoacid composition and thermal stability of globular proteins IntJ Biol Macromol 4 186ndash190

42 BerezovskyIN (2011) The diversity of physical forces andmechanisms in intermolecular interactions Phys Biol 8 035002

43 MaBG GoncearencoA and BerezovskyIN (2010)Thermophilic adaptation of protein complexes inferred fromproteomic homology modeling Structure 18 819ndash828

44 MakarovaKS and KooninEV (2005) Evolutionary andfunctional genomics of the Archaea Curr Opin Microbiol 8586ndash594

45 NovichkovPS WolfYI DubchakI and KooninEV (2009)Trends in prokaryotic evolution revealed by comparison of closelyrelated bacterial and archaeal genomes J Bacteriol 191 65ndash73

46 KooninEV and WolfYI (2008) Genomics of bacteria andarchaea the emerging dynamic view of the prokaryotic worldNucleic Acids Res 36 6688ndash6719

47 KooninEV MushegianAR GalperinMY and WalkerDR(1997) Comparison of archaeal and bacterial genomes computeranalysis of protein sequences predicts novel functions andsuggests a chimeric origin for the archaea Mol Microbiol 25619ndash637

48 KatzL and BurgeCB (2003) Widespread selection for localRNA secondary structure in coding regions of bacterial genesGenome Res 13 2042ndash2051

49 HofackerIL PriwitzerB and StadlerPF (2004) Prediction oflocally stable RNA secondary structures for genome-wide surveysBioinformatics 20 186ndash190

50 ZukerM and StieglerP (1981) Optimal computer folding oflarge RNA sequences using thermodynamics and auxiliaryinformation Nucleic Acids Res 9 133ndash148

51 MathewsDH SabinaJ ZukerM and TurnerDH (1999)Expanded sequence dependence of thermodynamic parametersimproves prediction of RNA secondary structure J Mol Biol288 911ndash940

52 ShabalinaSA OgurtsovAY and SpiridonovNA (2006) Aperiodic pattern of mRNA secondary structure created by thegenetic code Nucleic Acids Res 34 2428ndash2437

53 MarmurJ and DotyP (1962) Determination of the basecomposition of deoxyribonucleic acid from its thermaldenaturation temperature J Mol Biol 5 109ndash118

Nucleic Acids Research 2014 Vol 42 No 5 2891

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

54 HurstLD and MerchantAR (2001) High guanine-cytosinecontent is not an adaptation to high temperature acomparative analysis amongst prokaryotes Proc Biol Sci 268493ndash497

55 NayaH RomeroH ZavalaA AlvarezB and MustoH (2002)Aerobiosis increases the genomic guanine plus cytosine content(GC) in prokaryotes J Mol Evol 55 260ndash264

56 BeckmanKB and AmesBN (1997) Oxidative decay of DNAJ Biol Chem 272 19633ndash19636

57 ChengKC CahillDS KasaiH NishimuraS and LoebLA(1992) 8-Hydroxyguanine an abundant form of oxidative DNAdamage causes Gmdash-T and Amdash-C substitutions J Biol Chem267 166ndash172

58 Vieira-SilvaS and RochaEP (2008) An assessment of theimpacts of molecular oxygen on the evolution of proteomes MolBiol Evol 25 1931ndash1942

59 GaltierN and LobryJR (1997) Relationships between genomicG+C content RNA secondary structures and optimal growthtemperature in prokaryotes J Mol Evol 44 632ndash636

60 MustoH NayaH ZavalaA RomeroH Alvarez-ValinF andBernardiG (2004) Correlations between genomic GC levels andoptimal growth temperatures in prokaryotes FEBS Lett 57373ndash77

61 FitchWM (1974) The large extent of putative secondary nucleicacid structure in random nucleotide sequences or amino acidderived messenger-RNA J Mol Evol 3 279ndash291

62 WorkmanC and KroghA (1999) No evidence that mRNAshave lower folding free energies than random sequences with

the same dinucleotide distribution Nucleic Acids Res 274816ndash4822

63 BachmairA FinleyD and VarshavskyA (1986) In vivo half-lifeof a protein is a function of its amino-terminal residue Science234 179ndash186

64 BerezovskyIN NamiotVA TumanyanVG and EsipovaNG(1999) Hierarchy of the interaction energy distribution in thespatial structure of globular proteins and the problem of domaindefinition J Biomol Struct Dyn 17 133ndash155

65 YakovchukP ProtozanovaE and Frank-KamenetskiiMD(2006) Base-stacking and base-pairing contributions into thermalstability of the DNA double helix Nucleic Acids Res 34564ndash574

66 FriedmanRA and HonigB (1995) A free energy analysis ofnucleic acid base stacking in aqueous solution Biophys J 691528ndash1535

67 OkonogiTM AlleySC ReeseAW HopkinsPB andRobinsonBH (2002) Sequence-dependent dynamics of duplexDNA the applicability of a dinucleotide model Biophys J 833446ndash3459

68 MasquidaB and WesthofE (2000) On the wobble GoU andrelated pairs RNA 6 9ndash15

69 VaraniG and McClainWH (2000) The G x U wobble base pairA fundamental building block of RNA structure crucial to RNAfunction in diverse biological systems EMBO Rep 1 18ndash23

70 BerezovskyIN ChenWW ChoiPJ and ShakhnovichEI(2005) Entropic stabilization of proteins and its proteomicconsequences PLoS Comput Biol 1 e47

2892 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from

54 HurstLD and MerchantAR (2001) High guanine-cytosinecontent is not an adaptation to high temperature acomparative analysis amongst prokaryotes Proc Biol Sci 268493ndash497

55 NayaH RomeroH ZavalaA AlvarezB and MustoH (2002)Aerobiosis increases the genomic guanine plus cytosine content(GC) in prokaryotes J Mol Evol 55 260ndash264

56 BeckmanKB and AmesBN (1997) Oxidative decay of DNAJ Biol Chem 272 19633ndash19636

57 ChengKC CahillDS KasaiH NishimuraS and LoebLA(1992) 8-Hydroxyguanine an abundant form of oxidative DNAdamage causes Gmdash-T and Amdash-C substitutions J Biol Chem267 166ndash172

58 Vieira-SilvaS and RochaEP (2008) An assessment of theimpacts of molecular oxygen on the evolution of proteomes MolBiol Evol 25 1931ndash1942

59 GaltierN and LobryJR (1997) Relationships between genomicG+C content RNA secondary structures and optimal growthtemperature in prokaryotes J Mol Evol 44 632ndash636

60 MustoH NayaH ZavalaA RomeroH Alvarez-ValinF andBernardiG (2004) Correlations between genomic GC levels andoptimal growth temperatures in prokaryotes FEBS Lett 57373ndash77

61 FitchWM (1974) The large extent of putative secondary nucleicacid structure in random nucleotide sequences or amino acidderived messenger-RNA J Mol Evol 3 279ndash291

62 WorkmanC and KroghA (1999) No evidence that mRNAshave lower folding free energies than random sequences with

the same dinucleotide distribution Nucleic Acids Res 274816ndash4822

63 BachmairA FinleyD and VarshavskyA (1986) In vivo half-lifeof a protein is a function of its amino-terminal residue Science234 179ndash186

64 BerezovskyIN NamiotVA TumanyanVG and EsipovaNG(1999) Hierarchy of the interaction energy distribution in thespatial structure of globular proteins and the problem of domaindefinition J Biomol Struct Dyn 17 133ndash155

65 YakovchukP ProtozanovaE and Frank-KamenetskiiMD(2006) Base-stacking and base-pairing contributions into thermalstability of the DNA double helix Nucleic Acids Res 34564ndash574

66 FriedmanRA and HonigB (1995) A free energy analysis ofnucleic acid base stacking in aqueous solution Biophys J 691528ndash1535

67 OkonogiTM AlleySC ReeseAW HopkinsPB andRobinsonBH (2002) Sequence-dependent dynamics of duplexDNA the applicability of a dinucleotide model Biophys J 833446ndash3459

68 MasquidaB and WesthofE (2000) On the wobble GoU andrelated pairs RNA 6 9ndash15

69 VaraniG and McClainWH (2000) The G x U wobble base pairA fundamental building block of RNA structure crucial to RNAfunction in diverse biological systems EMBO Rep 1 18ndash23

70 BerezovskyIN ChenWW ChoiPJ and ShakhnovichEI(2005) Entropic stabilization of proteins and its proteomicconsequences PLoS Comput Biol 1 e47

2892 Nucleic Acids Research 2014 Vol 42 No 5

by guest on May 6 2014

httpnaroxfordjournalsorgD

ownloaded from